This code takes in a dataset of cloth reviews from women to train a Support Vector Machine Classifier. The model then takes in new reviews and assigns them score from 1-5 indicating good or bad reviews

In [1]:
import pandas as pd
from sklearn import svm
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

We clean the data for any irregularities, nan values, and we also perform feature engineering. But for simplicity purposes, we only remove the nan values for now.

In [2]:
# importing csv data into pandas dataframe
data = pd.read_csv('data/womenReviews.csv')
df = pd.DataFrame(data)  # now the df variable contains our data in a dataframe
# print(df) 

In [3]:
# removing unnecessary columns
cols2use = ['title', 'rating']  # for simplicity we are taking only the title, and rating columns from the dataset
# if columns is not in cols2use, then we drop the columns
df2use = df.drop([x for x in df.columns if x not in cols2use], axis=1)
# print(df2use)

In [4]:
# for simplicity we are not performing any data cleansing methods except for removing the nan values 
# checking and removing nan values
print(df.isnull().values.any())
#dropping nan values from the dataset
df2use = df2use.dropna()

True


Now, we split the data into training and testing sets, vectorize them and then fit them into the model. Its normally a good approach to create 2 separate datasets for training and testing rather than just splitting them.

In [5]:
#splitting data into train, and test dataset, 90 percent for training
train, test = train_test_split(df2use, test_size=0.10, random_state=42)

x_train = train['title']  # input for training data
y_train = train['rating']  # output for training data

x_test = test['title']  # input for testing data
y_test = test['rating']  # output for testing data

In [6]:
# we need to convert the words into vectors to fit into the model
# bag of words vectorization
vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(x_train)  # training input vectorization
test_x_vectors = vectorizer.transform(x_test)  # testing input vectorization

In [7]:
# we are using svc from support vector machine as our classifier
# classifier or model making
model = svm.SVC()
model.fit(train_x_vectors,y_train)  # fitting the data into the classifier

We check the accuracy of the trained model.

In [8]:
# now we make predictions for the test data
prediction = model.predict(test_x_vectors)
# calculating the accuracy of the prediction of the test data
accuracy = metrics.accuracy_score(y_test,prediction)
print(accuracy)

0.6432926829268293


After we are satisfied with our accuracy, we save the model. For now, we are using pickle to save the model.

In [10]:
import pickle
# # save the model to disk
filename = 'model/clothReviewModel.pkl'
pickle.dump(model, open(filename, 'wb'))

Loading and using the saved model to classify further reviews

In [11]:
# loading the saved model
modelname = 'model/clothReviewModel.pkl'
model = pickle.load(open(modelname, 'rb'))

In [12]:
# creating sample reviews for testing
singleReview = ['The cloth was a great fit. I looked very good, and i felt comfortable.']
reviewSamples = ['it did not fit me at all. Absolutely hated it.', 'it was satisfactory.']

In [13]:
# transforming sample texts into vectors
single_vector = vectorizer.transform(singleReview)
reviewSampleVector = vectorizer.transform(reviewSamples)

In [14]:
# checking the predictions of the samples
prediction = model.predict(single_vector)
print(prediction)
prediction = model.predict(reviewSampleVector)
print(prediction)

[5]
[5 5]
