Predict the rating of a dress from online reviews. Predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars). 

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

Explain how the bag-of-words model and Naïve Bayes work, and how they work together.

Bag-of-words model turns arbitrary text into length vector with the help of string of numbers by counting how many times each word appears .

Naive Bayes algorithm is a classification technique which assumes presence of a particular feature in a class is unrelated to the presence of any other feature. 

In order to do a text classification with Naive Bayes, the text document needs to be represented as a feature vector with the help of bag-of-words.

# Pre-processing steps

In [13]:
df = pd.read_csv('data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [14]:
df = df.loc[df['Class Name'] == 'Dresses'] #Filtering out all non dresses reviews
df = df.dropna(axis=0)
df = df.reset_index(drop=True) #To drop all rows with NaN values
text = df['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
1,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
2,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
3,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses
4,10,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses


Text pre-processing steps resulting in a document-feature matrix

In [15]:
feature_names = vect.get_feature_names() #Get the words from the vocabulary

#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 7747 words in the vocabulary. A selection: ['allusion', 'allusione', 'almsot', 'alr', 'alright', 'als', 'altar', 'alter', 'alteration', 'alterations', 'altered', 'altering', 'alternate', 'alternations', 'alternative', 'althetic', 'altho', 'altogether', 'am5', 'amadi']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [16]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object to make a matrix
print(docu_feat[0:100,0:100])

  (1, 7)	1
  (3, 71)	1
  (14, 78)	1
  (15, 37)	1
  (16, 3)	1
  (16, 44)	1
  (17, 11)	1
  (18, 54)	1
  (19, 39)	1
  (21, 59)	1
  (26, 11)	2
  (26, 58)	3
  (28, 84)	1
  (29, 30)	1
  (29, 94)	1
  (42, 71)	1
  (43, 0)	1
  (47, 58)	4
  (51, 11)	1
  (51, 58)	1
  (52, 71)	1
  (58, 44)	1
  (59, 81)	1
  (60, 11)	1
  (76, 69)	1
  (77, 11)	1
  (77, 58)	1
  (83, 44)	1
  (86, 24)	1
  (92, 71)	1
  (97, 84)	1


In [17]:
df['Binary Rating']=np.where(df['Rating']>3,'Positive','Neutral/Negative') #creating new column with assigned values
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Binary Rating
0,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Neutral/Negative
1,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,Neutral/Negative
2,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses,Positive
3,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses,Positive
4,10,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses,Neutral/Negative


# Building and evaluating model

Split the file into a training and a test set.

In [18]:
nb = MultinomialNB() #create the model

X = docu_feat #the document-feature matrix is the X matrix, we put entire matrix under X variable, X is always multiple variables
y = df['Binary Rating'] #creating the y vector with carachter names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data into training and test, store it, the order is important

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character, to set the coefficient 

#Evaluate the model (to reduce overfitting and reduce objective score ALWAYS evaluate under test data)
y_test_p = nb.predict(X_test) #predict always takes X variables
nb.score(X_test, y_test)

0.8542183622828784

The accuracy of the Naive Bayes model is 85.4% which is very high. 

In [19]:
y_test_p

array(['Positive', 'Neutral/Negative', 'Positive', ..., 'Positive',
       'Positive', 'Positive'], dtype='<U16')