Predict the rating of a dress from online reviews. Predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars). 

In [32]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

Explain how the bag-of-words model and Naïve Bayes work, and how they work together.

Pre-processing steps (don’t forget to filter out all non-dress reviews)

In [33]:
df = pd.read_csv('data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [41]:
df = df.loc[df['Class Name'] == 'Dresses']
df.dropna() 
text = df['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


Text pre-processing steps resulting in a document-feature matrix

In [42]:
feature_names = vect.get_feature_names() #Get the words from the vocabulary

#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [43]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object to make a matrix
print(docu_feat[0:100,0:100])

  (2, 8)	1
  (4, 72)	1
  (19, 80)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (25, 55)	1
  (26, 40)	1
  (28, 60)	1
  (35, 12)	2
  (35, 59)	3
  (38, 86)	1
  (39, 31)	1
  (39, 96)	1
  (49, 91)	1
  (57, 72)	1
  (58, 0)	1
  (62, 59)	4
  (68, 12)	1
  (68, 59)	1
  (69, 72)	1
  (75, 45)	1
  (77, 33)	1
  (78, 83)	1
  (80, 12)	1
  (98, 70)	1
  (99, 12)	1
  (99, 59)	1


# Building and evaluating model

Split the file into a training and a test set.

In [44]:
nb = MultinomialNB() #create the model

X = docu_feat #the document-feature matrix is the X matrix, we put entire matrix under X variable, X is always multiple variables
y = df['Rating'] #creating the y vector with carachter names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data into training and test, store it, the order is important

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character, to set the coefficient 

#Evaluate the model (to reduce overfitting and reduce objective score ALWAYS evaluate under test data)
y_test_p = nb.predict(X_test) #predict always takes X variables
nb.score(X_test, y_test)

0.5912447257383966

The accuracy of the Naive Bayes model is 59%

In [45]:
y_test_p

array([4, 3, 4, ..., 4, 5, 5])

In [46]:
df['Rating'].value_counts(normalize=True)

5    0.537585
4    0.220763
3    0.132616
2    0.072955
1    0.036082
Name: Rating, dtype: float64