### Working of the models

The bag-of-words model counts how often words appear in the text and doesn't look at grammer and order. Naïve Bayes determines which group the piece of text belongs to based on the words that appear in the text. If a certain (combination of) words occurs more often in one of the groups, Naïve Bayers will assume the piece of text belongs to that group.

### Pre-processing steps

In [214]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import math
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import normalize
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [215]:
df_subset = df[["Class Name", "Review Text", "Rating"]] # Creating a subset with the needed variables
df_subset.dropna()
df_subset.head()

Unnamed: 0,Class Name,Review Text,Rating
0,Intimates,Absolutely wonderful - silky and sexy and comf...,4
1,Dresses,Love this dress! it's sooo pretty. i happene...,5
2,Dresses,I had such high hopes for this dress and reall...,3
3,Pants,"I love, love, love this jumpsuit. it's fun, fl...",5
4,Blouses,This shirt is very flattering to all due to th...,5


In [216]:
dresses = df_subset[df_subset['Class Name']=="Dresses"] # filters out all the non-dress reviews
dresses.head()

Unnamed: 0,Class Name,Review Text,Rating
1,Dresses,Love this dress! it's sooo pretty. i happene...,5
2,Dresses,I had such high hopes for this dress and reall...,3
5,Dresses,"I love tracy reese dresses, but this one is no...",2
8,Dresses,I love this dress. i usually get an xs but it ...,5
9,Dresses,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


In [217]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[0:20]}")

There are 8080 words in the vocabulary. A selection: ['00', '000', '00p', '02', '03', '03dd', '04', '06', '0p', '0petite', '0r', '0xs', '10', '100', '100lbs', '101', '102', '102lbs', '103', '103lb']


In [218]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:100,0:100]) # print the first 100

  (2, 8)	1
  (4, 72)	1
  (19, 80)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (25, 55)	1
  (26, 40)	1
  (28, 60)	1
  (35, 12)	2
  (35, 59)	3
  (38, 86)	1
  (39, 31)	1
  (39, 96)	1
  (49, 91)	1
  (57, 72)	1
  (58, 0)	1
  (62, 59)	4
  (68, 12)	1
  (68, 59)	1
  (69, 72)	1
  (75, 45)	1
  (77, 33)	1
  (78, 83)	1
  (80, 12)	1
  (98, 70)	1
  (99, 12)	1
  (99, 59)	1


In [219]:
dresses['positivereviews'] = df['Rating'] > 3 # mark when a review is postive by its rating
dresses.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Class Name,Review Text,Rating,positivereviews
1,Dresses,Love this dress! it's sooo pretty. i happene...,5,True
2,Dresses,I had such high hopes for this dress and reall...,3,False
5,Dresses,"I love tracy reese dresses, but this one is no...",2,False
8,Dresses,I love this dress. i usually get an xs but it ...,5,True
9,Dresses,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,True


### Train the set

In [220]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
X = docu_feat
y = dresses['positivereviews']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) 

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

### Evaluate the performance of the model 

In [221]:
y_test_p = nb.predict(X_test)
nb.score(X_test,y_test) # shows performance score of the model

0.8665611814345991

###### The model is right in 86 percent of the cases

In [222]:
cm = confusion_matrix(y_test, y_test_p) # creates a "confusion matrix"
cm

array([[ 299,  183],
       [  70, 1344]])

In [223]:
nb.classes_

array([False,  True])

In [224]:
conf_matrix = pd.DataFrame(cm, index=['negative', 'positive'], columns = ['negative_p', 'positive_p']) 
conf_matrix # shows the confusion matrix

Unnamed: 0,negative_p,positive_p
negative,299,183
positive,70,1344


In [225]:
287/(287+173) # recall

0.6239130434782608

In [226]:
287/(287+99) # precission

0.7435233160621761

### 3 cases where your model is off target

In [243]:
for i in range(3): # loop to print 3 cases
    prob = nb.predict_proba(X[i])
    print(f"Review: {dresses.iloc[i,1]}") # showing the 2th row with the review
    print(f"Rating: {dresses.iloc[i,2]}") # showing the 3th row with the rating
    print(f"Positive rating: {dresses.iloc[i,3]}") # showing wether the review is positive or negative
    print(f"True: {prob[0,0]}, False: {prob[0,1]}") # shows probabilities to be true or false
    print("--------------------------------------------------------")

Review: Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
Rating: 5
Positive rating: True
True: 1.2054830999719798e-05, False: 0.9999879451689964
--------------------------------------------------------
Review: I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
Rating: 3
Positive rating: False
T

The last review is predicted wrong, what can be caused by the words used in the review. For example the combinations 'i love' and 'very pretty' are common in positive reviews. The first review has a high rate for true and false, because she uses 'i never would have ordered it online' what sounds negative out of context.