## Reviews, ratings and predictions
__Lotte Meijer__
1661695

The 'Bag of words' model analyzes text by counting each individual word. The Naïve Bayes sorts the collections of words into categories by learning from the other collections of words.

In [107]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB

In [108]:
df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [109]:
df_subset = df.loc[df['Class Name'].isin(['Dresses'])]#first I subset the ratings that are about a dress

In [110]:
df_subset = df_subset[['Review Text', 'Rating']]
df_subset.rename(columns={"Review Text": "Review_Text"}, inplace=True) #rename the Review Text column to make it more conveniend
df_subset.head()

Unnamed: 0,Review_Text,Rating
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
5,"I love tracy reese dresses, but this one is no...",2
8,I love this dress. i usually get an xs but it ...,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


In [111]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df_subset['Review_Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary

In [112]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:500,0:500]) #Let's print a little part of the matrix: the first 500 words & documents

  (2, 8)	1
  (4, 72)	1
  (4, 224)	1
  (18, 210)	1
  (19, 80)	1
  (19, 435)	1
  (19, 447)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (22, 248)	1
  (25, 55)	1
  (26, 40)	1
  (28, 60)	1
  (28, 226)	1
  (28, 237)	1
  (29, 436)	1
  (31, 241)	1
  (32, 104)	1
  (32, 186)	1
  (32, 233)	1
  (32, 255)	1
  (33, 428)	1
  (35, 12)	2
  :	:
  (479, 461)	1
  (480, 484)	1
  (482, 471)	1
  (483, 355)	1
  (484, 12)	1
  (484, 225)	1
  (484, 348)	1
  (485, 447)	1
  (487, 258)	1
  (487, 447)	1
  (489, 211)	1
  (489, 304)	1
  (489, 316)	1
  (491, 104)	1
  (491, 243)	1
  (491, 422)	1
  (492, 84)	1
  (492, 96)	1
  (492, 186)	1
  (492, 233)	1
  (492, 242)	1
  (492, 248)	1
  (494, 86)	1
  (494, 485)	1
  (497, 37)	1


In [113]:
from sklearn.model_selection import train_test_split #the function to split the data

df_subset["Rating"] = df_subset["Rating"].astype('category')
df_subset.dtypes

cleanup = {"Rating":     {1: "negative", 2: "negative", 3: "negative", 4: "positive", 5: "positive",}} #categorize the ratings in positive and negative
df_subset.replace(cleanup, inplace=True)
df_subset.head()

Unnamed: 0,Review_Text,Rating
1,Love this dress! it's sooo pretty. i happene...,positive
2,I had such high hopes for this dress and reall...,negative
5,"I love tracy reese dresses, but this one is no...",negative
8,I love this dress. i usually get an xs but it ...,positive
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",positive


In [114]:
#Setting up the data and model
X = docu_feat #selecting the variables to go into my X matrix
y = df_subset['Rating'] #creating the y vector

#Split the data. test_size = 0.3, so I'm splitting the data into 70% training data and 30% test data
#I'm using the subscript _l to indicate it's linear regression (using the same variable names in next block)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [115]:
clf = MultinomialNB()
clf = clf.fit(X_train, y_train) #this fits the Multinomial model with the train data
clf.score(X_test, y_test) #calculate the fit on the test data

0.8486286919831224

In [116]:
df_subset['Rating'].value_counts()

positive    4792
negative    1527
Name: Rating, dtype: int64

In [117]:
from sklearn.metrics import confusion_matrix
y_test_pred = clf.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
cm

array([[ 284,  182],
       [ 105, 1325]], dtype=int64)

In [118]:
from sklearn.metrics import classification_report
print(classification_report (y_test, y_test_pred))

              precision    recall  f1-score   support

    negative       0.73      0.61      0.66       466
    positive       0.88      0.93      0.90      1430

    accuracy                           0.85      1896
   macro avg       0.80      0.77      0.78      1896
weighted avg       0.84      0.85      0.84      1896



In [119]:
for i in range(0,20):
    prob = clf.predict_proba(X[i])
    print(i, "This is the Rating: ", df_subset.Rating.iloc[i])
    print(i, "This is the Review: ", df_subset.Review_Text.iloc[i])
    print(f"Negative: {prob[0,0]}, Positive {prob[0,1]}")

0 This is the Rating:  positive
0 This is the Review:  Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.
Negative: 8.598612681280654e-05, Positive 0.9999140138731744
1 This is the Rating:  negative
1 This is the Review:  I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
Negative: 0.998678722983

## The predictions that are off

10 This is the Rating:  positive
10 This is the Review:  I'm upset because for the price of the dress, i thought it was embroidered! no, that is a print on the fabric. i think i cried a little when i opened the box. it is still ver pretty. i would say it is true to size, it is a tad bit big on me, but i am very tiny, but i can still get away with it. the color is vibrant. the style is unique. skirt portion is pretty poofy. i keep going back and forth on it mainly because of the price, although the quality is definitely there. except i wish it were emb
Negative: 0.6878964008940213, Positive 0.31210359910598057

*This prediction is probably off because it uses words like: upset, cried and away. It took me some time as well to figure out that this review uses weird ways to say that it was a good experience, so I don't blame the algorithm.*

12 This is the Rating:  negative
12 This is the Review:  Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured.
Negative: 0.17361326828378917, Positive 0.8263867317162178

*This review uses words like: cute, good, like, love*

13 This is the Rating:  negative
13 This is the Review:  Love the color and style, but material snags easily
Negative: 0.296231362562037, Positive 0.7037686374379594

*This review is very short but is talks about loving the color and style, I think the other words aren't very common in the other negative reviews so it doens't recognizes this as negative.*

__The most important aspect of the off predictions is that the algorithm can't see combinations. It looks at words indivualy, and therefore loses context.__