In [12]:
import pandas as pd

# bag-of-words model and Naïve Bayes

The <b>bag-of-words model</b> puts all the words toghther and counts them. It doesnt do anything with the meaning of the words. This could then be taken out of context. Its a simple yet effective model. 

The <b>Naïve Bayes model</b> assumes that the features used to predict, are independent. But ussually that is not the case. That's why it's called 'Naïve'. Despite that, it's workes well and is very accurate. 

With the Naïve Bayes model, we can predict if a text (based on the words) is in category B or category B. In this case, we are going to predict if a review is going to get a postive star rating(>3) or a negative star rating(<4). 

# Pre-processing steps

In [13]:
#importing the data and putting it in a dataframe with Pandas.
reviews = pd.read_csv('Assignment text mining - data clothing reviews.csv')
reviews = reviews.dropna()
reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
6,6,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,5,1,1,General Petite,Tops,Knits


Getting the reviews for the dresses 

In [14]:
dress_df = reviews.loc[(reviews['Class Name'] == 'Dresses')]

In [15]:
dress_df.head(5)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses
10,10,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses


In [16]:
dress_df['Rating'].value_counts()

5    2832
4    1213
3     737
2     408
1     181
Name: Rating, dtype: int64

Allot of dresses have 5 or 4 stars. Very few have bad reviews.

# Text pre-processing steps resulting in a document-feature matrix

Now we can convert the reviews(text) to unicode, so that we can analyze them. The <b>CountVectorizer</b> puts all the words in a list.
And with the <b>stop word</b> function, we remove all the most common words, like 'the' or 'and'. 

In [17]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = dress_df['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:590]}")

There are 7747 words in the vocabulary. A selection: ['allusion', 'allusione', 'almsot', 'alr', 'alright', 'als', 'altar', 'alter', 'alteration', 'alterations', 'altered', 'altering', 'alternate', 'alternations', 'alternative', 'althetic', 'altho', 'altogether', 'am5', 'amadi', 'amalfi', 'amaz', 'amaze', 'amazed', 'amazement', 'amazing', 'amazingly', 'amazon', 'amd', 'amenable', 'american', 'amking', 'amoret', 'amorphous', 'amp', 'ample', 'amsterdam', 'amterial', 'analogy', 'andrews', 'angel', 'angela', 'angle', 'angles', 'angora', 'animal', 'animals', 'anita', 'ankl', 'ankle', 'ankles', 'anna', 'annie', 'anniversary', 'annoyed', 'annoying', 'annoys', 'ans', 'answer', 'ansy', 'antebellum', 'antheropologie', 'anthletic', 'antho', 'anthopology', 'anti', 'anticipated', 'anticipating', 'anticipation', 'anticpated', 'antique', 'antrho', 'antrhopologie', 'antro', 'antropologie', 'anwen', 'anxious', 'anymore', 'anyou', 'anyt', 'anytime', 'anyw', 'anyways', 'aottern', 'apart', 'apex', 'aplique

Okay now all the words are in the list. To see how many times a word occurs, I am going to use: <b>transform method </b>

In [18]:
docu_feat = vect.transform(text) 
print(docu_feat[0:90,0:90]) 

  (1, 7)	1
  (3, 71)	1
  (14, 78)	1
  (15, 37)	1
  (16, 3)	1
  (16, 44)	1
  (17, 11)	1
  (18, 54)	1
  (19, 39)	1
  (21, 59)	1
  (26, 11)	2
  (26, 58)	3
  (28, 84)	1
  (29, 30)	1
  (42, 71)	1
  (43, 0)	1
  (47, 58)	4
  (51, 11)	1
  (51, 58)	1
  (52, 71)	1
  (58, 44)	1
  (59, 81)	1
  (60, 11)	1
  (76, 69)	1
  (77, 11)	1
  (77, 58)	1
  (83, 44)	1
  (86, 24)	1


With the <b>docu feat</b> we can create a <b>document-feature matrix</b>

In [19]:
#Create a regular matrix out of docu_feat, make it into a DataFrame and concatenate it along the columns
#We need to reset the index because otherwise we end up with a bunch of NA's
df_review_words = pd.concat([reviews, pd.DataFrame(docu_feat.toarray()).reset_index()], axis=1)
df_review_words.head(5)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,...,7737,7738,7739,7740,7741,7742,7743,7744,7745,7746
0,,,,,,,,,,,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.0,1077.0,60.0,Some major design flaws,I had such high hopes for this dress and reall...,3.0,0.0,0.0,General,Dresses,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.0,1049.0,50.0,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5.0,1.0,0.0,General Petite,Bottoms,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4.0,847.0,47.0,Flattering shirt,This shirt is very flattering to all due to th...,5.0,1.0,6.0,General,Tops,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


When a word is in the review, the number will be 1. Otherwise its 0.

# Train a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars).

In [20]:
dress_df['rating2'] = dress_df['Rating'] > 3 
# false and true for now because I couldnt figure out how to turn it into positive and negative.
#True = postive & false = negative

#dress_df['Positive'] = dress_df.loc[(dress_df['Rating'] > 3 )]
#dress_df['Positive'] = reviews.loc[(reviews['Rating'])] == reviews.loc[(reviews['Rating'] > 3 )]
#dress_df['Negative'] = reviews.loc[(reviews['Rating'] < 4 )]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


# Splitting the file into a training and a test set.

In [21]:
from sklearn.model_selection import train_test_split
y = dress_df['rating2'] #We need to take out the rating as our Y-variable
X = docu_feat #this slices the dataframe to include all rows and the columns from "action" to "metascore"
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

In [22]:
dress_df

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,rating2
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,False
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,False
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses,True
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses,True
10,10,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses,False
...,...,...,...,...,...,...,...,...,...,...,...,...
23478,23478,1104,32,Unflattering,I was surprised at the positive reviews for th...,1,0,0,General Petite,Dresses,Dresses,False
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses,True
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses,False
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses,False


Now, we will use the <b>Naïve Bayes classifier</b> from sklearn.

In [23]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB() #create the model

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

# Evaluate the performance of the model on the test set.

In [24]:
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8542183622828784

The accurracy is 85%, which is good!
I'll make a confucion matrix to check it out.

In [25]:
# to see the labels
nb.classes_

array([False,  True])

In [26]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Negative', 'Positive'], columns=['Negative pred', 'Positive pred'])
cm

Unnamed: 0,Negative pred,Positive pred
Negative,238,168
Positive,67,1139


In [27]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p, nb.classes_)) #this function needs the class names, which are in nb.classes_

              precision    recall  f1-score   support

       False       0.78      0.59      0.67       406
        True       0.87      0.94      0.91      1206

   micro avg       0.85      0.85      0.85      1612
   macro avg       0.83      0.77      0.79      1612
weighted avg       0.85      0.85      0.85      1612



In [28]:
dress_df['rating2'].value_counts(normalize=True)

True     0.753119
False    0.246881
Name: rating2, dtype: float64

This shows that a positive review has more chance(75%) of beeing predicted correctly. 

In [29]:
print(reviews.iloc[0,4])
print(nb.predict_proba(X[0]))

I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
[[9.99999936e-01 6.43229512e-08]]


In [30]:
for i in range(3):
    prob = nb.predict_proba(X[i])
    print(f"Review: {i}. {reviews.iloc[i,4]}")
    print(f"False: {prob[0,0]}, Positive: {prob[0,1]}")

Review: 0. I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c
False: 0.9999999356770694, Positive: 6.432295123973546e-08
Review: 1. I love, love, love this jumpsuit. it's fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!
False: 0.6489362653628143, Positive: 0.3510637346371588
Review: 2. This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!
False: 0.0002889064395324252, Pos

In this example, we can see that 2 out of 3 is predicted correctly. 
The second review is predicted to be negative, but it's actually a positive review. It could be because the 2nd review doenst have as many words as the rest. 