# Text Mining - Naive Bayes | Bag-of-Words

Bag of words - A model which treats a document as a collection of words. In the model, each occurence of a word is counted. The bag of words is not complex. It does not take morphology, syntax or pragmatics into account. It really only counts the occurences of words. <br><br>
Naive Bayes - The Naive Bayes uses the frequency of occurences of specific words to calculate the probability of a text belonging to a certain category. However, Naive Bayes does not take into account the dependency of certain word combinations. E.g Crime and police could have a strong correlation. Naive Bayes does not consider these relations. 

In [9]:
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score
import math
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.naive_bayes import MultinomialNB

In [10]:
df = pd.read_csv('clothing_reviews.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


To be able to work with the rating system, we will have to classify them as either positive (X>3) or not positive (X<4). <br>
We choose for positive and not positive to make it easy for ourselves to create dummy variables of the classification. We will need the classification in order to train our model.

In [11]:
df['Classification'] = np.nan
df.loc[df['Rating'] > 3, 'Classification'] = 'Positive' 
df.loc[df['Rating'] < 4, 'Classification'] = 'Not Positive' 
df = pd.get_dummies(df, columns=['Classification'])
df_dresses = df[(df['Class Name'] == 'Dresses')]
df_dresses = df[['Review Text', 'Classification_Positive']]
df_dresses.head()


Unnamed: 0,Review Text,Classification_Positive
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


We will now filter the review text on features. Features can be common words found in the dress reviews. These are called features because we might use an amount of them to train and test our model.  <br><br>
We need the stop words filter to clean up our data so our model only has the relevant words to train with. The words are first filtered through an English stop word list. The stop word list will drop all "filler" words, like "and" and "or" etc.

In [12]:
text = df_dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() 
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[10:20]}")

There are 13856 words in the vocabulary. A selection: ['0p', '0petite', '0r', '0verall', '0xs', '10', '100', '1000', '100lb', '100lbs']


In [13]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat)

  (0, 581)	1
  (0, 2788)	1
  (0, 10684)	1
  (0, 10929)	1
  (0, 13631)	1
  (1, 1446)	2
  (1, 1845)	1
  (1, 3537)	1
  (1, 3701)	1
  (1, 4035)	1
  (1, 5421)	1
  (1, 5725)	1
  (1, 5930)	1
  (1, 6667)	1
  (1, 6754)	1
  (1, 6986)	1
  (1, 7137)	1
  (1, 7257)	2
  (1, 7671)	1
  (1, 8364)	1
  (1, 8432)	1
  (1, 8889)	3
  (1, 9340)	1
  (1, 11293)	1
  (1, 11631)	1
  :	:
  (23484, 8206)	1
  (23484, 8839)	1
  (23484, 8842)	1
  (23484, 10864)	1
  (23484, 11385)	1
  (23484, 11864)	1
  (23484, 12082)	1
  (23484, 12091)	1
  (23484, 12939)	1
  (23484, 13281)	1
  (23484, 13316)	1
  (23484, 13380)	1
  (23484, 13414)	1
  (23484, 13685)	1
  (23485, 2796)	1
  (23485, 4035)	1
  (23485, 4163)	1
  (23485, 4796)	1
  (23485, 4884)	1
  (23485, 5893)	1
  (23485, 7264)	1
  (23485, 8842)	1
  (23485, 9058)	1
  (23485, 9774)	1
  (23485, 13390)	1


We as humans can't do much with the features matrix. We cannot read what means what. Which is why we will concat the matrix into a readable dataframe. The ones describe the occurence of a specific word, the zero means the lack of a specific word. <br><br>
The reason we see so many zeros in the dataframe is simple to explain. The majority of the words do not occur often, especially with a dataframe consisting of almost 14 000 words! <br>
One exception in this case could be the word "dress", but sadly we cannot see exactly what column represents that word.

In [14]:
rev_words = pd.concat([df_dresses, pd.DataFrame(docu_feat.toarray())], axis=1)
rev_words.head(5)

Unnamed: 0,Review Text,Classification_Positive,0,1,2,3,4,5,6,7,...,13846,13847,13848,13849,13850,13851,13852,13853,13854,13855
0,Absolutely wonderful - silky and sexy and comf...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Love this dress! it's sooo pretty. i happene...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,I had such high hopes for this dress and reall...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,This shirt is very flattering to all due to th...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Next step is to assign our X and y. After which we will fit our model and see how accurate our model is.<br> For the X we use the document features, or the matrix we made previously. Every single word will be used as a seperate variable. <br><br>
Naive Bayes is naive, so it assumes no correlation between any of the given variables/features.<br>
Of course our y will be the classification of the review, is it positive or neutral/negative.

In [15]:
X = docu_feat 
y = df_dresses['Classification_Positive']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Below we create our Naive Bayes model and fit it with our X and y. Let's see how well it can predit our test data set!

In [16]:
nb = MultinomialNB()
nb = nb.fit(X_train, y_train)
y_test_p = nb.predict(X_test)

We use the easy score function that's built into the Naive Bayes model. It shows that our model is accurately able to predict almost 87% of the test cases! Which is pretty impressive. <br><br>
If we would want, we can fit in several extra features to see if our predicitions can be more accurate. <br> Some possible extra features could be word length, number of characters and punctuation.

In [17]:
nb.score(X_test, y_test, sample_weight=None)

0.8685779165483962

We can create a quick confusion matrix to see how the prediction model holds up. Let's also print the results of the precision and recall for both labels.

In [18]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Positive', 'Negative'], columns=['Pred Positive', 'Pred negative'])
cm

Unnamed: 0,Pred Positive,Pred negative
Positive,1097,518
Negative,408,5023


In [19]:
print(f"The precision for a positive review is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[1,0])}")
print(f"The recall for a positive review is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[0,1])}")
print(f"The precision for a negative review is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[0,1])}")
print(f"The recall for a negative review is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[1,0])}")

The precision for a positive review is: 0.7289036544850498
The recall for a positive review is: 0.6792569659442724
The precision for a negative review is: 0.906515069482043
The recall for a negative review is: 0.9248757134965936


Surprisingly, the model is better at predicting negative reviews than it is with prediciting positive reviews. An explanation could be that the dataset of negative reviews is larger. A greater dataset of cases often means a higher accuracy in predictions. <br><br>
We can dive further into the cases that the model is not able to predict accurately, what could be the reason for inaccuracy? <br>
We create a new column of the predicted value and compare that to the known classification. Let's see what that gives us.

In [31]:
df["Rating_Prediction"] = nb.predict(X)
df_dresses = df[['Review Text', 'Classification_Positive', 'Rating_Prediction']]
print(df_dresses)


                                             Review Text  \
0      Absolutely wonderful - silky and sexy and comf...   
1      Love this dress!  it's sooo pretty.  i happene...   
2      I had such high hopes for this dress and reall...   
3      I love, love, love this jumpsuit. it's fun, fl...   
4      This shirt is very flattering to all due to th...   
...                                                  ...   
23481  I was very happy to snag this dress at such a ...   
23482  It reminds me of maternity clothes. soft, stre...   
23483  This fit well, but the top was very see throug...   
23484  I bought this dress for a wedding i have this ...   
23485  This dress in a lovely platinum is feminine an...   

       Classification_Positive  Rating_Prediction  
0                            1                  1  
1                            1                  1  
2                            0                  0  
3                            1                  1  
4                  

In [33]:
df_dresses['Review Text'].iloc[23483]

"This fit well, but the top was very see through. this never would have worked for me. i'm glad i was able to try it on in the store and didn't order it online. with different fabric, it would have been great."

As we can see, while this review is negative it uses a lot of possible positive words. Such as great, worked and glad. Since Naive Bayes is naive, it does not look into the relation between these words in combination with other words. Thus classifying the review as positive.