# Week 6: Text mining

The task for this notebook is to train a Naïve Bayes classifier predicting whether a dress review is positive (>3 stars) or neutral/negative (<4 stars).

#### Imports

In [1]:
import sklearn as sk
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

## Naive Bayes and Bag-of-words model

The Bag-of-words model can be applied to a text document. After getting rid of stopwords and using lemmitization to reduce the words to the word stem, the words in the document can be counted. It is merely a bag of words because the word order is neglected and words are only counted. The result is a dataframe with the words and their frequency.

We can use this quantification of text for statistical models, which is exactly what Naive Bayes does. The algorithm tries to predict one of two categories (for example spam or not spam). By calculating the probability that an item falls into one category or another and multiplying the frequency of the word in these categories, the algorithm can make a calculation predicting what the odds are that a certain word or text falls into one category or another.

In [2]:
df = pd.read_csv('Assignment text mining - data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


### Data pre-processing
Because of the nature of the Naive Bayes algorithm the variable we try to determine needs to be a boolean. We need to create a dummy variable of positive/negative reviews because that is what we try to predict and not the score of 1-5. 

As we only focus on dresses we set the class name equal to dresses. 

In [3]:
df['Classification'] = np.nan
df.loc[df['Rating'] > 3, 'Classification'] = 'Positive' 
df.loc[df['Rating'] < 4, 'Classification'] = 'Not Positive' 
df = pd.get_dummies(df, columns=['Classification'])
df_dresses = df[(df['Class Name'] == 'Dresses')]
df_dresses = df[['Review Text', 'Classification_Positive']]
df_dresses.head()

Unnamed: 0,Review Text,Classification_Positive
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


Now we have a working dataframe of positive vs negative reviews and the associated review text. 

In [4]:
text = df_dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[10:20]}")


There are 13856 words in the vocabulary. A selection: ['0p', '0petite', '0r', '0verall', '0xs', '10', '100', '1000', '100lb', '100lbs']


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [20]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat) #Let's print the matrix

  (0, 581)	1
  (0, 2788)	1
  (0, 10684)	1
  (0, 10929)	1
  (0, 13631)	1
  (1, 1446)	2
  (1, 1845)	1
  (1, 3537)	1
  (1, 3701)	1
  (1, 4035)	1
  (1, 5421)	1
  (1, 5725)	1
  (1, 5930)	1
  (1, 6667)	1
  (1, 6754)	1
  (1, 6986)	1
  (1, 7137)	1
  (1, 7257)	2
  (1, 7671)	1
  (1, 8364)	1
  (1, 8432)	1
  (1, 8889)	3
  (1, 9340)	1
  (1, 11293)	1
  (1, 11631)	1
  :	:
  (23484, 8206)	1
  (23484, 8839)	1
  (23484, 8842)	1
  (23484, 10864)	1
  (23484, 11385)	1
  (23484, 11864)	1
  (23484, 12082)	1
  (23484, 12091)	1
  (23484, 12939)	1
  (23484, 13281)	1
  (23484, 13316)	1
  (23484, 13380)	1
  (23484, 13414)	1
  (23484, 13685)	1
  (23485, 2796)	1
  (23485, 4035)	1
  (23485, 4163)	1
  (23485, 4796)	1
  (23485, 4884)	1
  (23485, 5893)	1
  (23485, 7264)	1
  (23485, 8842)	1
  (23485, 9058)	1
  (23485, 9774)	1
  (23485, 13390)	1


In [58]:
#words = pd.concat([df, pd.DataFrame(docu_feat.toarray())], axis=1)
#words.head(5)

## Building and evaluating the model

We now can create and evaluate the model. In order to do so we first determine X & y and create a training and test set.

In [7]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB() #create the model

X = docu_feat 
y = df_dresses['Classification_Positive']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

In [8]:
#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8685779165483962

When evaluating the model we get a value of 0.87. This means that the algorithm can predict with 87% certainty whether a review is positive or negative.

### Confusion Matrix

In [9]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Positive', 'Negative'], columns=['Pred Positive', 'Pred negative'])
cm

Unnamed: 0,Pred Positive,Pred negative
Positive,1097,518
Negative,408,5023


I was curious what the model predicts in terms of positive and negative reviews, so additionally I created a confusion matrix. We can observe that a substantial amount was correctly identified as either positive or negative review. Below I went into detail on precision and recall for each category. 

In [12]:
print(f"The precision for a positive review is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[1,0])}") #this uses the coordinates of the confustion matrix
print(f"The recall for a positive review is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[0,1])}")
print(f"The precision for a negative review is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[0,1])}")
print(f"The recall for a negative review is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[1,0])}")

The precision for a positive review is: 0.7289036544850498
The recall for a positive review is: 0.6792569659442724
The precision for a negative review is: 0.906515069482043
The recall for a negative review is: 0.9248757134965936


I am now adding a column 'Rating Prediction' to the dataframe in order to investigate exactly which cases were not identified correctly and interpret these. 

In [51]:
from sklearn.naive_bayes import MultinomialNB
df["Rating_Prediction"] = nb.predict(X)


df_dresses = df[['Review Text', 'Classification_Positive', 'Rating_Prediction']]
df_dresses

Unnamed: 0,Review Text,Classification_Positive,Rating_Prediction
0,Absolutely wonderful - silky and sexy and comf...,1,1
1,Love this dress! it's sooo pretty. i happene...,1,1
2,I had such high hopes for this dress and reall...,0,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1,1
4,This shirt is very flattering to all due to th...,1,1
...,...,...,...
23481,I was very happy to snag this dress at such a ...,1,1
23482,"It reminds me of maternity clothes. soft, stre...",0,0
23483,"This fit well, but the top was very see throug...",0,1
23484,I bought this dress for a wedding i have this ...,0,0


### When does the model trip?

I can imagine the model trips when people use words that are originally associated with the category positive like 'love' but use it in connection with a negation ( e.g. "I don't love that dress.")

In [55]:
df_dresses.iloc[1]

Review Text                Love this dress!  it's sooo pretty.  i happene...
Classification_Positive                                                    1
Rating_Prediction                                                          1
Name: 1, dtype: object

In [56]:
df_dresses.iloc[2]

Review Text                I had such high hopes for this dress and reall...
Classification_Positive                                                    0
Rating_Prediction                                                          0
Name: 2, dtype: object

I can see that cases one and two both were correctly identified.. Let's see what the problem is with a incorrectly identified one. 

#### Example Case 23483

In [54]:
df_dresses.iloc[23483]

Review Text                This fit well, but the top was very see throug...
Classification_Positive                                                    0
Rating_Prediction                                                          1
Name: 23483, dtype: object

In [53]:
df_dresses["Review Text"].iloc[23483]

"This fit well, but the top was very see through. this never would have worked for me. i'm glad i was able to try it on in the store and didn't order it online. with different fabric, it would have been great."

This review contains the word _great_ which indicates to the algorithm that it is a positive review. However, the way that it was used in this context is negative. Hence, the wrong categorization of the algorithm. My theory was hence confirmed. 