# Text mining, assignment week 6
---


The task for this notebook is to train a Naïve Bayes classifier predicting whether a dress review is positive (>3 stars) or neutral/negative (<4 stars).

First task: import every library we need, and the dataset (i've renamed it to "Assignment_text_mining.csv")

---

## Naive Bayes Bag of Words model

For this assignment I use the Naive Bayes bag of words model. This model can be applied to a text document, before I really can deploy this model I need to get rid of stop words and bring the words back to their word stem, it's time to get into business!

Now that the words are reduced to their stem, we can count them. Its a bag of words (so to say) because the wordcount is random and only words are counted (so like husseling a bag and grabbing one of the items that's in the bag out of it.)

What we can do now is use this kind of quantification of text for a statistic model, which is what the Naive Bayes does.
By calculating the probability that an item falls in a certain category, the algorithm can make a calculation predicting what the odds are that a certain word fits into one category or another.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df = pd.read_csv('Assignment_text_mining.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


---

## Data pre processings.

First we need to create a dummy variable of the positive and negative reviews, for this i'll use the ">3" for positive reviews and "<4" for negative (or here: Not Positive) reviews.

Since this assignment is about dresses (luckly not a gold/blue one), we're setting the new DF name to "dresses".

In [3]:
df['Classification'] = np.nan
df.loc[df['Rating'] > 3, 'Classification'] = 'Positive' 
df.loc[df['Rating'] < 4, 'Classification'] = 'Not Positive' 
df = pd.get_dummies(df, columns=['Classification'])
df_dresses = df[(df['Class Name'] == 'Dresses')]
df_dresses = df[['Review Text', 'Classification_Positive']]
df_dresses.head()

Unnamed: 0,Review Text,Classification_Positive
0,Absolutely wonderful - silky and sexy and comf...,1
1,Love this dress! it's sooo pretty. i happene...,1
2,I had such high hopes for this dress and reall...,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1
4,This shirt is very flattering to all due to th...,1


Now we have a dataframe with positive and negative reviews. Yay! Let's get busy with reviewing the review text

In [4]:
text = df_dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[10:20]}")

There are 13856 words in the vocabulary. A selection: ['0p', '0petite', '0r', '0verall', '0xs', '10', '100', '1000', '100lb', '100lbs']


Now we can start counting words from our reviews.

In [5]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat) #Let's print the matrix

  (0, 581)	1
  (0, 2788)	1
  (0, 10684)	1
  (0, 10929)	1
  (0, 13631)	1
  (1, 1446)	2
  (1, 1845)	1
  (1, 3537)	1
  (1, 3701)	1
  (1, 4035)	1
  (1, 5421)	1
  (1, 5725)	1
  (1, 5930)	1
  (1, 6667)	1
  (1, 6754)	1
  (1, 6986)	1
  (1, 7137)	1
  (1, 7257)	2
  (1, 7671)	1
  (1, 8364)	1
  (1, 8432)	1
  (1, 8889)	3
  (1, 9340)	1
  (1, 11293)	1
  (1, 11631)	1
  :	:
  (23484, 8206)	1
  (23484, 8839)	1
  (23484, 8842)	1
  (23484, 10864)	1
  (23484, 11385)	1
  (23484, 11864)	1
  (23484, 12082)	1
  (23484, 12091)	1
  (23484, 12939)	1
  (23484, 13281)	1
  (23484, 13316)	1
  (23484, 13380)	1
  (23484, 13414)	1
  (23484, 13685)	1
  (23485, 2796)	1
  (23485, 4035)	1
  (23485, 4163)	1
  (23485, 4796)	1
  (23485, 4884)	1
  (23485, 5893)	1
  (23485, 7264)	1
  (23485, 8842)	1
  (23485, 9058)	1
  (23485, 9774)	1
  (23485, 13390)	1


---

## Make a model, review that model.

We need to create an train and test set, also think about what is the X and Y values. For this i'm using the positive reviews as Y variable.

In [6]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB() #create the model

X = docu_feat 
y = df_dresses['Classification_Positive']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

In [7]:
#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8685779165483962

When running the model, I see that the evaluation is about an value of 0.87. So the algorithm is about 87% right at the time. Which is pretty high, but not high 'enough' to be compleetly trusted (i guess).

---

## It's used confusion matrix, it's hurts itself in confusion!


In [8]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Positive', 'Negative'], columns=['Pred Positive', 'Pred negative'])
cm

Unnamed: 0,Pred Positive,Pred negative
Positive,1097,518
Negative,408,5023


In [9]:
print(f"The precision for a positive review is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[1,0])}") #this uses the coordinates of the confustion matrix
print(f"The recall for a positive review is: {cm.iloc[0,0]/(cm.iloc[0,0]+cm.iloc[0,1])}")
print(f"The precision for a negative review is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[0,1])}")
print(f"The recall for a negative review is: {cm.iloc[1,1]/(cm.iloc[1,1]+cm.iloc[1,0])}")

The precision for a positive review is: 0.7289036544850498
The recall for a positive review is: 0.6792569659442724
The precision for a negative review is: 0.906515069482043
The recall for a negative review is: 0.9248757134965936


What we can see is that the precision for a positive review is about 0.72, and the recall is 67.  Well we have this information, let's add another column, namely: Rating prediction; I do this to investigate which cases were not identified correctly and so I can interpertret this.

In [10]:
from sklearn.naive_bayes import MultinomialNB
df["Rating_Prediction"] = nb.predict(X)


df_dresses = df[['Review Text', 'Classification_Positive', 'Rating_Prediction']]
df_dresses.head(23000) #The review number "22998" has a 0 - 1 rating, let's see what that does in the next step.

Unnamed: 0,Review Text,Classification_Positive,Rating_Prediction
0,Absolutely wonderful - silky and sexy and comf...,1,1
1,Love this dress! it's sooo pretty. i happene...,1,1
2,I had such high hopes for this dress and reall...,0,0
3,"I love, love, love this jumpsuit. it's fun, fl...",1,1
4,This shirt is very flattering to all due to th...,1,1
...,...,...,...
22995,Oh what a disappointment! i was looking forwar...,0,0
22996,Got this in the petite xs in mint; the color i...,1,1
22997,"These socks are soft and comfortable, and they...",1,1
22998,"Beautiful, unique design. it's very flattering...",1,0


---

## So when does the model going to trip?

I think the model is going to trip when a negative word is being used in a positive setting ("I love the hell out of this dress", or "I can't really believe I hate myself for loving this dress").

In [11]:
df_dresses.iloc[1]

Review Text                Love this dress!  it's sooo pretty.  i happene...
Classification_Positive                                                    1
Rating_Prediction                                                          1
Name: 1, dtype: object

In [12]:
df_dresses.iloc[2]

Review Text                I had such high hopes for this dress and reall...
Classification_Positive                                                    0
Rating_Prediction                                                          0
Name: 2, dtype: object

The first two were correctly identified, so lets take a random number to see if thats correctly or indirectly identified.

In [13]:
df_dresses.iloc[22998]

Review Text                Beautiful, unique design. it's very flattering...
Classification_Positive                                                    1
Rating_Prediction                                                          0
Name: 22998, dtype: object

In [14]:
df_dresses["Review Text"].iloc[22998]

"Beautiful, unique design. it's very flattering the way the fabric hangs.\n\ni usually wear a 10, but because the lining is much smaller than the dress (which many linings are... it's a pet peeve of mine), i sized up to a 12 and took in the shoulders a couple inches so the bodice area would sit correctly on my bust. otherwise in the 10 my stomach would feel squeezed even if it looked like it fit perfectly. that said, i'm very sensitive to that kind of thing.\n\nother reviewers have noted that the emb"

This review uses a lot of positive words, but also looks like a negative words such as "not",. I think this is not the best example that it trips up but it's weird it just doesn't identify this review as "positive". For now i just let it be, but I think the model is pretty A-okay for this assignment.

### Would I use this model?
Probably, because the chances are pretty high it has a good prediction. Although I can see in a business setting that it could be a bit of a gamble.

Just my 2 cents.