# Week 6: text mining

In this assignment, we're going to use text mining to predict the rating of a dress from online reviews.

Objective:

Predict whether dresses reviews are positive (>3 stars) or neutral/negative (<4 stars).

- Explain briefly in your own words how the bag-of-words model and Naïve Bayes work, and how they work together.

What we're going to do is getting all the words out of the review text and counting them so we know the probability that certain words appear in a rated review. After that you can calculate the rating of new messages because we know what the chance is that a certain messsage is negative, based on the value of the words in it. 
Because the model is naive, it only gives value to words, not to the structure of how certain words are used in a sentense. Because of the ignoring of relationships among words, the model has a high bias. But, it works very well in practice, that gives it a low variance.

## Pre processing steps
- Pre-processing steps (don’t forget to filter out all non-dress reviews).
- The head() of the resulting dataframe.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


In [2]:
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv', encoding='ISO-8859-1')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [3]:
# Subsetting only dresses
df = df.loc[df['Class Name'] == 'Dresses']
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses


In [4]:
# Checking the overall scores of the dresses
df['Rating'].value_counts()

5    3397
4    1395
3     838
2     461
1     228
Name: Rating, dtype: int64

We see here that there are a lot more positive/4-5 star reviews (4792) than negative reviews (1527)

Because we only have to distinguish if a review is neutral/negative or positive, we merge the values together with strings. Just for the ease of it we give it the strings 'positive' and 'negative'.

In [5]:
df['Rating'].replace({1: 'negative',
                      2: 'negative',
                      3: 'negative',
                      4: 'positive',
                      5: 'positive'}, inplace=True)

df['Rating'].value_counts()

positive    4792
negative    1527
Name: Rating, dtype: int64

In [6]:
# Creating a subset with only the columns we need
df = df[['Rating', 'Review Text']]
df.head()

Unnamed: 0,Rating,Review Text
1,positive,Love this dress! it's sooo pretty. i happene...
2,negative,I had such high hopes for this dress and reall...
5,negative,"I love tracy reese dresses, but this one is no..."
8,positive,I love this dress. i usually get an xs but it ...
9,positive,"I'm 5""5' and 125 lbs. i ordered the s petite t..."


## Reading the text
- Text pre-processing steps resulting in a document-feature matrix

To read the text and use it for our analysis, we need an object from sklearn called a CountVectorizer. Essentially, what it does is create a dictionary from a series of text. It lowercases the text and tokenizes it by using whitespace and interpunction as separations between words. I use a list of frequent English words ('stop words') that will not be counted: they are not informative enough.

We will need to convert the text to Unicode, which is a standard text format. We do so by using .values.astype('U').

In [7]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
vect
feature_names = vect.get_feature_names() #Get the words from the vocabulary
#feature_names
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

df.head()

There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


Unnamed: 0,Rating,Review Text
1,positive,Love this dress! it's sooo pretty. i happene...
2,negative,I had such high hopes for this dress and reall...
5,negative,"I love tracy reese dresses, but this one is no..."
8,positive,I love this dress. i usually get an xs but it ...
9,positive,"I'm 5""5' and 125 lbs. i ordered the s petite t..."


Now that we have the dictionary, we can count the occurences of each word for each review. This way, we can create a document-feature matrix, with documents (reviews) in the rows, and features (words) in the columns.

In [8]:
docu_feat = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(docu_feat[0:50,0:50]) #Let's print a little part of the matrix: the first 50 words & documents

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


## Using Naive Bayes

- Split the file into a training and a test set.
- Train a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars).
- Evaluate the performance of your model on the test set.

Now, we will use the Naïve Bayes classifier from sklearn.

In [9]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB() #create the model
X = docu_feat #the document-feature matrix is the X matrix
y = df['Rating'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data and store it

nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

#Evaluate the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8554852320675106

In [10]:
y_test_p

array(['positive', 'negative', 'positive', ..., 'positive', 'positive',
       'positive'], dtype='<U8')

In [11]:
df['Rating'].value_counts(normalize=True)

positive    0.758348
negative    0.241652
Name: Rating, dtype: float64

So in the dataset counts positive reviews for dresses for 75%. If we'd say that every review was positive we would have a rating of 75%.

Our model gives us a score of 85%. That's somewhat higher, but I was hoping to get it above 90%.

In [12]:
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

    negative       0.75      0.61      0.67       464
    positive       0.88      0.94      0.91      1432

    accuracy                           0.86      1896
   macro avg       0.82      0.77      0.79      1896
weighted avg       0.85      0.86      0.85      1896



## Evaluation
- Check out 3 cases where your model is off target. Inspect the associated texts. Do you understand why your model trips up? Explain.

After building the model, it's a good practice to look at examples of where your model got it wrong, and think about what features you can add to "maybe" improve the performance.

In [25]:
for i in range(10):
    prob = nb.predict_proba(X[i])
    print(f"\nText: {i}. {df.iloc[i,1]}. ")
    print(f"\nNegative pred.: {prob[0,0]}, Positive pred.: {prob[0,1]} \n")
    print(f"Actual: {df.iloc[i,0]}\n")


Text: 0. Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.. 

Negative pred.: 5.337412695289174e-05, Positive pred.: 0.9999466258730467 

Actual: positive


Text: 1. I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c. 

Negative pred.: 0.9999990978503296, Positive pred.: 9.021496679830375e-07 


Attention: I don't know if I was correct with deciding which was the positive pred. and which the negative. First I had it the other way around (the first three mistakes would've heen 2,3 and 5), but then I remembered it probably is on alphabetical order.

- Text 0: So with the change of pos./nega. I really wouldn't know why the system thinks this is negative.
- Text 1: The system probably thinks that this is a positive one because she expains why the top half of the dress is good.
- Text 4: So this was a close one for the system. I don't know why actually. She explains very detailed why she is happy with it and why other should buy it. 

Not really an good evaluation, I admit. But, at least I came to the end code-wise!
