# Assignment week 6

The aim of this assignment is to predict the category 'Rating' of dresses. A rating with 4 or 5 (> 4) is considered positive and the ones with equal to 3 (=< 3) or less are considered neutral/negative. We will use Naive Bayes algorithm to predict our classification.

In [1]:
# importing the libraries we are going to use in our notebook
import pandas as pd
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split 

## Data pre-processing steps

First, we will start with cleaning and sorting the data.

In [2]:
# importing dataset and viewing the first 5 items 
reviews = pd.read_csv('data-clothing-reviews.csv')
reviews.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Some important notes after viewing the dataset: 
- We would like to filter only on 'Dresses' (Class Name)
- We would like to drop all the empty rows (or NaN)

In [3]:
# Only keep the rows with dresses
df_subset = reviews[(reviews["Class Name"] == "Dresses")]

# Dropping all the NaN and empty cells
df_subset.dropna(axis=0,how='any',thresh=None,subset=None,inplace=True)

df_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses
8,8,1077,24,Flattering,I love this dress. i usually get an xs but it ...,5,1,0,General,Dresses,Dresses
9,9,1077,34,Such a fun dress!,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,1,0,General,Dresses,Dresses
10,10,1077,53,Dress looks like it's made of cheap material,Dress runs small esp where the zipper area run...,3,0,14,General,Dresses,Dresses


In [4]:
# Create a column with positive or negative based
def positive_or_negative(x):
    if(x > 3): 
        return 'Positive'
    else: 
        return 'Negative'

df_subset['Rate'] = df_subset['Rating'].apply(positive_or_negative)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [5]:
# Subsetting the data set to create just one dataset
df_subset = df_subset[["Clothing ID","Review Text", "Rating", "Class Name", "Rate"]]
df_subset.head()

Unnamed: 0,Clothing ID,Review Text,Rating,Class Name,Rate
2,1077,I had such high hopes for this dress and reall...,3,Dresses,Negative
5,1080,"I love tracy reese dresses, but this one is no...",2,Dresses,Negative
8,1077,I love this dress. i usually get an xs but it ...,5,Dresses,Positive
9,1077,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5,Dresses,Positive
10,1077,Dress runs small esp where the zipper area run...,3,Dresses,Negative


In [6]:
df_subset['Rate'].value_counts()

Positive    4045
Negative    1326
Name: Rate, dtype: int64

## Text Modeling and Document Feature Matrix

Now the data set is ready to use. We are going to create the Document Feature Matrix, which is a document consisting of rows (Document) and colums (different words). 

The bag of words is a model which treats a document as a collection of words and it counts it. It does that by tokenizing the words, which is the process of breaking a text into words or units. We are going to do so by using CountVectorizer from sklearn.feature_text_extraction library. The CountVectorizer breaks the text into a bag of words and create a dictionary from a series of text and then change them to lower case. Since the text used is English then we are going to use _stopwords for the English language. The stop_words are informative words and frequently used in the English language and therefore will not be counted. Also, we need to convert the text to Unicode and we are going to do so by using astype('U')

In [7]:
# Creating the collection of words (dictionary)
from sklearn.feature_extraction.text import CountVectorizer 

text = df_subset['Review Text'].values.astype('U') 
vect = CountVectorizer(stop_words='english') 
vect = vect.fit(text) 
vect
feature_names = vect.get_feature_names() 

print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 7747 words in the vocabulary. A selection: ['allusion', 'allusione', 'almsot', 'alr', 'alright', 'als', 'altar', 'alter', 'alteration', 'alterations', 'altered', 'altering', 'alternate', 'alternations', 'alternative', 'althetic', 'altho', 'altogether', 'am5', 'amadi']


Now, we have a dictionary of words. We are going to create a Document Feature matrix out of it by using the transform method. The transform method transforms a document into a Document Feature Matrix

In [8]:
# Creates a document feature matrix
docu_feat = vect.transform(text) 
print(docu_feat[0:500,0:500]) 

  (1, 7)	1
  (3, 71)	1
  (3, 214)	1
  (12, 480)	1
  (13, 200)	1
  (14, 78)	1
  (14, 408)	1
  (14, 420)	1
  (15, 37)	1
  (16, 3)	1
  (16, 44)	1
  (17, 11)	1
  (17, 236)	1
  (18, 54)	1
  (19, 39)	1
  (21, 59)	1
  (21, 216)	1
  (21, 225)	1
  (22, 229)	1
  (23, 102)	1
  (23, 178)	1
  (23, 222)	1
  (23, 243)	1
  (24, 403)	1
  (26, 11)	2
  :	:
  (475, 96)	1
  (475, 455)	2
  (476, 63)	1
  (476, 200)	1
  (476, 334)	1
  (477, 344)	1
  (478, 156)	1
  (479, 216)	1
  (479, 368)	1
  (482, 309)	2
  (484, 130)	1
  (489, 362)	1
  (490, 112)	1
  (492, 11)	2
  (492, 364)	1
  (493, 50)	1
  (493, 480)	1
  (497, 442)	1
  (498, 334)	1
  (499, 11)	1
  (499, 165)	1
  (499, 187)	1
  (499, 219)	1
  (499, 248)	1
  (499, 451)	1


## Split the file into a training and a test set - Naive Bayes

Now, the Document Feature Matrix is ready. We are going to split our dataset into a training set and a test set to prepare for building our Naive Bayes Model

In [9]:
from sklearn.naive_bayes import MultinomialNB

# Spliting the data set into training set and test set
nb = MultinomialNB()
X = docu_feat
y = df_subset['Rate'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Creating the model
nb = nb.fit(X_train, y_train)


In the code above, we created our Naive Bayes model. The Naive Bayes is an algorithm based on Bayes Theorem. Bayes Theorem calculates the probability of something given that a prior condition related to that event. For example the probability for raining given that it is cloudy.

The general formula of Naive Theorem is as follows: $P(A|B) = \frac{P(B|A) * P(A)}{P(B)} $

It means calculating the probability of A given the condition B. How Naive Bayes Algorithm works is that it compares the $ P(A|B) $ vs $ P(B|A) $ and makes the classification based on the higher number.


In this case, it is going to be as follows: $ P(Positive Review | Word) $ vs $ P(Neutral/Negative | Word)$ and it makes the classification based on that. Naive Bayes algorithm compares the reviews given the have certain features, in our case "words" and makes the classification based on the probability comparison of the two cateogries (positive and negative)


## Evaluate the performance of the model 

In the last section, we are going to evaluate the performance of our model

In [10]:
# Creating prediction array with the test set
y_test_p = nb.predict(X_test)

# Calculating the accuracy on the test set
nb.score(X_test, y_test)

0.8542183622828784

The accuracy of our model is 85% which is fairly good. Now we are going to create a Confusion Matrix to analyze the classification

In [11]:
# Creating a confusion matrix with the test set
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_test_p)
cm

array([[ 238,  168],
       [  67, 1139]])

In [12]:
# In order to build the matrix we are going to view the classes of the array first to view the order 
nb.classes_

array(['Negative', 'Positive'], dtype='<U8')

In [13]:
# Creating the confusion matrix
conf_matrix = pd.DataFrame(cm, index = ["Negative (actual)", "Positive (actual)"], columns = ["Negative (predicted)", "Positive (predicted)"])
conf_matrix

Unnamed: 0,Negative (predicted),Positive (predicted)
Negative (actual),238,168
Positive (actual),67,1139


A calculation of the variables 

$accuracy = \frac{238 + 1139}{238 + 168 + 67 + 1139} = 0.85$
The accuracy is exactly the same as the knn-score


How much of the predicted ‘negative’ is actually negative? 

recall negative = $\frac{238}{238 + 168} = 0.59$

How much of the predicted ‘positive’ is actually positive? 

recall positive = $\frac{1139}{1139 + 67} = 0.94$

How much of the positive is actually predict as positive?

precision positive = $\frac{1139}{1139 + 168} = 0.87$

How much of the negative is actually predict as negative?

precision negative = $\frac{238}{238 + 67} = 0.87$

In [14]:
# Calculating the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p))

              precision    recall  f1-score   support

    Negative       0.78      0.59      0.67       406
    Positive       0.87      0.94      0.91      1206

    accuracy                           0.85      1612
   macro avg       0.83      0.77      0.79      1612
weighted avg       0.85      0.85      0.85      1612



We are going to fit our trained model into the whole dataset, and take a subset where the model was off and try to analyze the reason.

In [15]:
df_subset["Rate_Prediction"] = nb.predict(X)
df_contradictions = df_subset[df_subset["Rate"] != df_subset["Rate_Prediction"]]
df_contradictions.head()

Unnamed: 0,Clothing ID,Review Text,Rating,Class Name,Rate,Rate_Prediction
12,1095,More and more i find myself reliant on the rev...,5,Dresses,Positive,Negative
23,1077,Cute little dress fits tts. it is a little hig...,3,Dresses,Negative,Positive
311,1089,Looks beautiful online but has too much materi...,3,Dresses,Negative,Positive
383,1104,This dress is not what i expected. the bottom ...,3,Dresses,Negative,Positive
417,1083,"I love byron lars dresses, and this design is ...",2,Dresses,Negative,Positive


In [17]:
for i in range(0,3):
    print(f"For item number: {i+1}")
    print(f"The review text we have is:\n{df_contradictions.iloc[i,1]}\n")
    print(f"The actual Score for this review is:\n{df_contradictions.iloc[i,2]}\n")
    print(f"The actual Rating Category for this review is:\n{df_contradictions.iloc[i,4]}\n")
    print(f"The Predicted Category for this review is:\n{df_contradictions.iloc[i,5]}\n\n")

For item number: 1
The review text we have is:
More and more i find myself reliant on the reviews written by savvy shoppers before me and for the most past, they are right on in their estimation of the product. in the case of this dress-if it had not been for the reveiws-i doubt i would have even tried this. the dress is beautifully made, lined and reminiscent of the old retailer quality. it is lined in the solid periwinkle-colored fabric that matches the outer fabric print. tts and very form-fitting. falls just above the knee and does not rid

The actual Score for this review is:
5

The actual Rating Category for this review is:
Positive

The Predicted Category for this review is:
Negative


For item number: 2
The review text we have is:
Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured.

The actual Score for this review is:
3

The act

With the look at the first 3 cases, we can see that the prediciton was off due to the fact that those reviews contains word such as "Love", "Beautiful". But at the same time, w 

And our model assumed that the probability of these words occuring in the text is most likely to be a positive review.