# Explain briefly in your own words how the bag-of-words model and Naïve Bayes work, and how they work together.
Bag of Words (BOW): it’s a collection of words to represent a sentence with word count and mostly disregarding the order in which they appear. It's a method to used to extract features from text documents. These features can be used for training machine learning algorithms. BOW creates a vocabulary of all the unique words occurring in all the documents in the training set.

Naive Bayes method are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
BOW and Naive Bayes work together to get the prior probability of each class amd by counting the frequency of the the words in each class. The prior probability is then multiplied by the frequency of words to get the posterior probability found in each class.


In [20]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

First, let's read in the data file.

In [2]:
df = pd.read_csv('data clothing reviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


# Data Cleaning

In [5]:
#removing all quotations and symbols from the dataset to get a cleaned dataset.

def clean(input_df):
    ret_text = input_df['Review Text'].str.replace('"', '')
    ret_text = ret_text.str.replace(u'\u2019', '')
    ret_text = ret_text.str.replace('!', '')
    ret_text = ret_text.str.replace('-', '')
    ret_text = ret_text.str.replace(',', ' ')  
    ret_text = ret_text.str.replace('?', '')
    ret_text = ret_text.str.replace('.', '') 
 
    input_df['Review Text'] = ret_text
    return input_df

In [6]:
df_new = clean(df)
df_new.head()
#df_subset = df[['Review Text']]
#df_subset.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful silky and sexy and comfo...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress it's sooo pretty i happened ...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,I love love love this jumpsuit it's fun fli...,5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


Creating a new column with textual interpretation of the Rating column. I am doing this to be able to get the percentage negative and positive ratings. 

In [13]:
df_new['Rating_text'] = 'x'
df.loc[(df_new['Rating'] <= 3), 'Rating_text'] = 'Negative'
df.loc[(df_new['Rating'] >= 4), 'Rating_text'] = 'Positive'
df


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,PosNeg,Rating_text
0,0,767,33,,Absolutely wonderful silky and sexy and comfo...,4,1,0,Initmates,Intimate,Intimates,Positive,Positive
1,1,1080,34,,Love this dress it's sooo pretty i happened ...,5,1,4,General,Dresses,Dresses,Positive,Positive
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,Negative,Negative
3,3,1049,50,My favorite buy!,I love love love this jumpsuit it's fun fli...,5,1,0,General Petite,Bottoms,Pants,Positive,Positive
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses,Positive,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23481,23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses,Positive,Positive
23482,23482,862,48,Wish it was made of cotton,It reminds me of maternity clothes soft stret...,3,1,0,General Petite,Tops,Knits,Negative,Negative
23483,23483,1104,31,"Cute, but see through",This fit well but the top was very see throug...,3,0,1,General Petite,Dresses,Dresses,Negative,Negative
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses,Negative,Negative


In [14]:
df['Rating_text'].value_counts(normalize=True)

Positive    0.77527
Negative    0.22473
Name: Rating_text, dtype: float64

From the value count, there are 77.5% positive feedback and there are 22.5% negative feedback

# Generating a document-feature matrix



In [17]:
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object

text = df['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode
vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
docu_feat = vect.transform(text) # make a matrix
#print(docu_feat)

# Building the model using the Naïve Bayes classifier from sklearn.




In [28]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()  
X = docu_feat #the document-feature matrix is the x matrix
y = df_new['Rating_text'] #creating the y vector

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)  

nb = nb.fit(X_train, y_train) #fit the model x=features, y=character




#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data and store it

#nb = nb.fit(X_train, y_train) #fit the model X=features, y=character

#Evaluate the model
#y_test_p = nb.predict(X_test)
#nb.score(X_test, y_test)

In [29]:
#Evaluating the model
y_test_p = nb.predict(X_test)
nb.score(X_test, y_test)

0.8732614249219415

We have an accuracy of 87%

Creating a confusion matrix 

In [31]:
cm = confusion_matrix(y_test, y_test_p)
cm = pd.DataFrame(cm, index=['Positive', 'Negative'], columns=['Positive-pred', 'Negative-pred'])
cm

Unnamed: 0,Positive-pred,Negative-pred
Positive,1038,529
Negative,364,5115


Calculating the precision

In [33]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_p, nb.classes_))

              precision    recall  f1-score   support

    Negative       0.74      0.66      0.70      1567
    Positive       0.91      0.93      0.92      5479

    accuracy                           0.87      7046
   macro avg       0.82      0.80      0.81      7046
weighted avg       0.87      0.87      0.87      7046





The precision for positive feedback is 0.9 which is really good 0.91(91%) and the recall is also good at 0.93 (%) with on 0.7(7%) of its being wrong prediction. 
The precision for the negative feedback is fairly good as 0.74 (74%), the recall is at 0.66 (66%), which is also fairly good

I looked at two ways the prediction can be off (Ratings and Rating Text). 
For this, the ratings were analyzed and the predictions were a bit close to the actual ratings.
In Case 1: The prediction was 0.0009 and the actual was 0.767. They both were negative reviews
In Case 2: The prediction was 9.0011993957604e-05 and the actual was 1. 1080. They both were negative review
In Case 3: The prediction was 0.9999696895769902 and the actual was 2. 1077. They both were negative review


In [49]:
for i in range(3):
    prob = nb.predict_proba(X[i])
    print(df['Review Text'][i])
    print(f"actual: {i}. {df.iloc[i,1]}")
    print(f"prediction: {prob[0,0]}")
    print('\n')
    

Absolutely wonderful  silky and sexy and comfortable
actual: 0. 767
prediction: 0.0009082636166812295


Love this dress  it's sooo pretty  i happened to find it in a store  and i'm glad i did bc i never would have ordered it online bc it's petite  i bought a petite and am 5'8  i love the length on me hits just a little below the knee  would definitely be a true midi on someone who is truly petite
actual: 1. 1080
prediction: 9.0011993957604e-05


I had such high hopes for this dress and really wanted it to work for me i initially ordered the petite small (my usual size) but i found this to be outrageously small so small in fact that i could not zip it up i reordered it in petite medium  which was just ok overall  the top half was comfortable and fit nicely  but the bottom half had a very tight under layer and several somewhat cheap (net) over layers imo  a major design flaw was the net over layer sewn directly into the zipper  it c
actual: 2. 1077
prediction: 0.9999696895769902




In [46]:
k = 0
j = 0
for i in y_test.index:
    if y_test[i] != y_test_p[k] and j < 3:
        print(df['Review Text'][i])
        print('Actual: '+ str(y_test_p[k]))
        print('Prediction: ' + str(y_test[i]))
        print('\n')
        j += 1
    k += 1

Although the design and quality were great  this did not work for me it was boxy  square and overwhelmed my petite frame if you are taller  or have a longer torso  this may be a great basic staple for your wardrobe
Actual: Positive
Prediction: Negative


I absolutely love the retro look of this swimsuit i first saw it on blogger amber fillerupclark (barefoot blonde) and i knew i had to have it this is the first one piece suit i've purchased in about six years i've avoided one pieces because most of the ones i tried made me feel frumpy  and the monokini look just looked odd on me i have a smaller frame and a larger bust (32ddd)  so finding swimsuits that fit properly is a challenge i am a size 4 but i ordered a size 6 after reading reviews
Actual: Negative
Prediction: Positive


I'm 5 ft 7 125 lbs true size 4 and these pants made me look like i had boy parts super disappointed
Actual: Positive
Prediction: Negative




I looked at two ways the prediction can be off accurate (Ratings and Rating Text). 
this is how it can be off

Case 1: The prediction was Negative but the actual was Positive. This might be because of the words "Did Not" and "Overwhelmed" 

Case 2:The prediction was Positive but the actual was Negative. Predicted to be positive, actual to be negative/neutral. From the reviews i think the prediction is wrong because it might have misintepreted words like Love from this text (I absolutely love the retro look(Scenario 2))

Case 3: The prediction was Negative but the actual was Positive. This might be because of the words "super disappointed" 