In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer #The CountVectorizer object
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support

In [2]:
reviews = pd.read_csv("Assignment 5 - data clothing reviews.csv")
reviews.head(3)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses


In [3]:
reviews["Class Name"].value_counts()

Dresses           6319
Knits             4843
Blouses           3097
Sweaters          1428
Pants             1388
Jeans             1147
Fine gauge        1100
Skirts             945
Jackets            704
Lounge             691
Swim               350
Outerwear          328
Shorts             317
Sleep              228
Legwear            165
Intimates          154
Layering           146
Trend              119
Casual bottoms       2
Chemises             1
Name: Class Name, dtype: int64

In [4]:
#filter out only dress reviews
dresses = reviews[reviews['Class Name'] == 'Dresses']
dresses.tail(3)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
23483,23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses
23485,23485,1104,52,Please make more like this one!,This dress in a lovely platinum is feminine an...,5,1,22,General Petite,Dresses,Dresses


In [5]:
#split ratings into positive and negative
dresses["Positive"] = dresses['Rating'] >3 
#dresses["Negative"] = dresses["Rating"] <4 
dresses.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Positive
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,True
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,False
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,False


In [6]:
#get dummies
dummies = pd.get_dummies(dresses['Positive'])
dresses = pd.concat([dresses, dummies], axis=1) 
dresses.head(3)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,Positive,False,True
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses,True,0,1
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses,False,1,0
5,5,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",2,0,4,General,Dresses,Dresses,False,1,0


In [7]:
#using the true and false values is giving an error. Can't figure my way around it
df = dresses[["Review Text", "Rating"]]
df.head()

Unnamed: 0,Review Text,Rating
1,Love this dress! it's sooo pretty. i happene...,5
2,I had such high hopes for this dress and reall...,3
5,"I love tracy reese dresses, but this one is no...",2
8,I love this dress. i usually get an xs but it ...,5
9,"I'm 5""5' and 125 lbs. i ordered the s petite t...",5


In [8]:
#Text pre-processing steps such as tokenizing, removing stopwords, building a document-feature matrix
text = dresses['Review Text'].values.astype('U') #Taking the text from the df. We need to convert it to Unicode

vect = CountVectorizer(stop_words='english') #Create the CV object, with English stop words
vect = vect.fit(text) #We fit the model with the words from the review text
feature_names = vect.get_feature_names() #Get the words from the vocabulary
print(f"There are {len(feature_names)} words in the vocabulary. A selection: {feature_names[500:520]}")

There are 8080 words in the vocabulary. A selection: ['airier', 'airiness', 'airism', 'airline', 'airplane', 'airplanes', 'airport', 'airy', 'aize', 'aka', 'akward', 'al', 'alas', 'albeit', 'alerations', 'alert', 'alexandria', 'align', 'aligned', 'alignment']


In [9]:
matrix = vect.transform(text) #The transform method from the CountVectorizer object creates the matrix
print(matrix[0:50,0:50])

  (2, 8)	1
  (20, 38)	1
  (21, 4)	1
  (21, 45)	1
  (22, 12)	1
  (26, 40)	1
  (35, 12)	2
  (39, 31)	1


In [10]:
#getting the document feature matrix
docu_feat = pd.DataFrame(matrix.toarray()) #make a regular matrix, then put in Dataframe
docu_feat.index = dresses['Review Text'] #Give the rows names (text of the review)
docu_feat.columns = feature_names

In [28]:
#Split the file into a training and a test set.
y = dresses['Positive'] #we want to know the rating
X = matrix

In [29]:
#random state is 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) 

Explain briefly in your own words how the bag-of-words model and Naïve Bayes work together.
The bag-of-words model basically does what it says on the tin. In a body of text, all the words are cut up and put into the model without relational values or weighted importance. Each word is treated as independent and equal. This is different from othe n-gram mosel where words are associated with one another, and these associations impact the value of the word in the model. Bag-of-words is basically a type of n-gram model, only that n=1.

Naive Bayes classifier is a condictional probabilistic model based on Bayes Theorem. Similar to bag-of-words, Niave Bayes considers each condition as independent and equal. That is why it is called Naive. For example, to predict the sales of an icecream van, this classifier considers the weather and the location of the van as independent conditions that act on thier own to influence the probability. A more sophisticated algorithm will see that an ice-cream van located in Ghana, is more likely to have hot weather and thus more likely to sell more icecream than say one located in Norway.
Bayes theorem sort of stratifies different conditions in a logical way and calculates the probaility of a certain outcome on the probabilities associated with each condition.
$ P(B|A)=P(A|B)P(B)P(A)$
 P(A/B) is the probability of A if we already know that B has occurred and is known as likelihood.
 An example of Bayes theorem at work is looking at the probability of a person looking to buy a house. If we know that the customer has already signed up with a realtor, then the likelihood of that person seeking a mortgage is much higher.


In [30]:
# Train a NB model on the training set
clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [31]:
#to find out the classes being used in the model
#the classes should be positive and negative but i got stumped at that step
clf.classes_

array([False,  True])

In [33]:
y_predict = clf.predict(X_test) #the predicted values
y_predict

array([ True, False,  True, ...,  True,  True,  True])

In [19]:
#Train a Naïve Bayes classifier predicting whether a review is positive (>3 stars) or neutral/negative (<4 stars)
#another failed attempt at getting the positive and negative predictions...i have boolean for positive but it is a variable that is difficult to handle
df['positive'] = df['Rating'] > 3
#negative =  df['Rating'] < 4
df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Review Text,Rating,positive
1,Love this dress! it's sooo pretty. i happene...,5,True
2,I had such high hopes for this dress and reall...,3,False
5,"I love tracy reese dresses, but this one is no...",2,False


In [17]:
clf.score(X_test, y_test) #calculate the fit on the test data, i.e. the accuracy

0.5912447257383966

The accuracy of this model is 59%.


In [35]:
#Check out 3 cases where your model is off target. Inspect the associated texts. Do you understand why your model trips up? Explain.
for i in range(5):
    C = clf.predict(X[i])
    words = df['Review Text'].iloc[i]
    prob = clf.predict_proba(X[i])
    p = prob.max()  #get the maximum value from the array
    print(f"The sentence:'{words}' was predited as  {C[0]} based on {p:.2f} prediction.")
    print(f"Positive: {prob[0][0]}, Negative:  {prob[0][1]}")  #the array of probabilities for comparison

The sentence:'Love this dress!  it's sooo pretty.  i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite.  i bought a petite and am 5'8".  i love the length on me- hits just a little below the knee.  would definitely be a true midi on someone who is truly petite.' was predited as  True based on 1.00 prediction.
Positive: 5.337412695289174e-05, Negative:  0.9999466258730467
The sentence:'I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was comfortable and fit nicely, but the bottom half had a very tight under layer and several somewhat cheap (net) over layers. imo, a major design flaw was the net over layer sewn directly into the zipper - it c' was predited as  False based on 1.00 prediction.
Posit

In [55]:
#tried a differnt wayy to get and compare the values and this worked but had to increase the range
for i in range(100):
    C = clf.predict(X[i])
    rate = dresses["Positive"].iloc[i]
    if C != rate:
        words = df['Review Text'].iloc[i]
        print(f'Actual review: {rate}')
        print(f'Predicted review: {C[0]}')
        print (f'Number of stars: {df["Rating"].iloc[i]}')
        print(words)

Actual review: False
Predicted review: True
Number of stars: 3
Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured.
Actual review: False
Predicted review: True
Number of stars: 3
Love the color and style, but material snags easily
Actual review: False
Predicted review: True
Number of stars: 3
Looks beautiful online but has too much material and the zipper catches on the lace. also runs very large, i am normally a small but would need and xs in this dress
Actual review: False
Predicted review: True
Number of stars: 2
I love byron lars dresses, and this design is on-point. the ruffle at the neckline is so pretty, and the dress fits like a dream. however -- the fabric!!! i would have loved it if this dress had a heavier feel. this is, sadly, going back today.
Actual review: True
Predicted review: False
Number of stars: 4
I loved this dress a

Some reviews were predicted to be positive but were actually negative. Looking at the text it is quite easy to imagine why the classifier got these reviews wrong. By and large they contain many positive words about the product. The example below is typical of this:
    "Actual review: False
    Predicted review: True
    Cute little dress fits tts. it is a little high waisted. good length for my 5'9 height. 
    i like the dress, i'm just not in love with it. i dont think it looks or feels cheap. it appears just as pictured."

The reviewer actually gave the dress 3 stars. She said she liked it but did not love the dress. A lot of the reviews which contain this mix of emotions end up getting misclassified. The use more word pointing out the good aspects of the dress, and just a few to voice thier dissatisfaction. However this dissatisfaction is likely to make them give a lower rating, no matter how many good things they found to say about the dress.
