# Sentiment Analysis of Amazone Baby Products

This project analyzes and develop a model to analyze and predict the sentiment given the costomer review for an amazon baby product. Turicreate is used for this work. The data set is highly biased towards the positive sentiment, so model is improved, by using only selected features for training the model.

## Exploring the Data

In [1]:
import turicreate as tc
import pandas as pd

In [2]:
products = tc.SFrame.read_csv('amazon_baby.csv')
products.head(5)

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5


### Examining the most reviewed product

In [3]:
products.groupby('name',operations={'count':tc.aggregate.COUNT()}).sort('count',ascending=False)

name,count
Vulli Sophie the Giraffe Teether ...,785
"Simple Wishes Hands-Free Breastpump Bra, Pink, ...",562
Infant Optics DXR-5 2.4 GHz Digital Video Baby ...,561
Baby Einstein Take Along Tunes ...,547
Cloud b Twilight Constellation Night ...,520
"Fisher-Price Booster Seat, Blue/Green/Gray ...",489
Fisher-Price Rainforest Jumperoo ...,450
"Graco Nautilus 3-in-1 Car Seat, Matrix ...",419
Leachco Snoogle Total Body Pillow ...,388
"Regalo Easy Step Walk Thru Gate, White ...",374


In [4]:
giraffe = products[products['name'] == 'Vulli Sophie the Giraffe Teether']
giraffe.head(3)

name,review,rating
Vulli Sophie the Giraffe Teether ...,He likes chewing on all the parts especially the ...,5
Vulli Sophie the Giraffe Teether ...,My son loves this toy and fits great in the diaper ...,5
Vulli Sophie the Giraffe Teether ...,There really should be a large warning on the ...,1


In [5]:
giraffe['rating'].show()

In [6]:
products['rating'].show()


## Modeling



In [7]:
products = products[products['rating']!= 3] #igonore 3 star ratings
products['sentiment'] = products['rating'] >= 4 # if 4 or 5 : positive, otherwise negative
products.head(3)

name,review,rating,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5,1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5,1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5,1


In [8]:
products['sentiment'].show()

In [9]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
products['word_count'] = tc.text_analytics.count_words(products['review'])

In [10]:
def count_words(wc_dict,word):
    count = 0
    if word in wc_dict:
            count = wc_dict[word]
    return count

In [11]:
for word in selected_words:
    products[word]= products['word_count'].apply(lambda x: count_words(x,word))

In [12]:
products.head(3)

name,review,rating,sentiment,word_count,awesome,great
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5,1,"{'recommend': 1.0, 'disappointed': 1.0, ...",0.0,0.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5,1,"{'quilt': 1.0, 'the': 1.0, 'than': 1.0, 'fu ...",0.0,0.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5,1,"{'tool': 1.0, 'clever': 1.0, 'binky': 2.0, ...",0.0,0.0

fantastic,amazing,love,horrible,bad,terrible,awful,wow,hate
0.0,0.0,1.0,0,0,0.0,0,0,0
0.0,0.0,0.0,0,0,0.0,0,0,0
0.0,0.0,2.0,0,0,0.0,0,0,0


In [13]:
for word in selected_words:
    print(f'{word} = {products[word].sum()}')

awesome = 3892.0
great = 55791.0
fantastic = 1664.0
amazing = 2628.0
love = 41994.0
horrible = 1110
bad = 4183
terrible = 1146.0
awful = 687
wow = 425
hate = 1107


In [14]:
train_data,test_data = products.random_split(.8, seed=0)
features=selected_words

In [15]:
model = tc.logistic_classifier.create(train_data,target='sentiment',features=features, validation_set = test_data)

**Weight of the words selected for modeling**

In [16]:
model.coefficients

name,index,class,value,stderr
(intercept),,1,1.3365913848877649,0.008929969787657
awesome,,1,1.133534666034138,0.0839964398318752
great,,1,0.8630655001196574,0.0189550524443772
fantastic,,1,0.8858047568814237,0.1116759129339967
amazing,,1,1.1000933113660225,0.0995477626046599
love,,1,1.3592688669225097,0.0280683001520988
horrible,,1,-2.251335236759098,0.0802024938878843
bad,,1,-0.9914778800650624,0.0384842866469906
terrible,,1,-2.2236614360851323,0.0773173620378575
awful,,1,-2.052908204031357,0.1009973543525925


**Evaluation of the model**

In [17]:
model.evaluate(test_data)

{'accuracy': 0.8463848186404036,
 'auc': 0.6935096220934976,
 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 4
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        0        |  159  |
 |      0       |        0        |  371  |
 |      0       |        1        |  4957 |
 |      1       |        1        | 27817 |
 +--------------+-----------------+-------+
 [4 rows x 3 columns],
 'f1_score': 0.9157860082304526,
 'log_loss': 0.39622654670874996,
 'precision': 0.8487520595594068,
 'recall': 0.9943165570488991,
 'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 1001
 
 Data:
 +-----------+--------------------+-----+-------+------+
 | threshold |        fpr         | tpr |   p   |  n   |
 +-----------+--------------------+-----+-------+------+
 |    0.0    |        1.0         | 1.0 | 27976 | 5328

## Testing with New Reviews

In [18]:
reviews = {'review1' : " I am very happy to buy this product. My baby loves this, this is pretty much awesome. Once I bought this for my elder son and he loved it very much. Amazing product!!!",
           'review2' : "this product is very bad, I hate this one. This is horrible",
           'review3' : " I loved this product once and I bought this for my neice. But this time the package was damaged. my fiance also said the same opinion \
                that the packaging is bad now a days. "
          }

In [19]:
df_item = reviews.items()
df_list = list(df_item)
df = pd.DataFrame(df_list)
df['review_num']=df[0]
df['review']=df[1]
df=df[['review_num','review']]
sf = tc.SFrame(df)

In [20]:
sf['word_count'] = tc.text_analytics.count_words(sf['review'])
for word in selected_words:
    sf[word]= sf['word_count'].apply(lambda x: count_words(x,word))
sentiment = model.predict(sf,output_type='class')
sentiment

dtype: int
Rows: 3
[1, 0, 1]