# Project 3 - Sentiment Analysis for E-Commerce Store
### by Azubuogu Peace Udoka

### Table of Content
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#data">Understanding the Dataset</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#ques">Calculating Metrics</a></li>
<li><a href="#conc">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Background
As a data analyst at an e-commerce store, I have access to a dataset containing product reviews and associated ratings. The task is to build a model that can classify each review as positive, negative, or neutral based on the text content.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk import word_tokenize

%matplotlib inline
#set general style of plots
sns.set(rc = {'figure.figsize':(20,8)}, style="white", font_scale=1.5)


import warnings
warnings.simplefilter("ignore")

### Exploring the Dataset

In [2]:
data = pd.read_csv('Amazon Product Review.txt')
# view 5 random rows of data
data.sample(5)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,sentiment
25076,US,34643897,R30E8KWLFES6AC,B00IKPYKWG,2693241,"Fire HD 7, 7"" HD Display, Wi-Fi, 8 GB",PC,5,0,0,N,Y,Five Stars,Aabsolutely love my Kindle!,2014-12-31,1
24791,US,48047713,R1AK5U1770DL7O,B00IKPYKWG,2693241,"Fire HD 7, 7"" HD Display, Wi-Fi, 8 GB",PC,5,1,1,N,Y,Five Stars,She loves it,2015-01-01,1
407,US,40870906,R2B9PETJDW8115,B00IKPYKWG,2693241,"Fire HD 7, 7"" HD Display, Wi-Fi, 8 GB",PC,4,0,0,N,Y,This new one the screen is perfect for colors ...,I was looking for a replacement for my first g...,2015-08-21,1
4576,US,1104762,RNC4EBYUM7B4Q,B00LCHRQL6,2693241,"Fire HD 7, 7"" HD Display, Wi-Fi, 8 GB",PC,5,0,0,N,Y,Five Stars,Very easy to carry and use. Fits in pocket of ...,2015-06-20,1
9795,US,15679880,R2UB0F0RYQCTH2,B00IKPYKWG,2693241,"Fire HD 7, 7"" HD Display, Wi-Fi, 8 GB",PC,5,1,1,N,Y,Great All Purpose Pad,Great internet pad for not a wholen lot of mon...,2015-05-02,1


In [3]:
# size of dataset
data.shape

(30846, 16)

There are 30846 rows of data and 16 columns

In [4]:
# checking for missing values
data.isnull().sum()

marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
vine                 0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
sentiment            0
dtype: int64

There are no missing values.

In [5]:
# checking for duplicates
data.duplicated().sum()

0

There are no dupicates.

In [6]:
data.dtypes

marketplace          object
customer_id           int64
review_id            object
product_id           object
product_parent        int64
product_title        object
product_category     object
star_rating           int64
helpful_votes         int64
total_votes           int64
vine                 object
verified_purchase    object
review_headline      object
review_body          object
review_date          object
sentiment             int64
dtype: object

In [7]:
# converting customer_id and product_parent column to string and review_date column to date
data['customer_id'] = data['customer_id'].astype('str')
data['product_parent'] = data['product_parent'].astype('str')
data['review_date'] = pd.to_datetime(data['review_date'])

In [8]:
# fraction of positive and negative reviews
data.sentiment.value_counts()/len(data)

1    0.835343
0    0.164657
Name: sentiment, dtype: float64

The sentiment column contains two values.

0 - negative sentiment

1 - positive sentiment

This means, majority of the reviews are positive (83.5343%) and 16.4657% of the reviews are negative.

In [9]:
# create new column containing number of words from each review_body
word_tokens = [word_tokenize(review) for review in data.review_body]
length = []
 
for i in range(len(word_tokens)):
    length.append(len(word_tokens[i]))
data['no_of_words'] = length  