### **AMAZON PRODUCTS SENTIMENTAL ANALYSIS**

#### **PROBLEM STATEMENT**
With the introduction of e-commerce and wide usage of social media platforms, there is a lot of user-generated data ranging from current affairs, topical issues, and even product reviews. Companies and organizations are slowly taking cognizance of the impact of social media comments on their brands. They are now using sentiment analysis to monitor their brand reputation across social media platforms and the web in general.

This projects uses a dataset from the Amazon webpage, https://www.tensorflow.org/datasets/catalog/amazon_us_reviews#amazon_us_reviewswireless_v1_00_default_config, to create three models that predict the sentiment of a review.

#### **DEFINING THE METRIC FOR SUCCESS** 
The metric for success for this project is creating three models: Naive Bayes, Support Vector Machine and Logistic Reggression, with an accuracy of 80%, precision of 85% and recall of 80%.

### **EXPERIMENTAL DESIGN**
Loading libraries

Loading data

Reading data

Cleaning data

Feature Eng and Preprocessing

Modeling

Optimization and model evaluation

Conclusions and recommedations



#### Loading the required libraries 

In [28]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import tweepy
import warnings
import re

#### Web Scrapping Twitter using Tweepy

In [29]:


# Set up authentication credentials
#consumer_key = 'xxxx'
#consumer_secret = 'xxxx'
#access_token = 'xxxx'
#access_token_secret = 'xxxxxx'

# Authenticate to Twitter
#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(access_token, access_token_secret)


# Create API object
#api = tweepy.API(auth)

# Specify the search query and number of tweets to fetch
#query = "Amazon"
#num_tweets = 1500

# Fetch tweets containing the specified search query
#tweets = api.search_tweets(q=query, lang="en", tweet_mode="extended", count=num_tweets)

# Extract relevant information, such as tweet text and user information
#for tweet in tweets:
    #tweet_text = tweet.full_text
    #user_info = tweet.user.screen_name
    #print(f'Tweet: {tweet_text}')
    #print(f'User: {user_info}')
    #print('---')


#### Loading an existing Amazon review dataset from Tensorflow 

In [30]:
import tensorflow as tf
import tensorflow_datasets as tfds
ds = tfds.load('amazon_us_reviews/Mobile_Electronics_v1_00', split='train', shuffle_files=True)
assert isinstance(ds, tf.data.Dataset)
print(ds)
#convert the dataset into a pandas dataframe
df = tfds.as_dataframe(ds)


<_PrefetchDataset element_spec={'data': {'customer_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'helpful_votes': TensorSpec(shape=(), dtype=tf.int32, name=None), 'marketplace': TensorSpec(shape=(), dtype=tf.string, name=None), 'product_category': TensorSpec(shape=(), dtype=tf.string, name=None), 'product_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'product_parent': TensorSpec(shape=(), dtype=tf.string, name=None), 'product_title': TensorSpec(shape=(), dtype=tf.string, name=None), 'review_body': TensorSpec(shape=(), dtype=tf.string, name=None), 'review_date': TensorSpec(shape=(), dtype=tf.string, name=None), 'review_headline': TensorSpec(shape=(), dtype=tf.string, name=None), 'review_id': TensorSpec(shape=(), dtype=tf.string, name=None), 'star_rating': TensorSpec(shape=(), dtype=tf.int32, name=None), 'total_votes': TensorSpec(shape=(), dtype=tf.int32, name=None), 'verified_purchase': TensorSpec(shape=(), dtype=tf.int64, name=None), 'vine': TensorSpec(shape=(), dtype

In [31]:
# Viewing the top records
df.head()

Unnamed: 0,data/customer_id,data/helpful_votes,data/marketplace,data/product_category,data/product_id,data/product_parent,data/product_title,data/review_body,data/review_date,data/review_headline,data/review_id,data/star_rating,data/total_votes,data/verified_purchase,data/vine
0,b'20980074',0,b'US',b'Mobile_Electronics',b'B00D1847NE',b'274617424',b'Teenage Mutant Ninja Turtles Boombox CD Play...,b'Does not work',b'2015-01-09',b'One Star',b'R1OVS0D6SEXPW7',1,0,0,1
1,b'779273',0,b'US',b'Mobile_Electronics',b'B00KMO6DYG',b'397452138',b'4 Gauge Amp Kit Amplifier Install Wiring Com...,b'This is a great wiring kit i used it to set ...,b'2015-08-06',b'Great kit',b'R9VSD0ET8FERB',4,0,0,1
2,b'15410531',0,b'US',b'Mobile_Electronics',b'B000GWLL0K',b'948304826',b'Travel Wall Charger fits Creative Zen Vision...,b'It works great so much faster than USB charg...,b'2007-03-15',b'A/C Charger for Creative Zen Vision M',b'R3ISXCZHWLJLBH',5,0,0,1
3,b'27389005',0,b'US',b'Mobile_Electronics',b'B008L3JE6Y',b'466340015',b'High Grade Robust 360\xc2\xb0 Adjustable Car...,b'This product was purchased to hold a monitor...,b'2013-07-30',b'camera stand',b'R1TWVUDOFJSQAW',5,0,0,1
4,b'2663569',0,b'US',b'Mobile_Electronics',b'B00GHZS4SC',b'350592810',b'HDE Multifunctional Bluetooth FM Audio Car K...,"b""it works but it has really bad sound quality...",b'2014-12-31',b'bad sound quality',b'R2PEOEUR1LP0GH',3,0,0,1


In [32]:
# Viewing the bottom details
df.tail()

Unnamed: 0,data/customer_id,data/helpful_votes,data/marketplace,data/product_category,data/product_id,data/product_parent,data/product_title,data/review_body,data/review_date,data/review_headline,data/review_id,data/star_rating,data/total_votes,data/verified_purchase,data/vine
104970,b'16433874',0,b'US',b'Mobile_Electronics',b'B003SH571E',b'816208213',b'BlueAnt S4 Bluetooth Car Speakerphone Kit [U...,"b""It's a wonderful invention. You don't need t...",b'2012-09-17',b'excellent gadget',b'RCDFUCZ20BO1Y',4,0,0,1
104971,b'11714515',3,b'US',b'Mobile_Electronics',b'B00HK6CPVY',b'383636318',"b'Szstudio US 5"" Car GPS Navigation Sat Nav Bu...","b""This is not good item,I can even maket work,...",b'2014-04-26',b'Do not waste your money',b'RU0PWIV6N2OW5',1,5,0,1
104972,b'51380565',0,b'US',b'Mobile_Electronics',b'B005CJ769C',b'698252499',b'New Barnes Noble Nook 2 2nd Edition Generati...,b'The cover and skin were both exactly like th...,b'2012-01-16',b'Great product!',b'R3R5T9X5WW8C25',5,0,0,1
104973,b'24019237',0,b'US',b'Mobile_Electronics',b'B004911E9M',b'423996186',b'Wall AC Charger USB Sync Data Cable for iPho...,b'I ordered 2 of these cords for both mine and...,b'2011-12-13',b'Horrible. DOES NOT WORK with iPhone 3GS',b'R300CMC3CKOQRO',1,0,0,1
104974,b'7627794',0,b'US',b'Mobile_Electronics',b'B00ATWD880',b'866011377',b'Covert Acoustic Tube Earpiece 2 PIN for ICOM...,b'Item works better than I had hoped for. Very...,b'2013-06-25',b'Makes wearing an earpiece comfortable',b'R1KL85UT0HAW24',5,0,0,1


In [33]:
# Checking the shape of our dataset
df.shape

(104975, 15)

#### Data Pre-processing

In [34]:
# Picking the required features that are rating and the review comment

df1 = df[['data/review_body', 'data/star_rating']]

df1.head()

Unnamed: 0,data/review_body,data/star_rating
0,b'Does not work',1
1,b'This is a great wiring kit i used it to set ...,4
2,b'It works great so much faster than USB charg...,5
3,b'This product was purchased to hold a monitor...,5
4,"b""it works but it has really bad sound quality...",3


In [35]:
# Converting star rating as either positive or negative
### If a positive is greater than or equal to 3 then it's positive, else it's a negative review.
df1["Sentiment"] = df1["data/star_rating"].apply(lambda score: "positive" if score >= 3 else "negative")
df1['Sentiment'] = df1['Sentiment'].map({'positive':1, 'negative':0})

# Removing the punctuation on on the review statements.
df1['short_review'] =df1['data/review_body'].str.decode("utf-8")
df1 = df1[["short_review", "Sentiment"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1["Sentiment"] = df1["data/star_rating"].apply(lambda score: "positive" if score >= 3 else "negative")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Sentiment'] = df1['Sentiment'].map({'positive':1, 'negative':0})
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['short_review'] =df1['data

In [36]:
# Preview of the top records from the cleaned dataset
df1.head()

Unnamed: 0,short_review,Sentiment
0,Does not work,0
1,This is a great wiring kit i used it to set up...,1
2,It works great so much faster than USB charger...,1
3,This product was purchased to hold a monitor o...,1
4,it works but it has really bad sound quality. ...,1


In [37]:
# Checking the distribution of positive and negative comments

df1["Sentiment"].value_counts()

1    80077
0    24898
Name: Sentiment, dtype: int64

In [38]:
#Removing links 
df1["short_review"] = df1["short_review"].apply(lambda s: ' '.join(re.sub("(w+://S+)", " ", s).split()))

#Changing all the letter to lower case
df1["short_review"] = df1.short_review.map(lambda x: x.lower())

#Removing the punctuation
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stopwords = set(stopwords.words("english")) 
df1["short_review"] = df1["short_review"].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '' , x))
df1["short_review"] = df1["short_review"].str.replace('user','')

# Removing emojis
def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')
df1["short_review"] = df1["short_review"].apply(lambda s: deEmojify(s))

# to remove stop words
df1["short_review"] = df1["short_review"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nelly\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### **Tokenization**
##### We are going to to use word tokenization to convert to convert the reviews to individual word.
##### I chose to Lemmatize due to the nature of our analysis, the benefit of lemmetization is that it converts the words into there base formart. 
##### Runs, Ran, Running are all converted to Run which is prefarable to stemming which leaves thw words in an incomplete state

In [39]:
df1.head()

Unnamed: 0,short_review,Sentiment
0,work,0
1,great wiring kit used set pyle 2000 watt amp 2...,1
2,works great much faster usb chargerbuy glad,1
3,product purchased hold monitor desk connected ...,1
4,works really bad sound quality bass doesnt wor...,1


In [40]:

# Download the required data files
nltk.download('punkt')

# After downloading the data files, you can use the word_tokenize function
from nltk.tokenize import word_tokenize

# word tokenizing
df1["short_review"] = df1["short_review"].apply(word_tokenize)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Nelly\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [41]:
# Tokenizing the data.

#nltk.download('omw-1.4')
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer


lemmatiser = WordNetLemmatizer()

df1["short_review"] = df1["short_review"].apply(lambda tokens: [lemmatiser.lemmatize(token, pos='v') for token in tokens])
df1

Unnamed: 0,short_review,Sentiment
0,[work],0
1,"[great, wire, kit, use, set, pyle, 2000, watt,...",1
2,"[work, great, much, faster, usb, chargerbuy, g...",1
3,"[product, purchase, hold, monitor, desk, conne...",1
4,"[work, really, bad, sound, quality, bass, does...",1
...,...,...
104970,"[wonderful, invention, dont, need, wrap, bluet...",1
104971,"[good, itemi, even, maket, workthe, gps, app, ...",0
104972,"[cover, skin, exactly, like, picture, describe...",1
104973,"[order, 2, cord, mine, husband, iphone, 3gs, n...",0


#### **NAIVE BAYES MODEL**

In [42]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

df1['short_review'] = df1['short_review'].apply(' '.join) 
# Split the data into train and test sets
X = df1['short_review']
y = df1['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_vectorized, y_train)

# Predict the sentiment labels for test data
y_pred = clf.predict(X_test_vectorized)

# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')

Accuracy: 0.8713
Precision: 0.8661
Recall: 0.8713


In [55]:
# Validating using a new review 

import pickle

# Save the trained model as a .pkl file
with open('naive_bayes_model.pkl', 'wb') as file:
    pickle.dump(clf, file)

# Load the trained Naive Bayes model
with open('naive_bayes_model.pkl', 'rb') as file:
    naive_model = pickle.load(file)

new_review = input("Enter a review: ")

# Preprocess the new review
new_review = new_review.lower()
new_review = word_tokenize(new_review)
new_review = ' '.join(new_review)

# Load the CountVectorizer used during training
vectorizer = CountVectorizer()  # Create a new CountVectorizer object
vectorizer.fit(X_train)  # Fit it to your training data

# Vectorize the new review
new_review_vectorized = vectorizer.transform([new_review])

# Predict the sentiment label of the new review
naive_model.predict(new_review_vectorized)[0]


1

#### **SUPPORT VECTOR MACHINE**

In [23]:
from sklearn.svm import LinearSVC 
model = LinearSVC().fit(X_train_vectorized, y_train)
predicted = model.predict(X_test_vectorized)
report = classification_report( y_test, predicted )
print(report)
acc=accuracy_score(y_test,predicted)


              precision    recall  f1-score   support

           0       0.74      0.70      0.72      4906
           1       0.91      0.93      0.92     16089

    accuracy                           0.87     20995
   macro avg       0.83      0.81      0.82     20995
weighted avg       0.87      0.87      0.87     20995





#### **LOGISTIC REGRESSION**

In [17]:


# Train the logistic regression model
clf = LogisticRegression()
clf.fit(X_train_vectorized, y_train)

# Predict on the testing set
y_pred = clf.predict(X_test_vectorized)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
classification_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{classification_report}')

Accuracy: 0.8837818528221005
Classification Report:
              precision    recall  f1-score   support

           0       0.78      0.69      0.74      4906
           1       0.91      0.94      0.93     16089

    accuracy                           0.88     20995
   macro avg       0.85      0.82      0.83     20995
weighted avg       0.88      0.88      0.88     20995



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


##### **CONCLUSION AND RECOMMENDATION**

The were higher number of positive comments for the Amazon mobile electronic products.

Naive Bayles has the highest accuracy. 

The data should be increased and the number of negative responses increased to create a larger training for higher accuracy. 