<a href="https://colab.research.google.com/github/Arya1790/NLP/blob/main/Kindle_Review_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dataset:**

This is a small subset of dataset of Book reviews from Amazon Kindle Store category. Dataset of product reviews from Amazon Kindle Store category from May 1996 - July 2014. Contains total of 982619 entries. Each reviewer has at least 5 reviews and each product has at least 5 reviews in this dataset.

Columns

asin - ID of the product, like B000FA64PK

helpful - helpfulness rating of the review - example: 2/3.

overall - rating of the product.

reviewText - text of the review (heading).

reviewTime - time of the review (raw).

reviewerID - ID of the reviewer, like A3SPTOKDG7WBLN

reviewerName - name of the reviewer.

summary - summary of the review (description).

unixReviewTime - unix timestamp.

Acknowledgements This dataset is taken from Amazon product data, Julian McAuley, UCSD website. http://jmcauley.ucsd.edu/data/amazon/

License to the data files belong to them.



In [23]:
!pip install gensim



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Sentiment analysis on reviews.**

Understanding how people rate usefulness of a review/ What factors influence helpfulness of a review.

Fake reviews/ outliers.

Best rated product IDs, or similarity between products based on reviews alone (not the best idea ikr).

**Best Practises**

Preprocessing And Cleaning

Train Test Split

BOW,TFIDF,Word2vec

Train ML algorithms

In [2]:
# Load the dataset
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [27]:
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
data=pd.read_csv('/content/drive/MyDrive/Dataset/all_kindle_review.csv')
data.head(2)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,asin,helpful,rating,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,0,11539,B0033UV8HI,"[8, 10]",3,"Jace Rankin may be short, but he's nothing to ...","09 2, 2010",A3HHXRELK8BHQG,Ridley,Entertaining But Average,1283385600
1,1,5957,B002HJV4DE,"[1, 1]",5,Great short read. I didn't want to put it dow...,"10 8, 2013",A2RGNZ0TRF578I,Holly Butler,Terrific menage scenes!,1381190400


In [5]:
df = data[['reviewText','rating']]
df.shape

(12000, 2)

In [6]:
# null value check
df.isnull().sum()

Unnamed: 0,0
reviewText,0
rating,0


In [7]:
# target column
df['rating'].unique()

array([3, 5, 4, 2, 1])

In [8]:
# check if data is imbalanced
df['rating'].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
5,3000
4,3000
3,2000
2,2000
1,2000


Data preprocessing and cleaning

In [9]:
# positive review is 1 and negative review is 0

In [10]:
df.loc[:,'rating'] = df['rating'].apply(lambda x:0 if x<3 else 1)

In [11]:
df['rating'].value_counts()

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
1,8000
0,4000


In [12]:
# preprocessing
# 1. lowerall the cases
df.loc[:,'reviewText'] = df['reviewText'].str.lower()

In [13]:
# 2. clean the data
## Removing special characters
df.loc[:,'reviewText'] = df['reviewText'].apply(lambda x: re.sub('[^a-zA-Z0-9 ]+','',str(x)))
## Remove the stopswords
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x:" ".join([y for y in x.split() if y not in stopwords.words('english')]))
## Remove url
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x: re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , str(x)))
## Remove html tags
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
## Remove any additional spaces
df.loc[:,'reviewText']=df['reviewText'].apply(lambda x: " ".join(x.split()))

In [14]:
df.head(2)

Unnamed: 0,reviewText,rating
0,jace rankin may short hes nothing mess man hau...,1
1,great short read didnt want put read one sitti...,1


In [15]:
lemmatizer=WordNetLemmatizer()

def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df.loc[:,'reviewText']=df['reviewText'].apply(lambda x:lemmatize_words(x))

In [16]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(df['reviewText'],df['rating'],
                                              test_size=0.20,stratify=df['rating'],random_state=1)

In [17]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB

In [18]:
# BOW
bow=CountVectorizer()
X_train_bow=bow.fit_transform(X_train).toarray()
X_test_bow=bow.transform(X_test).toarray()

nb_model_bow=GaussianNB().fit(X_train_bow,y_train)
y_pred_bow=nb_model_bow.predict(X_test_bow)
print("BOW accuracy: ",accuracy_score(y_test,y_pred_bow))

BOW accuracy:  0.5875


In [None]:
confusion_matrix(y_test,y_pred_bow)

In [22]:
# TF-IDF
tfidf=TfidfVectorizer()
X_train_tfidf=tfidf.fit_transform(X_train).toarray()
X_test_tfidf=tfidf.transform(X_test).toarray()

nb_model_tfidf=GaussianNB().fit(X_train_tfidf,y_train)
y_pred_tfidf=nb_model_tfidf.predict(X_test_tfidf)
print("TFIDF accuracy: ",accuracy_score(y_test,y_pred_tfidf))

TFIDF accuracy:  0.59125


In [21]:
confusion_matrix(y_test,y_pred_tfidf)

array([[522, 278],
       [703, 897]])

In [34]:
# word2Vec
import gensim
from gensim.models import Word2Vec, KeyedVectors
from nltk import sent_tokenize
from gensim.utils import simple_preprocess
import numpy as np

In [28]:
words = []
def preprocess(corpus):
  for sent in corpus:
    sent_tokens = sent_tokenize(sent)
    for sent_token in sent_tokens:
      words.append(simple_preprocess(sent_token))
  return words

words = preprocess(df['reviewText'])

In [30]:
print(words[0])
print(words[1])

['jace', 'rankin', 'may', 'short', 'he', 'nothing', 'mess', 'man', 'hauled', 'saloon', 'undertaker', 'know', 'he', 'famous', 'bounty', 'hunter', 'oregon', 'shot', 'man', 'saloon', 'finished', 'year', 'long', 'quest', 'avenge', 'sister', 'murder', 'trying', 'figure', 'next', 'snottynosed', 'farm', 'boy', 'rescued', 'gang', 'bully', 'offer', 'money', 'kill', 'man', 'forced', 'ranch', 'reluctantly', 'agrees', 'bring', 'man', 'justice', 'kill', 'outright', 'first', 'need', 'tell', 'sister', 'widower', 'newskyla', 'kyle', 'springer', 'bailey', 'riding', 'trail', 'sleeping', 'ground', 'past', 'month', 'trying', 'find', 'jace', 'want', 'revenge', 'man', 'killed', 'husband', 'took', 'ranch', 'amongst', 'crime', 'shes', 'keen', 'detour', 'jace', 'want', 'take', 'realizes', 'shes', 'option', 'hide', 'behind', 'boy', 'persona', 'best', 'try', 'keep', 'pace', 'confrontation', 'along', 'way', 'get', 'shot', 'jace', 'discovers', 'kyles', 'kyla', 'come', 'clean', 'whole', 'reason', 'need', 'scoundrel

In [31]:
# word2vec from scratch
# vector_size=100 dimensions
model = gensim.models.Word2Vec(words,vector_size=100,epochs=50)

In [32]:
def avg_word2vec(doc):
    return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key],axis=0)

In [38]:
#apply for the entire sentences
X=[]
for i in range(len(words)):
    X.append(avg_word2vec(words[i]))

In [39]:
## this is the final independent features
df1=pd.DataFrame()
df_list = []
for i in range(0,len(X)):
    df_list.append(pd.DataFrame(X[i].reshape(1,-1)))

df1 = pd.concat(df_list, ignore_index=True)
## Independent Feature
X=df1.iloc[:,:-1]
## Dependent Feature
y=df['rating']
## Train Test Split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,stratify=y,random_state=1)

In [40]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train,y_train)

y_pred=classifier.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.8054166666666667


In [41]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.77      0.59      0.67       800
           1       0.82      0.91      0.86      1600

    accuracy                           0.81      2400
   macro avg       0.79      0.75      0.77      2400
weighted avg       0.80      0.81      0.80      2400

