# Sentiment Analysis

## About the Dataset:

Amazon Review Full Score Dataset

Version 3, Updated 09/09/2015

ORIGIN

The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. For more information, please refer to the following paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.

The Amazon reviews full score dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the above dataset. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).


DESCRIPTION

The Amazon reviews full score dataset is constructed by randomly taking 600,000 training samples and 130,000 testing samples for each review score from 1 to 5. In total there are 3,000,000 trainig samples and 650,000 testing samples.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 5), review title and review text. The review title and text are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".


In [1]:
#warnings :)
import warnings
warnings.filterwarnings('ignore')



import pandas as pd
import re
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import LSTM, Embedding, Dropout, Dense, Flatten
from tensorflow.keras.models import Sequential



train=pd.read_csv(r'C:\Users\Malyaj Mishra\Desktop\DS_IT\Projects\NLP\amazon_review_full_csv\train.csv',\
                  names=['Rating','Review_title','Review'])

test=pd.read_csv(r'C:\Users\Malyaj Mishra\Desktop\DS_IT\Projects\NLP\amazon_review_full_csv\test.csv',\
                 names=['Rating','Review_title','Review'])

In [2]:
train.head()

Unnamed: 0,Rating,Review_title,Review
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...


In [3]:
test.head()

Unnamed: 0,Rating,Review_title,Review
0,1,mens ultrasheer,"This model may be ok for sedentary types, but ..."
1,4,Surprisingly delightful,This is a fast read filled with unexpected hum...
2,2,"Works, but not as advertised",I bought one of these chargers..the instructio...
3,2,Oh dear,I was excited to find a book ostensibly about ...
4,2,Incorrect disc!,"I am a big JVC fan, but I do not like this mod..."


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 3 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   Rating        int64 
 1   Review_title  object
 2   Review        object
dtypes: int64(1), object(2)
memory usage: 68.7+ MB


In [5]:
#determining null values in train dataset
train.isna().sum()

Rating           0
Review_title    76
Review           0
dtype: int64

In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 650000 entries, 0 to 649999
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   Rating        650000 non-null  int64 
 1   Review_title  649988 non-null  object
 2   Review        650000 non-null  object
dtypes: int64(1), object(2)
memory usage: 14.9+ MB


In [7]:
train.shape

(3000000, 3)

In [8]:
test.shape

(650000, 3)

### Taking a sample out of the data & performing some EDA:
As we can see above that there are 3 million entries in our training set alone. And 65k in test data.
So we'll take a sample of it. Jaldi Jaldi krna h na!!!

Now our dataset is constructed by randomly taking 600,000 training samples and 130,000 testing samples for each review score from 1 to 5. In total there are 3,000,000 trainig samples and 650,000 testing samples. Hence each type of rating has same number of samples. 

So abhi k liye training mein each group se n=6000 nikal lo, test data mein n=1300 nikaal lete hain. (1/100th sample le lia h as of now.)

NOTE: We will ignore the 'Review_heading' feature for now and only use the "Review" for sentiment analysis.

In [9]:
#taking sample from train & test dataset.

train_g=train.groupby("Rating").sample(n=6000, random_state=1)
test_g=test.groupby("Rating").sample(n=1300, random_state=1)

train_g=train_g.reset_index(drop=True)
test_g=test_g.reset_index(drop=True)

In [10]:
print(train_g.shape)
print(test_g.shape)

(30000, 3)
(6500, 3)


In [45]:
train_g.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Rating        30000 non-null  int64 
 1   Review_title  30000 non-null  object
 2   Review        30000 non-null  object
dtypes: int64(1), object(2)
memory usage: 703.2+ KB


In [46]:
test_g.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6500 entries, 0 to 6499
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Rating        6500 non-null   int64 
 1   Review_title  6500 non-null   object
 2   Review        6500 non-null   object
dtypes: int64(1), object(2)
memory usage: 152.5+ KB


In [11]:
train_g.describe()

Unnamed: 0,Rating
count,30000.0
mean,3.0
std,1.414237
min,1.0
25%,2.0
50%,3.0
75%,4.0
max,5.0


In [12]:
train_g.head()

Unnamed: 0,Rating,Review_title,Review
0,1,"An amateur prank, if not a scam...",For all serious film buffs and film profession...
1,1,Excelent,"Good price, excelent optical zom and video, ve..."
2,1,Very Basic Overview,The Inductor Handbook was a real disappointmen...
3,1,Low price but...,"How about they tried to ship this to me four, ..."
4,1,HUH!,WHAT THE H##ll DID I buy?!?! this is the worst...


In [13]:
test_g.head()

Unnamed: 0,Rating,Review_title,Review
0,1,cheap item. not happy with.,bought for 99c. shipped for 5.00. seller will ...
1,1,yawn!,I would like to echo the thoughts of the other...
2,1,Schenker lost in hollywood,This is without a doubt the worst record of Sc...
3,1,Poor Quality in a NEW DVD,I paid for a NEW copy of the Da Vinci Code and...
4,1,Stay away from the director's cut,Cinema Paradiso is one of my favorite movies e...


In [43]:
train_g.Rating.unique()

array([1, 2, 3, 4, 5], dtype=int64)

In [44]:
test_g.Rating.unique()

array([1, 2, 3, 4, 5], dtype=int64)

In [47]:
train_g.Rating.value_counts()

1    6000
2    6000
3    6000
4    6000
5    6000
Name: Rating, dtype: int64

In [48]:
test_g.Rating.value_counts()

1    1300
2    1300
3    1300
4    1300
5    1300
Name: Rating, dtype: int64

### Observations:
- All the ratings have equal number of samples both in train and test data ===> Dataset is balanced. Everthing's fine!!

In [None]:
#to make plots to check for imbalanced data:

import matplotlib.pyplot as plt
%matplotlib inline
print('Percentage for default\n')
print(round(Reviewdata.Is_Response.value_counts(normalize=True)*100,2))
round(Reviewdata.Is_Response.value_counts(normalize=True)*100,2).plot(kind='bar')
plt.title('Percentage Distributions by review type')
plt.show()

# Pre-Processing:

### Observations:
- Need to bring text in same form (like all in lower case)
- Remove punctuations
- Change every element in the sentences to strings as there are some integers in it.


In [14]:
#converting to lower case & removing punctuations:


train_g['Review']=train_g['Review'].transform(lambda value:re.sub(r'[^\w\s]','',value.lower()))
test_g['Review']=test_g['Review'].transform(lambda value:re.sub(r'[^\w\s]','',value.lower()))


'''
Here,
re.sub(‘expression_to_replace’,’new_replacement’,’sentence’) removes all digits & punctuations 
(except underscore:_) using the expression r’[^\w\s]’
'''


'\nHere,\nre.sub(‘expression_to_replace’,’new_replacement’,’sentence’) removes all digits & punctuations \n(except underscore:_) using the expression r’[^\\w\\s]’\n'

In [15]:
train_g.head()


Unnamed: 0,Rating,Review_title,Review
0,1,"An amateur prank, if not a scam...",for all serious film buffs and film profession...
1,1,Excelent,good price excelent optical zom and video very...
2,1,Very Basic Overview,the inductor handbook was a real disappointmen...
3,1,Low price but...,how about they tried to ship this to me four y...
4,1,HUH!,what the hll did i buy this is the worst mya c...


In [16]:
test_g.head()

Unnamed: 0,Rating,Review_title,Review
0,1,cheap item. not happy with.,bought for 99c shipped for 500 seller will ref...
1,1,yawn!,i would like to echo the thoughts of the other...
2,1,Schenker lost in hollywood,this is without a doubt the worst record of sc...
3,1,Poor Quality in a NEW DVD,i paid for a new copy of the da vinci code and...
4,1,Stay away from the director's cut,cinema paradiso is one of my favorite movies e...


### Normalization:
- Word Normalization using stemming:---> bringing all words into canonical form

In [17]:
#Word Normalization:

from nltk.stem import PorterStemmer


ps=PorterStemmer()
train_g['Review']=train_g['Review'].transform(lambda value:' '.join( [ps.stem(word) for word in value.split(' ')]))
test_g['Review']=test_g['Review'].transform(lambda value:' '.join( [ps.stem(word) for word in value.split(' ')]))


#lambda function used splits the string using space:’ ’, applies porter stemmer on the word using list comprehension 
#& finally joins the entire sentence using ‘ ’.join()


In [18]:
train_g.head()

Unnamed: 0,Rating,Review_title,Review
0,1,"An amateur prank, if not a scam...",for all seriou film buff and film profession o...
1,1,Excelent,good price excel optic zom and video veri easi...
2,1,Very Basic Overview,the inductor handbook wa a real disappoint the...
3,1,Low price but...,how about they tri to ship thi to me four ye f...
4,1,HUH!,what the hll did i buy thi is the worst mya cd...


In [19]:
test_g.head()

Unnamed: 0,Rating,Review_title,Review
0,1,cheap item. not happy with.,bought for 99c ship for 500 seller will refund...
1,1,yawn!,i would like to echo the thought of the other ...
2,1,Schenker lost in hollywood,thi is without a doubt the worst record of sch...
3,1,Poor Quality in a NEW DVD,i paid for a new copi of the da vinci code and...
4,1,Stay away from the director's cut,cinema paradiso is one of my favorit movi ever...


In [52]:
#Finding max length of a review in train set. # This can be used later on to find out the length of our word embedding.

from nltk.tokenize import word_tokenize


r_len=[]
for text in train_g.Review:
    word=word_tokenize(str(text))
    length=len(word)
    r_len.append(length)
    
MAX_REVIEW_LEN=np.max(r_len)
MAX_REVIEW_LEN

407

In [20]:
#some of our reviews have integers which Keras Tokenizer can't deal with, so we will convert everything to strings.

train_g['Review']=[str (item) for item in train_g['Review']]
test_g['Review']=[str (item) for item in test_g['Review']]

## Word Embedding:

In [21]:
#word embedding using gensim's Word2Vec

from gensim.models import Word2Vec
all_reviews=list([sentence.split(' ') for sentence in train_g['Review']])+list([sentence.split(' ') for\
                                                                               sentence in test_g['Review']])
all_reviews=[x for x in all_reviews if str(x)!='nan']  
w2v=Word2Vec(all_reviews,vector_size=32,min_count=1,epochs=20)
vector=w2v.wv.vectors


In [22]:
vector

array([[ 9.4312632e-01,  2.9378893e+00,  1.2585325e+00, ...,
         2.0502758e+00,  9.0726238e-01,  5.4350939e+00],
       [ 1.6552215e+00,  1.8092480e+00,  1.3605517e-01, ...,
        -3.4785914e-01, -3.9675288e-02,  2.4053090e+00],
       [ 3.1529710e+00,  2.2829568e+00, -9.6757567e-01, ...,
         2.6887906e+00,  3.1965210e+00,  2.4357827e+00],
       ...,
       [-8.9254584e-03,  9.1037102e-02, -4.3973498e-02, ...,
         5.2044421e-02, -1.0588109e-01, -8.8877436e-03],
       [-8.9719594e-02, -7.5723141e-02, -6.6624209e-02, ...,
        -8.7609492e-02,  5.4001058e-03,  7.8578684e-03],
       [-4.2727839e-02, -2.1533245e-01,  5.1889542e-02, ...,
         1.7253259e-02, -3.5328679e-02, -6.7412998e-03]], dtype=float32)

In [23]:
vector.shape

(75634, 32)

In [24]:
#Tokenization

maxlength=128 
#maxlength=MAX_REVIEW_LEN
#Ideally, we should decide this length after figuring out the max length/avg length of our sentences/reviews.
#which we have done above. And max length was=407. But we will take it 128 for now to save space and time!!! :) 


from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence,text


t=Tokenizer(split=" ")
t.fit_on_texts(list(train_g['Review'])+list(test_g['Review']))

train_g['Review']=t.texts_to_sequences(train_g['Review'])
test_g['Review']=t.texts_to_sequences(test_g['Review'])

train_f=sequence.pad_sequences(train_g['Review'],maxlen=maxlength)
test_f=sequence.pad_sequences(test_g['Review'],maxlen=maxlength)


print(train_f.shape)
print(test_f.shape)

print(train_f)
print(test_f)

(30000, 128)
(6500, 128)
[[    7   742 11576 ...   379   294   732]
 [    0     0     0 ...     5   531   234]
 [    0     0     0 ...     2  1388   545]
 ...
 [    0     0     0 ...    76     5   462]
 [    0     0     0 ...    20    46  3339]
 [    0     0     0 ...   204     1   190]]
[[    0     0     0 ...     6    78   418]
 [    0     0     0 ... 20270    13   388]
 [    7 18988  1281 ...     1  1239  3207]
 ...
 [    0     0     0 ...    38     9   100]
 [    0     0     0 ... 75568    68   237]
 [    0     0     0 ...     4    46   433]]


In [25]:
#creating training and validation dataset

from sklearn.model_selection import train_test_split as tts

#Since 'Rating' has value of ---> (1,2,3,4,5), but we need 0,1,2,3,4 as label inputs--> Thus we will minus 1.
#Later on, we will add 1 to the final predicted values of 'Rating'.
label=train_g['Rating']-1


xtrain,xvalid,ztrain,zvalid=tts(train_f,label,train_size=0.8, random_state = 1)

In [26]:
label

0        0
1        0
2        0
3        0
4        0
        ..
29995    4
29996    4
29997    4
29998    4
29999    4
Name: Rating, Length: 30000, dtype: int64

In [38]:
print(xtrain.shape)


(24000, 128)


In [27]:
xtrain

array([[    0,     0,     0, ...,    24,  4949, 34305],
       [   10,     1,   279, ...,   622,    15,   148],
       [    0,     0,     0, ...,    11,  1240,  1081],
       ...,
       [    0,     0,     0, ...,    45,   540,   708],
       [    0,     0,     0, ...,     1,   737,  3019],
       [    0,     0,     0, ...,     5,  1019,    41]])

In [28]:
xvalid

array([[   0,    0,    0, ...,  146,    7, 1759],
       [   0,    0,    0, ...,  910,   23,   73],
       [   0,    0,    0, ...,  535,    2, 3620],
       ...,
       [   0,    0,    0, ...,   51,    3,  193],
       [   0,    0,    0, ...,  828,    7,   45],
       [   0,    0,    0, ..., 1987,    1,  517]])

In [29]:
ztrain

5851     0
22473    3
8078     1
14589    2
15383    2
        ..
2380     0
13011    2
3900     0
5726     0
261      0
Name: Rating, Length: 24000, dtype: int64

In [30]:
zvalid

14634    2
24078    4
27171    4
5865     0
20099    3
        ..
28481    4
6960     1
11914    1
14171    2
16384    2
Name: Rating, Length: 6000, dtype: int64

In [31]:
print(type(xtrain))
print(type(ztrain))
print(type(np.array(ztrain)))

print(type(xvalid))
print(type(zvalid))
print(type(np.array(zvalid)))

<class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>


Since xtrain, xvalid are array & ztrain, zvalid are series, we will have to make them all array type and keep their shape aligned as well, when we input them in our model for training.

## Building our model:

In [32]:
#Neural network

model = Sequential()
model.add(Embedding(vector.shape[0], vector.shape[1], weights=[vector], input_length=maxlength)) #maxlength=128
model.add(LSTM(64,dropout=0.2,recurrent_dropout=0.2,return_sequences=True))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(5, activation='softmax'))  #using softmax since it is a mutliclass classification problem.

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 128, 32)           2420288   
                                                                 
 lstm (LSTM)                 (None, 128, 64)           24832     
                                                                 
 flatten (Flatten)           (None, 8192)              0         
                                                                 
 dense (Dense)               (None, 256)               2097408   
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 64)                16448     
                                                                 
 dropout_1 (Dropout)         (None, 64)                0

In [None]:
np.array(ztrain).reshape(-1,1),validation_data=(xvalid.reshape(-1,256),\
                                np.array(zvalid).reshape(-1,1)

In [34]:
xtrain.reshape(-1,256).shape

(12000, 256)

In [35]:
np.array(ztrain).reshape(-1,1).shape

(24000, 1)

In [36]:
xvalid.reshape(-1,256).shape

(3000, 256)

In [41]:
#checking shape before feeding them into our neural network model
print(xtrain.shape)
print(np.array(ztrain).reshape(-1,1).shape)

print(xvalid.shape)
print(np.array(zvalid).reshape(-1,1).shape)

(24000, 128)
(24000, 1)
(6000, 128)
(6000, 1)


# Training the model:

In [42]:
#training the model:

%%time

checkpointer = tf.keras.callbacks.ModelCheckpoint(filepath='weights.best.hdf5', verbose = 1, save_best_only = True)

'''features_type = tf.float32
target_type = tf.int32

train_dataset = tf.data.Dataset.from_tensor_slices(tf.cast(xtrain, features_type),\
                                                   tf.cast(ztrain, target_type))

model.fit(train_dataset, epochs=5, verbose=1)
'''

#model.fit(xtrain.reshape(-1,256),np.array(ztrain).reshape(-1,1))

model.fit(xtrain,np.array(ztrain).reshape(-1,1),validation_data=(xvalid,\
                                np.array(zvalid).reshape(-1,1)),epochs=10,batch_size=64,verbose=1,callbacks=[checkpointer])



Epoch 1/10
Epoch 1: val_loss improved from inf to 1.48110, saving model to weights.best.hdf5
Epoch 2/10
Epoch 2: val_loss improved from 1.48110 to 1.36944, saving model to weights.best.hdf5
Epoch 3/10
Epoch 3: val_loss improved from 1.36944 to 1.36421, saving model to weights.best.hdf5
Epoch 4/10
Epoch 4: val_loss did not improve from 1.36421
Epoch 5/10
Epoch 5: val_loss did not improve from 1.36421
Epoch 6/10
Epoch 6: val_loss did not improve from 1.36421
Epoch 7/10
Epoch 7: val_loss did not improve from 1.36421
Epoch 8/10
Epoch 8: val_loss did not improve from 1.36421
Epoch 9/10
Epoch 9: val_loss did not improve from 1.36421
Epoch 10/10
Epoch 10: val_loss did not improve from 1.36421


<keras.callbacks.History at 0x1d42e3c7850>

## Prediction:

In [80]:
#z_pred=model.predict(test_f)
z_pred=np.argmax(model.predict(test_f), axis=-1)
z_pred=z_pred+1

print("Prediction array shape:",z_pred.shape)
z_pred

Prediction array shape: (6500,)


array([1, 2, 3, ..., 5, 5, 3], dtype=int64)

In [81]:
#calculating metrics:


from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score

z_test=test_g['Rating']


print("Confusion matrics:\n", confusion_matrix(z_test,z_pred))
print("Accuracy : ", accuracy_score(z_test,z_pred))
print("Precision : ", precision_score(z_test, z_pred, average = 'weighted'))
print("Recall : ", recall_score(z_test, z_pred, average = 'weighted'))

Confusion matrics:
 [[590 340 180  90 100]
 [334 396 292 156 122]
 [183 328 324 255 210]
 [132 181 250 338 399]
 [102 118 136 272 672]]
Accuracy :  0.3569230769230769
Precision :  0.35119072827567016
Recall :  0.3569230769230769


In [74]:
#trying on new random review
'''
example = ["I'm not happy"]

result = model.predict(example)


print(result)'''



ValueError: in user code:

    File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 2041, in predict_function  *
        return step_function(self, iterator)
    File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 2027, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 2015, in run_step  **
        outputs = model.predict_step(data)
    File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 1983, in predict_step
        return self(x, training=False)
    File "C:\Anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "C:\Anaconda3\lib\site-packages\keras\engine\input_spec.py", line 232, in assert_input_compatibility
        raise ValueError(

    ValueError: Exception encountered when calling layer "sequential" "                 f"(type Sequential).
    
    Input 0 of layer "lstm" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 32)
    
    Call arguments received by layer "sequential" "                 f"(type Sequential):
      • inputs=tf.Tensor(shape=(None,), dtype=string)
      • training=False
      • mask=None


In [None]:
#word cloud:
#from wordcloud import WordCloud
