# Natural Processing Language Amazon reviews

In this project we are going to make a models to predict are reviews positive or negative. We will use 4 different algorithms for this task. Also we will compare accuracy of balanced dataset and imbalanced dataset.

In [1]:
import json

In [2]:
#importing file
file = "./NLP/Books_small_10000.json"

In [3]:
#opening file
with open (file) as f:
    for line in f:
        print(line)
        break

{"reviewerID": "A1F2H80A1ZNN1N", "asin": "B00GDM3NQC", "reviewerName": "Connie Correll", "helpful": [0, 0], "reviewText": "I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.", "overall": 5.0, "summary": "Can't stop reading!", "unixReviewTime": 1390435200, "reviewTime": "01 23, 2014"}



In [4]:
#printing reviewText and overall
with open (file) as f:
    for line in f:
        review=json.loads(line)
        print(review["reviewText"])
        print(review["overall"])
        break

I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.
5.0


In [5]:
#making list of reviews and overall
reviews=[]
with open (file) as f:
    for line in f:
        review=json.loads(line)
        reviews.append((review["reviewText"],review["overall"]))
        
reviews[3]

('I really enjoyed this adventure and look forward to reading more of Robert Spire. I especially liked all the info on global warming. You did a good job on the research.',
 4.0)

In [6]:
# making class for text review,score review and sentiment

class Sentiment:
    Negative="Negative"
    Positive="Positive"
    
class Review:
    def __init__(self,text,score):
        self.text=text
        self.score=score
        self.sentiment=self.get_sentiment()
        
    def get_sentiment(self):
        if self.score<=3:
            return Sentiment.Negative
        
        
        else:
            return Sentiment.Positive


In [7]:
#checking our classes
reviews=[]
with open (file) as f:
    for line in f:
        review=json.loads(line)
        reviews.append(Review(review["reviewText"],review["overall"]))
        
        
print("Review text:",reviews[5].text)
print("Review score:",reviews[5].score)
print("Review sentiment:",reviews[5].sentiment)

Review text: I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.
Review score: 5.0
Review sentiment: Positive


In [8]:
#how many reviews we have
len(reviews)

10000

### Preparing data

In [9]:
#importing library
from sklearn.model_selection import train_test_split

In [10]:
# making features and target
X=[x.text for x in reviews]
y=[x.sentiment for x in reviews]

In [11]:
X[1]

'I enjoyed this short book. But it was way way to short ....I can see how easily it would have been to add several chapters.'

In [12]:
y[1]

'Negative'

In [13]:
#splitting data on train and test
train_X,test_X,train_y,test_y=train_test_split(X,y,test_size=0.3,random_state=42)

In [14]:
train_X[2]

"Awakening. Honesty. Action. Separate from each other, and there's not much to go with. Together, however, and there's a moment there, of connection with HaShem, that's supposed to change it all. Or at least Idleman asserts. There are plenty of reviewers that have tackled this very question. Additionally, I'm not reviewing the book, but rather the audiobook. When reviewing an audiobook, the purpose of the review is to ascertain if the audio is one worth listening to. So, on my end, I will assume that you will have decided on the content material, but now you are at a point where you need to decide on traditional reading format or auditory listening format.Heyborne's narration is clear and easy to understand. He reads at a medium pace and pauses slightly in-between sentences to allow readers to keep up. A softer voice, Heyborne's is not one I would mind listening to in the background or while on a drive. Overall, the audiobook edition is a quality edition to consider, breaking out of th

In [15]:
train_y[2]

'Positive'

In [16]:
len(train_X)

7000

In [17]:
len(test_X)

3000

In [18]:
#Vectorizing words
from sklearn.feature_extraction.text import CountVectorizer

vectorizer=CountVectorizer()

train_X_vector=vectorizer.fit_transform(train_X)
test_X_vector=vectorizer.transform(test_X)


print(train_X[4])
print(train_X_vector[4])

Super awesome and super sexy prequel.... I loved every bit of it and cannot wait to start into the series:)
  (0, 24685)	1
  (0, 1357)	2
  (0, 24340)	1
  (0, 13002)	1
  (0, 23051)	1
  (0, 17003)	1
  (0, 2926)	1
  (0, 21619)	1
  (0, 23628)	2
  (0, 2205)	1
  (0, 21701)	1
  (0, 18740)	1
  (0, 14656)	1
  (0, 8635)	1
  (0, 3961)	1
  (0, 26265)	1
  (0, 12824)	1


## Building Linear SVM model

In [19]:
from sklearn import svm

In [20]:
sv=svm.SVC(kernel="linear",gamma="auto")
sv.fit(train_X_vector,train_y)

SVC(gamma='auto', kernel='linear')

In [21]:
prediction_sv=sv.predict(test_X_vector)

In [22]:
sv.score(test_X_vector,test_y)

0.8386666666666667

In [23]:
#Importing classification_report
from sklearn.metrics import classification_report

In [24]:
print(classification_report(test_y,prediction_sv))

              precision    recall  f1-score   support

    Negative       0.49      0.53      0.51       476
    Positive       0.91      0.90      0.90      2524

    accuracy                           0.84      3000
   macro avg       0.70      0.71      0.71      3000
weighted avg       0.84      0.84      0.84      3000



## Building Decision Tree model

In [25]:
from sklearn.tree import DecisionTreeClassifier

In [26]:
dtc=DecisionTreeClassifier()
dtc.fit(train_X_vector,train_y)

DecisionTreeClassifier()

In [27]:
prediction_dtc=dtc.predict(test_X_vector)

In [28]:
dtc.score(test_X_vector,test_y)

0.7993333333333333

In [29]:
print(classification_report(test_y,prediction_dtc))

              precision    recall  f1-score   support

    Negative       0.34      0.29      0.32       476
    Positive       0.87      0.90      0.88      2524

    accuracy                           0.80      3000
   macro avg       0.61      0.59      0.60      3000
weighted avg       0.79      0.80      0.79      3000



## Building Naive Bayes model

In [30]:
from sklearn.naive_bayes import GaussianNB

In [31]:
gnb=GaussianNB()
gnb.fit(train_X_vector.toarray(),train_y)

GaussianNB()

In [32]:
predictions_gnb=gnb.predict(test_X_vector.toarray())

In [33]:
gnb.score(test_X_vector.toarray(),test_y)

0.6453333333333333

In [34]:
print(classification_report(test_y,predictions_gnb))

              precision    recall  f1-score   support

    Negative       0.19      0.38      0.25       476
    Positive       0.86      0.70      0.77      2524

    accuracy                           0.65      3000
   macro avg       0.52      0.54      0.51      3000
weighted avg       0.75      0.65      0.69      3000



## Building LogisticRegression model

In [35]:
from sklearn.linear_model import LogisticRegression

In [36]:
lr=LogisticRegression(max_iter=500)
lr.fit(train_X_vector,train_y)

LogisticRegression(max_iter=500)

In [37]:
lr.score(test_X_vector,test_y)

0.8703333333333333

In [38]:
prediction_lr=lr.predict(test_X_vector)

In [39]:
print(classification_report(test_y,prediction_lr))

              precision    recall  f1-score   support

    Negative       0.62      0.48      0.54       476
    Positive       0.91      0.94      0.92      2524

    accuracy                           0.87      3000
   macro avg       0.76      0.71      0.73      3000
weighted avg       0.86      0.87      0.86      3000



## Balancing dataset

In [40]:
from sklearn.feature_extraction.text import  TfidfVectorizer

In [41]:
# Using TfidVectorizer for better accuracy
vectorizer = TfidfVectorizer()
train_X_vectors = vectorizer.fit_transform(train_X)

test_X_vectors = vectorizer.transform(test_X)


In [42]:
import numpy as np
values, counts = np.unique(train_y, return_counts=True)
print(values)
print(counts)

['Negative' 'Positive']
[1146 5854]


In [43]:
from imblearn.over_sampling import SMOTE

In [44]:
smote=SMOTE(sampling_strategy='minority')

In [45]:
train_X_smote,train_y_smote=smote.fit_resample(train_X_vectors,train_y)

In [46]:
counts = np.unique(train_y_smote, return_counts=True)
print(counts)

(array(['Negative', 'Positive'], dtype='<U8'), array([5854, 5854]))


## SVM model 

In [47]:
sv_s=svm.SVC(kernel="linear",gamma="auto")
sv_s.fit(train_X_smote,train_y_smote)

SVC(gamma='auto', kernel='linear')

In [48]:
predict_sv=sv_s.predict(test_X_vectors)

In [49]:
sv_s.score(test_X_vectors,test_y)

0.865

In [50]:
print(classification_report(test_y,predict_sv))

              precision    recall  f1-score   support

    Negative       0.57      0.61      0.59       476
    Positive       0.92      0.91      0.92      2524

    accuracy                           0.86      3000
   macro avg       0.75      0.76      0.75      3000
weighted avg       0.87      0.86      0.87      3000



## Decision Tree

In [51]:
dtc_s=DecisionTreeClassifier()
dtc_s.fit(train_X_smote,train_y_smote)

DecisionTreeClassifier()

In [52]:
predict_dtc=dtc_s.predict(test_X_vectors)

In [53]:
dtc_s.score(test_X_vectors,test_y)

0.7576666666666667

In [54]:
print(classification_report(test_y,predict_dtc))

              precision    recall  f1-score   support

    Negative       0.31      0.43      0.36       476
    Positive       0.88      0.82      0.85      2524

    accuracy                           0.76      3000
   macro avg       0.60      0.62      0.60      3000
weighted avg       0.79      0.76      0.77      3000



## Naive Bayes

In [55]:
gnb_s=GaussianNB()
gnb_s.fit(train_X_smote.toarray(),train_y_smote)

GaussianNB()

In [56]:
predict_gnb=gnb_s.predict(test_X_vectors.toarray())

In [57]:
gnb_s.score(test_X_vector.toarray(),test_y)

0.7203333333333334

In [58]:
print(classification_report(test_y,predict_gnb))

              precision    recall  f1-score   support

    Negative       0.19      0.37      0.25       476
    Positive       0.86      0.70      0.77      2524

    accuracy                           0.65      3000
   macro avg       0.52      0.54      0.51      3000
weighted avg       0.75      0.65      0.69      3000



### Logistic Regression

In [59]:
lr_s=LogisticRegression(max_iter=500)
lr_s.fit(train_X_smote,train_y_smote)

LogisticRegression(max_iter=500)

In [60]:
lr_s.score(test_X_vectors,test_y)

0.8603333333333333

In [63]:
predict_lr=lr_s.predict(test_X_vectors)

In [64]:
print(classification_report(test_y,predict_lr))


              precision    recall  f1-score   support

    Negative       0.55      0.67      0.60       476
    Positive       0.93      0.90      0.92      2524

    accuracy                           0.86      3000
   macro avg       0.74      0.78      0.76      3000
weighted avg       0.87      0.86      0.87      3000



# Results

|    Model             | Imbalanced (score)  | Balanced  (score)  |    
|----------------------|---------------------|--------------------|
|SVM                   |     83.8 %          |      86.5 %        |                   
|DecisionTree          |     79.9%           |      75.76 %       |               
|Naive Bayes           |     64.53%          |      72.03 %       |  
|LogisticRegression    |     87.03%          |      86.03 %       |  