## Plan of Action


1.   We are using **Amazon Alexa Reviews dataset (3150 reviews)**, that contains: **customer reviews, rating out of 5**, date of review, Alexa variant 
2.   First we  **generate sentiment labels: positive/negative**, by marking *positive for reviews with rating >3 and negative for remaining*
3. Then, we **clean dataset through Vectorization Feature Engineering** (TF-IDF) - a popular technique
4. Post that, we use **Support Vector Classifier for Model Fitting** and check for model performance (*we are getting >90% accuracy*)
5. Last, we use our model to do **predictions on real Amazon reviews** using: a simple way and then a fancy way

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import dataset

df1 = pd.read_csv('AmazonAlexa_Reviews_Dataset.tsv',delimiter='\t')
df1

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1
...,...,...,...,...,...
3145,5,30-Jul-18,Black Dot,"Perfect for kids, adults and everyone in betwe...",1
3146,5,30-Jul-18,Black Dot,"Listening to music, searching locations, check...",1
3147,5,30-Jul-18,Black Dot,"I do love these things, i have them running my...",1
3148,5,30-Jul-18,White Dot,Only complaint I have is that the sound qualit...,1


In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3150 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


# Data pre-processing

In [4]:
df = df1[['verified_reviews','rating']]
df.columns = ['Reviews', 'Rating']
df.head()

Unnamed: 0,Reviews,Rating
0,Love my Echo!,5
1,Loved it!,5
2,"Sometimes while playing a game, you can answer...",4
3,I have had a lot of fun with this thing. My 4 ...,5
4,Music,5


In [5]:
# creating new column sentiment based on overall ratings

def compute_sentiments(lables):
    sentiments = []
    for lable in lables:
        if lable > 3.0:
            Rating = 1
        elif lable <= 3.0:
            Rating = 0
        sentiments.append(Rating)
    return sentiments

In [6]:
df['Rating'] = compute_sentiments(df.Rating)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Rating'] = compute_sentiments(df.Rating)


In [7]:
df.head()

Unnamed: 0,Reviews,Rating
0,Love my Echo!,1
1,Loved it!,1
2,"Sometimes while playing a game, you can answer...",1
3,I have had a lot of fun with this thing. My 4 ...,1
4,Music,1


In [8]:
# Checking for distrubution of Rating
df['Rating'].value_counts()

1    2741
0     409
Name: Rating, dtype: int64

In [9]:
# check for null values

df.isnull().sum()

Reviews    0
Rating     0
dtype: int64

# Data transformation

In [10]:
x = df['Reviews']
y = df['Rating']

In [13]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.1
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [14]:
# import tokenizer
from tokenizer_input import CustomTokenizerExample

In [15]:
import spacy
nlp = spacy.load('en_core_web_sm')

import string
punct = string.punctuation
# punct

from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS) # list of stopwords

class CustomTokenizerExample():
    def __init__(self):
        pass

    def text_data_cleaning(self,sentence):
        doc = nlp(sentence)                         # spaCy tokenize text & call doc components, in order

        tokens = [] # list of tokens
        for token in doc:
            if token.lemma_ != "-PRON-":
                temp = token.lemma_.lower().strip()
            else:
                temp = token.lower_
            tokens.append(temp)

        cleaned_tokens = []
        for token in tokens:
            if token not in stopwords and token not in punct:
                cleaned_tokens.append(token)
        return cleaned_tokens

In [None]:
# if root form of that word is not proper noun then it is going to convert that into lower form
# and if that word is a proper noun, then we are directly taking lower form,
# because there is no lemma for proper noun

# stopwords and punctuations removed

In [16]:
# testing
token = CustomTokenizerExample()
token.text_data_cleaning("Those were the best days of my life!")

['good', 'day', 'life']

# Feature Engineering

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
tfidf = TfidfVectorizer(tokenizer=token.text_data_cleaning)

# Training of the model

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(x, y , test_size=0.2, stratify=df.Rating, random_state=0)

In [21]:
X_train.shape, X_test.shape

((2520,), (630,))

In [22]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

In [23]:
classifier = LinearSVC()

In [24]:
# it will first do vectorization and then it will do classification
pipeline = Pipeline([('tfidf', tfidf),('clf',classifier)])

In [25]:
pipeline.fit(X_train, y_train)

# Checking Model Performance

In [26]:
y_pred_svc = pipeline.predict(X_test)

In [27]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [33]:
# confusion_matrix
cnf_svc = confusion_matrix(y_test, y_pred_svc)
print(cnf_svc)

# accuracy score 
accuracy_svc = accuracy_score(y_test, y_pred_svc)
print(accuracy_svc)

# classification_report
clf_report_svc = classification_report(y_test, y_pred_svc)
print(clf_report_svc)

[[ 37  45]
 [  9 539]]
0.9142857142857143
              precision    recall  f1-score   support

           0       0.80      0.45      0.58        82
           1       0.92      0.98      0.95       548

    accuracy                           0.91       630
   macro avg       0.86      0.72      0.77       630
weighted avg       0.91      0.91      0.90       630



# Model Serialization

In [34]:
import joblib
joblib.dump(pipeline,'SentimentAnalysis_Model_Pipeline.pkl')

['SentimentAnalysis_Model_Pipeline.pkl']

# Predict Sentiments using Model

## Simple way

In [35]:
prediction = pipeline.predict(["Alexa is good"])

if prediction == 1:
    print("Result: This review is positive")
else:
    print("Result: This review is negative")

Result: This review is positive


## Fancy way

In [36]:
new_review = []
pred_sentiment = []

while True:
    # ask for a new amazon alexa review
    review = input("Please type an Alexa review - ")
    if review == 'skip':
        print("See you soon!")
        break
    else:
        prediction = pipeline.predict([review])

        if prediction == 1:
            result = 'Positive'
            print("Result: This review is positive\n")
        else:
            result = 'Negative'
            print("Result: This review is negative\n")
  
    new_review.append(review)
    pred_sentiment.append(result)

Please type an Alexa review - alexa is good
Result: This review is positive

Please type an Alexa review - alexa is negative
Result: This review is positive

Please type an Alexa review - alexa is bad
Result: This review is negative

Please type an Alexa review - skip
See you soon!


In [37]:
Results_Summary = pd.DataFrame(
    {'New Review': new_review,
     'Sentiment': pred_sentiment,
    })

Results_Summary.to_csv("Predicted_Sentiments.tsv", sep='\t', encoding='UTF-8', index=False)
Results_Summary

Unnamed: 0,New Review,Sentiment
0,alexa is good,Positive
1,alexa is negative,Positive
2,alexa is bad,Negative


# Model Deployment

In [39]:
joblib.__version__

'1.1.0'