## Data Loading

Importing the necessary libraries, like pandas, numpy and some plotting libraries such as matplotlib and seaborn

In [1]:
# importing math library
import pandas as pd
import numpy as np
# importing plotting library
import matplotlib.pyplot as plt
import seaborn as sns
# importing warnings
import warnings
# set ignore warnings to avoid it 
warnings.filterwarnings('ignore')

Reading of data as a pandas dataframe and named as df

In [2]:
# reading the dataset from csv file
df = pd.read_csv('Musical_instruments_reviews.csv')

In [3]:
# printing the dataset
df.head(3)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2IBPI20UZIR0U,1384719342,"cassandra tu ""Yeah, well, that's just like, u...","[0, 0]","Not much to write about here, but it does exac...",5.0,good,1393545600,"02 28, 2014"
1,A14VAT5EAX3D9S,1384719342,Jake,"[13, 14]",The product does exactly as it should and is q...,5.0,Jake,1363392000,"03 16, 2013"
2,A195EZSQDW3E21,1384719342,"Rick Bennette ""Rick Bennette""","[1, 1]",The primary job of this device is to block the...,5.0,It Does The Job Well,1377648000,"08 28, 2013"


In [4]:
# printing the size of the dataset
df.shape

(10261, 9)

In [5]:
# counting all the unique values of users
df['reviewerID'].nunique()

1429

In [6]:
# counting all the unique values products
df['asin'].nunique()

900

**About Data:**

*   reviewerID - ID of the reviewer
*   asin - ID of the product
*   reviewerName - name of the reviewer
*   helpful - helpfulness rating of the review, e.g. 2/3
*   reviewText - text of the review
*   overall - rating of the product
*   summary - summary of the review
*   unixReviewTime - time of the review (unix)
*   reviewTime - time of the review

**Insights:**

*   Currently data contains lots of columns, no columns is clear as input and none of a clear output.
*   Review-Time, unixReviewTime, asin are the seems to be useless.

## Data Exploration and Cleaning

In [7]:
# Count the number of null values for each columns
df.isna().sum()

reviewerID         0
asin               0
reviewerName      27
helpful            0
reviewText         7
overall            0
summary            0
unixReviewTime     0
reviewTime         0
dtype: int64

There are 27 null values in **reviewerName** and 7 null values in reviewText. **reviewerName** columns is not usefull and due to small number in reviewText we can drop these rows

In [8]:
# drop the specified columns
df.drop('reviewerName', axis=1, inplace=True)
# dropping of nan values
df.dropna(inplace=True)

It is good to combine reviewText and summary in a single columns

In [9]:
# combining two columns into one
df['reviews'] = df['reviewText']+df['summary']

In [10]:
# dropping the reviewText and summary column
df.drop(['reviewText','summary'], axis=1, inplace=True)

In [11]:
# Reviewing the dataset
df.head(3)

Unnamed: 0,reviewerID,asin,helpful,overall,unixReviewTime,reviewTime,reviews
0,A2IBPI20UZIR0U,1384719342,"[0, 0]",5.0,1393545600,"02 28, 2014","Not much to write about here, but it does exac..."
1,A14VAT5EAX3D9S,1384719342,"[13, 14]",5.0,1363392000,"03 16, 2013",The product does exactly as it should and is q...
2,A195EZSQDW3E21,1384719342,"[1, 1]",5.0,1377648000,"08 28, 2013",The primary job of this device is to block the...


In [12]:
# Counting number of values for each of the lables
df['overall'].value_counts()

5.0    6932
4.0    2083
3.0     772
2.0     250
1.0     217
Name: overall, dtype: int64

I will use overall column to prepare the output of the data set.
*   Having overall **5 or 4** - labeled as positive
*   Having overall equal to **3** - labeled as neutral
*   Having overall **2 or 1** - labeled as negative.



In [13]:
# write a function to create a traget columns to create a target column

def function(df):
    # write neutral if equal to 3
    if df['overall'] == 3.0:
        val = 'Neutral'
    # write negative if less than 3
    elif df['overall'] == 1.0 or df['overall'] == 2.0:
        val = 'Negative'
    # write positive if greater than 3
    elif df['overall'] == 4.0 or df['overall'] == 5.0:
        val = 'Positive'
    # otherwise return -1
    else:
        val = -1
    return val

In [14]:
# applied the created function to the dataset and create a new column named as sentiment
df['sentiment'] = df.apply(function, axis=1)

In [15]:
# reviewing a dataset
df.head(3)

Unnamed: 0,reviewerID,asin,helpful,overall,unixReviewTime,reviewTime,reviews,sentiment
0,A2IBPI20UZIR0U,1384719342,"[0, 0]",5.0,1393545600,"02 28, 2014","Not much to write about here, but it does exac...",Positive
1,A14VAT5EAX3D9S,1384719342,"[13, 14]",5.0,1363392000,"03 16, 2013",The product does exactly as it should and is q...,Positive
2,A195EZSQDW3E21,1384719342,"[1, 1]",5.0,1377648000,"08 28, 2013",The primary job of this device is to block the...,Positive


In [16]:
# Counting the number of each sentiment in the dataset
df['sentiment'].value_counts()

Positive    9015
Neutral      772
Negative     467
Name: sentiment, dtype: int64

Now deleting the columns like **asin, helpful, overall, unixReviewTime, ReviewTime**.

In [17]:
# deleting the columns
df.drop(['asin', 'helpful', 'unixReviewTime', 'reviewTime', 'overall'], axis = 1, inplace = True)

In [18]:
# reviewing dataset
df.head(3)

Unnamed: 0,reviewerID,reviews,sentiment
0,A2IBPI20UZIR0U,"Not much to write about here, but it does exac...",Positive
1,A14VAT5EAX3D9S,The product does exactly as it should and is q...,Positive
2,A195EZSQDW3E21,The primary job of this device is to block the...,Positive


#### Cleaning reviews
<br> Now we have only 3 columns for the model - **reviewerID, reviews, sentiment**.
<br> Before making the model, we have to clean the reviews by the following ways -
1. Bring all the words in the small letter
2. Removing the * and ? from the sentences
3. Removing the links from the sentences
4. Removing the new line character.
5. Removing the dates and numbers mentioned in the sentence.

In [19]:
import re # importing the regular expression library
import string # import the string library for work on stings

# creating a function to clean the reviews
def review_cleaning(text):
    # lower all the character
    text = str(text).lower()
    # remove all enclosed square bracket
    text = re.sub('\[.*?\]', '', text)
    # Repalcing  all the links with space
    text = re.sub('https?://\S+|www\.\S+', '', text)
    # Replacing all the html tags with space
    text = re.sub('<.*?>+', '', text)
    # Replacing all the punctuation with space
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # Replacing all the new line character with the space
    text = re.sub('\n', '', text)
    # Replacing all the alpha numeric character with space
    text = re.sub('\w*\d\w*', '', text)
    # retuning the output text
    return text

In [20]:
# applying the created function on the reviews column
df['reviews'] = df['reviews'].apply(lambda x:review_cleaning(x))
# reviewing the dataset
df.head(3)

Unnamed: 0,reviewerID,reviews,sentiment
0,A2IBPI20UZIR0U,not much to write about here but it does exact...,Positive
1,A14VAT5EAX3D9S,the product does exactly as it should and is q...,Positive
2,A195EZSQDW3E21,the primary job of this device is to block the...,Positive


There are lots of words which are repeated and not useful in prediction of sentimens.

In [21]:
# Importing nltk
import nltk
# downloading stopwords library from the nltk
nltk.download('stopwords')
# import stopwords
from nltk.corpus import stopwords
# Removing the english stopwords from the sentence and again join the word without stopwords
df['reviews'] = df['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords.words('english'))]))
# Reviewing the dataset
df.head(3)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,reviewerID,reviews,sentiment
0,A2IBPI20UZIR0U,much write exactly supposed filters pop sounds...,Positive
1,A14VAT5EAX3D9S,product exactly quite affordablei realized dou...,Positive
2,A195EZSQDW3E21,primary job device block breath would otherwis...,Positive


In [22]:
# copying the dataset to prevent the damage of the raw dataset.
review_features = df.copy()
# from the copied dataset create a object for the review column
review_features = review_features[['reviews']].reset_index(drop=True)
# previewing the feature
review_features.head(3)

Unnamed: 0,reviews
0,much write exactly supposed filters pop sounds...
1,product exactly quite affordablei realized dou...
2,primary job device block breath would otherwis...


There are lot of words which are not is in there base form, for eg - Historical, Historically all converted into Histori.

In [23]:
# Using a Porter Stemmer library from nltk to perform the stemming task
from nltk.stem.porter import PorterStemmer
# Creating a stemming object
ps = PorterStemmer()
# creating a corpus list which stores the stemmed words of a sentence
corpus = []
# looping over all the reviews
for i in range(0, len(review_features)):
    # removing all the character which is other than alphabets
    review = re.sub('[^a-zA-Z]', ' ', review_features['reviews'][i])
    # spliting the sentence into words
    review = review.split()
    # looping over a words of a sentences and perform porter stemming
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    # joining the stemmed words with spaces
    review = ' '.join(review)
    # inserting the stemmed sentences into the corpus
    corpus.append(review)   

## Vectorization

Now we have a cleaned and perfect sentence for the model perparation, but Machine can't work with alphabets, hence it should be converted into numbers.

<br> Vectorization can be done on two ways - 
*   ngram = 1 means taking one word at a time
*   ngram = 2 means taking two word at a time

In [24]:
# Importing tf-idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# creating an object of tfidf vectorizer with ngram = 1
tfidf1 = TfidfVectorizer(max_features=5000,ngram_range=(1,1))
# fitting the vectorizer on the reviews
X1 = tfidf1.fit_transform(review_features['reviews'])
# creating an object of tfidf vectorizer with ngram = 2
tfidf2 = TfidfVectorizer(max_features=5000,ngram_range=(2,2))
# fitting the vectorizer on the reviews
X2 = tfidf2.fit_transform(review_features['reviews'])

## Encoding

We have target column as **sentiment**, it is also in the categorical form to convert it into a numerical form, label encoder is used.

In [25]:
# Import Label Encoder
from sklearn.preprocessing import LabelEncoder
# Creating an object of label encoder
label_encoder = LabelEncoder() 
# fitting the encoder into the sentiment columns
df['sentiment']= label_encoder.fit_transform(df['sentiment']) 
# printing the unique values of sentiment
df['sentiment'].unique() 

array([2, 1, 0])

In [26]:
# Creating a variable for target column
y = df['sentiment']

Now we ave X1 (with n_gram  = 1) and X2 (with n_gram = 2) as input column and y as target column, 

In [27]:
# counting the values of each class
y.value_counts()

2    9015
1     772
0     467
Name: sentiment, dtype: int64

We don't have all the three classified value in the equal proportion. Because of this our model can baised with a single prediction. Hence I used **SMOTE oversampling** method to increase the number of observation such that all the three classes will be in equal proporion.

## SMOTE Oversampling

In [28]:
# Import the smote oversampling library form the imblearn 
from imblearn.over_sampling import SMOTE
# create a smote object 
smote = SMOTE(random_state= 42)
# fit and create new dataset using mono-gram
X1_res, y_res = smote.fit_resample(X1,y)
# fit and create new dataset using bi-gram
X2_res, y_res = smote.fit_resample(X2,y)

In [29]:
# checking the count of values in each class
y_res.value_counts()

2    9015
1    9015
0    9015
Name: sentiment, dtype: int64

Now all the three values are in equal proportions

## Train-Test Split

In [30]:
#import the train test split
from sklearn.model_selection import train_test_split

Split the dataset (having ngram 1) into train and test in the ratio of 3 : 1

In [31]:
# divide the dataset into train and test
X1_train, X1_test, y1_train, y1_test = train_test_split(X1_res, y_res, test_size=0.25, random_state=0)

Split the dataset (having ngram 2) into train and test in the ratio of 3 : 1

In [32]:
# divide the dataset into train and test
X2_train, X2_test, y2_train, y2_test = train_test_split(X2_res, y_res, test_size=0.25, random_state=0)

## Models

Now we will use several models to train the model

1.   Support Vector Classifier
2.   K-Nearest Neighbour Classifier
3.   Logistic Regression

**Support Vector Classifier**

In [33]:
# import the svm classifier model
from sklearn.svm import SVC
# creating an object for SVM classifier
svc1 = SVC() # for mono gram
svc2 = SVC() # for bi gram

Prediction through a **Mono-Gram** training set

In [34]:
# fit the model on the mono gramically vectorize training dataset
svc1.fit(X1_train, y1_train)
# Printing the score of training and test dataset
print("Support Vector Classifier Train Accuracy: {}".format(svc1.score(X1_train, y1_train)))
print("Support Vector Classifier Test Accuracy: {}".format(svc1.score(X1_test, y1_test)))

Support Vector Classifier Train Accuracy: 0.9984716264852339
Support Vector Classifier Test Accuracy: 0.9900916888494529


Prediction through a **Bi-Gram** training set

In [35]:
# fit the model on the bi gramically vectorize training dataset
svc2.fit(X2_train, y2_train)
# Printing the score of training and test dataset
print("Support Vector Classifier Train Accuracy: {}".format(svc2.score(X2_train, y2_train)))
print("Support Vector Classifier Test Accuracy: {}".format(svc2.score(X2_test, y2_test)))

Support Vector Classifier Train Accuracy: 0.9691367154760144
Support Vector Classifier Test Accuracy: 0.9609582963620231


**K Nearest Neighbour**

In [36]:
# import the KNN classifier model
from sklearn.neighbors import KNeighborsClassifier
# creating an object for KNN classifier
knn1 = KNeighborsClassifier() # for mono gram
knn2 = KNeighborsClassifier() # for bi gram

Prediction through a **Mono-Gram** training set

In [37]:
# fit the model on the mono gramically vectorize training dataset
knn1.fit(X1_train, y1_train)
# Printing the score of training and test dataset
print("K Nearest Neighbour Train Accuracy: {}".format(knn1.score(X1_train, y1_train)))
print("K Nearest Neighbour Test Accuracy: {}".format(knn1.score(X1_test, y1_test)))

K Nearest Neighbour Train Accuracy: 0.686190405758517
K Nearest Neighbour Test Accuracy: 0.6821946169772257


Prediction through a **Bi-Gram** training set

In [38]:
# fit the model on the bi gramically vectorize training dataset
knn2.fit(X2_train, y2_train)
# Printing the score of training and test dataset
print("K Nearest Neighbour Train Accuracy: {}".format(knn2.score(X2_train, y2_train)))
print("K Nearest Neighbour Test Accuracy: {}".format(knn2.score(X2_test, y2_test)))

K Nearest Neighbour Train Accuracy: 0.6563624710348568
K Nearest Neighbour Test Accuracy: 0.637385388938184


**Logistic Regression**

In [39]:
# import a logistic regression model
from sklearn.linear_model import LogisticRegression
# creating an object for Logistic Regression classifier
logreg1 = LogisticRegression(random_state=0) # for mono gram
logreg2 = LogisticRegression(random_state=0) # for bi gram

Prediction through a **Mono-Gram** training set

In [40]:
# fit the model on the mono gramically vectorize training dataset
logreg1.fit(X1_train, y1_train)
# Printing the score of training and test dataset
print("Logistic Regression Train Accuracy: {}".format(logreg1.score(X1_train, y1_train)))
print("Logistic Regression Test Accuracy: {}".format(logreg1.score(X1_test, y1_test)))

Logistic Regression Train Accuracy: 0.9662278755608145
Logistic Regression Test Accuracy: 0.9418811002661934


Prediction through a **Bi-Gram** training set

In [41]:
# fit the model on the bi gramically vectorize training dataset
logreg2.fit(X2_train, y2_train)
# Printing the score of training and test dataset
print("Logistic Regression Train Accuracy: {}".format(logreg2.score(X2_train, y2_train)))
print("Logistic Regression Test Accuracy: {}".format(logreg2.score(X2_test, y2_test)))

Logistic Regression Train Accuracy: 0.9225952768328156
Logistic Regression Test Accuracy: 0.8713398402839396


**Insights:**

*   SVC gives the score of 99% on Mono-Gram model
*   KNN gives the score of 96% on Bi-Gram model
*   Logistic Regreesion gives the score of 96% on Bii-Gram model

## Evaluation

#### Evaluation of SVC

In [42]:
# making prediction using SVC mono gram model
pred_svc = svc1.predict(X1_test)
# import classification report
from sklearn.metrics import classification_report
# print the classification report
print("Classification Report:\n", classification_report(y1_test, pred_svc))
# import roc - auc score from the metrices
from sklearn.metrics import roc_auc_score
# predict the roc - auc score 
roc_auc_score(y2_test, knn2.predict_proba(X2_test), multi_class='ovr')

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2302
           1       1.00      0.97      0.99      2246
           2       0.97      1.00      0.99      2214

    accuracy                           0.99      6762
   macro avg       0.99      0.99      0.99      6762
weighted avg       0.99      0.99      0.99      6762



0.797114884787912

#### Evaluation of KNN

In [43]:
# making prediction using KNN mono gram model
pred_knn = knn1.predict(X1_test)
# print the classification report
print("Classification Report:\n", classification_report(y1_test, pred_knn))
# print the roc auc score
roc_auc_score(y1_test, knn1.predict_proba(X1_test), multi_class='ovr')

Classification Report:
               precision    recall  f1-score   support

           0       0.73      1.00      0.85      2302
           1       0.63      1.00      0.77      2246
           2       1.00      0.03      0.06      2214

    accuracy                           0.68      6762
   macro avg       0.79      0.68      0.56      6762
weighted avg       0.79      0.68      0.56      6762



0.8271179750533415

#### Evaluation of Logistic Regression

In [44]:
# making prediction using logistic regression mono gram model
pred_logreg = logreg1.predict(X1_test)
# print the classification report
print("Classification Report:\n", classification_report(y1_test, pred_logreg))
# print the roc auc score
roc_auc_score(y1_test, logreg1.predict_proba(X1_test), multi_class='ovr')

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.98      2302
           1       0.90      0.97      0.93      2246
           2       0.97      0.86      0.91      2214

    accuracy                           0.94      6762
   macro avg       0.94      0.94      0.94      6762
weighted avg       0.94      0.94      0.94      6762



0.9880553967940177