<a href="https://colab.research.google.com/github/SimonielMusyoki/Data-Science/blob/master/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Task 1**

In [2]:
# Importing Libraries
import pandas as pd
import numpy as np
import spacy
import re

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from bs4 import BeautifulSoup


In [3]:
# Set up
nlp = spacy.load("en", disable=["parser", "ner"])
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stopwords = nltk.corpus.stopwords.words('english')
stopwords_lower = [s.lower() for s in stopwords]
np.warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [4]:
# Read data from CSV
df = pd.read_csv('data.csv', index_col=0)
df.head()

Unnamed: 0_level_0,stars,review,helpful_votes,total_votes
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The herbs were great...but the cherry tomatoes...not so great,2,The herb kit that came with my Aerogarden was ...,15,17
Even more useful than regular parchment paper,5,I originally bought this just because it was c...,19,19
Shake it before you bake it,2,"If you do it in reverse (bake before shaking),...",2,13
Not what the picture describes,2,I bought this steak for my father in law for C...,7,14
What a ripe off - GIVE ME A BREAK,2,Sorry but I had these noodles and they are no ...,10,34


##### Data Cleaning

In [5]:
# Remove Null values
df=df.dropna()
df = df.reset_index(drop=True)

In [6]:
# Convert star and votes to integers
df['stars'] = df['stars'].astype(int)
df['helpful_votes'] = df['helpful_votes'].astype(int)
df['total_votes'] = df['total_votes'].astype(int)

In [7]:
# Assign a class label "positive/negative" to reviews
df['label']=np.where(df["stars"]>=4,1,0) #1-Positve,0-Negative
df

Unnamed: 0,stars,review,helpful_votes,total_votes,label
0,2,The herb kit that came with my Aerogarden was ...,15,17,0
1,5,I originally bought this just because it was c...,19,19,1
2,2,"If you do it in reverse (bake before shaking),...",2,13,0
3,2,I bought this steak for my father in law for C...,7,14,0
4,2,Sorry but I had these noodles and they are no ...,10,34,0
...,...,...,...,...,...
8992,1,"The product description claims ""It contains no...",24,38,0
8993,1,What a disappointment!!!!!!!!!!!! I bought the...,4,12,0
8994,3,The jury is still out on this item. Perhaps it...,2,21,0
8995,1,I hope this review helps others save their mon...,8,15,0


In [8]:
df['stars'].value_counts()

5    4278
1    3084
2     936
4     360
3     339
Name: stars, dtype: int64

##### Data Preprocessing

In [9]:
# The first step is convert the all reviews into the lower case.
df['pre_process'] = df['review'].apply(lambda x: ' '.join(x.lower() for x in str(x).split()))

In [10]:
# Remove the HTML tags and URLs from the reviews.
df['pre_process']=df['pre_process'].apply(lambda x: BeautifulSoup(x).get_text())
df['pre_process']=df['pre_process'].apply(lambda x: re.sub(r"http\S+", "", x))

In [11]:
# Perform the Contractions on the reviews.
# Example: "it won’t be" converted as "it will not be"
def contractions(s):
  s = re.sub(r"won’t", "will not",s)
  s = re.sub(r"would’t", "would not",s)
  s = re.sub(r"could’t", "could not",s)
  s = re.sub(r"\’d", " would",s)
  s = re.sub(r"can\’t", "can not",s)
  s = re.sub(r"n\’t", " not", s)
  s= re.sub(r"\’re", " are", s)
  s = re.sub(r"\’s", " is", s)
  s = re.sub(r"\’ll", " will", s)
  s = re.sub(r"\’t", " not", s)
  s = re.sub(r"\’ve", " have", s)
  s = re.sub(r"\’m", " am", s)
  return s
df['pre_process']=df['pre_process'].apply(lambda x:contractions(x))


In [12]:
# Remove non-alpha characters
df['pre_process'] = df['pre_process'].apply(lambda x: " ".join([re.sub("[^A-Za-z]+","", x) for x in nltk.word_tokenize(x)]))

In [14]:
# Remove the stop words by using the NLTK package
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['pre_process'] = df['pre_process'].apply(lambda x: " ".join([x for x in x.split() if x not in stop]))

In [15]:
# Perform lemmatization using the wordnet lemmatizer
lemmatizer = WordNetLemmatizer()
df['pre_process'] = df['pre_process'].apply(lambda x: " ".join([lemmatizer.lemmatize(w) for w in nltk.word_tokenize(x)]))
df

Unnamed: 0,stars,review,helpful_votes,total_votes,label,pre_process
0,2,The herb kit that came with my Aerogarden was ...,15,17,0,herb kit came aerogarden superb enjoyed caring...
1,5,I originally bought this just because it was c...,19,19,1,originally bought cheaper regular parchment pa...
2,2,"If you do it in reverse (bake before shaking),...",2,13,0,reverse bake shaking going get mess parmesan w...
3,2,I bought this steak for my father in law for C...,7,14,0,bought steak father law christmas always wante...
4,2,Sorry but I had these noodles and they are no ...,10,34,0,sorry noodle better cent version difference sp...
...,...,...,...,...,...,...
8992,1,"The product description claims ""It contains no...",24,38,0,product description claim contains highfructos...
8993,1,What a disappointment!!!!!!!!!!!! I bought the...,4,12,0,disappointment bought grocery store sale notic...
8994,3,The jury is still out on this item. Perhaps it...,2,21,0,jury still item perhaps take little time feel ...
8995,1,I hope this review helps others save their mon...,8,15,0,hope review help others save money least bette...


##### Feature Extraction using TF-IDF

In [16]:
X_train,X_test,Y_train, Y_test = train_test_split(df['pre_process'], df['label'], test_size=0.25, random_state=30)
print("Train: ",X_train.shape,Y_train.shape,"Test: ",(X_test.shape,Y_test.shape))

Train:  (6747,) (6747,) Test:  ((2250,), (2250,))


In [17]:
# Using TFIDF Vectorizer

vectorizer= TfidfVectorizer()
tf_x_train = vectorizer.fit_transform(X_train)
tf_x_test = vectorizer.transform(X_test)

###### First Classifier - SVM

In [18]:
# Implementing SVM with sklearn for classification
clf = LinearSVC(random_state=0)
# Fitting the Training data into model
clf.fit(tf_x_train,Y_train)
# Predicting the Test data
y_test_pred=clf.predict(tf_x_test)
report = classification_report(Y_test, y_test_pred,output_dict=True)
report

{'0': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 1082},
 '1': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 1168},
 'accuracy': 1.0,
 'macro avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250},
 'weighted avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250}}

***By Using the SVM classifier we got an accuracy of 100%***

###### Second Classifier - Logistic Regression

In [19]:
clf = LogisticRegression(max_iter=1000,solver='saga')
clf.fit(tf_x_train,Y_train)
y_test_pred=clf.predict(tf_x_test)
report=classification_report(Y_test, y_test_pred,output_dict=True)
report

{'0': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 1082},
 '1': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 1168},
 'accuracy': 1.0,
 'macro avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250},
 'weighted avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250}}

***By Using the Logistic Regression we got an accuracy of 100%***

## **Task 2**

In [20]:
# Copy the Dataframe
df1 = df.copy()
# Drop the old label column which is based on stars
del df1['label']
df1



Unnamed: 0,stars,review,helpful_votes,total_votes,pre_process
0,2,The herb kit that came with my Aerogarden was ...,15,17,herb kit came aerogarden superb enjoyed caring...
1,5,I originally bought this just because it was c...,19,19,originally bought cheaper regular parchment pa...
2,2,"If you do it in reverse (bake before shaking),...",2,13,reverse bake shaking going get mess parmesan w...
3,2,I bought this steak for my father in law for C...,7,14,bought steak father law christmas always wante...
4,2,Sorry but I had these noodles and they are no ...,10,34,sorry noodle better cent version difference sp...
...,...,...,...,...,...
8992,1,"The product description claims ""It contains no...",24,38,product description claim contains highfructos...
8993,1,What a disappointment!!!!!!!!!!!! I bought the...,4,12,disappointment bought grocery store sale notic...
8994,3,The jury is still out on this item. Perhaps it...,2,21,jury still item perhaps take little time feel ...
8995,1,I hope this review helps others save their mon...,8,15,hope review help others save money least bette...


In [21]:
# If 80% of total votes are helpful, we asssign the review as helpful
df1['label']=np.where((df1["helpful_votes"]/df1["total_votes"])>=0.8,1,0)
df1

Unnamed: 0,stars,review,helpful_votes,total_votes,pre_process,label
0,2,The herb kit that came with my Aerogarden was ...,15,17,herb kit came aerogarden superb enjoyed caring...,1
1,5,I originally bought this just because it was c...,19,19,originally bought cheaper regular parchment pa...,1
2,2,"If you do it in reverse (bake before shaking),...",2,13,reverse bake shaking going get mess parmesan w...,0
3,2,I bought this steak for my father in law for C...,7,14,bought steak father law christmas always wante...,0
4,2,Sorry but I had these noodles and they are no ...,10,34,sorry noodle better cent version difference sp...,0
...,...,...,...,...,...,...
8992,1,"The product description claims ""It contains no...",24,38,product description claim contains highfructos...,0
8993,1,What a disappointment!!!!!!!!!!!! I bought the...,4,12,disappointment bought grocery store sale notic...,0
8994,3,The jury is still out on this item. Perhaps it...,2,21,jury still item perhaps take little time feel ...,0
8995,1,I hope this review helps others save their mon...,8,15,hope review help others save money least bette...,0


In [28]:
df1['label'].value_counts()

1    5430
0    3567
Name: label, dtype: int64

*In this task, we don't need to perform any data cleaning, since we already did that. So we jump straight into feature extraction*

##### Feature Extraction using TF-IDF

In [29]:
X_train,X_test,Y_train, Y_test = train_test_split(df1['pre_process'], df1['label'], test_size=0.25, random_state=30)
print("Train: ",X_train.shape,Y_train.shape,"Test: ",(X_test.shape,Y_test.shape))

Train:  (6747,) (6747,) Test:  ((2250,), (2250,))


In [30]:
# Using TFIDF Vectorizer
vectorizer= TfidfVectorizer()
tf_x_train = vectorizer.fit_transform(X_train)
tf_x_test = vectorizer.transform(X_test)

###### First Classifier - SVM

In [31]:
# Implementing SVM with sklearn for classification
clf = LinearSVC(random_state=0)
# Fitting the Training data into model
clf.fit(tf_x_train,Y_train)
# Predicting the Test data
y_test_pred=clf.predict(tf_x_test)
# Analyzing the results
report=classification_report(Y_test, y_test_pred,output_dict=True)
report

{'0': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 890},
 '1': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 1360},
 'accuracy': 1.0,
 'macro avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250},
 'weighted avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250}}

***By Using the SVM classifier we still got an accuracy of 100%***

###### Second Classifier - Logistic Regression

In [32]:
clf = LogisticRegression(max_iter=1000,solver='saga')
clf.fit(tf_x_train,Y_train)
y_test_pred=clf.predict(tf_x_test)
# Analysing Logistic regression report
report=classification_report(Y_test, y_test_pred,output_dict=True)
report

{'0': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 890},
 '1': {'f1-score': 1.0, 'precision': 1.0, 'recall': 1.0, 'support': 1360},
 'accuracy': 1.0,
 'macro avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250},
 'weighted avg': {'f1-score': 1.0,
  'precision': 1.0,
  'recall': 1.0,
  'support': 2250}}

***By Using the Logistic Regression we got an accuracy of 100%***