# Sentiment Analysis on Customer Review using Linear SVM

3 Datasets are available from 3 different websites containing customer review.

The main task is to generate a natural language processing model to detect whether a review is positive(1) or negative(0)

Linear SVM Model is used for training

Generate Confusion Matrix and F1 Score as accuracy figure

## Import Dataset

In [1]:
# import pandas library
import pandas as pd

In [2]:
# read the reviews text files into different datasets
# the label and text are seperated by tab. hence sep='\t'
dataset1 = pd.read_csv('amazon_cells_labelled.txt', sep='\t', header=None)
dataset1.columns = ['review', 'label']
print (dataset1.shape)
dataset2 = pd.read_csv('imdb_labelled.txt', sep='\t', header=None)
dataset2.columns = ['review', 'label']
print (dataset2.shape)
dataset3 = pd.read_csv('yelp_labelled.txt', sep='\t', header=None)
dataset3.columns = ['review', 'label']
print (dataset3.shape)

(1000, 2)
(748, 2)
(1000, 2)


In [3]:
# combine the read datasets to a single dataframe
main_dataset = pd.concat([dataset1,dataset2,dataset3]).reset_index(drop=True)
print (main_dataset.shape)
main_dataset.head()

(2748, 2)


Unnamed: 0,review,label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


### Text Pre Processing (Cleaning)

In [4]:
# cleaning the texts using nltk (natural language tool kit)
# remove general english stopwords and clean up the reviews
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 2748):
    review = re.sub('[^a-zA-Z]', ' ', main_dataset['review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
corpus[0]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\igaln\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'way plug us unless go convert'

## Vectorization

In [5]:
# creating the Bag of Words model (vectorization)
# this bag of words will be the feature dataset
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = main_dataset.iloc[:, 1].values

## Train Test Split

In [6]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

## Linear Support Vector Machine Model (Linear SVM)

In [7]:
from sklearn import svm
classifier_linear = svm.SVC(kernel='linear')
classifier_linear.fit(X_train, y_train)

SVC(kernel='linear')

## Train Data Prediction 

In [8]:
# Predicting the Train set results
y_pred = classifier_linear.predict(X_train)

### Confusion Matrix

In [9]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_train, y_pred)
df_cm = pd.DataFrame(cm)
df_cm.index= [1,0]
df_cm.columns = [1,0]
print ('Confusion Matrix')
print (df_cm.head())

Confusion Matrix
      1     0
1  1009    72
0    59  1058


### F1 Score

In [10]:
# Higher F1 Score Means Better Model
from sklearn.metrics import f1_score

f1 = f1_score(y_train, y_pred)
print ('F1 Score for SVM Model : ' + str(f1))

F1 Score for SVM Model : 0.9417000445037829


## Test Data Prediction

In [11]:
# Predicting the Test set results
y_pred = classifier_linear.predict(X_test)

### Confusion Matrix

In [12]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm)
df_cm.index= [1,0]
df_cm.columns = [1,0]
print ('Confusion Matrix')
print (df_cm.head())

Confusion Matrix
     1    0
1  224   57
0   60  209


### F1 Score

In [13]:
# Higher F1 Score Means Better Model
from sklearn.metrics import f1_score

f1 = f1_score(y_test, y_pred)
print ('F1 Score for SVM Model : ' + str(f1))

F1 Score for SVM Model : 0.7813084112149532
