# Sentiment Analysis


The data used in this notebook could be found at https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews.


The task is to make a model that classifies if a given statement is "Positive" or "Negative".


Rodmap:

1. Preprocessing.
2. Model Building and Evaluation.

In [1]:
#Importing dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import random

In [2]:
data= pd.read_csv("D:/Data/IMDB/data.csv") #Reading the data
data= data.sample(frac= 1) #Shuffling the rows

In [3]:
# checking for null values
data.isnull().sum()

review       0
sentiment    0
dtype: int64

## Preprocessing

In [4]:
data.head()

Unnamed: 0,review,sentiment
11302,I can't believe I'm wasting my time with a com...,negative
49401,Many of us who went through high school probab...,positive
9016,I watched SCARECROWS because of the buzz surro...,negative
6634,This show is unbelievable in that . . . what i...,negative
31507,This is truly one of the worst films I have ev...,negative


In [5]:
x= data["review"].copy()
y= data["sentiment"].copy()

In [6]:
# Removing punctuation using regular expressions
no_space = re.compile("[.;:!\'?,\"()\[\]]")
with_space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def cleaning(reviews):
    reviews = [no_space.sub("", line.lower()) for line in reviews]
    reviews = [with_space.sub(" ", line) for line in reviews]
    return reviews

In [7]:
x= cleaning(x)

In [8]:
x= pd.DataFrame(x, columns= ["text"])

In [9]:
from nltk.corpus import stopwords
stopwords= stopwords.words('english')

In [10]:
#Removing stopwords
x["text"]= x["text"].apply(lambda x: " ".join([word for word in x.split() if x not in stopwords]))

In [11]:
x

Unnamed: 0,text
0,i cant believe im wasting my time with a comme...
1,many of us who went through high school probab...
2,i watched scarecrows because of the buzz surro...
3,this show is unbelievable in that what it repr...
4,this is truly one of the worst films i have ev...
...,...
49995,an updated version of a theme which has been d...
49996,if i was british i would be embarrassed by thi...
49997,i usually steer clear of tv movies because of ...
49998,finally watched this shocking movie last night...


In [12]:
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
stemmer= PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [13]:
x["text"]= x["text"].apply(lambda x: " ".join([stemmer.stem(word) for word in x.split()])) #Stemming

In [14]:
x["text"]= x["text"].apply(lambda x: " ".join([lemmatizer.lemmatize(word) for word in x.split()])) #lemmatization 

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
vectorizer= CountVectorizer()

In [17]:
x= vectorizer.fit_transform(x["text"]) #Converting text to features

## Model Building and Evaluation

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2) #Train test split

In [20]:
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

In [21]:
model= LinearSVC(C= 0.01) 
model.fit(x_train, y_train) #training the model

LinearSVC(C=0.01, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [22]:
pred= model.predict(x_test)

In [23]:
accuracy_score(y_test, pred) #checking the accuracy

0.8899