# IBDM Sentiment Analysis 

This proyect will use:
- Kaggle IBDM sentiment Dataset 

Libraries Used:

- Pandas

- Numpy

- NLTK

- Seaborn

- Matpolib 

- Scikit-Learn

# 0-Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# 1-Data Exploration and Dataset Inicialization
We Will:
- 1-initialize dataset in df
- 2-explore first 5 rows using head()
- 3-Use Description() to get information from the dataset'
- 4-Get Number of Sentiments classified in positive or negative

In [2]:
#1
df=pd.read_csv("https://raw.githubusercontent.com/meghjoshii/NSDC_DataScienceProjects_SentimentAnalysis/main/IMDB%20Dataset.csv")

In [3]:
#2
print(df.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


                                Format: Index -  some review text - sentiment 

In [4]:
#3
print(df.describe())

                                                   review sentiment
count                                               50000     50000
unique                                              49582         2
top     Loved today's show!!! It was a variety and not...  positive
freq                                                    5     25000


                              information gathered: Count: 50k reviews      

In [5]:
#4
print(df['sentiment'].value_counts())

sentiment
positive    25000
negative    25000
Name: count, dtype: int64


                             Data Imbalance?: No, 25k positive and 25k negative reviews , balanced number of reviews

# 2-Tokenization

In [6]:
df['review'] = df['review'].apply(word_tokenize)
print(df['review'][1])

['A', 'wonderful', 'little', 'production', '.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'The', 'filming', 'technique', 'is', 'very', 'unassuming-', 'very', 'old-time-BBC', 'fashion', 'and', 'gives', 'a', 'comforting', ',', 'and', 'sometimes', 'discomforting', ',', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece', '.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'The', 'actors', 'are', 'extremely', 'well', 'chosen-', 'Michael', 'Sheen', 'not', 'only', '``', 'has', 'got', 'all', 'the', 'polari', "''", 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too', '!', 'You', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'Williams', "'", 'diary', 'entries', ',', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece', '.', 'A', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'master', "'s", 'of', 'comedy', 'and', 'his', 'life', '

# 3-Cleaning Data
We Will:
- 1-Eliminate Special Characters
- 2-Eliminate  StopWords 
- 3-LowerCase all the words

In [7]:
#1- Alphanumerical Checker
df['review'] = df['review'].apply(lambda x: [item for item in x if item.isalpha()])

In [8]:
#2- Eliminate StopWords
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: [item for item in x if item not in stop_words])

In [9]:
#3-LowerCase all words
df['review'] = df['review'].apply(lambda x: [item.lower() for item in x])

- Preview After Cleaning

In [10]:
print(df['review'])

0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, br, br, the...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object


- Complete example after cleaning

In [11]:
print(df['review'][1])

['a', 'wonderful', 'little', 'production', 'br', 'br', 'the', 'filming', 'technique', 'fashion', 'gives', 'comforting', 'sometimes', 'discomforting', 'sense', 'realism', 'entire', 'piece', 'br', 'br', 'the', 'actors', 'extremely', 'well', 'michael', 'sheen', 'got', 'polari', 'voices', 'pat', 'you', 'truly', 'see', 'seamless', 'editing', 'guided', 'references', 'williams', 'diary', 'entries', 'well', 'worth', 'watching', 'terrificly', 'written', 'performed', 'piece', 'a', 'masterful', 'production', 'one', 'great', 'master', 'comedy', 'life', 'br', 'br', 'the', 'realism', 'really', 'comes', 'home', 'little', 'things', 'fantasy', 'guard', 'rather', 'use', 'traditional', 'techniques', 'remains', 'solid', 'disappears', 'it', 'plays', 'knowledge', 'senses', 'particularly', 'scenes', 'concerning', 'orton', 'halliwell', 'sets', 'particularly', 'flat', 'halliwell', 'murals', 'decorating', 'every', 'surface', 'terribly', 'well', 'done']


# 4-Stemming 

In [12]:
ps = PorterStemmer()
df['review'] = df['review'].apply(lambda x: [ps.stem(item) for item in x])

In [13]:
print(df['review'][1])

['a', 'wonder', 'littl', 'product', 'br', 'br', 'the', 'film', 'techniqu', 'fashion', 'give', 'comfort', 'sometim', 'discomfort', 'sens', 'realism', 'entir', 'piec', 'br', 'br', 'the', 'actor', 'extrem', 'well', 'michael', 'sheen', 'got', 'polari', 'voic', 'pat', 'you', 'truli', 'see', 'seamless', 'edit', 'guid', 'refer', 'william', 'diari', 'entri', 'well', 'worth', 'watch', 'terrificli', 'written', 'perform', 'piec', 'a', 'master', 'product', 'one', 'great', 'master', 'comedi', 'life', 'br', 'br', 'the', 'realism', 'realli', 'come', 'home', 'littl', 'thing', 'fantasi', 'guard', 'rather', 'use', 'tradit', 'techniqu', 'remain', 'solid', 'disappear', 'it', 'play', 'knowledg', 'sens', 'particularli', 'scene', 'concern', 'orton', 'halliwel', 'set', 'particularli', 'flat', 'halliwel', 'mural', 'decor', 'everi', 'surfac', 'terribl', 'well', 'done']


- Joining Sentences 

In [14]:
df['review'] = df['review'].apply(lambda x: " ".join(x))
print(df['review'][1])

a wonder littl product br br the film techniqu fashion give comfort sometim discomfort sens realism entir piec br br the actor extrem well michael sheen got polari voic pat you truli see seamless edit guid refer william diari entri well worth watch terrificli written perform piec a master product one great master comedi life br br the realism realli come home littl thing fantasi guard rather use tradit techniqu remain solid disappear it play knowledg sens particularli scene concern orton halliwel set particularli flat halliwel mural decor everi surfac terribl well done


# 5- Setting Training Data and Test Data

In [15]:
#Training and Testing reviews
train_reviews = df.review[:40000]
test_reviews = df.review[40000:]
#Training and Testing Sentiments
train_sentiments=df.sentiment[:40000]
test_sentiments=df.sentiment[40000:]

# 6- CountVectorizer and LabelBinazer

Why CountVectorizer Used?:
- CountVectorizer Creates a dictionary with the word associated and the times that it appears in the document
- We fit the data to the model and then we transform the data into a matrix of token counts
- converting string labels to numerical values(1/0) with LabelBinazer

In [16]:
#Count vectorizer for bag of words
cv = CountVectorizer(min_df=1, max_df=1.0, ngram_range=(1, 3), stop_words=None)

In [17]:
#fitting/transform training reviews + fitting test reviews
cv_TrainReviews = cv.fit_transform(train_reviews)
cv_TestReviews = cv.transform(test_reviews)

In [20]:
#Sentiment labelAnalizer 
lb = LabelBinarizer()
lb_train_sentiments = lb.fit_transform(train_sentiments)
lb_test_sentiments = lb.fit_transform(test_sentiments)

# 7- Naive Bayes Classifier

In [21]:
#Naive Bayes Classifier
mnb = MultinomialNB()
#Fitting Bayes Model with training, label
mnb_bow = mnb.fit(cv_TrainReviews,lb_train_sentiments)
mnb_bow_predict = mnb.predict(cv_TestReviews)
mnb_bow_score = accuracy_score(lb_test_sentiments, mnb_bow_predict)
print("Accuracy :", mnb_bow_score)

  y = column_or_1d(y, warn=True)


Accuracy : 0.8875
