## Sentiment Analysis on IMDB Dataset

#### Here we have IMDB reviews and we have to analyze the reviews and we need to analyze that review its either positive sentiments or negative sentiments 

In [54]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from bs4 import BeautifulSoup  # its beatifulsoup library from bs4
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer # will perform n gram
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import AdaBoostClassifier , GradientBoostingClassifier 
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score , confusion_matrix , classification_report , roc_auc_score , roc_curve
import warnings
warnings.filterwarnings('ignore')

#### Import the data in local jupyter notebook

In [2]:
df = pd.read_csv('C:\\Users\\Lenovo\\Downloads\\9th July 2024\\IMDB Dataset.csv')

In [3]:
pd.set_option('display.max_colwidth', None)

In [4]:
df.head(2)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive


#### Shape

In [5]:
df.shape

(50000, 2)

#### Info

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


#### review is contains review info and sentiment contains categorical value like positive and negative . this is target variable and this variable tells how review is .... is it positive or negative ?

### Data Preprocessing

#### 1.Handling the missing values

In [7]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [8]:
#### Data contains some numbers and as well as some tags like HTML tags like <br/> tags

#### First we remove numeric value from the review text

In [9]:
def remove_numbers(text):
    
    clean_text = re.sub(r'[0-9]+' , '' , text)
    
    # sub func is replace the numbers by empty space 
    
    return clean_text

#### Use apply function 

In [10]:
df['review'] = df['review'].apply(remove_numbers)

In [11]:
df.head(2)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive


#### We noticed we remove numbers from text

#### but our data contains still have HTML tags so we need to remove it for that reason we will use beatifulsoup library. which extensively work on it

In [12]:
# pip install beautifulsoup4

#### html.parser is html parsing library which helps to parsing html and extracting data means searching html tags basically here it search html tags along with data and from that data along with html tags we get extract only text data by using get_text()

#### In other words here html.parser library helps us to search html tags in text data along with text data and we store that html tag along with data in clean_text variable and we just extract or return only text data from all data (basically we depricate HTML tags we consider only text data)

In [13]:
def remove_html_tags(text):
    
    clean_text = BeautifulSoup(text , 'html.parser')
    return clean_text.get_text()

#### apply is the function which take another function as an argument and extracts data from dataframe column by implictely applying for loop then give to function and perform function code and then again return to function by return keyword and it will create new column

In [14]:
df['review'] = df['review'].apply(remove_html_tags)

#### Data after the removal of html tags

In [15]:
df.head(2)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",positive
1,"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",positive


#### Now data is quiet good but now we have to check the another things like data balancing

#### Data Balancing

In [16]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

#### Data is perfectly balance so no need to do this thing

#### Now we are looking for text to vector means text to numbers but we will build pipeline so at once data to text to vector and that data to build model

### Encode the target variable

#### Actually target variable contains Positive and negative sentiments so basically its ordinal logical data so we will perform LabelEncoding here

In [17]:
label = LabelEncoder()

In [18]:
df['sentiment'] = label.fit_transform(df['sentiment'])

In [19]:
df.head(2)

Unnamed: 0,review,sentiment
0,"One of the other reviewers has mentioned that after watching just Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.",1
1,"A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",1


##### Encoding done on target variable

#### We split the data into inependent and dependent variable

In [20]:
x = df['review']
print(x.shape)
y = df['sentiment']
print(y.shape)

(50000,)
(50000,)


In [21]:
x.head(2)

0    One of the other reviewers has mentioned that after watching just  Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows

In [22]:
y.head(2)

0    1
1    1
Name: sentiment, dtype: int32

#### Split the data into train and test

In [23]:
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size = 0.25 , random_state = 1 ,  stratify = y )

In [24]:
print(x_train.shape , x_test.shape , y_train.shape , y_test.shape)

(37500,) (12500,) (37500,) (12500,)


### Now Data is ready for building model

### Pipeline

In [25]:
model = Pipeline([('TFIDF' , TfidfVectorizer()) , ('XGBOOST' , XGBClassifier())])

In [26]:
model

#### How Pipeline Works ?

#### Ans : First data come to tfidfvectorizer so that vectorizer convert data from text to vector means numeric and after data is numeric then data gives to model and on that data it will build ML model which you provided in pipeline 

### Fit the data or Train the model

In [27]:
model.fit(x_train , y_train)

#### Train the model means model get learn from the data . it actually identifying patterns from data 

#### Training Prediction 

In [29]:
y_train_pred = model.predict(x_train)

In [30]:
print(y_train_pred)

[1 1 1 ... 0 0 0]


#### Testing Prediction

In [33]:
y_test_pred = model.predict(x_test)

In [34]:
print(y_test_pred)

[1 0 1 ... 1 1 0]


#### Training Accuracy

In [37]:
training_accuracy = accuracy_score(y_train ,y_train_pred)

In [38]:
training_accuracy

0.9424266666666666

#### Testing accuracy

In [39]:
testing_accuracy = accuracy_score(y_test ,y_test_pred)

In [40]:
testing_accuracy 

0.85816

#### Training Accuracy by xtreamgradientboost is 0.9424266666666666 and 0.85816 .Its quiet good . model is not overfitted. 9 % difference in train and test accuracy

#### We will try some other models

### Pipeline with GradientBoostingClassifier

In [49]:
model_gb = Pipeline([('TFIDF',TfidfVectorizer()),('GradientBoost',GradientBoostingClassifier())])

print(model_gb)

# fit the model

model_gb.fit(x_train , y_train)

# Training prediction
y_train_pred_gb = model_gb.predict(x_train)
print('Training Prediction by gb' , y_train_pred_gb)

# Testing Prediction

y_test_pred_gb = model_gb.predict(x_test)
print('Testing Prediction by gb' , y_test_pred_gb)

# Training Accuracy

training_accuracy_gb = accuracy_score(y_train , y_train_pred_gb)
print('Training Accuracy of gb is :' ,training_accuracy_gb)

# Testing Accuracy

testing_accuracy_gb = accuracy_score(y_test , y_test_pred_gb)
print('Testing Accuracy of gb is :' , testing_accuracy_gb)

Pipeline(steps=[('TFIDF', TfidfVectorizer()),
                ('GradientBoost', GradientBoostingClassifier())])
Training Prediction by gb [1 1 1 ... 0 0 0]
Testing Prediction by gb [1 0 1 ... 1 1 0]
Training Accuracy of gb is : 0.8261333333333334
Testing Accuracy of gb is : 0.81416


#### Training and Testing Accuracy of GradientboostingClassifier 0.8261333333333334 and 0.81416 respectively

### Pipeline with AdaBoostClassifier

In [50]:
model_ada = Pipeline([('TFIDF',TfidfVectorizer()),('GradientBoost',AdaBoostClassifier())])

print(model_ada)

# fit the model

model_ada.fit(x_train , y_train)

# Training prediction
y_train_pred_ada = model_ada.predict(x_train)
print('Training Prediction by ada' , y_train_pred_ada)

# Testing Prediction

y_test_pred_ada = model_ada.predict(x_test)
print('Testing Prediction by ada' , y_test_pred_ada)

# Training Accuracy

training_accuracy_ada = accuracy_score(y_train , y_train_pred_ada)
print('Training Accuracy of ada is :' ,training_accuracy_ada)

# Testing Accuracy

testing_accuracy_ada = accuracy_score(y_test , y_test_pred_ada)
print('Testing Accuracy of ada is :' , testing_accuracy_ada)

Pipeline(steps=[('TFIDF', TfidfVectorizer()),
                ('GradientBoost', AdaBoostClassifier())])
Training Prediction by ada [1 1 1 ... 0 0 0]
Testing Prediction by ada [1 0 0 ... 1 1 0]
Training Accuracy of ada is : 0.8082133333333333
Testing Accuracy of ada is : 0.80408


#### Training and Testing Accuracy of adaboost is 0.8082133333333333 and 0.80408 respectively

### Pipeline with RandomForest

In [56]:
model_rf = Pipeline([('TFIDF',TfidfVectorizer()),('RF',RandomForestClassifier())])

print(model_rf)

# fit the model

model_rf.fit(x_train , y_train)

# Training prediction
y_train_pred_rf = model_rf.predict(x_train)
print('Training Prediction by rf' , y_train_pred_rf)

# Testing Prediction

y_test_pred_rf = model_rf.predict(x_test)
print('Testing Prediction by rf' , y_test_pred_rf)

# Training Accuracy

training_accuracy_rf = accuracy_score(y_train , y_train_pred_rf)
print('Training Accuracy of rf is :' ,training_accuracy_rf)

# Testing Accuracy

testing_accuracy_rf = accuracy_score(y_test , y_test_pred_rf)
print('Testing Accuracy of rf is :' , testing_accuracy_rf)

Pipeline(steps=[('TFIDF', TfidfVectorizer()), ('RF', RandomForestClassifier())])
Training Prediction by rf [1 1 1 ... 0 0 0]
Testing Prediction by rf [1 0 1 ... 1 1 0]
Training Accuracy of rf is : 1.0
Testing Accuracy of rf is : 0.84432


#### Training and Tesing Accuracy by AdaBoost is 1.0 and 0.84432 resepectively.model get overfitted

### As per my undetstanding and according to performances XGBOOST is best as compare to others or we can build stacking or votingclassifier