<a href="https://colab.research.google.com/github/Ankita-Patel1710/Major-Project-Sentiment-Analysis/blob/main/Major_Project_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MAJOR PROJECT - SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS**

## **Gathering Data**

*   50,000 IMDB movie reviews for sentiment analysis
*   Kaggle Dataset




In [1]:
import pandas as pd

df = pd.read_table('/content/drive/MyDrive/datasets/major project/labeledTrainData.tsv')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [2]:
df['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

## **2. Preprocessing - Cleaning of Data**



In [3]:
import re
def cleanTxt(text):
  text = re.sub(r'@[A-Za-z0-9]+', '', text) #removes @mentions
  text = re.sub(r'#', '',text) #removing the # symbol
  text = re.sub(r'RT[\s]+','',text) #removing RT
  text = re.sub('https?:\/\/\S+','',text) #removing hyperlink
  text = re.sub(r'[^\w\s]','', text) #removing punctuations
  text = re.sub(r'[0-9]', '', text) #removing numbers

  return text

#clean the tweets
df['review'] = df['review'].apply(cleanTxt)

#show the cleaned tweets
df

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,The Classic War of the Worlds by Timothy Hines...
2,7759_3,0,The film starts with a manager Nicholas Bell g...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious s...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I dont believe they made this film Completely ...
24997,10905_3,0,Guy is a loser Cant get girls needs to build u...
24998,10194_3,0,This minute documentary Buñuel made in the ea...


In [4]:
#assigning x and y

x = df.iloc[:,2]
y = df.iloc[:,1]

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 0, stratify = y)

In [6]:
print(x_train.shape)
print(x_test.shape)

(17500,)
(7500,)


In [7]:
import numpy as np
np.unique(y_train, return_counts=True)

(array([0, 1]), array([8750, 8750]))

In [8]:
np.unique(y_test, return_counts=True)

(array([0, 1]), array([3750, 3750]))

## **Vectorization and Creation of Model**

*   TFIDF Vectorizer

*   Logistic Regression for classification algorithm




In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

text_model = Pipeline([('Tfidf',TfidfVectorizer()),('model',LogisticRegression())])

In [13]:
text_model.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('Tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('model',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scali

In [14]:
y_pred = text_model.predict(x_test)
y_pred

array([0, 1, 0, ..., 0, 0, 1])

In [25]:
positive = df[df['sentiment']==1]
positive.iloc[16].values

2021-04-17 14:18:31.989 INFO    numexpr.utils: NumExpr defaulting to 2 threads.


array(['4005_10', 1,
       'Although at one point I thought this was going to turn into The Graduate I have to say that The Mother does an excellent job of explaining the sexual desires of an older womanbr br Im so glad this is a British film because Hollywood never would have done it and even if they had they would have ruined it by not taking the time to develop the charactersbr br The story is revealed slowly and realistically The acting is superb the characters are believably flawed and the dialogue is sensitive I tried many times to predict what was going to happen and I was always wrong so I was very intrigued by the storybr br I highly recommend this movie And I must confess Ill forever look at my mom in a different light'],
      dtype=object)

## **Evaluation**

*   Accuracy Score
*   Confusion Matrix

*   Classification Report





In [15]:
# Evaluation
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [16]:
accuracy_score(y_pred,y_test)*100

88.49333333333334

In [17]:
confusion_matrix(y_pred,y_test)

array([[3284,  397],
       [ 466, 3353]])

In [18]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

           0       0.88      0.89      0.88      3681
           1       0.89      0.88      0.89      3819

    accuracy                           0.88      7500
   macro avg       0.88      0.89      0.88      7500
weighted avg       0.89      0.88      0.88      7500



## **Creation of Web App using Streamlit**

In [19]:
# For saving the model in ML, use Pickle, Joblib
import joblib
joblib.dump(text_model,'sentiment_analysis')

['sentiment_analysis']

In [20]:
!pip install -q --upgrade ipython
!pip install -q --upgrade ipykernel

In [21]:
!pip install streamlit --quiet
!pip install pyngrok==4.1.1 --quiet
from pyngrok import ngrok

In [23]:
%%writefile app.py
import joblib
import streamlit as st
model = joblib.load('sentiment_analysis')
st.title('Sentiment Classifier')
ip = st.text_input('Enter your message : ')
op = model.predict([ip])
if st.button('Predict'):
  st.title(op[0])

Writing app.py


In [24]:
!nohup streamlit run app.py &
url=ngrok.connect(port='8501')
url

nohup: appending output to 'nohup.out'


2021-04-17 14:16:25.709 INFO    pyngrok.process: ngrok process starting: 772
2021-04-17 14:16:25.736 INFO    pyngrok.process: t=2021-04-17T14:16:25+0000 lvl=info msg="no configuration paths supplied"

2021-04-17 14:16:25.739 INFO    pyngrok.process: t=2021-04-17T14:16:25+0000 lvl=info msg="using configuration at default config path" path=/root/.ngrok2/ngrok.yml

2021-04-17 14:16:25.742 INFO    pyngrok.process: t=2021-04-17T14:16:25+0000 lvl=info msg="open config file" path=/root/.ngrok2/ngrok.yml err=nil

2021-04-17 14:16:25.756 INFO    pyngrok.process: t=2021-04-17T14:16:25+0000 lvl=info msg="starting web service" obj=web addr=127.0.0.1:4040

2021-04-17 14:16:25.846 INFO    pyngrok.process: t=2021-04-17T14:16:25+0000 lvl=info msg="tunnel session started" obj=tunnels.session

2021-04-17 14:16:25.849 INFO    pyngrok.process: t=2021-04-17T14:16:25+0000 lvl=info msg="client session established" obj=csess id=8970f68a1e07

2021-04-17 14:16:25.852 INFO    pyngrok.process: ngrok process has s

'http://aacf443e4c00.ngrok.io'

2021-04-17 14:16:25.928 INFO    pyngrok.process: t=2021-04-17T14:16:25+0000 lvl=info msg=end pg=/api/tunnels id=f9182b24224d9edc status=201 dur=53.694333ms



## **Deployment on Heroku**

*   Link : https://sentiment-analysis-ankita.herokuapp.com/

