# <center> Movie-Review-Sentiment-Analysis

## Business Value
Sentiment analysis or opinion mining, is to analyze some textual documents and predict their sentiment or opinion based on the content of the document. Sentiment analysis is perhaps one of the most popular applications of natural language processing and text analytics with a vast number of websites, books and tutorials on this subject. Typically sentiment analysis seems to work best on subjective text, where people express opinions, feelings, and their mood. From a real-world industry standpoint, sentiment analysis is widely used to analyze corporate surveys, feedback surveys, social media data, and reviews for movies, places, commodities, and many more. The idea is to analyze and understand the reactions of people toward a specific entity and take insightful actions based on their sentiment.

## Problem Statement
To build a model to analyze the sentiment of the unstructered moview reviews data and seggregate the textual data based on the sentiment of the movie reviews

## Data
Each row of data represents a review for a movie and the other column contains the predefinedsentiment of the review.

review : Moview reviews, textual unstructured data,

sentiment : predefined sentiment

## Approach
* Importing Necessary Libraries

* Loading Data

* Cleaning Textual data Removing HTML Tag Removing Accented Characters Expanding Contractions Removing special characters Lemmatizing Text Removing Stopwords

* Sentiment Analysis Analysis with <b>AFINN</b> Analysis with <b>SentiWordNet</b>

In order to measure the performance of the model, we will be using the classification report which includes the accuracy, precision, recall, and F1 score for the model.

# Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import string
import re
import nltk

# Loading Data

In [2]:
dataset = pd.read_csv('movie_reviews.csv')

reviews = np.array(dataset['Summary'])
sentiments = np.array(dataset['Sentiment'])

In [3]:
reviews

array(['rock destined st century new conan going make splash even greater arnold schwarzenegger jean claud van damme steven segal',
       'gorgeously elaborate continuation lord ring trilogy huge column word cannot adequately describe co writer director peter jackson expanded vision j r r tolkien middle earth',
       'effective tepid biopic', ...,
       'stand crocodile hunter hurried badly cobbled look godzilla combined scene japanese monster flick canned shot raymond burr commenting monster path destruction',
       'thing look like made home video quickie',
       'enigma well made dry placid'], dtype=object)

In [4]:
sentiments

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

# Cleaning Text

#### Removing HTML Tags

In [5]:
from bs4 import BeautifulSoup

In [6]:
def strip_html_tag(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

#### Remove Accented Characters

In [7]:
import unicodedata

In [8]:
def strip_accents(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

#### Expanding Contraction

In [9]:
CONTRACTION_MAP = {"ain't": "is not",
                   "aren't": "are not",
                   "can't": "cannot",
                   "can't've": "cannot have",
                   "'cause": "because",
                   "could've": "could have",
                   "couldn't": "could not",
                   "couldn't've": "could not have",
                   "didn't": "did not",
                   "doesn't": "does not",
                   "don't": "do not",
                   "hadn't": "had not",
                   "hadn't've": "had not have",
                   "hasn't": "has not",
                   "haven't": "have not",
                   "he'd": "he would",
                   "he'd've": "he would have",
                   "he'll": "he will",
                   "he'll've": "he will have",
                   "he's": "he is",
                   "how'd": "how did",
                   "how'd'y": "how do you",
                   "how'll": "how will",
                   "how's": "how is",
                   "I'd": "I would",
                   "I'd've": "I would have",
                   "I'll": "I will",
                   "I'll've": "I will have",
                   "I'm": "I am",
                   "I've": "I have",
                   "i'd": "i would",
                   "i'd've": "i would have",
                   "i'll": "i will",
                   "i'll've": "i will have",
                   "i'm": "i am",
                   "i've": "i have",
                   "isn't": "is not",
                   "it'd": "it would",
                   "it'd've": "it would have",
                   "it'll": "it will",
                   "it'll've": "it will have",
                   "it's": "it is",
                   "let's": "let us",
                   "ma'am": "madam",
                   "mayn't": "may not",
                   "might've": "might have",
                   "mightn't": "might not",
                   "mightn't've": "might not have",
                   "must've": "must have",
                   "mustn't": "must not",
                   "mustn't've": "must not have",
                   "needn't": "need not",
                   "needn't've": "need not have",
                   "o'clock": "of the clock",
                   "oughtn't": "ought not",
                   "oughtn't've": "ought not have",
                   "shan't": "shall not",
                   "sha'n't": "shall not",
                   "shan't've": "shall not have",
                   "she'd": "she would",
                   "she'd've": "she would have",
                   "she'll": "she will",
                   "she'll've": "she will have",
                   "she's": "she is",
                   "should've": "should have",
                   "shouldn't": "should not",
                   "shouldn't've": "should not have",
                   "so've": "so have",
                   "so's": "so as",
                   "that'd": "that would",
                   "that'd've": "that would have",
                   "that's": "that is",
                   "there'd": "there would",
                   "there'd've": "there would have",
                   "there's": "there is",
                   "they'd": "they would",
                   "they'd've": "they would have",
                   "they'll": "they will",
                   "they'll've": "they will have",
                   "they're": "they are",
                   "they've": "they have","to've": "to have",
                   "wasn't": "was not",
                   "we'd": "we would",
                   "we'd've": "we would have",
                   "we'll": "we will",
                   "we'll've": "we will have",
                   "we're": "we are",
                   "we've": "we have",
                   "weren't": "were not",
                   "what'll": "what will",
                   "what'll've": "what will have",
                   "what're": "what are",
                   "what's": "what is",
                   "what've": "what have",
                   "when's": "when is",
                   "when've": "when have",
                   "where'd": "where did",
                   "where's": "where is",
                   "where've": "where have",
                   "who'll": "who will",
                   "who'll've": "who will have",
                   "who's": "who is",
                   "who've": "who have",
                   "why's": "why is",
                   "why've": "why have",
                   "will've": "will have",
                   "won't": "will not",
                   "won't've": "will not have",
                   "would've": "would have",
                   "wouldn't": "would not",
                   "wouldn't've": "would not have",
                   "y'all": "you all",
                   "y'all'd": "you all would",
                   "y'all'd've": "you all would have",
                   "y'all're": "you all are",
                   "y'all've": "you all have",
                   "you'd": "you would",
                   "you'd've": "you would have",
                   "you'll": "you will",
                   "you'll've": "you will have",
                   "you're": "you are",
                   "you've": "you have"}

In [10]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [11]:
expand_contractions("  It's an amazing language which can be used for Scripting")

'  It is an amazing language which can be used for Scripting'

#### Removing Special Characters

In [12]:
def strip_special_characters(text):
    text = re.sub('[^a-zA-z0-9\s]', '', text)
    return text

#### Lemmatizing Text

In [13]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer()

In [14]:
def lemmatize_text(text):
    tokens = nltk.word_tokenize (text)
    text =' '.join([lemmatizer.lemmatize(word) for word in tokens])
    return text

#### Removing Stopwords

In [15]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

In [16]:
def strip_stopwords(text, is_lower_case=False):
    tokens = nltk.word_tokenize (text)
    #tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

#### Cleaned Text

In [17]:
def clean_text(text,strip_html=True, expand_contraction=True,
               accent_remove=True, text_lower_case=True,text_lemmatize=True, 
               special_char_remove=True, stopword_remove=True):
    
    processed_text=[]
    for doc in text:
        
        #HTML tah striping
        if strip_html:
            doc=strip_html_tag(doc)
        
        ## remove accented characters
        if accent_remove:
            doc = strip_accents(doc)
            
        # expand contractions    
        if expand_contraction:
            doc = expand_contractions(doc)
            
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
            
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        
        # insert spaces between special characters to isolate them    
        special_char_pattern = re.compile(r'([{.(-)!}])')
        doc = special_char_pattern.sub(" \\1 ", doc)
        
        # lemmatizing text
        if text_lemmatize:
            doc = lemmatize_text(doc)
        
        # remove special characters    
        if special_char_remove:
            doc = strip_special_characters(doc)  
        
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        
        # remove stopwords
        if stopword_remove:
            doc = strip_stopwords(doc, is_lower_case=text_lower_case)
            
        processed_text.append(doc)
        
    return processed_text


In [18]:
clean_reviews=clean_text(reviews)

In [19]:
clean_reviews=np.array(clean_reviews)

# Sentiment Analysis with AFINN

The <b>AFINN </b> lexicon is perhaps one of the simplest and most popular lexicons that can be used extensively for sentiment analysis.

In [22]:
#!pip install afinn
from afinn import Afinn

afn = Afinn(emoticons=True)

Collecting afinn
  Downloading afinn-0.1.tar.gz (52 kB)
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py): started
  Building wheel for afinn (setup.py): finished with status 'done'
  Created wheel for afinn: filename=afinn-0.1-py3-none-any.whl size=53455 sha256=f2aae148f9bdf84acd11a1eec52eeed8bf824238f3498289aa459a1710aa0994
  Stored in directory: c:\users\moad\appdata\local\pip\cache\wheels\f6\6f\c3\b305c5107a17618f2938a067d5ffcbb556909d82398762089e
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1


You should consider upgrading via the 'C:\ProgramData\Anaconda3\python.exe -m pip install --upgrade pip' command.


# Predict Sentiment

In [23]:
sample_id= [7,12,257,1066]
sample_reviews= reviews[sample_id]

In [24]:
sample_reviews

array(['perhaps picture ever made literally showed road hell paved good intention',
       'wendigo go cinema fed eye heart mind',
       'movie soft percolating magic deadpan suspense',
       'deep meaningful film'], dtype=object)

In [25]:
sample_reviews[2]

'movie soft percolating magic deadpan suspense'

In [26]:
norm_reviews=clean_text(sample_reviews)

In [27]:
type(np.array(norm_reviews))

numpy.ndarray

In [28]:
for review, sentiment in zip(clean_reviews[sample_id], sentiments[sample_id]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    print('Predicted Sentiment polarity:', afn.score(review))
    print('*'*12)

REVIEW: perhaps picture ever made literally showed road hell paved good intention
Actual Sentiment: 1
Predicted Sentiment polarity: -1.0
************
REVIEW: wendigo go cinema fed eye heart mind
Actual Sentiment: 1
Predicted Sentiment polarity: 0.0
************
REVIEW: movie soft percolating magic deadpan suspense
Actual Sentiment: 1
Predicted Sentiment polarity: 0.0
************
REVIEW: deep meaningful film
Actual Sentiment: 1
Predicted Sentiment polarity: 2.0
************


# Sentiment for Complete Data Set

In [31]:
?afn.score

In [30]:
sent_polarity = [afn.score(review) for review in clean_reviews]
#pred_sentiment = ['positive' if score >= 1.0 else 'negative' for score in sent_polarity]
pred_sentiment = [1 if score >= 1.0 else 0 for score in sent_polarity]

In [32]:
pred_sentiment

[1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,


# Evaluate Model Performance

In [33]:
from sklearn import metrics

In [34]:
def display_metric(actual_sent, predicted_sent):
    print("Accuracy:", np.round(metrics.accuracy_score(actual_sent,predicted_sent),4))
    print('Precision:', np.round(metrics.precision_score(actual_sent, predicted_sent, average='weighted'),4))
    print('Recall:', np.round(metrics.recall_score(actual_sent,predicted_sent,average='weighted'),4))
    print('F1 Score:', np.round( metrics.f1_score(actual_sent,predicted_sent,average='weighted'),4))

In [35]:
display_metric(pred_sentiment, sentiments)

Accuracy: 0.6376
Precision: 0.6379
Recall: 0.6376
F1 Score: 0.6376


In [36]:
def display_confusion_matrix(actual_sent, predicted_sent, classes=[1,0]):
    
    total_classes = len(classes)
    level_labels = [total_classes*[0], list(range(total_classes))]

    cm = metrics.confusion_matrix(y_true=actual_sent, y_pred=predicted_sent)
    cm_frame = pd.DataFrame(data=cm) 
    print(cm_frame)

In [37]:
def display_classification_report(actual_sent, predicted_sent, classes=[1,0]):

    report = metrics.classification_report(y_true=actual_sent,y_pred=predicted_sent, labels=classes) 
    print(report)

In [38]:
def display_model_performance(actual_sent, predicted_sent, classes=[1,0]):
    print('\nPrediction Confusion Matrix:')
    print('*'*50)
    display_confusion_matrix(actual_sent=actual_sent, predicted_sent=predicted_sent,classes=classes)
    print('\nModel Classification report:')
    print('*'*50)
    display_classification_report(actual_sent=actual_sent, predicted_sent=predicted_sent,classes=classes)    
    print('Model Performance metrics:')
    print('*'*50)
    display_metric(actual_sent=actual_sent, predicted_sent=predicted_sent)

In [39]:
display_model_performance(actual_sent=sentiments, predicted_sent=pred_sentiment)


Prediction Confusion Matrix:
**************************************************
      0     1
0  3337  1994
1  1870  3461

Model Classification report:
**************************************************
              precision    recall  f1-score   support

           1       0.63      0.65      0.64      5331
           0       0.64      0.63      0.63      5331

    accuracy                           0.64     10662
   macro avg       0.64      0.64      0.64     10662
weighted avg       0.64      0.64      0.64     10662

Model Performance metrics:
**************************************************
Accuracy: 0.6376
Precision: 0.6377
Recall: 0.6376
F1 Score: 0.6375


# Sentiment Analysis with SentiWordNet

In [40]:
import nltk
from nltk.corpus import sentiwordnet as swn
nltk.download('sentiwordnet')
  

[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\moad\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

In [41]:
awesome = list(swn.senti_synsets('awesome', 'a'))[0]
print('Positive Polarity Score:', awesome.pos_score())
print('Negative Polarity Score:', awesome.neg_score())
print('Objective Score:', awesome.obj_score())

Positive Polarity Score: 0.875
Negative Polarity Score: 0.125
Objective Score: 0.0


In [42]:
awesome

SentiSynset('amazing.s.02')

In [43]:
def analyze_sentiment_sentiwordnet_lexicon(review,verbose=False):

    # tokenize and POS tag text tokens
    tokens = nltk.word_tokenize (review)
    tokens = [token.strip() for token in tokens]
    tagged_text = nltk.pos_tag(tokens)
    #tagged_text = [(token.text, token.tag_) for token in tokenized_text]
    pos_score = neg_score = obj_score = 0
    token_count =1
    # get wordnet synsets based on POS tags
    # get sentiment scores if synsets are found
    for word, tag in tagged_text:
        ss_set = None
        if 'NN' in tag and list(swn.senti_synsets(word, 'n')):
            ss_set = list(swn.senti_synsets(word, 'n'))[0]
        elif 'VB' in tag and list(swn.senti_synsets(word, 'v')):
            ss_set = list(swn.senti_synsets(word, 'v'))[0]
        elif 'JJ' in tag and list(swn.senti_synsets(word, 'a')):
            ss_set = list(swn.senti_synsets(word, 'a'))[0]
        elif 'RB' in tag and list(swn.senti_synsets(word, 'r')):
            ss_set = list(swn.senti_synsets(word, 'r'))[0]
        # if senti-synset is found        
        if ss_set:
            # add scores for all found synsets
            pos_score += ss_set.pos_score()
            neg_score += ss_set.neg_score()
            obj_score += ss_set.obj_score()
            token_count += 1
    
    # aggregate final scores
    final_score = pos_score - neg_score
    norm_final_score = round(float(final_score) / token_count, 2)
    final_sentiment = 1 if norm_final_score >= 0 else 0
    if verbose:
        norm_obj_score = round(float(obj_score) / token_count, 2)
        norm_pos_score = round(float(pos_score) / token_count, 2)
        norm_neg_score = round(float(neg_score) / token_count, 2)
        # to display results in a nice table
        sentiment_frame = pd.DataFrame([[final_sentiment, norm_obj_score, norm_pos_score, 
                                         norm_neg_score, norm_final_score]])#,
                                       #columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'], 
                                                             #['Predicted Sentiment', 'Objectivity',
                                                              #'Positive', 'Negative', 'Overall']], 
                                                             #labels=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)
        
    return final_sentiment

In [44]:
nltk.download('averaged_perceptron_tagger')
for review, sentiment in zip(clean_reviews[sample_id], sentiments[sample_id]):
    print('REVIEW:', review)
    print('Actual Sentiment:', sentiment)
    pred = analyze_sentiment_sentiwordnet_lexicon(review, verbose=True)    
    #print(pred)
    print('*'*120)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\moad\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


REVIEW: perhaps picture ever made literally showed road hell paved good intention
Actual Sentiment: 1
   0     1     2     3     4
0  1  0.81  0.07  0.03  0.03
************************************************************************************************************************
REVIEW: wendigo go cinema fed eye heart mind
Actual Sentiment: 1
   0     1     2     3    4
0  1  0.82  0.02  0.02  0.0
************************************************************************************************************************
REVIEW: movie soft percolating magic deadpan suspense
Actual Sentiment: 1
   0     1    2     3     4
0  0  0.75  0.0  0.08 -0.08
************************************************************************************************************************
REVIEW: deep meaningful film
Actual Sentiment: 1
   0     1     2     3     4
0  1  0.54  0.08  0.04  0.04
***********************************************************************************************************************

# Predict Sentiment for Complete Dataset

In [45]:
predicted_sentiments_swn = [analyze_sentiment_sentiwordnet_lexicon(review, verbose=False) for review in clean_reviews]

# Evaluate Model Performance

In [46]:
display_model_performance(actual_sent=sentiments, predicted_sent=predicted_sentiments_swn)


Prediction Confusion Matrix:
**************************************************
      0     1
0  2152  3179
1  1375  3956

Model Classification report:
**************************************************
              precision    recall  f1-score   support

           1       0.55      0.74      0.63      5331
           0       0.61      0.40      0.49      5331

    accuracy                           0.57     10662
   macro avg       0.58      0.57      0.56     10662
weighted avg       0.58      0.57      0.56     10662

Model Performance metrics:
**************************************************
Accuracy: 0.5729
Precision: 0.5823
Recall: 0.5729
F1 Score: 0.5603
