<a href="https://colab.research.google.com/github/KhushalMitbaokar/DataScience/blob/main/sentiment_analysis_on_movie_review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install sagemaker==1.72.0

Collecting sagemaker==1.72.0
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/75ea837e2bd704b1567bdf55f7e768745da4fcf1e3b3e061e41ba7d7f399/sagemaker-1.72.0.tar.gz (297kB)
[K     |████████████████████████████████| 307kB 8.0MB/s 
[?25hCollecting boto3>=1.14.12
[?25l  Downloading https://files.pythonhosted.org/packages/57/3d/386cc84db1e57aa7782eed00bcbdb884e496bdb1689c7f4c09a22572846d/boto3-1.17.35-py2.py3-none-any.whl (131kB)
[K     |████████████████████████████████| 133kB 13.3MB/s 
Collecting protobuf3-to-dict>=0.1.5
  Downloading https://files.pythonhosted.org/packages/6b/55/522bb43539fed463275ee803d79851faaebe86d17e7e3dbc89870d0322b9/protobuf3-to-dict-0.1.5.tar.gz
Collecting smdebug-rulesconfig==0.1.4
  Downloading https://files.pythonhosted.org/packages/2c/d7/80252c50e8848101914457d1cf58ef7e20f34406fc660d26108a1fec866d/smdebug_rulesconfig-0.1.4-py2.py3-none-any.whl
Collecting s3transfer<0.4.0,>=0.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/98/

In [None]:
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

--2021-03-24 16:54:24--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2021-03-24 16:54:26 (44.2 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [None]:
import os
import glob

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels
    

In [None]:
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [None]:
from sklearn.utils import shuffle

def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test



In [None]:
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


In [None]:
print(train_X[100])
print(train_y[100])
print(test_X[10])

On the back burner for years (so it was reported) this television reunion of two of the most beloved characters in sitcom history started off badly - and went straight downhill from there. Mary Richards (Mary Tyler Moore) and her best friend Rhoda Morgenstern (Valerie Harper) meet in New York after a long estrangement and catch up on each other's lives. What a novel concept! But, sad to relate, nothing worth talking about (let alone making a movie about) has happened to either of them in the intervening years. So, instead, the script contents itself with throwing out one hoary old plot device after another (most having to do with older women in the workplace), while completely missing the quirky charm and sophistication that made the original show a winner. The supporting cast is instantly forgettable, the humor is nonexistent, and the chemistry which Moore and Harper once had together is gone. Moore allegedly stalled this project for years, waiting for "just the right script" before c

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [None]:
print(review_to_words(train_X[100]))

['back', 'burner', 'year', 'report', 'televis', 'reunion', 'two', 'belov', 'charact', 'sitcom', 'histori', 'start', 'badli', 'went', 'straight', 'downhil', 'mari', 'richard', 'mari', 'tyler', 'moor', 'best', 'friend', 'rhoda', 'morgenstern', 'valeri', 'harper', 'meet', 'new', 'york', 'long', 'estrang', 'catch', 'live', 'novel', 'concept', 'sad', 'relat', 'noth', 'worth', 'talk', 'let', 'alon', 'make', 'movi', 'happen', 'either', 'interven', 'year', 'instead', 'script', 'content', 'throw', 'one', 'hoari', 'old', 'plot', 'devic', 'anoth', 'older', 'women', 'workplac', 'complet', 'miss', 'quirki', 'charm', 'sophist', 'made', 'origin', 'show', 'winner', 'support', 'cast', 'instantli', 'forgett', 'humor', 'nonexist', 'chemistri', 'moor', 'harper', 'togeth', 'gone', 'moor', 'allegedli', 'stall', 'project', 'year', 'wait', 'right', 'script', 'commit', 'one', 'consid', 'right', 'earth', 'one', 'turn', 'like', 'age', 'charact', 'time', 'inevit', 'march', 'almost', 'complet', 'lack', 'imagin', '

In [None]:

import pickle

cache_dir = os.path.join("../cache", "sentiment_analysis")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [None]:
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Wrote preprocessed data to cache file: preprocessed_data.pkl


In [None]:
test_X[10]

In [None]:
import numpy as np

def build_dict(data, vocab_size = 5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    # TODO: Determine how often each word appears in `data`. Note that `data` is a list of sentences and that a
    #       sentence is a list of words.
    
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur

    for data_item in data:
        for word in data_item:
            if word not in word_count.keys():
                word_count[word] = 1
            else:
                word_count[word] = word_count[word] +1
                
    
    # TODO: Sort the words found in `data` so that sorted_words[0] is the most frequently appearing word and
    #       sorted_words[-1] is the least frequently appearing word.
    
    i=0
    sorted_words = []
    weight = []
    for key, value in word_count.items():
        sorted_words.append(key)
        weight.append(value)
    
    i = 0
    for i in range(len(sorted_words)):
        j = 0
        for j in range(len(sorted_words)):
            if weight[j] <= weight[i]:
                temp1 = sorted_words[i]
                sorted_words[i] = sorted_words[j]
                sorted_words[j] = temp1
                temp2 = weight[i]
                weight[i] = weight[j]
                weight[j] = temp2
    print(sorted_words)
    print(weight)
    
    word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2                              # 'infrequent' labels
        
    return word_dict

In [None]:
word_dict = build_dict(train_X)

In [None]:
print(list(word_dict.items())[:5])

In [None]:
import pandas as pd  
from bs4 import BeautifulSoup
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


train = pd.read_csv('labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)     

def review_to_words(raw_review):
    removedTags = BeautifulSoup(raw_review)         #1. Remove HTML
    upperAndLowerRemains = re.sub('[^a-zA-Z]'," ",removedTags.get_text()) #2. Remove non letters
    toLowerAndSplit = upperAndLowerRemains.lower().split() #3. Convert to lowercase and split it into words
    stops = set(stopwords.words('english'))
    stopwordsRemoved = [w for w in toLowerAndSplit if not w in stops]  #4. Remove stops words
    complete_review = " ".join(stopwordsRemoved);  #5. Joint back and return the joined sentence
    return complete_review
    
CleanedListOfReviews = []
BagOfWords = []
for iterator in range(0,train["review"].size): 
    if iterator%1000 == 0 or iterator==24999:     #Checking progress after every 1000 Reviews
        print("Cleaned Reviews: ",iterator)
    complete_review = review_to_words(train["review"][iterator])
    CleanedListOfReviews.append(complete_review)
    BagOfWords.append(train["sentiment"][iterator])




vocabularySize = 5000
smoothingFactor = 5
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = vocabularySize)  # Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.  

data_features = vectorizer.fit_transform(CleanedListOfReviews)
data_features = data_features.toarray()

trainingSet = 0.8
dataSize = train['review'].size
trainSize = dataSize * trainingSet
SentimentsTrainedSet = []
ReviewsTrainedSet = []
ReviewsValidationSet = []
SentimentsValidationSet = []
for i in range( 0, dataSize):
	if(i < trainSize):
		SentimentsTrainedSet.append(BagOfWords[i])
		ReviewsTrainedSet.append(data_features[i])
	else:
		SentimentsValidationSet.append(BagOfWords[i])
		ReviewsValidationSet.append(data_features[i])



# Fitting the model to Naive Bayes Classifier
clf = MultinomialNB(alpha=smoothingFactor)
clf.fit(np.array(ReviewsTrainedSet), np.array(SentimentsTrainedSet))


#Predicting on Validation set
pred_labels = clf.predict(np.array(ReviewsValidationSet))
val_labels = np.array(SentimentsValidationSet)

#Calculating Accuracy
accuracy = float((pred_labels == val_labels).sum())
total = val_labels.size

acc_perc = (accuracy/total)*100
print("\nAccuracy on 20% validation set with smoothing factor ",smoothingFactor," and vocabulary size ",vocabularySize," is: ",acc_perc)





[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Cleaned Reviews:  0
Cleaned Reviews:  1000
Cleaned Reviews:  2000
Cleaned Reviews:  3000
Cleaned Reviews:  4000
Cleaned Reviews:  5000
Cleaned Reviews:  6000

Accuracy on 20% validation set with smoothing factor  5  and vocabulary size  5000  is:  86.321094312455
