# Sentiment Polarity Prediction with Naive Bayes

This notebook contains a basic implementation of document-level sentiment analysis
for movie reviews with multinomial Naive Bayes and bag-of-words features
and of cross-validation.
* No special treatment of rare or unknown words. Unknown words in the test data are skipped.

We use the movie review polarity data set of Pang and Lee 2004 [A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts](https://www.aclweb.org/anthology/P04-1035/) in Version 2.0 available from http://www.cs.cornell.edu/People/pabo/movie-review-data (section "Sentiment polarity datasets"). This dataset contains 1000 positive and 1000 negative reviews, each tokenised, sentence-split (one sentence per line) and lowercased. Each review has been assigned to 1 of 10 cross-validation folds by the authors and this setup should be followed to compare with published results.


# Setup

### Import packages

In [1]:
import os
import time
import pandas as pd
from tqdm.auto import tqdm

### Import the classes and functions from the '.py' file

In [2]:
%load_ext autoreload
%autoreload 1
%aimport Classes_and_functions

import Classes_and_functions as py

### Define the location of my chromedriver

In [3]:
chromedriver_location = 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chromedriver.exe'

# Load in the data

In [4]:
data_dict = py.load_data('data', chromedriver_location)

### Preview the documents in the data

In [5]:
fold = 0
for label in ['pos', 'neg']:
    print('\n====', label, '====')
    
    output_df = pd.DataFrame(columns=['doc_num', 'sentences', 'start_of_first_sentence'])
    
    list_of_documents = py.get_documents(data_dict, fold, label)
    for doc_num, document in enumerate(list_of_documents):
        
        doc_preview = py.get_document_preview(document, max_length=50)
        
        one_row = pd.DataFrame({'doc_num': doc_num, 'sentences': len(document), 'start_of_first_sentence': doc_preview}, index=[0])
        output_df = pd.concat([output_df, one_row], axis=0)
        
    print(output_df.reset_index(drop=True))


==== pos ====
   doc_num sentences                            start_of_first_sentence
0        0        25  films|adapted|from|comic|books|have|had|plenty|of
1        1        39      every|now|and|then|a|movie|comes|along|from|a
2        2        19  you've|got|mail|works|alot|better|than|it|dese...
3        3        42  "|jaws|"|is|a|rare|film|that|grabs|your|attention
4        4        25        moviemaking|is|a|lot|like|being|the|general
..     ...       ...                                                ...
95      95        26               "|crazy/beautiful|"|suffers|from|the
96      96        29  everyone|knows|someone|like|giles|de'ath|:|stuffy
97      97        51  for|many|people|,|procrastination|isn't|a|problem
98      98        26         meet|joe|black|(|reviewed|on|nov|.|27/98|)
99      99        48      call|it|touched|by|a|demon|.|gregory|hoblit's

[100 rows x 3 columns]

==== neg ====
   doc_num sentences                            start_of_first_sentence
0        0

# Create training-test splits for Cross-Validation

In [6]:
train_test_splits = py.get_train_test_splits(data_dict)

### Show the splits

In [7]:
num_docs_in_split = pd.DataFrame(columns=["train_set_size", "test_set_size"])

for train_data, test_data in train_test_splits:
    one_row = pd.DataFrame({"train_set_size": len(train_data), "test_set_size": len(test_data)}, index=[0])
    num_docs_in_split = pd.concat([num_docs_in_split, one_row], axis=0)
    
num_docs_in_split.reset_index(drop=True)

Unnamed: 0,train_set_size,test_set_size
0,1800,200
1,1800,200
2,1800,200
3,1800,200
4,1800,200
5,1800,200
6,1800,200
7,1800,200
8,1800,200
9,1800,200


# Define the Naive Bayes model

In [8]:
model = py.Naive_Bayes()

### Measure the model performance using one train-test split

In [9]:
sample_train_data, sample_test_data = train_test_splits[0]
model.train(sample_train_data)
py.print_first_n_predictions(model, sample_test_data, num_predictions=10, len_preview=50)

Accuracy = 0.795
Confusion_matrix:
 [[82 18]
 [23 77]]


Unnamed: 0,label,prediction,documents
0,pos,neg,films|adapted|from|comic|books|have|had|plenty|of
1,pos,pos,every|now|and|then|a|movie|comes|along|from|a
2,pos,pos,you've|got|mail|works|alot|better|than|it|dese...
3,pos,pos,"""|jaws|""|is|a|rare|film|that|grabs|your|attention"
4,pos,neg,moviemaking|is|a|lot|like|being|the|general
5,pos,pos,"on|june|30|,|1960|,|a|self-taught|,|idealistic|,"
6,pos,pos,"apparently|,|director|tony|kaye|had|a|major|ba..."
7,pos,pos,one|of|my|colleagues|was|surprised|when|i|told...
8,pos,pos,"after|bloody|clashes|and|independence|won|,"
9,pos,pos,the|american|action|film|has|been|slowly|drowning


### Measure the model performance Cross-Validating across all train-test splits

In [None]:
py.evaluate_model(model, train_test_splits, verbose=True)

# Compare diferent models

In [None]:
models_to_compare = {'Naive Bayes with clip_counts': py.Naive_Bayes(clip_counts=True),
                     'Naive Bayes no clip_counts': py.Naive_Bayes(clip_counts=False),
                    }

In [None]:
eval_df = pd.DataFrame(columns=['Model', 'Accuracy', 'Stddev', 'Min', 'Max', 'Duration (s)'])
for model_name, model in tqdm(models_to_compare.items()):
    
    # evaluate the model
    start = time.time()
    avg, std_dev, min_avg, max_avg = py.evaluate_model(model, train_test_splits)
    duration = time.time() - start
    
    # add the results to a dataframe to store them
    one_row = pd.DataFrame({'Model': model_name, 'Accuracy':avg, 'Stddev':std_dev, 'Min':min_avg, 'Max':max_avg, 'Duration (s)': duration}, index=[0])
    eval_df = pd.concat([eval_df, one_row], axis=0)

In [None]:
eval_df.reset_index(drop=True)