# IST 664 - Natural Language Processing
**Mark Stiles | Logan Roach**<br>
**March 3, 2024**<br>
**Final Project: Stanford's Large Movie Review Dataset**<br>

# Libraries

In [1]:
import nltk
from nltk import sent_tokenize
from nltk.collocations import *
from bs4 import BeautifulSoup
import pathlib as pathlib
import re
import pandas as pd
import pickle
import random
import time
import statistics
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Data

The data is sourced from a Stanford's large movie review dataset. There are 25,000 reviews. Half are classified as positive and the other half negative. Each review is an open text block written about an unnamed movie and was reviewed and annoted as a research project. 

**URL: https://ai.stanford.edu/~amaas/data/sentiment/**

# Functions

Much of the work is reusable and is contained in the following list of functions that were developed over several iterations. They were separated into three groups: Data Prep, Feature Generation and Training functions.

### Data Prep Functions

**build_review_df**

The build_review_df function takes in a list of file paths and a class value (sentiment value). This is the initial conversion of data files to a dataset. It will load each of the files and build a dataframe. There will be two dataframes, one with positive and one with negative classes. 

In [4]:
def build_review_df(files, class_value):
    
    # build a list of the provided class value for each file in the list
    all_classes = [class_value for pf in pos_files]
    all_texts = []
    
    # loop through all files and get the text
    pos = 1
    for file_path in files:
        
        # read out contents
        f = open(file_path, "r", encoding="latin-1")
        text = f.read()
        f.close()

        # strip out tags
        soup = BeautifulSoup(text, 'html.parser')
        for data in soup(['br']):
            data.decompose()
        text = ' '.join(soup.stripped_strings)
                
        all_texts.append(text)

        pos += 1
    
    # create dataframe with the class and text lists
    new_df = pd.DataFrame({'sentiment': all_classes, 'review': all_texts})
    
    print(f'built review df for class: {class_value} with dimensions: {new_df.shape}')
    
    return new_df

**get_movie_review_text**

The get_movie_review_text function will find all the file names by class type and then use build_review_df to create two dataframes that will then be merged into one dataframe and stored as a .csv file. 

In [5]:
def get_movie_review_text():                
    
    # load files from local directory
    prefix = "./LargeMovieReviewData/train/"
    pos_files = [f for f in pathlib.Path(prefix + "pos").iterdir()]
    neg_files = [f for f in pathlib.Path(prefix + "neg").iterdir()]
    
    # create a dataframe for each list of classification files
    pos_df = build_review_df(pos_files,"pos")
    neg_df = build_review_df(neg_files,"neg")

    # merge the dataframes with positive and negative classes
    merge_df = pd.concat([pos_df, neg_df], axis=0)
    
    return merge_df

**tokenize_and_remove_stopwords**

The tokenize_and_remove_stopwords function is used by the bag of words feature functions to get a trimmed set of words by removing stop words and removing a configurable set of words by parts-of-speech category. Additional symbols representing unknown parts-of-speech will also be removed to help create a more compact and meaningful bag-of-words list. This uses the Penn-Treebank parts-of-speech list. 

**Source: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html**

In [6]:
def tokenize_and_remove_stopwords(word_string, pos_keys):
    
    word_list = nltk.word_tokenize(word_string)
    new_list = []
        
    # get pos tags for all words
    tagged_words = nltk.pos_tag(word_list)
        
    # store characters and parts of speech to be removed
    skip_values = ['#','as','s','ed','ii','al','t','\'the','%','\'m', '\'', '\'s','A','n\'t','do','\'re','\'ve','*','i','I','ca','\'ll','\'d','..','....','wo','Â','$','de','b','la']
    
    # dictionary of parts-of-speech groupings
    pos_dict = {
        'conj' : ['CC','IN'],
        'determ' : ['DT','PDT','WDT'],
        'noun' : ['NN','NNS','NNP','NNPS','PRP','PRP$','WP','WP$'],
        'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ','MD'],
        'adj' : ['JJ','JJR','JJS'],
        'adv' : ['RB','RBR','RBS','WRB'],
        'particle' : ['RP'],
        'marker' : ['LS'],
        'num' : ['CD','LS'],
        'foreign_word' : ['FW'],
        'symbol' : ['SYM'],
        'interj' : ['UH'],
        'to' : ['TO'],
        'ex' : ['EX'],
        'pos' : ['POS']
    }
    
    # gather all the requested pos tags into one list (with additional symbols)
    skip_tags = ['(',')',':','``',"''",'.',',']
    for key in pos_keys:
        skip_tags += pos_dict[key]
    
    # get nltk base stopwords
    nltk_stop_words = nltk.corpus.stopwords.words('english') 
    custom_stop_words = ['the', 'it', 'this', 'and', 'in'] + nltk_stop_words
    
    # filter out characters, POS tags and stop words
    filtered_list = []
    for (word, tag) in tagged_words:
        if(tag not in skip_tags and word not in skip_values and word not in custom_stop_words):
            filtered_list.append((word,tag))
    
    # extract only the remaining words and set them to lowercase
    word_list = [word.lower() for (word, tag) in filtered_list]
        
    return word_list

### Feature Generation Functions

**Bag-of-Words**

**sub_bow_features and generate_bow_features**

The sub_bow_features and generate_bow_features work together to build a bag-of-words feature set for all the reviews. The generate function takes care of the initial tokenization and word frequency distribution so it's not repeatedly being created and then iterates through each review while calling the sub function. The sub function then builds the features for each individual review. 

In [7]:
def sub_bow_features(word_string, word_features):
    
    lower_word_string = word_string.lower()

    features = {}
    
    # top word features
    for word in word_features:
        features[f'V_{word}'] = (word.lower() in lower_word_string)
        
    return features

def generate_bow_features(pos_categories, feature_count):
    
    tokenized_reviews = [tokenize_and_remove_stopwords(row.review,pos_categories) for (pos,row) in reviews_df.iterrows()]
    all_words_list = [word for review in tokenized_reviews for word in review]
    word_dist = nltk.FreqDist(all_words_list)
    top_ranked_words = word_dist.most_common(feature_count)
    top_words = [word for (word,count) in top_ranked_words]
    feature_sets = [(sub_bow_features(row.review, top_words), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_sets

**Parts-of-Speech**

**count_tags**

The count_tags function will take a list of tags and sum all the counts for each tag in the list.

In [8]:
def count_tags(word_dist, tag_list):
    
    # sum the usage counts for each parts-of-speech tag in the list
    tag_count = sum(word_dist[tag] for tag in tag_list)
    
    return tag_count

**sub_pos_features and generate_pos_features**

The sub_pos_features and generate_pos_features functions work together to build a parts-of-speech feature set for all the reviews. The generate function iterates through each review while calling the sub function. The sub function first tokenizes and tags the review text with parts-of-speech and then builds a frequency distribution of the tags to then pass to the count_tags function to build the features for each individual review.

In [9]:
def sub_pos_features(word_string):
    
    word_list = nltk.word_tokenize(word_string)
    tags = [tag for (word,tag) in nltk.pos_tag(word_list)]
    word_dist = nltk.FreqDist(tags)

    features = {}
    
    # pos counts
    features['V_conj_count'] = count_tags(word_dist, ['CC','IN'])
    features['V_determ_count'] = count_tags(word_dist, ['DT', 'PDT', 'WDT'])
    features['V_noun_count'] = count_tags(word_dist, ['NN', 'NNS', 'NNP', 'NNPS', 'PRP', 'PRP$', 'WP', 'WP$'])
    features['V_verb_count'] = count_tags(word_dist, ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'MD'])
    features['V_adj_count'] = count_tags(word_dist, ['JJ', 'JJR', 'JJS'])
    features['V_adv_count'] = count_tags(word_dist, ['RB', 'RBR', 'RBS', 'WRB'])
    features['V_particle_count'] = count_tags(word_dist, ['RP'])
    features['V_marker_count'] = count_tags(word_dist, ['LS'])
    features['V_num_count'] = count_tags(word_dist, ['CD', 'LS'])
    features['V_foreign_word_count'] = count_tags(word_dist, ['FW'])
    features['V_symbol_count'] = count_tags(word_dist, ['SYM'])
    features['V_interj_count'] = count_tags(word_dist, ['UH'])
        
    return features

def generate_pos_features():
    
    feature_sets = [(sub_pos_features(row.review), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_sets

**Text Statistics**

**sub_text_stat_features and generate_text_stat_features**

The sub_text_stat_features and generate_text_stat_features functions work together to build a feature set for all the reviews based on basic statistics like the count of capital letters or the length of the review. The generate function iterates through each review while calling the sub function. The sub function first tokenizes the review text and then calculates basic statistics to build the features for each review.

In [10]:
def sub_text_stat_features(word_string):
    
    word_list = nltk.word_tokenize(word_string)
    
    features = {}
    
    # text statistics
    features['V_cap_count'] = sum(1 for c in word_string if c.isupper())
    features['V_review_length'] = len(word_string)
    features['V_avg_word_length'] = int(statistics.mean(len(w) for w in word_list))    
    
    return features

def generate_text_stat_features():
    
    feature_sets = [(sub_text_stat_features(row.review), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_sets

**Bigrams**

**sub_bigram_features and generate_bigram_features**

The sub_bigram_features and generate_bigram_features functions work together to build a feature set for all the reviews. The generate function creates a bi-gram finder and then finds the top bigrams limited to the value provided. It then iterates through each review while calling the sub function. The sub function first finds the bi-grams for the review text and then creates a feature with a value indicating if that bi-gram is present in the review.

In [11]:
def sub_bigram_features(word_string, bigrams):
    
    review_bigrams = nltk.bigrams(word_string)
    
    features = {}
    for b in bigrams:
        features['B_{}_{}'.format(b[0], b[1])] = (b in review_bigrams)
    
    return features
    
def generate_bigram_features(bigram_count):
    
    finder = BigramCollocationFinder.from_words(all_words_list)
    bigrams = finder.nbest(nltk.collocations.BigramAssocMeasures().chi_sq, bigram_count)
    
    feature_sets = [(sub_bigram_features(row.review, bigrams), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_sets

**Nots**

**is_negation_word**

The is_negation_word function determines if the word provided is in the negation word list. 

In [12]:
def is_negation_word(word):
    
    negation_words = ['no', 'not', 'never', 'none', 'nowhere', 'nothing', 'noone', 'rather', 'hardly', 'scarcely', 'rarely', 'seldom', 'neither', 'nor']

    return word in negation_words or word.endswith("n't")

**sub_not_features and generate_not_features**

The sub_not_features and generate_not_features functions work together to build a feature set for all the reviews. The generate function iterates through each review while calling the sub function. The sub function first tokenizes the review text and then creates a feature with the count of negation words in the review.

In [13]:
def sub_not_features(word_string):
    
    word_list = nltk.word_tokenize(word_string)
    
    features = {}
    features['V_not_count'] = sum(1 for w in word_list if is_negation_word(w))    
    
    return features

def generate_not_features():
    
    feature_sets = [(sub_not_features(row.review), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_sets

**Negations**

**sub_negation_features and generate_negation_features**

The sub_negation_features and generate_negation_features functions work together to build a feature set for all the reviews. The generate function first uses the global all word distribution to find the top words limited by the value provided. It then iterates through each review while calling the sub function. The sub function uses the top words to build features by determining if each top word is in the review and preceded by a 'not'. This indicates that the value of the word itself is reversed.

In [14]:
def sub_negation_features(word_string, word_features):
    
    # break up the review text by word
    word_list = nltk.word_tokenize(word_string)
    features = {}
    
    # build all features first
    for word in word_features:
        features[f'V_not_{word}'] = False
        
    # unset only the word features matching all conditions
    for i in range(0, len(word_features)):
        
        # get current word feature and determine condition values
        word = word_features[i]
        pos_exists = (i + 1) < len(word_features)
        neg_word = is_negation_word(word)
        
        #if there's another word and the current word is a negation set the next word as not if its in the review text
        if pos_exists and is_negation_word(word):
            i += 1
            next_word = word_features[i]
            next_word_in_list = next_word in word_features
            features[f'V_not_{next_word}'] = next_word_in_list
            
    return features

def generate_negation_features(feature_count):
    
    word_items = word_dist.most_common(feature_count)
    word_features = [word for (word,count) in word_items]
    
    feature_sets = [(sub_negation_features(row.review, word_features), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_sets

**TF-IDF**

**tfidf_tokenizer**

The tfidf_tokenizer function is used by the TF-IDF Vectorizer to tokenize the review text. 

In [15]:
def tfidf_tokenizer(text):
    words = nltk.word_tokenize(text)
    return words

**sub_tfidf_features and generate_tfidf_features**

The sub_tfidf_features and generate_tfidf_features functions work together to build a feature set for all the reviews. The generate function first uses the TF-IDF Vectorizer to compute the top terms by term frequency / inverse document frequency value for each review limiting the number of features it tracks to the value provided. It then iterates through each review while calling the sub function. The sub function uses the top tf-idf words and builds features that indicate if the top word is within the review text.

In [1]:
def sub_tfidf_features(word_string, tfidf_values):
    
    word_list = nltk.word_tokenize(word_string)
    
    features = {}
    for word in tfidf_values:
        features['TF_{}'.format(word)] = word in word_list
        
    return features
    
def generate_tfidf_features(max_features):
    
    tfidf = TfidfVectorizer(tokenizer=tfidf_tokenizer, stop_words='english', max_features=max_features)
    sparse_tfidf_texts = tfidf.fit_transform(reviews_df.review)
    tfidf_values = {d:c for (d, c) in zip(tfidf.get_feature_names_out(), tfidf.idf_)}
    
    feature_sets = [(sub_tfidf_features(row.review, tfidf_values), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_sets

**Vader**

**sub_vader_features and generate_vader_features**

The sub_vader_features and generate_vader_features functions work together to build a feature set for all the reviews. The generate function first creates the Vader Sentiment Intensity Analyzer and then iterates through each review while passing the analyzer to the sub function. The sub function uses the polarity scores to build features that indicate if the review text is positive or negative.

In [17]:
def sub_vader_features(word_string, vader_analyzer):

    scores = vader_analyzer.polarity_scores(word_string)
    agg_score = scores['compound']
    features = {}
    features['S_vader'] = True if agg_score >= 0.1 else False
    
    return features
    
def generate_vader_features():
    
    analyzer = SentimentIntensityAnalyzer()
    
    feature_set = [(sub_vader_features(row.review, analyzer), row.sentiment) for (pos,row) in reviews_df.iterrows()]
    
    return feature_set

### Training Functions

**merge_feature_sets**

The merge_feature_sets function takes two feature sets and combines the features while retaining the original sentiment.

In [18]:
def merge_feature_sets(set_1, set_2):
    
    if len(set_1) != len(set_2):
        print('the feature sets are of different lengths')
        return
    
    new_set = []
    for i in range(0,len(set_1)):
        sent = set_1[i][1]
        dict_1 = {}
        dict_2 = set_1[i][0]
        dict_3 = set_2[i][0]
        dict_1.update(dict_2)
        dict_1.update(dict_3)
        tup = (dict_1,sent)
        new_set.append(tup)
        
    return new_set

**eval_measures**

The eval_measures function takes the labels and predictions for a training result and calculates the recall, precision and f1 statistics. It then prints out the statistics into a table.

In [19]:
def eval_measures(labels, predictions):
    
    # get a list of labels
    distinct_labels = list(set(labels))
    
    # these lists have values for each label 
    rec_list = []
    prec_list = []
    f_list = []
    
    # for each label, compare gold and predicted lists and compute values
    for lab in distinct_labels:
        TP = FP = FN = TN = 0
        for i, val in enumerate(labels):
            if val == lab and predictions[i] == lab:  TP += 1
            if val == lab and predictions[i] != lab:  FN += 1
            if val != lab and predictions[i] == lab:  FP += 1
            if val != lab and predictions[i] != lab:  TN += 1
                
        # use these to compute recall, precision, F1
        recall = TP / (TP + FP) if (TP + FP) > 0 else 0
        precision = TP / (TP + FN) if (TP + FN) > 0 else 0
        f1 = 2 * (recall * precision) / (recall + precision) if (recall + precision) > 0 else 0
        
        rec_list.append(recall)
        prec_list.append(precision)
        f_list.append(f1)

    # the evaluation measures in a table with one row per label
    print('\tPrecision\tRecall\t\tF1')
    
    # print measures for each label
    for i, lab in enumerate(distinct_labels):
        print(f'{lab}\t{prec_list[i]:10.3f}\t{rec_list[i]:10.3f}\t{f_list[i]:10.3f}')
    print()

**build_and_test_classifier**

The build_and_test_classifier function takes a list of feature_set and the number of cross validation folds to create and then sequentially partitions the feature dataset to build a model using a different holdout set. It then averages the accuracy of all models to get an overall model accuracy.

In [20]:
def build_and_test_classifier(feature_sets, num_folds):
    
    # calculate cross validation fold size
    fset_length = len(feature_sets)
    print(f'feature set length: {fset_length}')
    fold_size = int(fset_length / num_folds)
    print(f'Fold size: {fold_size}')
    print()
    
    # loop through number of folds and build classifier
    acc_results = []
    for i in range(num_folds):
        
        # calculate the position of elements for test and train set
        fold_start = i * fold_size
        fold_end = fold_start + fold_size
        test_set = feature_sets[fold_start:fold_end]
        train_set = feature_sets[:fold_start] + feature_sets[fold_end:]
        
        # train model
        classifier = nltk.NaiveBayesClassifier.train(train_set)
        
        # calculate and store accuracy
        acc = nltk.classify.accuracy(classifier, test_set)
        acc_results.append(acc)
        
        print(f'Round {i+1} accuracy: {acc}')
        
        labels = []
        predictions = []
        for (features, label) in test_set:
            labels.append(label)
            predictions.append(classifier.classify(features))
        
        eval_measures(labels, predictions)
        
        print(classifier.show_most_informative_features(3))
        print()
            
    # print cross validation accuracy
    print (f'Mean Accuracy: {((sum(acc_results) / num_folds) * 100):0.2f}%', )
    print()

## Load Data

In [2]:
reviews_df = get_movie_review_text()

#print(reviews_df.head())

In [3]:
# save data
#reviews_df.to_csv("reviews.csv")

# load data
#reviews_df = pd.read_csv("reviews.csv")

# Randomize rows
reviews_df = reviews_df.sample(frac = 1)

## Review Tokenization and Distribution

This section tokenizes all the review texts into a single word list and then builds a frequency distribution that can be used globally for other functions to save time. 

In [21]:
# tokenizes all text words and builds a frequency distribution for them
tokenized_reviews = [tokenize_and_remove_stopwords(row.review,[]) for (pos,row) in reviews_df.iterrows()]
all_words_list = [word for review in tokenized_reviews for word in review]
word_dist = nltk.FreqDist(all_words_list)

## Review Corpus Statistics

This sections prints some basic word statistics for the corpus.

In [22]:
print(f'Word Count: {len(all_words_list):,}')
print(f'Unique Word Count: {len(word_dist):,}')
print('Top 50 Words')
top_ranked_words = word_dist.most_common(50)
top_words = [word for (word,count) in top_ranked_words]
print(top_words)

Word Count: 3,188,602
Unique Word Count: 104,013
Top 50 Words
['the', 'movie', 'film', 'one', 'like', 'it', 'this', 'good', 'would', 'even', 'time', 'story', 'really', 'see', 'much', 'well', 'could', 'people', 'get', 'also', 'bad', 'great', 'first', 'made', 'way', 'make', 'movies', 'but', 'think', 'characters', 'and', 'character', 'watch', 'films', 'two', 'many', 'seen', 'never', 'acting', 'little', 'plot', 'best', 'love', 'life', 'show', 'there', 'know', 'in', 'ever', 'better']


# Experiments

## Setup
Each experiment is iterated across five rounds of cross validation to evaluate the model performance and results. For all model evaluations, the model’s performance in the experiments is evaluated using accuracy, precision, recall and F1 score for both positive (pos) and negative (neg) sentiment classes. Accuracy measures the overall correctness of the model. Precision measures the proportion of true positive instances out of all positive predictions. Recall measures the proportion of true positive instances that were correctly predicted out of all actual positive instances and F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics. 

In [23]:
# List of keys for parts-of-speech groups
pos_keys = ['conj','determ','noun','verb','adj','adv','particle','marker','num','foreign_word','symbol','interj','to','ex','pos']

# training parameters
num_folds = 5
bow_feature_count = 1500
bigram_feature_count = 500
negation_feature_count = 2000
tfidf_feature_count = 1000

In [24]:
# Start timer
tic_global = time.perf_counter()

## Experiment 1

Experiment 1 is a baseline for experiment 2 and 3. It generates a bag-of-words feature set from all the unique words in the reviews and would includes all words from all parts-of-speech categories. The further experiments will remove varying amounts of text based on parts-of-speech tags to identify if there are any relationships between part-of-speech and the accuracy. 

In [25]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 1: test with all pos in bag of words')
print()
bow_1 = generate_bow_features([], bow_feature_count)
build_and_test_classifier(bow_1, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 1: test with all pos in bag of words

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.8262
	Precision	Recall		F1
neg	     0.825	     0.827	     0.826
pos	     0.828	     0.825	     0.827

Most Informative Features
             V_pointless = True              neg : pos    =     11.3 : 1.0
             V_laughable = True              neg : pos    =      9.5 : 1.0
                 V_waste = True              neg : pos    =      8.9 : 1.0
None

Round 2 accuracy: 0.8352
	Precision	Recall		F1
neg	     0.820	     0.841	     0.831
pos	     0.849	     0.830	     0.840

Most Informative Features
             V_laughable = True              neg : pos    =      9.7 : 1.0
                 V_worst = True              neg : pos    =      9.1 : 1.0
             V_pointless = True              neg : pos    =      9.1 : 1.0
None

Round 3 accuracy: 0.8186
	Precision	Recall		F1
neg	     0.825	     0.821	     0.823
pos	     0.812	     0.816	     0.814

Most Informative Features
  

### Results

In this experiment, all parts of speech  were included in the bag of words feature set of 25,000 reviews. The performance of each round varies slightly across all five rounds where there is a slight dip in the third round but then improves again in the following round of experimentation. The most indicative features used to distinguish sentiment remain consistent across all rounds where words like "pointless," "laughable," and "waste" are strongly associated with negative sentiment. Overall, the mean accuracy across all rounds was 83.14% suggesting the model is reasonably accurate in sentiment classification, but with room for improvement.  

## Experiment 2

Experiment 2 iterates through each group of parts-of-speech (ex: all noun types are grouped together) and removes them from the list of words used to determine the bag-of-words. Each iteration will determine what effect that group had on the overall performance. A lower score would indicate that group as having a positive impact which caused a deficit when those words were removed. And conversely, a higher score would indicate that group as having a negative impact on prediction. 

In [27]:
# restart timer
tic = time.perf_counter()

bow_2 = {}
print('EXPERIMENT 2: test with one pos removed at a time')
print()
for key in pos_keys:
    bow_2[key] = generate_bow_features([key], bow_feature_count)
    print(f'build model removing {key} pos')
    build_and_test_classifier(bow_2[key], num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 2: test with one pos removed at a time

build model removing conj pos
feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.8262
	Precision	Recall		F1
neg	     0.828	     0.825	     0.826
pos	     0.825	     0.827	     0.826

Most Informative Features
             V_pointless = True              neg : pos    =     11.3 : 1.0
             V_laughable = True              neg : pos    =      9.5 : 1.0
             V_redeeming = True              neg : pos    =      9.3 : 1.0
None

Round 2 accuracy: 0.8368
	Precision	Recall		F1
neg	     0.819	     0.845	     0.832
pos	     0.854	     0.829	     0.842

Most Informative Features
             V_laughable = True              neg : pos    =      9.7 : 1.0
             V_redeeming = True              neg : pos    =      9.6 : 1.0
                 V_worst = True              neg : pos    =      9.1 : 1.0
None

Round 3 accuracy: 0.8208
	Precision	Recall		F1
neg	     0.827	     0.823	     0.825
pos	     0.815	     0.818	     0.8

Round 2 accuracy: 0.812
	Precision	Recall		F1
neg	     0.794	     0.818	     0.806
pos	     0.829	     0.806	     0.817

Most Informative Features
                 V_10/10 = True              pos : neg    =     14.9 : 1.0
             V_pointless = True              neg : pos    =      9.1 : 1.0
                V_poorly = True              neg : pos    =      9.0 : 1.0
None

Round 3 accuracy: 0.799
	Precision	Recall		F1
neg	     0.809	     0.800	     0.805
pos	     0.788	     0.798	     0.793

Most Informative Features
                 V_10/10 = True              pos : neg    =     15.2 : 1.0
             V_pointless = True              neg : pos    =     10.1 : 1.0
                 V_waste = True              neg : pos    =      9.0 : 1.0
None

Round 4 accuracy: 0.8138
	Precision	Recall		F1
neg	     0.812	     0.810	     0.811
pos	     0.816	     0.818	     0.817

Most Informative Features
                 V_10/10 = True              pos : neg    =     10.8 : 1.0
             V_pointl

Round 3 accuracy: 0.8186
	Precision	Recall		F1
neg	     0.828	     0.819	     0.824
pos	     0.809	     0.818	     0.813

Most Informative Features
             V_pointless = True              neg : pos    =     10.1 : 1.0
             V_laughable = True              neg : pos    =      9.2 : 1.0
                 V_waste = True              neg : pos    =      9.0 : 1.0
None

Round 4 accuracy: 0.8372
	Precision	Recall		F1
neg	     0.841	     0.830	     0.835
pos	     0.833	     0.845	     0.839

Most Informative Features
             V_pointless = True              neg : pos    =      9.6 : 1.0
                 V_worst = True              neg : pos    =      9.3 : 1.0
             V_laughable = True              neg : pos    =      8.9 : 1.0
None

Round 5 accuracy: 0.8396
	Precision	Recall		F1
neg	     0.831	     0.849	     0.840
pos	     0.848	     0.831	     0.839

Most Informative Features
             V_pointless = True              neg : pos    =     11.5 : 1.0
                 V_

Round 4 accuracy: 0.8374
	Precision	Recall		F1
neg	     0.839	     0.831	     0.835
pos	     0.836	     0.843	     0.840

Most Informative Features
             V_pointless = True              neg : pos    =      9.6 : 1.0
                 V_worst = True              neg : pos    =      9.3 : 1.0
             V_laughable = True              neg : pos    =      8.9 : 1.0
None

Round 5 accuracy: 0.8402
	Precision	Recall		F1
neg	     0.830	     0.851	     0.840
pos	     0.851	     0.830	     0.840

Most Informative Features
             V_pointless = True              neg : pos    =     11.5 : 1.0
                 V_worst = True              neg : pos    =      9.3 : 1.0
             V_laughable = True              neg : pos    =      8.4 : 1.0
None

Mean Accuracy: 83.14%

build model removing ex pos
feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.825
	Precision	Recall		F1
neg	     0.824	     0.826	     0.825
pos	     0.826	     0.824	     0.825

Most Informative Features
 

### Results

In this experiment, the model’s performance is evaluated with one part of speech removed from the bag of word features during each iteration. The parts of speech removed for each iteration in order are conjugation, determiner, noun, verb, adjective, adverb, particle, marker, numerical, foreign words, symbols, interjection, to, ex (existential there) and pos (posessive endings). 

When removing conjugation parts of speech, accuracy ranges from 82.08% to 84.08% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment "pointless," "laughable," "redeeming," and "worst" are strongly associated with negative sentiment. Overall, the mean accuracy is across all rounds 83.28% suggesting the model is reasonably accurate in sentiment classification when removing conjugation parts of speech but does not show significant signs of improvement.  

When removing the determiner parts of speech, accuracy ranges from 81.74% to 83.96% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment "pointless," "laughable," and "worst" are strongly associated with negative sentiment. Overall, the mean accuracy is across all rounds 83.09% suggesting the model is reasonably accurate in sentiment classification when removing determiner parts of speech with comparable results to the model’s results when including all parts of speech.

When removing noun parts of speech, accuracy ranges from 83.7% to 85.6% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment are "3/10," "4/10," and "7/10”, which are assumed to be ratings out of ten with the lower scoring ratings corresponding to negative sentiment and the higher rating corresponding to positive sentiment. Overall, the mean accuracy is 84.61%, which is the highest in this experiment, suggesting when removing noun parts of speech is reasonably accurate in sentiment classification but only slightly improved from using all parts of speech. 

When removing verb parts of speech, accuracy ranges from 81.16% to 83.5% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment are "unfunny," "pointless," and "10/10"  where the first two are strongly associated with negative sentiment and the last strongly associated with positive sentiment. The mean accuracy is 82.46% following the same trend as before where the model is reasonably good at classifying sentiment but with no improvement when including all parts of speech.

When removing adjective parts of speech, accuracy ranges from 79.9% to 81.58% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment are "10/10," "pointless," and "waste," where the first is associated with positive sentiment and the last two are strongly associated with negative sentiment. Overall, the mean accuracy is 80.86% which is the lowest accuracy in the experiment, still high suggesting a reasonable performance by the model, but slightly lower performance when using all parts of speech.

When removing adverb parts of speech, accuracy ranged from 81.5% to 83.6% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment are "pointless," "laughable," and "redeeming" which are strongly associated with negative sentiment. With an overall mean accuracy of 82.64% suggests the model is reasonably good at classifying sentiment, but still does not perform as well when including all parts of speech. 

When removing particle parts of speech, accuracy ranges from 81.86% to 83.88% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment "pointless," “laughable," and "waste" are strongly associated with negative sentiment. The mean accuracy is 83.14% showing the model is reasonable at classifying sentiment, but with no improvement when including all parts of speech.

When removing marker parts of speech, accuracy ranges from 81.86% to 83.88% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment "pointless," "laughable," and "waste" are strongly associated with negative sentiment. The mean accuracy is 83.14% again suggesting the model is reasonable at classifying sentiment, but with no improvement when including all parts of speech.

When removing numerical parts of speech, accuracy ranges from 81.86% to 83.96% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment were again “pointless," "laughable," and "waste" which were strongly associated with negative sentiment. The mean accuracy is 83.12% showing the model is reasonable at classifying sentiment, but with no improvement when including all parts of speech.

When removing foreign parts of speech, accuracy ranged from 81.86% to 83.88% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment were again “pointless," "laughable," and "waste" which were strongly associated with negative sentiment. The mean accuracy is 83.14% showing the model is reasonable at classifying sentiment, but with no change when including all parts of speech.

When removing symbol parts of speech, accuracy ranges from 81.86% to 83.88% across the five rounds. The most indicative features used to distinguish sentiment were again “pointless," "laughable," and "waste" which were strongly associated with negative sentiment. The mean accuracy is 83.14% showing the model is reasonable at classifying sentiment, but with no improvement when including all parts of speech.
	
When removing interjection parts of speech, accuracy ranges from 81.84% to 84.0% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment were again “pointless," "laughable," and "waste" which were strongly associated with negative sentiment. The mean accuracy is 83.14% showing the model is reasonable at classifying sentiment, but with no change when including all parts of speech.

When removing to parts of speech, accuracy ranges from 81.84% to 84.02% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment were again “pointless," "laughable," and "waste" which were strongly associated with negative sentiment. The mean accuracy is 83.14% showing the model is reasonable at classifying sentiment, but with no change when including all parts of speech.

When removing ex parts of speech, accuracy ranges from 81.82% to 83.86% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy but vary slightly across rounds. The most indicative features used to distinguish sentiment were again “pointless," "laughable," and "waste" which were strongly associated with negative sentiment. The mean accuracy is 83.14% showing the model is reasonable at classifying sentiment, but with no change when including all parts of speech.

When removing pos parts of speech, accuracy ranges from 81.84% to 83.92% across the five rounds. The most indicative features used to distinguish sentiment were again “pointless," "laughable," and "waste" which were strongly associated with negative sentiment. The mean accuracy is 83.14% showing the model is reasonable at classifying sentiment, but with no improvement when including all parts of speech.

## Experiment 3

Experiment 3 removes all parts-of-speech before generating the bag-of-words features and what is left is only what it could not define. This is similar to the baseline from experiment 1 but on the other extreme. 

In [28]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 3: test all pos removed')
print()
bow_3 = generate_bow_features(pos_keys, bow_feature_count)
build_and_test_classifier(bow_3, num_folds)
print() 

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 3: test all pos removed

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.5964
	Precision	Recall		F1
neg	     0.552	     0.606	     0.578
pos	     0.641	     0.589	     0.614

Most Informative Features
                V_'zombi = True              neg : pos    =      7.7 : 1.0
            V_winchester = True              pos : neg    =      7.3 : 1.0
                 V_hanzo = True              pos : neg    =      6.3 : 1.0
None

Round 2 accuracy: 0.5886
	Precision	Recall		F1
neg	     0.522	     0.593	     0.556
pos	     0.653	     0.585	     0.617

Most Informative Features
            V_winchester = True              pos : neg    =     16.5 : 1.0
                V_'zombi = True              neg : pos    =      6.9 : 1.0
                 V_hanzo = True              pos : neg    =      5.7 : 1.0
None

Round 3 accuracy: 0.593
	Precision	Recall		F1
neg	     0.543	     0.616	     0.577
pos	     0.645	     0.574	     0.608

Most Informative Features
                

### Results

In this experiment, all tagged parts of speech were removed from the bag of words feature set. The accuracy ranges from 58.86% to 60.26% dropping significantly from the previous experimental results. The most indicative features used to distinguish sentiment were "winchester," "hanzo," and "zombi"  where the first two are associated with positive sentiment and the last associated with negative sentiment. The mean accuracy is 59.53% significantly lower than all parts of speech included in the bag of word features, and lower than removing just one part of speech tag at a time. 

## Experiment 4

Experiment 4 generates features by using counts of parts of speech within a review. The goal is to determine if the volume of parts-of-speech tags are any indication of a positive or negative review. 

In [29]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 4: test pos counts')
print()
poscounts = generate_pos_features()
build_and_test_classifier(poscounts, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 4: test pos counts

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.542
	Precision	Recall		F1
neg	     0.644	     0.535	     0.584
pos	     0.440	     0.553	     0.490

Most Informative Features
            V_verb_count = 173               neg : pos    =      6.3 : 1.0
            V_verb_count = 148               neg : pos    =      5.7 : 1.0
          V_determ_count = 120               neg : pos    =      5.7 : 1.0
None

Round 2 accuracy: 0.5448
	Precision	Recall		F1
neg	     0.657	     0.530	     0.587
pos	     0.436	     0.567	     0.493

Most Informative Features
            V_verb_count = 148               neg : pos    =      6.9 : 1.0
             V_adv_count = 81                neg : pos    =      6.3 : 1.0
            V_noun_count = 205               pos : neg    =      5.4 : 1.0
None

Round 3 accuracy: 0.5536
	Precision	Recall		F1
neg	     0.660	     0.553	     0.602
pos	     0.442	     0.554	     0.492

Most Informative Features
            V_verb_co

### Results

In this experiment, the features set was composed of counts of tagged parts of speech including conjugations, determiner parts of speech, nouns, verbs, adjectives, adverbs, particles, markers, numerical parts of speech, foreign words, symbol parts of speech, and interjection parts of speech. The accuracy ranges from 52.4 % to 55.48% across the five rounds.  The precision, recall, and F1 scores are similar with respect to accuracy varying slightly across rounds, but the F1 score for classifying negative sentiment was slightly higher than classifying positive sentiment. This suggests the model is more precise and has a greater recall classifying negative sentiment rather than positive sentiment. The most indicative features used to distinguish sentiment were "verb_count" associated with positive sentiment,  "determ_count" associated with negative sentiment, “noun_count” associated with positive sentiment. This suggests the model finds reviews with high number of verbs and noun parts of speech to be associated with positive sentiment and higher determiner parts of speech to be associated with negative sentiment. The mean accuracy was accuracy across all rounds is approximately 54.74%, showing that removing noun parts of speech on the bag of words feature set still has a better performance when classifying sentiment. 

## Experiment 5

Experiment 5 generates features by counting the number of capital letters, measuring the length of the review and average word length. There may be a relationship between how lengthy or concise a review is, the overuse of caps or even the word complexity (more syllables). 

In [30]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 5: text statistics features')
print()
ts = generate_text_stat_features()
build_and_test_classifier(ts, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 5: text statistics features

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.5138
	Precision	Recall		F1
neg	     0.532	     0.513	     0.522
pos	     0.496	     0.515	     0.505

Most Informative Features
         V_review_length = 241               pos : neg    =      6.3 : 1.0
         V_review_length = 930               pos : neg    =      6.3 : 1.0
         V_review_length = 1081              neg : pos    =      6.3 : 1.0
None

Round 2 accuracy: 0.519
	Precision	Recall		F1
neg	     0.529	     0.511	     0.520
pos	     0.509	     0.527	     0.518

Most Informative Features
         V_review_length = 855               neg : pos    =      7.6 : 1.0
         V_review_length = 1363              pos : neg    =      7.0 : 1.0
         V_review_length = 1104              neg : pos    =      7.0 : 1.0
None

Round 3 accuracy: 0.517
	Precision	Recall		F1
neg	     0.531	     0.527	     0.529
pos	     0.502	     0.506	     0.504

Most Informative Features
         V_re

### Results

In this experiment the feature statistics were used to generate the feature set. These statistics are composed of capitalized word count review length and an average word length. The accuracy ranges from 51.38% to 52.76% across the five rounds. Again, this range drops from the initial set of experiments but the precision, recall, and F1 scores are similar with respect to accuracy. Again, the F1 score for classifying negative sentiment was slightly higher than classifying positive sentiment suggesting the model is more precise and has a greater recall classifying negative sentiment rather than positive sentiment. The most indicative features used to distinguish sentiment was “review_length” followed by “cap_count” where each is not specifically associated with positive or negative sentiment, but the model identified them as determinative when classifying the reviews. The overall accuracy of all rounds is 51.87% decreasing from the previous experiment, and only slightly better than a coin flip. 

## Experiment 6

Experiment 6 uses bi-grams collected from the review corpus to generate features from. If there are common two word phrases that indicate if the review is negative or positive this will identify what those are. 

In [31]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 6: bi-gram features')
print()
bigrams = generate_bigram_features(bigram_feature_count)
build_and_test_classifier(bigrams, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 6: bi-gram features

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.4998
	Precision	Recall		F1
neg	     1.000	     0.500	     0.666
pos	     0.000	     0.000	     0.000

Most Informative Features
        B_'achcha_pitaji = False             neg : pos    =      1.0 : 1.0
B_'acts_operetta-structure = False             neg : pos    =      1.0 : 1.0
       B_'airs_12th-14th = False             neg : pos    =      1.0 : 1.0
None

Round 2 accuracy: 0.4924
	Precision	Recall		F1
neg	     1.000	     0.492	     0.660
pos	     0.000	     0.000	     0.000

Most Informative Features
        B_'achcha_pitaji = False             neg : pos    =      1.0 : 1.0
B_'acts_operetta-structure = False             neg : pos    =      1.0 : 1.0
       B_'airs_12th-14th = False             neg : pos    =      1.0 : 1.0
None

Round 3 accuracy: 0.4888
	Precision	Recall		F1
neg	     0.000	     0.000	     0.000
pos	     1.000	     0.489	     0.657

Most Informative Features
        B_'achc

### Results

In this experiment, a bigram finder is used to find the top bigrams iterated across all reviews. The bigrams are then used as features to classify within the model. The accuracy ranges from 48.88% to 49.98% across the five rounds. In an unusual case, one class of sentiment had perfect precision and the other had a precision of 0.0 where the class depended on the round of the experiment. Where the model precision was perfect, the recall hovered around 59% and the F1 score was higher than the accuracy score found usually around 66%. When the precision of one of the sentiment classes was 0.0 the recall and the F1 score were also 0.0. This could be because the bigrams were comprised of only negated bi-gram features or positive bigram features. The most indicative features were found to be these negated bigram features where the model determined them to have a strong association with negative sentiment implying the model with this feature set is biased towards predicting instances at negative. It’s important to remember, the mean accuracy is 49.33%, indicating that the model's performance is not much better than random chance.

## Experiment 7

Experiment 7 generate a feature that counts the number of 'not' words. The outcome should indicate if this statistic is a good barometer of sentiment. 

In [32]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 7: not features')
print()
nots = generate_not_features()
build_and_test_classifier(nots, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 7: not features

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.5794
	Precision	Recall		F1
neg	     0.617	     0.574	     0.595
pos	     0.542	     0.586	     0.563

Most Informative Features
             V_not_count = 19                neg : pos    =      3.5 : 1.0
             V_not_count = 20                neg : pos    =      3.4 : 1.0
             V_not_count = 13                neg : pos    =      2.9 : 1.0
None

Round 2 accuracy: 0.5774
	Precision	Recall		F1
neg	     0.600	     0.567	     0.583
pos	     0.555	     0.589	     0.571

Most Informative Features
             V_not_count = 17                neg : pos    =      3.6 : 1.0
             V_not_count = 13                neg : pos    =      2.7 : 1.0
             V_not_count = 19                neg : pos    =      2.4 : 1.0
None

Round 3 accuracy: 0.5826
	Precision	Recall		F1
neg	     0.608	     0.589	     0.598
pos	     0.556	     0.576	     0.565

Most Informative Features
             V_not_coun

### Results

In this experiment, the number instances of the word “not” were counted in each review across both sets of reviews to generate a feature set used to classify sentiment. The accuracy ranges from 57.74% to 58.33% across the five rounds. The precision, recall, and F1 scores are similar with respect to accuracy varying slightly across rounds, but the F1 score for classifying negative sentiment was slightly higher than classifying positive sentiment. This suggests the model is more precise and has a greater recall classifying negative sentiment rather than positive sentiment. This logically follows when creating a list of features comprised of the number of times “not”, a negation word, is used across reviews. The most indicative features were the number of “not_counts” found within the reviews where they were all associated with negative sentiment. The overall mean accuracy 58.33% was higher than the previous experiment, but still not as high as the initial experiments removing noun parts of speech and the bag of words feature set.

## Experiment 8

Experiment 8 is similar to experiment 7 but instead of counting the not words it indicates which words were contradicted. Not all words were indicated. This only used the most common words by frequency to determine if each review contained a contradicted version of that word.

In [33]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 8: negation features')
print()
negate = generate_negation_features(negation_feature_count)
build_and_test_classifier(negate, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 8: negation features

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.4998
	Precision	Recall		F1
neg	     1.000	     0.500	     0.666
pos	     0.000	     0.000	     0.000

Most Informative Features
                 V_not_& = False             neg : pos    =      1.0 : 1.0
              V_not_'the = False             neg : pos    =      1.0 : 1.0
             V_not_..... = False             neg : pos    =      1.0 : 1.0
None

Round 2 accuracy: 0.4924
	Precision	Recall		F1
neg	     1.000	     0.492	     0.660
pos	     0.000	     0.000	     0.000

Most Informative Features
                 V_not_& = False             neg : pos    =      1.0 : 1.0
              V_not_'the = False             neg : pos    =      1.0 : 1.0
             V_not_..... = False             neg : pos    =      1.0 : 1.0
None

Round 3 accuracy: 0.4888
	Precision	Recall		F1
neg	     0.000	     0.000	     0.000
pos	     1.000	     0.489	     0.657

Most Informative Features
                 V

### Results

In this experiment, the feature set used to classify sentiment is composed of top words preceded by the word “not” across the global word distribution set. The accuracy ranges from 48.88% to 49.98% across all five rounds. Again, the behavior of the precision, recall and F1 score results are similar to the bi-gram features where only one class of sentiment had perfect precision and the other had a precision of 0.0. Where the model precision was perfect, the recall hovered around 49%, and the F1 score was about 66%. When the precision of one of the sentiment classes was 0.0 the recall and the F1 score were also 0.0. The same phenomenon that occurred when using the bigram feature set suggesting sets could be composed of wholly negated pairs of top word or perhaps double negated words when the positive class precision is 100% accurate. There are no indicative features that include double negation, but they do include features like “not_&” and “not_the” which are associated with negative sentiment. The mean accuracy was found to be 49.33%, the same when using the bigram feature set and not improving from previous experiments. 

## Experiment 9

Experiment 9 is similar to the bag-of-words features in experiments 1-3. The TF-IDF uses frequency number countered by overuse. If a word is used a lot it is probably a good word to compare against but if it's used too much then it loses it's meaning and should be penalized. This test should determine how this compares to the bag-of-words tests. 

In [34]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 9: tf-idf features')
print()
tfidf = generate_tfidf_features(tfidf_feature_count)
build_and_test_classifier(tfidf, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 9: tf-idf features

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.8246
	Precision	Recall		F1
neg	     0.799	     0.842	     0.820
pos	     0.850	     0.809	     0.829

Most Informative Features
                TF_waste = True              neg : pos    =     14.1 : 1.0
            TF_pointless = True              neg : pos    =     10.5 : 1.0
                TF_awful = True              neg : pos    =      8.5 : 1.0
None

Round 2 accuracy: 0.8214
	Precision	Recall		F1
neg	     0.782	     0.844	     0.812
pos	     0.859	     0.803	     0.830

Most Informative Features
                TF_waste = True              neg : pos    =     12.3 : 1.0
               TF_poorly = True              neg : pos    =      9.1 : 1.0
                TF_worst = True              neg : pos    =      9.1 : 1.0
None

Round 3 accuracy: 0.8086
	Precision	Recall		F1
neg	     0.787	     0.830	     0.808
pos	     0.831	     0.789	     0.809

Most Informative Features
                TF_w

### Results

In this experiment, the TF_IDF vectorizer is used to compute the top terms while considering their frequency across the entire corpora. This feature set is then used to classify sentiment where the accuracy ranges from 80.86% to 82.68%.  The precision, recall, and F1 scores are similar with respect to accuracy with the F1 scores for positive sentiment classification being slightly higher than the negative sentiment classification. This suggests the slightest favorability when classifying positive sentiment using the model when using this feature set. The indicative features used to distinguish sentiment were "waste," "pointless," and "worst" which are strongly associated with negative sentiment. The mean accuracy across all rounds was 82.14% which is much better than the previous experiments but still slightly worse than the first experiment using all parts of speech and the bag of words feature set. 

## Experiment 10

Experiment 10 uses a 3rd party sentiment library to generate features with a sentiment value. In testing it was never completely accurate but may provide some improved accuracy. 

In [35]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 10: vader sentiment features')
print()
vader = generate_vader_features()
build_and_test_classifier(vader, num_folds)
print()

# check end time
toc = time.perf_counter()
print(f"Completed in {((toc-tic)/60):0.2f} minutes")

EXPERIMENT 10: vader sentiment features

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.7018
	Precision	Recall		F1
neg	     0.547	     0.792	     0.647
pos	     0.856	     0.654	     0.742

Most Informative Features
                 S_vader = False             neg : pos    =      3.6 : 1.0
                 S_vader = True              pos : neg    =      1.9 : 1.0
None

Round 2 accuracy: 0.7054
	Precision	Recall		F1
neg	     0.544	     0.793	     0.645
pos	     0.862	     0.661	     0.748

Most Informative Features
                 S_vader = False             neg : pos    =      3.6 : 1.0
                 S_vader = True              pos : neg    =      1.9 : 1.0
None

Round 3 accuracy: 0.6938
	Precision	Recall		F1
neg	     0.554	     0.783	     0.649
pos	     0.840	     0.643	     0.728

Most Informative Features
                 S_vader = False             neg : pos    =      3.7 : 1.0
                 S_vader = True              pos : neg    =      1.9 : 1.0
None

Roun

### Results

In this experiment, the feature set used to classify across was generated using Vader Sentiment Intensity Analyzer. Everything else being equal, the accuracy ranges from 69.08% to 70.54%. The precision, recall, and F1 scores are similar with respect to accuracy but there were slight differences between classes. For example, for the negative class, precision ranges from approximately 0.544 to 0.554, recall ranges from approximately 0.778 to 0.793, and F1-score ranges from approximately 0.640 to 0.649. For the positive class, precision ranges were higher from approximately 0.840 to 0.862, the recall scores were lower from approximately 0.643 to 0.663, and the F1 scores were thus higher from approximately 0.728 to 0.748. This could indicate a slight favoritism towards classifying positive sentiment when using the model with this feature set. The most informative features where sentiment analyzed features were either present or not. Interestingly, the absence of sentiment (S_vader = False) is more indicative of the negative class, while the presence of sentiment (S_vader = True) is more indicative of the positive class. The mean accuracy across all rounds was found to be 69.92%, showing a moderate performance but not as good as performance from Experiment 1 or Experiment 2. 

## Experiment 11

Experiment 11 combines features from previous tests to create a set that has a higher score than each individual feature set. In this case the features from Vader, TF-IDF and Bag-of-Words (all words) were used. 

In [36]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 11: combined features (1)')
print()
new_1 = merge_feature_sets(vader, tfidf)
new_2 = merge_feature_sets(new_1, bow_1)
build_and_test_classifier(new_2, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 11: combined features (1)

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.8344
	Precision	Recall		F1
neg	     0.823	     0.842	     0.832
pos	     0.846	     0.827	     0.836

Most Informative Features
                TF_waste = True              neg : pos    =     14.1 : 1.0
             V_pointless = True              neg : pos    =     11.3 : 1.0
            TF_pointless = True              neg : pos    =     10.5 : 1.0
None

Round 2 accuracy: 0.8408
	Precision	Recall		F1
neg	     0.812	     0.857	     0.834
pos	     0.868	     0.827	     0.847

Most Informative Features
                TF_waste = True              neg : pos    =     12.3 : 1.0
             V_laughable = True              neg : pos    =      9.7 : 1.0
                 V_worst = True              neg : pos    =      9.1 : 1.0
None

Round 3 accuracy: 0.829
	Precision	Recall		F1
neg	     0.824	     0.839	     0.831
pos	     0.834	     0.819	     0.827

Most Informative Features
              

### Results

In this experiment, multiple feature sets were combined including the Vader feature set, the TF_IDF feature set and the bag of words feature set including all parts of speech. The accuracy ranged from 82.9% to 84.72% across all five rounds. We see similar results to the previous experiment in terms of the precision, recall and F1 scores to be slightly higher in the positive class than the negative class but with less of a difference between the two. The negative class precision ranges from approximately 0.812 to 0.840, recall ranges from approximately 0.839 to 0.866, and F1-score ranges from approximately 0.831 to 0.844. The positive class precision ranges from approximately 0.854 to 0.870, recall ranges from approximately 0.819 to 0.827, and F1-score ranges from approximately 0.836 to 0.850. The most informative features were a combination of features from the merged list of features. For example, TF_DIF vectorized features like "TF_waste" and sentiment features like "V_pointless" were found to be associated with negative sentiment. The mean accuracy across all rounds was found to be 83.92% making this model just slightly outperform the performance seen in Experiment 1.

## Experiment 12

Experiment 12 is similar to experiment 11. This will combine the features from Vader, TF-IDF and Bag-of-Words (with nouns removed). 

In [37]:
# restart timer
tic = time.perf_counter()

print('EXPERIMENT 12: combined features (2)')
print()
new_1 = merge_feature_sets(vader, tfidf)
new_2 = merge_feature_sets(new_1, bow_2['noun'])
build_and_test_classifier(new_2, num_folds)
print()

# check end time
print(f"Completed in {((time.perf_counter()-tic)/60):0.2f} minutes")

EXPERIMENT 12: combined features (2)

feature set length: 25000
Fold size: 5000

Round 1 accuracy: 0.85
	Precision	Recall		F1
neg	     0.834	     0.861	     0.848
pos	     0.866	     0.839	     0.852

Most Informative Features
                  V_4/10 = True              neg : pos    =     37.3 : 1.0
                  V_7/10 = True              pos : neg    =     27.0 : 1.0
                  V_3/10 = True              neg : pos    =     25.5 : 1.0
None

Round 2 accuracy: 0.8462
	Precision	Recall		F1
neg	     0.811	     0.868	     0.838
pos	     0.881	     0.827	     0.853

Most Informative Features
                  V_3/10 = True              neg : pos    =     55.8 : 1.0
                  V_4/10 = True              neg : pos    =     32.3 : 1.0
                  V_7/10 = True              pos : neg    =     28.9 : 1.0
None

Round 3 accuracy: 0.8402
	Precision	Recall		F1
neg	     0.833	     0.852	     0.842
pos	     0.848	     0.829	     0.838

Most Informative Features
               

### Results

In this experiment, multiple feature sets were combined including the Vader feature set, the TF_IDF feature set and the bag of words feature set with the nouns removed from the parts of speech included. This was done to build off the promising results of the previous experiment and using results from experiment 2 where nouns were removed from the bag of word features improved the performance. The accuracy ranged from 84.02% to 85.54% across all five rounds. The same pattern emerges in regard to the previous experiment’s precision, recall and F1 scores being slightly higher in the positive class than the negative class. The negative class had precision ranges from approximately 0.811 to 0.851, recall ranges from approximately 0.852 to 0.877, and F1-score ranges from approximately 0.838 to 0.858, and the positive class had precision ranges from approximately 0.866 to 0.881, recall ranges from approximately 0.827 to 0.858, and F1-score ranges from approximately 0.852 to 0.865. This indicates a slight favoritism towards the positive class sentiment in terms of classifying the reviews, but again very slight. The most informative features appear to be numerical ratings like in the experiment where nouns were removed previously. The same features appear "3/10," "4/10," and "7/10”, with the lower scoring ratings corresponding to negative sentiment and the higher rating corresponding to positive sentiment. The mean accuracy across all rounds is found to be 85.07%, the most promising results from the model across all experiments and iterations. 

In [38]:
# Stop timer
print(f"Start to finish in {((time.perf_counter()-tic_global)/60):0.2f} minutes")

Start to finish in 318.96 minutes


# Conclusion

Through all experiments a prediction accuracy of 85% was the highest achieved. It was through experiment 12 which combined the features sets with the overall highest accuracy individually. These features were Vader, TF-IDF and Bag-of-Words (with nouns removed). The improvement was modest of 1-2 points from using the feature sets as standalone but was relatively high.

With more time other avenues could've been attempted to mix feature logic or spend more time within the review text to identify other aspects of the corpus that indicated movie sentiment with better accuracy. Overall the results were respectable but we wished we could've designed features that reached at least to 90% accuracy. 