## Sentiment Experiment Runner Template
10/23/16 - Basic pipeline to run sentiment experiments on the IMDB movie review corpus.
Uses the IMDB dataset folder (http://ai.stanford.edu/~amaas/data/sentiment/).

In [2]:
import os
import math
import numpy as np
imdb_folder_location = "../aclImdb" # Change this to wherever your imbd folder is located

In [11]:
def imdb_sentiment_reader(dataset_type='train', sentiment='both'):
    """
    Iterator over the imdb dataset.
    Also removes capitalization from all reviews.
    Args:
        is_train: ['train', 'val', 'test] - whether to iterate over the train, val, or test sets.
        sentiment: ['pos', 'neg', 'both']: whether to iterate over just the positive, just the
                   negative, or both.
    Returns: Iterator over (filename, movie_review, sentiment_score) tuples. 
    """
    subfolder = 'train' if dataset_type=='train' else 'test'
    dataset_path = os.path.join(imdb_folder_location, subfolder)
    if sentiment=='pos' or sentiment=='both':
        # Sort by the index
        filenames = sorted(os.listdir(os.path.join(dataset_path, 'pos')), 
                                key=lambda filename: int(filename.split('_')[0]))
        # Take a slice if these are for val/test
        if dataset_type == 'val' or dataset_type == 'test':
            cutoff = int(math.ceil(len(filenames) * .2))
            if dataset_type == 'val':
                filenames = filenames[:cutoff]
            else:
                filenames = filenames[cutoff:]
        for filename in filenames:
            sentiment_score = int(filename.split('_')[1].split('.')[0])
            with open(os.path.join(dataset_path, 'pos', filename)) as f:
                review = f.read()
            yield filename, review.lower(), sentiment_score
    if sentiment=='neg' or sentiment=='both':
        # Sort by the index 
        filenames = sorted(os.listdir(os.path.join(dataset_path, 'neg')), 
                                key=lambda filename: int(filename.split('_')[0]))
         # Take a slice if these are for val/test
        if dataset_type == 'val' or dataset_type == 'test':
            cutoff = int(math.ceil(len(filenames) * .2))
            if dataset_type == 'val':
                filenames = filenames[:cutoff]
            else:
                filenames = filenames[cutoff:]
        for filename in filenames:
            sentiment_score = int(filename.split('_')[1].split('.')[0])
            with open(os.path.join(dataset_path, 'neg', filename)) as f:
                review = f.read()
            yield filename, review.lower(), sentiment_score
    
# Example Usage
for filename, review, score in imdb_sentiment_reader(dataset_type='test', sentiment='both'):
    print filename, review, score
    break

2500_8.txt this movie was a sicky sweet cutesy romantic comedy, just the kind of movie i usually dislike but this one was just cute enough to keep me interested. it was really funny in one moment (probably why i liked it) and then just as serious in the next. plus, it had ellen in it and i've always had a soft spot for her.<br /><br />basically, the owner of a book store, helen (kate capshaw) finds a love letter in one of the old couches in her store. she thinks it is for her and goes crazy trying to figure out who sent it. she has kind of shut herself off from the world, so it really throws her for a loop. eventually, almost everyone connected with her finds this letter and they are all getting mixed signals which creates some really funny moments.<br /><br />like i said, i am usually not one for this type of movie but i really wound up enjoying it and recommend it highly. 8


In [4]:
class ExperimentRunner():
    """
    Runs a sentiment experiment runner experiment. 
    Trains on the training set, then iterates over the reviews in the test set,
    transforming them using the transform_func and evaluating them using the eval_func.
    
    Outputs the average performance on the test set.
    
    Args:
        train_reader: an iterator over (filename, review, score) tuples.
        test_reader: an iterator over (filename, review, score) tuples.
        transform_func: should take (filename, review, score) and return a transformed string review.
        eval_func: should take (filename, old_review, new_review, old_sentiment_score) and return a score.
        verbose: default = False
    """
    def __init__(self, train_reader, test_reader, transform_func, eval_func, verbose=False):
        self.train_reader = train_reader
        self.test_reader = test_reader
        self.transform_func = transform_func
        self.eval_func = eval_func
        self.verbose = verbose
        self.scores = []

    def run_experiment(self):
        # Iterate over the test set, transforming the reviews and evaluating them
        for index, (filename, review, sent_score) in enumerate(self.test_reader):
            if self.verbose and index % 1000 == 0:
                print "Now evaluating: " + str(index)
            # Transform the review
            transformed_review = self.transform_func(filename, review, sent_score)
            # Evaluate the transformed review
            new_score = self.eval_func(filename, review, transformed_review, sent_score)
            self.scores.append(new_score)
        if self.verbose:
            print "Finished evaluating " + str(index) + " test reviews."
        print "Mean score: " + str(np.mean(self.scores))
    
# Example usage
demo_train = imdb_sentiment_reader(dataset_type='train', sentiment='both')
demo_test= imdb_sentiment_reader(dataset_type='val', sentiment='both')

def demo_transform_func(filename, review, score):
    return review

def demo_eval_func(filename, old_review, new_review, old_score):
    return old_score

demo_runner = ExperimentRunner(demo_train, demo_test, demo_transform_func, 
                               demo_eval_func, verbose=True)
demo_runner.run_experiment()

Now evaluating: 0
Now evaluating: 1000
Now evaluating: 2000
Now evaluating: 3000
Now evaluating: 4000
Finished evaluating 4999 test reviews.
Mean score: 5.5362


## Baseline

In [8]:
import nltk
from nltk import word_tokenize

In [23]:
def baseline_transform_func(filename, review, score):
    """
    Baseline: returns a review with 'not' inserted in front of any identified adjectives/adverbs.
    """
    tagged_review = nltk.pos_tag(word_tokenize(review))
    transformed_review = []
    for tagged_word in tagged_review:
        if tagged_word[1] in ['JJ', 'JJR', 'JJS', 'RB', 'RBR', 'RBS']:
            transformed_review.append('not')
        transformed_review.append(tagged_word[0])
    return " ".join(transformed_review)
# Example usage:
for (filename, review, score) in imdb_sentiment_reader(dataset_type='val', sentiment='pos'):
    print "Original review: "
    print review
    print "Transformed review:" 
    print baseline_transform_func(filename, review, score)
    break

Original review: 
i went and saw this movie last night after being coaxed to by a few friends of mine. i'll admit that i was reluctant to see it because from what i knew of ashton kutcher he was only able to do comedy. i was wrong. kutcher played the character of jake fischer very well, and kevin costner played ben randall with such professionalism. the sign of a good movie is that it can toy with our emotions. this one did exactly that. the entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. while exiting the theater i not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. this movie was great, and i suggest that you go see it before you judge.
Transformed review:
not i went and saw this movie not last night after being coaxed to by a not few friends of mine . not i 'll admit that i was not reluctant to see it because from what 