## Sentiment Experiment Runner Template
10/23/16 - Basic pipeline to run sentiment experiments on the IMDB movie review corpus.
Uses the IMDB dataset folder (http://ai.stanford.edu/~amaas/data/sentiment/).

In [41]:
import os
import math
imdb_folder_location = "../aclImdb" # Change this to wherever your imbd folder is located

In [50]:
def imdb_sentiment_reader(dataset_type='train', sentiment='both'):
    """
    Iterator over the imdb dataset.
    Args:
        is_train: ['train', 'val', 'test] - whether to iterate over the train, val, or test sets.
        sentiment: ['pos', 'neg', 'both']: whether to iterate over just the positive, just the
                   negative, or both.
    Returns: Iterator over (filename, movie_review, sentiment_score) tuples. 
    """
    subfolder = 'train' if dataset_type=='train' else 'test'
    dataset_path = os.path.join(imdb_folder_location, subfolder)
    if sentiment=='pos' or sentiment=='both':
        # Sort by the index
        filenames = sorted(os.listdir(os.path.join(dataset_path, 'pos')), 
                                key=lambda filename: int(filename.split('_')[0]))
        # Take a slice if these are for val/test
        if dataset_type == 'val' or dataset_type == 'test':
            cutoff = int(math.ceil(len(filenames) * .2))
            if dataset_type == 'val':
                filenames = filenames[:cutoff]
            else:
                filenames = filenames[cutoff:]
        print len(filenames)
        for filename in filenames:
            sentiment_score = int(filename.split('_')[1].split('.')[0])
            with open(os.path.join(dataset_path, 'pos', filename)) as f:
                review = f.read()
            yield filename, sentiment_score, review
    if sentiment=='neg' or sentiment=='both':
        # Sort by the index 
        filenames = sorted(os.listdir(os.path.join(dataset_path, 'neg')), 
                                key=lambda filename: int(filename.split('_')[0]))
         # Take a slice if these are for val/test
        if dataset_type == 'val' or dataset_type == 'test':
            cutoff = int(math.ceil(len(filenames) * .2))
            if dataset_type == 'val':
                filenames = filenames[:cutoff]
            else:
                filenames = filenames[cutoff:]
        for filename in filenames:
            sentiment_score = int(filename.split('_')[1].split('.')[0])
            with open(os.path.join(dataset_path, 'neg', filename)) as f:
                review = f.read()
            yield filename, sentiment_score, review
    
# Example Usage
for filename, review, score in imdb_sentiment_reader(dataset_type='test', sentiment='neg'):
    print filename, review, score
    break

10000
2500_2.txt 2 It's strange, while the film features full X-rated sex scenes and violent murders, it never feels as shocking as it ought to.<br /><br />A group of scientists go to an island in the Caribbean to investigate a radioactive incident. Upon their arrival, a mutated islander goes about the happy business of murdering the men and having his way with the women. Doesn't it always seem to work out that way.<br /><br />Among the sored acts we find a some lesbian encounters, a three-way with male prostitutes, assorted heterosexual couplings and the rape of an already dead body. Even though it's all fully explicit, it fails to ever shock or stir as it is meant to. As soon as the sex goes fully pornographic it just loses it's edge; the suspension of disbelief is broken and we realize we are just watching people having sex.<br /><br />There is some blood and gore with the murders, but given that this is a D'amato flick it's really tame. For a much more rounded experience watch the 

In [None]:
class ExperimentRunner():
    def __init__(self, train_reader, test_reader, transform_func, eval_func):
        self.train_reader = train_reader
        self.test_reader = test_reader