# Movie Reviews

We will implement Turney's algorithm ([paper](http://www.aclweb.org/anthology/P02-1053.pdf)) for classifying movie reviews as positive or negative. Rather than deal with the headache of web search querying, we'll use a local data set as a corpus for querying. [IMDB reviews](https://www.kaggle.com/utathya/imdb-review-dataset) will be for our "web search", and [this polarity dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz) will be used for testing the classifier. It's important that there's no signal leakage between our two data sets, thus using data from two different sources rather than performing a train/test split.

### Setup Corpus

Load the corpus for querying.

In [1]:
import pandas as pd

# Large corpus of IMDB movie reviews. Comes labeled
# but we will only use the raw text of the reviews
# our classifier will "query" this data
QUERY_FILE = "train/imdb_master.csv"

# Use pandas to make our lives easier
query_corpus = pd.read_csv(QUERY_FILE, encoding = 'cp1252')  # https://en.wikipedia.org/wiki/Windows-1252
query_corpus = query_corpus['review']  # select just the text
print(query_corpus.head())

0    Once again Mr. Costner has dragged out a movie...
1    This is an example of why the majority of acti...
2    First of all I hate those moronic rappers, who...
3    Not even the Beatles could write songs everyon...
4    Brass pictures (movies is not a fitting word f...
Name: review, dtype: object


Tokenize the corpus.

In [3]:
import nltk
from tqdm import tqdm

# Tokenize each document in our corpus
query_corpus_tokens = []
for document in tqdm(query_corpus):
    query_corpus_tokens.append(nltk.word_tokenize(document))

query_corpus_tokens = pd.Series(query_corpus_tokens)
print(query_corpus_tokens.head())

100%|██████████| 100000/100000 [02:32<00:00, 657.04it/s]

0    [Once, again, Mr., Costner, has, dragged, out,...
1    [This, is, an, example, of, why, the, majority...
2    [First, of, all, I, hate, those, moronic, rapp...
3    [Not, even, the, Beatles, could, write, songs,...
4    [Brass, pictures, (, movies, is, not, a, fitti...
dtype: object





Construct helper functions to query our corpus.

In [4]:
def hits(w):
    """
    Query our corpus for the number
    of occurrences of token w.
    """
    global query_corpus_tokens
    
    count = 0
    
    for token_list in query_corpus_tokens:
        for token in token_list:
            if token == w:
                count += 1
                
    return count + 0.01  # no zeros

def hits_near(phrase, w, nearness_threshold=10):
    """
    Query our corpus for how often phrase is within
    nearness_threshold tokens of w.
    """
    global query_corpus_tokens
    
    count = 0
    
    for token_list in query_corpus_tokens:
        for i in range(len(token_list)):
            token = token_list[i]
            if token == w:
                tokens_to_left = token_list[max(i-nearness_threshold,0):i]
                tokens_to_right = token_list[i+1:i+nearness_threshold]
                
                for context in [tokens_to_left, tokens_to_right]:
                    for bigram in zip(context[:-1], context[1:]):
                        if bigram == phrase:
                            count += 1

    return count + 0.01  # no zeros

### Construct classifier

Following the paper, there are three steps in the classification algorithm. Let's code up functions to do that now.

"The first step of the algorithm is to extract phrases containing adjectives or adverbs."

We'll use the [NLTK POS tagger](https://www.nltk.org/book/ch05.html) for this.

In [5]:
def matches_tag_pattern(pos1, pos2, pos3):
    """
    Manually check for patterns in Table 1 of the paper
    """
    if pos1 == 'JJ':
        if pos2 == 'JJ' and pos3 not in ['NN', 'NNS']:
            # pattern 3
            return True
        elif pos2 in ['NN', 'NNS']:
            # pattern 1
            return True
    elif pos1 in ['RB', 'RBR', 'RBS']:
        if pos2 == 'JJ' and pos3 not in ['NN', 'NNS']:
            # pattern 2
            return True
        elif pos2 in ['VB', 'VBD', 'VBN', 'VBG']:
            # pattern 5
            return True
    elif pos1 in ['NN', 'NNS']:
        if pos2 == 'JJ' and pos3 not in ['NN', 'NNS']:
            # pattern 4
            return True

    return False

def extract_all_phrases(review):
    """
    Generator which extracts all two-word phrases
    that fit Turney's pattern of tags.
    """
    
    review_tokens = nltk.word_tokenize(review)
    review_pos = nltk.pos_tag(review_tokens)
    
    # Loop through all trigrams checking for phrases
    for (token1, pos1), (token2, pos2), (token3, pos3) in zip(review_pos[:-2],
                                                              review_pos[1:-1],
                                                              review_pos[2:]):
        if matches_tag_pattern(pos1, pos2, pos3):
            yield (token1, token2)

    # Check last two-word phrase in review, with a blank 3rd pos
    if matches_tag_pattern(review_pos[-2][1], review_pos[-1][1], ''):
        yield (review_pos[-2][0], review_pos[-1][0])

"The second step is to estimate the semantic orientation of the extracted phrases, using the PMI-IR algorithm."

The Pointwise Mutual Information-Information Retrieval (PMI-IR) algorithm uses the following formula:

$$
SO(phrase) = log_2(\frac{hits(phrase\ NEAR\ "excellent") * hits("poor")}{hits(phrase\ NEAR\ "poor") * hits("excellent")})
$$

for some nearness threshold. Notice the implicit positive and negative seeds of "excellent" and "poor" respectively.

Here is where the hits* functions we previously made for querying our corpus will come in handy.

In [6]:
import numpy as np

POSITIVE_SEED = "excellent"
NEGATIVE_SEED = "poor"

hits_positive = hits(POSITIVE_SEED)
hits_negative = hits(NEGATIVE_SEED)

def compute_SO(phrase):
    """
    Compute the semantic orientation for a phrase by querying our corpus.
    """
    global POSITIVE_SEED, NEGATIVE_SEED, hits_positive, hits_negative
    
    hits_phrase_near_positive = hits_near(phrase, POSITIVE_SEED)
    hits_phrase_near_negative = hits_near(phrase, NEGATIVE_SEED)
    
    if hits_phrase_near_positive < 4 and hits_phrase_near_negative < 4:
        # skip phrase
        return 0
    
    OR = (hits_phrase_near_positive * hits_negative) / (hits_phrase_near_negative * hits_positive)
    
    SO = np.log(OR)
    
    return SO

"The third step is to calculate the average semantic orientation of the phrases in the given review..."

In [7]:
def classify_review(review):
    """
    Return an average sentiment orien
    """
    SO_sum = 0
    n_sentences = 0
    for phrase in extract_all_phrases(review):
        SO_sum += compute_SO(phrase)
        n_sentences += 1
    
    SO_avg = SO_sum / n_sentences
    
    return SO_avg

### Classify reviews

Calculate the average SO of each review, and compare to its label.

In [8]:
import os

# These test folders contain labeled movie reviews
TEST_POS_FOLDER = "test/pos"
TEST_NEG_FOLDER = "test/neg"

y = []
y_hat = []

NUM_REVIEWS_TO_TEST = 10  # classifying is very slow, so this is for debugging


for class_folder, label in [(TEST_POS_FOLDER, 1), (TEST_NEG_FOLDER, -1)]:
    review_fns = os.listdir(class_folder)
    test_i = 0
    for review_fn in tqdm(review_fns):
        try:
            with open(os.path.join(class_folder, review_fn), 'r') as f:
                review = f.read()
        except IOError:
            print("Error reading %s" % review_fn)
            continue

        review_SO_avg = classify_review(review)

        y.append(label)
        y_hat.append(review_SO_avg)

        test_i += 1
        if test_i > NUM_REVIEWS_TO_TEST:
            break
            
y = np.array(y)
y_hat = np.array(y_hat)
y_hat_categorical = np.array(list(map(lambda x: 1 if x >= 0 else -1, y_hat)))

print("MSE = %f" % np.mean(np.square(y - y_hat)))
print("ACCURACY = %f" % np.mean(np.equal(y, y_hat_categorical)))

  1%|          | 10/1000 [28:06<55:59:42, 203.62s/it]
  0%|          | 0/1000 [00:00<?, ?it/s][A
  0%|          | 1/1000 [02:45<45:53:02, 165.35s/it][A
  0%|          | 2/1000 [03:53<37:44:47, 136.16s/it][A
  0%|          | 3/1000 [05:32<34:39:55, 125.17s/it][A
  0%|          | 4/1000 [08:13<37:32:38, 135.70s/it][A
  0%|          | 5/1000 [10:42<38:36:21, 139.68s/it][A
  1%|          | 6/1000 [13:26<40:38:11, 147.17s/it][A
  1%|          | 7/1000 [16:00<41:08:24, 149.15s/it][A
  1%|          | 8/1000 [17:47<37:37:45, 136.56s/it][A
  1%|          | 9/1000 [18:48<31:17:20, 113.66s/it][A
  1%|          | 10/1000 [19:16<24:15:31, 88.21s/it][A

MSE = 0.917250
ACCURACY = 0.727273


Since our test set is evenly split between positive and negative reviews (1000 samples in each class), the baseline accuracy is 50%.