<a href="https://colab.research.google.com/github/mutherr/CS6120-PS1/blob/master/PS1_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this experiment, you will explore the accuracy of sentiment classificaiton using different feature representations of text documents.

First, you will implement `createBasicFeatures`, which creates a sparse matrix representation of a collection of documents. For this exercise, you should have a feature for each word containing at least one alphabetic character. You may use the `numpy` and `sklearn` packages to help with implementing a sparse matrix.

Then, you will implement `createFancyFeatures`, which can specify at any other features you choose to help improve performance on the classification task.

The two code blocks at the end train and evaluate two models—logistic regression with L1 and L2 regularization—using your featurization functions. Besides held-out classification accuracy with 10-fold cross-validation, you will also see the features in each class given high weights by the model.

In [129]:
import json
import requests
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut,KFold
import numpy as np

In [130]:
# read in the movie review corpus
def readReviews():
  raw = requests.get("https://raw.githubusercontent.com/mutherr/CS6120-PS1-data/master/cornell_reviews.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  return corpus

This is where you will implement two functions to featurize the data.

In [131]:
#NB: The current contents are for testing only
#This function should return: 
#  -a sparse numpy matrix of document features
#  -a list of the correct class for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix. 

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
def createBasicFeatures(corpus):
    #Creating return variables
    classes = []
    texts = []
    vocab = None
    
    #Loop through the reviews in the corpus
    for example in corpus:
        #Save the label for later
        classes.append(example['class'])
        #Clean the text
        text = re.sub(r'\d', "", example['text'].replace("_", ""))
        #Create array from the review text
        text = nltk.word_tokenize(text)

        #String to reassamble the review into a string
        text_string = ""
        #Loop through all the remaining words in the text and remove punctuation
        for word in text:
            if word.isalpha():
                text_string += word.lower() + " "

        #Reassemble the review text
        texts.append(text_string)
    
    #Creat a sparse matrix from the reviews    
    vectorizer = CountVectorizer(binary=False, analyzer = 'word', token_pattern=r'\b\w+\b')
    texts = vectorizer.fit_transform(texts).toarray()

    #Get the vocab from the vectorizer
    vocab = vectorizer.get_feature_names()
      
    return texts,classes,vocab

# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.
def createFancyFeatures(corpus):
    #Creating return variables
    classes = []
    texts = []
    vocab = []
    
    #Polarity scores dictionary
    polarity_scores_dict = {}
    
    #Loop through the examples in the corpus
    for example in corpus:
        #Save class label for review
        classes.append(example['class'])
        
        #Clean text and split words into array
        text = re.sub(r'\d', "", example['text'].replace("_", ""))
        text = nltk.word_tokenize(text)

        #String to reassamble the review into a string
        text_string = ""
        
        #Loop through the words in the text
        for word in text:
            #If word is not punctuation
            if word.isalpha():
                #Add the word to the reassamble variable
                text_string += word.lower() + " "
                #If the word is not added to the polarity score dictionary
                #Else add to the score of the label
                if word not in polarity_scores_dict:
                    polarity_scores_dict[word] = {}
                    if example['class'] == 'neg':
                        polarity_scores_dict[word]['negative_count'] = 1
                        polarity_scores_dict[word]['positive_count'] = 0
                    else:
                        polarity_scores_dict[word]['positive_count'] = 1
                        polarity_scores_dict[word]['negative_count'] = 0
                else:
                    if example['class'] == 'neg':
                        polarity_scores_dict[word]['negative_count'] += 1
                    else:
                        polarity_scores_dict[word]['positive_count'] += 1

        #Add the single review texts to the texts list
        texts.append(text_string)
        
    #Loop through the words in the polarity score dictionary    
    for key, value in polarity_scores_dict.items():
        word_neg_count = value['negative_count']
        word_pos_count = value['positive_count']
        
        #smoothing for the calculations
        if word_neg_count == 0:
            word_neg_count = 1
        elif word_pos_count == 0:
            word_pos_count = 1
            
        #Calcualte the the polarity scores so that the ratio is positive
        if word_neg_count > word_pos_count:
            polarity_score = word_neg_count/word_pos_count
            value['polarity_score'] = polarity_score
        else:
            polarity_score = word_pos_count/word_neg_count
            value['polarity_score'] = polarity_score

        #Add every word to the vocab list
        vocab.append(key)
    
    #Create the sparse matrix with a temporary blank row and remove it
    tempRow = []
    for vocab_word in vocab:
        tempRow.append(0)

    matrix = np.array([tempRow])
    matrix = np.delete(matrix, 0, axis=0)
    
    #Loop through all of the reviews in the corpus
    for example in corpus:
        #Clean text and split words into array
        text = re.sub(r'\d', "", example['text'].replace("_", ""))
        text = nltk.word_tokenize(text)
        #Create the matrix row for the sentence
        matrixRow = []
        for vocab_word in vocab:
            matrixRow.append(0)
        #For each word in the text, set the matrix row value to the word 
        #polarity score at the index of the word in the vocab list    
        for word in text:
            if word.isalpha() and vocab.count(word) != 0:
                matrixRow[vocab.index(word)] = polarity_scores_dict[word]['polarity_score']
        
        #Add the row to the final matrix
        matrix = np.append(matrix, [matrixRow], axis=0)
    
    #Set the texts variable to the created matrix
    texts = matrix
    
    return texts,classes,vocab

In [132]:
#given a numpy matrix representation of the features for the training set, the 
# vector of true classes for each example, and the vocabulary as described 
# above, this computes the accuracy of the model using leave one out cross 
# validation and reports the most indicative features for each class

def evaluateModel(X,y,vocab,penalty="l1"):
  #create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=KFold(n_splits=10, shuffle=True, random_state=1))
  
  #determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)
  
  #determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  class0_weight_sorted = model.coef_[0, :].argsort()
  class1_weight_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  class0_indicators = [vocab[i] for i in class0_weight_sorted[:termsToTake]]
  class1_indicators = [vocab[i] for i in class1_weight_sorted[:termsToTake]]

  if model.classes_[0] == "pos":
    return avg_score,class0_indicators,class1_indicators
  else:
    return avg_score,class1_indicators,class0_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)

In [133]:
corpus = readReviews()

Run the following to train and evaluate two models using basic features:

In [134]:
X,y,vocab = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.821000
The most informative terms for pos are: ['flaws', 'terrific', 'memorable', 'excellent', 'using', 'masterpiece', 'command', 'follows', 'perfectly', 'master', 'enjoyable', 'sherri', 'edge', 'strange', 'gas', 'fun', 'experiences', 'fantastic', 'entertaining', 'beavis']
The most informative terms for neg are: ['waste', 'mess', 'worst', 'ridiculous', 'tedious', 'cheap', 'lame', 'awful', 'superior', 'unfortunately', 'write', 'boring', 'flat', 'bad', 'poor', 'terrible', 'jesse', 'designed', 'adam', 'headed']
----------L2 Norm-----------
The model's average accuracy is 0.833000
The most informative terms for pos are: ['fun', 'excellent', 'back', 'great', 'quite', 'overall', 'perfectly', 'job', 'yet', 'well', 'terrific', 'memorable', 'american', 'true', 'seen', 'pulp', 'performances', 'using', 'follows', 'very']
The most informative terms for neg are: ['bad', 'unfortunately', 'worst', 'nothing', 'waste', 'boring', 'only', 're

Run the following to train and evaluate two models using extended features:

In [135]:
X,y,vocab = createFancyFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.879000
The most informative terms for pos are: ['epic', 'fun', 'others', 'today', 'terrific', 'boards', 'frank', 'buffs', 'tool', 'blast', 'sullivan', 'flaws', 'job', 'hamilton', 'infectious', 'identify', 'seen', 'solid', 'succeeds', 'pace']
The most informative terms for neg are: ['only', 'nothing', 'tedious', 'mess', 'should', 'designed', 'have', 'awful', 'anywhere', 'unfortunately', 'supposedly', 'adam', 'poor', 'attempt', 'plot', 'embarrassing', 'looks', 'metro', 'promising', 'cash']
----------L2 Norm-----------
The model's average accuracy is 0.937000
The most informative terms for pos are: ['terrific', 'memorable', 'fun', 'others', 'very', 'lang', 'seen', 'refreshing', 'excellent', 'today', 'job', 'hilarious', 'most', 'wonderfully', 'enjoyed', 'allows', 'definitely', 'class', 'overall', 'breathtaking']
The most informative terms for neg are: ['awful', 'poor', 'nothing', 'waste', 'only', 'mess', 'supposed', 'worst', 's