In this experiment, you will explore the accuracy of sentiment classificaiton using different feature representations of text documents.

First, you will implement `createBasicFeatures`, which creates a sparse matrix representation of a collection of documents. For this exercise, you should have a feature for each word containing at least one alphabetic character. You may use the `numpy` and `sklearn` packages to help with implementing a sparse matrix.

Then, you will implement `createFancyFeatures`, which can specify at any other features you choose to help improve performance on the classification task.

The two code blocks at the end train and evaluate two models—logistic regression with L1 and L2 regularization—using your featurization functions. Besides held-out classification accuracy with 10-fold cross-validation, you will also see the features in each class given high weights by the model.

In [0]:
import json
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut,KFold
import numpy as np

In [5]:
# library to clean data 
import re  
# Natural Language Tool Kit 
import nltk  
nltk.download('stopwords')  
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')
# to remove stopword 
from nltk.corpus import stopwords  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 
from sklearn.feature_extraction import DictVectorizer

from collections import Counter, OrderedDict
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
# read in the movie review corpus
def readReviews():
  raw = requests.get("https://raw.githubusercontent.com/mutherr/CS6120-PS1-data/master/cornell_reviews.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  return corpus

In [0]:
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

In [0]:
def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

In [0]:
def joinStrings(stringList):
    return ''.join(string for string in stringList)

In [0]:
def clean_text(text):
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)

    return(text)

In [0]:
def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [0]:
def convert(lst): 
    return (lst[0].split(' '))

This is where you will implement two functions to featurize the data.

In [0]:
#NB: The current contents are for testing only
#This function should return: 
#  -a sparse numpy matrix of document features
#  -a list of the correct class for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix. 

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
def createBasicFeatures(corpus):
  #Your code here
  classes = []
  matrix = [] 
  check = []
  texts = []
  stop_words=set(stopwords.words("english"))

  for i in range(2000): 
    matrix.append(corpus[i]['text'])
    classes.append(corpus[i]['class'])

  for i in range(len(matrix)): 
    matrix[i] = clean_text(matrix[i])

  cv = CountVectorizer(ngram_range = (1,1))
  texts = cv.fit_transform(matrix)
  vocab = (cv.get_feature_names())

  return texts,classes, vocab

# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.
def createFancyFeatures(corpus):
  #Your code here
  classes = []
  matrix = [] 
  check = []
  texts = []
  stop_words=set(stopwords.words("english"))

  for i in range(2000): 
    matrix.append(corpus[i]['text'])
    classes.append(corpus[i]['class'])

  for i in range(len(matrix)): 
    matrix[i] = clean_text(matrix[i])

  cv = CountVectorizer(ngram_range = (1,3), binary= True)
  texts = cv.fit_transform(matrix)
  vocab = (cv.get_feature_names())

  return texts,classes, vocab

In [0]:
#given a numpy matrix representation of the features for the training set, the 
# vector of true classes for each example, and the vocabulary as described 
# above, this computes the accuracy of the model using leave one out cross 
# validation and reports the most indicative features for each class

def evaluateModel(X,y,vocab,penalty="l1"):
  #create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=KFold(n_splits=10, shuffle=True, random_state=1))
  
  #determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)
  
  #determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  class0_weight_sorted = model.coef_[0, :].argsort()
  class1_weight_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  class0_indicators = [vocab[i] for i in class0_weight_sorted[:termsToTake]]
  class1_indicators = [vocab[i] for i in class1_weight_sorted[:termsToTake]]

  if model.classes_[0] == "pos":
    return avg_score,class0_indicators,class1_indicators
  else:
    return avg_score,class1_indicators,class0_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)

In [0]:
corpus = readReviews()

Run the following to train and evaluate two models using basic features:

In [22]:
X,y,vocab = createBasicFeatures(corpus)
#y = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.816000
The most informative terms for pos are: ['memorable', 'overall', 'equally', 'fun', 'surprising', 'gas', 'enjoyable', 'excellent', 'contrast', 'definitely', 'buy', 'pulp', 'matrix', 'flaw', 'edge', 'quite', 'stun', 'mummy', 'raise', 'bulworth']
The most informative terms for neg are: ['lame', 'waste', 'awful', 'ridiculous', 'unfortunately', 'anyway', 'jesse', 'tedious', 'suppose', 'mess', 'poor', 'dig', 'forward', 'pointless', 'bad', 'embarrass', 'films', 'poorly', 'nothing', 'bore']
----------L2 Norm-----------
The model's average accuracy is 0.841500
The most informative terms for pos are: ['fun', 'great', 'matrix', 'quite', 'overall', 'excellent', 'others', 'performance', 'different', 'enjoyable', 'terrific', 'memorable', 'true', 'see', 'back', 'hilarious', 'carry', 'chemistry', 'definitely', 'perfectly']
The most informative terms for neg are: ['bad', 'waste', 'unfortunately', 'suppose', 'nothing', 'attempt', 'poo

Run the following to train and evaluate two models using extended features:

In [30]:
X,y,vocab = createFancyFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.813000
The most informative terms for pos are: ['equally', 'wonderfully', 'memorable', 'hilarious', 'work well', 'relish', 'one many', 'ideal', 'award', 'edge', 'breathtaking', 'sometimes', 'others', 'ben', 'damon', 'terrific', 'today', 'excellent', 'overall', 'share']
The most informative terms for neg are: ['awful', 'tedious', 'lame', 'waste', 'mess', 'laughable', 'clich', 'pointless', 'poor', 'promise', 'bore', 'ridiculous', 'bad', 'poorly', 'profanity', 'flat', 'nothing', 'neither', 'football', 'blame']
----------L2 Norm-----------
The model's average accuracy is 0.851500
The most informative terms for pos are: ['great', 'performance', 'also', 'hilarious', 'others', 'fun', 'many', 'life', 'excellent', 'different', 'perfect', 'memorable', 'job', 'one best', 'overall', 'perfectly', 'terrific', 'sometimes', 'enjoy', 'quite']
The most informative terms for neg are: ['bad', 'waste', 'plot', 'nothing', 'suppose', 'attempt', '