In this experiment, you will explore the accuracy of sentiment classificaiton using different feature representations of text documents.

First, you will implement `createBasicFeatures`, which creates a sparse matrix representation of a collection of documents. For this exercise, you should have a feature for each word containing at least one alphabetic character. You may use the `numpy` and `sklearn` packages to help with implementing a sparse matrix.

Then, you will implement `createFancyFeatures`, which can specify at any other features you choose to help improve performance on the classification task.

The two code blocks at the end train and evaluate two models—logistic regression with L1 and L2 regularization—using your featurization functions. Besides held-out classification accuracy with 10-fold cross-validation, you will also see the features in each class given high weights by the model.

# **Observartions**

*Features of pos class have postive words such as fun, great, excellent and so on while feature such as bad, worst, awful corresponds correctly to neg class. We also see words which are related to plot of the movie in neg class.*
*Achieved accuracy of 84.2%*


1. TF-IDF representation did not work well than word count representation 
2. Bi-grams did not perform as good as uni-grams for this corpus. 
3. Noun parts of speech for lemmatization works better than verb and adjective
4. Allowing words which are present at most on 80% of the reviews and appeared in atleast 3 reviews
5. Included stopwords which are part of CountVectorizer function not the stopwords from nltk package as these have negation (not, shouldn't, ...) words which could be usual for differentiating neg class

In [0]:
import json
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut,KFold
import numpy as np

In [0]:
# read in the movie review corpus
def readReviews():
  raw = requests.get("https://raw.githubusercontent.com/mutherr/CS6120-PS1-data/master/cornell_reviews.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  return corpus

This is where you will implement two functions to featurize the data.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer #for lemmatization
import re #regular expression package
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
#NB: The current contents are for testing only
#This function should return: 
#  -a sparse numpy matrix of document features
#  -a list of the correct class for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix. 

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
def createBasicFeatures(corpus):
  X = [dict['text'] for dict in corpus] #list of reviews Or map(lambda dict: dict['text'], corpus)
  classes = [ dict['class'] for dict in corpus] #list of correct classes

  #Pre-process the data
  docs = []
  lemmatizer = WordNetLemmatizer()

  for rev in range(0, len(X)):
    #remove all the special characters
    doc = re.sub(r'[\W_]', ' ', str(X[rev])) #introduces multiple spaces
    #\W ==> string does not contain any word characters

    #remove all single characters (previous step would have introduced lot of 's')
    doc = re.sub(r'\s+[a-zA-Z]\s+', ' ', doc)
    doc = re.sub(r'\^[a-zA-Z]\s+', ' ', doc)

    #substitute multiple spaces with single space
    doc = re.sub(r'\s+', ' ', doc, flags=re.I)

    #remove prefix 'b' (if the dataset is in bytes, each line will have letter 'b'appended at the start)
    doc = re.sub(r'^b\s+', '', doc)

    #there are lot of numbers, remove the numbers as they don't add predictive power
    doc = re.sub(r'[0-9]+', '', doc) 

    doc = doc.lower() # converting to lowercase
    docs.append(doc)
  
  vectorizer = CountVectorizer() # basic word count input = corpus, stop_words = "english"
  texts = vectorizer.fit_transform(docs).toarray()   #vectorizer.fit(docs), basic_data = vectorizer.transform(docs), basic_data.toarray()   
  vocab = vectorizer.vocabulary_ #this is a dictionary of words with word as key and indices number as values
  vocab = sorted(vocab.items(), key = lambda x: x[1]) #returns tuple (key, value)
  vocab = [v[0] for v in vocab]
  
  return texts,classes,vocab

# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.
def createFancyFeatures(corpus):
  X = [dict['text'] for dict in corpus] #list of reviews Or map(lambda dict: dict['text'], corpus)
  classes = [ dict['class'] for dict in corpus] #list of correct classes

  #Pre-process the data
  docs = []
  lemmatizer = WordNetLemmatizer()
  stopw = stopwords.words('english')

  for rev in range(0, len(X)):
    doc = re.sub(r'[\W_]', ' ', str(X[rev])) #removes '_' around words as well
    doc = re.sub(r'\s+[a-zA-Z]\s+', ' ', doc)
    doc = re.sub(r'\^[a-zA-Z]\s+', ' ', doc)
    doc = re.sub(r'\s+', ' ', doc, flags=re.I)
    doc = re.sub(r'^b\s+', '', doc)
    doc = doc.lower() 
    doc = re.sub(r'[0-9]+', '', doc)

    #Lemmatization
    doc = doc.split()
    doc = [lemmatizer.lemmatize(word) for word in doc] # noun PoS works better than verb and adjective
    doc = ' '.join(doc)

    docs.append(doc)
  #remove stop words
  #remove more frequent words across all the reviews (remove if the word is present on at least 80% of the reviews), and words which have appeared less than 3 times 
 
  vectorizer = CountVectorizer(stop_words = "english", max_df=0.8, min_df=3) #ngram_range=(1,2)
  texts = vectorizer.fit_transform(docs).toarray()  
  vocab = vectorizer.vocabulary_ 
  vocab = sorted(vocab.items(), key = lambda x: x[1])
  vocab = [v[0] for v in vocab]
  return texts,classes,vocab

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
#another feature extraction using TF-IDF representation
def createFancyFeatures_1(corpus):
  X = [dict['text'] for dict in corpus] #list of reviews Or map(lambda dict: dict['text'], corpus)
  classes = [ dict['class'] for dict in corpus] #list of correct classes

  #Pre-process the data
  docs = []
  lemmatizer = WordNetLemmatizer()
  stopw = stopwords.words('english')

  for rev in range(0, len(X)):
    doc = re.sub(r'[\W_]', ' ', str(X[rev])) #removes _ around words as well
    doc = re.sub(r'\s+[a-zA-Z]\s+', ' ', doc)
    doc = re.sub(r'\^[a-zA-Z]\s+', ' ', doc)
    doc = re.sub(r'\s+', ' ', doc, flags=re.I)
    doc = re.sub(r'^b\s+', '', doc)
    doc = doc.lower() 
    doc = re.sub(r'[0-9]+', '', doc)

    #Lemmatization
    doc = doc.split()
    doc = [lemmatizer.lemmatize(word) for word in doc] 
    #doc = [lemmatizer.lemmatize(word, pos='v') for word in doc] 
    #doc = [lemmatizer.lemmatize(word, pos='a') for word in doc] 
    doc = ' '.join(doc)

    docs.append(doc)
  #remove stop words
  #use TF_IDF and remove  
 
  vectorizer = TfidfVectorizer(stop_words = "english", ngram_range=(1,2), max_df=0.7, min_df=5)
  texts = vectorizer.fit_transform(docs).toarray()  
  vocab = vectorizer.vocabulary_ 
  vocab = sorted(vocab.items(), key = lambda x: x[1])
  vocab = [v[0] for v in vocab]
  return texts,classes,vocab 

In [0]:
#stop = stopwords.words('english')
#these stopwords, contain's 'not' words which are important for bad review in bigram hence not using this explicit list

In [0]:
#given a numpy matrix representation of the features for the training set, the 
# vector of true classes for each example, and the vocabulary as described 
# above, this computes the accuracy of the model using leave one out cross 
# validation and reports the most indicative features for each class

def evaluateModel(X,y,vocab,penalty="l1"):
  #create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=KFold(n_splits=10, shuffle=True, random_state=1))
  
  #determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)
  
  #determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  class0_weight_sorted = model.coef_[0, :].argsort()
  class1_weight_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  class0_indicators = [vocab[i] for i in class0_weight_sorted[:termsToTake]]
  class1_indicators = [vocab[i] for i in class1_weight_sorted[:termsToTake]]

  if model.classes_[0] == "pos":
    return avg_score,class0_indicators,class1_indicators
  else:
    return avg_score,class1_indicators,class0_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)

In [0]:
corpus = readReviews()
#corpus[1:5]


Run the following to train and evaluate two models using basic features:

In [0]:
X,y,vocab = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)
print("size of vocabulary: {}".format(len(vocab)))

----------L1 Norm-----------
The model's average accuracy is 0.826000
The most informative terms for pos are: ['flaws', 'memorable', 'terrific', 'perfectly', 'masterpiece', 'edge', 'enjoyable', 'gas', 'using', 'sherri', 'excellent', 'overall', 'fun', 'command', 'holds', 'quite', 'follows', 'different', 'allows', 'solid']
The most informative terms for neg are: ['waste', 'mess', 'ridiculous', 'lame', 'headed', 'worst', 'cheap', 'unfortunately', 'awful', 'write', 'tedious', 'boring', 'iii', 'jesse', 'superior', 'poor', 'bad', 'terrible', 'flat', 'looks']
----------L2 Norm-----------
The model's average accuracy is 0.832500
The most informative terms for pos are: ['fun', 'great', 'back', 'quite', 'well', 'excellent', 'perfectly', 'memorable', 'overall', 'american', 'job', 'terrific', 'pulp', 'seen', 'yet', 'true', 'performances', 'bit', 'husband', 'others']
The most informative terms for neg are: ['bad', 'unfortunately', 'worst', 'waste', 'nothing', 'script', 'only', 'boring', 'awful', 'p

Form the above top 20 words/features, we can see that there are numbers which does not add any information to review considering bag of words model. There are also stopwords such as 'any','yet', 'only', 'should', which should be removed. We can also see few words which corresponds to plot of teh movie, hence will be removing most frequent and very less frequent word to create advanced features. 

Run the following to train and evaluate two models using extended features:

In [0]:
X,y,vocab = createFancyFeatures(corpus)
runEvaluation(X, y, vocab)
print("size of vocabulary: {}".format(len(vocab)))

----------L1 Norm-----------
The model's average accuracy is 0.816500
The most informative terms for pos are: ['memorable', 'overall', 'excellent', 'terrific', 'fantastic', 'equally', 'daylight', 'deserves', 'succeeds', 'wonderfully', 'hilarious', 'belief', 'perfectly', 'command', 'sullivan', 'different', 'fun', 'flaw', 'definitely', 'pace']
The most informative terms for neg are: ['ridiculous', 'worst', 'waste', 'headed', 'designed', 'supposed', 'awful', 'lame', 'mess', 'wasted', 'poor', 'forward', 'unfortunately', 'guess', 'cheap', 'terrible', 'saved', 'potential', 'bad', 'write']
----------L2 Norm-----------
The model's average accuracy is 0.842000
The most informative terms for pos are: ['fun', 'great', 'overall', 'memorable', 'different', 'excellent', 'hilarious', 'quite', 'perfectly', 'matrix', 'terrific', 'true', 'definitely', 'entertaining', 'enjoyed', 'performance', 'pace', 'follows', 'job', 'enjoyable']
The most informative terms for neg are: ['bad', 'worst', 'unfortunately',

bi-gram model does not perform better than uni-gram model, hence using uni-gram representation. 

In [0]:
X,y,vocab = createFancyFeatures_1(corpus)
runEvaluation(X, y, vocab)
print("size of vocabulary: {}".format(len(vocab)))

----------L1 Norm-----------
The model's average accuracy is 0.735500
The most informative terms for pos are: ['great', 'performance', 'life', 'war', 'perfectly', 'seen', 'quite', 'excellent', 'jackie', 'best', 'perfect', 'family', 'world', 'overall', 'fun', 'different', 'titanic', 'american', 'mulan', 'hilarious']
The most informative terms for neg are: ['bad', 'worst', 'boring', 'supposed', 'attempt', 'waste', 'plot', 'minute', 'stupid', 'mess', 'unfortunately', 'dull', 'ridiculous', 'script', 'awful', 'joke', 'wasted', 'tv', 'harry', 'look']
----------L2 Norm-----------
The model's average accuracy is 0.834000
The most informative terms for pos are: ['life', 'great', 'performance', 'war', 'truman', 'family', 'jackie', 'excellent', 'world', 'best', 'quite', 'perfect', 'mulan', 'american', 'cameron', 'perfectly', 'fun', 'hilarious', 'titanic', 'seen']
The most informative terms for neg are: ['bad', 'worst', 'plot', 'boring', 'supposed', 'stupid', 'attempt', 'waste', 'minute', 'script'

Using TF-IDF represntation (with or with out n-gram ) does not perform as good as word-count representation 