In this experiment, you will train models to distringuish examples of two different genres of Shakespeare's plays: comedies and tragedies. (We'll ignore the histories, sonnets, etc.) Since he died four hundred years ago, Shakespeare has not written any more plays—although scraps of various other works have come to light. We are not, therefore, interested in building models simply to help categorize an unbounded stream of future documents, as we might be in other applications of text classification; rather, we are interested in what a classifier might have to tell us about what we mean by the terms “comedy” and “tragedy”.

You will start by copying and running your `createBasicFeatures` function from the experiment with movie reviews. Do the features the classifier focuses on tell you much about comedy and tragedy in general?

You will then implement another featurization function `createInterestingFeatures`, which will focus on only those features you think are informative for distinguishing between comedy and tragedy. Accuracy on leave-one-out cross-validation may go up, but it more important to look at the features given the highest weight by the classifier. Interpretability in machine learning, of course, may be harder to define than accuracy—although accuracy at some tasks such as summarization is hard enoough.

# **Observartions**

*Top features which corresponds to tragedy (pos) class are mostly character names where as top features of comedy (neg) class contains common nouns (relating to place, role of person like magic, ghost, witch).*
*Achieved accuracy of 96%*


1. TF-IDF representation did not work well than word count representation 
2. Bi-grams did not perform as good as uni-grams for this corpus
3. Here adjective parts of speech works better than noun and verb while lemmatizing
4. Allowing words which are present at most on 60% of the document gives better accuracy than max-df = 0.7 or 0.5
5. Since L1 leads most of the weights to 0, we see the words after some top 15 features are part of both the classes, hence we should not consider these features (after top 15) when using L1 model

In [0]:
import json
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut
import numpy as np

In [0]:
#read in the shakespeare corpus
def readShakespeare():
  raw = requests.get("https://raw.githubusercontent.com/mutherr/CS6120-PS1-data/master/shakespeare_plays.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  #remove histories from the data, as we're only working with tragedies and comedies
  corpus = [entry for entry in corpus if entry["genre"] != "history"]
  return corpus

In [0]:
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
snow = nltk.stem.SnowballStemmer('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


This is where you will implement two functions to featurize the data:

In [0]:
#NB: The current contents are for testing only
#This function should return: 
#  -a sparse numpy matrix of document features
#  -a list of the correct genre for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix. 

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
def createBasicFeatures(corpus):
  genres = [play['genre'] for play in corpus]
  X = [play['text'] for play in corpus]
  

  docs = []
  for play in range(0, len(X)):
    doc = re.sub(r'[\W_]', ' ', str(X[play]))
    doc = re.sub(r'\s+', ' ', doc, flags=re.I)
    doc = re.sub(r'[0-9]*', '', doc)
    doc = re.sub(r'^b\s+', '', doc)
    doc = doc.lower()
    docs.append(doc)
    
  vectorizer = CountVectorizer()
  texts = vectorizer.fit_transform(docs).toarray()
  vocab = vectorizer.vocabulary_  
  vocab = sorted(vocab.items(), key= lambda x : x[1])
  vocab = [v[0] for v in vocab]
  
  return texts,genres,vocab

# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.
def createInterestingFeatures(corpus):
  genres = [play['genre'] for play in corpus]
  X = [play['text'] for play in corpus]
  stopw = stopwords.words('english')
  lemmatizer = WordNetLemmatizer()

  docs = []
  for play in range(0, len(X)):
    doc = re.sub(r'[\W_]', ' ', str(X[play]))
    doc = re.sub(r'\s+', ' ', doc, flags=re.I)
    doc = re.sub(r'[0-9]*', '', doc)
    doc = re.sub(r'^b\s+', '', doc)
    doc = doc.lower()
    doc = doc.split()
    #doc = [lemmatizer.lemmatize(word, pos = 'v') for word in doc]
    doc = [lemmatizer.lemmatize(word, pos = 'a') for word in doc] #adjectives performs better in lemmatize
    #doc = [lemmatizer.lemmatize(word) for word in doc]
    #doc = [snow.stem(word) for word in doc]
    doc = ' '.join(doc)
    docs.append(doc)
    
  vectorizer = CountVectorizer(stop_words=stopw,  max_df=0.6, min_df=3) #consider words that are atleast present 3 times in corpus and atmost presnt in 60% of the plays
  #vectorizer = TfidfVectorizer(stop_words=stopw, max_df=0.6, min_df=3)
  #vectorizer = CountVectorizer(stop_words=stopw, ngram_range= (2,2),  max_df=0.6, min_df=3)
  texts = vectorizer.fit_transform(docs).toarray()
  vocab = vectorizer.vocabulary_  
  vocab = sorted(vocab.items(), key= lambda x : x[1])
  vocab = [v[0] for v in vocab] 

  return texts,genres,vocab

In [0]:
#given a numpy matrix representation of the features for the training set, the 
# vector of true classes for each example, and the vocabulary as described 
# above, this computes the accuracy of the model using leave one out cross 
# validation and reports the most indicative features for each class
def evaluateModel(X,y,vocab,penalty="l1"):
  #create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=LeaveOneOut())
  
  #determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)
  
  #determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  neg_class_prob_sorted = model.coef_[0, :].argsort()
  pos_class_prob_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  pos_indicators = [vocab[i] for i in neg_class_prob_sorted[:termsToTake]]
  neg_indicators = [vocab[i] for i in pos_class_prob_sorted[:termsToTake]]

  return avg_score,pos_indicators,neg_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  

In [0]:
corpus = readShakespeare()
#corpus[1:5]

Run the following to train and evaluate two models with basic features:

In [0]:
X,y,vocab = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)
print("size of vocabulary: {}".format(len(vocab)))

----------L1 Norm-----------
The model's average accuracy is 0.769231
The most informative terms for pos are: ['helena', 'prospero', 'sir', 'you', 'your', 'for', 'me', 'duke', 'of', 'love', 'preservation', 'presents', 'press', 'preserve', 'preserved', 'preserver', 'presentment', 'preservers', 'preserving', 'presently']
The most informative terms for neg are: ['our', 'him', 'rom', 'iago', 'thy', 'ham', 'imogen', 'what', 'brutus', 'his', 'lear', 'timon', 'premises', 'pressing', 'presses', 'pressed', 'press', 'president', 'preservers', 'preserver']
----------L2 Norm-----------
The model's average accuracy is 0.730769
The most informative terms for pos are: ['you', 'prospero', 'duke', 'helena', 'antonio', 'me', 'for', 'your', 'sir', 'ariel', 'sebastian', 'hermia', 'lysander', 'parolles', 'stephano', 'will', 'leontes', 'caliban', 'demetrius', 'love']
The most informative terms for neg are: ['ham', 'iago', 'him', 'our', 'othello', 'what', 'his', 'lear', 'imogen', 'brutus', 'rom', 'nurse', 'r

From top 20 features we see there are lot of same words with different grammar tense (words with same semantic). Also we see many stopwords such as what, his, thy, him, our, to, of, me, you, your. Hence will remove stop words (this time will use explicit stopwords list, as negation/not words does not add predictive powers to genre classification) and perform lemmatization.

Run the following to train and evaluate two models with features that are interesting for distinguishing comedy and tragedy:

In [0]:
X,y,vocab = createInterestingFeatures(corpus)
runEvaluation(X, y, vocab)
print("size of vocabulary: {}".format(len(vocab)))

----------L1 Norm-----------
The model's average accuracy is 0.884615
The most informative terms for pos are: ['antonio', 'shepherd', 'kate', 'chain', 'helena', 'angelo', 'princess', 'jaques', 'hero', 'page', 'valentine', 'demetrius', 'maria', 'signior', 'clown', 'prescribe', 'prescience', 'presage', 'pricking', 'prerogative']
The most informative terms for neg are: ['murther', 'lucius', 'castle', 'corn', 'senator', 'rome', 'nurse', 'troilus', 'caesar', 'moor', 'preparation', 'prepar', 'pregnant', 'prefix', 'preferr', 'prefer', 'prepared', 'predominant', 'precisely', 'precise']
----------L2 Norm-----------
The model's average accuracy is 0.961538
The most informative terms for pos are: ['antonio', 'shepherd', 'helena', 'angelo', 'chain', 'clown', 'jaques', 'demetrius', 'kate', 'valentine', 'signior', 'sebastian', 'hero', 'princess', 'merchant', 'claudio', 'page', 'padua', 'costard', 'jew']
The most informative terms for neg are: ['lucius', 'murther', 'ghost', 'castle', 'moor', 'nurse',

In [0]:
X,y,vocab = createInterestingFeatures(corpus)
runEvaluation(X, y, vocab)
print("size of vocabulary: {}".format(len(vocab)))

----------L1 Norm-----------
The model's average accuracy is 0.538462
The most informative terms for pos are: ['abandon', 'porch', 'porpentin', 'porridg', 'port', 'portabl', 'portend', 'portent', 'porter', 'portion', 'posi', 'posit', 'posset', 'poster', 'postern', 'poorest', 'postur', 'potenc', 'potent', 'potion']
The most informative terms for neg are: ['abandon', 'porch', 'porpentin', 'porridg', 'port', 'portabl', 'portend', 'portent', 'porter', 'portion', 'posi', 'posit', 'posset', 'poster', 'postern', 'poorest', 'postur', 'potenc', 'potent', 'potion']
----------L2 Norm-----------
The model's average accuracy is 0.769231
The most informative terms for pos are: ['antonio', 'helena', 'angelo', 'valentin', 'sebastian', 'page', 'clown', 'claudio', 'maria', 'shepherd', 'jaqu', 'hero', 'kate', 'costard', 'princess', 'demetrius', 'jew', 'john', 'moth', 'oliv']
The most informative terms for neg are: ['lucius', 'brutus', 'murther', 'senat', 'antoni', 'caesar', 'rome', 'titus', 'moor', 'nurs

Above iteration using TF-IDF, does not perform well compared to word-count representation 

Bigrams did not perform as good as unigrams. Mixture of unigram and bigram gives same performance as stand alone unigram model.