<a href="https://colab.research.google.com/github/mutherr/CS6120-PS1/blob/master/PS1_Shakespeare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this experiment, you will train models to distringuish examples of two different genres of Shakespeare's plays: comedies and tragedies. (We'll ignore the histories, sonnets, etc.) Since he died four hundred years ago, Shakespeare has not written any more plays—although scraps of various other works have come to light. We are not, therefore, interested in building models simply to help categorize an unbounded stream of future documents, as we might be in other applications of text classification; rather, we are interested in what a classifier might have to tell us about what we mean by the terms “comedy” and “tragedy”.

You will start by copying and running your `createBasicFeatures` function from the experiment with movie reviews. Do the features the classifier focuses on tell you much about comedy and tragedy in general?

You will then implement another featurization function `createInterestingFeatures`, which will focus on only those features you think are informative for distinguishing between comedy and tragedy. Accuracy on leave-one-out cross-validation may go up, but it more important to look at the features given the highest weight by the classifier. Interpretability in machine learning, of course, may be harder to define than accuracy—although accuracy at some tasks such as summarization is hard enoough.

In [243]:
import json
import requests
import re
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut,KFold
import numpy as np

In [244]:
#read in the shakespeare corpus
def readShakespeare():
  raw = requests.get("https://raw.githubusercontent.com/mutherr/CS6120-PS1-data/master/shakespeare_plays.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  #remove histories from the data, as we're only working with tragedies and comedies
  corpus = [entry for entry in corpus if entry["genre"] != "history"]
  return corpus

This is where you will implement two functions to featurize the data:

In [245]:
#NB: The current contents are for testing only
#This function should return: 
#  -a sparse numpy matrix of document features
#  -a list of the correct genre for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix. 

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
def createBasicFeatures(corpus):
    #Creating return variables
    genres = []
    texts = []
    vocab = None
    
    #Loop through the reviews in the corpus
    for example in corpus:
        #Save the label for later
        genres.append(example['genre'])
        #Clean the text
        text = re.sub(r'\d', "", example['text'].replace("_", ""))
        #Create array from the review text
        text = nltk.word_tokenize(text)

        #String to reassamble the review into a string
        text_string = ""
        #Loop through all the remaining words in the text and remove punctuation
        for word in text:
            if word.isalpha():
                text_string += word.lower() + " "

        #Reassemble the review text
        texts.append(text_string)
    
    #Creat a sparse matric from the reviews    
    vectorizer = CountVectorizer(binary=False, analyzer = 'word', token_pattern=r'\b\w+\b')
    texts = vectorizer.fit_transform(texts).toarray()

    #Get the vocab from the vectorizer
    vocab = vectorizer.get_feature_names()
      
    return texts,genres,vocab

# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.
def createInterestingFeatures(corpus):
    #Creating return variables
    genres = []
    texts = []
    vocab = []
    
    #Polarity scores dictionary
    polarity_scores_dict = {}
    #Part of speech dictionary
    pos_dict = {}
    
    #Loop through the examples in the corpus
    for example in corpus:
        #Save the genre label for the play
        genres.append(example['genre'])
        
        #Clean text and split words into array
        text = re.sub(r'\d', "", example['text'].replace("_", ""))
        text = nltk.word_tokenize(text)
        
        #Get the parts of speech for all words 
        #Save them in a dictionary
        pos_tagged = nltk.pos_tag(text)
        for tagged_word in pos_tagged:
            if tagged_word[0] not in pos_dict:
                pos_dict[tagged_word[0]] = tagged_word[1]
                
        #String to reassamble the play
        text_string = ""
        for word in text:
            #If word is not punctuation
            if word.isalpha():
                #Add the word to the reassamble variable
                text_string += word.lower() + " "
                #If the word is not added to the polarity score dictionary
                #Else add to the score of the label
                if word not in polarity_scores_dict:
                    polarity_scores_dict[word] = {}
                    if example['genre'] == 'comedy':
                        polarity_scores_dict[word]['negative_count'] = 1
                        polarity_scores_dict[word]['positive_count'] = 0
                    else:
                        polarity_scores_dict[word]['positive_count'] = 1
                        polarity_scores_dict[word]['negative_count'] = 0
                else:
                    if example['genre'] == 'comedy':
                        polarity_scores_dict[word]['negative_count'] += 1
                    else:
                        polarity_scores_dict[word]['positive_count'] += 1
        #Add the single play to the texts list
        texts.append(text_string)
    #Loop through the words in the polarity score dictionary    
    for key, value in polarity_scores_dict.items():
        word_neg_count = value['negative_count']
        word_pos_count = value['positive_count']
        
        #smoothing for the calculations
        if word_neg_count == 0:
            word_neg_count = 1
        elif word_pos_count == 0:
            word_pos_count = 1
        
        #Calcualte the the polarity scores so that the ratio is positive    
        if word_neg_count > word_pos_count:
            polarity_score = word_neg_count/word_pos_count
        else:
            polarity_score = word_pos_count/word_neg_count
        
        #Add the word to the vocab list based on palarity score and part of speech
        if polarity_score > 2 and pos_dict[key] != 'NN' and pos_dict[key] != 'NNS' and pos_dict[key] != 'NNP' and pos_dict[key] != 'NNPS':
            vocab.append(key)
            
    #Create the vectorizer with the vacab list and create the matrix        
    vectorizer = CountVectorizer(binary=True, analyzer = 'word', token_pattern=r'\b\w+\b', vocabulary=vocab)
    texts = vectorizer.fit_transform(texts).toarray()
    
    #Get the final vocab list from the vectorizer
    vocab = vectorizer.get_feature_names()
      
    return texts,genres,vocab

In [246]:
#given a numpy matrix representation of the features for the training set, the 
# vector of true classes for each example, and the vocabulary as described 
# above, this computes the accuracy of the model using leave one out cross 
# validation and reports the most indicative features for each class
def evaluateModel(X,y,vocab,penalty="l1"):
  #create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=LeaveOneOut())
  
  #determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)
  
  #determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  neg_class_prob_sorted = model.coef_[0, :].argsort()
  pos_class_prob_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  pos_indicators = [vocab[i] for i in neg_class_prob_sorted[:termsToTake]]
  neg_indicators = [vocab[i] for i in pos_class_prob_sorted[:termsToTake]]

  return avg_score,pos_indicators,neg_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  

In [247]:
corpus = readShakespeare()

Run the following to train and evaluate two models with basic features:

In [248]:
X,y,vocab = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.615385
The most informative terms for pos are: ['you', 'duke', 'helena', 'prospero', 'i', 'sir', 'leontes', 'a', 'privately', 'president', 'preserving', 'preservers', 'preserver', 'preserved', 'preserve', 'press', 'preservative', 'preserv', 'presents', 'presentment']
The most informative terms for neg are: ['him', 's', 'iago', 'imogen', 'o', 'brutus', 'lear', 'ham', 'and', 'rom', 'the', 'president', 'preserving', 'preservers', 'preserver', 'pretia', 'pretense', 'press', 'preserve', 'preservative']
----------L2 Norm-----------
The model's average accuracy is 0.769231
The most informative terms for pos are: ['i', 'you', 'duke', 'prospero', 'a', 'helena', 'your', 'antonio', 'sir', 'leontes', 'hermia', 'for', 'lysander', 'ariel', 'sebastian', 'demetrius', 'camillo', 'stephano', 'me', 'parolles']
The most informative terms for neg are: ['iago', 'othello', 's', 'him', 'imogen', 'what', 'lear', 'brutus', 'his', 'cassio', 'o', 'ham

Run the following to train and evaluate two models with features that are interesting for distinguishing comedy and tragedy:

In [249]:
X,y,vocab = createInterestingFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.923077
The most informative terms for pos are: ['jest', 'oaths', 'dulcet', 'conclusion', 'carrying', 'ber', 'castle', 'elsinore', 'ophelia', 'norway', 'fortinbras', 'francisco', 'bernardo', 'osric', 'guildenstern', 'nephew', 'lofty', 'forfend', 'abhorred', 'scorns']
The most informative terms for neg are: ['slain', 'warlike', 'rise', 'fierce', 'ope', 'dreamt', 'parolles', 'hor', 'carefully', 'fran', 'ber', 'castle', 'elsinore', 'ophelia', 'norway', 'fortinbras', 'osric', 'bernardo', 'guildenstern', 'nephew']
----------L2 Norm-----------
The model's average accuracy is 1.000000
The most informative terms for pos are: ['jest', 'oaths', 'shallow', 'conclusion', 'marvellous', 'signior', 'forsworn', 'lying', 'dote', 'impossible', 'lower', 'reasonable', 'studied', 'fancy', 'afterward', 'conceal', 'advis', 'adverse', 'forswear', 'sheep']
The most informative terms for neg are: ['slain', 'fierce', 'dreamt', 'warlike', 'domestic', '