CSCI 3832: Lecture 8, investigating classifiers, precision, recall
===========
1/31/2020, Spring 2020, Muzny

Relevant textbook sections: 4.1, 4.3, 4.7

Today, we'll be spending our time investigating some classifiers that we've trained for you.

All three of these classifiers are Naïve Bayes classifiers. For a given new, unlabeled document, they calculate:

$$ P(feature_1, feature_2, feature_3, ..., feature_n | c)P(c)$$

Where $c$ is a candidate class. They then select the class that has the highest probability to be the actual label of the new document.


Task 1: Which Classifier is Which?
-------------------------
We have given you 3 Naïve Bayes classifiers. All three of these are binary classifiers that choose between the label '0' or '1' (these are strings).

- one of these classifiers is an authorship attributor
- one of these classifiers is a language identifier
- one of these classifiers is a sentiment analyser

Your first job is to conduct experiments to determine two things:
1. Which classifier is which?
2. What specific classes do you believe that they are choosing between? (what are better labels for each classifier than '0' and '1'?)
    1. Note: this is a difficult task. It is of utmost importance that you consider the particular data set that they were trained on. I will tell you that they were trained using some of [nltk's available corpora](http://www.nltk.org/nltk_data/).

Authorship 0 or 1 is probably is it this author or not or which of two authors
Language identifier is probably 0 for not english 1 for english
Sentiment analyser is probably negative emotion vs positive emotion

In [1]:
#TODO: Write your names here
# Names: Kelley Kelley and Jake Swartwout
# Feel free to work in groups of 2 - 3/talk to your neighbors

# You'll be turning this notebook in at the end of lecture today 
# as a pdf
# File -> Download As -> .html -> open in a browser -> print to pdf
# (one submission per group)
# Please make a comment on your submission with your name and the name(s)
# of your partners as well!

In [2]:
# load your trained classifiers from pickled files
# (we've already trained your classifiers for you)
import pickle
import matplotlib.pyplot as plt # for graphing
#import nltk  # not necessary, but you can uncomment if you want

# add more imports here as you would like

In [3]:
# This function converts a list of words so that they are featurized
# for nltk's format for bag-of-words
# params:
# words - list of words where each element is a single word 
# return: dict mapping every word to True
def word_feats(words):
    return dict([(word, True) for word in words])

f = open('classifier1.pickle', 'rb')
classifier1 = pickle.load(f)
f.close()

f = open('classifier2.pickle', 'rb')
classifier2 = pickle.load(f)
f.close()

f = open('classifier3.pickle', 'rb')
classifier3 = pickle.load(f)
f.close()

# in a list, if you find that helpful
classifiers = [classifier1, classifier2, classifier3]

In [4]:
# Here's an example of how to run a test sentence through the classifiers
# edit at your leisure
test = "this is a test sentence"
# you can either split on whitespace or use nltk's word_tokenize
featurized = word_feats(test.split()) 
for classifier in classifiers:
    print(classifier.prob_classify(featurized).samples())  # will tell you what samples are available
    print(classifier.prob_classify(featurized).prob('0'))  # get the probability for class '0'
    print(classifier.prob_classify(featurized).prob('1'))  # get the probability for class '1'
    print(classifier.classify(featurized))  # just get the label that it wants to assign

dict_keys(['0', '1'])
0.6325082240556184
0.3674917759443814
0
dict_keys(['0', '1'])
3.7824855426585115e-08
0.9999999621751425
1
dict_keys(['0', '1'])
0.867841315914037
0.1321586840859627
0


In [5]:
# TODO: put in as many experiments as you'd like here (and feel free to add more cells as needed)
# we recommend testing a variety of sentences. You can make these up or get them from sources
# on the web

In [21]:
print("emotions tests")
emotionstests = []
emotionstests.append("bad negative")
emotionstests.append("suck worst sad")
emotionstests.append("good amazing")
emotionstests.append("beautiful wonderful happy")
i = 1
for words in emotionstests:
    i = 1
    for classifier in classifiers:
        print("----------------")
        print("Classifier: ", i)
        i += 1
        featurized = word_feats(words.split())
        print(words)
        print("P(0): ", classifier.prob_classify(featurized).prob('0'))
        print("P(1): ", classifier.prob_classify(featurized).prob('1'))
        print("Label: ", classifier.classify(featurized))
        


emotions tests
----------------
Classifier:  1
bad negative
P(0):  0.5604746317512275
P(1):  0.4395253682487722
Label:  0
----------------
Classifier:  2
bad negative
P(0):  0.00039311220731488405
P(1):  0.9996068877926841
Label:  1
----------------
Classifier:  3
bad negative
P(0):  0.8255161545582594
P(1):  0.17448384544174098
Label:  0
----------------
Classifier:  1
suck worst sad
P(0):  0.8340384311244659
P(1):  0.16596156887553373
Label:  0
----------------
Classifier:  2
suck worst sad
P(0):  0.008065476075267777
P(1):  0.9919345239247326
Label:  1
----------------
Classifier:  3
suck worst sad
P(0):  0.3137165678868946
P(1):  0.6862834321131046
Label:  1
----------------
Classifier:  1
good amazing
P(0):  0.3724219126783736
P(1):  0.6275780873216263
Label:  1
----------------
Classifier:  2
good amazing
P(0):  0.0009919164649184365
P(1):  0.9990080835350811
Label:  1
----------------
Classifier:  3
good amazing
P(0):  0.9207732071190304
P(1):  0.07922679288096905
Label:  0
----

In [23]:
print("English tests")
englishtests = []
englishtests.append("This is english")
englishtests.append("This is also english")
englishtests.append("fjdjf djaf fdksjf")
englishtests.append("&#^$^@&!#*&%&#&$")
englishtests.append("Just curious & if $50 this; is it english?!?!")
englishtests.append("你哄我")
for words in englishtests:
    i = 1
    for classifier in classifiers:
        print("----------------")
        print("Classifier: ", i)
        i += 1
        featurized = word_feats(words.split())
        print(words)
        print("P(0): ", classifier.prob_classify(featurized).prob('0'))
        print("P(1): ", classifier.prob_classify(featurized).prob('1'))
        print("Label: ", classifier.classify(featurized))

English tests
----------------
Classifier:  1
This is english
P(0):  0.43136194115751164
P(1):  0.5686380588424879
Label:  1
----------------
Classifier:  2
This is english
P(0):  7.873301860098114e-06
P(1):  0.9999921266981396
Label:  1
----------------
Classifier:  3
This is english
P(0):  0.7461045155824994
P(1):  0.2538954844175007
Label:  0
----------------
Classifier:  1
This is also english
P(0):  0.37721130109440043
P(1):  0.6227886989055992
Label:  1
----------------
Classifier:  2
This is also english
P(0):  9.588705428952318e-09
P(1):  0.9999999904112946
Label:  1
----------------
Classifier:  3
This is also english
P(0):  0.9349539620177195
P(1):  0.06504603798228063
Label:  0
----------------
Classifier:  1
fjdjf djaf fdksjf
P(0):  0.5
P(1):  0.5
Label:  1
----------------
Classifier:  2
fjdjf djaf fdksjf
P(0):  0.5256709451575262
P(1):  0.47432905484247373
Label:  0
----------------
Classifier:  3
fjdjf djaf fdksjf
P(0):  0.8372395833333334
P(1):  0.16276041666666669
Labe

In [20]:
print("Authorship")
authors = []
# these are the two authors you always mention so its one of these
authors.append("Harry Potter")
authors.append("Where art thou Romeo")
for words in authors:
    i = 1
    for classifier in classifiers:
        print("----------------")
        print("Classifier: ", i)
        i += 1
        featurized = word_feats(words.split())
        print(words)
        print("P(0): ", classifier.prob_classify(featurized).prob('0'))
        print("P(1): ", classifier.prob_classify(featurized).prob('1'))
        print("Label: ", classifier.classify(featurized))

Authorship
----------------
Classifier:  1
Harry Potter
1
----------------
Classifier:  2
Harry Potter
0
----------------
Classifier:  3
Harry Potter
0
----------------
Classifier:  1
Where art thou Romeo
1
----------------
Classifier:  2
Where art thou Romeo
1
----------------
Classifier:  3
Where art thou Romeo
1


TODO: Answer the questions outlined at the beginning of this task here (please keep __bold__ formatting in this notebook):

1. Which classifier is which?
    1. classifier1 is __Emotions. It responded to emotion words.__
    1. classifier2 is __English. It was 1 every time I used English while the others varied. It was also 0 when I didn't use English__
    1. classifier3 is __Authorship. Process of elimination.__
2. What specific classes do you believe that they are choosing between?
    1. classifier1's '0' label should be __NEGATIVE (I think it is more specific but not sure, maybe sadness, from Jake's pretty graphs we determined it is negative review where angry and hurtful are especially polarizing toward 0)__ and its '1' label should be __POSITIVE (I think happy specifically, from Jake's graphs good movie reviews)__
    1. classifier2's '0' label should be __NOT ENLGISH (I think it is a specific other language because Chinese was very neutral, Niko says its spanish)__ and its '1' label should be __English__
    1. classifier3's '0' label should be __NOT SHAKESPEARE (not sure who)__ and its '1' label should be __SHAKESPEARE__

Task 2: Investigating Accuracy, Precision, and Recall
---------------------------------------------
Textbook: 4.7

When we are determining how well a classifier is doing, we can look at overall accuracy:

$$ accuracy = \frac{true_{pos} + true_{neg}}{true_{pos} + false_{pos} + true_{neg} + false_{neg}} $$



In [None]:
# TODO: implement this accuracy function, 
# then test the accuracy of two of the three classifiers from task 1.

# Params: gold_labels, a list of labels assigned by hand ("truth")
# predicted_labels, a corresponding list of labels predicted by the system
# return: double accuracy (a number from 0 to 1)
def accuracy(gold_labels, predicted_labels):
    pass


# test the accuracy of two of your classifiers.
# Note: this requires knowing what labels your test data should have!

Next, (if you get this far).

Often, however, it is more useful to look at __precision__ and __recall__ to determine how well a classifier is doing. This is especially important if we're dealing with imbalanced classes (one class occurs more frequently than another).

$$ precision = \frac{true_{pos}}{true_{pos} + false_{pos}} $$



$$ recall = \frac{true_{pos}}{true_{pos} + false_{neg}} $$

To make this calculation, we'll need to choose which label is associated with "positive" and which is associated with "negative". For our purposes, we'll choose the label '1' to be our "positive" label.

Answer the following questions:

1. Suppose you wanted a very precise system, but didn't care about recall. How would you achieve this?
    1. __YOUR ANSWER HERE__

2. Suppose you wanted a system with the best recall, but didn't care about precision. How would you achieve this?
    1. __YOUR ANSWER HERE__


In [None]:
# TODO: implement the precision and recall functions, 
# then test the precision/recall of two of the three classifiers from task 1.

# Params: gold_labels, a list of labels assigned by hand ("truth")
# predicted_labels, a corresponding list of labels predicted by the system
# target_label (default value '1') - the label associated with "positives"
# return: double precision (a number from 0 to 1)
def precision(gold_labels, predicted_labels, target_label = '1'):
    pass

# Params: gold_labels, a list of labels assigned by hand ("truth")
# predicted_labels, a corresponding list of labels predicted by the system
# target_label (default value '1') - the label associated with "positives"
# return: double recall (a number from 0 to 1)
def recall(gold_labels, predicted_labels, target_label = '1'):
    pass