# Week 3: Basic Document Classification (Part 1)

## Overview 
In labs this week (and next), the focus will be on the application of sentiment analysis. You will be using a corpus of **movie reviews**.

You will be exploring various techniques that can be used to classify the sentiment of the movie reviews as either positive or negative. 

You will be developing your own **Word List** and **Naïve Bayes** classifiers and then comparing them to the **NLTK Naïve Bayes** classifier.

First, we will need to download the movie_review corpus.

In [1]:
import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/juliewe/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juliewe/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

The movie_reviews corpus reader provides a number of useful methods:
   * .categories()
   * .fileids()
   * .words()
   
First, we can use `.categories()` to check the set of labels with which the reviews have been labelled

In [2]:
from nltk.corpus import movie_reviews

print(movie_reviews.categories())

['neg', 'pos']


We can use `.fileids()` to get all of the file names associated with a particular category.

In [3]:
pos_review_ids=movie_reviews.fileids('pos')
neg_review_ids=movie_reviews.fileids('neg')

print("The number of positive reviews is {}".format(len(pos_review_ids)))
print("The number of negative reviews is {}".format(len(neg_review_ids)))

The number of positive reviews is 1000
The number of negative reviews is 1000


We can use `.words()` to get back word-tokenised reviews.  The argument to `.words()` is the file id of an individual review.

In [4]:
print(movie_reviews.words(pos_review_ids[0]))

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]


In [5]:
type(movie_reviews.words(pos_review_ids[0]))

nltk.corpus.reader.util.StreamBackedCorpusView

Note, the object returned by `movie_reviews.words()` looks a lot like a list (and behaves a lot like a list) - but it is actually a `StreamBackedCorpusView`.  This essentially means it is not necessarily all in memory  - it is retrieved from disk as needed.  If you want to see all of the words at once then you can convert it to a list using the `list()` constructor.  

In [6]:
print(list(movie_reviews.words(pos_review_ids[0])))

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', 'there', "'", 's', 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'", '80s', 'with', 'a', '12', '-', 'part', 'series', 'called', 'the', 'watchmen', '.', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', '.', 'the', 'book', '(', 'or', '"', 'grap

## Creating training and testing sets
You will be training and testing various document classifiers. It is essential that the data used in the testing phase is not used during the training phase, since this can lead to overestimating performance. 

We now introduce the `split_data` function (defined in the cell below) which can be used to get separate **training** and **testing** sets.

> Look through the code in the following cell, reading the comments and making sure that you understand each line.

In [7]:
import random # have a look at the documentation at https://docs.python.org/3/library/random.html 


def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given collection of items and ratio:
     - partitions the collection into training and testing, where the proportion in training is ratio,

    :param data: A list (or generator) of documents or doc ids
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    n = len(data)  #Found out number of samples present.  data could be a list or a generator
    train_indices = random.sample(range(n), int(n * ratio))          #Randomly select training indices
    test_indices = list(set(range(n)) - set(train_indices))   #Other items are testing indices
 
    train = [data[i] for i in train_indices]           #Use training indices to select data
    test = [data[i] for i in test_indices]             #Use testing indices to select data
 
    return (train, test)                       #Return split data
 

Now we can use this function to create training and testing data.  First, we need to create 4 lists:
    * file ids  of positive docs to go in the training data
    * file ids of positive docs to go in the testing data
    * file ids of negative docs to go in the training data
    * file ids of negative docs to go in the testing data

In [8]:
random.seed(41)  #set the random seeds so these random splits are always the same
pos_train_ids, pos_test_ids = split_data(pos_review_ids)
neg_train_ids, neg_test_ids = split_data(neg_review_ids)


Now, we want to create our labelled data sets.   We need to associate each review with its label so that later we can shuffle up all of the training data (and the testing data)

### Exercise 1
Write some python code which will construct a training set (`training`) and a test set (`testing`) from the data.  Each set should be a list of pairs where each pair is a list of words and a label, as below:

<code>[([list,of,words],'label'),([list,of,words],'label'),...]</code>

Hint:  You can do this with 4 list comprehensions and list concatenation.

Check the size of `training` and `testing`.  Using a 70\% split, how many should be in each?

In [9]:
training = [(movie_reviews.words(f),'pos') for f in pos_train_ids]+[(movie_reviews.words(f),'neg') for f in neg_train_ids]
testing = [(movie_reviews.words(f),'pos') for f in pos_test_ids]+[(movie_reviews.words(f),'neg') for f in neg_test_ids]

In [10]:
print(len(training))
print(len(testing))

1400
600


## Document Representations

Currently, each review / document is represented as a list of tokens.  In many simple applications, the order of words in a document is deemed irrelevant and we use a bag-of-words representation of the document.  We can create a bag-of-words using a dictionary (as we did in Lab_2_2 when considering the size of the vocabulary) or we can use a library function such as FreqDist from nltk.probability (or Counter from Collections).  In the cell below, I generate the bag-of-words for the first review in the training set using nltk's FreqDist.  You can think of this as like a dictionary but with extra benefits.  For example, later on in the lab, we will see it has useful methods which allow the document representations to be added and subtracted.

In [11]:
from nltk.probability import FreqDist

doc1 = FreqDist(training[0][0])
doc1

FreqDist({',': 24, '.': 18, 'and': 11, 'a': 9, 'to': 8, 'the': 7, 'melvin': 6, 'his': 6, "'": 6, 's': 6, ...})

### Exercise 2.1

Write code to use FreqDist to construct a bag-of-words representation for each document in the training and testing sets.  Store the results in two lists, `training_basic` and `testing_basic`.  Don't lost the annotations as to whether each review is positive or negative!  

In [12]:
training_basic=[(FreqDist(wordlist),label) for (wordlist,label) in training]
testing_basic=[(FreqDist(wordlist),label) for (wordlist,label) in testing]

#training_basic=[(FreqDist(item[0]),item[1]) for item in training]

In [13]:
training_basic[0]

(FreqDist({',': 24, '.': 18, 'and': 11, 'a': 9, 'to': 8, 'the': 7, 'melvin': 6, 'his': 6, "'": 6, 's': 6, ...}),
 'pos')

You will notice of course that many of the words in your representations of documents are punctuation and stopwords.  This is because we haven't done any pre-processing of the wordlists.

### Exercise 2.2

Decide which of the following pre-processing steps to apply to the word lists:-
* case normalisation
* number normalisation
* punctuation removal
* stopword removal
* stemming / lemmatisation


Apply these preprocessing steps to the original wordlist representations (stored in `training` and `testing`).  Then recreate the bag-of-words representations, storing the results in `training_norm` and `testing_norm`

In [14]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

def normalise(wordlist):
    lowered=[word.lower() for word in wordlist] #don't actually need this as already lowered
    filtered=[word for word in lowered if word.isalpha() and word not in stop]
    return filtered

normalise(training[0][0])

['melvin',
 'udall',
 'heartless',
 'man',
 'spends',
 'days',
 'inside',
 'spacious',
 'manhattan',
 'apartment',
 'writing',
 'romance',
 'novels',
 'also',
 'seems',
 'melvin',
 'never',
 'change',
 'one',
 'day',
 'dines',
 'ar',
 'favorite',
 'restaurant',
 'little',
 'mean',
 'normal',
 'waitress',
 'waittress',
 'serve',
 'carol',
 'played',
 'perfection',
 'lovely',
 'sexy',
 'helen',
 'hunt',
 'threatens',
 'serve',
 'shut',
 'asthmatic',
 'son',
 'shut',
 'make',
 'matters',
 'considerably',
 'worse',
 'melvin',
 'obsessive',
 'compulsive',
 'disorder',
 'one',
 'day',
 'gay',
 'artist',
 'neighbor',
 'simon',
 'greg',
 'kinear',
 'talk',
 'soup',
 'fame',
 'oscar',
 'worthy',
 'role',
 'dog',
 'threatens',
 'dismiss',
 'melvon',
 'door',
 'dog',
 'meets',
 'garbage',
 'chute',
 'soon',
 'simon',
 'sadly',
 'beaten',
 'thieveing',
 'burglars',
 'ray',
 'cuba',
 'gooding',
 'jr',
 'simon',
 'agent',
 'takes',
 'dog',
 'verdell',
 'melvin',
 'melvin',
 'dogsit',
 'dog',
 'rathe

In [15]:
training_norm=[(FreqDist(normalise(wordlist)),label) for (wordlist,label) in training]
testing_norm=[(FreqDist(normalise(wordlist)),label) for (wordlist,label) in testing]

training_norm[0]

(FreqDist({'melvin': 6, 'simon': 4, 'dog': 4, 'playing': 3, 'one': 2, 'day': 2, 'serve': 2, 'carol': 2, 'threatens': 2, 'shut': 2, ...}),
 'pos')

## Creating word lists
The next section will explain how to use a sentiment classifier that bases its decisions on word lists. The classifier requires a list of words indicating positive sentiment, and a second list of words indicating negative sentiment. Given positive and negative word lists, a document's overall sentiment is determined based on counts of occurrences of words that occur in the two lists. In this section we are concerned with the creation of the word lists. We will be considering both hand-crafted lists and automatically generated lists.

### Exercise 3.1

- Create a reasonably long hand-crafted list of words that you think indicate positive sentiment.
- Create a reasonably long hand-crafted list of words that indicate negative sentiment.

Use the following cells to store these lists in the variables `my_positive_word_list` and `my_negative_word_list`.

In [16]:
my_positive_word_list = ["good","great","lovely", "excellent"] # extend this one or put your own list here
my_negative_word_list = ["bad", "terrible", "awful", "dreadful"] # extend this one or put your own list here

Now lets see how often each of those words occurs in total in our positive and negative training data.  First, lets create a total of the FreqDists for positive data and for negative data.  As these are FreqDists (rather than simple dictionaries), we can do this as follows:

In [18]:
pos_freq_dist=FreqDist()
neg_freq_dist=FreqDist()

for reviewDist,label in training_norm:
    if label=='pos':
        pos_freq_dist+=reviewDist
    else:
        neg_freq_dist+=reviewDist
        
pos_freq_dist

FreqDist({'film': 3737, 'one': 2127, 'movie': 1721, 'like': 1285, 'story': 893, 'time': 882, 'good': 859, 'also': 848, 'even': 804, 'well': 762, ...})

In [23]:
pos_freq_dist['bad']

236

In [24]:
neg_freq_dist['bad']

709

In [26]:
words=['bad']

for word in words:
    diff=pos_freq_dist[word]-neg_freq_dist[word]
    print(word,diff)

bad -473


### Exercise 3.2
In the blank code cell below write code that uses the total frequency distributions `pos_freq_dist` and `neg_freq_dist` and the word lists `my_positive_word_list` and `my_negative_word_list` created earlier to determine whether or not the review data conforms to your expectations. In particular, whether:
- the words you expected to indicate positive sentiment actually occur more frequently in positive reviews than negative reviews
- the words you expected to indicate negative sentiment actually occur more frequently in negative reviews than positive reviews.

You could display your findings in a table using pandas.

In [22]:
def check_expectations(a_word_list,expectation,pos=pos_freq_dist,neg=neg_freq_dist):
#expectation is a positive number if words are expected to be positive
#expectation is a negative number if words are expected to be negative
    results=[]
    for word in a_word_list:
        pos_freq=pos.get(word,0)
        neg_freq=neg.get(word,0)
        diff=pos_freq-neg_freq
        if diff*expectation>0:
            print("As expected: for {} difference is {}".format(word,diff))
            results.append((word,diff,'yes'))
        else:
            print("Contrary to expectations: for {} difference is {}".format(word,diff))
            results.append((word,diff,'no'))
            
    return results
            
        
        

In [23]:
results=check_expectations(my_positive_word_list,1)

As expected: for good difference is 52
As expected: for great difference is 254
Contrary to expectations: for lovely difference is 0
As expected: for excellent difference is 73


In [24]:
results+=check_expectations(my_negative_word_list,-1)

As expected: for bad difference is -473
As expected: for terrible difference is -54
As expected: for awful difference is -60
As expected: for dreadful difference is -6


In [25]:
import pandas as pd
df=pd.DataFrame(results,columns=['word','diff','conforms to expectation'])
display(df)

Unnamed: 0,word,diff,conforms to expectation
0,good,52,yes
1,great,254,yes
2,lovely,0,no
3,excellent,73,yes
4,bad,-473,yes
5,terrible,-54,yes
6,awful,-60,yes
7,dreadful,-6,yes


### Exercise 3.3
Now, you are going to create positive and negative word lists automatically from the training data. In order to do this:

1. write two new functions to help with automating the process of generating wordlists.

    - `most_frequent_words` - this function should take THREE arguments: 2 frequency distributions and a natural number, k. It should order words by how much more they occur in one frequency distribution than the other.   It should then return the top k highest scoring words. You might want to use the `most_common` method from the `FreqDist` class - this returns a list of word, frequency pairs ordered by frequency.  You might also or alternatively want to use pythons built-in `sorted` function
    - `words_above_threshold` - this function also takes three arguments: 2 frequency distributions and a natural number, k. Again, it should order words by how much more they occur in one distribution than the other.  It should return all of the words that have a score greater than k.

2. Using the training data, create two sets of positive and negative word lists using these functions (1 set with each function). 
3.  Display these 4 lists (possibly in a `Pandas` dataframe?)



In [26]:
posdiff=pos_freq_dist-neg_freq_dist
posdiff

FreqDist({'melvin': 19,
          'udall': 3,
          'man': 89,
          'inside': 19,
          'spacious': 4,
          'manhattan': 2,
          'apartment': 30,
          'romance': 15,
          'novels': 5,
          'also': 300,
          'seems': 25,
          'never': 26,
          'change': 58,
          'one': 194,
          'day': 96,
          'dines': 1,
          'ar': 1,
          'favorite': 15,
          'little': 28,
          'normal': 42,
          'waitress': 9,
          'waittress': 1,
          'serve': 2,
          'carol': 21,
          'perfection': 12,
          'hunt': 9,
          'threatens': 9,
          'shut': 3,
          'asthmatic': 1,
          'son': 42,
          'matters': 3,
          'considerably': 4,
          'obsessive': 7,
          'compulsive': 3,
          'disorder': 5,
          'gay': 21,
          'artist': 36,
          'neighbor': 6,
          'simon': 42,
          'kinear': 1,
          'soup': 1,
          'fame': 13,
   

In [27]:
posdiff.get('excellent',0)

73

In [28]:
posdiff.get('good',0)

52

In [29]:
posdiff.most_common()

[('film', 756),
 ('life', 384),
 ('also', 300),
 ('great', 254),
 ('story', 220),
 ('world', 216),
 ('many', 213),
 ('films', 212),
 ('best', 211),
 ('one', 194),
 ('well', 184),
 ('family', 157),
 ('american', 157),
 ('jackie', 137),
 ('first', 134),
 ('quite', 128),
 ('although', 128),
 ('performance', 125),
 ('war', 125),
 ('young', 112),
 ('way', 111),
 ('men', 111),
 ('however', 110),
 ('new', 109),
 ('see', 109),
 ('mother', 108),
 ('john', 104),
 ('seen', 104),
 ('job', 104),
 ('star', 103),
 ('people', 101),
 ('love', 101),
 ('perfect', 99),
 ('takes', 97),
 ('different', 97),
 ('day', 96),
 ('always', 92),
 ('may', 91),
 ('true', 91),
 ('disney', 91),
 ('yet', 90),
 ('often', 90),
 ('dark', 90),
 ('man', 89),
 ('years', 88),
 ('gives', 86),
 ('especially', 85),
 ('makes', 85),
 ('black', 85),
 ('time', 83),
 ('city', 83),
 ('cameron', 83),
 ('father', 82),
 ('fiction', 82),
 ('performances', 80),
 ('still', 79),
 ('without', 78),
 ('wars', 78),
 ('truman', 78),
 ('horror', 77)

In [30]:

def most_frequent_words(posfreq,negfreq,topk):
    difference=posfreq-negfreq
    sorteddiff=difference.most_common()
    justwords=[word for (word,freq) in sorteddiff[:topk]]
    return justwords

In [31]:
top_pos=most_frequent_words(pos_freq_dist,neg_freq_dist,50)
print(top_pos)

['film', 'life', 'also', 'great', 'story', 'world', 'many', 'films', 'best', 'one', 'well', 'family', 'american', 'jackie', 'first', 'quite', 'although', 'performance', 'war', 'young', 'way', 'men', 'however', 'new', 'see', 'mother', 'john', 'seen', 'job', 'star', 'people', 'love', 'perfect', 'takes', 'different', 'day', 'always', 'may', 'true', 'disney', 'yet', 'often', 'dark', 'man', 'years', 'gives', 'especially', 'makes', 'black', 'time']


In [32]:
top_neg=most_frequent_words(neg_freq_dist,pos_freq_dist,50)
print(top_neg)

['movie', 'bad', 'plot', 'worst', 'even', 'script', 'could', 'nothing', 'supposed', 'reason', 'get', 'boring', 'stupid', 'least', 'unfortunately', 'better', 'godzilla', 'harry', 'tv', 'know', 'big', 'minutes', 'maybe', 'got', 'looks', 'dull', 'tries', 'guy', 'batman', 'robin', 'thing', 'think', 'dialogue', 'west', 'waste', 'trying', 'wasted', 'mess', 'lame', 'seagal', 'minute', 'awful', 'action', 'half', 'give', 'made', 'worse', 'terrible', 'problem', 'oh']


In [33]:
def above_threshold(posfreq,negfreq,threshold):
  difference=posfreq-negfreq
  sorteddiff=difference.most_common()
  filtered=[w for (w,f) in sorteddiff if f>threshold]
  return filtered

In [34]:
above100pos = above_threshold(pos_freq_dist,neg_freq_dist,100)
print(above100pos)

['film', 'life', 'also', 'great', 'story', 'world', 'many', 'films', 'best', 'one', 'well', 'family', 'american', 'jackie', 'first', 'quite', 'although', 'performance', 'war', 'young', 'way', 'men', 'however', 'new', 'see', 'mother', 'john', 'seen', 'job', 'star', 'people', 'love']


In [35]:
above100neg = above_threshold(neg_freq_dist,pos_freq_dist,100)
print(above100neg)

['movie', 'bad', 'plot', 'worst', 'even', 'script', 'could', 'nothing', 'supposed', 'reason', 'get', 'boring', 'stupid']


## Creating a word list based classifier
Now you have a number of word lists for use with a classifier. 
> Make sure you understand the following code, which will be used as the basis for creating a word list based classifier.

In [36]:
from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI): 

    def __init__(self, pos, neg): 
        self._pos = pos 
        self._neg = neg 

    def classify(self, words): 
        score = 0
        
        # add code here that assigns an appropriate value to score
        return "neg" if score < 0 else "pos"

    ##we don't actually need to define the classify_many method as it is provided in ClassifierI
    #def classify_many(self, docs): 
    #    return [self.classify(doc) for doc in docs] 

    def labels(self): 
        return ("pos", "neg")

#Example usage:

classifier = SimpleClassifier(my_positive_word_list, my_negative_word_list)
classifier.classify("This movie was great".split())

'pos'

### Exercise 3.1

- Copy the above code cell and move it to below this one. Then complete the `classify` method in the above code as specified below.
- Test your classifier on several very simple hand-crafted examples to verify that you have implemented `classify` correctly.

The classifier is initialised with a list of positive words, and a list of negative words. The words of a document are passed to the `classify` method (which is partially completed in the above code fragment). The `classify` method should be defined so that each occurrence of a negative word decrements `score`, and each occurrence of a positive word increments `score`. 
- For `score` less than 0, an "`N`" for negative should be returned.
- For `score` greater than 0,  "`P`" for positive should returned.
- For `score` of 0, the classification decision should be made randomly (see https://docs.python.org/3/library/random.html).


In [37]:

from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI): 

    def __init__(self, pos, neg): 
        self._pos = pos 
        self._neg = neg 

    def classify(self, doc): 
        #doc is a FreqDist
        score = 0
        
        # add code here that assigns an appropriate value to score
        for word,value in doc.items():
            if word in self._pos:
                score+=value
            if word in self._neg:
                score-=value
        
        return "neg" if score < 0 else "pos" 

     ##we don't actually need to define the classify_many method as it is provided in ClassifierI
    #def classify_many(self, docs): 
    #    return [self.classify(doc) for doc in docs] 

    def labels(self): 
        return ("pos", "neg")

#Example usage:

classifier = SimpleClassifier(my_positive_word_list, my_negative_word_list)
classifier.classify(FreqDist("This movie was dreadful".split()))

'neg'

### Exercise 3.2
* Extend your SimpleClassifier class so that it has a `train` function which will derive the wordlists from training data.  You could build a separate class for each way of automatically deriving wordlists (which both inherit from SimpleClassifier) OR a single class which takes an extra parameter at training time.

In [38]:
class SimpleClassifier_mf(SimpleClassifier):
    
    def __init__(self,k):
        self._k=k
    
    def train(self,training_data):
        
        pos_freq_dist=FreqDist()
        neg_freq_dist=FreqDist()

        for reviewDist,label in training_data:
            if label=='pos':
                pos_freq_dist+=reviewDist
            else:
                neg_freq_dist+=reviewDist
                
        self._pos=most_frequent_words(pos_freq_dist,neg_freq_dist,self._k)
        self._neg=most_frequent_words(neg_freq_dist,pos_freq_dist,self._k)
    
    

In [39]:
movieclassifier=SimpleClassifier_mf(100)

In [40]:
movieclassifier.train(training_norm)

Try out your classifier on the test data.  We will look at how to evaluate classifiers in the next part, but in an ideal world, most of the positive test items will have been classified as 'P' and most of the negative test items will have been classified as 'N'.  Note that the batch_classify method takes a list of unlabelled documents so you can't give it a list of pairs (where each pair is doc and a label).  You can either use a list comprehension or the <code>zip(*list_of_pairs)</code> function to split a list of pairs into a pair of lists.

In [41]:
movieclassifier.classify(FreqDist("I hated this movie".split()))

'neg'

In [42]:
testing,labels=zip(*testing_norm)
movieclassifier.classify_many(testing)

['neg',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'neg',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'neg',
 'pos',
 'pos',
 'pos',
 'neg',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'neg',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'neg',
 'pos',
 'pos',
 'pos',
 'pos',
 'neg',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'neg',
 'pos',
 'pos',


In [43]:
class SimpleClassifier_ot(SimpleClassifier):
    
    def __init__(self,k):
        self._k=k
    
    def train(self,training_data):
        
        pos_freq_dist=FreqDist()
        neg_freq_dist=FreqDist()

        for reviewDist,label in training_data:
            if label=='pos':
                pos_freq_dist+=reviewDist
            else:
                neg_freq_dist+=reviewDist
                
        self._pos=above_threshold(pos_freq_dist,neg_freq_dist,self._k)
        self._neg=above_threshold(neg_freq_dist,pos_freq_dist,self._k)
    

In [44]:
movieclassifier2=SimpleClassifier_ot(50)
movieclassifier2.train(training_norm)

In [45]:
movieclassifier2.classify_many(testing)

['pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
 'pos',
