# Week 3: Basic Document Classification (Part 1)

## Overview
In labs this week (and next), the focus will be on the application of sentiment analysis. You will be using a corpus of **movie reviews**.

You will be exploring various techniques that can be used to classify the sentiment of the movie reviews as either positive or negative.

You will be developing your own **Word List** and **Naïve Bayes** classifiers and then comparing them to the **NLTK Naïve Bayes** classifier.

First, we will need to download the movie_review corpus.

In [2]:
import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

The movie_reviews corpus reader provides a number of useful methods:
   * .categories()
   * .fileids()
   * .words()
   
First, we can use `.categories()` to check the set of labels with which the reviews have been labelled

In [3]:
from nltk.corpus import movie_reviews

print(movie_reviews.categories())

['neg', 'pos']


We can use `.fileids()` to get all of the file names associated with a particular category.

In [4]:
pos_review_ids=movie_reviews.fileids('pos')
neg_review_ids=movie_reviews.fileids('neg')

print("The number of positive reviews is {}".format(len(pos_review_ids)))
print("The number of negative reviews is {}".format(len(neg_review_ids)))


The number of positive reviews is 1000
The number of negative reviews is 1000


We can use `.words()` to get back word-tokenised reviews.  The argument to `.words()` is the file id of an individual review.

In [5]:
print(movie_reviews.words(pos_review_ids[0]))

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]


In [6]:
type(movie_reviews.words(pos_review_ids[0]))

Note, the object returned by `movie_reviews.words()` looks a lot like a list (and behaves a lot like a list) - but it is actually a `StreamBackedCorpusView`.  This essentially means it is not necessarily all in memory  - it is retrieved from disk as needed.  If you want to see all of the words at once then you can convert it to a list using the `list()` constructor.  

In [7]:
print(list(movie_reviews.words(pos_review_ids[0])))

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',', 'superman', ',', 'spawn', ')', ',', 'or', 'geared', 'toward', 'kids', '(', 'casper', ')', 'or', 'the', 'arthouse', 'crowd', '(', 'ghost', 'world', ')', ',', 'but', 'there', "'", 's', 'never', 'really', 'been', 'a', 'comic', 'book', 'like', 'from', 'hell', 'before', '.', 'for', 'starters', ',', 'it', 'was', 'created', 'by', 'alan', 'moore', '(', 'and', 'eddie', 'campbell', ')', ',', 'who', 'brought', 'the', 'medium', 'to', 'a', 'whole', 'new', 'level', 'in', 'the', 'mid', "'", '80s', 'with', 'a', '12', '-', 'part', 'series', 'called', 'the', 'watchmen', '.', 'to', 'say', 'moore', 'and', 'campbell', 'thoroughly', 'researched', 'the', 'subject', 'of', 'jack', 'the', 'ripper', 'would', 'be', 'like', 'saying', 'michael', 'jackson', 'is', 'starting', 'to', 'look', 'a', 'little', 'odd', '.', 'the', 'book', '(', 'or', '"', 'grap

## Creating training and testing sets
You will be training and testing various document classifiers. It is essential that the data used in the testing phase is not used during the training phase, since this can lead to overestimating performance.

We now introduce the `split_data` function (defined in the cell below) which can be used to get separate **training** and **testing** sets.

> Look through the code in the following cell, reading the comments and making sure that you understand each line.

In [8]:
import random # have a look at the documentation at https://docs.python.org/3/library/random.html


def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given collection of items and ratio:
     - partitions the collection into training and testing, where the proportion in training is ratio,

    :param data: A list (or generator) of documents or doc ids
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the
            pair is a list of the training data and the second is a list of the test data.
    """

    n = len(data)  #Found out number of samples present.  data could be a list or a generator
    train_indices = random.sample(range(n), int(n * ratio))          #Randomly select training indices
    test_indices = list(set(range(n)) - set(train_indices))   #Other items are testing indices

    train = [data[i] for i in train_indices]           #Use training indices to select data
    test = [data[i] for i in test_indices]             #Use testing indices to select data

    return (train, test)                       #Return split data


Now we can use this function to create training and testing data.  First, we need to create 4 lists:
    * file ids  of positive docs to go in the training data
    * file ids of positive docs to go in the testing data
    * file ids of negative docs to go in the training data
    * file ids of negative docs to go in the testing data

In [9]:
random.seed(41)  #set the random seeds so these random splits are always the same
pos_train_ids, pos_test_ids = split_data(pos_review_ids)
neg_train_ids, neg_test_ids = split_data(neg_review_ids)


Now, we want to create our labelled data sets.   We need to associate each review with its label so that later we can shuffle up all of the training data (and the testing data)

### Exercise 1
Write some python code which will construct a training set (`training`) and a test set (`testing`) from the data.  Each set should be a list of pairs where each pair is a list of words and a label, as below:

<code>[([list,of,words],'label'),([list,of,words],'label'),...]</code>

Hint:  You can do this with 4 list comprehensions and list concatenation.

Check the size of `training` and `testing`.  Using a 70\% split, how many should be in each?

In [10]:
pos_train = [(movie_reviews.words(sentence), "pos") for sentence in pos_train_ids]
pos_test = [(movie_reviews.words(sentence), "pos")  for sentence in pos_test_ids]
neg_train = [(movie_reviews.words(sentence), "neg")  for sentence in neg_train_ids]
neg_test = [(movie_reviews.words(sentence), "neg")  for sentence in neg_test_ids]

train_data = pos_train + neg_train
test_data = neg_test + pos_test

train_data[0]

(['melvin', 'udall', 'is', 'a', 'heartless', 'man', '.', ...], 'pos')

In [11]:
def labelled_data(pos_data, neg_data):
  pos_train_data, pos_test_data = split_data(pos_data)
  neg_train_data, neg_test_data = split_data(neg_data)

  pos_train = [(movie_reviews.words(sentence), "pos") for sentence in pos_train_ids]
  pos_test = [(movie_reviews.words(sentence), "pos")  for sentence in pos_test_ids]
  neg_train = [(movie_reviews.words(sentence), "neg")  for sentence in neg_train_ids]
  neg_test = [(movie_reviews.words(sentence), "neg")  for sentence in neg_test_ids]



  return pos_train, pos_test, neg_train, neg_test




In [12]:
pos_train, pos_test, neg_train, neg_test = labelled_data(pos_review_ids, neg_review_ids)

training = pos_train + neg_train
testing = neg_test + pos_test
concatenate_data = training + testing
print(f"Length of training data: {len(training)}, and it should be {int(len(concatenate_data)*0.7)}")
print(f"Length of test data: {len(testing)}, and it should be {int(len(concatenate_data)*0.3)}")

Length of training data: 1400, and it should be 1400
Length of test data: 600, and it should be 600


## Document Representations

*   List item
*   List item



Currently, each review / document is represented as a list of tokens.  In many simple applications, the order of words in a document is deemed irrelevant and we use a bag-of-words representation of the document.  We can create a bag-of-words using a dictionary (as we did in Lab_2_2 when considering the size of the vocabulary) or we can use a library function such as FreqDist from nltk.probability (or Counter from Collections).  In the cell below, I generate the bag-of-words for the first review in the training set using nltk's FreqDist.  You can think of this as like a dictionary but with extra benefits.  For example, later on in the lab, we will see it has useful methods which allow the document representations to be added and subtracted.

In [13]:
training[:2]

[(['melvin', 'udall', 'is', 'a', 'heartless', 'man', '.', ...], 'pos'),
 (['why', 'do', 'people', 'hate', 'the', 'spice', ...], 'pos')]

In [14]:
from nltk.probability import FreqDist

doc1 = FreqDist(training[0][0])
doc1

FreqDist({',': 24, '.': 18, 'and': 11, 'a': 9, 'to': 8, 'the': 7, 'melvin': 6, 'his': 6, "'": 6, 's': 6, ...})

### Exercise 2.1

Write code to use FreqDist to construct a bag-of-words representation for each document in the training and testing sets.  Store the results in two lists, `training_basic` and `testing_basic`.  Don't lose the annotations as to whether each review is positive or negative!  

In [15]:
training_basic = [(FreqDist(words), label) for words, label in training]
testing_basic = [(FreqDist(words), label) for words, label in testing]

In [16]:
print(training_basic[0])
print(testing_basic[0])

(FreqDist({',': 24, '.': 18, 'and': 11, 'a': 9, 'to': 8, 'the': 7, 'melvin': 6, 'his': 6, "'": 6, 's': 6, ...}), 'pos')
(FreqDist({',': 37, '.': 33, 'the': 29, 'a': 22, 'and': 20, "'": 16, 'it': 13, 'is': 11, 'to': 11, 'as': 10, ...}), 'neg')


You will notice of course that many of the words in your representations of documents are punctuation and stopwords.  This is because we haven't done any pre-processing of the wordlists.

### Exercise 2.2

Decide which of the following pre-processing steps to apply to the word lists:-
* case normalisation
* number normalisation
* punctuation removal
* stopword removal
* stemmming / lemmatisation


Apply these preprocessing steps to the original wordlist representations (stored in `training` and `testing`).  Then recreate the bag-of-words representations, storing the results in `training_norm` and `testing_norm`

In [17]:
from nltk.corpus import stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
def case_normalisation(data):

  lower_cased = [([token.lower() for token in tokens], label) for tokens, label in data]

  return lower_cased

In [19]:
training_lowered = case_normalisation(training)
testing_lowered = case_normalisation(testing)

In [20]:
training_lowered[:1]

[(['melvin',
   'udall',
   'is',
   'a',
   'heartless',
   'man',
   '.',
   'he',
   'spends',
   'his',
   'days',
   'inside',
   'of',
   'his',
   'spacious',
   'manhattan',
   'apartment',
   'writing',
   'romance',
   'novels',
   '.',
   'it',
   'also',
   'seems',
   'that',
   'melvin',
   'will',
   'never',
   'change',
   '.',
   'one',
   'day',
   'he',
   'dines',
   'ar',
   'his',
   'favorite',
   'restaurant',
   ',',
   'and',
   'is',
   'a',
   'little',
   'too',
   'mean',
   'to',
   'his',
   'normal',
   'waitress',
   '(',
   'the',
   'only',
   'waittress',
   'that',
   'will',
   'serve',
   'him',
   ')',
   ',',
   'carol',
   '(',
   'played',
   'to',
   'perfection',
   'by',
   'a',
   'lovely',
   'and',
   'sexy',
   'helen',
   'hunt',
   '.',
   ')',
   'she',
   'threatens',
   'not',
   'to',
   'serve',
   'him',
   'if',
   'he',
   'doesn',
   "'",
   't',
   'shut',
   'up',
   'about',
   'her',
   'asthmatic',
   'son',
   '.',
  

In [21]:
def stopwords_removal(data, language="english"):

  stop = stopwords.words(language)
  filtered_tokens = [([w for w in tokens if w.isalpha() and w not in stop], label) for tokens, label in data]
  return filtered_tokens

In [22]:
training_stop = stopwords_removal(training_lowered)
testing_stop = stopwords_removal(testing_lowered)

In [23]:
testing_stop[0]

(['plunkett',
  'macleane',
  'marks',
  'directing',
  'debut',
  'jake',
  'scott',
  'brother',
  'ridley',
  'tony',
  'naturally',
  'got',
  'worried',
  'would',
  'jake',
  'talent',
  'inherited',
  'ridley',
  'tony',
  'ridley',
  'movie',
  'would',
  'thoughtful',
  'suspensor',
  'action',
  'thrown',
  'tony',
  'would',
  'wham',
  'bang',
  'drivel',
  'unfortunately',
  'latter',
  'true',
  'worthless',
  'picture',
  'little',
  'charm',
  'carlyle',
  'miller',
  'titular',
  'highwaymen',
  'plunkett',
  'carlyle',
  'poor',
  'unruly',
  'captain',
  'james',
  'macleane',
  'miller',
  'clean',
  'cut',
  'gentleman',
  'tagline',
  'clearly',
  'wants',
  'make',
  'known',
  'rob',
  'rich',
  'nothing',
  'else',
  'film',
  'basically',
  'follows',
  'rowdy',
  'hold',
  'ups',
  'two',
  'stage',
  'along',
  'romantic',
  'interludes',
  'lady',
  'rebecca',
  'tyler',
  'hot',
  'tails',
  'mr',
  'chance',
  'ken',
  'stott',
  'wants',
  'see',
  'dead

In [24]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [25]:
from nltk.stem.wordnet import WordNetLemmatizer

def lemmatizer(data):
  wnl = WordNetLemmatizer()
  lemma_word = [(FreqDist(wnl.lemmatize(token) for token in tokens),label) for tokens, label in data]
  return lemma_word


In [26]:
training_norm = lemmatizer(training_stop)
testing_norm = lemmatizer(testing_stop)

training_norm[0]

(FreqDist({'melvin': 6, 'simon': 4, 'dog': 4, 'day': 3, 'playing': 3, 'one': 2, 'serve': 2, 'carol': 2, 'threatens': 2, 'shut': 2, ...}),
 'pos')

In [27]:
def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences:
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())

raw_vocab_size = vocabulary_size([[token for token in tokens]for tokens, label in training_stop])
lemma_vocab_size = vocabulary_size([[token for token in tokens]for tokens, label in training_norm])
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - lemma_vocab_size)/raw_vocab_size,raw_vocab_size,lemma_vocab_size))

Normalisation produced a 11.15% reduction in vocabulary size from 33913 to 30133


## Creating word lists
The next section will explain how to use a sentiment classifier that bases its decisions on word lists. The classifier requires a list of words indicating positive sentiment, and a second list of words indicating negative sentiment. Given positive and negative word lists, a document's overall sentiment is determined based on counts of occurrences of words that occur in the two lists. In this section we are concerned with the creation of the word lists. We will be considering both hand-crafted lists and automatically generated lists.

### Exercise 3.1

- Create a reasonably long hand-crafted list of words that you think indicate positive sentiment.
- Create a reasonably long hand-crafted list of words that indicate negative sentiment.

Use the following cells to store these lists in the variables `my_positive_word_list` and `my_negative_word_list`.

In [28]:
my_positive_word_list = ["good","great","lovely", "perfect", "amazing", "admirable", "brilliant", "cheerful", "creative", "delightful", "elegant", "enthusiastic", "excellent", "fantastic", "friendly", "generous", "graceful", "happy", "honest", "hopeful", "incredible", "inspiring", "joyful", "kind", "marvelous", "optimistic", "resilient", "wonderful"] # extend this one or put your own list here
my_negative_word_list = ["bad", "terrible", "awful", "arrogant", "boorish", "callous", "careless", "clumsy", "cowardly", "cruel", "cynical", "deceitful", "dishonest", "dreadful", "foolish", "greedy", "grumpy", "harmful", "hostile", "ignorant", "impatient", "inconsiderate", "irresponsible", "wasty", "selfish", "untrustworth"] # extend this one or put your own list here

Now lets see how often each of those words occurs in total in our positive and negative training data.  First, lets create a total of the FreqDists for positive data and for negative data.  As these are FreqDists (rather than simple dictionaries), we can do this as follows:

In [29]:
pos_freq_dist=FreqDist()
neg_freq_dist=FreqDist()

for reviewDist,label in training_norm:
    if label=='pos':
        pos_freq_dist+=reviewDist
    else:
        neg_freq_dist+=reviewDist

pos_freq_dist

FreqDist({'film': 4385, 'one': 2199, 'movie': 2168, 'character': 1429, 'like': 1317, 'time': 1087, 'story': 982, 'scene': 975, 'make': 927, 'get': 914, ...})

### Exercise 3.2
In the blank code cell below write code that uses the total frequency distributions `pos_freq_dist` and `neg_freq_dist` and the word lists `my_positive_word_list` and `my_negative_word_list` created earlier to determine whether or not the review data conforms to your expectations. In particular, whether:
- the words you expected to indicate positive sentiment actually occur more frequently in positive reviews than negative reviews
- the words you expected to indicate negative sentiment actually occur more frequently in negative reviews than positive reviews.

You could display your findings in a table using pandas.

In [30]:
pos_word_in_pos = {}
for key, value in pos_freq_dist.items():
  if key in my_positive_word_list:
    pos_word_in_pos[key] = pos_word_in_pos.get(0, value) +1

pos_word_in_neg = {}
for key, value in neg_freq_dist.items():
  if key in my_positive_word_list:
    pos_word_in_neg[key] = pos_word_in_neg.get(0, value) +1


neg_word_in_pos = {}
for key, value in pos_freq_dist.items():
  if key in my_negative_word_list:
    neg_word_in_pos[key] = neg_word_in_pos.get(0, value) +1

neg_word_in_neg = {}
for key, value in neg_freq_dist.items():
  if key in my_negative_word_list:
    neg_word_in_neg[key] = neg_word_in_neg.get(0, value) +1


print(pos_word_in_pos)
print(pos_word_in_neg)

{'lovely': 21, 'creative': 27, 'great': 543, 'honest': 42, 'good': 864, 'amazing': 83, 'happy': 93, 'perfect': 169, 'kind': 226, 'wonderful': 115, 'excellent': 101, 'joyful': 3, 'friendly': 23, 'incredible': 49, 'elegant': 9, 'brilliant': 88, 'inspiring': 13, 'delightful': 34, 'fantastic': 44, 'cheerful': 10, 'hopeful': 6, 'graceful': 4, 'marvelous': 18, 'admirable': 13, 'generous': 6, 'optimistic': 5, 'enthusiastic': 5, 'resilient': 2}
{'good': 816, 'great': 289, 'happy': 60, 'kind': 207, 'perfect': 70, 'incredible': 23, 'excellent': 28, 'marvelous': 5, 'lovely': 21, 'wonderful': 39, 'honest': 20, 'amazing': 37, 'admirable': 10, 'brilliant': 37, 'hopeful': 10, 'friendly': 18, 'enthusiastic': 8, 'creative': 20, 'elegant': 2, 'optimistic': 4, 'fantastic': 10, 'delightful': 11, 'graceful': 4, 'inspiring': 6, 'cheerful': 4, 'generous': 5}


In [31]:
def add_missing(data1, data2):
  for token in data1.keys():
    neg_list = []
    for tokenn in data2.keys():
      neg_list.append(tokenn)
    if token not in neg_list:
      data2[token] = data2.get(token, 0)
  return data2

#Option 2:(Faster)
'''
missing = set(pos_word_in_pos) - set(pos_word_in_neg)
for token in missing:
    pos_word_in_neg[token] = 0
'''
#Option 3:(Faster)
'''
pos_word_in_neg = {token: pos_word_in_neg.get(token, 0) for token in pos_word_in_pos}
'''

pos_word_in_neg = add_missing(pos_word_in_pos, pos_word_in_neg)
pos_word_in_neg

{'good': 816,
 'great': 289,
 'happy': 60,
 'kind': 207,
 'perfect': 70,
 'incredible': 23,
 'excellent': 28,
 'marvelous': 5,
 'lovely': 21,
 'wonderful': 39,
 'honest': 20,
 'amazing': 37,
 'admirable': 10,
 'brilliant': 37,
 'hopeful': 10,
 'friendly': 18,
 'enthusiastic': 8,
 'creative': 20,
 'elegant': 2,
 'optimistic': 4,
 'fantastic': 10,
 'delightful': 11,
 'graceful': 4,
 'inspiring': 6,
 'cheerful': 4,
 'generous': 5,
 'joyful': 0,
 'resilient': 0}

In [32]:
pos_list = sorted(pos_word_in_pos.items())
neg_list = sorted(pos_word_in_neg.items())

print(pos_list)
print(neg_list)

[('admirable', 13), ('amazing', 83), ('brilliant', 88), ('cheerful', 10), ('creative', 27), ('delightful', 34), ('elegant', 9), ('enthusiastic', 5), ('excellent', 101), ('fantastic', 44), ('friendly', 23), ('generous', 6), ('good', 864), ('graceful', 4), ('great', 543), ('happy', 93), ('honest', 42), ('hopeful', 6), ('incredible', 49), ('inspiring', 13), ('joyful', 3), ('kind', 226), ('lovely', 21), ('marvelous', 18), ('optimistic', 5), ('perfect', 169), ('resilient', 2), ('wonderful', 115)]
[('admirable', 10), ('amazing', 37), ('brilliant', 37), ('cheerful', 4), ('creative', 20), ('delightful', 11), ('elegant', 2), ('enthusiastic', 8), ('excellent', 28), ('fantastic', 10), ('friendly', 18), ('generous', 5), ('good', 816), ('graceful', 4), ('great', 289), ('happy', 60), ('honest', 20), ('hopeful', 10), ('incredible', 23), ('inspiring', 6), ('joyful', 0), ('kind', 207), ('lovely', 21), ('marvelous', 5), ('optimistic', 4), ('perfect', 70), ('resilient', 0), ('wonderful', 39)]


In [33]:
from itertools import zip_longest
import pandas as pd

df = pd.DataFrame(list(zip_longest([token for token, count in pos_list], [count for token, count in pos_list], [token for token, count in neg_list], [count for token, count in neg_list])), columns= ["Positive", "count1", "Negative", "count2"])
df

Unnamed: 0,Positive,count1,Negative,count2
0,admirable,13,admirable,10
1,amazing,83,amazing,37
2,brilliant,88,brilliant,37
3,cheerful,10,cheerful,4
4,creative,27,creative,20
5,delightful,34,delightful,11
6,elegant,9,elegant,2
7,enthusiastic,5,enthusiastic,8
8,excellent,101,excellent,28
9,fantastic,44,fantastic,10


In [34]:
exp_pos = df[df["count2"] > df["count1"]]["Positive"].tolist()
exp_pos

['enthusiastic', 'hopeful']

'enthusiastic' and 'hopeful' expected to indicate positive sentiment actually occur more frequently in positive reviews than negative reviews.

In [35]:
exp_pos2= df[df["count2"] > (df["count1"])*0.8]["Positive"].tolist()
set(exp_pos2) - set(exp_pos)

{'generous', 'good', 'graceful', 'kind', 'lovely'}

Also, there are some more words occured many times more than expected in negative reviews:  'generous', 'good', 'graceful', 'kind', 'lovely'

In [36]:
neg_word_in_pos , neg_word_in_neg

neg_word_in_pos = {token: neg_word_in_pos.get(token,0) for token in neg_word_in_neg.keys()}

pos_list2 = sorted(neg_word_in_pos.items())
neg_list2 = sorted(neg_word_in_neg.items())


In [37]:
df2 = pd.DataFrame(list(zip_longest([token for token, count in pos_list2], [count for token, count in pos_list2], [token for token, count in neg_list2], [count for token, count in neg_list2])), columns= ["Positive", "count1", "Negative", "count2"])
df2

Unnamed: 0,Positive,count1,Negative,count2
0,arrogant,14,arrogant,9
1,awful,14,awful,74
2,bad,237,bad,710
3,boorish,0,boorish,4
4,callous,4,callous,2
5,careless,2,careless,2
6,clumsy,10,clumsy,15
7,cowardly,5,cowardly,2
8,cruel,20,cruel,19
9,cynical,14,cynical,14


In [38]:
exp_neg = df2[df2["count1"] > df2["count2"]]["Positive"].tolist()

exp_neg

['arrogant',
 'callous',
 'cowardly',
 'cruel',
 'deceitful',
 'greedy',
 'harmful',
 'ignorant']

**'arrogant',  'callous', 'cowardly', 'cruel', 'deceitful', 'greedy', 'harmful' and
 'ignorant'**  expected to indicate positive sentiment actually occur more frequently in positive reviews than negative reviews.

### Exercise 3.3
Now, you are going to create positive and negative word lists automatically from the training data. In order to do this:

1. write two new functions to help with automating the process of generating wordlists.

    - `most_frequent_words` - this function should take THREE arguments: 2 frequency distributions and a natural number, k. It should order words by how much more they occur in one frequency distribution than the other.   It should then return the top k highest scoring words. You might want to use the `most_common` method from the `FreqDist` class - this returns a list of word, frequency pairs ordered by frequency.  You might also or alternatively want to use pythons built-in `sorted` function
    - `words_above_threshold` - this function also takes three arguments: 2 frequency distributions and a natural number, k. Again, it should order words by how much more they occur in one distribution than the other.  It should return all of the words that have a score greater than k.

2. Using the training data, create two sets of positive and negative word lists using these functions (1 set with each function).
3.  Display these 4 lists (possibly in a `Pandas` dataframe?)



In [43]:
pos_freq_dist.most_common(5)

[('film', 4385),
 ('one', 2199),
 ('movie', 2168),
 ('character', 1429),
 ('like', 1317)]

In [62]:
def most_freuent_words(pos_fd, neg_fd, k):
  new_freq = pos_fd - neg_fd
  sorted_freq = new_freq.most_common()
  most_freq_words = [word for word, count in sorted_freq[:k]]

  return most_freq_words


In [64]:
pos_most_freq_word = most_freuent_words(pos_freq_dist,neg_freq_dist,50)

In [65]:
pos_most_freq_word

['film',
 'life',
 'also',
 'story',
 'great',
 'world',
 'many',
 'one',
 'best',
 'performance',
 'war',
 'well',
 'american',
 'family',
 'see',
 'year',
 'character',
 'jackie',
 'first',
 'way',
 'quite',
 'although',
 'love',
 'take',
 'young',
 'child',
 'men',
 'however',
 'job',
 'new',
 'mother',
 'john',
 'seen',
 'time',
 'alien',
 'people',
 'star',
 'toy',
 'friend',
 'perfect',
 'different',
 'u',
 'always',
 'may',
 'true',
 'disney',
 'yet',
 'often',
 'dark',
 'city']

In [67]:
neg_most_freq_word = most_freuent_words(neg_freq_dist, pos_freq_dist,50)
neg_most_freq_word

['movie',
 'bad',
 'plot',
 'worst',
 'even',
 'script',
 'could',
 'minute',
 'nothing',
 'get',
 'supposed',
 'boring',
 'stupid',
 'least',
 'reason',
 'unfortunately',
 'guy',
 'better',
 'godzilla',
 'attempt',
 'look',
 'joke',
 'harry',
 'tv',
 'problem',
 'big',
 'maybe',
 'try',
 'got',
 'dull',
 'think',
 'batman',
 'robin',
 'waste',
 'dialogue',
 'west',
 'mess',
 'trying',
 'wasted',
 'lame',
 'seagal',
 'awful',
 'half',
 'line',
 'action',
 'made',
 'worse',
 'idea',
 'name',
 'terrible']

In [68]:
def words_above_threshold(pos_fd, neg_fd, k):
  new_freq = pos_fd - neg_fd
  sorted_freq = new_freq.most_common()
  above_threshold = [word for word, count in sorted_freq if count>k]

  return above_threshold

In [70]:
neg_thres_word = most_freuent_words(neg_freq_dist, pos_freq_dist,100)

In [71]:
pos_thres_word = most_freuent_words(pos_freq_dist,neg_freq_dist,100)

In [72]:
freq_df = pd.DataFrame(list(zip_longest(pos_most_freq_word, neg_most_freq_word , pos_thres_word, neg_thres_word)), columns= ["Most Frequent Positive", "Most Frequent Negative", "Positive Above Threshold", "Negative Above Threshold"])
freq_df

Unnamed: 0,Most Frequent Positive,Most Frequent Negative,Positive Above Threshold,Negative Above Threshold
0,film,movie,film,movie
1,life,bad,life,bad
2,also,plot,also,plot
3,story,worst,story,worst
4,great,even,great,even
...,...,...,...,...
95,,,heart,project
96,,,voice,eddie
97,,,relationship,thriller
98,,,making,rest


## Creating a word list based classifier
Now you have a number of word lists for use with a classifier.
> Make sure you understand the following code, which will be used as the basis for creating a word list based classifier.

In [41]:
from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI):

    def __init__(self, pos, neg):
        self._pos = pos
        self._neg = neg

    def classify(self, words):
        score = 0

        # add code here that assigns an appropriate value to score
        return "neg" if score < 0 else "pos"

    ##we don't actually need to define the classify_many method as it is provided in ClassifierI
    #def classify_many(self, docs):
    #    return [self.classify(doc) for doc in docs]

    def labels(self):
        return ("pos", "neg")

#Example usage:

classifier = SimpleClassifier(my_positive_word_list, my_negative_word_list)
classifier.classify(FreqDist("This movie was great".split()))

'pos'

### Exercise 3.1

- Copy the above code cell and move it to below this one. Then complete the `classify` method in the above code as specified below.
- Test your classifier on several very simple hand-crafted examples to verify that you have implemented `classify` correctly.

The classifier is initialised with a list of positive words, and a list of negative words. The words of a document are passed to the `classify` method (which is partially completed in the above code fragment). The `classify` method should be defined so that each occurrence of a negative word decrements `score`, and each occurrence of a positive word increments `score`.
- For `score` less than 0, "`neg`" for negative should be returned.
- For `score` greater than 0,  "`pos`" for positive should returned.
- For `score` of 0, the classification decision should be made randomly (see https://docs.python.org/3/library/random.html).


### Exercise 3.2
* Extend your SimpleClassifier class so that it has a `train` function which will derive the wordlists from training data.  You could build a separate class for each way of automatically deriving wordlists (which both inherit from SimpleClassifier) OR a single class which takes an extra parameter at training time.

Try out your classifier on the test data.  We will look at how to evaluate classifiers in the next part, but in an ideal world, most of the positive test items will have been classified as 'P' and most of the negative test items will have been classified as 'N'.  Note that the batch_classify method takes a list of unlabelled documents so you can't give it a list of pairs (where each pair is doc and a label).  You can either use a list comprehension or the <code>zip(*list_of_pairs)</code> function to split a list of pairs into a pair of lists.