# Week 3: Basic Document Classification (Part 1)

## Preliminaries 

In [1]:
#necessary library imports and setup introduced previously

import sys
sys.path.append(r'T:\Departments\Informatics\LanguageEngineering') 
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/juliewe/resources')

import re
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import zip_longest
from nltk.tokenize import word_tokenize

from sussex_nltk.corpus_readers import ReutersCorpusReader

Sussex NLTK root directory is \\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources


## Overview 
In labs this week (and next), the focus will be on the application of sentiment analysis. You will be using a corpus of **book reviews** within an **Amazon review corpus**.

You will be exploring various techniques that can be used to classify the sentiment of Amazon book reviews as either positive or negative. 

You will be developing your own **Word List** and **Naïve Bayes** classifiers and then comparing them to the **NLTK Naïve Bayes** classifier.

## Creating training and testing sets
You will be training and testing various document classifiers. It is essential that the data used in the testing phase is not used during the training phase, since this can lead to overestimating performance. 

We now introduce the `split_data` function (defined in the cell below) which can be used to get separate **training** and **testing** sets.

> Look through the code in the following cell, reading the comments and making sure that you understand each line.

In [2]:
from random import sample # have a look at https://docs.python.org/3/library/random.html to see what random.sample does
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader

 
def split_data(data, ratio=0.7): # when the second argument is not given, it defaults to 0.7
    """
    Given corpus generator and ratio:
     - partitions the corpus into training data and test data, where the proportion in train is ratio,

    :param data: A corpus generator.
    :param ratio: The proportion of training documents (default 0.7)
    :return: a pair (tuple) of lists where the first element of the 
            pair is a list of the training data and the second is a list of the test data.
    """
    
    data = list(data) # data is a generator, so this puts all the generated items in a list
 
    n = len(data)  #Found out number of samples present
    train_indices = sample(range(n), int(n * ratio))          #Randomly select training indices
    test_indices = list(set(range(n)) - set(train_indices))   #Other items are testing indices
 
    train = [data[i] for i in train_indices]           #Use training indices to select data
    test = [data[i] for i in test_indices]             #Use testing indices to select data
 
    return (train, test)                       #Return split data
 

Now we can use this function together with a <code>reader</code> object  to create training and testing data.  Note that the <code>AmazonReviewCorpusReader().category("dvd")</code> returns a reader over just the *dvd* reviews.  The methods <code>positive()</code>, <code>negative()</code> can be called to create readers over reviews classified accordingly to their sentiment.  

In [3]:
#Create an Amazon corpus reader pointing at only dvd reviews
dvd_reader = AmazonReviewCorpusReader().category("dvd")

#The following two lines use the documents function on the Amazon corpus reader. 
#This returns a generator over reviews in the corpus. 
#Each review is an instance of a Python class called AmazonReview. 
#An AmazonReview object contains all the data about a review.
pos_train, pos_test = split_data(dvd_reader.positive().documents())
neg_train, neg_test = split_data(dvd_reader.negative().documents())

#You can also combine the training data
train = pos_train + neg_train


### Exercise 1.1
* Generate 80:20 *training*:*testing* splits of all 4 categories of reviews (*dvd*, *book*, *kitchen* and *electronics*), containing **positive** and **negative** reviews.  
* Record the number of reviews according to category, sentiment and dataset (training or testing) in a Pandas dataframe
* Answer the following questions:
    1. Regarding the *training* data for *books*, how many are a) **positive**, b) **negative**?
    2. Regarding the **negative** *testing* data for, how many are there for each category: a) *dvd*, b) *book*, c) *kitchen* and d) *electronics*? 

In [4]:
reader_dvd = AmazonReviewCorpusReader().category("dvd")
reader_book = AmazonReviewCorpusReader().category("book")
reader_kitchen = AmazonReviewCorpusReader().category("kitchen")
reader_electronics = AmazonReviewCorpusReader().category("electronics")

pos_train_dvd, pos_test_dvd = split_data(reader_dvd.positive().documents(),0.8)
neg_train_dvd, neg_test_dvd = split_data(reader_dvd.negative().documents(),0.8)

pos_train_book, pos_test_book = split_data(reader_book.positive().documents(),0)
neg_train_book, neg_test_book = split_data(reader_book.negative().documents(),0)

pos_train_kitchen, pos_test_kitchen = split_data(reader_kitchen.positive().documents(),0.8)
neg_train_kitchen, neg_test_kitchen = split_data(reader_kitchen.negative().documents(),0.8)

pos_train_electronics, pos_test_electronics = split_data(reader_electronics.positive().documents(),0.8)
neg_train_electronics, neg_test_electronics = split_data(reader_electronics.negative().documents(),0.8)

pd.DataFrame(list(zip_longest(pos_test_book,neg_test_book)),columns=["Positive", "Negative"])



Unnamed: 0,Positive,Negative
0,<review>\n<unique_id>\n0785758968:one_of_the_b...,<review>\n<unique_id>\n0312355645:horrible_boo...
1,<review>\n<unique_id>\n0452279550:the_medicine...,<review>\n<unique_id>\n1559278676:shallow_self...
2,<review>\n<unique_id>\n1599620065:beautiful!:s...,<review>\n<unique_id>\n1559278676:horrible_boo...
3,<review>\n<unique_id>\n0743277724:for_lovers_o...,<review>\n<unique_id>\n0425193373:disappointme...
4,<review>\n<unique_id>\n061318114X:excellent_an...,<review>\n<unique_id>\n0142004030:a_disappoint...
...,...,...
995,<review>\n<unique_id>\n0618256288:interesting_...,<review>\n<unique_id>\n0385514573:a_nice_place...
996,<review>\n<unique_id>\n0152024867:best_book_fo...,<review>\n<unique_id>\n0385514573:boring:o._go...
997,<review>\n<unique_id>\n1565843584:fascinating_...,<review>\n<unique_id>\n097160200X:lacking:m._r...
998,<review>\n<unique_id>\n1580131581:it's_shofar_...,<review>\n<unique_id>\n1564147363:okay_for_ide...


In [5]:
neg_train_dvd, neg_test_dvd = split_data(reader_dvd.negative().documents())
neg_train_book, neg_test_book = split_data(reader_book.negative().documents())
neg_train_kitchen, neg_test_kitchen = split_data(reader_kitchen.negative().documents())
neg_train_electronics, neg_test_electronics = split_data(reader_electronics.negative().documents())
pd.DataFrame(list(zip_longest(neg_test_dvd,neg_test_book,neg_test_kitchen,neg_test_electronics))
             ,columns=["neg_test_dvd","neg_test_book","neg_test_kitchen","neg_test_electronics"])

Unnamed: 0,neg_test_dvd,neg_test_book,neg_test_kitchen,neg_test_electronics
0,<review>\n<unique_id>\nB00064LJVE:one_of_the_w...,<review>\n<unique_id>\n1559278676:horrible_boo...,"<review>\n<unique_id>\nB00015USLK:ok,_the_shee...",<review>\n<unique_id>\nB0001YFW3K:not_so_great...
1,<review>\n<unique_id>\nB000BYA4F6:fun_to_watch...,<review>\n<unique_id>\n0895261537:save_your_mo...,<review>\n<unique_id>\nB0006N2O0U:save_your_mo...,<review>\n<unique_id>\nB000BD35SK:horribly_noi...
2,<review>\n<unique_id>\nB00005JKG9:i_have_a_dif...,<review>\n<unique_id>\n0380725835:unendurable_...,<review>\n<unique_id>\nB000634GCO:i_am_returni...,<review>\n<unique_id>\nB0002E1HGK:low_quality:...
3,<review>\n<unique_id>\nB000GFLKF8:not_too_impr...,<review>\n<unique_id>\n0805076158:interesting_...,<review>\n<unique_id>\nB000096JFW:poor_negativ...,<review>\n<unique_id>\nB000093IRC:waste_of_mon...
4,<review>\n<unique_id>\nB000BKVQS4:not_much_for...,<review>\n<unique_id>\n0842329277:the_half-way...,<review>\n<unique_id>\nB0001LB9RG:leaky_mess:c...,<review>\n<unique_id>\nB000093IRC:good_when_it...
...,...,...,...,...
295,<review>\n<unique_id>\nB000FILVFA:potentially_...,<review>\n<unique_id>\n006440966X:dragged_down...,<review>\n<unique_id>\nB0000CBI4E:just_a_regul...,<review>\n<unique_id>\nB0001OTBUK:images_heavi...
296,<review>\n<unique_id>\nB000BB96LW:euuuwww........,<review>\n<unique_id>\n158131129X:weak_cases:y...,<review>\n<unique_id>\nB0002XGRCK:don't_buy_th...,<review>\n<unique_id>\nB0000CEPE8:too_many_pro...
297,<review>\n<unique_id>\nB000127Z9G:lacking_alot...,<review>\n<unique_id>\n0394586980:weak_link_in...,<review>\n<unique_id>\nB0002Y5XM4:looks_great_...,<review>\n<unique_id>\nB00004Z0C7:long_cord_di...
298,<review>\n<unique_id>\nB000063VBJ:how_to_destr...,<review>\n<unique_id>\n1419375385:good_start_-...,<review>\n<unique_id>\nB00022HZ04:it_is_an_exp...,<review>\n<unique_id>\nB00004Z6BJ:terrible_net...


## Creating word lists
The next section will explain how to use a sentiment classifier that bases its decisions on word lists. The classifier requires a list of words indicating positive sentiment, and a second list of words indicating negative sentiment. Given positive and negative word lists, a document's overall sentiment is determined based on counts of occurrences of words that occur in the two lists. In this section we are concerned with the creation of the word lists. We will be considering both hand-crafted lists and automatically generated lists.

### Exercise 2.1

- Create a reasonably long hand-crafted list of words that you think indicate positive sentiment.
- Create a reasonably long hand-crafted list of words that indicate negative sentiment.

Use the following cells to store these lists in the variables `my_positive_word_list` and `my_negative_word_list`.

In [6]:
my_positive_word_list = ["good","great","lovely","best"] # extend this one or put your own list here
my_negative_word_list = ["bad", "terrible", "awful","worst"] # extend this one or put your own list here

Next, you should try to derive word lists from the data. One way to do this, is to use the most frequent words in positive reviews as your positive list, and the most frequent words in negative reviews as your negative list. This can be done with the [NLTK <code style="background-color: #F5F5F5;">FreqDist</code>](http://www.nltk.org/api/nltk.html#module-nltk.probability) object. 

> You should make sure you understand the code in the cell below.

In [7]:
from nltk.probability import FreqDist # see http://www.nltk.org/api/nltk.html#module-nltk.probability
#from sussex_nltk.corpus_readers import AmazonReviewCorpusReader
from functools import reduce # see https://docs.python.org/3/library/functools.html

#Helper function. Given a list of reviews, return a list of all the words in those reviews
#To understand this look at the description of functools.reduce in https://docs.python.org/3/library/functools.html
def get_all_words(amazon_reviews):
    return reduce(lambda words,review: words + review.words(), amazon_reviews, [])

#A frequency distribution over all words in positive book reviews
pos_freqdist = FreqDist(get_all_words(pos_train))
neg_freqdist = FreqDist(get_all_words(neg_train))

In [8]:
pos_freqdist


FreqDist({'the': 5869, '.': 5585, ',': 5363, 'and': 3394, 'a': 2990, 'of': 2944, 'to': 2581, 'is': 2299, 'I': 1759, 'in': 1662, ...})

### Exercise 2.2
Explain (in words) how the <code>get_all_words()</code> function works.  Your description should include details about
1. the input
2. the output
3. the algorithm used to generate the output from the input

### Exercise 2.3
In the blank code cell below write code that uses the frequency lists, `pos_freqdist` and `neg_freqdist`, created in the above cell and `my_positive_word_list` and `my_negative_word_list` that you manually created earlier to determine whether or not the review data conforms to your expectations. In particular, whether:
- the words you expected to indicate positive sentiment actually occur more frequently in positive reviews than negative reviews
- the words you expected to indicate negative sentiment actually occur more frequently in negative reviews than positive reviews.

Display your findings in a table using pandas.

In [9]:
freq_pos_pos=[]
freq_neg_pos=[]
for word in my_positive_word_list:
    freq_pos_pos.append(pos_freqdist[word])
for word in my_negative_word_list:
    freq_neg_pos.append(pos_freqdist[word])

freq_pos_neg=[]
freq_neg_neg=[]
for word in my_positive_word_list:
    freq_pos_neg.append(neg_freqdist[word])
for word in my_negative_word_list:
    freq_neg_neg.append(neg_freqdist[word])

In [10]:
pd.DataFrame(zip_longest(my_positive_word_list,freq_pos_pos,freq_pos_neg),columns=["word", "freq in posreview","freq in negreview"])

Unnamed: 0,word,freq in posreview,freq in negreview
0,good,253,281
1,great,272,133
2,lovely,10,3
3,best,169,75


In [11]:
pd.DataFrame(zip_longest(my_negative_word_list,freq_neg_pos,freq_neg_neg),columns=["word", "freq in posreview","freq in negreview"])

Unnamed: 0,word,freq in posreview,freq in negreview
0,bad,50,177
1,terrible,9,41
2,awful,1,26
3,worst,8,72


### Exercise 2.4
Now, you are going to create positive and negative word lists automatically from the training data. In order to do this:

1. write two new functions to help with automating the process of generating wordlists.

    - `most_frequent_words` - this function should take THREE arguments: 2 frequency distributions and a natural number, k. It should order words by how much more they occur in one frequency distribution than the other.   It should then return the top k highest scoring words. You might want to use the `most_common` method from the `FreqDist` class - this returns a list of word, frequency pairs ordered by frequency.  You might also or alternatively want to use pythons built-in `sorted` function
    - `words_above_threshold` - this function also takes three arguments: 2 frequency distributions and a natural number, k. Again, it should order words by how much more they occur in one distribution than the other.  It should return all of the words that have a score greater than k.

2. Remove punctuation and stopwords from consideration. You can re-use code from near the end of Lab_2_2.
3. Using the training data, create two sets of positive and negative word lists using these functions (1 set with each function). 
4.  Display these 4 lists (possibly in a `Pandas` dataframe?)



In [12]:
from nltk.probability import FreqDist
from functools import reduce
def get_all_words(amazon_reviews):
    return reduce(lambda words,review: words + review.words(), amazon_reviews, [])
pos_freqdist = FreqDist(get_all_words(pos_train))
neg_freqdist = FreqDist(get_all_words(neg_train))

def all_lower(list):
    new_list=[]
    for s in list:
        new_list.append(s.lower())
    return new_list

def most_frequent_words(freqdist1,freqdist2,k):
    from nltk.corpus import stopwords
    stop = stopwords.words('english')
    new_freqdist1 = [w for w in all_lower(get_all_words(freqdist1)) if w.isalpha() and w not in stop]
    new_freqdist2 = [w for w in all_lower(get_all_words(freqdist2)) if w.isalpha() and w not in stop]
    new_freqdist1 = FreqDist(new_freqdist1)
    new_freqdist2 = FreqDist(new_freqdist2)
    return new_freqdist1.most_common(k),new_freqdist2.most_common(k)
most_frequent_words(pos_train,neg_train,15)


([('movie', 681),
  ('film', 624),
  ('one', 570),
  ('like', 337),
  ('dvd', 336),
  ('great', 319),
  ('good', 272),
  ('time', 232),
  ('well', 231),
  ('first', 222),
  ('see', 218),
  ('also', 211),
  ('would', 204),
  ('really', 203),
  ('best', 201)],
 [('movie', 863),
  ('film', 589),
  ('one', 453),
  ('like', 402),
  ('dvd', 324),
  ('good', 300),
  ('would', 288),
  ('even', 241),
  ('much', 237),
  ('get', 229),
  ('time', 213),
  ('really', 203),
  ('could', 190),
  ('people', 186),
  ('bad', 185)])

In [13]:
def words_above_threshold(freqdist1,freqdist2,k):
    from nltk.corpus import stopwords
    stop = stopwords.words('english')
    new_freqdist1 = [w for w in all_lower(get_all_words(freqdist1)) if w.isalpha() and w not in stop and len(w)>k]
    new_freqdist2 = [w for w in all_lower(get_all_words(freqdist2)) if w.isalpha() and w not in stop and len(w)>k]
    new_freqdist_FD1 = FreqDist(new_freqdist1)
    new_freqdist_FD2 = FreqDist(new_freqdist2)
   # new_freqdist_FD1=new_freqdist_FD1.inc(sorted(w for w in set(new_freqdist1) if len(w) <k))
   # new_freqdist_FD2=new_freqdist_FD2.inc(sorted(w for w in set(new_freqdist2) if len(w) <k))
    return new_freqdist_FD1.most_common(15),new_freqdist_FD2.most_common(15)
words_above_threshold(pos_train,neg_train,3)

([('movie', 681),
  ('film', 624),
  ('like', 337),
  ('great', 319),
  ('good', 272),
  ('time', 232),
  ('well', 231),
  ('first', 222),
  ('also', 211),
  ('would', 204),
  ('really', 203),
  ('best', 201),
  ('love', 201),
  ('much', 191),
  ('even', 189)],
 [('movie', 863),
  ('film', 589),
  ('like', 402),
  ('good', 300),
  ('would', 288),
  ('even', 241),
  ('much', 237),
  ('time', 213),
  ('really', 203),
  ('could', 190),
  ('people', 186),
  ('story', 175),
  ('movies', 171),
  ('make', 168),
  ('first', 167)])

In [14]:
pd.DataFrame(zip(most_frequent_words(pos_train,neg_train,15)[0],most_frequent_words(pos_train,neg_train,15)[1],words_above_threshold(pos_train,neg_train,3)[0],words_above_threshold(pos_train,neg_train,3)[1]),columns=["1", "2","3","4"])

Unnamed: 0,1,2,3,4
0,"(movie, 681)","(movie, 863)","(movie, 681)","(movie, 863)"
1,"(film, 624)","(film, 589)","(film, 624)","(film, 589)"
2,"(one, 570)","(one, 453)","(like, 337)","(like, 402)"
3,"(like, 337)","(like, 402)","(great, 319)","(good, 300)"
4,"(dvd, 336)","(dvd, 324)","(good, 272)","(would, 288)"
5,"(great, 319)","(good, 300)","(time, 232)","(even, 241)"
6,"(good, 272)","(would, 288)","(well, 231)","(much, 237)"
7,"(time, 232)","(even, 241)","(first, 222)","(time, 213)"
8,"(well, 231)","(much, 237)","(also, 211)","(really, 203)"
9,"(first, 222)","(get, 229)","(would, 204)","(could, 190)"


## Creating a word list based classifier
Now you have a number of word lists for use with a classifier. 
> Make sure you understand the following code, which will be used as the basis for creating a word list based classifier.

In [34]:
from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI): 
    #look at the documentation for ClassifierI https://www.nltk.org/_modules/nltk/classify/api.html
    
    def __init__(self, pos, neg): 
        self._pos = pos 
        self._neg = neg 

    def classify(self, words): 
        score = 0
        for word in words:
            if word in self._pos:
                score+=1
            if word in self._neg:
                score-=1
        # add code here that assigns an appropriate value to score
        return "N" if score < 0 else "P"

    def classify_many(self, docs): 
        return [self.classify(doc.words() if hasattr(doc, 'words') else doc) for doc in docs] 

    def labels(self): 
        return ("P", "N")

#Example usage:

#classifier = SimpleClassifier(top_pos, top_neg)
classifier = SimpleClassifier(my_positive_word_list,my_negative_word_list)
classifier.classify("I enjoyed this great movie".split())

'P'

### Exercise 3.1

- Copy the above code cell and move it to below this one. Then complete the `classify` method in the above code as specified below.
- Test your classifier on several very simple hand-crafted examples to verify that you have implemented `classify` correctly.

The classifier is initialised with a list of positive words, and a list of negative words. The words of a document are passed to the `classify` method (which is partially completed in the above code fragment). The `classify` method should be defined so that each occurrence of a negative word decrements `score`, and each occurrence of a positive word increments `score`. 
- For `score` less than 0, an "`N`" for negative should be returned.
- For `score` greater than 0,  "`P`" for positive should returned.
- For `score` of 0, the classification decision should be made randomly (see https://docs.python.org/3/library/random.html).


In [15]:
from nltk.classify.api import ClassifierI
import random

class SimpleClassifier(ClassifierI): 
    #look at the documentation for ClassifierI https://www.nltk.org/_modules/nltk/classify/api.html
    
    def __init__(self, pos, neg): 
        self._pos = pos 
        self._neg = neg 

    def classify(self, words): 
        score = 0
        for word in words:
            if word in self._pos:
                score+=1
            if word in self._neg:
                score-=1
        # add code here that assigns an appropriate value to score
        return "N" if score < 0 else "P"

    def classify_many(self, docs): 
        return [self.classify(doc.words() if hasattr(doc, 'words') else doc) for doc in docs] 

    def labels(self): 
        return ("P", "N")

### Exercise 3.2
* Extend your SimpleClassifier class so that it has a `train` function which will derive the wordlists from training data.  You could build a separate class for each way of automatically deriving wordlists (which both inherit from SimpleClassifier) OR a single class which takes an extra parameter at training time.

In [15]:
class SimpleClassifier_mf(SimpleClassifier):
    
    def __init__(self,k):
        self.k=k
    
    def train(self,pos_train,neg_train):
        pos_freqdist = FreqDist(get_all_words(pos_train))
        neg_freqdist = FreqDist(get_all_words(neg_train))
        self._pos=most_frequent_words(pos_freqdist,self.k)
        self._neg=most_frequent_words(neg_freqdist,self.k)
        