# Project 2 - Document Classifier Application

We will use **movie_reviews** corpus, that is composed by a series of film reviews contained in several files.

* This corpus structure is: two folders (**pos** and **neg**) containing all the reviews, each one stored as a .txt file

In [1]:
from nltk.corpus import movie_reviews
import nltk
import random

Now we create a new list of tuples, each tuple will contain:

* one review (as a list of tokens)
* its category(**pos** or **neg**)

tuple:([review_tokens],category)

Example:

documents[1]=(['edward', 'burns', ... , 'it', '.'], 'neg')

In [2]:
documents = [(list(movie_reviews.words(file_id)), category) for category in movie_reviews.categories() for file_id in
             movie_reviews.fileids(category)]

In [15]:
print(documents[1])

(['edward', 'burns', 'tackles', 'his', 'third', 'picture', 'with', 'no', 'looking', 'back', ',', 'and', 'like', 'his', 'previous', 'two', ',', 'it', 'is', 'a', 'working', '-', 'class', 'relationship', 'picture', '.', 'however', ',', 'unlike', 'his', 'previous', 'work', ',', 'the', 'film', 'dwells', 'on', 'a', 'more', 'personal', 'story', ',', 'and', 'with', 'a', 'female', 'protagonist', '.', 'and', 'in', 'no', 'looking', 'back', ',', 'he', 'stumbles', ',', 'making', 'a', 'slow', ',', 'boring', 'film', 'without', 'the', 'spark', 'that', 'enlivened', 'his', 'previous', 'work', '.', 'claudia', '(', 'lauren', 'holly', ')', 'is', 'a', 'small', 'town', 'waitress', 'who', 'is', 'feeling', 'stifled', 'by', 'her', 'life', '.', 'she', "'", 's', 'at', 'a', 'turning', 'point', 'in', 'her', 'life', ',', 'and', 'feels', 'as', 'if', 'she', "'", 's', 'going', 'nowhere', '.', 'her', 'boyfriend', ',', 'michael', '(', 'jon', 'bon', 'jovi', ')', ',', 'is', 'broke', 'and', 'in', 'a', 'dead', 'end', 'job', 

Once we created our object, we should randomly suffle our reviews, in order to be able to train our classifier:

In [3]:
random.shuffle(documents)

In [4]:
all_words = nltk.FreqDist([w.lower() for w in movie_reviews.words()]) # FreqDist gives us the frequency of each
#print(all_words.keys())                                              # word, ordered from most to least frequent.
word_features = list(all_words.keys())[:2000] # takes the 2000 most frequent words in our corpus

Now we need a function to check if a word of our document is contained in **word_features**

In [5]:
def document_features(document):
    document_words = set(document)
    features = dict()
    for word in word_features:
        features['contains' + word] = word in document_words
    return features

We can check our function with a single .txt file:

In [6]:
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

{'containsplot': True, 'contains:': True, 'containstwo': True, 'containsteen': False, 'containscouples': False, 'containsgo': False, 'containsto': True, 'containsa': True, 'containschurch': False, 'containsparty': False, 'contains,': True, 'containsdrink': False, 'containsand': True, 'containsthen': True, 'containsdrive': False, 'contains.': True, 'containsthey': True, 'containsget': True, 'containsinto': True, 'containsan': True, 'containsaccident': False, 'containsone': True, 'containsof': True, 'containsthe': True, 'containsguys': False, 'containsdies': False, 'containsbut': True, 'containshis': True, 'containsgirlfriend': True, 'containscontinues': False, 'containssee': False, 'containshim': True, 'containsin': True, 'containsher': False, 'containslife': False, 'containshas': True, 'containsnightmares': False, 'containswhat': True, "contains'": True, 'containss': True, 'containsdeal': False, 'contains?': False, 'containswatch': True, 'containsmovie': True, 'contains"': True, 'conta

Applying our function to all our reviews .txt files:

In [7]:
feauture_sets = [(document_features(document), category) for document, category in documents]

In [23]:
print(feauture_sets[1])

({'containsplot': False, 'contains:': False, 'containstwo': True, 'containsteen': False, 'containscouples': False, 'containsgo': False, 'containsto': True, 'containsa': True, 'containschurch': False, 'containsparty': False, 'contains,': True, 'containsdrink': False, 'containsand': True, 'containsthen': False, 'containsdrive': False, 'contains.': True, 'containsthey': True, 'containsget': True, 'containsinto': True, 'containsan': False, 'containsaccident': False, 'containsone': True, 'containsof': True, 'containsthe': True, 'containsguys': False, 'containsdies': False, 'containsbut': True, 'containshis': True, 'containsgirlfriend': False, 'containscontinues': False, 'containssee': True, 'containshim': True, 'containsin': True, 'containsher': True, 'containslife': True, 'containshas': True, 'containsnightmares': False, 'containswhat': True, "contains'": True, 'containss': True, 'containsdeal': False, 'contains?': False, 'containswatch': False, 'containsmovie': True, 'contains"': False, '

In [8]:
train_set = feauture_sets[100:]
test_set = feauture_sets[:100]

In [9]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [10]:
print(nltk.classify.accuracy(classifier, test_set))

0.87


In [11]:
print(classifier.show_most_informative_features())

Most Informative Features
      containsschumacher = True              neg : pos    =     12.3 : 1.0
   containsunimaginative = True              neg : pos    =      8.3 : 1.0
          containswelles = True              neg : pos    =      8.3 : 1.0
       containsatrocious = True              neg : pos    =      7.0 : 1.0
          containsshoddy = True              neg : pos    =      7.0 : 1.0
          containsturkey = True              neg : pos    =      6.5 : 1.0
         containssingers = True              pos : neg    =      6.4 : 1.0
            containsmena = True              neg : pos    =      6.3 : 1.0
          containssuvari = True              neg : pos    =      6.3 : 1.0
         containsunravel = True              pos : neg    =      5.7 : 1.0
None
