## Text Classification

The goal with text classification can be pretty broad. Maybe we're trying to classify text as about politics or the military. Maybe we're trying to classify it by the gender of the author who wrote it. A fairly popular text classification task is to identify a body of text as either spam or not spam, for things like email filters. In our case, we're going to try to create a sentiment analysis algorithm.

To do this, we're going to start by trying to use the movie reviews database that is part of the NLTK corpus. From there we'll try to use words as "features" which are a part of either a positive or negative movie review. The NLTK corpus movie_reviews data set has the reviews, and they are labeled already as positive or negative. This means we can train and test with this data. First, let's wrangle our data.

In [2]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["stupid"])

(['known', 'as', 'the', 'most', 'successful', ',', 'highest', '-', 'grossing', 'romantic', 'comedy', 'in', 'history', ',', 'director', 'garry', 'marshall', 'apparently', 'struck', 'gold', 'with', '"', 'pretty', 'woman', ',', '"', 'which', 'opened', 'quietly', 'during', 'the', 'summer', 'of', '1990', 'but', ',', 'thanks', 'to', 'positive', 'word', '-', 'of', '-', 'mouth', ',', 'was', 'able', 'to', 'reach', 'upwards', 'of', '$', '175', '-', 'million', 'in', 'theaters', 'alone', '.', 'the', 'question', 'of', 'why', 'it', 'worked', 'so', 'well', 'lies', 'directly', 'with', 'the', 'film', "'", 's', 'two', 'charismatic', 'stars', ',', 'richard', 'gere', 'and', 'julia', 'roberts', ',', 'since', 'the', 'story', 'itself', 'is', 'none', 'too', 'original', 'or', 'even', 'believable', '.', 'the', 'other', 'winning', 'element', 'that', 'makes', '"', 'pretty', 'woman', '"', 'so', 'entertaining', 'is', 'its', 'genuine', 'sweetness', 'and', 'innocence', ',', 'which', 'is', 'rarely', 'as', 'palpable', 

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
253


In each category (we have pos or neg), take all of the file IDs (each review has its own ID), then store the word_tokenized version (a list of words) for the file ID, followed by the positive or negative label in one big list.

Next, we use random to shuffle our documents. This is because we're going to be training and testing. If we left them in order, chances are we'd train on all of the negatives, some positives, and then test only against positives. We don't want that, so we shuffle the data.

Then, just so you can see the data you are working with, we print out documents[1], which is a big list, where the first element is a list the words, and the 2nd element is the "pos" or "neg" label.

Next, we want to collect all words that we find, so we can have a massive list of typical words. From here, we can perform a frequency distribution, to then find out the most common words. As you will see, the most popular "words" are actually things like punctuation, "the," "a" and so on, but quickly we get to legitimate words. We intend to store a few thousand of the most popular words, so this shouldn't be a problem.

In [7]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["stupid"])

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

print(all_words.most_common(15))
print("=================================splitter===================================")
print(all_words["stupid"])

(['"', '.', '.', 'it', "'", 's', 'certainly', 'more', 'than', 'just', 'a', 'monster', 'story', '.', '.', '"', '-', 'kenneth', 'branagh', ',', 'director', '/', 'actor', '/', 'co', '-', 'writer', 'of', 'mary', 'shelley', "'", 's', 'frankenstein', '.', 'kenneth', 'branagh', "'", 's', 'new', 'motion', 'picture', 'had', 'to', 'compete', 'with', 'all', 'the', 'previous', 'frankenstein', '-', 'films', 'made', 'throughout', 'the', 'years', 'in', 'almost', 'every', 'part', 'of', 'the', 'world', ',', 'including', 'the', 'most', 'controversial', 'version', 'of', 'them', 'all', '-', 'directed', 'by', 'james', 'whale', '.', 'the', 'dark', 'and', 'stormy', 'nights', ',', 'the', 'lightning', 'bolts', ',', 'the', 'charnel', 'houses', 'of', 'spare', 'body', 'parts', ',', 'the', 'laboratory', 'where', 'victor', 'frankenstein', 'stirs', 'his', 'steaming', 'cauldron', 'of', 'life', 'are', 'effectful', '.', 'but', 'the', 'center', 'of', 'the', 'film', ',', 'quieter', 'and', 'more', 'thoughtful', ',', 'cont

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
253
[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
253


In [None]:
The above gives the 15 most common words, also find out how many occurences a word of "stupid"