# Natural Language Processing with `nltk`

Using `nltk` for importing, cleaning, pre-processing text data in human language and then applying computational linguistics algorithms like sentiment analysis.

## Inspect the Movie Reviews Dataset

In [None]:
import nltk

In [None]:
dl = nltk.downloader.Downloader("http://nltk.github.com/nltk_data/")
dl.download()

In [None]:
from nltk.corpus import movie_reviews

In [None]:
len(movie_reviews.fileids())

In [None]:
movie_reviews.fileids()[:5]

In [None]:
movie_reviews.fileids()[-5:]

In [None]:
negative_fileids = movie_reviews.fileids('neg')
positive_fileids = movie_reviews.fileids('pos')

In [None]:
len(negative_fileids), len(positive_fileids)

We can inspect one of the reviews using the `raw` method of `movie_reviews`, each file is split into sentences, the curators of this dataset also removed from each review from any direct mention of the rating of the movie.

In [None]:
print(movie_reviews.raw(fileids=positive_fileids[0]))

## Tokenize Text in Words

`nltk` has a sophisticated word tokenizer trained on English named `punkt`

In [None]:
nltk.download("punkt")

The `movie_reviews` corpus already has direct access to tokenized text with the `words` method:

In [None]:
movie_reviews.words(fileids=positive_fileids[0])

## Build a bag-of-words model

From the bag-of-words model we will build features to be used by a classifier.
We implement this in Python as a dictionary where for each word in a sentence we associate `True`, if a word is missing, that would be the same as assigning `False`.

In [None]:
nltk.download("stopwords")

In [None]:
import string

In [None]:
string.punctuation

Using the Python `string.punctuation` list and the English stopwords we can build better features by filtering out those words that would not help in the classification:

In [None]:
useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
#useless_words
#type(useless_words)

In [None]:
def build_bag_of_words_features_filtered(words):
    return {
        word:1 for word in words \
        if not word in useless_words}

## Plotting Frequencies of Words


In [None]:
all_words = movie_reviews.words()
len(all_words)/1e6

First we want to filter out `useless_words`, this will reduce the length of the dataset by more than a factor of 2:

In [None]:
filtered_words = [word for word in movie_reviews.words() if not word in useless_words]
type(filtered_words)

In [None]:
len(filtered_words)/1e6

In [None]:
from collections import Counter

word_counter = Counter(filtered_words)

In [None]:
most_common_words = word_counter.most_common()[:10]

In [None]:
most_common_words

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

We will sort the word counts and plot their values on Logarithmic axes to check the shape of the distribution.

In [None]:
sorted_word_counts = sorted(list(word_counter.values()), reverse=True)

plt.loglog(sorted_word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank");

Also plotting the histogram of `sorted_word_counts`, which displays how many words have a count in a specific range.

The distribution is highly peaked at low counts, i.e. most of the words appear which a low count, so we display it on semilogarithmic axes to inspect the tail of the distribution.

In [None]:
plt.hist(sorted_word_counts, bins=50);

In [None]:
plt.hist(sorted_word_counts, bins=50, log=True);

## Train a Classifier for Sentiment Analysis

Using `build_bag_of_words_features` function we will build separately the negative and positive features.
Basically for each of the 1000 negative and for the 1000 positive review, we create one dictionary of the words and we associate the label "neg" and "pos" to it.

In [None]:
negative_features = [
    (build_bag_of_words_features_filtered(movie_reviews.words(fileids=[f])), 'neg') \
    for f in negative_fileids
]

In [None]:
print(negative_features[3])

In [None]:
positive_features = [
    (build_bag_of_words_features_filtered(movie_reviews.words(fileids=[f])), 'pos') \
    for f in positive_fileids
]

In [None]:
print(positive_features[6])

In [None]:
from nltk.classify import NaiveBayesClassifier

In [None]:
split = 800

In [None]:
sentiment_classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])

Accuracy on the training set:

In [None]:
nltk.classify.util.accuracy(sentiment_classifier, positive_features[:split]+negative_features[:split])*100

In [None]:
Accuracy on the test set:

In [None]:
nltk.classify.util.accuracy(sentiment_classifier, positive_features[split:]+negative_features[split:])*100

Accuracy here is around 70% which is pretty good for such a simple model if we consider that the estimated accuracy for a person is about 80%.
Printing the most informative words:

In [None]:
sentiment_classifier.show_most_informative_features()