In [39]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer


nltk.download([
     "names",
     "stopwords",
     "state_union",
     "twitter_samples",
     "movie_reviews",
     "averaged_perceptron_tagger",
     "vader_lexicon",
     "punkt",
 ])


[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Denis.kozarenko\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Denis.kozarenko\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     C:\Users\Denis.kozarenko\AppData\Roaming\nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Denis.kozarenko\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Denis.kozarenko\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Denis.kozarenko\AppData\Roaming\nltk_data...
[n

True

Start by loading the State of the Union corpus you downloaded earlier:

In [40]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]


Note that you build a list of individual words with the corpus’s .words() method, but you use str.isalpha() to include only the words that are made up of letters. Otherwise, your word list may end up with “words” that are only punctuation marks.

Have a look at your list. You’ll notice lots of little words like “of,” “a,” “the,” and similar. These common words are called stop words, and they can have a negative effect on your analysis because they occur so often in the text. Thankfully, there’s a convenient way to filter them out.

NLTK provides a small corpus of stop words that you can load into a list:

In [41]:
stopwords = nltk.corpus.stopwords.words("english")

Make sure to specify english as the desired language since this corpus contains stop words in various languages.

Now you can remove stop words from your original word list:

In [42]:
words = [w for w in words if w.lower() not in stopwords]


Since all words in the stopwords list are lowercase, and those in the original list may not be, you use str.lower() to account for any discrepancies. Otherwise, you may end up with mixedCase or capitalized stop words still in your list.

While you’ll use corpora provided by NLTK for this tutorial, it’s possible to build your own text corpora from any source. Building a corpus can be as simple as loading some plain text or as complex as labeling and categorizing each sentence. Refer to NLTK’s documentation for more information on how to work with corpus readers.

For some quick analysis, creating a corpus could be overkill. If all you need is a word list, there are simpler ways to achieve that goal. Beyond Python’s own string manipulation methods, NLTK provides nltk.word_tokenize(), a function that splits raw text into individual words. While tokenization is itself a bigger topic (and likely one of the steps you’ll take when creating a custom corpus), this tokenizer delivers simple word lists really well.

To use it, call word_tokenize() with the raw text you want to split:

In [31]:
>>> from pprint import pprint

>>> text = """
... For some quick analysis, creating a corpus could be overkill.
... If all you need is a word list,
... there are simpler ways to achieve that goal."""
>>> pprint(nltk.word_tokenize(text), width=79, compact=True)
# ['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could',
#  'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list',
#  ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']

['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could',
 'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list',
 ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']


Now you have a workable word list! Remember that punctuation will be counted as individual words, so use str.isalpha() to filter them out later.

# Creating Frequency Distributions
Now you’re ready for frequency distributions. A frequency distribution is essentially a table that tells you how many times each word appears within a given text. In NLTK, frequency distributions are a specific object type implemented as a distinct class called FreqDist. This class provides useful operations for word frequency analysis.

To build a frequency distribution with NLTK, construct the nltk.FreqDist class with a word list:

In [43]:
# words: list[str] = nltk.word_tokenize(text)
fd = nltk.FreqDist(words)

This will create a frequency distribution object similar to a Python dictionary but with added features.

Note: Type hints with generics as you saw above in words: list[str] = ... is a new feature in Python 3.9!

After building the object, you can use methods like .most_common() and .tabulate() to start visualizing information:

In [45]:
fd.most_common(3)
# [('must', 1568), ('people', 1291), ('world', 1128)]
fd.tabulate(3)
#   must people  world
#   1568   1291   1128

  must people  world 
  1568   1291   1128 


These methods allow you to quickly determine frequently used words in a sample. With .most_common(), you get a list of tuples containing each word and how many times it appears in your text. You can get the same information in a more readable format with .tabulate().

In addition to these two methods, you can use frequency distributions to query particular words. You can also use them as iterators to perform some custom analysis on word properties.

For example, to discover differences in case, you can query for different variations of the same word:
>>> fd["America"]
1076

>>> fd["america"]  # Note this doesn't result in a KeyError
0

>>> fd["AMERICA"]
3

These return values indicate the number of times each word occurs exactly as given.

Since frequency distribution objects are iterable, you can use them within list comprehensions to create subsets of the initial distribution. You can focus these subsets on properties that are useful for your own analysis.

Try creating a new frequency distribution that’s based on the initial one but normalizes all words to lowercase:

In [47]:
lower_fd = nltk.FreqDist([w.lower() for w in fd])
lower_fd.tabulate(10)

     world       year        new   congress      peace    federal    program government        war   economic 
         3          3          3          3          3          3          3          3          3          3 


To use VADER, first create an instance of nltk.sentiment.SentimentIntensityAnalyzer, then use .polarity_scores() on a raw string:

In [48]:
>>> from nltk.sentiment import SentimentIntensityAnalyzer
>>> sia = SentimentIntensityAnalyzer()
>>> sia.polarity_scores("Wow, NLTK is really powerful!")
{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

You’ll get back a dictionary of different scores. The negative, neutral, and positive scores are related: They all add up to 1 and can’t be negative. The compound score is calculated differently. It’s not just an average, and it can range from -1 to 1.

Now you’ll put it to the test against real data using two different corpora. First, load the twitter_samples corpus into a list of strings, making a replacement to render URLs inactive to avoid accidental clicks:

In [49]:
tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

Notice that you use a different corpus method, .strings(), instead of .words(). This gives you a list of raw tweets as strings.

Different corpora have different features, so you may need to use Python’s help(), as in help(nltk.corpus.tweet_samples), or consult NLTK’s documentation to learn how to use a given corpus.

Now use the .polarity_scores() function of your SentimentIntensityAnalyzer instance to classify tweets:

In [50]:
from random import shuffle

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

> False @smiffy Sorry to hear about your dogs :(
> True @Yolandy @Veho Post the vid! :)
> True RT @OwenJones84: David Cameron *refuses* to rule out cutting child benefit. When we all march into the polling booths next week, let’s reme…
> True @ellenRstewart where are you off to next? :)
> False RT @thoughtland: EdM welcomes Tory govt over an anti-austerity/end Trident deal w/ SNP. It's pre-Thurs blah. Still brutal to hear.  https:/…
> False RT @twcuddleston: How can you not like Ed Miliband? #Milifandom #VoteLabour http//t.co/HPyIT1k8nc
> False RT @AamerAnwar: 'Vote no 2Indy' lead UK by staying in, bt don't u dare try 2hav a voice unless U do what we tell u 2 do - ED MILL Time up #…
> False RT @craigilynn: .@kdugdalemsp Actually Ed called Scotland's bluff. Let's see what happens when that backfires. Will he let the Tories in or…
> True RT @MirrorPolitics: Come clean over £12bn benefits cuts, Tories begged http//t.co/sAXrwGBR5T http//t.co/wmAwu0hISq
> False Miliband resorts to blatant l

In this case, is_positive() uses only the positivity of the compound score to make the call. You can choose any combination of VADER scores to tweak the classification to your needs.

Now take a look at the second corpus, movie_reviews. As the name implies, this is a collection of movie reviews. The special thing about this corpus is that it’s already been classified. Therefore, you can use it to judge the accuracy of the algorithms you choose when rating similar texts.

Keep in mind that VADER is likely better at rating tweets than it is at rating long movie reviews. To get better results, you’ll set up VADER to rate individual sentences within the review rather than the entire text.

Since VADER needs raw strings for its rating, you can’t use .words() like you did earlier. Instead, make a list of the file IDs that the corpus uses, which you can use later to reference individual reviews:

In [53]:
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids


.fileids() exists in most, if not all, corpora. In the case of movie_reviews, each file corresponds to a single review. Note also that you’re able to filter the list of file IDs by specifying categories. This categorization is a feature specific to this corpus and others of the same type.

Next, redefine is_positive() to work on an entire review. You’ll need to obtain that specific review using its file ID and then split it into sentences before rating:

In [54]:
from statistics import mean

def is_positive(review_id: str) -> bool:
    """True if the average of all sentence compound scores is positive."""
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [
        sia.polarity_scores(sentence)["compound"]
        for sentence in nltk.sent_tokenize(text)
    ]
    return mean(scores) > 0

.raw() is another method that exists in most corpora. By specifying a file ID or a list of file IDs, you can obtain specific data from the corpus. Here, you get a single review, then use nltk.sent_tokenize() to obtain a list of sentences from the review. Finally, is_positive() calculates the average compound score for all sentences and associates a positive result with a positive review.

You can take the opportunity to rate all the reviews and see how accurate VADER is with this setup:

In [55]:
 shuffle(all_review_ids)
 correct = 0
 for review_id in all_review_ids:
     if is_positive(review_id):
         if review_id in positive_review_ids:
             correct += 1
     else:
         if review_id in negative_review_ids:
             correct += 1

 print(F"{correct / len(all_review_ids):.2%} correct")


64.05% correct


64.0

After rating all reviews, you can see that only 64 percent were correctly classified by VADER using the logic defined in is_positive().

A 64 percent accuracy rating isn’t great, but it’s a start. Have a little fun tweaking is_positive() to see if you can increase the accuracy.

In the next section, you’ll build a custom classifier that allows you to use additional features for classification and eventually increase its accuracy to an acceptable level.
# Customizing NLTK’s Sentiment Analysis
NLTK offers a few built-in classifiers that are suitable for various types of analyses, including sentiment analysis. The trick is to figure out which properties of your dataset are useful in classifying each piece of data into your desired categories.

In the world of machine learning, these data properties are known as features, which you must reveal and select as you work with your data. While this tutorial won’t dive too deeply into feature selection and feature engineering, you’ll be able to see their effects on the accuracy of classifiers.

## Selecting Useful Features
Since you’ve learned how to use frequency distributions, why not use them as a launching point for an additional feature?

By using the predefined categories in the movie_reviews corpus, you can create sets of positive and negative words, then determine which ones occur most frequently across each set. Begin by excluding unwanted words and building the initial category groups:

In [56]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

This time, you also add words from the names corpus to the unwanted list on line 2 since movie reviews are likely to have lots of actor names, which shouldn’t be part of your feature sets. Notice pos_tag() on lines 14 and 18, which tags words by their part of speech.

It’s important to call pos_tag() before filtering your word lists so that NLTK can more accurately tag all words. skip_unwanted(), defined on line 4, then uses those tags to exclude nouns, according to NLTK’s default tag set.

Now you’re ready to create the frequency distributions for your custom feature. Since many words are present in both positive and negative sets, begin by finding the common set so you can remove it from the distribution objects: