Here, we are going to learn about the following topics:

(i) Split and filter text data in preparation for analysis.

(ii) Analyze word frequency.

(iii) Find concordance and collocations using different methods.

(iv) Perform quick sentiment analysis with NLTK’s built-in VADER.

(v) Define features for custom classification.

(vi) Use and compare classifiers from scikit-learn for sentiment analysis within NLTK.

Sentiment analysis helps us determine the ratio of positive to negative engagement about a specific topic. We can analyze bodies of text, such as comments, tweets, and product reviews, to obtain insights from our audience.

First, use !pip to install NLTK.

In [1]:
!pip install nltk



After installing the NLTK module, we need to obtain a few additional resources, and we use nltk.download().

In [3]:
import nltk


In [4]:
nltk.download([
...     "names",
...     "stopwords",
...     "state_union",
...     "twitter_samples",
...     "movie_reviews",
...     "averaged_perceptron_tagger",
...     "vader_lexicon",
...     "punkt",
... ])

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package state_union to /root/nltk_data...
[nltk_data]   Unzipping corpora/state_union.zip.
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Here, we shall use


**names:** A list of common English names compiled by Mark Kantrowitz.

**stopwords:** A list of really common words, like articles, pronouns, prepositions, and conjunctions.

**state_union:** A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrens.

**twitter_samples:** A list of social media phrases posted to Twitter.

**movie_reviews:** Two thousand movie reviews categorized by Bo Pang and Lillian Lee.

**averaged_perceptron_tagger:** A data model that NLTK uses to categorize words into their parts of speech.

**vader_lexicon:** A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbert.

**punkt:** A data model created by Jan Strunk that NLTK uses to split full texts into word lists.

The **Shakespeare** corpus contains a set of Shakespeare plays, formatted as XML files.

In [7]:
import nltk
w=nltk.download('shakespeare')

[nltk_data] Downloading package shakespeare to /root/nltk_data...
[nltk_data]   Package shakespeare is already up-to-date!


**Compiling Data:** Let us start by loading the State of the Union corpus we downloaded earlier.

In [8]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]

NLTK provides a small corpus of stop words that we can load into a list.

In [9]:
stopwords = nltk.corpus.stopwords.words("english")

Now we can remove stop words from our original word list.

In [10]:
words = [w for w in words if w.lower() not in stopwords]

To use it, call word_tokenize() with the raw text we want to split.

In [11]:
from pprint import pprint

text = """
... For some quick analysis, creating a corpus could be overkill.
... If all you need is a word list,
... there are simpler ways to achieve that goal."""
pprint(nltk.word_tokenize(text), width=79, compact=True)
['For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus', 'could',
 'be', 'overkill', '.', 'If', 'all', 'you', 'need', 'is', 'a', 'word', 'list',
 ',', 'there', 'are', 'simpler', 'ways', 'to', 'achieve', 'that', 'goal', '.']

['...', 'For', 'some', 'quick', 'analysis', ',', 'creating', 'a', 'corpus',
 'could', 'be', 'overkill', '.', '...', 'If', 'all', 'you', 'need', 'is', 'a',
 'word', 'list', ',', '...', 'there', 'are', 'simpler', 'ways', 'to',
 'achieve', 'that', 'goal', '.']


['For',
 'some',
 'quick',
 'analysis',
 ',',
 'creating',
 'a',
 'corpus',
 'could',
 'be',
 'overkill',
 '.',
 'If',
 'all',
 'you',
 'need',
 'is',
 'a',
 'word',
 'list',
 ',',
 'there',
 'are',
 'simpler',
 'ways',
 'to',
 'achieve',
 'that',
 'goal',
 '.']

**Creating Frequency Distributions:** To build a frequency distribution with NLTK, construct the nltk.FreqDist class with a word list.

In [12]:
words: list[str] = nltk.word_tokenize(text)
fd = nltk.FreqDist(words)

After building the object, we can use methods like .most_common() and .tabulate() to start visualizing information.

In [13]:
fd.most_common(3)

[('...', 3), (',', 2), ('a', 2)]

In [14]:
fd.tabulate(3)

...   ,   a 
  3   2   2 


These methods allow us to quickly determine frequently used words in a sample. With .most_common(), we get a list of tuples containing each word and how many times it appears in our text. We can get the same information in a more readable format with .tabulate().

In addition to these two methods, we can use frequency distributions to query particular words. These return values indicate the number of times each word occurs exactly as given.

In [15]:
fd["America"]

0

In [16]:
fd["america"]

0

In [17]:
fd["AMERICA"]

0

Let us create a new frequency distribution that is based on the initial one but normalizes all words to lowercase.

In [18]:
lower_fd = nltk.FreqDist([w.lower() for w in fd])

**Extracting Concordance and Collocations:**

Before invoking .concordance(), build a new word list from the original corpus text so that all the context, even stop words, will be there.

In [19]:
text = nltk.Text(nltk.corpus.state_union.words())
text.concordance("america", lines=5)

Displaying 5 of 1079 matches:
 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace
beyond any shadow of a doubt , that America will continue the fight for freedom
 to make complete victory certain , America will never become a party to any pl
nly in law and in justice . Here in America , we have labored long and hard to 


Remember that .concordance() already ignores the case, allowing us to see the context of all case variants of a word in order of appearance. Note also that this function doesn’t show you the location of each word in the text.

Moreover, since .concordance() only prints information to the console, it’s not ideal for data manipulation. To obtain a usable list that will also give us information about the location of each occurrence, use .concordance_list().




In [20]:
concordance_list = text.concordance_list("america", lines=2)
for entry in concordance_list:
   print(entry.line)

 would want us to do . That is what America will do . So much blood has already
ay , the entire world is looking to America for enlightened leadership to peace


.concordance_list() gives us a list of ConcordanceLine objects, which contain information about where each word occurs as well as a few more properties worth exploring. The list is also sorted in order of appearance.

The nltk.Text class itself has a few other interesting features. One of them is .vocab(), which is worth mentioning because it creates a frequency distribution for a given text.

Revisiting nltk.word_tokenize(), check out how quickly we can create a custom nltk. Text instance and an accompanying frequency distribution.

In [21]:
words: list[str] = nltk.word_tokenize(
...     """Beautiful is better than ugly.
...     Explicit is better than implicit.
...     Simple is better than complex."""
... )

In [22]:
text = nltk.Text(words)

In [23]:
fd = text.vocab()

In [24]:
fd.tabulate(3)

    is better   than 
     3      3      3 


NLTK provides specific classes for us to find collocations in our text. Following the pattern we have seen so far, these classes are also built from lists of words.

In [25]:
words = [w for w in nltk.corpus.state_union.words() if w.isalpha()]
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

The TrigramCollocationFinder instance will search specifically for trigrams.

One of their most useful tools is the ngram_fd property. This property holds a frequency distribution that is built for each collocation rather than for individual words.

Using ngram_fd, we can find the most common collocations in the supplied text.

In [26]:
finder.ngram_fd.most_common(2)

[(('the', 'United', 'States'), 294), (('the', 'American', 'people'), 185)]

In [27]:
finder.ngram_fd.tabulate(2)

  ('the', 'United', 'States') ('the', 'American', 'people') 
                          294                           185 


**Using NLTK’s Pre-Trained Sentiment Analyzer:** NLTK already has a built-in, pretrained sentiment analyzer called **VADER** (Valence Aware Dictionary and sEntiment Reasoner). To use VADER, first create an instance of nltk.sentiment. SentimentIntensityAnalyzer, then use .polarity_scores() on a raw string.



In [28]:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [29]:
sia.polarity_scores("Wow, NLTK is really powerful!")

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

We shall get back a dictionary of different scores. The negative, neutral, and positive scores are related: They all add up to 1 and cannot be negative. The compound score is calculated differently. It is not just an average, and it can range from -1 to 1.

Now we shall put it to the test against real data using two different corpora. First, load the twitter_samples corpus into a list of strings, making a replacement to render URLs inactive to avoid accidental clicks.

In [30]:
tweets = [t.replace("://", "//") for t in nltk.corpus.twitter_samples.strings()]

Now use the .polarity_scores() function of our SentimentIntensityAnalyzer instance to classify tweets.

In [31]:
from random import shuffle

def is_positive(tweet: str) -> bool:
    """True if tweet has positive compound sentiment, False otherwise."""
    return sia.polarity_scores(tweet)["compound"] > 0

shuffle(tweets)
for tweet in tweets[:10]:
    print(">", is_positive(tweet), tweet)

> True @GFuelEnergy i want that but i dont have paypal :(
> False Nicola Sturgeon eats English babies for breakfast
> True @candinam Gals also bend down in one knees to put a men down and bend down in two knees to pleasure a men up to the heaven :)
> False RT @matt_isom: @DavidGWrigley @BBCNews @bbcnickrobinson @Ed_Miliband he slipped on David Cameron's sweat
> True RT @HTScotPol: Good grief. Ed Miliband hands SNP massive gift on #bbcqt by saying he'd rather Tories ran country than do a deal with #SNP
> False RT @Ed_Miliband: Working families can't afford five more years of the Tories, but in seven days time people can vote Labour to put working …
> True RT @OwenJones84: Woohoo! Ed Miliband pointing out that social security bill being fuelled by subsidies for low wages! The facts banished fr…
> True @nathan3205 OMG THAT CAME FAST! You graduate at the end of the year?! I know :( catch ups are a definite (what we did best at uni anyways)
> False RT @bigbuachaille: Miliband fallout:

West

Since VADER needs raw strings for its rating, we cannot use .words() like we did earlier. Instead, make a list of the file IDs that the corpus uses, which we can use later to reference individual reviews.

In [32]:
positive_review_ids = nltk.corpus.movie_reviews.fileids(categories=["pos"])
negative_review_ids = nltk.corpus.movie_reviews.fileids(categories=["neg"])
all_review_ids = positive_review_ids + negative_review_ids

Next, redefine is_positive() to work on an entire review. We shall need to obtain that specific review using its file ID and then split it into sentences before rating.

In [33]:
from statistics import mean

def is_positive(review_id: str) -> bool:
    """True if the average of all sentence compound scores is positive."""
    text = nltk.corpus.movie_reviews.raw(review_id)
    scores = [
        sia.polarity_scores(sentence)["compound"]
        for sentence in nltk.sent_tokenize(text)
    ]
    return mean(scores) > 0

.raw() is another method that exists in most corpora. By specifying a file ID or a list of file IDs, we can obtain specific data from the corpus. Here, we get a single review, then use nltk.sent_tokenize() to obtain a list of sentences from the review. Finally, is_positive() calculates the average compound score for all sentences and associates a positive result with a positive review.

We can take the opportunity to rate all the reviews and see how accurate VADER is with this setup.

In [35]:
shuffle(all_review_ids)
correct = 0
for review_id in all_review_ids:
     if is_positive(review_id):
         if review_id in positive_review_ids:
             correct += 1
     else:
       if review_id in negative_review_ids:
             correct += 1

print(F"{correct / len(all_review_ids):.2%} correct")

64.00% correct


**Customizing NLTK’s Sentiment Analysis:**

(a) Selecting Useful Features



By using the predefined categories in the movie_reviews corpus, we can create sets of positive and negative words, then determine which ones occur most frequently across each set. Begin by excluding unwanted words and building the initial category groups.

In [36]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

def skip_unwanted(pos_tuple):
    word, tag = pos_tuple
    if not word.isalpha() or word in unwanted:
        return False
    if tag.startswith("NN"):
        return False
    return True

positive_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["pos"]))
)]
negative_words = [word for word, tag in filter(
    skip_unwanted,
    nltk.pos_tag(nltk.corpus.movie_reviews.words(categories=["neg"]))
)]

It is important to call pos_tag() before filtering our word lists so that NLTK can more accurately tag all words. skip_unwanted(), then uses those tags to exclude nouns, according to NLTK’s default tag set.

Now we are ready to create the frequency distributions for our custom feature. Since many words are present in both positive and negative sets, begin by finding the common set so we can remove it from the distribution objects.

In [37]:
positive_fd = nltk.FreqDist(positive_words)
negative_fd = nltk.FreqDist(negative_words)

common_set = set(positive_fd).intersection(negative_fd)

for word in common_set:
    del positive_fd[word]
    del negative_fd[word]

top_100_positive = {word for word, count in positive_fd.most_common(100)}
top_100_negative = {word for word, count in negative_fd.most_common(100)}

Here is how we can set up the positive and negative bigram finders.

In [38]:
unwanted = nltk.corpus.stopwords.words("english")
unwanted.extend([w.lower() for w in nltk.corpus.names.words()])

positive_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["pos"])
    if w.isalpha() and w not in unwanted
])
negative_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words([
    w for w in nltk.corpus.movie_reviews.words(categories=["neg"])
    if w.isalpha() and w not in unwanted
])

(b) Training and Using a Classifier

 Since we are looking for positive movie reviews, focus on the features that indicate positivity, including VADER scores.

In [39]:
def extract_features(text):
    features = dict()
    wordcount = 0
    compound_scores = list()
    positive_scores = list()

    for sentence in nltk.sent_tokenize(text):
        for word in nltk.word_tokenize(sentence):
            if word.lower() in top_100_positive:
                wordcount += 1
        compound_scores.append(sia.polarity_scores(sentence)["compound"])
        positive_scores.append(sia.polarity_scores(sentence)["pos"])

    # Adding 1 to the final compound score to always have positive numbers
    # since some classifiers you'll use later don't work with negative numbers.
    features["mean_compound"] = mean(compound_scores) + 1
    features["mean_positive"] = mean(positive_scores)
    features["wordcount"] = wordcount

    return features

extract_features() should return a dictionary, and it will create three features for each piece of text:

1. The average compound score

2. The average positive score

3. The amount of words in the text that are also part of the top 100 words in all positive reviews

In order to train and evaluate a classifier, we shall need to build a list of features for each text we shall analyze.

In [40]:
features = [
    (extract_features(nltk.corpus.movie_reviews.raw(review)), "pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (extract_features(nltk.corpus.movie_reviews.raw(review)), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])

Training the classifier involves splitting the feature set so that one portion can be used for training and the other for evaluation, then calling .train().

In [41]:
# Use 1/4 of the set for training
train_count = len(features) // 4
shuffle(features)
classifier = nltk.NaiveBayesClassifier.train(features[:train_count])
classifier.show_most_informative_features(10)

Most Informative Features
               wordcount = 3                 pos : neg    =      4.5 : 1.0
               wordcount = 2                 pos : neg    =      3.8 : 1.0
               wordcount = 5                 pos : neg    =      3.2 : 1.0
               wordcount = 1                 pos : neg    =      2.0 : 1.0
               wordcount = 0                 neg : pos    =      1.8 : 1.0
               wordcount = 4                 pos : neg    =      1.1 : 1.0
           mean_positive = 0.159             pos : neg    =      1.0 : 1.0


In [42]:
nltk.classify.accuracy(classifier, features[train_count:])

0.6553333333333333

**Comparing Additional Classifiers:**

Like NLTK, scikit-learn is a third-party Python library, so we shall have to install it with !pip.

In [44]:
!pip install scikit-learn



The following classifiers are a subset of all classifiers available to us. These will work within NLTK for sentiment analysis.

In [45]:
from sklearn.naive_bayes import (
    BernoulliNB,
    ComplementNB,
    MultinomialNB,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

To aid in accuracy evaluation, it is helpful to have a mapping of classifier names and their instances.

In [46]:
classifiers = {
    "BernoulliNB": BernoulliNB(),
    "ComplementNB": ComplementNB(),
    "MultinomialNB": MultinomialNB(),
    "KNeighborsClassifier": KNeighborsClassifier(),
    "DecisionTreeClassifier": DecisionTreeClassifier(),
    "RandomForestClassifier": RandomForestClassifier(),
    "LogisticRegression": LogisticRegression(),
    "MLPClassifier": MLPClassifier(max_iter=1000),
    "AdaBoostClassifier": AdaBoostClassifier(),
}

**Using scikit-learn Classifiers With NLTK:** We shall also be able to leverage the same features list we built earlier by means of extract_features(). To refresh our memory, here is how we built the features list.

In [47]:
features = [
    (extract_features(nltk.corpus.movie_reviews.raw(review)), "pos")
    for review in nltk.corpus.movie_reviews.fileids(categories=["pos"])
]
features.extend([
    (extract_features(nltk.corpus.movie_reviews.raw(review)), "neg")
    for review in nltk.corpus.movie_reviews.fileids(categories=["neg"])
])

Since the first half of the list contains only positive reviews, begin by shuffling it, then iterate over all classifiers to train and evaluate each one.

In [48]:
# Use 1/4 of the set for training
train_count = len(features) // 4
shuffle(features)
for name, sklearn_classifier in classifiers.items():
     classifier = nltk.classify.SklearnClassifier(sklearn_classifier)
     classifier.train(features[:train_count])
     accuracy = nltk.classify.accuracy(classifier, features[train_count:])
     print(F"{accuracy:.2%} - {name}")

66.20% - BernoulliNB
66.20% - ComplementNB
66.20% - MultinomialNB
68.93% - KNeighborsClassifier
63.33% - DecisionTreeClassifier
67.60% - RandomForestClassifier
71.47% - LogisticRegression
71.73% - MLPClassifier
71.20% - AdaBoostClassifier


Now we have reached 71.73 percent accuracy before even adding a second feature! While this does not mean that the MLPClassifier will continue to be the best one as we engineer new features, having additional classification algorithms at our disposal is clearly advantageous.

**Conclusion**

We are now quite familiar with the features of NTLK that allow us to process text into objects that we can filter and manipulate, which allows us to analyze text data to gain information about its properties. We can also use different classifiers to perform sentiment analysis on our data and gain insights into how our audience is responding to content.