### Text Classification

#### Text classification using NLTK

Now that we have covered the basics of preprocessing for Natural Language Processing, we can move on to text classification using simple machine learning classification algorithms.

In [1]:
import random
import nltk
from nltk.corpus import movie_reviews

In [3]:
# Build list of documents
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
random.shuffle(documents)

print('Number of Documents: {}'.format(len(documents)))
print()
print('First Review: {}'.format(documents[1]))

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

print('Most common words: {}'.format(all_words.most_common(15)))
print()
print('The word happy: {}'.format(all_words["happy"]))

Number of Documents: 2000

First Review: (['in', '"', 'the', 'sweet', 'hereafter', ',', '"', 'writer', '/', 'director', 'atom', 'egoyan', 'takes', 'us', 'beyond', 'the', 'tragedy', 'of', 'death', 'into', 'the', 'tragedy', 'of', 'living', '.', 'he', 'shows', 'us', 'how', 'it', 'isn', "'", 't', 'dying', 'that', 'hurts', ',', 'but', 'rather', 'the', 'pain', 'of', 'living', 'in', 'the', 'hereafter', 'of', 'death', ',', 'and', 'dealing', 'with', 'the', 'loss', 'and', 'grief', 'that', 'it', 'brings', '.', 'on', 'a', 'cold', 'winter', 'day', 'in', 'a', 'small', ',', 'isolated', 'town', 'in', 'british', 'columbia', ',', 'a', 'school', 'bus', 'full', 'of', 'children', 'slides', 'off', 'the', 'highway', 'and', 'onto', 'a', 'frozen', 'lake', ',', 'where', 'it', 'cracks', 'through', 'the', 'ice', 'and', 'sinks', '.', 'fourteen', 'children', 'die', ',', 'and', 'numerous', 'others', 'are', 'hurt', '.', 'for', 'the', 'residents', 'of', 'this', 'small', 'town', ',', 'all', 'of', 'whose', 'children', '

Most common words: [(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]

The word happy: 215


In [4]:
# We'll use the 4000 most common words as features
word_features = list(all_words.keys())[:4000]

The `find_features` function will determine which of the 4000 word features are contained in the review.

In [6]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features


# Let's use an example from a negative review
features = find_features(movie_reviews.words('neg/cv000_29416.txt'))
for key, value in features.items():
    if value == True:
        print(key)

plot
:
two
teen
couples
go
to
a
church
party
,
drink
and
then
drive
.
they
get
into
an
accident
one
of
the
guys
dies
but
his
girlfriend
continues
see
him
in
her
life
has
nightmares
what
'
s
deal
?
watch
movie
"
sorta
find
out
critique
mind
-
fuck
for
generation
that
touches
on
very
cool
idea
presents
it
bad
package
which
is
makes
this
review
even
harder
write
since
i
generally
applaud
films
attempt
break
mold
mess
with
your
head
such
(
lost
highway
&
memento
)
there
are
good
ways
making
all
types
these
folks
just
didn
t
snag
correctly
seem
have
taken
pretty
neat
concept
executed
terribly
so
problems
well
its
main
problem
simply
too
jumbled
starts
off
normal
downshifts
fantasy
world
you
as
audience
member
no
going
dreams
characters
coming
back
from
dead
others
who
look
like
strange
apparitions
disappearances
looooot
chase
scenes
tons
weird
things
happen
most
not
explained
now
personally
don
trying
unravel
film
every
when
does
give
me
same
clue
over
again
kind
fed
up
after
while
biggest


In [7]:
# Now let's do it for all the documents
featuresets = [(find_features(rev), category) for (rev, category) in documents]

We can split the featuresets into training and testing datasets using `sklearn`.

In [8]:
from sklearn import model_selection

# Define a seed for reproducibility
seed = 1

# Split the data into training and testing datasets
training, testing = model_selection.train_test_split(featuresets, test_size = 0.25, random_state = seed)

In [9]:
print(len(training))
print(len(testing))

1500
500


We can use sklearn algorithms in NLTK.

In [11]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

model = SklearnClassifier(SVC(kernel = 'linear'))

# Train the model on the training data
model.train(training)

# And test on the testing dataset
accuracy = nltk.classify.accuracy(model, testing) * 100
print("SVC Accuracy: {}".format(accuracy), "%")

SVC Accuracy: 80.0 %
