By now, we've seen how simple it is to use classifiers out of the box, and now it's time to explore more! The top Python module for this is Scikit-learn (sklearn).

Fortunately, the creators of NLTK recognized the importance of integrating sklearn with the NLTK classifier approach. They developed the SklearnClassifier API for this purpose.

SklearnClassifier: A tool that allows the use of scikit-learn classifiers within the NLTK framework, enabling sklearn for NLP tasks.

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews

# to use sklearn api:
from nltk.classify.scikitlearn import SklearnClassifier

# Importing sklearn classifiers:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

<br><br><br>

In [2]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [3]:
random.shuffle(documents)

In [4]:
all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

In [5]:
all_words = nltk.FreqDist(all_words)

In [6]:
len(all_words)

39768

In [17]:
# all_words.items()

In [7]:
word_features = list(all_words.keys())[:3000]
# word_features: This extracts the top 3,000 most frequent words from the frequency distribution and stores them in a list. 
# These top 3,000 words will be used as the features to classify whether a review is positive or negative.

In [8]:
for feat in word_features[:20]:
    print(feat)

plot
:
two
teen
couples
go
to
a
church
party
,
drink
and
then
drive
.
they
get
into
an


In [9]:
#build a quick function that will find these top 3,000 words in our positive and negative documents,
#marking their presence as either positive(true) or negative(false):

def find_features(review):
    words = set(review)
    features = {}
    for w in word_features:
        features[w] = (w in words)     # w in words will be either True or False

    return features

In [10]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

In [11]:
len(featuresets)

2000

In [12]:
featuresets[:1]

[({'plot': False,
   ':': True,
   'two': False,
   'teen': False,
   'couples': False,
   'go': False,
   'to': True,
   'a': True,
   'church': False,
   'party': False,
   ',': True,
   'drink': False,
   'and': True,
   'then': True,
   'drive': False,
   '.': True,
   'they': True,
   'get': True,
   'into': True,
   'an': True,
   'accident': False,
   'one': True,
   'of': True,
   'the': True,
   'guys': False,
   'dies': False,
   'but': True,
   'his': True,
   'girlfriend': False,
   'continues': False,
   'see': True,
   'him': True,
   'in': True,
   'her': False,
   'life': False,
   'has': True,
   'nightmares': False,
   'what': True,
   "'": True,
   's': True,
   'deal': True,
   '?': False,
   'watch': False,
   'movie': True,
   '"': True,
   'sorta': False,
   'find': False,
   'out': True,
   'critique': False,
   'mind': False,
   '-': True,
   'fuck': False,
   'for': True,
   'generation': False,
   'that': True,
   'touches': False,
   'on': True,
   'very': F

<br><br>
### training set:

In [13]:
training_set = featuresets[:1900]
# training_set[0]

### testing set:

In [14]:
testing_set = featuresets[1900:]

<br><br><br>
### Creating classifier:

In [15]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

<br><br>
### Testing and accuracy:

In [16]:
print(f"Classifier accuracy: {nltk.classify.accuracy(classifier, testing_set)*100}%")

Classifier accuracy: 78.0%


<br><br>
## Utilizing SklearnClassifier to use different sklearn classifiers for classifying NLP tasks:

In [19]:
BNB_Classifier = SklearnClassifier(BernoulliNB())
BNB_Classifier.train(training_set)
print(f"Classifier accuracy: {nltk.classify.accuracy(BNB_Classifier, testing_set)*100}%")

Classifier accuracy: 78.0%


In [20]:
MNB_Classifier = SklearnClassifier(MultinomialNB())
MNB_Classifier.train(training_set)
print(f"Classifier accuracy: {nltk.classify.accuracy(MNB_Classifier, testing_set)*100}%")

Classifier accuracy: 79.0%


In [21]:
LR_Classifier = SklearnClassifier(LogisticRegression())
LR_Classifier.train(training_set)
print(f"Classifier accuracy: {nltk.classify.accuracy(LR_Classifier, testing_set)*100}%")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Classifier accuracy: 80.0%


In [22]:
SGD_Classifier = SklearnClassifier(SGDClassifier())
SGD_Classifier.train(training_set)
print(f"Classifier accuracy: {nltk.classify.accuracy(SGD_Classifier, testing_set)*100}%")

Classifier accuracy: 80.0%


In [23]:
SVC_Classifier = SklearnClassifier(SVC())
SVC_Classifier.train(training_set)
print(f"Classifier accuracy: {nltk.classify.accuracy(SVC_Classifier, testing_set)*100}%")

Classifier accuracy: 84.0%


In [24]:
LinearSVC_Classifier = SklearnClassifier(LinearSVC())
LinearSVC_Classifier.train(training_set)
print(f"Classifier accuracy: {nltk.classify.accuracy(LinearSVC_Classifier, testing_set)*100}%")

Classifier accuracy: 79.0%


In [25]:
NuSVC_Classifier = SklearnClassifier(NuSVC())
NuSVC_Classifier.train(training_set)
print(f"Classifier accuracy: {nltk.classify.accuracy(NuSVC_Classifier, testing_set)*100}%")

Classifier accuracy: 84.0%
