# Text Classification with NLTK

## 1. Reading datasets

In [1]:
import os
import codecs
# Creating a function to read the files.
def read_in(folder):
    # Creating a list containing the names of the files in the directory
    files = os.listdir(folder)
    a_list = []
    for a_file in files:
        # Skiping hidden files
        if not a_file.startswith("."):
            # Reading the contents of each file.
            f = codecs.open(folder + a_file, "r", encoding = "ISO-8859-1", errors="ignore")
            a_list.append(f.read())
            f.close()
    return a_list

Let's read the datasets using this function.

In [2]:
spam_list = read_in("enron1/spam/")
print(f"A number of mails that includes spam: {len(spam_list)}")
print(spam_list[10])
print("-----------------------\n")
ham_list = read_in("enron1/ham/")
print(f"A number of mails that includes ham: {len(ham_list)}")
print(ham_list[10])

A number of mails that includes spam: 1500
Subject: re: rdd, the auxiliary iturean
Free cable@ tv
Dabble bam servomechanism ferret canopy bookcase befog seductive elapse ballard daphne acrylate deride decadent desolate else sequestration condition ligament ornately yaquI giblet emphysematous woodland lie segovia almighty coffey shut china clubroom diagnostician
Cheer leadsman abominate cambric oligarchy mania woodyard quake tetrachloride contiguous welsh depressive synaptic trauma cloister banks canadian byroad alexander gnaw annette charlie

-----------------------

A number of mails that includes ham: 3672
Subject: entex transistion
The purpose of the email is to recap the kickoff meeting held on yesterday
With members from commercial and volume managment concernig the entex account:
Effective january 2000, thu nguyen (x 37159) in the volume managment group,
Will take over the responsibility of allocating the entex contracts. Howard
And thu began some training this month and will con

As you can see this function returns 1500 for `enron1/spam` and 3672 for `enron1/ham`.  

Let's now combine the data into a single structure and shuffle them.

In [3]:
# Use random module to shuffle.
import random
# Utilize list comprehensions to create the all_emails list.
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
# fixing randomness
random.seed(42)
# Shuffling
random.shuffle(all_emails)
print (f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


Let's take a look at the first three rows.

In [4]:
all_emails[:2]

[('Subject: bloodline, ahead of the street microcap alert\r\nWhen living with sheriff is obsequious, blood clot beyond deficit reach an understanding with toward blood clot. [3\r\n',
  'spam'),
 ('Subject: well heads\r\nPhillips has changed there nom at meter 6673. Vance had 119 in his file but\r\nPhillips sent in a nom today for 948. So far it has flowed for april\r\nBetween 1100 and 841.\r\nPrize has some changes.\r\nMeter from\r\nTo march range.\r\n4028 1113\r\n717 1137 to 887\r\n5579 2733\r\n2381 2800 to 2578\r\n5767 115\r\n150 140 to 103\r\n6191 249\r\n154 253 to 217\r\n6675 120\r\n239 78 to 156\r\n9604 109\r\n32 38 to 63\r\n4965 149\r\n180 71 to 288\r\n5121 1163\r\n1135 303 to 703\r\nVintage\r\n989603 0\r\n330 no mom in april, nomed 270\r\nIn march.',
  'ham')]

## 2. Split the text into words

Let's use NLTK’s tokenizer. It gets running text as input and returns a list of words based on a number of customized regular expressions.

In [5]:
import nltk
from nltk import word_tokenize

# Creating a function to tokenize.
def get_features(text): 
    features = {}
    word_list = [word for word in word_tokenize(text.lower())]
    for word in word_list:
        features[word] = True
    return features

In [6]:
print(get_features("I am living in U.S.A!"))

{'i': True, 'am': True, 'living': True, 'in': True, 'u.s.a': True, '!': True}


## 3. Extract and normalize the features

In [7]:
# Tokenizing our dataset.
all_features = [(get_features(email), label) for (email, label) in all_emails]

In [8]:
print(len(all_features))
print(len(all_features[0][0]))
print(len(all_features[10][0]))

5172
27
39


## 4. Train the classifier

Naïve Bayes is a probabilistic classifier, which means that it makes the class prediction based on the estimate of which outcome is most likely (i.e., it assesses the probability of an email being spam and compares it with the probability of it being ham), and then selects the outcome that is most probable between the two.

In [9]:
from nltk import NaiveBayesClassifier, classify

def train(features, proportion):
    train_size = int(len(features) * proportion)
    # initialise the training and test sets
    train_set, test_set = features[:train_size], features[train_size:]
    print (f"Training set size = {str(len(train_set))} emails")
    print (f"Test set size = {str(len(test_set))} emails")
    # train the classifier
    classifier = NaiveBayesClassifier.train(train_set)
    return train_set, test_set, classifier

train_set, test_set, classifier = train(all_features, 0.8)

Training set size = 4137 emails
Test set size = 1035 emails


 ## 5. Evaluate your classifier

In [10]:
def evaluate(train_set, test_set, classifier):
    # check how the classifier performs on the training and test sets
    print (f"Accuracy on the training set = {str(classify.accuracy(classifier, train_set))}")
    print (f"Accuracy on the test set = {str(classify.accuracy(classifier, test_set))}")    
    # check which words are most informative for the classifier
    classifier.show_most_informative_features(20)

evaluate(train_set, test_set, classifier)

Accuracy on the training set = 0.961082910321489
Accuracy on the test set = 0.9420289855072463
Most Informative Features
               forwarded = True              ham : spam   =    198.3 : 1.0
                    2004 = True             spam : ham    =    143.8 : 1.0
                     nom = True              ham : spam   =    126.0 : 1.0
            prescription = True             spam : ham    =    122.9 : 1.0
                    pain = True             spam : ham    =     98.8 : 1.0
                  health = True             spam : ham    =     82.7 : 1.0
                     ect = True              ham : spam   =     76.8 : 1.0
                    2001 = True              ham : spam   =     75.8 : 1.0
                featured = True             spam : ham    =     74.7 : 1.0
              nomination = True              ham : spam   =     72.1 : 1.0
             medications = True             spam : ham    =     69.9 : 1.0
                  differ = True             spam : ham

Let's check the contexts of specific words. For example I'm going to explore subscribers.

In [11]:
from nltk.text import Text
def concordance(data_list, search_word):
    for email in data_list:
        word_list = [word for word in word_tokenize(email.lower())]
        text_list = Text(word_list)
        if search_word in word_list:
            text_list.concordance(search_word)

In [12]:
print ("subscribers in HAM:")
concordance(ham_list, "subscribers ")
print("\n---------------------------")
print ("\nsubscribers in SPAM:")
concordance(spam_list, "subscribers")

subscribers in HAM:

---------------------------

subscribers in SPAM:
Displaying 1 of 1 matches:
tment decisions by its readers or subscribers . s 2 p is not a registered broke
Displaying 1 of 1 matches:
k on the otcbb to our millions of subscribers for substantial profits immediate
Displaying 1 of 1 matches:
tment decisions by its readers or subscribers . ddsr is not a registered broker
Displaying 1 of 1 matches:
ity alert advises all readers and subscribers to seek advice from a registered 
Displaying 3 of 3 matches:
 through the roof . as one of our subscribers you already probably reaped the b
0 % in only 3 days and all of our subscribers turned a quick buck ! don ' t del
tment decisions by its readers or subscribers . it is strongly recommended that
Displaying 1 of 1 matches:
dvice . we advise all readers and subscribers to seek advice from a registered 
Displaying 1 of 1 matches:
dvice . we advise all readers and subscribers to seek advice from a registered 
Displaying 2 of 2 ma

Let's create a messages as input.

In [13]:
test_spam_list = ["Participate in our new lottery!", 
                  "Try out this new medicine"]
test_ham_list = ["See the minutes from the last meeting attached", 
                 "Investors are coming to our office on Monday"]

test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

new_test_set = [(get_features(email), label) for (email, label) in test_emails]

evaluate(train_set, new_test_set, classifier)

Accuracy on the training set = 0.961082910321489
Accuracy on the test set = 1.0
Most Informative Features
               forwarded = True              ham : spam   =    198.3 : 1.0
                    2004 = True             spam : ham    =    143.8 : 1.0
                     nom = True              ham : spam   =    126.0 : 1.0
            prescription = True             spam : ham    =    122.9 : 1.0
                    pain = True             spam : ham    =     98.8 : 1.0
                  health = True             spam : ham    =     82.7 : 1.0
                     ect = True              ham : spam   =     76.8 : 1.0
                    2001 = True              ham : spam   =     75.8 : 1.0
                featured = True             spam : ham    =     74.7 : 1.0
              nomination = True              ham : spam   =     72.1 : 1.0
             medications = True             spam : ham    =     69.9 : 1.0
                  differ = True             spam : ham    =     66.7 

See how they get classified:

In [14]:
for email in test_spam_list:
    print (email)
    print (classifier.classify(get_features(email)))

Participate in our new lottery!
spam
Try out this new medicine
spam


In [15]:
for email in test_ham_list:
    print (email)
    print (classifier.classify(get_features(email)))

See the minutes from the last meeting attached
ham
Investors are coming to our office on Monday
ham


Run in an interactive manner:

In [17]:
while True:
    email = input("Type in your email here (or press 'Enter'): ")
    if len(email)==0:
        break
    else: 
        prediction = classifier.classify(get_features(email))
        print (f"This email is likely {prediction}\n")

Type in your email here (or press 'Enter'): congrats you make money
This email is likely spam

Type in your email here (or press 'Enter'): I want to work with you
This email is likely ham

Type in your email here (or press 'Enter'): 


Thanks for reading 😀

Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎


## Source
- Getting started with NLP