# 18BCE101
## Practical 7
## Aim:
Classify a given text data into predetermined categories. You can choose (i) Spam mail classification or (ii) News article classification examples.
 
Given a set of labeled email documents, classify them as spam or non-spam using Naive Bayesian classifier.

#### Importing the necessary libraries

In [1]:
import nltk
import os
import random
from nltk import word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from collections import Counter
from nltk import NaiveBayesClassifier, classify

#### 1. Loading the files and categorizing the content as spam/ham

In [2]:
def read_files(start_path):
    a_list = []
    for path,dirs,files in os.walk(start_path):
        for filename in files:
            f = open(os.path.join(path,filename), 'r', encoding="latin1")
            a_list.append(f.read())
        f.close()
    return a_list

#Loading the data
spam = read_files('data/spam/')
ham = read_files('data/ham/')

In [3]:
all_emails = [(email, 'spam') for email in spam]
all_emails += [(email, 'ham') for email in ham]

In [4]:
all_emails[:3]

[('Subject: dobmeos with hgh my energy level has gone up ! stukm\nintroducing\ndoctor - formulated\nhgh\nhuman growth hormone - also called hgh\nis referred to in medical science as the master hormone . it is very plentiful\nwhen we are young , but near the age of twenty - one our bodies begin to produce\nless of it . by the time we are forty nearly everyone is deficient in hgh ,\nand at eighty our production has normally diminished at least 90 - 95 % .\nadvantages of hgh :\n- increased muscle strength\n- loss in body fat\n- increased bone density\n- lower blood pressure\n- quickens wound healing\n- reduces cellulite\n- improved vision\n- wrinkle disappearance\n- increased skin thickness texture\n- increased energy levels\n- improved sleep and emotional stability\n- improved memory and mental alertness\n- increased sexual potency\n- resistance to common illness\n- strengthened heart muscle\n- controlled cholesterol\n- controlled mood swings\n- new hair growth and color restore\nread\nm

In [5]:
len(all_emails)

5172

In [8]:
#randomly shuffle the spam and ham examples
random.shuffle(all_emails)

#### 2. Preprocessing

In [9]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\labdh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [10]:
lemmatizer = WordNetLemmatizer()
def preprocess(sentence):
    return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(sentence)]

#### 3. Extracting features

In [11]:
stoplist = stopwords.words('english')

In [12]:
def get_features(text, setting):
    #bow - bag of words
    if setting=='bow':
        return {word: count for word, count in Counter(preprocess(text)).items() if not word in stoplist}

In [13]:
all_features = [(get_features(email, 'bow'), label) for (email, label) in all_emails]

In [14]:
all_features[:3]

[({'subject': 1,
   ':': 3,
   'txu': 9,
   'lone': 4,
   'star': 4,
   'working': 1,
   'clearing': 1,
   'old': 1,
   '/': 7,
   'gas': 5,
   'distribution': 1,
   'balance': 1,
   '8': 4,
   '99': 4,
   '9': 2,
   '.': 13,
   'originally': 2,
   'billed': 1,
   'nominated': 1,
   'quantity': 1,
   'shortpaid': 1,
   'u': 1,
   'due': 1,
   'meter': 2,
   'adjustment': 1,
   'brokered': 2,
   'deal': 10,
   'sold': 2,
   'purchased': 2,
   'exxonmobil': 5,
   'highland': 5,
   'ha': 1,
   'provided': 1,
   'support': 1,
   'since': 2,
   'pipeline': 1,
   ',': 12,
   'however': 1,
   'unsure': 1,
   'volume': 1,
   'management': 1,
   'reallocate': 1,
   'example': 1,
   '10': 1,
   '000': 1,
   'mmbtu': 4,
   '3': 2,
   '1': 1,
   '200': 1,
   'energy': 3,
   '800': 1,
   'show': 1,
   'total': 1,
   'amount': 1,
   'wa': 2,
   '240': 1,
   'purchase': 3,
   'piece': 1,
   'allocated': 1,
   '?': 1,
   'august': 1,
   '1999': 2,
   'sale': 2,
   'electric': 2,
   '&': 2,
   '#': 7,


#### 4. Training the classifier

In [15]:
def train(features, samples_proportion):
    train_size = int(len(features) * samples_proportion)
    train_set, test_set = features[:train_size], features[train_size:]
    print ('Training set size = ' + str(len(train_set)) + ' emails')
    print ('Test set size = ' + str(len(test_set)) + ' emails')
    return train_set, test_set

In [16]:
train_set, test_set = train(all_features, 0.8)
classifier = NaiveBayesClassifier.train(train_set)

Training set size = 4137 emails
Test set size = 1035 emails


#### 5. Evauating the results

In [17]:
def evaluate(train_set, test_set, classifier):
    print ('Accuracy on the training set = ' + str(classify.accuracy(classifier, train_set)))
    print ('Accuracy of the test set = ' + str(classify.accuracy(classifier, test_set)))

In [18]:
evaluate(train_set, test_set, classifier)

Accuracy on the training set = 0.9584239787285472
Accuracy of the test set = 0.9314009661835749


In [19]:
classifier.show_most_informative_features(20)

Most Informative Features
               forwarded = 1                 ham : spam   =    145.6 : 1.0
            prescription = 1                spam : ham    =     94.3 : 1.0
                     nom = 1                 ham : spam   =     92.8 : 1.0
                      xl = 2                 ham : spam   =     89.0 : 1.0
                    pain = 1                spam : ham    =     87.9 : 1.0
                   brand = 1                spam : ham    =     79.8 : 1.0
                    2005 = 1                spam : ham    =     73.4 : 1.0
                   meter = 1                 ham : spam   =     71.6 : 1.0
                      ex = 1                spam : ham    =     65.4 : 1.0
                     ibm = 1                spam : ham    =     62.1 : 1.0
                creative = 1                spam : ham    =     62.1 : 1.0
               trademark = 1                spam : ham    =     60.5 : 1.0
                     sex = 1                spam : ham    =     57.3 : 1.0

# Conclusion
Classification is involved in many tasks in machine learning.
Spam filtering is a binary classification task where you need to detect whether an email belongs to a “spam” or “ham” class and thus we have used Bernouli Naive Bayes classification method here.
Word occurrence and frequency are some of the most informative features for spam detection.
We preprocess the data before using words as features.