<b>Text Classification Tutorial with Naive Bayes</b>

The challenge of text classification is to attach labels to bodies of text, e.g., tax
document, medical form, etc. based on the text itself. For example, think of your spam folder in
your email. How does your email provider know that a particular message is spam or "ham" (not
spam)? We'll take a look at one natural language processing technique for text classification
called Naive Bayes .

The classic example used to illustrate Bayes Theorem involves medical testing. Let's
suppose that we were getting tested for the flu. When we get a medical test, there are really 4
cases to consider when we get the results back:
● True Positive : The test says we have the flu and we actually have the flu
● True Negative : The test says we don't have the flu and we actually don't have the flu
● False Positive : The test says we have the flu and we actually don't have the flu
● False Negative : The test says we don't have the flu and we actually do have the flu

The assumption is that each
word is independent of all other words . In reality, this is not true! Knowing what words come
before/after do influence the next/previous word! However, making this assumption greatly
simplifies the math and, in practice, works well! This assumption is why this technique is called
Naive Bayes

In [None]:
import os
import re
import string
import math
DATA_DIR = 'enron'
target_names = [ 'ham' , 'spam' ]

def get_data(DATA_DIR):
    #return data in files and target as lists
    # data is a list of e-mails text
    # target is [1,0,1,0,1,....]
    subfolders = [ 'enron%d' % i for i in range( 1 , 7 )]
    data = []
    target = []
    for subfolder in subfolders:
        # spam
        spam_files = os .listdir( os . path .join(DATA_DIR, subfolder,'spam' ))
        for spam_file in spam_files:
            with open ( os . path .join(DATA_DIR, subfolder, 'spam' ,spam_file), encoding= "latin-1" ) as f:
                data.append(f. read ())
                target.append( 1 )
        # ham
        ham_files = os .listdir( os . path .join(DATA_DIR, subfolder,'ham' ))
        for ham_file in ham_files:
            with open ( os . path .join(DATA_DIR, subfolder, 'ham' ,ham_file), encoding= "latin-1" ) as f:
                data.append(f. read ())
                target.append( 0 )
    return data, target

This will produce two lists: the data list, where each element is the text of an email, and
the target list, which is simply binary (1 meaning spam and 0 meaning ham). Now let's create a
class and add some helper functions for string manipulation

In [None]:
class SpamDetector (object):
    """Implementation of Naive Bayes for binary classification"""
    def clean (self, s):
        # removing punctuation from arg s
        # maketrans()
        # If only one argument is supplied, it must be a dictionary.
        # If two arguments are passed, it must be two strings with equal length.
        # Each character in the first string is a replacement to its corresponding index in the second string.
        # If three arguments are passed, each character in the third argument is mapped to None.
        translator = str.maketrans( "" , "" , string.punctuation) # string.punctuation returns all sets of punctuation
        return s.translate(translator)
    
    def tokenize (self, text):
        # returns a list of words from a text
        text = self.clean(text).lower()
        return re.split( "\W+" , text)
    
    def get_word_counts (self, words):
        # given a bag of words, returns a list of word_counts[word] for each word
        word_counts = {}
        for word in words:
            word_counts[word] = word_counts.get(word, 0.0 ) + 1.0
        return word_counts

    def fit ( self , X, Y):
        self .log_class_priors = {}
        self .word_counts = {}
        self .vocab = set()
        
        # n is the number of data points (e-mails)
        n = len(X)
        
        # calculate the log(P(spam)) and Log(P(ham))
        self .log_class_priors[ 'spam' ] = math.log(sum( 1 for label in Y if label== 1 ) / n)
        self .log_class_priors[ 'ham' ] = math.log(sum( 1 for label in Y if label== 0 ) / n)
        
        self .word_counts[ 'spam' ] = {}
        self .word_counts[ 'ham' ] = {}
        
        # looping over all data points get word counts vector for each file
        for x, y in zip(X, Y):
            c = 'spam' if y == 1 else 'ham'
            counts = self .get_word_counts( self .tokenize(x))
            # looping over every item (word,count) in the counts vector of this file              
            for word, count in counts.items():
                # building a global vocab list no duplicates
                if word not in self . vocab:
                    self.vocab.add(word)
                # counting the words in spam or not spam list.
                if word not in self .word_counts[c]:
                    self.word_counts[c][word] = 0.0
                self.word_counts[c][word] += count
    
        
    def predict ( self , X):
        result = []
        for x in X:
            counts = self .get_word_counts( self .tokenize(x))
            spam_score = 0
            ham_score = 0
            for word, _ in counts.items():
                if word not in self . vocab: continue
                # add Laplace smoothing +1 so not to get log0 (undefined)
                # log(P(x=word |spam) or likelyhood)
                log_w_given_spam = math.log(
                    ( self.word_counts['spam'].get(word, 0.0 ) + 1 ) /
                    (sum (self.word_counts['spam'].values()) + len( self.vocab)) )
                
                log_w_given_ham = math.log( 
                    ( self .word_counts['ham'].get(word,0.0 ) + 1 ) / 
                    (sum (self.word_counts['ham'].values()) + len( self.vocab)) )
                
            spam_score += log_w_given_spam
            spam_score += self .log_class_priors['spam']
            
            ham_score += log_w_given_ham
            ham_score += self .log_class_priors['ham']
            
            if spam_score > ham_score:
                result.append( 1 )
            else:
                result.append( 0 )
        return result

We're keeping track of the frequency of each word as it appears in either a spam or ham
message. For example, we expect the word "free" to appear in both messages, but we expect it
to be more frequent in the "spam" vocabulary than the "ham" vocabulary.

posterior=prior x likelyhood/evidence => log(posterior)= log(prior)+ log(likelyhood) - log(evidence)

Now that we've extracted all of the data we need from the training data, we can write another
function to actually output the class label for new data. To do this classification, we apply Naive
Bayes directly. For example, given a document, we need to iterate each of the words and
compute log p(wi|Spam) and sum them all up, and we also compute log p(wi|ham) and sum
them all up. Then we add the log class priors and check to

In [None]:
if __name__ == '__main__' :
    X, y = get_data(DATA_DIR)
    print ("Number of files:",len(X))
    # initialize the object
    MNB = SpamDetector()
    MNB.fit(X[ 100 :], y[ 100 :])
    pred = MNB.predict(X[: 100 ])
    true = y[: 100 ]
    accuracy = sum( 1 for i in range( len (pred)) if pred[i] == true [i])/float( len (pred))
    print ( "accuracy {0:.4f}" . format (accuracy))