![alt text](https://www.msengineering.ch/typo3conf/ext/nm_theme_msengineering/Resources/Public/Template/img/mse_logo.jpg "MSE Logo") 

# AnTeDe Lab 2: Text Classification - Part A

## Session goal
The goal of this session is to implement a Multinomial Naive Bayes classifier from scratch.

## Data collection
We are going to use a small toy dataset. Each document is a single sentence. The training data contains three documents, each from a different class.

In [1]:
import pandas as pd
import nltk

# these 3 lines are here for compatibility purposes
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
#

training_corpus=["The Limmat flows out of the lake.", 
           "The bears are in the bear pit near the river.",
           "The Rhône flows out of Lake Geneva.",
          ]
training_labels=["zurich", 
         "bern",
         "geneva",
        ]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Daniele/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /Users/Daniele/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/Daniele/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


We are also going to need a helper function that can normalize a string.

In [2]:
def normalize(document, keep_punctuation=False, \
                  keep_stop_words=False, keep_inflected=True, keep_numbers=False):
            import string
            
            from nltk.corpus import stopwords
            from nltk.stem import WordNetLemmatizer
            from nltk.tokenize import word_tokenize

             
            word_tokens = word_tokenize(document)

            wl = WordNetLemmatizer()
            lemmatize = lambda tokens: \
                [wl.lemmatize(w) for w in tokens]


            stop_words=set(stopwords.words('english'))
            normalized = [w.lower() for w in word_tokens 
                               if ((not w.lower() in set(string.punctuation)) \
                                   or keep_punctuation)
                               and
                               ((not w.lower() in stop_words) or keep_stop_words)
                               and
                               ((w.lower().isalnum()) or keep_punctuation)
                               and
                               (not (w.lower().isdigit()) or keep_numbers)
                               ] 

            if keep_inflected is False:
                normalized = lemmatize(normalized)
            
            return normalized 

How does *keep_inflected* affect the output of __normalize__?





In [3]:
# BEGIN_REMOVE
normalized_training_corpus = [normalize(item, keep_inflected=False) for item in training_corpus]    
inflected_training_corpus = [normalize(item, keep_inflected=True) for item in training_corpus] 

df=pd.DataFrame(columns=['original', 'normalized', 'inflected'])
df['original']=training_corpus
df['normalized']=normalized_training_corpus
df['inflected']=inflected_training_corpus
print (df.to_string())
# keep_inflected maintains inflected forms such as 'cities'
# END_REMOVE

                                        original                      normalized                        inflected
0              The Limmat flows out of the lake.            [limmat, flow, lake]            [limmat, flows, lake]
1  The bears are in the bear pit near the river.  [bear, bear, pit, near, river]  [bears, bear, pit, near, river]
2            The Rhône flows out of Lake Geneva.     [rhône, flow, lake, geneva]     [rhône, flows, lake, geneva]




Now, we need to define a __get_vocabulary__ function that gets us all the unique words that appear in the normalized documents.

In [4]:
def get_vocabulary (data):
    return list(set(sum(data,[])))    

Print the vocabulary

In [5]:
# BEGIN_REMOVE
print(get_vocabulary(normalized_training_corpus))  
# END_REMOVE

['flow', 'lake', 'geneva', 'bear', 'rhône', 'pit', 'river', 'limmat', 'near']


We define a class __ms_timer__ that helps us time snippets of code. Its definition follows a special syntax that serves to implement what is known as a context manager. 

Each code snippet that we wish to time will be placed in an indented block following a __with__ statement. At the end of the indented block, the run time of the snippet will be returned by the class method __get_elapsed_time__. 

(You can do this same thing effortlessly in an IDE with profiling, but this is a good way to do it in a Jupyter notebook.)

In [6]:
import time
class ms_timer:
            
    def __enter__(self):
        self.start=time.time()
        return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.stop=time.time()
    def get_elapsed_time(self):
        return 1000*(self.stop-self.start)

Here's an example of how to time code snippets using the context manager trick.

In [7]:
my_data = range(1, 10)

with ms_timer() as timer:
    prod=1
    for item in my_data:
        prod=prod*item
print ("Elapsed time for the loop: "+str(round(timer.get_elapsed_time(), 4))+" ms")   

Elapsed time for the loop: 0.0041 ms


## MNB from scratch

We are now ready to implement our MNB from scratch. Our implementation is contained in a class called __naive_bayes__. We can define our class across multiple cells simply by defining a derived class with exactly the same name in the following cells.

First we compute the posterior probabilities.

In [8]:
class naive_bayes:


    @staticmethod
    def get_posterior_probabilities (training_data, verbose=False):

        import logging, re

        posterior = {}

        vocabulary = get_vocabulary(training_data['documents'])
        lw = len(vocabulary)

        classes = list(set(training_labels))

        for index, c in enumerate(classes): 

            tokens = sum(training_data['documents']\
                                         [training_data['labels']==c], [])
            
            
            try:
                den = len(tokens)    
            except:
                den=0

            current_class_docs = tokens

            for w in vocabulary:

                num = current_class_docs.count(w)                
                posterior[(w,c)]=(1+num)/(den+lw)

                if verbose:
                    
                    message = 'Token '+w+' appears '+str(num)+' times in class '+c
                    message=re.sub('1 times', 'once', message)
                    logging.warning (message)
                    
                    message = 'There are '+str(den)+' tokens in class '+c
                    message=re.sub('are 1 tokens', 'is 1 token', message)
                    logging.warning (message)
                    
                    logging.warning (current_class_docs)
                    logging.warning ('Vocab size: '+str(lw))
                    logging.warning ('Vanilla posterior: '+str(round(num/den, 2)))
                    logging.warning ('Posterior: '+str(round(posterior[(w,c)], 2)))

        return posterior  


3) What is the posterior probability of finding 'limmat' given that the document is tagged as 'zurich'? Complete the following code snippet to find out. Use *verbose* to see what's going on under the hood.

In [9]:
# The method get_posterior_probabilities expects the training data in the form of a data frame
training_data = pd.DataFrame(columns=['documents', 'labels'])
training_data['documents']=normalized_training_corpus
training_data['labels']=training_labels

# BEGIN_REMOVE
posterior=naive_bayes.get_posterior_probabilities(training_data, verbose=True)
print ("%.2f"%posterior['limmat', 'zurich'])
# END_REMOVE



0.17


Complete the following code so we can train the classifier.

In [10]:
class naive_bayes(naive_bayes):
    
    def train(self, training_data, timing=False):

            import logging

            classes = training_data['labels']

            # BEGIN_REMOVE            
            with ms_timer() as timer:

                P_c = \
                [(training_data['labels']==tagged_class).sum()/len(training_data) \
                 for tagged_class in classes]
            if timing:
                logging.warning('Priors probabilities computed in '+"%.2f"%timer.get_elapsed_time()+" ms")
            
            with ms_timer() as timer:
                
                posterior_p=self.get_posterior_probabilities(training_data)
            if timing:    
                logging.warning('Posterior probabilities computed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
            # END_REMOVE
            
            return P_c, posterior_p

Now we get to train the classifier. 

Print out the prior probabilities and the posterior probabilities and answer the following questions:

a) What is the lowest posterior probability that you observe and why?

b) What is the highest posterior probability that you observe and why?

c) Why are the prior probabilities all 1/3?

In [11]:
nb=naive_bayes()
P_c, posterior_p=nb.train(training_data, timing=True) 

# BEGIN_REMOVE
print ('Prior probabilities:')
print ([str(round(x, 2)) for x in P_c])

print ('Posterior probabilities:')
df=pd.DataFrame()
df['(token, class)']=[x for x in posterior_p.keys()]
df['post_p']=list(map(lambda x:round(x, 2), posterior_p.values()))
print (df.to_string())
# END_REMOVE



Prior probabilities:
['0.33', '0.33', '0.33']
Posterior probabilities:
      (token, class)  post_p
0       (flow, bern)    0.07
1       (lake, bern)    0.07
2     (geneva, bern)    0.07
3       (bear, bern)    0.21
4      (rhône, bern)    0.07
5        (pit, bern)    0.14
6      (river, bern)    0.14
7     (limmat, bern)    0.07
8       (near, bern)    0.14
9     (flow, zurich)    0.17
10    (lake, zurich)    0.17
11  (geneva, zurich)    0.08
12    (bear, zurich)    0.08
13   (rhône, zurich)    0.08
14     (pit, zurich)    0.08
15   (river, zurich)    0.08
16  (limmat, zurich)    0.17
17    (near, zurich)    0.08
18    (flow, geneva)    0.15
19    (lake, geneva)    0.15
20  (geneva, geneva)    0.15
21    (bear, geneva)    0.08
22   (rhône, geneva)    0.15
23     (pit, geneva)    0.08
24   (river, geneva)    0.08
25  (limmat, geneva)    0.08
26    (near, geneva)    0.08


And we get to do the classifying.

In [12]:
class naive_bayes(naive_bayes):
    
    
    def classify_document (self, training_data, test_document, logarithmic=True):

            import logging
            from functools import reduce
            import math
            from nltk.tokenize import word_tokenize

            classes = list(set(training_data['labels']))
            
            
            P_c, posterior_p=self.train(training_data)

            NB=dict()
            
            normalized_test_document = normalize(test_document, keep_inflected=False)

            for index, c in enumerate(classes):

                posterior_logsum=0
                
                for token in normalized_test_document:
                    
                    
                    try:
                        posterior_logsum=posterior_logsum+math.log(posterior_p[token, c], 10)
                    except:
                        pass
                    
                if posterior_logsum==0:
                        logging.error ('Classification failure: insufficient info')
            
                NB[c]=round(posterior_logsum+math.log(P_c[index], 10), 2)
                
            return max(NB, key=NB.get), NB 
            



Test your classifier with the test document *The name of the city comes from the word 'bear'.* What goes wrong? Can you fix it?

In [13]:
# BEGIN_REMOVE

import logging
nb=naive_bayes()
test_corpus = "The name of the city comes from the word 'bear'"
test_labels = "bern"

print (test_corpus)

with ms_timer() as timer:
    result = nb.classify_document(training_data, test_corpus)
    
logging.warning('Classification completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
print (result)    

# END_REMOVE

The name of the city comes from the word 'bear'


ERROR:root:Classification failure: insufficient info
ERROR:root:Classification failure: insufficient info
ERROR:root:Classification failure: insufficient info


('bern', {'bern': -0.48, 'zurich': -0.48, 'geneva': -0.48})


In [14]:
# BEGIN_REMOVE

import logging
nb=naive_bayes()
test_corpus = "The name of the city comes from the word 'bear'"
test_labels = "bern"

test_corpus=test_corpus.replace('\'', '')

print (test_corpus)

with ms_timer() as timer:
    result = nb.classify_document(training_data, test_corpus)
    
logging.warning('Classification completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
print (result)    

# END_REMOVE



The name of the city comes from the word bear
('bern', {'bern': -1.15, 'zurich': -1.56, 'geneva': -1.59})


Can you explain the performance of your classifier on the following test corpus?

In [15]:
test_corpus = ['We saw the bears there.', 
               'We crossed the Rhône.', 
               'There is no lake.',
              ]
test_labels = ['bern',
               'geneva',
               'bern',
              ]

nb=naive_bayes() 



for item in test_corpus:
    print ('\n Classifying: '+item)
    with ms_timer() as timer:
        result = nb.classify_document(training_data, item)
    logging.warning('Classification of \"'+item+'\" completed in '+"%.2f"%timer.get_elapsed_time()+" ms")    
    print (result)                                      
    print ('correct label: '+test_labels[test_corpus.index(item)])




 Classifying: We saw the bears there.
('bern', {'bern': -1.15, 'zurich': -1.56, 'geneva': -1.59})
correct label: bern

 Classifying: We crossed the Rhône.
('geneva', {'bern': -1.62, 'zurich': -1.56, 'geneva': -1.29})
correct label: geneva

 Classifying: There is no lake.
('zurich', {'bern': -1.62, 'zurich': -1.26, 'geneva': -1.29})
correct label: bern


Now test your classifier with the one-sentence document "The federal capital is pretty." What happens?

In [16]:
# BEGIN_REMOVE
test_corpus = "The federal capital is pretty."
test_labels = "bern"

# Your classifier fails because your test document contains a previously unseen word. 
print (nb.classify_document(training_data, test_corpus))
# END_REMOVE

ERROR:root:Classification failure: insufficient info
ERROR:root:Classification failure: insufficient info
ERROR:root:Classification failure: insufficient info


('bern', {'bern': -0.48, 'zurich': -0.48, 'geneva': -0.48})
