This excercise will help us to get started with using tools(pandas, sklearn etc) from python eco system.

**Problem statement:** classify SMS messages as *HAM* or *SPAM* using **naive bayes** in supervised setting.
See this link to get an idea supervised learning workflow [supervsed learning workflow](http://www.allprogrammingtutorials.com/tutorials/introduction-to-machine-learning.php)

**Dataset:** We will use [SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) from UCI machine learning repository.

This notebook has taken text processing idea from
https://radimrehurek.com/data_science_python/

Running this notebook with anaconda installation requires installing Textblob libray for text processing

Run following commands to install Textblob **from command prompt under anaconda**

** conda install -c conda-forge textblob**

In [None]:
#Must for inline plot
%matplotlib inline 
import requests
import pprint # for pretty printing
import os # listing and managing file patho
import zipfile # for zip and unzip utilities
import pandas # for data analysis
import csv
import matplotlib.pyplot as plt # for plotting
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer # for converting documents in word count


In [None]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
r = requests.get(data_url)
#r.content

Let's download and save the zip file

In [None]:
sms_zip_file = 'smsspamcollection.zip'
#http = urllib3.PoolManager()
with open(sms_zip_file, 'wb') as out_file:
    out_file.write(r.content)

# Let verify it. See how you can run linux bash command using !
**make sure output of following command contains smsspamcollection.zip file**

In [None]:
#Let verify it. See how you can run linux bash command using !
dir_listing = os.listdir('.') # list content of current directory
print(dir_listing)

# Q 1: Can you complete following  code to check if sms_zip_file is present in above output

In [None]:
assert sms_zip_file ? , "directory doesn't contain {}".format(sms_zip_file) # hint look  in operator

In [None]:
with zipfile.ZipFile(sms_zip_file,"r") as zip_ref:
    zip_ref.extractall("data")

# Let's list the content of the new data folder

In [None]:
print(os.listdir('./data'))

SMSSpamCollection file contains around 5k SMS messages. Checkour readme file for details.

**Let's open this file and store line in python list**

In [None]:
with  open('./data/SMSSpamCollection', 'r') as f:
    sms_messages = f.readlines()

In [None]:
print(sms_messages[0:10]) # printing 10 messages

This is tab seperated(\t) file with a new line in the end. Let remove new line. Note that label and actual message **ham, spam** is seperated by tab.

In [None]:
# Following code show how to write list cpmprehension. We could have done this using for loop too.
# [<some_func>(x) for x in <something> if  <some_condition_is_true>]
sms_messages = [m.rstrip() for m in sms_messages] # we are not using if condition part
print('Number of sms messages is {}'.format(len(sms_messages)))

# Let's check couple of messages again

In [None]:
for idx, msg in enumerate(sms_messages[0:20]): # see how we can slice list using : operator
    print('message id {}  {}'.format(idx, msg))

**This is our training data set $\mathcal{D} = \{({x_i}, y_i)\}_{i=1}^{N=5574}$**. Using using this we will train(learn parameters $\theta$ of a models(Naive bayes, Discriminant anlaysis based etc.)) and use trained model to classify new messages as ham or spam

First step before jumping into using any machine learning model is understanding the data by **describing it's statistical attribute and visualizating samples or sample property**.
We can use CSV file reader and try to accomplish above task. But as they say python is a language with **battery(libraries) included**. Let's use **pandas and matplotlib** libraries to do this task as cleanly as possible. What to describe and what to plot will be an essential skill we build as we do various data science or machine learning tasks. Also with time you will also built a repositories of various packages available for different domain in python eco system. Most of the time reading blog and google search does the job of finding right libraries. Various packages for download and installation are avaiable at [PyPI - the Python Package Index](https://pypi.python.org/pypi)

**Optional**
This is [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)

If you have more time look into this link [Pandas Tutorial: DataFrames in Python](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python#gs.dEdNuDM)

In [None]:
# You will see how wrapping the file in pandas simplify lot of tasks
messages = pandas.read_csv('./data/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE,
                           names=["label", "message"])
print(messages)

# Let's try to understand various attribute of the data

*How many messages in each group etc.*

In [None]:
messages.groupby('label').describe()

*How long are each messages*. See how we can attach a new column(pandas Series) to the pandas object.

It uses lambda(anonymous funtion) and map tell what to do with each entries in **message** column.
One can write a python function and pass it there too like

def get_length(msg):

    return len(msg)

**messages['length'] = messages['message'].map(get_length)**

But we will take more pythonic route

In [None]:
messages['length'] = messages['message'].map(lambda text: len(text))

In [None]:
messages.head()

Now we have length attribute.
**To see the whole picture. Let plot length distribution**

In [None]:
messages.length.plot(bins=20, kind='hist')

Looks like there are enough messages of length upto 150 but very few messages are too long(>400).

Let's try to summarize this distribution(hist) of length

In [None]:
messages.length.describe()

Infact there is a message  of length 910. What is this message?

In [None]:
print('Longest message is {}'.format(list(messages.message[messages.length > 900])))

is there any difference in message length between spam and ham?

In [None]:
messages.hist(column='length', by='label', bins=50)

**looks like** on average spam messages has more length.

**Q2:** Can you write the code to summarize above per class disribution of messages length(to be more precise about observation). i.e. can you group messages and describe their length?

In [None]:
# Write your answer here

**Computer only understand scalar or vector or matrices**. We need to convert text to vectors(feature).


# Feature engineering

we'll use the [Bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model) approach for creating feature
representing our sms message.

Bag of words is just feature genration idea where only count of words matter not the order. Later we will see
there are better model for sentence or document representation where words order matters. There are model which takes into account the word order like [N-gram](https://en.wikipedia.org/wiki/N-gram) etc.
Infact Deep learning has enabled us to learn better embedding of words using context of words(co occurance).
We will try to use them in **deep learning section** [**optional** see [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)]

Converting to vector is bit involved and require a good understanding of NLP(natural language processing).

But as we can imagine to convert a message into vector we need to
1. convert a sentence into word token
2. Normalize the words i.e do we care about(do they cary some infomration) Capital form(Cow vs cow), inflected form ("goes" vs. "go")
3. Build a dictionary of words and map the messages into vector using this dictionary
4. Finally train a model

**Again we will use a python library [Textblob](http://textblob.readthedocs.io/en/dev/quickstart.html) to do heavy lifting for us.**

write a function that will split a message into its individual words

In [None]:
def split_into_tokens(message):
    message = unicode(message, 'utf8')  # convert bytes into proper unicode
    return TextBlob(message).words

Here are some of the original texts again:

In [None]:
messages.message.head()

same messages, tokenized 

In [None]:
messages.message.head().apply(split_into_tokens)

With textblob, normalize words into their base form [lemmas](https://en.wikipedia.org/wiki/Lemmatisation) with

In [None]:
def split_into_lemmas(message):
    message = unicode(message, 'utf8').lower()
    words = TextBlob(message).words
    # for each word, take its "base form" = lemma 
    return [word.lemma for word in words]

# see how head portion changes
messages.message.head().apply(split_into_lemmas)

We will use CountVectorizer from **sklearn** to convert each  message into **count vector**.   Any row of this matrix represents an example(count of various words in the message).

**Let's create the transformation class first**

In [None]:
# You may have to uncomment following two line once if running following cell produces error
#import nltk
#nltk.download()
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(messages['message'])
print('Number of unique words in our dictionary are {}'.format(len(bow_transformer.vocabulary_)))

using this tranformer can can convert any message into count of words representation.

**Let see how message 4 gets transformed**

In [None]:
message4 = messages['message'][3]
print(message4)

In [None]:
# Let's convert it and check the converted message(Bag of word representation) and its shape
bow4 = bow_transformer.transform([message4])
print(bow4)
print(bow4.shape)

vector count representation of message 4 is of length 8859(size of our vocabulary) and we are only prnting the indices where count is not zero. 
So, nine unique words in message nr. 4, two of them appear twice, the rest only once. Sanity check: what are these words the appear twice in this message?


In [None]:
print(message4)
print(bow_transformer.get_feature_names()[6726])
print (bow_transformer.get_feature_names()[8002])

Let conver whole SMS corpus

In [None]:
messages_bow = bow_transformer.transform(messages['message'])
print 'sparse matrix shape:', messages_bow.shape
print 'number of non-zeros:', messages_bow.nnz
print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))

# Training the mode or estimating parameters $\theta$ of the model
Now we have vector feature representation $x_i$ of our sms samples. 

Let review some theory and see what paramters we need to estimate for Naive bayes model.

We know that we classify a sms $x_i$  to a class c= HAM or c= SPAM which has maximum vlaue of $P(c|x_i).$ Using bayes rule we have $P(c|x_i) = \frac{P(x_i|c) P(c)}{P(x_i)} \propto P(x_i|c) P(c)$ as normalization doesn't depend on class label. 

In naive bayes assumption for modelling class conditional densities we have $P(x_i|c) = \prod_j^D P(x_{ij}|c)$ assuming  $x_i \in \mathbb{R}^D$

**Note:$D$ is size of our vacabulary ($|V|$) build from sms document corpus i.e D = |V|**

**what probability distribution we should choose for $P(x_{i}|c)?$ **

Each value $x_{ij}$ is an integer values and there are total $D$ different unique values(word). This definetly suits a $D$ side die situation. 

**Infact ham or spam document generation in bag of word model is nothing but rolling this die. Pick the word dictated by the side of die throw.**

Now we  know that we can put multinomial distribution for such situation. Hence

$P(x_i|C) = \frac{n_i}{\prod_j^D x_{ij|C}} \prod^{D} P(w_j|c)^{x_{ij}} \propto  \prod^{D} P(w_j|c)^{x_{ij}}$ as normalization doesn't depend on class label

We know that using MLE estimate we have $P(w_j|c) = \frac{\sum_{i=1}^N x_{ij}\mathbb{1}(y_i=c)}{\sum_{k=1}^{D} \sum_{i=1}^N x_{ik}\mathbb{1}(y_i=c)}.$ where $\mathbb{1}$ is indicator function.

This is nothing but relative frequency of $w_j$ in documents of class c=SPAM or c= HAM
with respect to the total number of words in documents of that class.

prior class  densites are estimated as $P(c) = \frac{N_c}{N}.$ Where $N_c$ are numer of document in class k.


# Q3. Finish following function 

In [None]:
def estimate_class_probability_of_words(messages_bow, messages_label, class_label):
    '''This function estimates the parameter of mutlinomial distribution in BOF model in class_label
    args:
    messages_bow = BOW encoded messages
    messages_label = label(ham or spam) if the messages.(Note this is supervised setting we need class label)
    class_label = 'ham' or 'spam'
    returns:
    return the list of estimated parameter of size D
    '''    
    # write your code here. You can write code in multiple cell to write different part of function and finally
    # merge them into on cell as shown in the class.
    


# Q4. Write a function with full signature of estimating prior class densities.
# You function should return python list of size 2. First entry should be estimate for ham and second entry for spam


In [None]:
# Write your function here

# Q5. Using above functions, predict the  class of training messages in messages_bow. return the list predicting 'ham' or 'spam' for messges?
# use classification_report, f1_score, accuracy_score, confusion_matrix function form sklearn.metrics to show your results.

see this blog  https://radimrehurek.com/data_science_python/ about how to use these function in cell 27, 28 and 29.

# Q6 use  MultinomialNB function from sklearn on BOG model and use same metrices as in Q5. look from cell 24 to 28 in the blog notebook.