### NLP Classification Project

##### Building a spam filter with the kernel framework. 

Emails are either classed as undersired "spam" or "ham". This classification task is applies and uses variants of edit distance within a kernel framework. LibSVM algorithm is then applied to clasify the data as either "ham" or "spam". The <a href = "https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection">SMS Spam Collection Data Set</a> from UCI machine learning repository 

* Clean the data
* Carry out data exploration
* split, tokenize, remove punctuation
* Extract features
* Construct kernels
* Run classification 
* Compare and analyse

In [63]:
#Import required files
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import re
import string
%matplotlib inline
pd.set_option('display.max_colwidth', 100)

In [1]:
pwd

'/Users/osita/Documents/GitHub/NLP-classification'

In [10]:
dir()

['In',
 'Out',
 '_',
 '_2',
 '_9',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i2',
 '_i3',
 '_i4',
 '_i5',
 '_i6',
 '_i7',
 '_i8',
 '_i9',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'exit',
 'get_ipython',
 'nltk',
 'np',
 'pd',
 'plt',
 'quit']

In [32]:
pd.read_csv?

### Explore the Data

In [88]:

data2 = pd.read_csv("/Users/osita/Documents/GitHub/NLP-classification/data/SMSSpamCollection", sep='\t', header=None,
                   names=['label','Text'])

In [89]:
data2.head()

Unnamed: 0,label,Text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [90]:
print('The data is made up of {} examples '.format(data2.shape))

The data is made up of (5572, 2) examples 


In [91]:
len_ham = data2[data2['label']=='ham'].count()[0]
print('{} examples classed as ham'.format(len_ham))

4825 examples classed as ham


In [92]:
len_spam = data2[data2['label']=='spam'].count()[0]
print('{} examples classed as spam'.format(len_spam))

747 examples classed as spam


In [94]:
# View an example of the text
data2['Text'][2]

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

### Missing - 
* Add the plots

### Helper Functions

### Remove punctuations from the string of words

In [64]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [65]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation]) # joins the character with no spaces in between
    return text_nopunct

In [83]:
def remove_punction(word):
    no_punctuation = []
    for letter in word:
        if letter not in string.punctuation:
            no_punctuation.append(letter)
    no_punctuation = "".join(no_punctuation)
    
    return no_punctuation

In [97]:
#Use the apply button to apply this function to the body_list
data2['Text_no_punct'] = data2['Text'].apply(lambda x: remove_punction(x)) #Applies the function to each row
data2.head(10)

Unnamed: 0,label,Text,Text_no_punct
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though
5,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...,FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it s...
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertu...
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...,WINNER As a valued network customer you have been selected to receivea £900 prize reward To clai...
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camer...


### Tokenize the string of words

In [98]:
import re

def tokenize(words):
    tokens = re.split('\W+', words) # split on non word character
    return tokens

In [101]:
data2['Text_tokenize'] = data2['Text_no_punct'].apply(lambda x: tokenize(x)) #Applies the function to each row
data2.head(10)

Unnamed: 0,label,Text,Text_no_punct,Text_tokenize
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Ci..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[Free, entry, in, 2, a, wkly, comp, to, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, to..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[Nah, I, dont, think, he, goes, to, usf, he, lives, around, here, though]"
5,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...,FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it s...,"[FreeMsg, Hey, there, darling, its, been, 3, weeks, now, and, no, word, back, Id, like, some, fu..."
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[Even, my, brother, is, not, like, to, speak, with, me, They, treat, me, like, aids, patent]"
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertu...,"[As, per, your, request, Melle, Melle, Oru, Minnaminunginte, Nurungu, Vettam, has, been, set, as..."
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...,WINNER As a valued network customer you have been selected to receivea £900 prize reward To clai...,"[WINNER, As, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, ..."
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camer...,"[Had, your, mobile, 11, months, or, more, U, R, entitled, to, Update, to, the, latest, colour, m..."


### Remove stop words

In [102]:
import nltk

stopword = nltk.corpus.stopwords.words('english')

In [103]:
#To remove the stopwords we define our own function
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

In [104]:
data2['Text_nostop_words'] = data2['Text_tokenize'].apply(lambda x: remove_stopwords(x)) #Applies the function to each row
data2.head(10)

Unnamed: 0,label,Text,Text_no_punct,Text_tokenize,Text_nostop_words
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Ci...","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[Free, entry, in, 2, a, wkly, comp, to, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, to...","[Free, entry, 2, wkly, comp, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]","[U, dun, say, early, hor, U, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[Nah, I, dont, think, he, goes, to, usf, he, lives, around, here, though]","[Nah, I, dont, think, goes, usf, lives, around, though]"
5,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...,FreeMsg Hey there darling its been 3 weeks now and no word back Id like some fun you up for it s...,"[FreeMsg, Hey, there, darling, its, been, 3, weeks, now, and, no, word, back, Id, like, some, fu...","[FreeMsg, Hey, darling, 3, weeks, word, back, Id, like, fun, still, Tb, ok, XxX, std, chgs, send..."
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.,Even my brother is not like to speak with me They treat me like aids patent,"[Even, my, brother, is, not, like, to, speak, with, me, They, treat, me, like, aids, patent]","[Even, brother, like, speak, They, treat, like, aids, patent]"
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,As per your request Melle Melle Oru Minnaminunginte Nurungu Vettam has been set as your callertu...,"[As, per, your, request, Melle, Melle, Oru, Minnaminunginte, Nurungu, Vettam, has, been, set, as...","[As, per, request, Melle, Melle, Oru, Minnaminunginte, Nurungu, Vettam, set, callertune, Callers..."
8,spam,WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To c...,WINNER As a valued network customer you have been selected to receivea £900 prize reward To clai...,"[WINNER, As, a, valued, network, customer, you, have, been, selected, to, receivea, 900, prize, ...","[WINNER, As, valued, network, customer, selected, receivea, 900, prize, reward, To, claim, call,..."
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...,Had your mobile 11 months or more U R entitled to Update to the latest colour mobiles with camer...,"[Had, your, mobile, 11, months, or, more, U, R, entitled, to, Update, to, the, latest, colour, m...","[Had, mobile, 11, months, U, R, entitled, Update, latest, colour, mobiles, camera, Free, Call, T..."
