# Overview

**Context**

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. 
It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

**Content**

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

# Approach

- Loading Data

- Input and Output Data

- Applying Regular Expression

- Each word to lower case

- Splitting words to Tokenize

- Stemming with PorterStemmer handling Stop Words

- Preparing Messages with Remaining Tokens

- Preparing WordVector Corpus

- Applying Classification

In [24]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

In [2]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re

# DATA

In [3]:
df = pd.read_csv('../input/spam.csv', encoding='latin-1')
df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)

In [4]:
df.head()

In [5]:
# Replace ham with 0 and spam with 1
df = df.replace(['ham','spam'],[0, 1]) 

In [6]:
df.head()

#### Count the number of words in each Text

In [7]:
df['Count']=0
for i in np.arange(0,len(df.v2)):
    df.loc[i,'Count'] = len(df.loc[i,'v2'])

In [8]:
df.head()

In [9]:
# Total ham(0) and spam(1) messages
df['v1'].value_counts()

In [10]:
df.info()

In [11]:
corpus = []
ps = PorterStemmer()

In [12]:
# Original Messages

print (df['v2'][0])
print (df['v2'][1])

## Processing Messages

In [13]:
for i in range(0, 5572):

    # Applying Regular Expression
    
    '''
    Replace email addresses with 'emailaddr'
    Replace URLs with 'httpaddr'
    Replace money symbols with 'moneysymb'
    Replace phone numbers with 'phonenumbr'
    Replace numbers with 'numbr'
    '''
    msg = df['v2'][i]
    msg = re.sub('\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', df['v2'][i])
    msg = re.sub('(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr', df['v2'][i])
    msg = re.sub('£|\$', 'moneysymb', df['v2'][i])
    msg = re.sub('\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b', 'phonenumbr', df['v2'][i])
    msg = re.sub('\d+(\.\d+)?', 'numbr', df['v2'][i])
    
    ''' Remove all punctuations '''
    msg = re.sub('[^\w\d\s]', ' ', df['v2'][i])
    
    if i<2:
        print("\t\t\t\t MESSAGE ", i)
    
    if i<2:
        print("\n After Regular Expression - Message ", i, " : ", msg)
    
    # Each word to lower case
    msg = msg.lower()    
    if i<2:
        print("\n Lower case Message ", i, " : ", msg)
    
    # Splitting words to Tokenize
    msg = msg.split()    
    if i<2:
        print("\n After Splitting - Message ", i, " : ", msg)
    
    # Stemming with PorterStemmer handling Stop Words
    msg = [ps.stem(word) for word in msg if not word in set(stopwords.words('english'))]
    if i<2:
        print("\n After Stemming - Message ", i, " : ", msg)
    
    # preparing Messages with Remaining Tokens
    msg = ' '.join(msg)
    if i<2:
        print("\n Final Prepared - Message ", i, " : ", msg, "\n\n")
    
    # Preparing WordVector Corpus
    corpus.append(msg)

In [14]:
cv = CountVectorizer()
x = cv.fit_transform(corpus).toarray()

# Applying Classification

- Input : Prepared Sparse Matrix
- Ouput : Labels (Spam or Ham)

In [15]:
y = df['v1']
print (y.value_counts())

print(y[0])
print(y[1])

### Encoding Labels

In [17]:
le = LabelEncoder()
y = le.fit_transform(y)

print(y[0])
print(y[1])

### Splitting to Training and Testing DATA

In [18]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y,test_size= 0.20, random_state = 0)

# Applying Guassian Naive Bayes

In [19]:
bayes_classifier = GaussianNB()
bayes_classifier.fit(xtrain, ytrain)

In [20]:
# Predicting
y_pred = bayes_classifier.predict(xtest)

## Results

In [21]:
# Evaluating
cm = confusion_matrix(ytest, y_pred)

In [22]:
cm

In [23]:
print ("Accuracy : %0.5f \n\n" % accuracy_score(ytest, bayes_classifier.predict(xtest)))
print (classification_report(ytest, bayes_classifier.predict(xtest)))

# Applying Decision Tree

In [25]:
dt = DecisionTreeClassifier(random_state=50)
dt.fit(xtrain, ytrain)

In [31]:
# Predicting
y_pred_dt = dt.predict(xtest)

## Results

In [33]:
# Evaluating
cm = confusion_matrix(ytest, y_pred_dt)

print(cm)

In [30]:
print ("Accuracy : %0.5f \n\n" % accuracy_score(ytest, dt.predict(xtest)))
print (classification_report(ytest, dt.predict(xtest)))

# Final Accuracy

- **Decision Tree : 96.861%**
- **Guassian NB   : 87.085%**   

Thanks for having a look :) ....Please give my kernel an **UPVOTE** 