# Analysing SMS Content to Detect Spam From Ham

In this project, we will follow the following machine learning pipeline to predict whether a text message is spam or ham.

1. Read in raw text.
2. Clean text and tokenize.
3. Feature engineering.
4. Fit simple model.
5. Tune hyperparameters and evaluate with GridSearchCV.
6. Final model selection.

## Load the important libraries and dataset

In [1]:
import pandas as pd
import nltk # Natural Language Toolkit library
import re # Regular Expression library
import string
# nltk.download() # to download the needed libraries

In [2]:
pd.set_option('display.max_colwidth', 100)

fullCorpus = pd.read_csv('SMSSpamCollection.tsv', sep="\t", header=None) # sep = separator 

fullCorpus.head()

Unnamed: 0,0,1
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


## Exploratory Data Analysis (EDA)

In [3]:
fullCorpus.columns = ['label', 'body_text'] #Assign headers name to the columns
fullCorpus.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to thank you for this breather. I promise i wont take yo...
1,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
2,ham,"Nah I don't think he goes to usf, he lives around here though"
3,ham,Even my brother is not like to speak with me. They treat me like aids patent.
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [4]:
fullCorpus.shape

(5568, 2)

In [5]:
print("Input data has {} rows and {} columns". format(len(fullCorpus), len(fullCorpus.columns)))

Input data has 5568 rows and 2 columns


In [6]:
fullCorpus.dtypes

label        object
body_text    object
dtype: object

In [7]:
# How many spam/ham are there?

print("Out of {} rows, {} are spam, {} are ham".format(len(fullCorpus),
                                                       len(fullCorpus[fullCorpus['label']=='spam']),
                                                       len(fullCorpus[fullCorpus['label']=='ham'])))
                                                          

Out of 5568 rows, 746 are spam, 4822 are ham


In [8]:
# How much missing data is there?

print("Number of null values in label: {}".format(fullCorpus['label'].isnull().sum()))
print("Number of null values in body_text: {}".format(fullCorpus['body_text'].isnull().sum()))

Number of null values in label: 0
Number of null values in body_text: 0


### Write out Clean Data

In [9]:
fullCorpus.to_csv('fullCorpus.csv', index=False, header=True)