# Introduction to Notebooks

Notebooks are interactive consoles with annotation features. In this one, you can run Python code OR unix commands.

In [1]:
print('Hello, world!')

Hello, world!


You can print variables without `print()`.

In [2]:
a = 2
a

2

Previously defined variables still exist in the scope of the notebook. Try deleting the cell above, and run the cell below:

In [None]:
a

In [3]:
ls

MyFirstMLProject.ipynb


In [11]:
pip list

Package                   Version
------------------------- --------------
anyio                     4.8.0
appnope                   0.1.4
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.4
attrs                     25.1.0
babel                     2.16.0
beautifulsoup4            4.12.3
bleach                    6.2.0
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
comm                      0.2.2
contourpy                 1.3.1
cycler                    0.12.1
debugpy                   1.8.12
decorator                 5.1.1
defusedxml                0.7.1
executing                 2.2.0
fastjsonschema            2.21.1
fonttools                 4.56.0
fqdn                      1.5.1
h11                       0.14.0
hatch-jupyter-builder     0.9.1
hatch-nodejs-version      0.3.2
hatchling                 1.27.0
httpcore     

# ML Pipeline

Build a spam filter using supervised Machine Learning.

This notebook goes through the steps of a standard ML project. The steps are:

1. Load data
2. Clean data
3. Feature Engineering
4. Model Selection
5. Training
6. Testing
7. Validation
8. Machine Learning pipeline
9. Reporting

# Requirement & Resources

## ML Python APIs
- [`sklearn`](https://scikit-learn.org/stable/index.html)(also called scikit-learn)
- [`nltk`](https://www.nltk.org/) (Natural Language Tool Kit)

## ML Datasets
- [UCI datasets](https://archive.ics.uci.edu/)
- [Kaggle](https://www.kaggle.com/datasets)

## Python Plotting APIs
- `matplotlib`
- `seaborn` (matplotlib simplied)
- `plotly`

In [3]:
pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/Cellar/jupyterlab/4.3.5/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/Cellar/jupyterlab/4.3.5/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install matplotlib


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/Cellar/jupyterlab/4.3.5/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install seaborn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/Cellar/jupyterlab/4.3.5/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/Cellar/jupyterlab/4.3.5/libexec/bin/python -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Imports

In [1]:
import pandas as pd
import matplotlib as plt
import seaborn as sn

import sklearn as sk
from sklearn.feature_extraction      import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection         import train_test_split
from sklearn.naive_bayes             import MultinomialNB
from sklearn.metrics                 import accuracy_score, precision_score, recall_score, confusion_matrix



import nltk
from nltk.corpus   import stopwords
from nltk.stem     import WordNetLemmatizer
from nltk.tokenize import word_tokenize

pd.set_option('display.max_colwidth', None) # Increase column width

In [2]:
# NLTK is a HUGE package, and sometimes, you will be prompted to download specific parts separately. import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/beatricemoissinac/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/beatricemoissinac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/beatricemoissinac/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/beatricemoissinac/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

# Load Data 

In [22]:
file_path = 'SMSSpamCollection.txt'
df = pd.read_csv(file_path,
                 sep   = '\t', 
                 header= None,
                 names = ['label', 'text'])

df['y'] = df['label']=='ham' # Convert your labels to {0,1}

In [23]:
# What does the data look like? 
df

Unnamed: 0,label,text,y
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",True
1,ham,Ok lar... Joking wif u oni...,True
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,False
3,ham,U dun say so early hor... U c already then say...,True
4,ham,"Nah I don't think he goes to usf, he lives around here though",True
...,...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",False
5568,ham,Will ü b going to esplanade fr home?,True
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",True
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,True


# Clean data
"Cleaning" (or wrangling) the data is an important - and time consuming step. It may take up to 99% of your time! For this project, cleaning the data means that we need to make the data consumable by the algorithms.

In [24]:
df['text_sans_punkt'] = df['text'].str.replace(r'[^\w\s]', ' ', regex=True)
df

Unnamed: 0,label,text,y,text_sans_punkt
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",True,Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat
1,ham,Ok lar... Joking wif u oni...,True,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,False,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive entry question std txt rate T C s apply 08452810075over18 s
3,ham,U dun say so early hor... U c already then say...,True,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",True,Nah I don t think he goes to usf he lives around here though
...,...,...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",False,This is the 2nd time we have tried 2 contact u U have won the 750 Pound prize 2 claim is easy call 087187272008 NOW1 Only 10p per minute BT national rate
5568,ham,Will ü b going to esplanade fr home?,True,Will ü b going to esplanade fr home
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",True,Pity was in mood for that So any other suggestions
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,True,The guy did some bitching but I acted like i d be interested in buying something else next week and he gave it to us for free


In [25]:
df['text_sans_punkt'] = df['text'].str.replace(r'[^\w\s]|\d+', ' ', regex = True)
df

Unnamed: 0,label,text,y,text_sans_punkt
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",True,Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat
1,ham,Ok lar... Joking wif u oni...,True,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,False,Free entry in a wkly comp to win FA Cup final tkts st May Text FA to to receive entry question std txt rate T C s apply over s
3,ham,U dun say so early hor... U c already then say...,True,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",True,Nah I don t think he goes to usf he lives around here though
...,...,...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",False,This is the nd time we have tried contact u U have won the Pound prize claim is easy call NOW Only p per minute BT national rate
5568,ham,Will ü b going to esplanade fr home?,True,Will ü b going to esplanade fr home
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",True,Pity was in mood for that So any other suggestions
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,True,The guy did some bitching but I acted like i d be interested in buying something else next week and he gave it to us for free


## Tokenizing
Tokenizing means breaking down a sentence or corpus into its individual works. Tokens will be used as paramete rs in the next step, thus, they need to be as informative as possible. If your tokens are full of garbage data, your model will not be able to make the difference.

In [26]:
df['token'] = df["text_sans_punkt"].apply(nltk.word_tokenize)
df

Unnamed: 0,label,text,y,text_sans_punkt,token
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",True,Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Cine, there, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,True,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,False,Free entry in a wkly comp to win FA Cup final tkts st May Text FA to to receive entry question std txt rate T C s apply over s,"[Free, entry, in, a, wkly, comp, to, win, FA, Cup, final, tkts, st, May, Text, FA, to, to, receive, entry, question, std, txt, rate, T, C, s, apply, over, s]"
3,ham,U dun say so early hor... U c already then say...,True,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",True,Nah I don t think he goes to usf he lives around here though,"[Nah, I, don, t, think, he, goes, to, usf, he, lives, around, here, though]"
...,...,...,...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",False,This is the nd time we have tried contact u U have won the Pound prize claim is easy call NOW Only p per minute BT national rate,"[This, is, the, nd, time, we, have, tried, contact, u, U, have, won, the, Pound, prize, claim, is, easy, call, NOW, Only, p, per, minute, BT, national, rate]"
5568,ham,Will ü b going to esplanade fr home?,True,Will ü b going to esplanade fr home,"[Will, ü, b, going, to, esplanade, fr, home]"
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",True,Pity was in mood for that So any other suggestions,"[Pity, was, in, mood, for, that, So, any, other, suggestions]"
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,True,The guy did some bitching but I acted like i d be interested in buying something else next week and he gave it to us for free,"[The, guy, did, some, bitching, but, I, acted, like, i, d, be, interested, in, buying, something, else, next, week, and, he, gave, it, to, us, for, free]"


## Stop-words

Stop-words are very common words (e.g., 'a', 'the', 'with') that do not bring a lot of meaning (signal) for this project. NLTK provides you with a starting list, but you may use whatever list you think is appropriate to your project.


In [27]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [28]:
df['clean_token'] = df ['token'].apply(lambda x: [t for t in x if t not in stopwords.words ('english')])
df

Unnamed: 0,label,text,y,text_sans_punkt,token,clean_token
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",True,Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Cine, there, got, amore, wat]","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,True,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,False,Free entry in a wkly comp to win FA Cup final tkts st May Text FA to to receive entry question std txt rate T C s apply over s,"[Free, entry, in, a, wkly, comp, to, win, FA, Cup, final, tkts, st, May, Text, FA, to, to, receive, entry, question, std, txt, rate, T, C, s, apply, over, s]","[Free, entry, wkly, comp, win, FA, Cup, final, tkts, st, May, Text, FA, receive, entry, question, std, txt, rate, T, C, apply]"
3,ham,U dun say so early hor... U c already then say...,True,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]","[U, dun, say, early, hor, U, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",True,Nah I don t think he goes to usf he lives around here though,"[Nah, I, don, t, think, he, goes, to, usf, he, lives, around, here, though]","[Nah, I, think, goes, usf, lives, around, though]"
...,...,...,...,...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",False,This is the nd time we have tried contact u U have won the Pound prize claim is easy call NOW Only p per minute BT national rate,"[This, is, the, nd, time, we, have, tried, contact, u, U, have, won, the, Pound, prize, claim, is, easy, call, NOW, Only, p, per, minute, BT, national, rate]","[This, nd, time, tried, contact, u, U, Pound, prize, claim, easy, call, NOW, Only, p, per, minute, BT, national, rate]"
5568,ham,Will ü b going to esplanade fr home?,True,Will ü b going to esplanade fr home,"[Will, ü, b, going, to, esplanade, fr, home]","[Will, ü, b, going, esplanade, fr, home]"
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",True,Pity was in mood for that So any other suggestions,"[Pity, was, in, mood, for, that, So, any, other, suggestions]","[Pity, mood, So, suggestions]"
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,True,The guy did some bitching but I acted like i d be interested in buying something else next week and he gave it to us for free,"[The, guy, did, some, bitching, but, I, acted, like, i, d, be, interested, in, buying, something, else, next, week, and, he, gave, it, to, us, for, free]","[The, guy, bitching, I, acted, like, interested, buying, something, else, next, week, gave, us, free]"


In [29]:
# Can you remove stop words using a different list
# from sklearn.feature_extraction import text

stop = text.ENGLISH_STOP_WORDS
df['clean_token_2'] = df ['token'].apply(lambda x: [t for t in x if t not in stopwords.words ('english')])
df

Unnamed: 0,label,text,y,text_sans_punkt,token,clean_token,clean_token_2
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",True,Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Cine, there, got, amore, wat]","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,True,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,False,Free entry in a wkly comp to win FA Cup final tkts st May Text FA to to receive entry question std txt rate T C s apply over s,"[Free, entry, in, a, wkly, comp, to, win, FA, Cup, final, tkts, st, May, Text, FA, to, to, receive, entry, question, std, txt, rate, T, C, s, apply, over, s]","[Free, entry, wkly, comp, win, FA, Cup, final, tkts, st, May, Text, FA, receive, entry, question, std, txt, rate, T, C, apply]","[Free, entry, wkly, comp, win, FA, Cup, final, tkts, st, May, Text, FA, receive, entry, question, std, txt, rate, T, C, apply]"
3,ham,U dun say so early hor... U c already then say...,True,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]","[U, dun, say, early, hor, U, c, already, say]","[U, dun, say, early, hor, U, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",True,Nah I don t think he goes to usf he lives around here though,"[Nah, I, don, t, think, he, goes, to, usf, he, lives, around, here, though]","[Nah, I, think, goes, usf, lives, around, though]","[Nah, I, think, goes, usf, lives, around, though]"
...,...,...,...,...,...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",False,This is the nd time we have tried contact u U have won the Pound prize claim is easy call NOW Only p per minute BT national rate,"[This, is, the, nd, time, we, have, tried, contact, u, U, have, won, the, Pound, prize, claim, is, easy, call, NOW, Only, p, per, minute, BT, national, rate]","[This, nd, time, tried, contact, u, U, Pound, prize, claim, easy, call, NOW, Only, p, per, minute, BT, national, rate]","[This, nd, time, tried, contact, u, U, Pound, prize, claim, easy, call, NOW, Only, p, per, minute, BT, national, rate]"
5568,ham,Will ü b going to esplanade fr home?,True,Will ü b going to esplanade fr home,"[Will, ü, b, going, to, esplanade, fr, home]","[Will, ü, b, going, esplanade, fr, home]","[Will, ü, b, going, esplanade, fr, home]"
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",True,Pity was in mood for that So any other suggestions,"[Pity, was, in, mood, for, that, So, any, other, suggestions]","[Pity, mood, So, suggestions]","[Pity, mood, So, suggestions]"
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,True,The guy did some bitching but I acted like i d be interested in buying something else next week and he gave it to us for free,"[The, guy, did, some, bitching, but, I, acted, like, i, d, be, interested, in, buying, something, else, next, week, and, he, gave, it, to, us, for, free]","[The, guy, bitching, I, acted, like, interested, buying, something, else, next, week, gave, us, free]","[The, guy, bitching, I, acted, like, interested, buying, something, else, next, week, gave, us, free]"


## Lemmatization

Not to be confused with stemming, a simpler version of lemmatization, lemmatization is the process of transforming a word into its 'radical' or most common version.
- cats => cat
- trouble => troubl
- troubling => troubl
- troubled => troubl

In [30]:
#from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
df['lemma_token'] = df['clean_token'].apply(lambda x: [lemmatizer. lemmatize (w) for w in x])
df

Unnamed: 0,label,text,y,text_sans_punkt,token,clean_token,clean_token_2,lemma_token
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...",True,Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amore wat,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Cine, there, got, amore, wat]","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,True,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's,False,Free entry in a wkly comp to win FA Cup final tkts st May Text FA to to receive entry question std txt rate T C s apply over s,"[Free, entry, in, a, wkly, comp, to, win, FA, Cup, final, tkts, st, May, Text, FA, to, to, receive, entry, question, std, txt, rate, T, C, s, apply, over, s]","[Free, entry, wkly, comp, win, FA, Cup, final, tkts, st, May, Text, FA, receive, entry, question, std, txt, rate, T, C, apply]","[Free, entry, wkly, comp, win, FA, Cup, final, tkts, st, May, Text, FA, receive, entry, question, std, txt, rate, T, C, apply]","[Free, entry, wkly, comp, win, FA, Cup, final, tkts, st, May, Text, FA, receive, entry, question, std, txt, rate, T, C, apply]"
3,ham,U dun say so early hor... U c already then say...,True,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]","[U, dun, say, early, hor, U, c, already, say]","[U, dun, say, early, hor, U, c, already, say]","[U, dun, say, early, hor, U, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",True,Nah I don t think he goes to usf he lives around here though,"[Nah, I, don, t, think, he, goes, to, usf, he, lives, around, here, though]","[Nah, I, think, goes, usf, lives, around, though]","[Nah, I, think, goes, usf, lives, around, though]","[Nah, I, think, go, usf, life, around, though]"
...,...,...,...,...,...,...,...,...
5567,spam,"This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.",False,This is the nd time we have tried contact u U have won the Pound prize claim is easy call NOW Only p per minute BT national rate,"[This, is, the, nd, time, we, have, tried, contact, u, U, have, won, the, Pound, prize, claim, is, easy, call, NOW, Only, p, per, minute, BT, national, rate]","[This, nd, time, tried, contact, u, U, Pound, prize, claim, easy, call, NOW, Only, p, per, minute, BT, national, rate]","[This, nd, time, tried, contact, u, U, Pound, prize, claim, easy, call, NOW, Only, p, per, minute, BT, national, rate]","[This, nd, time, tried, contact, u, U, Pound, prize, claim, easy, call, NOW, Only, p, per, minute, BT, national, rate]"
5568,ham,Will ü b going to esplanade fr home?,True,Will ü b going to esplanade fr home,"[Will, ü, b, going, to, esplanade, fr, home]","[Will, ü, b, going, esplanade, fr, home]","[Will, ü, b, going, esplanade, fr, home]","[Will, ü, b, going, esplanade, fr, home]"
5569,ham,"Pity, * was in mood for that. So...any other suggestions?",True,Pity was in mood for that So any other suggestions,"[Pity, was, in, mood, for, that, So, any, other, suggestions]","[Pity, mood, So, suggestions]","[Pity, mood, So, suggestions]","[Pity, mood, So, suggestion]"
5570,ham,The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free,True,The guy did some bitching but I acted like i d be interested in buying something else next week and he gave it to us for free,"[The, guy, did, some, bitching, but, I, acted, like, i, d, be, interested, in, buying, something, else, next, week, and, he, gave, it, to, us, for, free]","[The, guy, bitching, I, acted, like, interested, buying, something, else, next, week, gave, us, free]","[The, guy, bitching, I, acted, like, interested, buying, something, else, next, week, gave, us, free]","[The, guy, bitching, I, acted, like, interested, buying, something, else, next, week, gave, u, free]"


# Feature Engineering
A feature is a parameter, or a variable, that provides a signal to your model, about the question you are asking. A feature is can be extracted from the cleaned data, that can be used by a model. Words in themselves aren't features, we need to transform them into features, into something 'mathematical' that can be understood by a mathematical model.

## Word Count
A simple feature is the word count, or how many times a word was used in a document.

In [31]:
# Instantiate your feature engineering pipeline
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl= WordNetLemmatizer()
    def _call_(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

countVec = CountVectorizer( tokenizer  = LemmaTokenizer(),
                            stop_words = stopwords.words('english'))

In [32]:
vectorizer = CountVectorizer()
# Transform your words into numerical features
X = vectorizer.fit_transform(df.text)

In [50]:
X.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], shape=(5572, 8713))

In [43]:
# Split your data
X_train, X_test, y_train, y_test = train_test_split(X, df.y, test_size=0.2, random_state=41)

In [44]:
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

y_pred = nb_classifier.predict(X_test)

In [45]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

In [46]:
# Display evaluation metrics
print("Accuracy: {:.2f}%".format(accuracy * 100))
print("Precision: {:.2f}%".format(precision * 100))
print("Recall: {:.2f}%".format(recall * 100))
print("\nConfusion Matrix:")
print(conf_matrix)

Accuracy: 98.39%
Precision: 98.95%
Recall: 99.16%

Confusion Matrix:
[[156  10]
 [  8 941]]


## What happens when we change the seed?

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, df.y, test_size=0.2, random_state=28)

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

y_pred = nb_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Display evaluation metrics
print("Accuracy: {:.2f}%".format(accuracy * 100))
print("Precision: {:.2f}%".format(precision * 100))
print("Recall: {:.2f}%".format(recall * 100))
print("\nConfusion Matrix:")
print(conf_matrix)

Accuracy: 98.39%
Precision: 99.49%
Recall: 98.68%

Confusion Matrix:
[[127   5]
 [ 13 970]]
