# Data Preprocessing

## Objective

The objective of this notebook is to prepare the SMS Spam Collection dataset.<br>
This includes:<br>
- Handling missing values;
- Removing duplicate messages;
- Enconding class labels;
- Text Preprocessing.

In [16]:
import pandas as pd
import numpy as np
import re


## Load Dataset

In [19]:
data_path = "../datasets/raw/SMSSpamCollection"
df = pd.read_csv(data_path, sep="\t", header=None, names=["label", "message"])

df.head()


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


## Handling Missing Values

Although the dataset selected had no missing values like it was validated during the exploratory data analysis. This step is included for academic prupose. 

Messages without a valid class label or without message content do not provide usefull information for the supervasing learning and are removed.

In [21]:
print(f"df count: {df["label"].count()}")

df = df.dropna(subset=["label", "message"])
df.isnull().sum()

print(f"df count: {df["label"].count()}")

df count: 5572
df count: 5572


The print of the dataset before and after the drop of columns shouws that no single row had missing values.

## Duplicate Removal

Duplicate messages may introduce some issues during the training phase, as bias by over-representing specific smaples. Duplicated rows are removed to prevent this effect.

In [22]:
print(f"df count: {df["label"].count()}")

df = df.drop_duplicates()
df.shape

print(f"df count: {df["label"].count()}")

df count: 5572
df count: 5169


Were removed 403 rows during this step.

## Label Encoding

Machine learning algorithms require numerical labels. The original categorical labels (ham, spam), are going to be mapped to a numerical values.<br>
0 -> ham<br>
1 -> spam

In [23]:
df["label_encoded"] = df["label"].map({"ham": 0, "spam": 1})

df[["label", "label_encoded"]].head()

Unnamed: 0,label,label_encoded
0,ham,0
1,ham,0
2,spam,1
3,ham,0
4,ham,0


## Text Preprocessing

Basic text preprocessing is applied to standardise the message content. To prevent words like "Win!" to be interpreted differently than "win", this step helps reduce the noise on the dataset.

In [24]:
def clean_text(text):
    text = text.lower()                      # all text to lower case
    text = re.sub(r"\d+", "", text)          # remove numbers
    text = re.sub(r"[^\w\s]", "", text)      # remove punctuation
    text = text.strip()                      # removes unecessary blank spaces
    return text

df["clean_message"] = df["message"].apply(clean_text)

df[["message", "clean_message"]].head()


Unnamed: 0,message,clean_message
0,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in a wkly comp to win fa cup final...
3,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...


## Final Dataset Overview

After preprocessing the dataset, the same consists of cleaned text messages, and numerical labels for features extraction and model training.

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5169 entries, 0 to 5571
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   label          5169 non-null   object
 1   message        5169 non-null   object
 2   label_encoded  5169 non-null   int64 
 3   clean_message  5169 non-null   object
dtypes: int64(1), object(3)
memory usage: 201.9+ KB


In [28]:
df.head()

Unnamed: 0,label,message,label_encoded,clean_message
0,ham,"Go until jurong point, crazy.. Available only ...",0,go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,0,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,free entry in a wkly comp to win fa cup final...
3,ham,U dun say so early hor... U c already then say...,0,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,nah i dont think he goes to usf he lives aroun...


### Store dataset

Store the preprocessed dataset to be used during the training phase.

In [None]:
df.to_csv("../datasets/processed/sms_spam_preprocessed.csv", index=False)

## Preprocesing Summary

The preprocessing stage included the handling of missing values, removal of duplicate messages, encoding of class labels, and basic text cleaning.

These steps ensure that the dataset is consistent, noise is reduced, and the data is suitable for feature extraction and model training phase.