### Introduction
The use of emails for communication has been growing consistently in the digital era. However, unwanted emails such as ads and scams have also been growing in tandem. This do not only fill up the mail box, but some are also attempts at scamming and when successful, they may lead to financial losses or the leaking of sensitive data. To counter this, machine learning has been used to classify emails as either spam or not spam, and move the spams to a dedicated folder where they can be deleted.
We implement a machine learning algorithm that can be used to filter emails based on a logistic regression.

### Objective
The objective of this exercise is to develop a machine learning algorithm that can classify an email as a spam or not spam. We will use the Spam Emails Dataset for Classification and Filtering downloaded from Kaggle. [https://www.kaggle.com/datasets/abdallahwagih/spam-emails]
This dataset contains a collection of emails, categorized into two classes: "Spam" and "Non-Spam" (often referred to as "Ham"). These emails have been carefully curated and labeled to aid in the development of spam email detection models.

In [3]:
# Import the required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
# Load the dataset
df = pd.read_csv("../data/dataset.csv")
# Printing first five rows of the dataset
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


The dataset includes the following features:
Category: Categorizes each email as either "Spam" or "Ham" (Non-Spam).
Message: The content of the email, including the subject line and message body.

### Data Exploration and Cleaning
Next, we will have a look at the dataset and clean it.

In [8]:
# Check the data shape
df.shape

(5572, 2)

The dataset has 5,572 rows and 2 columns as indicated above.

In [9]:
# Check the data type and missing values
print(df.info())
print(df.isnull().sum())  # Count missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB
None
Category    0
Message     0
dtype: int64


Both categories are object data types. There are no missing values in the dataset.

In [10]:
# Compare the value of each text, Spam vs Non Spam
print(df['Category'].value_counts())


Category
ham     4825
spam     747
Name: count, dtype: int64


There are 4,825 spam emails and 747 non spam emails in our dataset.