<a href="https://colab.research.google.com/github/PeshalaPerera/E-mail-and-spam-filtering/blob/main/Cw1_w1810821_PeshalaPerera.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Part A – Application area review**

## E-mail and spam filtering



---



Email correspondence is now a necessary component of our everyday lives in both the personal and professional domains. But the growth of spam presents a serious threat to email systems' security and effectiveness. The incorporation of Artificial Intelligence (AI) has been instrumental in augmenting the efficacy of email filtering systems over time. This literature review explores the evolution of AI applications within e-mail and spam filtering, highlighting key advancements, challenges, and future directions.


Early attempts at e-mail filtering relied on rule-based systems that employed predefined criteria to identify and block spam. These systems, however, were not very flexible and found it difficult to keep up with the way spam was changing so quickly. An important change came with the introduction of machine learning (ML), which allowed systems to pick up on and adjust to new spam patterns without the need for explicit programming.


Email filtering has made extensive use of machine learning algorithms, especially supervised learning techniques. Large datasets of labeled emails are analyzed by these algorithms, which then use characteristics like content, sender information, and user behavior to distinguish between legitimate and spam emails. Naive Bayes, Support Vector Machines, and, more recently, deep learning techniques like neural networks are examples of notable algorithms.

The success of ML-based e-mail filtering systems relies heavily on the extraction and selection of relevant features. Word frequencies, sender reputation, and email structure are examples of features. By determining which features are the most informative, feature selection methods like Information Gain and Recursive Feature Elimination aid in improving the performance of the model.


Even with ML-based systems' effectiveness, problems still exist. Adversarial attacks are a serious risk because they involve spammers purposefully changing features to get past filters. Furthermore, because legitimate messages far outweigh spam in email datasets, the datasets are imbalanced, which can result in biased models that favor false negatives. The goal of ongoing research is to develop more resilient and adaptable filtering mechanisms in order to address these issues.


Email filtering has seen a rise in the use of deep learning, specifically convolutional neural networks (CNNs) and recurrent neural networks (RNNs). CNNs are good at extracting features from structured data, and RNNs are good at identifying sequential patterns in email content. Deep learning integration has demonstrated promise in raising spam detection accuracy and lowering false positives.


Alternative methods to e-mail filtering are provided by unsupervised learning techniques like clustering and anomaly detection. While anomaly detection detects departures from typical patterns, clustering algorithms group emails according to similarities. These techniques are especially helpful for identifying spam patterns that were previously unknown, which increases the flexibility of email filtering systems.


AI has advanced recently, going beyond content analysis to include user-centric methods. Behavioral analysis looks at past user behavior, including how they interact with emails, to find anomalies that could be signs of spam. This customized method improves accuracy by customizing the filtering mechanisms to suit the preferences of each individual user.


Hybrid approaches, which integrate the advantages of various AI techniques, have become popular in email filtering. A more complete and flexible solution can be achieved by combining rule-based systems with deep learning and machine learning models. The effectiveness of rule-based filters for well-known threats is leveraged by hybrid systems, while the adaptability of ML and deep learning models is utilized for newly emerging spam patterns.


There are ethical questions raised by the increasing use of AI in email filtering, especially with regard to user privacy. Analyzing user activity and content could violate people's right to privacy. For researchers and practitioners, finding a balance between efficient spam filtering and user privacy continues to be a significant challenge.

The development of AI promises exciting new directions for spam and email filtering in the future. The interpretability of filtering models can be improved, offering insights into decision-making processes, through the integration of explainable AI (XAI) and advances in natural language processing (NLP). Furthermore, by enabling email clients to learn locally without jeopardizing sensitive data, research into decentralized and federated learning models seeks to address privacy concerns.


The use of AI in spam and email filtering has advanced significantly over the years, moving from simple rule-based systems to complex deep learning models. Adversarial attacks and privacy concerns are among the ongoing challenges, but research shows that these issues are being addressed. Future innovations are anticipated, with artificial intelligence (AI) technologies serving as a key component in the development of more precise, adaptive, and privacy-conscious email filtering systems.



# **Part B – Compare and evaluate AI techniques**

Emails are used in practically every industry these days, including education and business. There are two subcategories of emails, spam and ham. Email spam is a type of email that can be used to harm any user by stealing important information, wasting computer resources, and wasting time. It is also known as junk email or unwanted email.

Every day, the percentage of spam emails is rising quickly. Detecting and filtering spam is one of the biggest issues facing email and Internet of Things service providers today. Email filtering is one of the most important and well-known methods among all the strategies created for identifying and stopping spam. For this, a variety of deep learning and machine learning approaches have been employed.

Naive Bayes, Decision trees, Neural networks are some of them.


## **Naive Bayes**

---



## Goal:

> The main goal of employing Naive Bayes for email and spam filtering is to effectively and precisely categorize incoming emails as either legitimate or spam by using a predetermined set of features.

## Objective:
> Determine whether an email is more likely to belong in the legitimate or spam class by applying the feature independence assumption. Making classifications quickly and accurately while utilizing the least amount of computational power is the aim.

## Strengths:

*   **Efficiency:** Naive Bayes is suited for real-time applications because it is computationally efficient and uses little resources.

*   **Easy Interpretation and Implementation:** Its simplicity makes interpretation and implementation simple.

*   **Robust to Irrelevant Features:** Because Naive Bayes assumes feature independence, it can handle irrelevant features with robustness.


## Weaknesses:
*   **Assumption of Independence:** In practical situations, the assumption of feature independence might not hold true, which could have an impact on prediction accuracy.

*   **Limited Expressiveness:** Naive Bayes may not be able to handle intricate feature relationships.

## Use Case(Application Example):
> Naive Bayes can be used to filter emails by examining the sender's details, the frequency of words in the message, and other characteristics. For example, the Naive Bayes algorithm can identify an email as spam if it contains words that are commonly linked to spam.

## Input Data:
> To use Naive Bayes, a dataset containing labeled examples is necessary, along with features like sender information, word frequencies, and potentially metadata.

## Expected Output:
> The result is a probability score that shows how likely it is that an email is spam or not.


## **Decision Trees**

---




## Goal:

> The goal of using decision trees in spam and email filtering is to build an efficient and comprehensible model that can categorize emails according to a hierarchical set of decision rules.

## Objective:
>  Create a tree structure that recursively divides the data into subsets and guides the classification process with features like sender information, content attributes, and metadata. Establishing a transparent and comprehensible process for determining whether an email is spam or legitimate is the objective.

## Strengths:

*   **Interpretability:** Decision trees are very interpretable, making it possible for users to comprehend the process of making decisions.

*   **Handling Non-linearity:** They are able to simulate intricate, nonlinear data relationships.


## Weaknesses:
*   **Overfitting:** Decision trees are prone to overfitting, which can lead to inadequate generalization, particularly with deep trees.

*   **Instability:** Subtle modifications to the data can result in notably distinct tree structures.

## Use Case(Application Example):
> Decision trees are a useful tool for email filtering because they can evaluate attributes like sender reputation, content characteristics, and keyword presence. The decision nodes in the tree represent the criteria that determine whether an item is considered legitimate or spam.

## Input Data:
> Labeled training data with attributes such as sender information, content attributes, and past user interactions are necessary for decision trees.

## Expected Output:
> The result is a binary classification that shows if an email is considered spam or not.


## **Neural Networks**

---




## Goal:

> Leveraging neural networks' ability to recognize intricate patterns and relationships within sizable and varied datasets is the main goal of using them for spam and email filtering.

## Objective:
>  Learn to automatically extract pertinent features from raw data, like email content, sender behavior, and structural information, using a neural network, such as a convolutional neural network (CNN) or recurrent neural network (RNN). The objective is to create a model with high classification accuracy for both known and unknown threats that can adjust to changing spam patterns.

## Strengths:

*   **Learning Complex Patterns:** Neural networks are particularly good at deciphering complex relationships and patterns in data.

*   **Adaptability:** They are appropriate for dynamic spam filtering because they can adjust to evolving patterns over time.

*   **Feature Extraction:** From unprocessed data, neural networks automatically identify pertinent features.

## Weaknesses:
*   **Computational Intensity:** Deep neural network training can require a lot of processing power.

*   **Black-Box Nature:** It can be difficult to understand how complex neural networks make decisions.

## Use Case(Application Example):
> Neural networks are capable of analyzing an email's entire content, taking sender behavior, structural information, and word embeddings into account. To detect disguised spam, for example, a recurrent neural network (RNN) can identify sequential patterns in the content.

## Input Data:
> Large labeled datasets containing raw input data, like email content, sender information, and user behavior, are necessary for neural networks to function.

## Expected Output:
> The result is a binary classification or probability indicating how likely it is that an email is spam.


# **Part C – Implementation**

## High-level diagram


---


## Dependencies

---





*   Import necessary libraries for data manipulation, numerical operations, and machine learning.

*   Import modules for file uploads and handling in Google Colab.


*   Import modules for machine learning tasks, including text processing and Naive Bayes classifier.



In [38]:
#import packages
import pandas as pd
import numpy as np
from google.colab import files
import io

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB


## Loading dataset

---



*   Use the 'files.upload()' method to interactively upload files in Google Colab.

*   This allows you to select and upload files from your local machine to the Colab environment.

*   The uploaded files are stored in the 'data' variable, making them accessible for further processing.


In [40]:
#upload data
data = files.upload()

Saving spam.csv to spam.csv




*   Read the CSV file named "spam.csv" from the 'data' dictionary using Pandas and io.BytesIO.

*   Specify the encoding as 'ISO-8859-1' to ensure proper interpretation of the file.

*   The resulting DataFrame, 'mails', now contains the data from the CSV file for further analysis.



In [60]:
mails = pd.read_csv(io.BytesIO(data["spam.csv"]), encoding='ISO-8859-1')



*   prints the first few rows of this DataFrame



In [61]:
#inspect data
mails.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Preparing the dataset.

---



The data needs to be prepared in a specific format, and as a first step, we drop the unnamed columns.

In [62]:
mails.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1, inplace = True)

In [63]:
mails.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Additionally, we can appropriately rename these columns.

In [64]:
mails.rename(columns = {'v1': 'Category', 'v2': 'Message'}, inplace = True)

In [65]:
mails.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Distribution Insights: Comparing Spam and Non-Spam Email

Using the groupby method in Pandas to group the data in the DataFrame mails by the 'Category' column and then generate descriptive statistics for each group

In [68]:
mails.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,653,Please call our customer service representativ...,4


This information provides insights into the distribution of text lengths in our dataset for both spam and non-spam messages.


*   Within each category, the 'unique' indicates the number of unique messages, and the 'count' indicates the total number of occurrences.


*   For 'ham' (non-spam) messages:

        Count: 4825
        Unique messages: 4516

*   For 'spam' messages:

        Count: 747
        Unique messages: 653

## Binary Encoding: Creating a 'spam' Column to Represent Spam and Non-Spam Labels

In [69]:
#turn spam/ham into numerical data, creating a new column called  'spam'
mails['spam'] = mails['Category'].apply(lambda x: 1 if x == 'spam' else 0)

In [70]:
mails

Unnamed: 0,Category,Message,spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,1
5568,ham,Will Ì_ b going to esplanade fr home?,0
5569,ham,"Pity, * was in mood for that. So...any other s...",0
5570,ham,The guy did some bitching but I acted like i'd...,0


# **Part D – Testing**

# **Part E – Evaluate results**