# <u>Applied AI Coursework</u>


Name: Safir-Ul-Islam Bukhari  
ID: w1826293

# Video Demonstration Link

https://www.youtube.com/watch?v=rpLm6Jc1qhY

# <u>Application Area Overview</u>

Chosen Problem Domain – Email and spam filtering


# <u>Literature Review – Email and spam filtering</u>

Email and spam filtering is a great form of application for artificial intelligence. It aims to distinguish legitimate emails, also known as ‘ham’, from malicious emails which are known as ‘spam’. Over the years there has been multiple AI techniques developed and tailored to combat the spam emails. This review aims to explore multiple AI methods which can applied to email and spam filtering.

# Early Approaches to spam filtering

Initially, before AI was so heavily used in everything, spam filtering relied on rule-based systems that used predefined patterns and keywords to identify spam. These systems were straightforward and simple and because of this were very easy to work around and avoid considering that spam tactics were also constantly evolving as well. This limitation showed how a more adaptive solution, such as machine learning, was needed. 

# Machine learning techniques

Machine learning introduced a level of adaptability into spam filtering by letting these ML models learn from data to identify patterns which indicated the email was spam. Some key machine learning techniques used in this domain include:  
<u>Naïve Bayes Classifier:</u> This is a model that calculates the likelihood of an email being spam based on word occurrences. This model operates under the assumption that all the features of the email are ‘conditionally independent’ given the class label. This means that a word such as ‘free’ has no connection to the word ‘win’ in the email, even though they appear together. This makes the model faster to use but this also may not always reflect reality since in reality words have influence over each other in context. So, while this assumption is often not the case in real world datasets, Naïve Bayes still performs well due to its ability to capture key patterns in the data.  
How it works is that during the training the model is trained using a labelled dataset of emails where each email has already been tagged as spam or ham. It then calculates the probabilities of each word occurring in each spam or ham type of email.  
<u>Decision trees:</u> Decision trees is a popular supervised learning technique used in spam filtering. This model classifies emails by making a series of decisions based on features extracted from the email. An example of this may be word frequencies, the presence of certain keywords or metadata such as sender information.  
It works by splitting the dataset based on feature values, creating its ‘tree’ with a couple of different components:  
The Root Node – Represents the entire dataset and splits it into subsets based on the most significant feature.  
Decision Nodes – Nodes that represent decisions made on the features, this splits the data further into more subsets.  
Leaf nodes – Nodes that represent the final classification, meaning it represents whether the email is spam or ham.  
This tree is made using a training dataset and each of the splits are made to maximise information gain. A great advantage of decision trees is that they are very interpretable meaning they provide clear reasoning for their classifications. Furthermore, they are also very flexible since they can handle many data types. Once a decision tree is made it can very quickly classify new emails, which makes it useful for real time spam filtering.  
Both techniques shown above have their advantages. Given the context of filtering spam emails either one can be used, decision trees are harder to implement however are very transparent and are good for complex relationships whereas naïve bayes features faster training which is a lot better for smaller datasets but doesn’t offer the transparency decision trees has.  
Overall, for this specific problem domain, the size and complexity of the dataset will mainly determine which AI technique should be applied to the problem. For larger more complex datasets decision trees should be used, for smaller less complex datasets naïve bayes should be used.

# <u>Compare and evaluate AI techniques</u>

Filtering out spam emails using Naïve bayes, decision trees or neural networks.  
1. <u>Naïve Bayes</u> – Naïve bayes is a model that calculates the likelihood of an email being spam or ham based on word occurrences. Is a type of supervised learning and thus requires a labelled dataset.  
<u>Strengths</u> – Efficient, useful for real time filtering.  
- Simple to implement and interpret.
- Performs well with high-dimensional data, meaning the model can handle datasets where each email has a very large amount of features.
- <u>Weaknesses</u> – The fact that the model reads all the features of an email as ‘conditionally independent’ means that it cannot completely accurately understand real world data.  
- Can struggle with rare or unseen words in the training set.    
- <u>Data Requirements</u> – Works best with smaller or moderately sized datasets.  
- Dataset must be labelled, as this is a supervised learning type.  
- Data should be vectorised in this case.  

---
2. <u>Decision Trees</u> - Decision trees classify emails by making decisions based on features such as word frequencies, metadata or the structure of the email. Each node represents a decision based on a feature, and the leaf node represents the classification result, either spam or ham in this case.  
- <u>Strengths</u> – Very transparent and interpretable. Easy to understand and visualise the decision-making process of the AI.  
- Can handle a wide variety of data types.  
- Good at capturing non-linear relationships between different features.  
- <u>Weaknesses</u> – Prone to overfitting.  
- May struggle with high-dimensional datasets.  
- <u>Data Requirements</u> – Requires well-structured input features.  
- Requires labelled data as it is a supervised learning type.  

---
3. <u>Neural Networks</u> – Neural Networks simulate the human brain by learning patterns within the data. A feedforward neural network can be used for spam filtering.  
- <u>Strengths</u> – Capable of modelling complex relationships between features.  
- Performance improves with larger datasets. Meaning its scalable.

- <u>Weaknesses</u> - Very intensive, requires more resources to train compared to other models.  
- Can be prone to overfitting if not regulated properly.  
- Not as interpretable compared to something like decision trees.  
 
- <u>Data Requirements:</u> Requires a large dataset to perform well  
Pre-processing steps such as vectorisation need to be taken
.



# <u>Implementation</u>  
# <u>High Level Diagram</u>

![Final_Diagram.png](attachment:70456abe-7659-4be4-84c4-ede792e20706.png)

# <u>Type Of Input Data Needed/How it will be prepared</u>

The email data will be taken from a site called kaggle. This site is useful as it provides downloads for many different datasets provided by people online. This particular dataset has labelled the data into spam and ham so its very quick and easy to implement into supervised learning models like naive bayes. Here is the link: https://www.kaggle.com/datasets/abdallahwagih/spam-emails  
***
Before training the model on the data there will be a small amount of pre-processing steps which will need to be taken.  
Firstly the data have to be coded into categories for the model to read. Ham will be coded as 0 and Spam will be coded as 1.  
The data will then have to be vectorised since machine learning models cannot process raw text. So the raw text will have to be transformed into numerical data with TF-IDF vectorization which assigns scores to words based on their frequencies in the entire dataset. This way the model can read the data. The code will also limit the features to 5000 so that we can avoid overfitting.  
After that the data will have to be split into training and testing sets. The training set will be used to actually train the model using the dataset, this will use 80% of the data. The testing set will be used to evaluate the performance of the model using 20% of the data.  
To evaulate the data, a confusion matrix will be used to visualise it, telling us the different metrics on how accurate the ML model was. 

# 1. Load the dataset

In [6]:
# Loading Library
import pandas as pd

# Load the dataset
dataset = pd.read_csv('Cw1_w1826293_SafirUlIslamBukhari_spam.csv')

# 2. Pre-Process the data

In [8]:
# Loading Library
from sklearn.feature_extraction.text import TfidfVectorizer

# code the labels so that ham is 0 and spam is 1
dataset['Category'] = dataset['Category'].map({'ham': 0, 'spam': 1})

# Use Vectorizer to transform the email messages
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)  # Limit to 5000 features
X = vectorizer.fit_transform(dataset['Message'])
y = dataset['Category']

print("data vectorised")


data vectorised


# 3. Splitting data

In [10]:
# Loading Library
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tells us the sample size of training and testing data
print("Data split")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")


Data split
Training set size: 4457 samples
Testing set size: 1115 samples


# 4. Training Naive Bayes Model

In [12]:
# Loading Library
from sklearn.naive_bayes import MultinomialNB

# Train the Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

print("Model Trained")


Model Trained


# 5. Evaluating the model

In [14]:
# Loading Library
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions based on the test set
y_pred = nb_model.predict(X_test)

# Check the models performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the metrics
print(f"Accuracy: {accuracy:.2f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nPrecision Report:")
print(classification_rep)


Accuracy: 0.98

Confusion Matrix:
[[966   0]
 [ 20 129]]

Precision Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       1.00      0.87      0.93       149

    accuracy                           0.98      1115
   macro avg       0.99      0.93      0.96      1115
weighted avg       0.98      0.98      0.98      1115



# <u>Software Testing</u>

One way the code above was tested was splitting the data into training and testing sets.  
With this the models efficiency can be evaluated after its done training to see how accurately it predicts the data using a confusion matrix to visualise the data.  
The Expected output of the data would be that it accurately predicts spam emails most of the time after training.  
From the confusion matrix in the code we can see that the accuracy of the naive bayes model is 98%. This shows that the model is very accurate at correctly classifying the data as spam or ham.  
It may not be at 100% due to a number of reasons, for example, increasing the number of allowed features may increase accuracy further, though it also increases the risk of overfitting and could negatively impact the results.  
Overall, the results of the model show that it accurately predicts data which aligns with the expected results.

# <u>Evalaute Results</u>

To evaluate the results of the model a confusion matrix was implemented into the code. This confusion matrix tells us some different metrics to evaulate the model, most importantly it tells us the overall accuracy of the model.  
The accuracy of the model was 0.98 which tells us it was classifying data with 98% accuracy.  
From the matrix we can see 966 True positives and 0 false positives, meaning all ham emails were correctly classified and not classified incorrectly. This shows full accuracy in this area of classification.  
There are 20 false negatives shown in the confusion matrix, meaning 20 spam emails where misclassified as ham, this shows there is a small amount of inaccuracy when classifying spam emails. Sometimes the model not recognise the emails as spam.  
Finally we can see 129 true positives in the confusion matrix, this means 129 of the spam emails were correctly identified as spam.  
The small inaccuracy with the spam emails could be due to a number of reasons.  
One of these reasons may be due to the number of spam emails compared to the number of ham emails being quite unbalanced, in this case there were a lot more ham emails than spam emails. This could cause the model to not accurately train the spam side of the emails as it had a much smaller dataset to understand and train from.

# Strength of results within domain

From the evaluation of the model there is a couple of takeaways to be seen.  
Strengths:  
- High overall accuracy with no false positives, meaning users can confidently assume that no important emails will be misclassified as spam.
- The overall speed of the model makes it ideal for real-time spam filtering
- High rate of detecting and correctly classifying spam emails.

Weaknesses:
- Spam recall is not 100%, meaning users are not fully protected from spam emails.
- Naive bayes could struggle adapting to more sophisticated emails, given the evolution of spam emails it may be hard for it to constantly keep up.

# Conclusion

Overall, the results analysed through the evaluation show that the naive bayes model is effective at classifying and differentiating spam and ham emails. In this particular case it is more effective in ham classification but that may be due to a dataset imbalance among other factors.  
Improvements are needed to improve its detection accuracy with spam emails so that they dont leak into the main inbox. However, given the speed of this model, it is still suitable for real time spam detection with minimal risk of mistakenly classifying legitimate and important ham emails as spam, while also providing adequate spam detection.

# <u>References</u>

Blanzieri, E., Bryl, A. A survey of learning-based techniques of email spam filtering.  

Al-Mailem, M.A. and Al-Azmi, M.A., (2018). Comparison of decision tree algorithms for spam e-mail filtering. Proceedings of the 2018 International Conference on Computer and Applications  

https://www.kaggle.com/datasets/abdallahwagih/spam-emails - Dataset used  

Ajani, T. and Ferrante, T. (2024). Cyber-analytics: An Examination of Machine Learning Algorithms for Spam Filtering.  

ArXiv (2010). Modeling Spammer Behavior: Naïve Bayes vs. Artificial Neural Networks.  

Bhatnagar, P. and Degadwala, S. (2023). A Comprehensive Review on Email Spam Classification with Machine Learning Methods  

DE Conference Publication (2022). Content Based Spam Email Classification using Supervised SVM, Decision Trees and Naive Bayes  

Cui, J. and Li, X. (2022). Content Based Spam Email Classification using Supervised SVM, Decision Trees and Naive Bayes.