In [3]:
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV

# COGS 118A - Project Checkpoint

# Names

- Cristian Antonio-Hernandez
- Rahul Puranam
- Ricardo Sedano
- Jason Shao

# Abstract 
Our goal is to develop an effective email spam detection system. We will be using a dataset consisting of labeled emails, where each email is classified as either spam or non-spam. The data represents various features extracted from the emails, such as the presence of certain keywords, email header information, and textual content. We will employ machine learning algorithms to train a model on this labeled data to accurately classify incoming emails as spam or non-spam. The performance of the system will be measured by evaluating its ability to correctly classify a new set of emails, using metrics such as accuracy, precision, recall, and F1-score. Our objective is to achieve high accuracy and minimize false positives (legitimate emails classified as spam) and false negatives (spam emails classified as legitimate), ensuring an efficient and reliable spam detection system.

# Background

The field of email spam detection has been extensively researched, with significant prior work paving the way for advancements in this area. Various techniques and approaches have been explored to tackle the problem of email classification, aiming to accurately distinguish between spam and non-spam emails.

Researchers have employed machine learning algorithms, such as support vector machines (SVMs), naive Bayes classifiers, and ensemble methods, to build effective spam detection models<a name="solanki"></a>[<sup>[1]</sup>](#solanki)<a name="dada"></a>[<sup>[2]</sup>](#dada). These models utilize features extracted from email content, such as keyword presence, textual analysis, and email header information, to make predictions on the classification of incoming emails.

Feature engineering has been a crucial aspect of this research, as it involves selecting and extracting relevant features that contribute to the identification of spam emails<a name="dada"></a>[<sup>[2]</sup>](#dada). Additionally, researchers have explored the use of natural language processing techniques to analyze the text of emails, including techniques like text tokenization, stemming, and TF-IDF weighting<a name="sahami"></a>[<sup>[3]</sup>](#sahami)<a name="yang"></a>[<sup>[4]</sup>](#yang).

# Problem Statement

Our goal is to develop an effective email spam classification system by leveraging features such as keyword presence, textual analysis, and email header information. This research aims to address the following questions:

- How can we accurately predict whether an email is spam or not spam based on its content, utilizing features such as keyword presence, textual analysis, and email header information?

- Which machine learning algorithms, such as Support Vector Machines, Decision Trees, Naive Bayes classifiers, or ensemble methods like Random Forest, are most suitable for solving this email spam classification problem?

We will collect a dataset consisting of a representative sample of emails, including both spam and non-spam examples. The dataset will be preprocessed to extract relevant features and labeled accordingly. We will then train and evaluate different machine learning models using appropriate performance metrics, such as accuracy.

By conducting this research, we aim to develop a robust and reliable email spam classification system that can effectively differentiate between spam and non-spam emails, while minimizing false positives and false negatives.

# Data

The dataset selected for the project is the UCI Machine Learning Repository's 'Spambase Data Set', which is comprised of 4601 observations and 57 continuous variables along with 1 nominal variable. In our data, an observation consists of percentage of word frequncy for 48 words, percentage of character frequncy for 6 special charcters, average length of uninterrupted sequences of capital letters, length of longest uninterrupted sequence of capital letters, total number of capital letters in the e-mail, and a nominal denoter of whether an email is spam or not.

Because our project deals with a categorical classificiation problem, the critical variables could be represented in a regression model. Pertaining to the scenario, critical values would hypothetically be tied to special character, word, and capital letter usage.

Additional to obtaining the data, our team did not handle data more than needed in order to preserve integrity in the data, and reduce bias thoughout; the dataset had very little cleaning. The dataset and names file came separate, so in order to prepare our dataset, we used the df.columns() function to name the columns according to their respective variable.

In order to illustrate an effective classification model and to help midigate overfitting, L2 regularization, pruning, and cross-validation can be valuable processes needed to sighly transform data, and balance the bias-variance tradeoff.


----------

UPDATED FROM PROPOSAL!

You should have obtained and cleaned (if necessary) data you will use for this project.

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!


In [2]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", header = None)
df.columns = ['make','address','all','3d','our','over','remove','internet','order','mail','recieve','will','people','report','addresses','free','business','email','you','credit','your','font','000','money','hp','hpl','george','650','lab','labs','telnet','857','data','415','85','technology','1999','parts','pm','direct','cs','meeting','original','project','re','edu','table','conference',';','(','[','!','$','#','Capital Run Length Average','Capital Run Length Longest','Capital Run Length Total','Spam']
df

Unnamed: 0,make,address,all,3d,our,over,remove,internet,order,mail,...,;,(,[,!,$,#,Capital Run Length Average,Capital Run Length Longest,Capital Run Length Total,Spam
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


# Proposed Solution

To address the email spam classification problem, we decided to utilize Support Vector Machines (SVM). SVMs are effective in separating data points into different classes by finding an optimal hyperplane that maximizes the margin between the classes. In our case, we aim to create a decision boundary that accurately distinguishes between spam and non-spam emails. By constructing a margin around this decision boundary, we can achieve better generalization and potentially reduce both false positives and false negatives. We will be implementing gradient descent to find the minimum loss and possibly using kernels to transform our data into data that is linearly separable in higher dimensions. Kernels with SVMs can help us find a linearly separable representation of the data, even if the original features were not linearly separable. By applying appropriate kernels, we can potentially improve the separation between spam and non-spam emails, enhancing the accuracy of our classification model.
To evaluate the performance of our solution, we will split our dataset into training and testing sets. During training, we will fit the SVM model on the labeled training data and optimize the hyperparameters using techniques like cross-validation. Then, we will use the testing dataset to measure the accuracy of our classifier. Accuracy provides an overall indication of how well the model performs in terms of correct predictions. We will utilize the F1 score, which considers both precision and recall, to assess the model's performance specifically in terms of false positives and false negatives.

# Evaluation Metrics

One evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model is the F1-score (F1-score = $\frac{2 * (\text{Precision} * \text{Recall})}{(\text{Precision} + \text{Recall})}$), assessing both precision and recall, as precision represents the ability of the model to correctly identify spam emails among the emails it classifies as spam, and recall measures the ability of the model to identify all the actual spam emails.

# Preliminary results

NEW SECTION!

Please show any preliminary results you have managed to obtain.

Examples would include:
- Analyzing the suitability of a dataset or alogrithm for prediction/solving your problem 
- Performing feature selection or hand-designing features from the raw data. Describe the features available/created and/or show the code for selection/creation
- Showing the performance of a base model/hyper-parameter setting.  Solve the task with one "default" algorithm and characterize the performance level of that base model.
- Learning curves or validation curves for a particular model
- Tables/graphs showing the performance of different models/hyper-parameters



In [5]:
x = df.drop(columns=['Spam'])
y = df['Spam']
X_train, X_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size=0.25, random_state=101)

# Ethics & Privacy

The most obvious ethical concern that should be addressed is having access to emails, of which for the average email recipient, may include personal information, like residential address, close contacts, and calendar events. Going back a step, people may not even want their email address to get leaked, so having this data be used may raise concern. Continuing on the email message contents, a detection service would have to scan a large portion of the email, or each line of text to properly make a decision. Having an algorithm read every email automatically aould also cause concern.

To address these issues, a step toward the correct solution would be to automatically not store and get rid of all text that the model takes in. From a user standpoint, it may also be helpful to give permission to the model to only filter email spam from senders outside of the recipients most popular contacts, or turn off or deactivative the spam detection completely if the user feels unsure if the detection service is too invasive or unethical.

# Team Expectations 

* *Team members will actively contribute to group discussions and decision-making processes.*
* *Any questions or concerns should be brought up in a timely manner, so that they can be addressed as soon as possible.*
* *All work will be submitted on time and meet the project's standards and guidelines.*
* *Regular meeting times should be scheduled in advance to check progress and ensure the project is on task.*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/16  |  3 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 5/17  | Before 11:59 PM  | Edit, finalize, and submit proposal; Search / finalize datasets; Wrangling/EDA | Discuss Wrangling and possible analytical approaches|
| 5/31  | Before 11:59 PM  | Begin programming for project; Complete Project Checkpoint; Preliminary analysis | Discuss/edit project code and analysis |
| 6/10  | 12 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 6/14  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="solanki"></a>1.[^](#solanki): Solanki, Rohit Kumar, et al. Spam Filtering Using Hybrid Local-Global Naive Bayes Classifier, 28 Sept. 2015, www.semanticscholar.org/paper/Spam-filtering-using-hybrid-local-global-Naive-Solanki-Verma/978a7972210e2d771dac6f92e17594100ea1a8a6. <br> 
<a name="dada"></a>2.[^](#dada): Dada, Emmanuel Gbenga, et al. “Machine Learning for Email Spam Filtering: Review, Approaches and Open Research Problems.” Heliyon, vol. 5, no. 6, June 2019, p. e01802, https://doi.org/10.1016/j.heliyon.2019.e01802. <br>
<a name="sahami"></a>3.[^](#sahami): Sahami, Mehran, et al. A B a Yesian Approach to Filtering Junk E-Mail. http://erichorvitz.com/ftp/junkfilter.pdf. <br>
<a name="yang"></a>4.[^](#yang): Yang, Yiming, and Xin Liu. “A Re-Examination of Text Categorization Methods.” Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’99, 1999, https://doi.org/10.1145/312624.312647. <br>
<a name="uci"></a>4.[^](#uci): “Spambase Data Set” UCI Machine Learning Repository Archives, https://archive.ics.uci.edu/ml/datasets/spambase. <br>