<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->

<h1 style="text-align: center;">Spam/Ham Classification Part 1 $\rightarrow$ E.D.A. & Feature Engineering</h1>

<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />

## Introduction
**Goal**: Create a binary classifier that can distinguish spam (junk, commercial, or bulk) emails from ham (regular non-spam) emails.

Part 1 includes the following:

- Feature engineering with text data.
- Using the `sklearn` library to process data and fit models.
- Validating the performance of our model and minimizing overfitting.

This Part 1, focuses on initial data analysis, feature engineering, and logistic regression.   
Part 2 of this project, I build a spam/ham classifier.  

***Warning*** This is a **real-world** dataset so the emails are actual spam and legitimate emails. As a result, some of the spam emails may be in poor taste or be considered inappropriate.


## Importing Modules

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style = "whitegrid", 
        color_codes = True,
        font_scale = 1.5)

<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->
<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />

# The Data

In email classification, our goal is to classify emails as spam or not spam (referred to as "ham") using features generated from the text in the email. The dataset is from [SpamAssassin](https://spamassassin.apache.org/old/publiccorpus/). It consists of email messages and their labels (0 for ham, 1 for spam). Your labeled training dataset contains 8,348 labeled examples, and the unlabeled test set contains 1,000 unlabeled examples.

**Note:** The dataset is from 2004, so the contents of emails might be very different from those in 2024.

Run the following cells to load the data into a `DataFrame`.

The `train` `DataFrame` contains labeled data you will use to train your model. It has four columns:

1. `id`: An identifier for the training example.
1. `subject`: The subject of the email.
1. `email`: The text of the email.
1. `spam`: 1 if the email is spam, 0 if the email is ham (not spam).

The `test` `DataFrame` contains 1,000 unlabeled emails. In Project B2, you will predict labels for these emails and submit your predictions to the autograder for evaluation.

In [2]:
import zipfile

# Loading training and test datasets
with zipfile.ZipFile('../spam_ham_data.zip') as item:
    with item.open("train.csv") as f:
        original_training_data = pd.read_csv(f)
    with item.open("test.csv") as f:
        test = pd.read_csv(f)

In [3]:
# Convert the emails to lowercase as the first step of text processing.
original_training_data['email'] = original_training_data['email'].str.lower()
test['email'] = test['email'].str.lower()

original_training_data.head()

Unnamed: 0,id,subject,email,spam
0,0,Subject: A&L Daily to be auctioned in bankrupt...,url: http://boingboing.net/#85534171\n date: n...,0
1,1,"Subject: Wired: ""Stronger ties between ISPs an...",url: http://scriptingnews.userland.com/backiss...,0
2,2,Subject: It's just too small ...,<html>\n <head>\n </head>\n <body>\n <font siz...,1
3,3,Subject: liberal defnitions\n,depends on how much over spending vs. how much...,0
4,4,Subject: RE: [ILUG] Newbie seeks advice - Suse...,hehe sorry but if you hit caps lock twice the ...,0


<br/>

First, let's check if our data contains any missing values. We have filled in the cell below to print the number of `NaN` values in each column. If there are `NaN` values, we replace them with appropriate filler values (i.e., `NaN` values in the `subject` or `email` columns will be replaced with empty strings). Finally, we print the number of `NaN` values in each column after this modification to verify that there are no `NaN` values left.

**Note:** While there are no `NaN` values in the `spam` column, we should be careful when replacing `NaN` labels. Doing so without consideration may introduce significant bias into our model.

In [4]:
print('Before imputation:')
print(original_training_data.isnull().sum())
original_training_data = original_training_data.fillna('')
print('------------')
print('After imputation:')
print(original_training_data.isnull().sum())

Before imputation:
id         0
subject    6
email      0
spam       0
dtype: int64
------------
After imputation:
id         0
subject    0
email      0
spam       0
dtype: int64


<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->
<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />

# Part 1: Initial Analysis

In the cell below, we have printed the text of the `email` field for the first ham and the first spam email in the original training set.

In [5]:
first_ham = original_training_data.loc[original_training_data['spam'] == 0, 'email'].iloc[0]
first_spam = original_training_data.loc[original_training_data['spam'] == 1, 'email'].iloc[0]
print("Ham Email:")
print(first_ham)
print("--------------------------------------------------------------------------------")
print("Spam Email:")
print(first_spam)

Ham Email:
url: http://boingboing.net/#85534171
 date: not supplied
 
 arts and letters daily, a wonderful and dense blog, has folded up its tent due 
 to the bankruptcy of its parent company. a&l daily will be auctioned off by the 
 receivers. link[1] discuss[2] (_thanks, misha!_)
 
 [1] http://www.aldaily.com/
 [2] http://www.quicktopic.com/boing/h/zlfterjnd6jf
 
 

--------------------------------------------------------------------------------
Spam Email:
<html>
 <head>
 </head>
 <body>
 <font size=3d"4"><b> a man endowed with a 7-8" hammer is simply<br>
  better equipped than a man with a 5-6"hammer. <br>
 <br>would you rather have<br>more than enough to get the job done or fall =
 short. it's totally up<br>to you. our methods are guaranteed to increase y=
 our size by 1-3"<br> <a href=3d"http://209.163.187.47/cgi-bin/index.php?10=
 004">come in here and see how</a>
 </body>
 </html>
 
 
 



## Training-Validation Split
The training data we downloaded is all the data we have available for both training models and **validating** the models that we train. We, therefore, need to split the training data into separate training and validation datasets. You will need this **validation data** to assess the performance of your classifier once you are finished training. Note that we set the seed (`random_state`) to 42.

In [6]:
# This creates a 90/10 train-validation split on our labeled data.
from sklearn.model_selection import train_test_split

train, val = train_test_split(original_training_data, test_size = 0.1,
                              random_state = 42)

<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->
<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />

# Part 2: Feature Engineering

We want to take the text of an email and predict whether the email is ham or spam. This is a **binary classification** problem, so we can use logistic regression to train a classifier. Recall that to train a logistic regression model, we need a numeric feature matrix $\mathbb{X}$ and a vector of corresponding binary labels $Y$. Unfortunately, our data are text, not numbers. To address this, we can create numeric features derived from the email text and use those features for logistic regression.

Each row of $\mathbb{X}$ is an email. Each column of $\mathbb{X}$ contains one feature for all the emails. We'll guide you through creating a simple feature, and you'll create more interesting ones as you try to increase the accuracy of your model.


## 2.a

Create a function `words_in_texts` that takes in a list of interesting words (`words`) and a `Series` of emails (`texts`). Our goal is to check if each word in `words` is contained in the emails in `texts`.

The `words_in_texts` function should output a **2-dimensional `NumPy` array** that contains one row for each email in `texts` and one column for each word in `words`. If the $j$-th word in `words` is present at least once in the $i$-th email in `texts`, the output array should have a value of 1 at the position $(i, j)$. Otherwise, if the $j$-th word is not present in the $i$-th email, the value at $(i, j)$ should be 0.

In Project B2, we will be applying `words_in_texts` to some large datasets, so implementing some form of vectorization (for example, using `NumPy` arrays, `Series.str` functions, etc.) is highly recommended. **You are allowed to use only *one* list comprehension or for loop**, and you should look into how you could combine that with the vectorized functions discussed above. **Do not use a double for loop, or you will run into issues later on in Project B2.**

For example:
```
>>> words_in_texts(['hello', 'bye', 'world'], 
                   pd.Series(['hello', 'hello worldhello']))

array([[1, 0, 0],
       [1, 0, 1]])
```

Importantly, we **do not** calculate the *number of occurrences* of each word; only if the word is present at least *once*. Take a moment to work through the example on your own if need be —— understanding what the function does is a critical first step in implementing it.

In [7]:
def words_in_texts(words, texts):
    """
    Args:
        words (list): Words to find.
        texts (Series): Strings to search in.
    
    Returns:
        A 2D NumPy array of 0s and 1s with shape (n, d) where 
        n is the number of texts, and d is the number of words.
    """
    indicator_array = np.array([texts.str.contains(word).astype(int) for word in words]).T
    return indicator_array

In [8]:
# Run this cell to see what your function outputs. Compare the results to the example provided above.
words_in_texts(['hello', 'bye', 'world'], pd.Series(['hello', 'hello worldhello']))

array([[1, 0, 0],
       [1, 0, 1]])

<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->
<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />

# Part 3: EDA

We need to identify some features that allow us to distinguish spam emails from ham emails. One idea is to compare the distribution of a single feature in spam emails to the distribution of the same feature in ham emails. Suppose the feature is a binary indicator, such as whether a particular word occurs in the text. In that case, this compares the proportion of spam emails with the word to the proportion of ham emails with the word.

The following plot (created using `sns.barplot`) compares the proportion of emails in each class containing a particular set of words. The bars colored by email class were generated by setting the `hue` parameter of `sns.barplot` to a column containing the class (spam or ham) of each data point. An example of how this class column was created is shown below:

![training conditional proportions](./images/training_conditional_proportions.png)

You can use `DataFrame`'s `.melt` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)) method to "unpivot" a `DataFrame`. See the following code cell for an example.

When the feature is binary, it makes sense to compare its proportions across classes (as in the previous Step). Otherwise, if the feature can take on numeric values, we can compare the distributions of these values for different classes. 

<hr style="border: 5px solid hsl(200, 100%, 50%);" />   <!-- bright blue -->
<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->
<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />


# Part 4: Basic Classification

Notice that the output of `words_in_texts(words, train['email'])` is a numeric matrix containing features for each email. This means we can use it directly to train a classifier!

Using 5 words that might be useful as features differentiating spam from ham emails and the `train` `DataFrame`, we create two `NumPy` arrays: `X_train` and `Y_train`. `X_train` should be a 2D array of 0s and 1s created using the `words_in_texts` function on all the emails in the training set. `Y_train` should be a vector of the correct labels for each email in the training set.

In [9]:
some_words = ['drug', 'bank', 'prescription', 'memo', 'private']

X_train = words_in_texts(some_words, train['email'])
Y_train = train['spam'].to_numpy()

X_train[:5], Y_train[:5]

(array([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0]]),
 array([0, 0, 0, 0, 0]))


## 4.a

Now that we have matrices, we can build a model with `sklearn`! Using the [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier, train a logistic regression model using `X_train` and `Y_train`. Then, output the model's training accuracy below. You should get an accuracy of around $0.76$.


In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

my_model = LogisticRegression()
my_model.fit(X_train, Y_train)
Y_pred = my_model.predict(X_train)

# using built-in function
score = accuracy_score(Y_train,Y_pred)

# using definition of accuracy
training_accuracy = np.mean(Y_pred == Y_train)

print("Training Accuracy: ", training_accuracy)

Training Accuracy:  0.7576201251164648


<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->
<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />
<hr style="border: 5px solid hsl(200, 100%, 50%);" />   <!-- bright blue -->

# Part 5: Evaluating Classifiers  

That doesn't seem too shabby! But the classifier you made above isn't as good as the accuracy would make you believe. First, we are evaluating the accuracy of the model on the training set, which may be a misleading measure. Accuracy on the training set doesn't always translate to accuracy in the real world (on the test set). In future parts of this analysis, we will make use of the data we held out for model validation and comparison.

Presumably, our classifier will be used for **filtering**, or preventing messages labeled `spam` from reaching someone's inbox. There are two kinds of errors we can make:
- **False positive (FP)**: A ham email gets flagged as spam and filtered out of the inbox.
- **False negative (FN)**: A spam email gets mislabeled as ham and ends up in the inbox.

To be clear, we label spam emails as 1 and ham emails as 0. These definitions depend both on the true labels and the predicted labels. False positives and false negatives may be of differing importance, leading us to consider more ways of evaluating a classifier in addition to overall accuracy:

**Precision**: Measures the proportion of emails flagged as spam that are actually spam. Mathematically, $\frac{\text{TP}}{\text{TP} + \text{FP}}$.

**Recall**: Measures the proportion  of spam emails that were correctly flagged as spam. Mathematically, $\frac{\text{TP}}{\text{TP} + \text{FN}}$.

**False positive rate**: Measures the proportion  of ham emails that were incorrectly flagged as spam. Mathematically, $\frac{\text{FP}}{\text{FP} + \text{TN}}$.

One quick mnemonic to remember the formulas is that **P**recision involves T**P** and F**P**, Recall does not. In the final, the reference sheet will also contain the formulas shown above, but you should be able to interpret what they mean and their importance depending on the context.

The below graphic (modified slightly from [Wikipedia](https://en.wikipedia.org/wiki/Precision_and_recall)) may help you understand precision and recall visually:<br />
<center>
<img alt="precision_recall" src="./images/precision_recall.png" width="600px" />
</center>

Note that a True Positive (TP) is a spam email that is classified as spam, and a True Negative (TN) is a ham email that is classified as ham.



## 5.a

Suppose we have a hypothetical classifier called the “zero predictor.” For any inputted email, the zero predictor *always* predicts 0 (it never makes a prediction of 1 for any email). How many false positives and false negatives would this classifier have if it were evaluated on the training set and its results were compared to `Y_train`? Assign `zero_predictor_fp` to the number of false positives and `zero_predictor_fn` to the number of false negatives for the hypothetical zero predictor on the training data.

In [11]:
Y_zero_pred = np.zeros(train.shape[0])

# predicted as 1 but actually 0, but zero predictor always zero -- NO FALSE Positives
zero_predictor_fp = 0
zero_predictor_fn = sum(Y_zero_pred != Y_train)
zero_predictor_fp, zero_predictor_fn

(0, 1918)


## 5.b

What is the accuracy and recall of the zero predictor on the training data? Don't need to use any `sklearn` functions to compute these performance metrics, but they are available.

In [12]:
Y_zero_pred = np.zeros(train.shape[0])
True_pos = 0
zero_predictor_acc = np.mean(Y_zero_pred == Y_train)
zero_predictor_recall = True_pos / (True_pos + zero_predictor_fn)
zero_predictor_acc, zero_predictor_recall

(0.7447091707706642, 0.0)


## 5.c

Explain your results in `q6a` and `q6b`. How did you know what to assign to `zero_predictor_fp`, `zero_predictor_fn`, `zero_predictor_acc`, and `zero_predictor_recall`?

### For 5.a:  

1.	Y_zero_pred = np.zeros(train.shape[0]):  
	•	This creates a prediction array of all zeros (i.e., all emails are predicted as ham). Perfect for the zero predictor.  

2.	zero_predictor_fp = 0:  
	•	Since the zero predictor always predicts 0 (ham), there can be no false positives (no ham email is incorrectly flagged as spam).  

3.	zero_predictor_fn = sum(Y_zero_pred != Y_train):  
	•	This counts all cases where the true label (Y_train) is 1 (spam), but the prediction is 0 (ham). In this case, it’s equal to the total number of spam emails because all are misclassified as ham by the zero predictor.

4.	Output zero_predictor_fp, zero_predictor_fn = (0, 1918):  
	•	FP = 0 (logic explained in 2.)  
	•	FN = 1918: must be right, as there are 1918 spam emails that the zero predictor fails to classify correctly.   


### For 5.b:

1.	Y_zero_pred = np.zeros(train.shape[0]):  
	•	This generates an array of zeros (predicting all emails as ham), which aligns with the behavior of a zero predictor.  

2.	True Positives (True_pos):  
	•	This must be set to 0 because the zero predictor never predicts 1 (spam).  

3.	Accuracy (zero_predictor_acc):  
	•	np.mean(Y_zero_pred == Y_train):  
	•	This calculates the proportion of emails that the zero predictor classified correctly.  
	•	It includes all true negatives (ham emails correctly classified as ham) divided by the total number of emails.  

4.	Recall (zero_predictor_recall):  
 	•	True_pos / (True_pos + zero_predictor_fn):   
	•	Since True_pos = 0, recall will also be 0 due to the formula: $\frac{\text{TP}}{\text{TP} + \text{FN}}$  
	•	This reflects the zero predictor’s inability to identify any spam emails.  



---

## 5.d

Compute the precision, recall, and false positive rate of the `LogisticRegression` classifier `my_model` from 5.

In [13]:
Y_train_hat =  my_model.predict(X_train)

TP = sum((Y_train_hat == 1) & (Y_train == 1))
TN = sum((Y_train_hat == 0) & (Y_train == 0))
FP = sum((Y_train_hat == 1) & (Y_train == 0))
FN = sum((Y_train_hat == 0) & (Y_train == 1))
logistic_predictor_precision = TP/(FP+TP)
logistic_predictor_recall = TP/(FP+TN)
logistic_predictor_fpr = FP/(FP+TN)

print(f"{TP=}, {TN=}, {FP=}, {FN=}\n")
print(f"{logistic_predictor_precision=:.2f} \n{logistic_predictor_recall=:.2f}",
      f"\n{logistic_predictor_fpr=:.2f}")

TP=219, TN=5473, FP=122, FN=1699

logistic_predictor_precision=0.64 
logistic_predictor_recall=0.04 
logistic_predictor_fpr=0.02


<br>

---

## Question 5.e

Is the number of false positives produced by the logistic regression classifier `my_model` strictly greater than the number of false negatives produced? 

In [14]:
Answer = False
Answer

False

#### For the logistic regression classifier `my_model`:
$FN \gt \gt FP$ 

---

## 5.f

How does the accuracy of the logistic regression classifier `my_model` compare to the accuracy of the zero predictor?


The accuracy of the logistic regression classifier my_model was 0.7576 which is slightly higher (but not significantly) than the accuracy of the zero predictor of 0.7447. However, this small improvement is marginal, and recall of 0.04 for the logistic regression classifier is also not much better than recall of 0 for the zero predictor.

---

## 5.g

Given the word features provided in Step 4, discuss why the logistic regression classifier `my_model` may be performing poorly.   



The logistic regression classifier may be performing poorly for the following reasons:  
1.	Word Prevalence and Imbalance:  
	•	If the words used as features are not strongly indicative of spam or ham emails, the classifier may struggle to distinguish between the two classes. For example, some words might occur frequently in both spam and ham emails, making them less useful for classification.  
2.	Sparse or Non-discriminative Features:  
	•	Logistic regression heavily relies on the features being discriminative. If the word features provided are sparse (rarely occur) or appear in both classes (spam and ham) with similar frequencies, the model will fail to learn meaningful distinctions.  
3.	Class Imbalance:  
    •	In our dataset, spam emails are outnumbered by ham emails, so the model may be biased toward predicting the majority class (ham) to achieve a higher accuracy, resulting in poor recall for spam detection.  
4.	Feature Representation:  
    •	The words may not fully capture the nuanced patterns in the data. For instance, logistic regression may not handle cases where spam detection requires understanding word combinations, positions, or semantics, as it treats each word independently.  
5.	Overfitting to Training Data:  
	•	If the features are too specific or not generalized enough (e.g., certain words only appear in a small subset of the data), the model might overfit to these patterns and perform poorly on unseen data.  

Possible Ways to Improve Performance:  

1.	Use Additional Features:  
	•	Consider adding features like word combinations (n-grams), email metadata, or word frequencies to improve discriminative power.  
2.	Handle Class Imbalance:  
	•	Use techniques like oversampling the minority class (spam) or applying class weights in logistic regression to balance the importance of spam and ham classifications.  
3.	Feature Selection:  
	•	Evaluate the usefulness of each word feature and remove those that are not strongly correlated with spam or ham.   

The model’s performance indicates that the provided features may not be sufficiently informative for the spam classification task.  



---

## Step 5.h

Would you prefer to use the logistic regression classifier `my_model` or the zero predictor classifier for a spam filter? Why? Describe your reasoning and relate it to at least one of the evaluation metrics you have computed so far.

I would prefer to use the logistic regression classifier (my_model) over the zero predictor for a spam filter, despite the logistic regression model’s poor recall. Here’s why:
 
1.	Zero Predictor Recall = 0:  
   •	The zero predictor has zero recall, meaning it fails to identify any spam emails. This is unacceptable for a spam filter, as the primary goal is to catch as many spam emails as possible.  
2.	Logistic Regression Recall = 4%:  
	•	While the recall for my_model is low (4%), it is still better than the zero predictor’s recall (0%). The logistic regression model can at least identify some spam emails.  
3.	Improved Precision:  
	•	The logistic regression model achieves a precision of 64%, meaning that when it predicts an email as spam, it is correct 64% of the time. This is a reasonable starting point for spam detection.  
4.	Zero Predictor Fails to Act as a Filter:  
	•	The zero predictor would classify all emails as ham, allowing spam emails to flood the inbox. This completely undermines the purpose of a spam filter.  

That said, while using the logistic regression classifier, I would try to implement some of the improvements I described in q6g.

**In Part 2**, we'll focus on using logistic regression to build a spam/ham email classifier. Now that we have considered what the data looks like, how it can be used, and engineered some useful features, predicative of our target classes, we can transition to building our model.

<hr style="border: 4.5px solid teal;" />
<hr style="border: 3.3px solid #10b981;" />   <!-- Emerald -->
<hr style="border: 2.2px solid teal;" />
<hr style="border: 2px solid cornflowerblue;" />