# Email Spam Classification

- Generate your dataset and get familiar with the target variable 'Spam/Non-Spam' and the explanatory variables, e.g. scale levels of your covariates etc.

- Construct a simple predictive model to classify future emails

- Motivate the steps you are pursuing and discuss your model results 

- What are the pros/cons of your selected algorithm?

- How could you improve it? 

In [1]:
from utils import SpamDatasetGenerator

### Imbalanced supervised learning

In [11]:
generator = SpamDatasetGenerator(n_samples=10000, random_state=43, spam_ratio=0.1)

X, y = generator.generate_dataset(na_ratio_for_non_spam=0)

generator.print_description()


    A class to generate a synthetic dataset for a spam classification task.

    The dataset contains emails with features that are commonly used
    in spam classification. The dataset includes 9 covariates and a binary response.

    Feature Descriptions:
      - num_links: The number of hyperlinks in the email. Spam emails
                   often contain several links to redirect the user.
      - num_exclamations: The count of exclamation marks in the email.
                          An unusually high count may indicate spam.
      - email_length: The length of the email in characters. This can
                      help differentiate short, potentially templated spam emails
                      from longer legitimate communications.
      - sender_domain: The domain from which the email was sent
                       (e.g., gmail.com, yahoo.com, spamdomain.net). Some domains
                       are more frequently associated with spam.
      - email_client: The client or pl

### Semi-supervised learning

In [17]:
generator = SpamDatasetGenerator(n_samples=10000, random_state=42, spam_ratio=0.1)

X, y = generator.generate_dataset(na_ratio_for_non_spam=0.5)

### Positive Unlabelled learning / Anomaly detection

In [19]:
generator = SpamDatasetGenerator(n_samples=10000, random_state=42, spam_ratio=0.1)

X, y = generator.generate_dataset(na_ratio_for_non_spam=1)

In [3]:
y.value_counts(dropna=False, normalize=True)

target
0    0.9007
1    0.0993
Name: proportion, dtype: Float64

In [4]:
# y.value_counts(dropna=True, normalize=True).plot(kind='bar', title='Target Distribution'); 

In [12]:
X.describe()

Unnamed: 0,num_links,num_exclamations,email_length
count,10000.0,10000.0,10000.0
mean,2.0237,1.0143,500.4192
std,1.431762,1.002395,100.134162
min,0.0,0.0,132.0
25%,1.0,0.0,432.0
50%,2.0,1.0,500.0
75%,3.0,2.0,569.0
max,10.0,7.0,843.0


### Example: logistic regression

In [10]:
import warnings
from utils import train_logit

warnings.filterwarnings('ignore')

model_pipeline = train_logit(X, y)


Classification Report:
               precision    recall  f1-score   support

         0.0       0.99      0.85      0.92      7194
         1.0       0.41      0.91      0.57       806

    accuracy                           0.86      8000
   macro avg       0.70      0.88      0.74      8000
weighted avg       0.93      0.86      0.88      8000


Classification Report:
               precision    recall  f1-score   support

         0.0       0.98      0.85      0.91      1799
         1.0       0.39      0.88      0.54       201

    accuracy                           0.85      2000
   macro avg       0.69      0.86      0.73      2000
weighted avg       0.92      0.85      0.87      2000

