# Homework 3: Logistic Regression
Nicholas Thomson

The link to the database is: https://archive.ics.uci.edu/dataset/94/spambase

The goal of this database is to detect whether email is spam or not spam. The link contains details on how to cite the database in python, which I have followed below. The first step was to install ucimlrepo

In [1]:
!pip install ucimlrepo



### Import packages

In [2]:
from ucimlrepo import fetch_ucirepo
import pandas as pd

### Fetch the dataset
Split dataframe into dependent (X) and independent (y) variables.

In [3]:
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features #Independent variables
y = spambase.data.targets #Dependent
y = y['Class']

# Explore the data
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 57 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   word_freq_make              4601 non-null   float64
 1   word_freq_address           4601 non-null   float64
 2   word_freq_all               4601 non-null   float64
 3   word_freq_3d                4601 non-null   float64
 4   word_freq_our               4601 non-null   float64
 5   word_freq_over              4601 non-null   float64
 6   word_freq_remove            4601 non-null   float64
 7   word_freq_internet          4601 non-null   float64
 8   word_freq_order             4601 non-null   float64
 9   word_freq_mail              4601 non-null   float64
 10  word_freq_receive           4601 non-null   float64
 11  word_freq_will              4601 non-null   float64
 12  word_freq_people            4601 non-null   float64
 13  word_freq_report            4601 

The independent variable X contains mostly character frequencies. There are additional variables that measure statistics of capital words including run length average, run length longest, and run length total.

The dependent variable Y contains whether the email is spam or not spam.

The goal is to figure out which characters and statistics of capital words are most common in spam mail to be able to identify them and classify them as spam mail.

In [4]:
y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 4601 entries, 0 to 4600
Series name: Class
Non-Null Count  Dtype
--------------  -----
4601 non-null   int64
dtypes: int64(1)
memory usage: 36.1 KB


The y variable is used to classify as spam and not spam. 0 is not spam, 1 is spam.

In [5]:
y.value_counts()

Class
0    2788
1    1813
Name: count, dtype: int64

An issue with this dataset is the number spam emails is less than the number of not spam emails. To make up for this, we can use oversampling to artificially boost the number of spam emails. Doing this makes the model more accurate.

In [6]:
from imblearn.over_sampling import RandomOverSampler

In [7]:
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)

In [8]:
y_resampled.value_counts()

Class
1    2788
0    2788
Name: count, dtype: int64

Now there is an even amount of both spam and not spam emails, ensuring our model is more accurate.

### Split Data into Train and Test Set

20% of the data will be in the testing sample, 80% for the training sample

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

### Scale the data

Because the numbers in this dataset are too large for the logistic regression model, I made sure to scale the dependent variables to z-scores rather than using the raw data. Otherwise we would get an error.

In [10]:
X_test

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
3690,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.000,0.000,0.000,1.400,3,14
3527,0.09,0.09,0.28,0.0,0.28,0.00,0.00,0.28,0.00,0.00,...,0.0,0.014,0.084,0.0,0.042,0.000,0.042,1.877,18,552
724,0.16,0.24,1.24,0.0,0.41,0.58,0.49,0.33,0.66,0.66,...,0.0,0.000,0.132,0.0,0.250,0.224,0.026,5.872,581,1339
3370,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.645,0.0,0.000,0.000,0.000,1.000,1,9
468,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.546,0.000,0.000,2.300,9,23
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4864,0.08,0.00,0.32,0.0,0.24,0.32,0.00,0.16,0.16,0.00,...,0.0,0.000,0.045,0.0,0.360,0.030,0.000,1.420,10,196
3227,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.000,0.000,0.000,1.961,11,51
3796,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.366,0.0,0.000,0.000,0.000,1.307,3,17
2879,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,3.33,...,0.0,0.000,0.000,0.0,1.156,0.000,0.000,2.333,10,21


In [11]:
from sklearn.preprocessing import StandardScaler

In [12]:
# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [13]:
X_test

array([[-3.59057698e-01, -1.71905471e-01, -6.01661243e-01, ...,
        -1.19994723e-01, -2.79756624e-01, -4.96608829e-01],
       [-5.86204430e-02, -9.26601581e-02, -3.89415711e-02, ...,
        -1.07422901e-01, -2.06544824e-01,  3.98496687e-01],
       [ 1.75052978e-01,  3.94153629e-02,  1.89038302e+00, ...,
        -2.13060058e-03,  2.54133807e+00,  1.70787966e+00],
       ...,
       [-3.59057698e-01, -1.71905471e-01, -6.01661243e-01, ...,
        -1.22445833e-01, -2.79756624e-01, -4.91617534e-01],
       [-3.59057698e-01, -1.71905471e-01, -6.01661243e-01, ...,
        -9.54045561e-02, -2.45591117e-01, -4.84962475e-01],
       [-3.59057698e-01, -1.71905471e-01,  1.70950884e+00, ...,
        -7.55994074e-03, -2.10749304e-02, -6.73574847e-02]])

### Build and train a logistic regression model

In [14]:
# import the class
from sklearn.linear_model import LogisticRegression

In [15]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)

# fit the model with data
logreg.fit(X_train, y_train.values)

### Evaluate the model

In [16]:
# Make predictions on testing set
y_pred = logreg.predict(X_test)

# Calculate confusion matrix
from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

array([[516,  42],
       [ 52, 506]], dtype=int64)

The confusion matrix tells us that the majority of emails were classified correctly. 516 were true positive, 42 were false positive, 52 were false negative, and 506 were true negative. The model seems to be fairly accurate and precise.

In [17]:
# Calculate accuracy, precision, recall, and F1 score
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Step 5: Evaluate the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", fscore)

Accuracy: 0.9157706093189965
Precision: 0.9233576642335767
Recall: 0.9068100358422939
F1 Score: 0.9150090415913201


### Evaluation of Model

Accuracy: 91.6%
Precision: 92.3%
Recall: 90%
F1 Score: 91.5%

All measurements are 90% or higher. This indicates a very reliable model. Unfortunately it is not a perfect model, so it will not be able to classify spam correctly all the time. This will result in potential loss of important emails due to classification as spam. It will also result in the inbox containing some spam. However, it will do the job of classifying spam correctly about 90% of the time based on F1 Score which measures both precision and recall.

Overall, I would say the model is very good despite not being perfect. I would implement this in an email system.