# 2. Feature size

- Explore how the number of selected features for the email dataset influences accuracy and runtime performance.

In [1]:
# Importing all the required libraries

import os
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 20})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

In [2]:
import email_preprocessor as epp
import numpy as np
import time

# Set the number of features to use for the experiment
num_features = [50, 100, 150, 200]

# Count the words in the dataset
word_freq, num_emails = epp.count_words()

# Find the top words based on word frequency
top_words, top_counts = epp.find_top_words(word_freq)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chandrachudgowda/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
for n in num_features:
    # Select the top `n` words
    features = top_words[:n]

    # Convert emails to feature vectors
    feature_vectors, y = epp.make_feature_vectors(features, num_emails)

    # Split the data into training and test sets
    np.random.seed(0)
    x_train, y_train, inds_train, x_test, y_test, inds_test = epp.make_train_test_sets(feature_vectors, y)

    # Save the data
    np.save(f'data/email_train_x_{n}.npy', x_train)
    np.save(f'data/email_train_y_{n}.npy', y_train)
    np.save(f'data/email_train_inds_{n}.npy', inds_train)
    np.save(f'data/email_test_x_{n}.npy', x_test)
    np.save(f'data/email_test_y_{n}.npy', y_test)
    np.save(f'data/email_test_inds_{n}.npy', inds_test)

    # Train a logistic regression model on the training data
    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression(max_iter=1000)
    start_time = time.time()
    clf.fit(x_train, y_train)
    end_time = time.time()

    # Predict on the test data and compute accuracy
    y_pred = clf.predict(x_test)
    acc = np.mean(y_pred == y_test)

    # Print results
    print(f'Number of features: {n}')
    print(f'Accuracy: {acc}')
    print(f'Runtime: {end_time - start_time:.4f}s\n')

Number of features: 50
Accuracy: 0.8521072796934865
Runtime: 4.2484s

Number of features: 100
Accuracy: 0.8908812260536398
Runtime: 7.5903s

Number of features: 150
Accuracy: 0.9262835249042145
Runtime: 8.4006s

Number of features: 200
Accuracy: 0.9459003831417625
Runtime: 9.5856s



# Report + Results

For my extension, I explored how the number of selected features for the email dataset influences accuracy and runtime performance. To do this, I used the existing code and added a loop that iterates over a range of feature sizes. The range of feature sizes I used was [50, 100, 150, 200]. For each iteration, I selected the top 'n' words from the email dataset, where 'n' is the feature size for that iteration. Then, I converted the emails to feature vectors using the selected features and split the data into training and test sets. Next, I trained a logistic regression model on the training data and predicted on the test data to compute accuracy. Finally, I saved the training and test data for each feature size to files.

The results of my extension are as follows:

For 50 features, the accuracy was 0.852 and the runtime was 4.248 seconds.
For 100 features, the accuracy was 0.891 and the runtime was 7.590 seconds.
For 150 features, the accuracy was 0.926 and the runtime was 8.401 seconds.
For 200 features, the accuracy was 0.946 and the runtime was 9.586 seconds.

As the number of features increased, the accuracy of the model increased. This makes sense as more features allow the model to better capture the nuances of the data. However, as the number of features increased, so did the runtime of the model. This is because more features require more computation to train and predict. Therefore, there is a trade-off between accuracy and runtime when selecting the number of features for a model.