### Data Loading
In this section, I load the Spambase dataset from a CSV file using the `csv` module. This dataset contains various features extracted from emails to classify them as spam or non-spam. Each row represents an email, and each column represents a specific feature.


In [43]:
import csv
with open('spambase.data', 'r') as f:
    reader = csv.reader(f)
    data = list(reader)

print(data[0])


['0', '0.64', '0.64', '0', '0.32', '0', '0', '0', '0', '0', '0', '0.64', '0', '0', '0', '0.32', '0', '1.29', '1.93', '0', '0.96', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0.778', '0', '0', '3.756', '61', '278', '1']


### Data Preprocessing
I convert the raw data into numerical format. Each row in the dataset represents an email, with features extracted from the email text. The features include word frequencies, character frequencies, and the lengths of capital letter sequences. Features are converted to floats and labels to integers. The dataset is then combined into a list of tuples where each tuple contains a dictionary of feature-value pairs and the corresponding label.


In [44]:
features = [list(map(float, row[:-1])) for row in data]
labels = [int(row[-1]) for row in data]
dataset = [(features[i], labels[i]) for i in range(len(features))]

print("First feature set:", features[0])
print("First label:", labels[0])

First feature set: [0.0, 0.64, 0.64, 0.0, 0.32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.64, 0.0, 0.0, 0.0, 0.32, 0.0, 1.29, 1.93, 0.0, 0.96, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.778, 0.0, 0.0, 3.756, 61.0, 278.0]
First label: 1


### Data Splitting
The dataset is split into three parts: training set (70%), devtest set (15%), and test set (15%). This split ensures I have separate data for training, validating, and testing my model, allowing me to evaluate the model's performance on unseen data and make necessary adjustments.


In [45]:
split1 = int(0.7 * len(dataset))  # 70% for training
split2 = int(0.85 * len(dataset)) # 15% for devtest, 15% for final test

# split the data
train_data = dataset[:split1]
devtest_data = dataset[split1:split2]
test_data = dataset[split2:]

print(f"Training set size: {len(train_data)}")
print(f"Devtest set size: {len(devtest_data)}")
print(f"Test set size: {len(test_data)}")



Training set size: 3220
Devtest set size: 690
Test set size: 691


### Preprocessing Function
The `preprocess_data` function converts the raw data into a format suitable for NLTK. It creates a list of tuples where each tuple contains a dictionary of features and a label. This format is required by NLTK for training and evaluating classifiers. The function ensures that features are correctly mapped to their respective values and labels are appropriately assigned.


In [50]:
import nltk

# Function to convert raw data to feature-value pairs and labels
def preprocess_data(data):
    print("First row in data:", data[0])
    features = [list(map(float, row[0])) if isinstance(row[0], list) else list(map(float, row[:-1])) for row in data]
    labels = [int(row[1]) if isinstance(row[0], list) else int(row[-1]) for row in data]
    return [(dict(enumerate(features[i])), labels[i]) for i in range(len(features))]


# Preprocess the training, devtest, and test sets
train_set = preprocess_data(train_data)
devtest_set = preprocess_data(devtest_data)
test_set = preprocess_data(test_data)

print("First processed training sample:", train_set[0])





First row in data: ([0.0, 0.64, 0.64, 0.0, 0.32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.64, 0.0, 0.0, 0.0, 0.32, 0.0, 1.29, 1.93, 0.0, 0.96, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.778, 0.0, 0.0, 3.756, 61.0, 278.0], 1)
First row in data: ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.97, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.97, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.491, 0.163, 0.0, 0.0, 0.0, 4.312, 33.0, 138.0], 0)
First row in data: ([0.0, 0.0, 0.3, 0.0, 0.61, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3, 0.0, 0.91, 0.0, 0.3, 0.0, 0.0, 0.0, 2.44, 0.61, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3, 1.52, 0.0, 0.0, 0.0, 0.0, 0.61, 1.22, 0.0, 0.0, 0.0, 0.0, 0.301, 0.043, 0.043, 0.0, 0.086, 0.0, 2.161, 19.0, 227.0], 0)
First processed training

### Training the Naive Bayes Classifier
I train a Naive Bayes classifier using the preprocessed training set. The Naive Bayes algorithm is suitable for spam classification because it handles large feature spaces well and is efficient in terms of computation. This classifier will learn to distinguish between spam and non-spam emails based on the features provided.


In [47]:
from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)


### Evaluating on Devtest Set
The trained classifier is evaluated on the devtest set to measure its accuracy. This helps me understand how well the model performs on unseen data and identify areas for improvement. I also display the most informative features that have the highest impact on distinguishing between spam and non-spam emails.

The classifier achieved a devtest accuracy of 95.07%, indicating strong performance on new data. The most informative features are those that the model found most useful in making its predictions. For example, feature 56 (`capital_run_length_total`) with values like 9.0, 5.0, 4.0, and 6.0, strongly indicates non-spam. High values of this feature are much more likely to be found in non-spam emails, as shown by the high ratios (e.g., 23.7:1). Similarly, feature 23 (`char_freq_$`) with a value of 0.08 is a strong indicator of spam, with a ratio of 13.9:1.

In [48]:
from nltk.classify.util import accuracy

devtest_accuracy = accuracy(classifier, devtest_set)
print(f'Devtest Accuracy: {devtest_accuracy}')

classifier.show_most_informative_features(5)

Devtest Accuracy: 0.9507246376811594
Most Informative Features
                      56 = 9.0                 0 : 1      =     23.7 : 1.0
                      56 = 5.0                 0 : 1      =     21.6 : 1.0
                      56 = 4.0                 0 : 1      =     16.7 : 1.0
                      56 = 6.0                 0 : 1      =     15.1 : 1.0
                      23 = 0.08                1 : 0      =     13.9 : 1.0


### Evaluating on Test Set
Finally, I evaluate the classifier on the test set to measure its overall performance and ensure it generalizes well to new data.

In [49]:
# Evaluate the classifier on the test set
test_accuracy = accuracy(classifier, test_set)
print(f'Test Accuracy: {test_accuracy}')

Test Accuracy: 0.8437047756874095
