# Run 1: Preprocessed Data

The following data being used here the preprocessed data from the enron email dataset. Specifically here, it is the data used by the following paper.

[Paper](https://www2.aueb.gr/users/ion/docs/ceas2006_paper.pdf)

[Data](https://www2.aueb.gr/users/ion/data/enron-spam/)

In [112]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [113]:
# import data
data = pd.read_csv('data/data_cleaned.csv')

In [114]:
# Splitting the data into features and target variable
X = data.drop('spam', axis=1)  # Replace 'target_variable_name' with the actual name of your target variable
y = data['spam']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust the test_size and random_state as desired

In [115]:
# Training the Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test)

# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8152173913043478


# Run 2: Data Processed By Me

In [116]:
# import data
d1 = pd.read_csv('data/enron_proccessed.csv')
d2 = pd.read_csv('data/enron_proccessed_spam.csv')

In [117]:
# combine d1 and d2, update the index
data = pd.concat([d1, d2], ignore_index=True)

In [118]:
data.head().T

Unnamed: 0,0,1,2,3,4
Index,1,2,3,4,5
Message Body,Message-ID: <15886558.1075840010371.JavaMail.e...,Message-ID: <15680772.1075840010630.JavaMail.e...,Message-ID: <19004769.1075840009908.JavaMail.e...,Message-ID: <20348456.1075839979614.JavaMail.e...,Message-ID: <9744175.1075839979640.JavaMail.ev...
Number of Words,519,224,150,187,187
Number of Stop Words,185,77,23,20,20
Number of Unique Words,312,170,140,107,107
Ratio of Lowercase to Uppercase,0.722543,0.754464,0.406667,0.374332,0.374332
Number of Exclamation Points,0,0,0,0,0
Number of Unique Stemmed Words,233,134,132,110,110
Number of Lemmatized Words,240,136,133,112,112
Cleaned Body,message-id: <15886558.1075840010371.javamail.e...,message-id: <15680772.1075840010630.javamail.e...,message-id: <19004769.1075840009908.javamail.e...,message-id: <20348456.1075839979614.javamail.e...,message-id: <9744175.1075839979640.javamail.ev...


In [119]:
# drop Message Body column
data.drop('Message Body', axis=1, inplace=True)

# drop the index column
data.drop('Cleaned Body', axis=1, inplace=True)

## run without

In [120]:
# drop the index column
data.drop('Index', axis=1, inplace=True)

In [121]:
# Splitting the data into features and target variable
X = data.drop('Target', axis=1)  # Replace 'target_variable_name' with the actual name of your target variable
y = data['Target']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust the test_size and random_state as desired

In [122]:
# Training the Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test)

# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7248062015503876


In [123]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Create the decision tree classifier
clf = DecisionTreeClassifier()

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
predictions = clf.predict(X_test)

# Evaluate the model
accuracy = np.mean(predictions == y_test)
print('Accuracy:', accuracy)

Accuracy: 0.9483204134366925
