# Run 1: Preprocessed Data

The following data being used here the preprocessed data from the enron email dataset. Specifically here, it is the data used by the following paper.

[Paper](https://www2.aueb.gr/users/ion/docs/ceas2006_paper.pdf)

[Data](https://www2.aueb.gr/users/ion/data/enron-spam/)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [2]:
# import data
data = pd.read_csv('data/data_cleaned.csv')

In [3]:
# Splitting the data into features and target variable
X = data.drop('spam', axis=1)  # Replace 'target_variable_name' with the actual name of your target variable
y = data['spam']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust the test_size and random_state as desired

In [4]:
# Training the Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test)

# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8208469055374593


# Run 2: Data Processed By Me

In [5]:
# import data
d1 = pd.read_csv('data/enron_proccessed.csv')
d2 = pd.read_csv('data/enron_proccessed_spam.csv')

In [6]:
# combine d1 and d2, update the index
data = pd.concat([d1, d2], ignore_index=True)

In [7]:
data.head().T

Unnamed: 0,0,1,2,3,4
Index,1,2,3,4,5
Message Body,Message-ID: <15886558.1075840010371.JavaMail.e...,Message-ID: <15680772.1075840010630.JavaMail.e...,Message-ID: <19004769.1075840009908.JavaMail.e...,Message-ID: <20348456.1075839979614.JavaMail.e...,Message-ID: <9744175.1075839979640.JavaMail.ev...
Number of Words,519,224,150,187,187
Number of Stop Words,185,77,23,20,20
Number of Unique Words,312,170,140,107,107
Ratio of Lowercase to Uppercase,0.722543,0.754464,0.406667,0.374332,0.374332
Number of Exclamation Points,0,0,0,0,0
Target,0,0,0,0,0


In [8]:
# drop Message Body column
data.drop('Message Body', axis=1, inplace=True)

In [9]:
data.head().T

Unnamed: 0,0,1,2,3,4
Index,1.0,2.0,3.0,4.0,5.0
Number of Words,519.0,224.0,150.0,187.0,187.0
Number of Stop Words,185.0,77.0,23.0,20.0,20.0
Number of Unique Words,312.0,170.0,140.0,107.0,107.0
Ratio of Lowercase to Uppercase,0.722543,0.754464,0.406667,0.374332,0.374332
Number of Exclamation Points,0.0,0.0,0.0,0.0,0.0
Target,0.0,0.0,0.0,0.0,0.0


## run with index column

In [10]:
# Splitting the data into features and target variable
X = data.drop('Target', axis=1)  # Replace 'target_variable_name' with the actual name of your target variable
y = data['Target']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust the test_size and random_state as desired

In [11]:
# Training the Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test)

# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8823255813953489


## run without

In [12]:
# drop the index column
data.drop('Index', axis=1, inplace=True)

In [13]:
data.head().T

Unnamed: 0,0,1,2,3,4
Number of Words,519.0,224.0,150.0,187.0,187.0
Number of Stop Words,185.0,77.0,23.0,20.0,20.0
Number of Unique Words,312.0,170.0,140.0,107.0,107.0
Ratio of Lowercase to Uppercase,0.722543,0.754464,0.406667,0.374332,0.374332
Number of Exclamation Points,0.0,0.0,0.0,0.0,0.0
Target,0.0,0.0,0.0,0.0,0.0


In [14]:
# Splitting the data into features and target variable
X = data.drop('Target', axis=1)  # Replace 'target_variable_name' with the actual name of your target variable
y = data['Target']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Adjust the test_size and random_state as desired

In [15]:
# Training the Naive Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = model.predict(X_test)

# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.72
