# Spam Classification
## Danny Chang, Joey Hernandez

In an era of constant digital communication, the relentless influx of messages and emails can be overwhelming. Among these, a significant portion comprises unwanted and potentially harmful spam messages, which can disrupt productivity and pose security risks. To combat this issue, we embark on a Data Science project aimed at developing a robust spam classification system.

Our goal is to create an intelligent algorithm that can automatically differentiate between legitimate messages and spam, providing users with a clutter-free and secure communication experience. By leveraging the power of machine learning and data analysis, we intend to build a predictive model capable of classifying messages and emails as either "Spam" or "Not Spam" with a high degree of accuracy.

This project will entail various stages, including data collection, preprocessing, feature engineering, model selection, and evaluation. We will draw upon a diverse dataset of messages and emails, encompassing a wide range of characteristics, to train and fine-tune our classification model. Throughout the process, we will explore advanced techniques in natural language processing (NLP) and machine learning to enhance our model's performance and adaptability.

The successful completion of this project will not only help individuals manage their digital communications more effectively but also have broader applications in email filtering, cybersecurity, and information management. By mitigating the impact of spam, we aim to contribute to a safer and more efficient digital communication environment.

## Importing Our Data

In [2]:
import os
import pandas as pd
import email

In [3]:
x = os.listdir("data/easy_ham")

with open(os.path.join("data/easy_ham",x[0]), "r") as file_handler:
    msg = file_handler.read()
    print(msg)

From rssfeeds@jmason.org  Mon Sep 30 13:43:46 2002
Return-Path: <rssfeeds@example.com>
Delivered-To: yyyy@localhost.example.com
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id AE79816F16
	for <jm@localhost>; Mon, 30 Sep 2002 13:43:46 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Mon, 30 Sep 2002 13:43:46 +0100 (IST)
Received: from dogma.slashnull.org (localhost [127.0.0.1]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g8U81fg21359 for
    <jm@jmason.org>; Mon, 30 Sep 2002 09:01:41 +0100
Message-Id: <200209300801.g8U81fg21359@dogma.slashnull.org>
To: yyyy@example.com
From: gamasutra <rssfeeds@example.com>
Subject: Priceless Rubens works stolen in raid on mansion
Date: Mon, 30 Sep 2002 08:01:41 -0000
Content-Type: text/plain; encoding=utf-8
Lines: 6
X-Spam-Status: No, hits=-527.4 required=5.0
	tests=AWL,DATE_IN_PAST_03_06,T_URI_COUNT_0_1
	version=2.50-cvs
X-Spam

In [4]:
file_name = []
label = []

# Retriving the data
for root,dirs,files in os.walk("data/"):
    for f in files:
        if "spam" in root:
            label.append(1)
        else:
            label.append(0)
        file_name.append(os.path.join(root,f))

In [5]:
# Putting data into dataframe
data = pd.DataFrame({"Message":file_name,"Target":label})
data

Unnamed: 0,Message,Target
0,data/spam/00249.5f45607c1bffe89f60ba1ec9f878039a,1
1,data/spam/0355.94ebf637e4bd3db8a81c8ce68ecf681d,1
2,data/spam/0395.bb934e8b4c39d5eab38f828a26f760b4,1
3,data/spam/0485.9021367278833179285091e5201f5854,1
4,data/spam/00373.ebe8670ac56b04125c25100a36ab0510,1
...,...,...
9348,data/easy_ham_2/00609.dd49926ce94a1ea328cce9b6...,0
9349,data/easy_ham_2/00957.e0b56b117f3ec5f85e432a9d...,0
9350,data/easy_ham_2/01127.841233b48eceb74a825417d8...,0
9351,data/easy_ham_2/01178.5c977dff972cd6eef64d4173...,0


In [6]:
"""
Lets count the types of messages we have first
"""
from collections import Counter
types = Counter()
msgs = []
trigger = True
for root,dirs,files in os.walk("data/"):
    for f in files:
        with open(os.path.join(root,f),'r',encoding='latin-1') as file_point:
            msg = email.message_from_file(file_point, )
            type_ = msg.get_content_type()
            types[type_]+=1
            if type_ == 'multipart/mixed' and trigger:
                print(root,f)
                print("______________________")
                trigger = False
                SAMPLE = msg.get_payload()

print(types) 
print("********************************")
print("WARNING--Remember all the multipart (and html!!) messages!!") 

data/spam 0343.0630afbe4ee1ffd0db0ffb81c6de98de
______________________
Counter({'text/plain': 7413, 'text/html': 1193, 'multipart/alternative': 326, 'multipart/signed': 180, 'multipart/mixed': 179, 'multipart/related': 56, 'multipart/report': 5, 'text/plain charset=us-ascii': 1})
********************************


In [7]:
"""
Read all the messages in
"""
msgs = []
for root,dirs,files in os.walk("data/"):
    for f in files:
        with open(os.path.join(root,f),'r',encoding='latin-1') as file_point:
            msg = email.message_from_file(file_point)
            body = msg.get_payload()
            msgs.append(body)

 
print("WARNING--Remember all the multipart messages!!")
print("You need address that for Case Study 3")       

You need address that for Case Study 3


In [8]:
data['messages'] = msgs
data

Unnamed: 0,Message,Target,messages
0,data/spam/00249.5f45607c1bffe89f60ba1ec9f878039a,1,"Dear Homeowner,\n \nInterest Rates are at thei..."
1,data/spam/0355.94ebf637e4bd3db8a81c8ce68ecf681d,1,"[[Content-Type, Content-Transfer-Encoding], [C..."
2,data/spam/0395.bb934e8b4c39d5eab38f828a26f760b4,1,"[[Content-Type, Content-Transfer-Encoding], [C..."
3,data/spam/0485.9021367278833179285091e5201f5854,1,<html><head>\n<title>Congratulations! You Get ...
4,data/spam/00373.ebe8670ac56b04125c25100a36ab0510,1,ATTENTION: This is a MUST for ALL Computer Use...
...,...,...,...
9348,data/easy_ham_2/00609.dd49926ce94a1ea328cce9b6...,0,"I'm one of the 30,000 but it's not working ver..."
9349,data/easy_ham_2/00957.e0b56b117f3ec5f85e432a9d...,0,Damien Morton quoted:\n>W3C approves HTML 4 'e...
9350,data/easy_ham_2/01127.841233b48eceb74a825417d8...,0,"On Mon, 2002-07-22 at 06:50, che wrote:\n\n> t..."
9351,data/easy_ham_2/01178.5c977dff972cd6eef64d4173...,0,"Once upon a time, Manfred wrote :\n\n> I would..."


In [9]:
#data.to_csv("spam_or_not.csv")

## Data Modeling

In [10]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

vectorizer = TfidfVectorizer()
out = vectorizer.fit_transform(data['messages'].astype('str'))

###  GaussianNB

In [14]:
%%time
from sklearn.naive_bayes import GaussianNB

ng = GaussianNB()
ng.fit(out.toarray(),data['Target'])

CPU times: user 3.62 s, sys: 11.3 s, total: 15 s
Wall time: 22.7 s


In [16]:
%%time
from sklearn.model_selection import cross_val_score

accuracy_scores = cross_val_score(ng, out.toarray(), data['Target'], cv=5, n_jobs=1, scoring='accuracy')
mean_accuracy = accuracy_scores.mean()

print("Accuracy Scores:", accuracy_scores)
print("Mean Accuracy:", mean_accuracy)

Accuracy Scores: [0.84553715 0.96044896 0.9353287  0.93048128 0.93743316]
Mean Accuracy: 0.921845848683966
CPU times: user 17.2 s, sys: 44.3 s, total: 1min 1s
Wall time: 1min 28s


### Clustering

In [23]:
from sklearn.cluster import KMeans 
from sklearn.metrics import silhouette_score

num_clusters = 2

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(out.toarray(),data['Target'])

data['Cluster_Label'] = kmeans.labels_

cluster_sizes = data['Cluster_Label'].value_counts()
print("Cluster Sizes:\n", cluster_sizes)

silhouette_avg = silhouette_score(out, kmeans.labels_)
print(f"Silhouette Score: {silhouette_avg}")


  super()._check_params_vs_input(X, default_n_init=10)


Cluster Sizes:
 Cluster_Label
1    8610
0     743
Name: count, dtype: int64
Silhouette Score: 0.04310823858063843


In [19]:
# Explore the contents of clusters
for cluster_id in range(num_clusters):
    cluster_messages = data[data['Cluster_Label'] == cluster_id]['Message']
    print(f"Cluster {cluster_id} Messages:\n", cluster_messages.head(5))

Cluster 0 Messages:
 1      data/spam/0355.94ebf637e4bd3db8a81c8ce68ecf681d
2      data/spam/0395.bb934e8b4c39d5eab38f828a26f760b4
5      data/spam/0343.0630afbe4ee1ffd0db0ffb81c6de98de
6     data/spam/00214.1367039e50dc6b7adb0f2aa8aba83216
10     data/spam/0112.ec411d26d1f4decc16af7ef73e69a227
Name: Message, dtype: object
Cluster 1 Messages:
 0    data/spam/00249.5f45607c1bffe89f60ba1ec9f878039a
3     data/spam/0485.9021367278833179285091e5201f5854
4    data/spam/00373.ebe8670ac56b04125c25100a36ab0510
7     data/spam/0125.44381546181fc6c5d7ea59e917f232c5
8    data/spam/00210.050ffd105bd4e006771ee63cabc59978
Name: Message, dtype: object
