<a href="https://colab.research.google.com/github/Octaxx/DLI/blob/main/Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
import pandas as pd
from sklearn.utils import resample


# STEP 1: Load the dataset
url = "https://raw.githubusercontent.com/Octaxx/DLI/main/CEAS_08.csv"
df = pd.read_csv(url)



In [5]:
# STEP 2: Remove missing values
df = df.dropna()

# STEP 3: Check for missing values after cleaning
print("Missing values after dropna():\n")
print(df.isnull().sum())
print("\n")

# STEP 4: Print total rows after cleaning
print(f"Total rows after cleaning: {len(df)}\n")

Missing values after dropna():

sender      0
receiver    0
date        0
subject     0
body        0
label       0
urls        0
dtype: int64


Total rows after cleaning: 38669



In [12]:
# STEP 5: Check and print original class counts
print("Original Email Type Counts:\n")
print(f"Phishing emails (1): {df['label'].value_counts().get(1, 0)}")
print(f"Safe emails (0): {df['label'].value_counts().get(0, 0)}\n")

# STEP 6: Show a few examples
print("Phishing emails examples:\n")
display(df[df['label'] == 1][['subject', 'body']].head(5))
print("\nSafe emails examples:\n")
display(df[df['label'] == 0][['subject', 'body']].head(5))
print("\n")


Original Email Type Counts:

Phishing emails (1): 21842
Safe emails (0): 17312

Phishing emails examples:



Unnamed: 0,subject,body
0,Never agree to be a loser,"Buck up, your troubles caused by small dimensi..."
1,Befriend Jenna Jameson,\nUpgrade your sex and pleasures with these te...
2,CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...
4,SpecialPricesPharmMoreinfo,\nWelcomeFastShippingCustomerSupport\nhttp://7...
5,From Caroline Aragon,\n\n\n\n\nYo wu urS mo ou go rc ebo eForM rgi ...



Safe emails examples:



Unnamed: 0,subject,body
3,Re: svn commit: r619753 - in /spamassassin/tru...,Would anyone object to removing .so from this ...
8,[Bug 5780] URI processing turns uuencoded stri...,http://issues.apache.org/SpamAssassin/show_bug...
15,RE: Trial IRC Certificate Application,"\nPlelim,\n\nJust to remind you that if a cert..."
18,"Re: [opensuse] Why can't I use ""shutdown now"" ...",Carlos E. R. wrote: > -----BEGIN PGP SIGNED ME...
19,Re: Fwd: [opensuse] Re: openSUSE Boxed Editions,Steve Jacobs wrote: > ---------- Forwarded mes...






In [13]:
# STEP 7: Balance the classes
# Separate majority and minority classes
majority = df[df['label'] == 1]
minority = df[df['label'] == 0]

# Upsample minority class
minority_upsampled = resample(minority,
                              replace=True,
                              n_samples=len(majority),
                              random_state=42)

# Combine majority and upsampled minority
df_balanced = pd.concat([majority, minority_upsampled])

# Shuffle the dataset
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# STEP 8: Check balanced class counts
print("✅ Balanced Email Type Counts:\n")
print(f"Phishing emails (1): {df_balanced['label'].value_counts().get(1, 0)}")
print(f"Safe emails (0): {df_balanced['label'].value_counts().get(0, 0)}\n")


# STEP 9: Show first 100 rows (optional)
df_balanced.head(100)

✅ Balanced Email Type Counts:

Phishing emails (1): 21842
Safe emails (0): 21842



Unnamed: 0,sender,receiver,date,subject,body,label,urls
0,Gustavo Harden <dwtelebrokersm@telebrokers.com>,user2.6@gvc.ceas-challenge.cc,"Thu, 07 Aug 2008 09:50:39 +0300",From Gustavo Harden,\n\n\n\n\n\n\nP gik harm qos acy\n\nVis zn it ...,1,1
1,Mariana Mohr <MarianaabramsonTabor@raisingkain...,user2.1@gvc.ceas-challenge.cc,"Wed, 06 Aug 2008 23:27:35 -0100","Produce Stronger, Rock Hard Erections.",\nPut on an average gain of 3.02 inches where ...,1,1
2,Garry Keller <Garry@rd.com>,user2.15@gvc.ceas-challenge.cc,"Thu, 07 Aug 2008 04:17:03 -0500",Cut prices for high-rank accessories,Craftsmanship of the highest level made it pos...,1,1
3,maria sala williams <eetmxpaxq@interfree.it>,pvosgpr@triptracker.net,"Thu, 07 Aug 2008 17:47:28 +0100",help,"\n\n\n\n\n\n\nI really like your slideshow, th...",0,0
4,Jeff Bone <zuutf@place.org>,Friends of Rohit Khare <wsye@xent.com>,"Wed, 06 Aug 2008 02:32:08 -0500","[FoRK] ""Because the internet needs prophylacti...",\nThis has legs:\n\n http://stupidfilter.org...,0,1
...,...,...,...,...,...,...,...
95,Nick Coghlan <uytankmf@gmail.com>,iybz@pobox.com,"Fri, 08 Aug 2008 19:17:14 +1000",Re: [Python-3000] [Python-Dev] Reminder: last ...,iybz@pobox.com wrote:\n> Fred> If user-loc...,0,1
96,Daily Top 10 <unsectio@mhasociados.net>,email1363@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 21:43:55 -0400",CNN.com Daily Top 10,>+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+...,1,1
97,"""Jos I. Boumans"" <hxh@dwim.org>",P5P Porters <qtog3-epoawmh@perl.org>,"Thu, 07 Aug 2008 08:32:57 +0100",[PATCH] Update Term::UI to 0.18,"Greetings,\n\nattached is the patch to update ...",0,1
98,Stanley Dolan <dwteamupm@teamup.com>,user2.6@gvc.ceas-challenge.cc,"Tue, 05 Aug 2008 21:07:12 -0500",From Stanley Dolan,\n\n\n\n\n Vi max si kxy t Our Wi rw de Ra ijr...,1,0
