# Data loading and train/test split
This notebook loads the dataset `dataset/PhiUSIIL_Phishing_URL_Dataset.csv`, performs a 70/30 train-test split (random_state=42), and writes `dataset/train.csv` and `dataset/test.csv`.

In [21]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(42)

In [22]:
# Load dataset
df = pd.read_csv('dataset/PhiUSIIL_Phishing_URL_Dataset.csv')
print(f'Loaded dataset with {len(df)} rows and {len(df.columns)} columns')

Loaded dataset with 235795 rows and 56 columns


In [23]:
# flip train=1 as phishing, train=0 as legitimate, previously it was the opposite
df['label'] = df['label'].apply(lambda x: 1 if x == 0 else 0)

In [24]:
train, test = train_test_split(df, test_size=0.30, random_state=42, shuffle=True, stratify=df['label'])
print(f'Train rows: {len(train)}, Test rows: {len(test)}')

Train rows: 165056, Test rows: 70739


In [25]:
# check class distribution
print('Class distribution in train set:')
print(train['label'].value_counts(normalize=True))
print('Class distribution in test set:')
print(test['label'].value_counts(normalize=True))

Class distribution in train set:
label
0    0.571897
1    0.428103
Name: proportion, dtype: float64
Class distribution in test set:
label
0    0.571891
1    0.428109
Name: proportion, dtype: float64


In [26]:
train.to_csv('dataset/train.csv', index=False)
test.to_csv('dataset/test.csv', index=False)
print('Saved train and test CSV files to dataset/')

Saved train and test CSV files to dataset/
