# Data loading and train/test split
This notebook loads the dataset `dataset/kaggle_phishing_dataset.csv`, performs a 80/20 train-test split (random_state=42), and writes `dataset/train.csv` and `dataset/test.csv`.

In [16]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(42)

In [17]:
# Load dataset
df = pd.read_csv('dataset/kaggle_phishing_dataset.csv')
print(f'Loaded dataset with {len(df)} rows and {len(df.columns)} columns')

Loaded dataset with 11430 rows and 89 columns


In [18]:
# preprocess to change status column to target col and drop other cols
def preprocess(df):
    # fix mapping of status column
    status_mapping = {'legitimate': 0, 'phishing': 1}
    df['target'] = df['status'].map(status_mapping)
    
    # drop all cols other than 'url' and 'target'
    df = df[['url', 'target']]
    return df

In [19]:
df = preprocess(df)

In [20]:
df.head()

Unnamed: 0,url,target
0,http://www.crestonwood.com/router.php,0
1,http://shadetreetechnology.com/V4/validation/a...,1
2,https://support-appleld.com.secureupdate.duila...,1
3,http://rgipt.ac.in,0
4,http://www.iracing.com/tracks/gateway-motorspo...,0


In [21]:
df.duplicated().sum()

np.int64(1)

In [22]:
# drop duplicated rows
df = df.drop_duplicates().reset_index(drop=True)

In [23]:
train, test = train_test_split(df, test_size=0.20, random_state=42, shuffle=True, stratify=df['target'])
print(f'Train rows: {len(train)}, Test rows: {len(test)}')

Train rows: 9143, Test rows: 2286


In [24]:
# check class distribution
print('Class distribution in train set:')
print(train['target'].value_counts(normalize=True))
print('Class distribution in test set:')
print(test['target'].value_counts(normalize=True))

Class distribution in train set:
target
0    0.500055
1    0.499945
Name: proportion, dtype: float64
Class distribution in test set:
target
1    0.5
0    0.5
Name: proportion, dtype: float64


In [25]:
train.to_csv('dataset/train.csv', index=False)
test.to_csv('dataset/test.csv', index=False)
print('Saved train and test CSV files to dataset/')

Saved train and test CSV files to dataset/
