<a href="https://www.kaggle.com/code/aisuko/build-a-new-dataset-for-binary-classification?scriptVersionId=210702495" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Let's create a new dataset for our binary classification tasks. We will load public datasets from Huggingface adn Kaggle and combine them together in shuffle mode.

# Load the first dataset

In [1]:
from datasets import load_dataset

ds = load_dataset("shawhin/phishing-site-classification")
ds

README.md:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/98.0k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/21.4k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/24.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/450 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/450 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 2100
    })
    validation: Dataset({
        features: ['text', 'labels'],
        num_rows: 450
    })
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 450
    })
})

In [2]:
from datasets import concatenate_datasets

train_dataset = ds['train']
validation_dataset = ds['validation']
test_dataset = ds['test']

combined_dataset = concatenate_datasets([train_dataset, validation_dataset, test_dataset])
combined_dataset

Dataset({
    features: ['text', 'labels'],
    num_rows: 3000
})

In [3]:
combined_dataset=combined_dataset.rename_column('text', 'url')
combined_dataset

Dataset({
    features: ['url', 'labels'],
    num_rows: 3000
})

# Load the second dataset

In [4]:
ds2=load_dataset("csv",data_files="/kaggle/input/phishing-and-legitimate-urls/new_data_urls.csv")
ds2

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['url', 'status'],
        num_rows: 822010
    })
})

In [5]:
ds2=ds2.rename_column('status','labels')
ds2

DatasetDict({
    train: Dataset({
        features: ['url', 'labels'],
        num_rows: 822010
    })
})

In [6]:
train_ds=ds2['train']
train_ds

Dataset({
    features: ['url', 'labels'],
    num_rows: 822010
})

# Combine two datasets

In [7]:
final_train_ds = concatenate_datasets([train_ds, combined_dataset])
final_train_ds

Dataset({
    features: ['url', 'labels'],
    num_rows: 825010
})

# Shuffle the dataset

Here we will use shuffle method to randomly reorder the dataset and then use `train_test_split` method to randomly drop data.

In [8]:
shuffled_dataset=final_train_ds.shuffle(seed=42)

In [9]:
reduced_dataset=shuffled_dataset.train_test_split(test_size=0.2)['train']

# Split the dataset

Let's split the dataset into:
* Training 80%
* Testing 10%
* Evaluation 10%

In [10]:
train_test_split=reduced_dataset.train_test_split(test_size=0.2)
train_test_split

DatasetDict({
    train: Dataset({
        features: ['url', 'labels'],
        num_rows: 528006
    })
    test: Dataset({
        features: ['url', 'labels'],
        num_rows: 132002
    })
})

In [11]:
# Further split the 'test' split into validation and test sets
val_test_split = train_test_split['test'].train_test_split(test_size=0.5)
val_test_split

DatasetDict({
    train: Dataset({
        features: ['url', 'labels'],
        num_rows: 66001
    })
    test: Dataset({
        features: ['url', 'labels'],
        num_rows: 66001
    })
})

## Combine the splits into a new DatasetDict

In [12]:
from datasets import DatasetDict

# Create a new DatasetDict with train, validation, and test sets
final_splits = DatasetDict({
    'train': train_test_split['train'],
    'validation': val_test_split['train'],
    'test': val_test_split['test']
})
final_splits

DatasetDict({
    train: Dataset({
        features: ['url', 'labels'],
        num_rows: 528006
    })
    validation: Dataset({
        features: ['url', 'labels'],
        num_rows: 66001
    })
    test: Dataset({
        features: ['url', 'labels'],
        num_rows: 66001
    })
})

# Publish on HF

In [13]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

hf_token = UserSecretsClient()

login(token=hf_token.get_secret("HUGGINGFACE_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [14]:
repo_id="aisuko/phishing-binary-classification"

final_splits.push_to_hub(repo_id)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/529 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/67 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/aisuko/phishing-binary-classification/commit/7f24b4d0a887de03818e96ce621e11c6e0c5808f', commit_message='Upload dataset', commit_description='', oid='7f24b4d0a887de03818e96ce621e11c6e0c5808f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/aisuko/phishing-binary-classification', endpoint='https://huggingface.co', repo_type='dataset', repo_id='aisuko/phishing-binary-classification'), pr_revision=None, pr_num=None)

# References

* https://www.kaggle.com/datasets/harisudhan411/phishing-and-legitimate-urls/data
* https://huggingface.co/datasets/shawhin/phishing-site-classification