# Dataset Collection: Kaggle Download
---

## Kaggle Dataset Collection

This notebook queries and downloads **public Kaggle datasets** in `.csv` format.  
- API authentication is handled via the `KAGGLE_USERNAME` and `KAGGLE_KEY` environment variables.
    - Create an account on: https://www.kaggle.com/
    - Go to: https://www.kaggle.com/settings
    - Search `Create New Token` an click the button to downlaod the `kaggle.json`.
    - Set the variables `KAGGLE_USERNAME` and `KAGGLE_KEY` or put the file in `~/.kaggle/kaggle.json`.
- Datasets are retrieved across multiple pages of the Kaggle API and filtered to include only **public datasets**.  
- Each dataset is downloaded and extracted into the local directory:  
  - `./kaggledatasets`  

⚠️ **Note:** The exact datasets retrieved depend on the **time of download** and Kaggle’s dataset availability. The `kaggledatasets.7z` dataset is provided for reproducibility.

---


## Requirements

In [None]:
%pip install kaggle==1.5.12
%pip install pandas==2.2.3

## Start Download 

In [None]:
import os

# insert login from https://www.kaggle.com/settings
os.environ['KAGGLE_USERNAME'] = 'see instructions'
os.environ['KAGGLE_KEY'] = 'see instructions'

import kaggle
import pandas as pd

dataset_directory = './kaggledatasets'

# creating a df that contains datasets
datasets_list_csv = []

for page_num in range(400, 420):
    datasets_page = kaggle.api.datasets_list(page=page_num, max_size=100000, filetype='csv')
    datasets_list_csv.extend(datasets_page)

datasets_list_df = pd.DataFrame(datasets_list_csv)
public_datasets = datasets_list_df[datasets_list_df['isPrivate'] == False]

print(public_datasets)

# downloading the datasets into path = './kaggledatasets'
for index, row in public_datasets.iterrows():
    owner_slug = row['ownerNameNullable'].replace(" ", "-").lower()
    dataset_slug = row['titleNullable'].replace(" ", "-").lower()
    dataset_name = f"{owner_slug}/{dataset_slug}"
    print(f"Downloading dataset: {dataset_name}")
    try:
        kaggle.api.dataset_download_files(dataset_name, path=dataset_directory, unzip=True)
    except Exception as e:
        print(f"Error downloading dataset {dataset_name}: {e}")