<div style="background-color: #f5e6a9; border-radius:5px;border:1px solid black;padding:2px;margin:0;">
<h3 style="margin:0;padding:5px;">Introduction</h3>
<i style="font-size:12px;">(Last update: Feb 27, 2025)</i>
</div>

<p style="background-color: #fffbe6; border-radius:5px;border:1px solid black;padding:10px;margin:0;">This notebook is a quick reference guide for accessing Kaggle dataset via Kaggle API.</p>

In [1]:
#pip install kaggle
#!pip install jupyterthemes

In [2]:
from kaggle.api.kaggle_api_extended import KaggleApi

In [3]:
#to load the json files.
import json
#for a neat display of json output
import pprint

<h4 style="background-color: #f5e6a9; border-radius:5px;border:1px solid black;padding:10px;margin:0;display: inline-block;">1. Instantiation and Authentication</h4>

<div style="background-color: #fffbe6; border-radius:5px;border:1px solid black;padding:7px;margin:0;">
<ul>
    <li><b><code>KaggleApi()</code>: </b> creates an instance of KaggleApi class, basically an empty client for the Kaggle server.</li>
    <li><b><code>api.authenticate()</code>: </b> finds the api-key json file in <b>".kaggle/kaggle.json"</b> path (Windows) and authenticate with the server.</li>
</ul>
</div>

In [4]:
#make an instance of the api-class, this is an empty client right now
api = KaggleApi()
#start the authentication process.
api.authenticate()

<div style="background-color: #fffbe6; border-radius:5px;border:1px solid black;padding:10px;margin:0;">
<h4>2. See the dataset slugs without visiting Kaggle website</h4>
<ul>
    <li><b><code>api.dataset_list()</code>: </b>several filters can be added in the query along with keyword.</li>
    <li><b><code>search="customer"</code>: </b>mention the keyword of choice.</li>
    <li>Returns a list of all dataset-slugs for all kaggle datasets containing this keyword.</li>
</ul>
</div>

In [5]:
#searching for all dataset slugs with "customer" keyword
customer_datasets = api.dataset_list(search="customer")

In [6]:
type(customer_datasets[1])

kaggle.models.kaggle_models_extended.Dataset

In [7]:
#print each slug
for index, dataset in enumerate(customer_datasets):
    print(f"{index+1}. {dataset}")

1. imakash3011/customer-personality-analysis
2. blastchar/telco-customer-churn
3. vetrirah/customer
4. datascientistanna/customers-dataset
5. barun2104/telecom-churn
6. vjchoudhary7/customer-segmentation-tutorial-in-python
7. iamsouravbanerjee/customer-shopping-trends-dataset
8. sakshigoyal7/credit-card-customers
9. abisheksudarshan/customer-segmentation
10. abhinav89/telecom-customer
11. radheshyamkollipara/bank-customer-churn
12. kaushiksuresh147/customer-segmentation
13. mahirahmzh/starbucks-customer-retention-malaysia-survey
14. dev0914sharma/customer-clustering
15. sjleshrac/airlines-customer-satisfaction
16. thoughtvector/customer-support-on-twitter
17. muhammadshahidazeem/customer-churn-dataset
18. hanaksoy/customer-purchasing-behaviors
19. ihormuliar/starbucks-customer-data
20. joebeachcapital/customer-segmentation


<div style="background-color: #fffbe6; border-radius:10px;border:1px solid black;padding:5px;margin:0;">
Each dataset in the list is <b>not</b> of type string.<br>
Each dataset is an object with properties like <code style="background-color:yellow; padding:2px 5px;">dataset.<b>title</b></code>, <code style="background-color:yellow; padding:2px 5px;">dataset.<b>ref</b></code> etc.
</div>

<h4 style="background-color: #f5e6a9; border-radius:5px;border:1px solid black;padding:10px;margin:0;display: inline-block;">3. Download the dataset using the slug.</h4>

In [8]:
#download one dataset
api.dataset_download_files('imakash3011/customer-personality-analysis', path='./', unzip=True)

Dataset URL: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis


<div style="background-color: #fffbe6; border-radius:5px;border:1px solid black;padding:6px;margin:0;">
<ul>
    <li><b><code>path="./"<code>: </b> downloads the dataset in current directory.</li>
    <li><b><code>unzip=True</code>: </b> unzips the dataset after download.</li>
</ul>
</div>

<h4 style="background-color: #f5e6a9; border-radius:5px;border:1px solid black;padding:10px;margin:0;display: inline-block;">4. Download the metadata (json file)</h4>

In [9]:
#downloads a json file with details about a dataset- owner, size etc.
dt_metadata = api.dataset_metadata('imakash3011/customer-personality-analysis', path='./')

In [10]:
print(dt_metadata)

./dataset-metadata.json


<h4 style="background-color: #f5e6a9; border-radius:5px;border:1px solid black;padding:10px;margin:0;display: inline-block;">5. Open the metadata file</h4>

In [11]:
#open the downloaded metadata
with open("./dataset-metadata.json", "r") as file:
    metadata = json.load(file)

In [12]:
#Print the metadata
#pprint.pprint(metadata[:10])

<h4 style="background-color: #f5e6a9; border-radius:5px;border:1px solid black;padding:10px;margin:0;display: inline-block;">6. Checking for multiple files within a dataset</h4>

In [13]:
#checking all files present in 5 datasets.
for index, dataset in enumerate(customer_datasets[:5]):
    #only extract the slug from dataset object
    dataset_slug = dataset.ref
    files = api.dataset_list_files(dataset_slug)
    print(f"{index+1}. {dataset}: number of files in it = {len(files.files)}")
    #print name of each file inside them
    for file in files.files:
        print(f"    * {file.name} ({file.size})")


1. imakash3011/customer-personality-analysis: number of files in it = 1
    * marketing_campaign.csv (215KB)
2. blastchar/telco-customer-churn: number of files in it = 1
    * WA_Fn-UseC_-Telco-Customer-Churn.csv (955KB)
3. vetrirah/customer: number of files in it = 3
    * Test.csv (130KB)
    * Train.csv (415KB)
    * sample_submission.csv (23KB)
4. datascientistanna/customers-dataset: number of files in it = 1
    * Customers.csv (73KB)
5. barun2104/telecom-churn: number of files in it = 1
    * telecom_churn.csv (126KB)
