<a href="https://colab.research.google.com/github/TheDataFestAI/Learning_Resources/blob/main/learning_poc/download_data_from_kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download Data From Kaggle

Reference:
1. Download data from kaggle into colab:
    1. https://www.kaggle.com/discussions/general/74235

2. Kaggle Api:
    1. https://github.com/Kaggle/kaggle-api


## Step 1: Generate New Api Access Token from Your Kaggle Personal Account

1. Sign in to https://kaggle.com/, then click on your profile picture on the top right and select "My Account" from the menu.

2. Scroll down to the "API" section and click "Create New API Token". This will download a file kaggle.json with the following contents:

    ```json
    {"username":"YOUR_KAGGLE_USERNAME","key":"YOUR_KAGGLE_KEY"}
    ```

## Step 2: Install the `Python Packages`

In [None]:
! pip install -q kaggle

## step 3: Upload the `kaggle.json` file into colab local directory

In [None]:
"""
https://github.com/googlecolab/colabtools/blob/main/google/colab/files.py
"""
from google.colab import files

kaggle_filename = "kaggle.json"

# used "_upload_file()" to specify the filename after upload into the colab
# used "out" variable to store the return value from "_upload_file()" for not to display the file content
out = files._upload_file(filepath=kaggle_filename)

## Step 4: Create `~/.kaggle/` dir and move `kaggle.json` there

In [None]:
"""
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/

Replicated the above unix commands with python 'os' module
"""
import os
from pathlib import Path


home_dir = os.path.expanduser('~')
kaggle_key_dir = os.path.join(home_dir, ".kaggle")
source_kaggle_json = os.path.abspath(os.path.join(os.getcwd(), kaggle_filename))
dest_kaggle_json = os.path.join(kaggle_key_dir, kaggle_filename)

# # create the `~/.kaggle/` directory
# used "os.path.expanduser('~')" to get home directory path
# if ".kaggle" in os.listdir(path=home_dir):
# didn't use the above if condition as listdir() may consume more processing if list of dirs are more
if os.path.exists(kaggle_key_dir):
    print(f"`{kaggle_key_dir}` dir is already present")
else:
    os.makedirs(name=kaggle_key_dir, exist_ok=True)
    print(f"`{kaggle_key_dir}` dir has been created")

# # move `kaggle.json` into `~/.kaggle/` directory
if os.path.isfile(source_kaggle_json) and not os.path.isfile(dest_kaggle_json):
    Path(source_kaggle_json).rename(dest_kaggle_json)
    print(f"`{source_kaggle_json}` moved to `{dest_kaggle_json}`")
else:
    print(f"{source_kaggle_json} doesn't exists or/and {dest_kaggle_json} is already present")

`/root/.kaggle` dir has been created
`/content/kaggle.json` moved to `/root/.kaggle/kaggle.json`


## *(Optional)* Step 5: Check the existence of `~/.kaggle/kaggle.json`

In [None]:
# # get list of directories under "/"
# os.listdir("/")

# # get list of directory under `home` directory
# os.listdir(path=os.path.expanduser('~'))

if os.path.isfile(dest_kaggle_json):
    print(f"{dest_kaggle_json} is present")

/root/.kaggle/kaggle.json is present


## Step 6: Change the permission of `~/.kaggle/kaggle.json`

In [None]:
"""
ref check: https://stackoverflow.com/questions/1861836/checking-file-permissions-in-linux-with-python
"""


In [None]:
"""
! chmod 600 ~/.kaggle/kaggle.json
"""
os.chmod(path=dest_kaggle_json, mode=600)

## *(Optional)* Step 7: Get Kaggle dataset lists

In [None]:
"""
! kaggle datasets list
"""

'\n! kaggle datasets list\n'

## Step 8: Download the dataset from kaggle

In [None]:
"""
!kaggle datasets download -d sergeymedvedev/customer_segmentation
"""
import zipfile
from kaggle.api.kaggle_api_extended import KaggleApi


# authenticate the kaggle api
api = KaggleApi()
api.authenticate()

dataset_owner = "sergeymedvedev"
dataset_name = "customer_segmentation"
kaggle_dataset = "/".join([dataset_owner, dataset_name])
# kaggle_dataset_path = os.path.join(os.getcwd(), "kaggle_dataset", dataset_name)
local_downloaded_path = os.path.join(os.getcwd(), "downloaded")

print(f"{kaggle_dataset=}")
# print(f"{kaggle_dataset_path=}")
print(f"{local_downloaded_path=}")

api.dataset_download_files(kaggle_dataset,
                            path=local_downloaded_path,
                            force=False,
                            quiet=True,
                            unzip=True)

# unzip the file
# zip_ref = zipfile.ZipFile(local_downloaded_path+'customer_segmentation.zip', 'r')
# zip_ref.extractall(kaggle_dataset_path)
# zip_ref.close()

kaggle_dataset='sergeymedvedev/customer_segmentation'
kaggle_dataset_path='/content/kaggle_dataset/customer_segmentation'
local_downloaded_path='/content/downloaded'


## Step 9: Get the kaggle dataset files

In [None]:
for i in os.listdir(local_downloaded_path):
    print(os.path.join(local_downloaded_path, i))

/content/downloaded/customer_segmentation.csv


## Another way of getting kaggle datasets

In [None]:
"""
!pip install -q opendatasets

import opendatasets as od
import pandas as pd

od.download('https://www.kaggle.com/datasets/saurabhbagchi/dish-network-hackathon') # insert ypu kaggle  username and key
pddf = pd.read_csv('/content/dish-network-hackathon/Test_Dataset.csv')
"""


# What you have learnt from this notebook:

id | Topic | Description | Comments
:--- | :---: | :--- | :---
1 | **os.path.isfile()** | This only checks the file not any dir | |
2 | **os.path.exists()** | This checks the exitence of file, dir both | |
3 | **os.makedirs()** | works same as mkdir command to create new directory | |
4 | **os.getcwd()**| get current working directory | |
5 | **os.path.join()** | | |
6 | **os.path.expanduser('~')** | get home directory | |
7 | **os.chmod()** | its used to change the file mode like 600 | |
8 | **pathlib.Path** | used to move the file by renaming it | |