# **Data Collection**

## **1. Introduction**

**Objectives**

* Download data from Kaggle and save it in the workspace

**Inputs**

* Kaggle JSON-file for authentication

**Outputs**

* Stores the downloaded dataset under **inputs/datasets**

## **2. How To Get Kaggle JSON**

1. Go to [Kaggle](https://www.kaggle.com/) and log in to your account.
2. Click on your profile picture in the top right corner, and select "Account" from the dropdown menu.
3. Scroll down to the "API" section and click on "Create New API Token.
4. This action will download your Kaggle API token in JSON format automatically.
<br>

*Now, you have successfully downloaded the Kaggle JSON file. Place this file in the working directory for this nootebook*

## **3. Change working directory**

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir
print(f"\x1b[32m{current_dir}\x1b[0m")

[32m/workspaces/Portfolio-Project-5/jupyter_notebooks[0m


We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("\x1b[32mYou set a new current directory\x1b[0m")

[32mYou set a new current directory[0m


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir
print(f"\x1b[32m{current_dir}\x1b[0m")

[32m/workspaces/Portfolio-Project-5[0m


## **4. Install Kaggle**

In [4]:
# Installing the kaggle package
!pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


<br><br>Executing the cell below changes the working directory for Kaggle and configures our 'kaggle.json' file for authentication.

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json
print("\x1b[32mTask completed.\x1b[0m")

[32mTask completed.[0m


---

## **5. Downloading the data**

This cell will grab the data from Kaggle and store it in the workspace

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/datasets/"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}
print("\x1b[32mTask completed.\x1b[0m")

Downloading cherry-leaves.zip to inputs/datasets
 89%|█████████████████████████████████▊    | 49.0M/55.0M [00:01<00:00, 36.0MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 30.8MB/s]
[32mTask completed.[0m


<br><br>This cell will unpack the data and remove the .zip file from the workspace

In [7]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')
print("\x1b[32mTask completed.\x1b[0m")

[32mTask completed.[0m


---

## **6. Conclusions and Next Steps**

- In this notebook, we successfully downloaded the **cherry-leaves** dataset from Kaggle and stored it in our workspace.
- The next step will be Data exploration and and data visualization