# **01_DataCollection**

## Objectives

* Fetch data from Kaggle and save it as raw data.

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch Data from Kaggle

Install Kaggle package to fetch data

In [None]:

%pip install kaggle==1.5.12

In order to authenticate Kaggle to download data in this session, your authentication token (JSON file) from Kaggle needs to be stored in the main project repository.

Once you dropped your kaggle.json file in the main working directory, run the cell below, so the token is recognized in the session.

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

This project uses the [Asthma Disease Dataset](https://www.kaggle.com/datasets/rabieelkharoua/asthma-disease-dataset/data)

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "rabieelkharoua/asthma-disease-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

Section 2 content

In [None]:
import pandas as pd
df = pd.read_csv("inputs/datasets/raw/asthma_disease_data.csv")
print(df.shape)
df.head()

The dataset contains 2392 rows and 29 columns.

## Data Types
----

In [None]:
df.info()

---

The dataset consists of 29 variables. All predictor variables are stored as either int64 or float64, reflecting binary indicators, ordinal scales, or continuous measurements.

Key numerical variables include:

* Demographic: age, gender

* Lifestyle: smoking, physicalactivity, dietquality, sleepquality

* Environmental & Allergy: airpollution, pollenexposure, dustexposure, petallergy

* Medical History: familyhistoryasthma, allergies, eczema, hayfever, previousrespiratoryinfections
  
* Clinical Factors: LungFunctionFEV1, LungFunctionFVC

* Symptoms: shortnessofbreath, chesttightness, coughing, wheezing, nighttimesymptoms, exerciseinduced

The target variable for correlation analysis is Diagnosis.

---

# Push files to Repo
---

Save the raw data in outputs/datasets/collection

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/asthma_disease_data.csv",index=False)