# **01_DataCollection**

## Objectives

* Fetch data from Kaggle and save it as raw data.

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/AsthmaBurden/jupyter_notebooks'

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/AsthmaBurden/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/AsthmaBurden'

# Fetch Data from Kaggle

Install Kaggle package to fetch data

In [5]:

%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In order to authenticate Kaggle to download data in this session, your authentication token (JSON file) from Kaggle needs to be stored in the main project repository.

Once you dropped your kaggle.json file in the main working directory, run the cell below, so the token is recognized in the session.

In [7]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

This project uses the [Asthma Disease Dataset](https://www.kaggle.com/datasets/rabieelkharoua/asthma-disease-dataset/data)

Define the Kaggle dataset, and destination folder and download it.

In [8]:
KaggleDatasetPath = "rabieelkharoua/asthma-disease-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading asthma-disease-dataset.zip to inputs/datasets/raw
100%|█████████████████████████████████████████| 222k/222k [00:00<00:00, 864kB/s]
100%|█████████████████████████████████████████| 222k/222k [00:00<00:00, 859kB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [11]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/asthma-disease-dataset.zip
  inflating: inputs/datasets/raw/asthma_disease_data.csv  


---

# Load and Inspect Kaggle data

Section 2 content

In [12]:
import pandas as pd
df = pd.read_csv("inputs/datasets/raw/asthma_disease_data.csv")
print(df.shape)
df.head()

(2392, 29)


Unnamed: 0,PatientID,Age,Gender,Ethnicity,EducationLevel,BMI,Smoking,PhysicalActivity,DietQuality,SleepQuality,...,LungFunctionFEV1,LungFunctionFVC,Wheezing,ShortnessOfBreath,ChestTightness,Coughing,NighttimeSymptoms,ExerciseInduced,Diagnosis,DoctorInCharge
0,5034,63,0,1,0,15.848744,0,0.894448,5.488696,8.701003,...,1.369051,4.941206,0,0,1,0,0,1,0,Dr_Confid
1,5035,26,1,2,2,22.757042,0,5.897329,6.341014,5.153966,...,2.197767,1.702393,1,0,0,1,1,1,0,Dr_Confid
2,5036,57,0,2,1,18.395396,0,6.739367,9.196237,6.840647,...,1.698011,5.022553,1,1,1,0,1,1,0,Dr_Confid
3,5037,40,1,2,1,38.515278,0,1.404503,5.826532,4.253036,...,3.032037,2.300159,1,0,1,1,1,0,0,Dr_Confid
4,5038,61,0,0,3,19.283802,0,4.604493,3.127048,9.625799,...,3.470589,3.067944,1,1,1,0,0,1,0,Dr_Confid


The dataset contains 2392 rows and 29 columns.

## Data Types
----

In [13]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 2392 entries, 0 to 2391
Data columns (total 29 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   PatientID               2392 non-null   int64  
 1   Age                     2392 non-null   int64  
 2   Gender                  2392 non-null   int64  
 3   Ethnicity               2392 non-null   int64  
 4   EducationLevel          2392 non-null   int64  
 5   BMI                     2392 non-null   float64
 6   Smoking                 2392 non-null   int64  
 7   PhysicalActivity        2392 non-null   float64
 8   DietQuality             2392 non-null   float64
 9   SleepQuality            2392 non-null   float64
 10  PollutionExposure       2392 non-null   float64
 11  PollenExposure          2392 non-null   float64
 12  DustExposure            2392 non-null   float64
 13  PetAllergy              2392 non-null   int64  
 14  FamilyHistoryAsthma     2392 non-null   int64  
 15

---

The dataset consists of 29 variables. All predictor variables are stored as either int64 or float64, reflecting binary indicators, ordinal scales, or continuous measurements.

Key numerical variables include:

* Demographic: age, gender

* Lifestyle: smoking, physicalactivity, dietquality, sleepquality

* Environmental & Allergy: airpollution, pollenexposure, dustexposure, petallergy

* Medical History: familyhistoryasthma, allergies, eczema, hayfever, previousrespiratoryinfections
  
* Clinical Factors: LungFunctionFEV1, LungFunctionFVC

* Symptoms: shortnessofbreath, chesttightness, coughing, wheezing, nighttimesymptoms, exerciseinduced

The target variable for correlation analysis is Diagnosis.

---

# Push files to Repo
---

Save the raw data in outputs/datasets/collection

In [14]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/asthma_disease_data.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'
