# Data collection
---

## Objectives
- Fetch data from Kaggle and display dataset overview

## Inputs
- Kaggle JSON file(autehtication token)

## Outputs
- Datasets:
    - inputs/datasets/raw/heart.csv
    - inputs/datasets/raw/o2Saturation.csv

## Additional comments
- Dataset heart.csv is will be used as input to train validate and test the machine learning model, whereas o2saturation.csv contains data on ossigenation levels
---


## Setting working Directory
The steps below allow to set heart_attack risk as the new working directory

- get current directory and print it


In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Heart_attack_risk/jupyer_notebooks'

- set new working directory as parent of the previous current directory
- As a result heart_attack_risk is the new working directory  

In [2]:
os.chdir(os.path.dirname(current_dir))


## Fetch Data from Kaggle

The dataset which I will use for training test and validation is avaialable on Kaggle
Therefore, now I install the kaggle library

In [3]:
%pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.12.tar.gz (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.7/79.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting certifi>=2023.7.22 (from kaggle)
  Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting requests (from kaggle)
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting urllib3 (from kaggle)
  D

To import data from kaggle a token must be provided and it needs read and write permissions in order to be recognised

In [3]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

- The kaggle dataset is defined by the user who added it and its name
- A destination folder is created
- The dataset is donwloaded and moved to the output folder

In [4]:
KaggleDatasetPath = "fedesoriano/heart-failure-prediction"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction
License(s): ODbL-1.0
Downloading heart-failure-prediction.zip to inputs/datasets/raw
  0%|                                               | 0.00/8.56k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 8.56k/8.56k [00:00<00:00, 21.4MB/s]


- dataset is unarchived
- archive file is deleted
- kaggle.json file is removed

In [5]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/heart-failure-prediction.zip
  inflating: inputs/datasets/raw/heart.csv  


### Load and Inspect Kaggle dataset

- .csv file is converted into a pandas a dataframe
- first 5 rows are printed 

In [6]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/heart.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


## Dataset backup

- Before performing any data manipulation on the dataset, it needs a backpup so that dataset can be restored if any error is performed on the dataset

In [8]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/heart.csv",index=False)