# Data collection
---

## Objectives
- Fetch data from Kaggle and display dataset overview

## Inputs
- Kaggle JSON file(autehtication token)

## Outputs
- Datasets:
    - inputs/datasets/raw/heart.csv
    - inputs/datasets/raw/o2Saturation.csv

## Additional comments
- Dataset heart.csv is will be used as input to train validate and test the machine learning model, whereas o2saturation.csv contains data on ossigenation levels
---


## Setting working Directory
The steps below allow to set heart_attack risk as the new working directory

- get current directory and print it


In [4]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Heart_attack_risk/jupyer_notebooks'

- set new working directory as parent of the previous current directory
- As a result heart_attack_risk is the new working directory  

In [7]:
os.chdir(os.path.dirname(current_dir))


## Fetch Data from Kaggle

The dataset which I will use for training test and validation is avaialable on Kaggle
Therefore, now I install the kaggle library

In [3]:
%pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.12.tar.gz (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.7/79.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting certifi>=2023.7.22 (from kaggle)
  Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting requests (from kaggle)
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting urllib3 (from kaggle)
  D

To import data from kaggle a token must be provided and it needs read and write permissions in order to be recognised

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


- The kaggle dataset is defined bu the user who added it and its name
- A destination folder is created
- The dataset is donwloaded and moved to the output folder

In [3]:
KaggleDatasetPath = "rashikrahmanpritom/heart-attack-analysis-prediction-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset
License(s): CC0-1.0
Downloading heart-attack-analysis-prediction-dataset.zip to inputs/datasets/raw
  0%|                                               | 0.00/4.11k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 4.11k/4.11k [00:00<00:00, 11.9MB/s]


- dataset is unarchived
- archive file is deleted
- kaggle.json file is removed

In [4]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open {DestinationFolder}/*.zip, {DestinationFolder}/*.zip.zip or {DestinationFolder}/*.zip.ZIP.

No zipfiles found.


### Load and Inspect Kaggle dataset

- .csv file is converted into a pandas a dataframe
- first 5 rows are printed 

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/heart.csv")
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


## Dataset backup

- Before performing any data manipulation on the dataset, it needs a backpup so that dataset can be restored if any error is performed on the dataset

In [9]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/heart.csv",index=False)