# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file and authentication token. 

## Outputs

* Generate Dataset: outputs/datasets/collection/heart_disease.csv

## Additional Comments

* This data is coming from an open, public source and poses no ethical or privacy concerns.


---

# Change working directory

* Install python packages in the notebooks

In [1]:
%pip install -r /workspace/cd-prediction-pp5/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


When running the notebook in the editor, we need to change the working directory from its current folder to its parent folder.
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/cd-prediction-pp5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/cd-prediction-pp5'

# Data acquisition from Kaggle

Install Kaggle package to fetch data.

In [5]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


* In order to download the data a personal authentication token (JSON file) is needed to authenticate Kaggle. You can aquire one by signin up to Kaggle.

* Once you have your token, drag and drop the file into the directory and then run the following:

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


* Get the dataset path from the Kaggle url

* Define the Kaggle dataset, and destination folder and download it.

In [7]:
KaggleDatasetPath = "jocelyndumlao/cardiovascular-disease-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace/cd-prediction-pp5. Or use the environment method.


* Unzip the downloaded file, delete the zip file and delete the kaggle.json file
* Remove pdf file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm {DestinationFolder}/*.pdf \
  && rm kaggle.json

* Move the csv file to designated folder and remove the redundant layer in pathing

In [None]:
os.rename(
    "inputs/datasets/raw/Cardiovascular_Disease_Dataset/Cardiovascular_Disease_Dataset.csv",
    "inputs/datasets/raw/Cardiovascular_Disease_Dataset.csv")
os.rmdir("inputs/datasets/raw/Cardiovascular_Disease_Dataset")

---

## Load and Inspect Kaggle Data

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Cardiovascular_Disease_Dataset.csv")
df.head(10)

Unnamed: 0,patientid,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,103368,53,1,2,171,0,0,1,147,0,5.3,3,3,1
1,119250,40,1,0,94,229,0,1,115,0,3.7,1,1,0
2,119372,49,1,2,133,142,0,0,202,1,5.0,1,0,0
3,132514,43,1,0,138,295,1,1,153,0,3.2,2,2,1
4,146211,31,1,1,199,0,0,2,136,0,5.3,3,2,1
5,148462,24,1,1,173,0,0,0,161,0,4.7,3,2,1
6,168686,79,1,2,130,240,0,2,157,0,2.5,2,1,1
7,170498,52,1,0,127,345,0,0,192,1,4.9,1,0,0
8,188225,62,1,0,121,357,0,1,138,0,2.8,0,0,0
9,192523,61,0,0,190,181,0,1,150,0,2.9,2,0,1


* Data Frame Summary

In [None]:
df.info()

* This is an index containing the meaning of all variables:

    <img src="../static/images/abbreviations.png" alt="abbreviations for Heart Disease data" height="500" />

* We want to check for duplicates in the patientid column

In [None]:
df[df.duplicated(subset=['patientid'])]

* We want to check for duplicates in the rest of the dataset

In [None]:
duplicate_count = df.duplicated().sum()

if duplicate_count == 0:
    print("There are no duplicates in the dataset")
else:
    print(f"There are {duplicate_count} duplicates in the dataset")

* We want to check for NaN values in the dataset

In [None]:
df.isna().sum()

---

# Push files to Repo

* We will save Data Frame and remove it's index

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/Cardiovascular_Disease_Dataset.csv",index=False)
