# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file and authentication token. 

## Outputs

* Generate Dataset: outputs/datasets/collection/heart_disease.csv

## Additional Comments

* This data is coming from an open, public source and poses no ethical or privacy concerns.


---

# Change working directory

* Install python packages in the notebooks

In [2]:
%pip install -r /workspace/cd-prediction-pp5/requirements.txt

Note: you may need to restart the kernel to use updated packages.


When running the notebook in the editor, we need to change the working directory from its current folder to its parent folder.
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/cd-prediction-pp5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/cd-prediction-pp5'

# Data acquisition from Kaggle

Install Kaggle package to fetch data.

In [None]:
%pip install kaggle==1.5.12

* In order to download the data a personal authentication token (JSON file) is needed to authenticate Kaggle. You can aquire one by signin up to Kaggle.

* Once you have your token, drag and drop the file into the directory and then run the following:

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the Kaggle url

* Define the Kaggle dataset, and destination folder and download it.

In [6]:
KaggleDatasetPath = "jocelyndumlao/cardiovascular-disease-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cardiovascular-disease-dataset.zip to inputs/datasets/raw
100%|████████████████████████████████████████| 411k/411k [00:00<00:00, 1.14MB/s]
100%|████████████████████████████████████████| 411k/411k [00:00<00:00, 1.14MB/s]


* Unzip the downloaded file, delete the zip file and delete the kaggle.json file
* Remove pdf file

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm {DestinationFolder}/*.pdf \
  && rm kaggle.json

Archive:  inputs/datasets/raw/cardiovascular-disease-dataset.zip
  inflating: inputs/datasets/raw/Cardiovascular_Disease_Dataset/Cardiovascular_Disease_Dataset.csv  
  inflating: inputs/datasets/raw/Cardiovascular_Disease_Dataset/Cardiovascular_Disease_Dataset_Description.pdf  


* Move the csv file to designated folder and remove the redundant layer in pathing

In [10]:
os.rename(
    "inputs/datasets/raw/Cardiovascular_Disease_Dataset/Cardiovascular_Disease_Dataset.csv",
    "inputs/datasets/raw/Cardiovascular_Disease_Dataset.csv")
os.rmdir("inputs/datasets/raw/Cardiovascular_Disease_Dataset")

---

## Load and Inspect Kaggle Data

In [11]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Cardiovascular_Disease_Dataset.csv")
df.head(10)

Unnamed: 0,patientid,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,103368,53,1,2,171,0,0,1,147,0,5.3,3,3,1
1,119250,40,1,0,94,229,0,1,115,0,3.7,1,1,0
2,119372,49,1,2,133,142,0,0,202,1,5.0,1,0,0
3,132514,43,1,0,138,295,1,1,153,0,3.2,2,2,1
4,146211,31,1,1,199,0,0,2,136,0,5.3,3,2,1
5,148462,24,1,1,173,0,0,0,161,0,4.7,3,2,1
6,168686,79,1,2,130,240,0,2,157,0,2.5,2,1,1
7,170498,52,1,0,127,345,0,0,192,1,4.9,1,0,0
8,188225,62,1,0,121,357,0,1,138,0,2.8,0,0,0
9,192523,61,0,0,190,181,0,1,150,0,2.9,2,0,1


* Data Frame Summary

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patientid          1000 non-null   int64  
 1   age                1000 non-null   int64  
 2   gender             1000 non-null   int64  
 3   chestpain          1000 non-null   int64  
 4   restingBP          1000 non-null   int64  
 5   serumcholestrol    1000 non-null   int64  
 6   fastingbloodsugar  1000 non-null   int64  
 7   restingrelectro    1000 non-null   int64  
 8   maxheartrate       1000 non-null   int64  
 9   exerciseangia      1000 non-null   int64  
 10  oldpeak            1000 non-null   float64
 11  slope              1000 non-null   int64  
 12  noofmajorvessels   1000 non-null   int64  
 13  target             1000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 109.5 KB


* This is an index containing the meaning of all variables:

    <img src="../static/images/abbreviations.png" alt="abbreviations for Heart Disease data" height="500" />

* We want to check for duplicates in the patientid column

In [21]:
df[df.duplicated(subset=['patientid'])]

Unnamed: 0,patientid,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target


* We want to check for duplicates in the rest of the dataset

In [15]:
duplicate_count = df.duplicated().sum()

if duplicate_count == 0:
    print("There are no duplicates in the dataset")
else:
    print(f"There are {duplicate_count} duplicates in the dataset")

There are no duplicates in the dataset


* We want to check for NaN values in the dataset

In [20]:
df.isna().sum()

patientid            0
age                  0
gender               0
chestpain            0
restingBP            0
serumcholestrol      0
fastingbloodsugar    0
restingrelectro      0
maxheartrate         0
exerciseangia        0
oldpeak              0
slope                0
noofmajorvessels     0
target               0
dtype: int64

---

# Push files to Repo

* We will save Data Frame and remove it's index

In [22]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/Cardiovascular_Disease_Dataset.csv",index=False)
