# **Data Collection**

## Objectives

* Obtain an appropriate diabetes dataset from Kaggle
* Download and save the dataset
* Inspect and import the dataset, extracting into the file path outputs/datasets/collection

## Inputs

* Using the Kaggle API to download the source data from [Kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)
* Authentication token for the Kaggle JSON file

## Outputs

* The purpose of this notebook is to collect the pre-processed dataset as a .csv and import to the following directory:
    * outputs/datasets/collection/diabetes.csv

## Additional Comments

* This notebook falls under the CRISP-DM of Data Collection
* The dataset is derived from a publicly available dataset on Kaggle. As this is publicly available there are no ethical or privacy concerns and can therefor be used in the repository. 


---

# Change working directory

* As the notebooks are stored in the subfolder 'jupyter_notebooks' we therefore, when running the notebook in the editor, need to change the working directory.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction'

# Importing the Source Data from Kaggle

* We begin by importing the diabetes dataset from Kaggle.

Install Kaggle package 1.5.12

In [4]:
!pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting python-slugify
  Downloading python_slugify-7.0.0-py2.py3-none-any.whl (9.4 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=98154b83b1baa99d54d36902c994936d5d7fe0fa2384e42cd5f536527f003abe
  Stored in directory: /home/gitpod/.cache/pip/wheels/03/f3/c7/fc5a63bb33d22177609b06c5b4c714b5eb3f1b195ce9dc5e47
Successfully built kaggle
Installing collected packages: text-unidecode, python-s

* Next we upload the JSON authentication token from Kaggle to the root directory.
* Then the below code is run to establish the code cell being recognised.
* Checking that the kaggle.json is included in the .gitignore so that it is not viewable.

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json


* The dataset used for the diabetes prediction was sourced [Here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

* The dataset as well as the destination folder is then defined.

In [6]:
KaggleDatasetPath = "uciml/pima-indians-diabetes-database"
DestinationFolder = "inputs/datasets/raw"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading pima-indians-diabetes-database.zip to inputs/datasets/raw
  0%|                                               | 0.00/8.91k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 8.91k/8.91k [00:00<00:00, 13.2MB/s]


* This will create a zip file containing the dataset from Kaggle.

* Next we need to extract the contents from the zip file.
* The kaggle.json file would then need to be removed but in this case it has been included in the .gitignore so this step is not necessary and is readily available if required to re-access.

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder}

Archive:  inputs/datasets/raw/pima-indians-diabetes-database.zip
  inflating: inputs/datasets/raw/diabetes.csv  


* Now that the contents have been extracted we can now transition to inspecting the dataset.

---

# Data Inspection

* The next steps involve loading and inspecting the diabetes dataset.
* A Pandas dataframe is declared using the diabetes dataset using `read_csv()`.
* The first five rows will be displayed using `head()`.

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [69]:
# Displays the number of rows and columns in the dataset
df.shape

(768, 9)

* From this we can see that there are 768 rows and 9 columns.

* The information regarding the data type is then shown below.

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


* Next we want to see the statistical measures of the dataset using the function `describe()`.

* This will give us helpful information for use later on such as the count of data, the mean data for each feature, the standard deviation, minimum and maximum values as well as percentiles.

In [71]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


* In order to see the count of diabetic patients against non-diabetic patients we use the `value_counts()` method against the 'Outcome' variable.

In [72]:
df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

* As we can see, there are 500 subjects classed as non-diabetic and 268 subjects classed as diabetic in this dataset.

* Next we want to get the mean values for each feature set for diabetic and non-diabetic as identified above. This will be achieved using the `groupby()` function where we will call 'Outcome' and the `mean()` function.

In [73]:
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


* From this we can get a clearer picture of the differences between the various feature sets for the subjects with diabetes and without diabetes.

* Some of the insights that can be gleaned from this mean data is that those with diabetes appear to have higher Glucose levels, on average are of an older age than those without diabetes as well as higher BMI values and insulin serum levels.

## Dataset description

* From this we can see that the dataset contains 768 records and 9 feature sets which are explained below.

    * **Pregnancies** - *The number of times pregnant*
    * **Glucose** - *Plasma glucose concentration using a glucose tolerance test*
    * **BloodPressure** - *Diastolic blood pressure measured in mm Hg*
    * **SkinThickness** - *Tricep skin fold thickness measured in mm*
    * **Insulin** - *2-hour serum insulin measured in mu U/ml*
    * **BMI** - *Body Mass Index measured in kg/m^2*
    * **DiabetesPedigreeFunction** - *Scores the likeliness of diabetes based on family history*
    * **Age** - *The age of the subject*
    * **Outcome** - *Class variable of 0 (for non-diabetic) and 1 (for diabetic)*<br>
<br>
* 768 records would typically be regarded as a small dataset however, it should be sufficient for our needs to train a machine learning model.
* From this data we can see that certain features have a value of 0 which would not be possible for biomarkers such as Skin thickness, glucose, blood pressure and BMI and was likely to be missing values represented as zero in the dataset.
* These zero values will either need to be removed from the dataset or replaced by imputing with a median value. As the dataset is already small, imputing will be opted for in the Data Cleaning stage later on.
* In order to see just how many zeros are contained in each column we will make use of exploratory data analysis in the form of a violin plot to visualise this along with a count to supplement the visual with an exact figure during the Correlation Study stage.

---

# Push files to Repo

* The dataframe has now been modified appropriately and can be saved as a .csv file and pushed to the repository. In order to do so an output directory will need to be created.

In [74]:
import os
try:
  # creates a folder in the datasets directory
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/diabetes.csv", index=False)
