# First Exploration to Dataset and Cleaning


At the very beginning, we import the libraries that are essential for our exploration and cleaning.

### Libraries
> **NumPy** : Library for Numeric Computations in Python  
> **Pandas** : Library for Data Acquisition and Preparation  
> **train_test_split** : Function from library(sklearn) for Random Data Spliting 

After that, we import our data and explore it.

In [1]:
# Import essential libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
#import the dataset
diabetes = pd.read_csv('dataset/diabetes_binary.csv')
diabetes

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,1.0,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0


<br>

Using the `train_test_split` function, we then split our data into two parts. As the dataset is too large, we will only use half of the data for our mini-project. After splitting, **diabetes_use** will be the part that we are working on, and **diabetes_throw** will be the unused part.

By that, we are firstly exploring our data used by knowing the datatype and the statistical information of each column. The method `.info()` and `.describe()` are used.

<br>

In [42]:
#spliting data
diabetes_use, diabetes_throw = train_test_split(diabetes, test_size =0.5)

In [43]:
#understanding the datatype for each column
diabetes_use.info()

<class 'pandas.core.frame.DataFrame'>
Index: 126840 entries, 88041 to 129291
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_binary       126840 non-null  float64
 1   HighBP                126840 non-null  float64
 2   HighChol              126840 non-null  float64
 3   CholCheck             126840 non-null  float64
 4   BMI                   126840 non-null  float64
 5   Smoker                126840 non-null  float64
 6   Stroke                126840 non-null  float64
 7   HeartDiseaseorAttack  126840 non-null  float64
 8   PhysActivity          126840 non-null  float64
 9   Fruits                126840 non-null  float64
 10  Veggies               126840 non-null  float64
 11  HvyAlcoholConsump     126840 non-null  float64
 12  AnyHealthcare         126840 non-null  float64
 13  NoDocbcCost           126840 non-null  float64
 14  GenHlth               126840 non-null  float64
 15  M

In [44]:
#understanding the basic statistical information for each column
diabetes_use.describe()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,...,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0,126840.0
mean,0.138024,0.428264,0.423289,0.963048,28.376908,0.442069,0.039972,0.093693,0.757411,0.634232,...,0.951774,0.083341,2.510793,3.189577,4.222698,0.168472,0.439964,8.039144,5.05067,6.055747
std,0.344927,0.494829,0.494082,0.188645,6.585436,0.496635,0.195893,0.291402,0.42865,0.481647,...,0.214245,0.276398,1.068611,7.422295,8.687329,0.374286,0.496385,3.054725,0.988176,2.070747
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0,4.0,5.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0,6.0,8.0
max,1.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0,6.0,8.0


In [45]:
list = ['Diabetes_binary', 'BMI']
for i in diabetes_use:
    if (i not in list):
        diabetes_use[i] = diabetes[i].astype('int64')
diabetes_use.info()

<class 'pandas.core.frame.DataFrame'>
Index: 126840 entries, 88041 to 129291
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_binary       126840 non-null  float64
 1   HighBP                126840 non-null  int64  
 2   HighChol              126840 non-null  int64  
 3   CholCheck             126840 non-null  int64  
 4   BMI                   126840 non-null  float64
 5   Smoker                126840 non-null  int64  
 6   Stroke                126840 non-null  int64  
 7   HeartDiseaseorAttack  126840 non-null  int64  
 8   PhysActivity          126840 non-null  int64  
 9   Fruits                126840 non-null  int64  
 10  Veggies               126840 non-null  int64  
 11  HvyAlcoholConsump     126840 non-null  int64  
 12  AnyHealthcare         126840 non-null  int64  
 13  NoDocbcCost           126840 non-null  int64  
 14  GenHlth               126840 non-null  int64  
 15  M

In [46]:
file_path = "dataset/diabetes_use.csv"

diabetes_use.to_csv(path_or_buf = file_path, index=False)