# Importing Important Libraries

In [1]:
# for numerical operation
import numpy as np

# for data analysis
import pandas as pd


# Data Ingestion

In [2]:
# Specify the absolute dataset path
dataset_path = "../dataset/diabetes.csv"

In [3]:
# import dataset and converting into dataframe to analysis
df = pd.read_csv(dataset_path)

# About Dataset:
#### Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

#### Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

#### Acknowledgements
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

#### Inspiration
Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

###  Dataset Reference: [Click Here!](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)

# Data Profiling and Inspection

# 1. Data Size

In [4]:
# to know the shape of dataset
df.shape

(768, 9)

* This indicates that the dataset consists of two dimensions, comprising a total of 768 rows and 9 columns.

# 2. Data Preview

In [5]:
# to preview data  .head() or .sample() method is used
df.head(3)   # retrun first 3 rows of dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [6]:
df.sample(3)   # return randomly any 3 rows of dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
633,1,128,82,17,183,27.5,0.115,22,0
603,7,150,78,29,126,35.2,0.692,54,1
550,1,116,70,28,0,27.4,0.204,21,0


# 3. Data Types

In [7]:
# to view the types of data in dataset, .dtypes or .info() is used
df.dtypes   # return the datatypes of each columns of dataset only

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [8]:
df.info()   # retrun the overall information of dataset along with data types of each columns in dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


#### Based on the information provided above about the dataset, we can conclude that...
* total number of rows/enteries is 768.
* total number of columns is 9.
* there are seven integer data types and two are float types.
* Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, Age and Outcome are integer data type.
* BMI and DiabetesPedigreeFunction are float types.
* also, all columns has 768 non null that means there is no any missing values.                                                                                                      



# 4. Missing values

In [9]:
# to check missing values .isnull() or .isna() method is used
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

#### As we can see there is no any missing values in this dataset.

# 5. Statistical Overview

In [10]:
# to overview statistical information .describe() is used
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


#### Based on the statistical overview provided above about the dataset, we can conclude that...
* every columns has same numbers of count.
* Pregnancies:
    1. Number of times pregnant: minimum times is 0 and maximum times is 17.
    2. center tendency mean is 3.85 with standard deviation is 3.37.
* Glucose:
    1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test: minimum is 0 and maximum is 199.
    2. mean is 120.89 and standard deviation is 31.97.
* Insulin:
    1. 2-Hour serum insulin (mu U/ml): minimum is 0 and maximum is 846.
    2. mean is 79.80 and standard deviation is 115.24.
* Outcome:
    1. As we can see, almost all the values of Outcome seems to be either a "0" or a "1". So it is better to treat it as a categorical value rather than a numerical value. Let's perform categorical analysis on Outcome.




# 6. Data Skewness

In [11]:
# to check the skewness of numerical features .skew() method is used
df.skew()

Pregnancies                 0.901674
Glucose                     0.173754
BloodPressure              -1.843608
SkinThickness               0.109372
Insulin                     2.272251
BMI                        -0.428982
DiabetesPedigreeFunction    1.919911
Age                         1.129597
Outcome                     0.635017
dtype: float64

#### It's evident that "BloodPressure" and "BMI" exhibit negative skewness, indicating that their medians surpass their means.

# 7. Correlation analysis

In [12]:
# to visualize the correlation of numerical features .corr() method is used
df.corr()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


#### From the  statistical summary of the dataset, we can deduce that...
* Glucose shows a strong correlation with the Outcome, with a correlation coefficient of 0.4665.
* Additionally, other columns also demonstrate positive correlations with the Outcome variable.

# 8. Data duplication

In [13]:
# to check duplication value .duplicated() is used
df.duplicated().sum()

0

#### This indicates that there are no duplicate values present in the dataset.