<a href="https://colab.research.google.com/github/Kiran-Pokhrel-91/Data-Analyst-Projects/blob/main/diabetes_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Details of Writer and Notebook
*   Auther: Kiran Pokhrel
*   Date: 6/2/2025
*   email: kiranpokhrel912@gmail.com


# Importing Dependencies

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Data Collection and Analysis

In [2]:
# importing diabetes dataset to pandas DataFrame
diabetes_dataset = pd.read_csv('/content/drive/MyDrive/Data Analyst Project/diabetes.csv') # PIMA diabetes dataset
diabetes_dataset # seeing top 5 and bottom 5 of dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


Here, in this dataset `Outcome` is the one that we are intresed to determine and it depend on others parameters thus,

Features: `Pregnancies Glucose BloodPressure SkinThickness Insulin BMI	DiabetesPedigreeFunction Age`

Labels : `Outcome` (0 0r 1)


In [3]:
# learning about columns and their datatypes
diabetes_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Note: As we see there is no null content and all are numeric which is good for analysis and machine  but,

--------------------------

0 vaues in the diabetes_dataset are mostly likely to be missing value so, we can use

eg:`df['Glucose'] = df['Glucose'].replace(0, np.nan)` to replace 0 to nan and then impute missig value using different measures

-----------------------------


In [4]:
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
for col in columns:
    diabetes_dataset[col] = diabetes_dataset[col].replace(0,np.nan)

In [5]:
diabetes_dataset.isnull().sum()

Unnamed: 0,0
Pregnancies,111
Glucose,5
BloodPressure,35
SkinThickness,227
Insulin,374
BMI,11
DiabetesPedigreeFunction,0
Age,0
Outcome,0


Now, we can see the real missing values on our datasets which were replaced by zero for convinence

In [6]:
missing_percent = diabetes_dataset.isna().mean() * 100
for col, val in missing_percent.items():
    print(f"{col}: {val:.2f}%")

Pregnancies: 14.45%
Glucose: 0.65%
BloodPressure: 4.56%
SkinThickness: 29.56%
Insulin: 48.70%
BMI: 1.43%
DiabetesPedigreeFunction: 0.00%
Age: 0.00%
Outcome: 0.00%


Here, we have seen the percent of missing values in our dataset. There are non missing values>75% so, there is no need to drop the column.And we can see Insulin has higest missing value.

note: specially in medical data analysis dropping of column is not recomanded

In [7]:
 # Getting statistical measures of the data without handeling missing values
stat_df=diabetes_dataset.describe()
stat_df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,657.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,768.0
mean,4.494673,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,0.348958
std,3.217291,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232,0.476951
min,1.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,2.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0,0.0
50%,4.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,0.0
75%,7.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [8]:
def mice_impute(df,predictor_cols,max_iter=10,scale=False):
    df = df.copy()
    if not predictor_cols:
        raise ValueError("predictor_cols must be specified.")
    if scale:
        scaler = StandardScaler()
        df[predictor_cols]=scaler.fit_transform(df[predictor_cols])
    imputer = IterativeImputer(max_iter=max_iter,random_state=0)
    imputed_array = imputer.fit_transform(df[predictor_cols])
    imputed_df = pd.DataFrame(imputed_array,columns=predictor_cols)
    if scale:
        imputed_df[predictor_cols]=scaler.inverse_transform(imputed_df[predictor_cols])
        df[predictor_cols] = imputed_df[predictor_cols]
    return df

In [9]:
predict = ['Pregnancies','Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
diabetes_df = mice_impute(diabetes_dataset, predictor_cols=predict, scale=True)

In [10]:
diabetes_df.isnull().sum()

Unnamed: 0,0
Pregnancies,0
Glucose,0
BloodPressure,0
SkinThickness,0
Insulin,0
BMI,0
DiabetesPedigreeFunction,0
Age,0
Outcome,0
