Instructions

Dataset description:

 This dataset includes medical measurements and indicators that can help in diagnosing diabetes. The goal is to explore the dataset and uncover any patterns, correlations, and potential issues such as missing values or outliers.
 

Dataset Columns Description

Pregnancies: Number of times the patient was pregnant.
Glucose: Plasma glucose concentration (2 hours in an oral glucose tolerance test).
BloodPressure: Diastolic blood pressure (mm Hg).
SkinThickness: Triceps skinfold thickness (mm).
Insulin: 2-Hour serum insulin (mu U/ml).
BMI: Body Mass Index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction: Diabetes pedigree function (a measure of genetic influence).
Age: Age of the patient (years).
Outcome: Diabetes diagnosis outcome (0 = No, 1 = Yes).

Step 1: Data Exploration with Pandas

Load the Dataset
 
General Information:
 
Display the dataset's structure, including column names, data types, and memory usage.
Identify the number of missing values or zeros in the dataset.
 
Descriptive Analysis:

 
Use the describe() function to analyze:
 
Summary statistics for each column (mean, min, max, quartiles).
Look for irregularities, such as columns with unrealistic minimum or maximum values.
 
Step 2: Data Exploration with ydata-profiling

Generate a Profiling Report:
 
Use ydata-profiling to create an interactive report that includes:
 
Column descriptions (type, unique values, missing values).
Distributions for numerical columns.
Correlation matrices to identify relationships between variables.
Highlighted outliers or anomalies.
 
Analyze the Report:
 
Identify missing values in key columns such as Glucose, Insulin, and BMI.
Examine correlations between columns like Age, Glucose, and Outcome.
Note any interesting insights or patterns (e.g., higher glucose levels correlated with diabetes diagnosis).

Step 3: Summary

Document Findings:
 
Write a summary of key observations from both Pandas and the ydata-profiling report.
Mention:
 
Patterns or trends in glucose, BMI, or pregnancies.
Any notable correlations between variables.
Issues such as missing or zero values in critical columns.
 
Suggestions:
 
Recommend next steps, such as handling missing values, addressing outliers, or exploring predictive modeling with the data.

In [2]:
import pandas as pd

In [3]:
filepath =  r'C:\Users\Special User\Downloads\kaggle_diabetes.csv'
df = pd.read_csv(filepath)

In [4]:
# Preview the dataset
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,2,138,62,35,0,33.6,0.127,47,1
1,0,84,82,31,125,38.2,0.233,23,0
2,0,145,0,0,0,44.2,0.63,31,1
3,0,135,68,42,250,42.3,0.365,24,1
4,1,139,62,41,480,40.7,0.536,21,0


In [5]:
#Display the dataset's structure, including column names, data types, and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               2000 non-null   int64  
 1   Glucose                   2000 non-null   int64  
 2   BloodPressure             2000 non-null   int64  
 3   SkinThickness             2000 non-null   int64  
 4   Insulin                   2000 non-null   int64  
 5   BMI                       2000 non-null   float64
 6   DiabetesPedigreeFunction  2000 non-null   float64
 7   Age                       2000 non-null   int64  
 8   Outcome                   2000 non-null   int64  
dtypes: float64(2), int64(7)
memory usage: 140.8 KB


In [6]:
#Identify the number of missing values or zeros in the dataset.
number_missing_values = df.isnull().sum()
number_missing_values

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [None]:
#Descriptive Analysis:
#Use the describe() function to analyze:
#Summary statistics for each column (mean, min, max, quartiles).
#Look for irregularities, such as columns with unrealistic minimum or maximum values.

In [7]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,3.7035,121.1825,69.1455,20.935,80.254,32.193,0.47093,33.0905,0.342
std,3.306063,32.068636,19.188315,16.103243,111.180534,8.149901,0.323553,11.786423,0.474498
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,63.5,0.0,0.0,27.375,0.244,24.0,0.0
50%,3.0,117.0,72.0,23.0,40.0,32.3,0.376,29.0,0.0
75%,6.0,141.0,80.0,32.0,130.0,36.8,0.624,40.0,1.0
max,17.0,199.0,122.0,110.0,744.0,80.6,2.42,81.0,1.0


In [None]:
#noticed an abnormality in some columns having some records as zero, to explore further:

In [8]:
columns_to_check = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for column in columns_to_check:
    zero_count = (df[column] == 0).sum()
    print(f"{column}: {zero_count} zeros")


Glucose: 13 zeros
BloodPressure: 90 zeros
SkinThickness: 573 zeros
Insulin: 956 zeros
BMI: 28 zeros


In [9]:
import numpy as np

In [None]:
#replacing the zeros withnan as the zeros are values that arent possible medically

In [10]:
columns_to_impute = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[columns_to_impute] = df[columns_to_impute].replace(0, np.nan)

In [11]:
columns_to_check = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for column in columns_to_check:
    zero_count = (df[column] == 0).sum()
    print(f"{column}: {zero_count} zeros")


Glucose: 0 zeros
BloodPressure: 0 zeros
SkinThickness: 0 zeros
Insulin: 0 zeros
BMI: 0 zeros


In [None]:
#dropping the rows with zero as their insulin level 

In [12]:
df = df.dropna(subset = ['Insulin'])

In [13]:
df.isnull().sum()

Pregnancies                 0
Glucose                     2
BloodPressure               2
SkinThickness               2
Insulin                     0
BMI                         3
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [15]:
df = df.dropna(subset=['Glucose', 'BloodPressure', 'SkinThickness', 'BMI'])


In [16]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [17]:
from ydata_profiling import ProfileReport


In [19]:
my_profile = ProfileReport(df, title = 'Diabetes Dataset Profike Report', explorative = True)

In [20]:
my_profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
#Identify missing values in key columns such as Glucose, Insulin, and BMI.
#Examine correlations between columns like Age, Glucose, and Outcome.
#Note any interesting insights or patterns (e.g., higher glucose levels correlated with diabetes diagnosis).


#There are no missing values as they have been dropped using pandas in earlier steps
#CORRELATION BETWEEN COLUMNS
# Age and Glucose - 0.318
#Age and Outcome - 0.398
#Glucose and Outcome - 0.547
#Outcome and DiabetesPedgreeFunction - 0.210

#Age and Glucose (0.318) indicates that older individuals tend to have slightly higher glucose levels, but the relationship is not very strong.
#Age and Outcome (0.398) suggests that as age increases, the likelihood of a diabetes diagnosis (Outcome = 1) also increases.
#Outcome and DiabetesPedigreeFunction (0.210) A slight positive relationship indicates that individuals with a higher 
#diabetes pedigree function (greater genetic influence) have a slightly increased likelihood of a diabetes diagnosis
#Glucose and Outcome (0.547) is a strong positive correlation.Higher glucose levels are strongly associated with a higher
#likelihood of a diabetes diagnosis.

In [None]:
#Write a summary of key observations from both Pandas and the ydata-profiling report.
#Mention:Patterns or trends in glucose, BMI, or pregnancies.
#Any notable correlations between variables.Issues such as missing or zero values in critical columns



#Initial Zeros Identified as Missing Values:
#Columns: Glucose, BloodPressure, SkinThickness, BMI, and Insulin contained zeros,which is medically not possible.
#I replaced the zeros with NaN to treat them as missing values.
#Number of Missing Values (Post-Replacement):
#Glucose: 13 missing values.
#BloodPressure: 90 missing values.
#SkinThickness: 573 missing values.
#BMI: 28 missing values.
 #Insulin: 956 missing values.
#Action Taken:
#Since insulin is crucial in the dataset and it has a lot of abnormalities(0) i dropped the rows that had NaN, this action left me with fewer NaN values
#in other columns, Every human has different conditions and health situations, i felt it wrong to replace missing values with mean in the case of a 
#medical data so I dropped all the rows with NaN values.

#Key correlations observed in the dataset:
#Glucose and Outcome: 0.547 (strong positive)
#Indicates that higher glucose levels are strongly associated with diabetes diagnosis.
#Age and Outcome: 0.398 (moderate positive)
#Suggests older individuals are more likely to be diagnosed with diabetes.
#Age and Glucose: 0.318 (moderate positive)
#Older individuals tend to have slightly higher glucose levels.
#Glucose and Insulin: 0.600 (strong positive)
#Indicates a strong relationship between glucose levels and insulin levels.
#Outcome and DiabetesPedigreeFunction: 0.210 (weak positive)
#Suggests a mild influence of genetic factors on diabetes diagnosis.

#The strong correlation between Glucose and Outcome, and between Glucose and Insulin, underscores their importance in modeling diabetes diagnosis
 # RECOMMENDATIONS
#While dropping rows with NaN values ensures clean data, it can significantly reduce the dataset size, potentially limiting model performance. You might recommend:
#Domain-Specific Imputation: Collaborate with medical experts to develop imputation strategies (e.g., stratified imputation based on similar groups like
#age or BMI).
#Predictive Imputation: Use machine learning models to predict missing values based on related features (e.g., predicting BloodPressure based on Age,
#BMI, and Glucose).
#Based on correlations:
#Prioritize Glucose, Insulin, BMI, and Age as key predictors for modeling diabetes diagnosis (Outcome).