# **Analysis of Diabetes Risk in India**

## Objectives

* Create a comprehensive data analysis tool designed for medical professionals to streamline data exploration, analysis, and visualisation to analyze behavioural and lifestyle factors on diabetes risk in India.

## Inputs

* Data source: https://www.kaggle.com/datasets/ankushpanday1/diabetes-in-youth-vs-adult-in-india


## Outputs

* A jupyter notebook file (Diabetes Risk Analysis (Hackathon1.ipynb) to showcase the data analysis.

## Additional Comments

* Started the project building simple code step by step to showcase data  cleaning process, then compilined the process in a imputer pipeline called Diabetes_ETL.



---

# Working directory

Changed the working directory from its current folder to its parent folder
* Accessing the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Analysis_of_diabetes_risk-/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm new directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Analysis_of_diabetes_risk-'

# Section 1 :  Data Extraction, Transformation, and Loading (ETL) 

In [4]:
# Setting up & Importing packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set_style('whitegrid')
from sklearn.pipeline import Pipeline
import plotly.express as px

In [5]:
# Loading the dataset in Diabetes datafrrame
diabetes_df= pd.read_csv('diabetes young adults india.csv')

In [6]:
#Preview of top 15 entries in dataset
top_rows= diabetes_df.head(15)

In [7]:
# Making a list of all Column names
Df_cols= [diabetes_df.columns]
Df_cols

[Index(['ID', 'Age', 'Gender', 'Region', 'Family_Income',
        'Family_History_Diabetes', 'Parent_Diabetes_Type', 'Genetic_Risk_Score',
        'BMI', 'Physical_Activity_Level', 'Dietary_Habits', 'Fast_Food_Intake',
        'Smoking', 'Alcohol_Consumption', 'Fasting_Blood_Sugar', 'HbA1c',
        'Cholesterol_Level', 'Prediabetes', 'Diabetes_Type', 'Sleep_Hours',
        'Stress_Level', 'Screen_Time'],
       dtype='object')]

In [8]:
# Total number of rows and columns in Diabetes dataframe
diabetes_df.shape
print(f"There are {diabetes_df.shape[0]} rows and {diabetes_df.shape[1]} columns in the Diabetes Dataframe.")

There are 100000 rows and 22 columns in the Diabetes Dataframe.


In [9]:
#Checking for values in each column
def col_value_check(diabetes_df):
      if isinstance(diabetes_df, pd.DataFrame):
        for col in diabetes_df.columns: 
          print(f"{col} value counts:") 
          print(diabetes_df[col].value_counts())
        return


In [10]:
col_value_check(diabetes_df)

ID value counts:
1         1
66651     1
66673     1
66672     1
66671     1
         ..
33332     1
33331     1
33330     1
33329     1
100000    1
Name: ID, Length: 100000, dtype: int64
Age value counts:
24    9259
21    9205
20    9194
18    9127
23    9123
19    9089
15    9031
22    9030
17    9006
16    9005
25    8931
Name: Age, dtype: int64
Gender value counts:
Female    48073
Male      47964
Other      3963
Name: Gender, dtype: int64
Region value counts:
North        16768
East         16751
Northeast    16677
South        16650
West         16607
Central      16547
Name: Region, dtype: int64
Family_Income value counts:
652308     3
929543     3
1745785    3
1121039    3
2071659    3
          ..
300996     1
307472     1
1686855    1
497882     1
1972687    1
Name: Family_Income, Length: 98018, dtype: int64
Family_History_Diabetes value counts:
No     64912
Yes    35088
Name: Family_History_Diabetes, dtype: int64
Parent_Diabetes_Type value counts:
None      65097
Type 2    25

In [11]:
# Checking initial information on Index, Datatypes and Memory used
Dataframe_info= diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 22 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   ID                       100000 non-null  int64  
 1   Age                      100000 non-null  int64  
 2   Gender                   100000 non-null  object 
 3   Region                   100000 non-null  object 
 4   Family_Income            100000 non-null  int64  
 5   Family_History_Diabetes  100000 non-null  object 
 6   Parent_Diabetes_Type     100000 non-null  object 
 7   Genetic_Risk_Score       100000 non-null  int64  
 8   BMI                      100000 non-null  float64
 9   Physical_Activity_Level  100000 non-null  object 
 10  Dietary_Habits           100000 non-null  object 
 11  Fast_Food_Intake         100000 non-null  int64  
 12  Smoking                  100000 non-null  object 
 13  Alcohol_Consumption      100000 non-null  object 
 14  Fasti

In [12]:
#Checking for any duplicate values
duplicates_check= diabetes_df.duplicated().any()
print (f'Any duplicate values:',duplicates_check)

Any duplicate values: False


In [13]:
# Checking for missing values in  dataset
missingvalues_check= diabetes_df.isnull().sum()


In [14]:
#Dropping data columns not going to be used in further analysis
col_dropped= ['ID','Family_Income', 'Family_History_Diabetes','Parent_Diabetes_Type', 'Fasting_Blood_Sugar','Genetic_Risk_Score']                               
diabetes_df= diabetes_df.drop(columns= col_dropped)

In [15]:
diabetes_df.head()

Unnamed: 0,Age,Gender,Region,BMI,Physical_Activity_Level,Dietary_Habits,Fast_Food_Intake,Smoking,Alcohol_Consumption,HbA1c,Cholesterol_Level,Prediabetes,Diabetes_Type,Sleep_Hours,Stress_Level,Screen_Time
0,21,Male,North,31.4,Sedentary,Moderate,1,Yes,No,9.5,163.3,Yes,,7.7,7,6.8
1,18,Female,Central,24.4,Active,Unhealthy,5,No,No,5.0,169.1,Yes,,7.9,8,6.0
2,25,Male,North,20.0,Moderate,Moderate,2,No,No,8.3,296.3,Yes,Type 1,7.6,8,4.6
3,22,Male,Northeast,39.8,Moderate,Unhealthy,4,No,Yes,4.6,252.8,No,,9.5,2,10.9
4,19,Male,Central,19.2,Moderate,Moderate,0,No,Yes,5.3,252.3,No,,6.4,2,1.3


In [16]:
# Descrpitive Statistics Overview
Summary_stats= diabetes_df.describe()

Summary_stats

Unnamed: 0,Age,BMI,Fast_Food_Intake,HbA1c,Cholesterol_Level,Sleep_Hours,Stress_Level,Screen_Time
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,20.00789,28.028089,4.98858,7.006461,209.903952,6.988082,5.50681,6.503842
std,3.154934,6.924196,3.169762,1.735327,52.049374,1.734122,2.87943,3.17021
min,15.0,16.0,0.0,4.0,120.0,4.0,1.0,1.0
25%,17.0,22.1,2.0,5.5,164.8,5.5,3.0,3.8
50%,20.0,28.0,5.0,7.0,209.8,7.0,6.0,6.5
75%,23.0,34.0,8.0,8.5,255.0,8.5,8.0,9.3
max,25.0,40.0,10.0,10.0,300.0,10.0,10.0,12.0


# Section 2

Pipeline functions

In [17]:
# Function to transform variable datatypes to numeric datatype for analysis
def transform_data(diabetes_df): 

     # Define the imputer  
      imputer = SimpleImputer(strategy='mean') 

     # Create a pipeline for imputation 
      imputer_pipeline = Pipeline(steps=[('imputer', imputer)]) 

     # Specify the columns to be imputed 
      columns_to_impute = ['BMI', 'physical_activity', 'dietary_habits', 'smoking', 
                     'fast_food_intake', 'alcohol_consumption'] 
                     
     # Apply the imputer pipeline to the specified columns 
      diabetes_df[columns_to_impute] = imputer_pipeline.fit_transform(diabetes_df[columns_to_impute]) 

     # Convert categorical variables to numerical 
      diabetes_df['smoking'] = ddiabetes_df['smoking'].map({'yes': 1, 'no': 0}) 
      diabetes_df['fast_food_intake'] = diabetes_df['fast_food_intake'].map({'frequent': 1, 'rare': 0}) 
      diabetes_df['alcohol_consumption'] = diabetes_df['alcohol_consumption'].map({'regular': 1, 'occasional': 0}) 
      diabetes_df['physical_activity'] = diabetes_df['physical_activity'].map({'regular': 1, 'rare': 0}) 
      diabetes_df['dietary_habits'] = df['dietary_habits'].map({'healthy': 1, 'unhealthy': 0})
      return diabetes_df

In [24]:
   # Correlating data analysis
diabetes_df.corr()
      

Unnamed: 0,Age,BMI,Fast_Food_Intake,HbA1c,Cholesterol_Level,Sleep_Hours,Stress_Level,Screen_Time
Age,1.0,-0.000482,0.006896,0.006189,-0.000582,-0.000743,0.004517,0.000513
BMI,-0.000482,1.0,-0.002043,0.0056,-0.002994,-0.002895,-0.000203,-0.005156
Fast_Food_Intake,0.006896,-0.002043,1.0,0.001768,0.000207,-0.008323,-0.001666,-0.001891
HbA1c,0.006189,0.0056,0.001768,1.0,-0.001143,-0.00324,0.002687,0.002138
Cholesterol_Level,-0.000582,-0.002994,0.000207,-0.001143,1.0,0.001186,-0.006368,0.004065
Sleep_Hours,-0.000743,-0.002895,-0.008323,-0.00324,0.001186,1.0,-0.002203,0.002309
Stress_Level,0.004517,-0.000203,-0.001666,0.002687,-0.006368,-0.002203,1.0,0.001587
Screen_Time,0.000513,-0.005156,-0.001891,0.002138,0.004065,0.002309,0.001587,1.0


In [None]:
# Creating a diabetes_outcome column based on HbA1c levels 
diabetes_df['Diabetes_Outcome'] = diabetes_df['HbA1c'].apply(lambda x: 'Positive' if x >= 6.5 else x= <= 'Negative')

---

# Data Visualisation

In [26]:
# Fast Food Intake vs. Diabetes Outcome 
fig_fast_food = px.histogram(diabetes_df, 
                             x='Fast_Food_Intake', 
                             color='diabetes_outcome', 
                             title='Fast Food Intake vs. Diabetes Outcome', 
                             labels={'Fast_Food_Intake': 'Fast Food Intake', 
                             'diabetes_outcome': 'Diabetes Outcome'}) 
fig_fast_food.show() 

# Smoking vs. Diabetes Outcome 
fig_smoking = px.histogram(diabetes_df, 
                           x='Smoking', 
                           color='diabetes_outcome', 
                           title='Smoking vs. Diabetes Outcome', 
                           labels={'Smoking': 'Smoking', 'diabetes_outcome': 'Diabetes Outcome'}) 
fig_smoking.show()

# Alcohol Consumption vs. Diabetes Outcome 
fig_alcohol = px.histogram(diabetes_df, 
                           x='Alcohol_consumption', 
                           color='diabetes_outcome', 
                           title='Alcohol Consumption vs. Diabetes Outcome', 
                           labels={'Alcohol_Consumption': 'Alcohol Consumption', 
                           'diabetes_outcome': 'Diabetes Outcome'}) 
# fig_alcohol.show()

ValueError: Value of 'color' is not the name of a column in 'data_frame'. Expected one of ['Age', 'Gender', 'Region', 'BMI', 'Physical_Activity_Level', 'Dietary_Habits', 'Fast_Food_Intake', 'Smoking', 'Alcohol_Consumption', 'HbA1c', 'Cholesterol_Level', 'Prediabetes', 'Diabetes_Type', 'Sleep_Hours', 'Stress_Level', 'Screen_Time'] but received: diabetes_outcome

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [18]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (553063055.py, line 5)