# Prediabetes Risk Assessment
---

### Introduction
---
This project aims to assess the risk of prediabetes among Indian youngsters aged from 15-25 based on various health and lifestyle factors. The dataset contains information on age, gender, region, family income, family history of diabetes, genetic risk score, BMI, physical activity level, dietary habits, fast food intake, smoking status, alcohol consumption, fasting blood sugar, HbA1c, cholesterol level, sleep hours, stress level, and screen time.

By analyzing this data, I aim to identify key factors that contribute to the risk of prediabetes and develop predictive models to assess individual risk levels. This can help in early detection and preventive measures to reduce the incidence of diabetes.

---
## 1.) Import Required Packages

####  Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
import warnings
import os
warnings.filterwarnings('ignore')

---
## 2.) Data Collection
- Dataset Source -https://www.kaggle.com/datasets/ankushpanday1/diabetes-in-youth-vs-adult-in-india
- The data consists of 22 column and 100000 rows.

#### Import the CSV Data as Pandas DataFrame

In [3]:
df=pd.read_csv('../data/raw_data/indian_youngsters_health_data.csv')

#### Show Top 5 Records

In [4]:
df.head()

Unnamed: 0,ID,Age,Gender,Region,Family_Income,Family_History_Diabetes,Parent_Diabetes_Type,Genetic_Risk_Score,BMI,Physical_Activity_Level,...,Smoking,Alcohol_Consumption,Fasting_Blood_Sugar,HbA1c,Cholesterol_Level,Prediabetes,Diabetes_Type,Sleep_Hours,Stress_Level,Screen_Time
0,1,21,Male,North,2209393,No,,6,31.4,Sedentary,...,Yes,No,95.6,9.5,163.3,Yes,,7.7,7,6.8
1,2,18,Female,Central,387650,No,,5,24.4,Active,...,No,No,164.9,5.0,169.1,Yes,,7.9,8,6.0
2,3,25,Male,North,383333,No,,6,20.0,Moderate,...,No,No,110.5,8.3,296.3,Yes,Type 1,7.6,8,4.6
3,4,22,Male,Northeast,2443733,No,,4,39.8,Moderate,...,No,Yes,160.7,4.6,252.8,No,,9.5,2,10.9
4,5,19,Male,Central,1449463,No,,4,19.2,Moderate,...,No,Yes,73.7,5.3,252.3,No,,6.4,2,1.3


#### Shape of the dataset

In [5]:
df.shape

(100000, 22)

### 2.1 Dataset information

- ID: Unique identifier for each record.
- Age: Age of the individual (15-25 years).
- Gender: Gender of the individual: Male, Female, or Other.
- Region: Region in India where the individual resides.
- Family_Income: Annual family income in INR.
- Family History of Heart Disease: Indicates if there is a family history of diabetes: Yes or No.
- Parent_Diabetes_Type: Indicates if parents have diabetes and its type: Type 1, Type 2, or None.
- Genetic_Risk_Score: A numeric score (1-10) indicating genetic predisposition to diabetes.
- BMI: Body Mass Index, a measure of body fat based on height and weight.
- Physical_Activity_Level: Level of physical activity: Sedentary, Moderate, or Active.
- Dietary_Habits: Dietary habits: Healthy, Unhealthy, or Mixed.
- Fast_Food_Intake: Frequency of fast food intake: Rarely, Occasionally, or Frequently.
- Smoking: Smoking status: Yes or No.
- Alcohol_Consumption: consumption of alcohol: Yes or No.
- Fasting_Blood_Sugar: Fasting blood sugar level in mg/dL.
- HbA1c: Glycated hemoglobin level in %.
- Cholesterol_Level: Cholesterol level in mg/dL.
- Prediabetes: Indicates if the individual has prediabetes: Yes or No.
- Diabetes_Type: Indicates the type of diabetes: Type 1, Type 2, or None.
- Sleep_Hours: Average hours of sleep per night.
- Stress_Level: A numeric score (1-10) indicating Level of stress.
- Screen_Time: Average hours of screen time per day.

---
## 3.) Data Checks to perform

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set

### 3.1 Check Missing values

In [6]:
df.isna().sum()

ID                             0
Age                            0
Gender                         0
Region                         0
Family_Income                  0
Family_History_Diabetes        0
Parent_Diabetes_Type       65097
Genetic_Risk_Score             0
BMI                            0
Physical_Activity_Level        0
Dietary_Habits                 0
Fast_Food_Intake               0
Smoking                        0
Alcohol_Consumption            0
Fasting_Blood_Sugar            0
HbA1c                          0
Cholesterol_Level              0
Prediabetes                    0
Diabetes_Type              74776
Sleep_Hours                    0
Stress_Level                   0
Screen_Time                    0
dtype: int64

### Missing Values are Found So Handling The Missing Values

In [7]:
df['Parent_Diabetes_Type'].fillna('No Diabetes',inplace=True)
df['Diabetes_Type'].fillna('No Diabetes',inplace=True)

### Ensure No More Missing values

In [8]:
df.isna().sum()

ID                         0
Age                        0
Gender                     0
Region                     0
Family_Income              0
Family_History_Diabetes    0
Parent_Diabetes_Type       0
Genetic_Risk_Score         0
BMI                        0
Physical_Activity_Level    0
Dietary_Habits             0
Fast_Food_Intake           0
Smoking                    0
Alcohol_Consumption        0
Fasting_Blood_Sugar        0
HbA1c                      0
Cholesterol_Level          0
Prediabetes                0
Diabetes_Type              0
Sleep_Hours                0
Stress_Level               0
Screen_Time                0
dtype: int64

#### There are no missing values in the data set

### 3.2 Check Duplicates

In [9]:
df.duplicated().sum()

0

#### There are no duplicates  values in the data set

### 3.3 Check data types

In [10]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 22 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   ID                       100000 non-null  int64  
 1   Age                      100000 non-null  int64  
 2   Gender                   100000 non-null  object 
 3   Region                   100000 non-null  object 
 4   Family_Income            100000 non-null  int64  
 5   Family_History_Diabetes  100000 non-null  object 
 6   Parent_Diabetes_Type     100000 non-null  object 
 7   Genetic_Risk_Score       100000 non-null  int64  
 8   BMI                      100000 non-null  float64
 9   Physical_Activity_Level  100000 non-null  object 
 10  Dietary_Habits           100000 non-null  object 
 11  Fast_Food_Intake         100000 non-null  int64  
 12  Smoking                  100000 non-null  object 
 13  Alcohol_Consumption      100000 non-null  object 
 14  Fasti

### 3.4 Checking the number of unique values of each column

In [11]:
df.nunique()

ID                         100000
Age                            11
Gender                          3
Region                          6
Family_Income               98018
Family_History_Diabetes         2
Parent_Diabetes_Type            3
Genetic_Risk_Score             10
BMI                           241
Physical_Activity_Level         3
Dietary_Habits                  3
Fast_Food_Intake               11
Smoking                         2
Alcohol_Consumption             2
Fasting_Blood_Sugar          1101
HbA1c                          61
Cholesterol_Level            1801
Prediabetes                     2
Diabetes_Type                   3
Sleep_Hours                    61
Stress_Level                   10
Screen_Time                   111
dtype: int64

### 3.5 Check statistics of data set

In [18]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,100000.0,50000.5,28867.657797,1.0,25000.75,50000.5,75000.25,100000.0
Age,100000.0,20.00789,3.154934,15.0,17.0,20.0,23.0,25.0
Family_Income,100000.0,1299440.0,691940.317226,100004.0,702202.75,1299989.5,1898915.75,2499974.0
Genetic_Risk_Score,100000.0,5.50534,2.872218,1.0,3.0,6.0,8.0,10.0
BMI,100000.0,28.02809,6.924196,16.0,22.1,28.0,34.0,40.0
Fast_Food_Intake,100000.0,4.98858,3.169762,0.0,2.0,5.0,8.0,10.0
Fasting_Blood_Sugar,100000.0,125.0722,31.788613,70.0,97.6,125.2,152.6,180.0
HbA1c,100000.0,7.006461,1.735327,4.0,5.5,7.0,8.5,10.0
Cholesterol_Level,100000.0,209.904,52.049374,120.0,164.8,209.8,255.0,300.0
Sleep_Hours,100000.0,6.988082,1.734122,4.0,5.5,7.0,8.5,10.0


In [19]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Gender,100000,3,Female,48073
Region,100000,6,North,16768
Family_History_Diabetes,100000,2,No,64912
Parent_Diabetes_Type,100000,3,No Diabetes,65097
Physical_Activity_Level,100000,3,Sedentary,50087
Dietary_Habits,100000,3,Moderate,39786
Smoking,100000,2,No,69786
Alcohol_Consumption,100000,2,No,80214
Prediabetes,100000,2,No,69979
Diabetes_Type,100000,3,No Diabetes,74776


In [15]:
df.columns

Index(['ID', 'Age', 'Gender', 'Region', 'Family_Income',
       'Family_History_Diabetes', 'Parent_Diabetes_Type', 'Genetic_Risk_Score',
       'BMI', 'Physical_Activity_Level', 'Dietary_Habits', 'Fast_Food_Intake',
       'Smoking', 'Alcohol_Consumption', 'Fasting_Blood_Sugar', 'HbA1c',
       'Cholesterol_Level', 'Prediabetes', 'Diabetes_Type', 'Sleep_Hours',
       'Stress_Level', 'Screen_Time'],
      dtype='object')

---
## 4.) Saving the Processed data

In [12]:
save_location = '../data/processed_data/indian_youngsters_health_data.csv'
os.makedirs(os.path.dirname(save_location), exist_ok=True)  # Ensure the directory exists
df.to_csv(save_location, index=False)