# Project-4-Group-7_Dm_Prediction
Diabetes Prediction Dataset retrieved from kaggle, by Mohammed Mustafa: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

- Gender - refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes.
- Age - is an important factor as diabetes is more commonly diagnosed in older adults. Age ranges from 0-80 in this dataset.
- Hypertension - medical condition in which the blood pressure in the arteries is persistently elevated. It has values a 0 or 1 where 0 indicates no hypertension and 1 means they have hypertension.
- Heart disease - medical condition that is associated with an increased risk of developing diabetes. It has values a 0 or 1 where 0 indicates no heart disease and 1 means they have heart disease.
- Smoking history - considered a risk factor for diabetes and can exacerbate the complications associated with diabetes. The dataset has 5 categories i.e not current, former, No Info, current, never and ever.
- BMI (Body Mass Index) - measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. The range of BMI in the dataset is from 10.16 to 71.55. 
- HbA1c (Hemoglobin A1c) level - measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. 
- Blood glucose level - measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. 

# Retrieve the dataset from the SQL Server

In [1]:
# Import modules
from sqlalchemy import create_engine, MetaData, Table
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
import pandas as pd

In [2]:
# Define the connection string
## engine = create_engine('postgresql+psycopg2://user:password@hostname/database_name')
engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost/Dm_Prediction")

In [3]:
# Reflect the database schema
metadata = MetaData()
metadata.reflect(bind=engine)

In [4]:
# Select the table
dm_prediction_table = Table('dm_prediction', metadata, autoload_with=engine)

In [5]:
# Use pandas to query the table and load it into a DataFrame
dm_prediction_df = pd.read_sql(dm_prediction_table.select(), engine)

# Display the first few rows of the DataFrame
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [6]:
# Save the DataFrame to a CSV file
csv_file_path = 'Resources/diabetes_prediction_dataset.csv'
dm_prediction_df.to_csv(csv_file_path, index=False)

# Data exploration and preparation 

In [7]:
# View the shape of the dataset
dm_prediction_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   hba1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [8]:
# Check for null values
dm_prediction_df.isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
hba1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [9]:
# Determine the number of unique values in each column.
dm_prediction_df.nunique()

gender                    3
age                     102
hypertension              2
heart_disease             2
smoking_history           6
bmi                    4247
hba1c_level              18
blood_glucose_level      18
diabetes                  2
dtype: int64

In [10]:
# Change diabetes column name to diabetes_status
dm_prediction_df.rename(columns={'diabetes': 'diabetes_status'}, inplace=True)
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes_status
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [11]:
dm_prediction_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,100000.0,41.885856,22.51684,0.08,24.0,43.0,60.0,80.0
hypertension,100000.0,0.07485,0.26315,0.0,0.0,0.0,0.0,1.0
heart_disease,100000.0,0.03942,0.194593,0.0,0.0,0.0,0.0,1.0
bmi,100000.0,27.320767,6.636783,10.01,23.63,27.32,29.58,95.69
hba1c_level,100000.0,5.527507,1.070672,3.5,4.8,5.8,6.2,9.0
blood_glucose_level,100000.0,138.05806,40.708136,80.0,100.0,140.0,159.0,300.0
diabetes_status,100000.0,0.085,0.278883,0.0,0.0,0.0,0.0,1.0


# One-Hot Encoding to convert the categorical columns

In [12]:
# Convert categorical data to numeric with `pd.get_dummies`
dm_prediction_df = pd.get_dummies(dm_prediction_df, columns=['gender', 'smoking_history']).astype(int)
dm_prediction_df.head()

Unnamed: 0,age,hypertension,heart_disease,bmi,hba1c_level,blood_glucose_level,diabetes_status,gender_Female,gender_Male,gender_Other,smoking_history_No Info,smoking_history_current,smoking_history_ever,smoking_history_former,smoking_history_never,smoking_history_not current
0,80,0,1,25,6,140,0,1,0,0,0,0,0,0,1,0
1,54,0,0,27,6,80,0,1,0,0,1,0,0,0,0,0
2,28,0,0,27,5,158,0,0,1,0,0,0,0,0,1,0
3,36,0,0,23,5,155,0,1,0,0,0,1,0,0,0,0
4,76,1,1,20,4,155,0,0,1,0,0,1,0,0,0,0


# Standardize the numerical columns using StandardScaler

In [13]:
# Standardize numerical columns
columns_to_standardize = ['age', 'bmi', 'hba1c_level', 'blood_glucose_level']
scaler = StandardScaler()

dm_prediction_df[columns_to_standardize] = scaler.fit_transform(dm_prediction_df[columns_to_standardize])

# Display the resulting DataFrame
dm_prediction_df.head()

Unnamed: 0,age,hypertension,heart_disease,bmi,hba1c_level,blood_glucose_level,diabetes_status,gender_Female,gender_Male,gender_Other,smoking_history_No Info,smoking_history_current,smoking_history_ever,smoking_history_former,smoking_history_never,smoking_history_not current
0,1.691761,0,1,-0.282234,0.787897,0.047704,0,1,0,0,0,0,0,0,1,0
1,0.538015,0,0,0.018883,0.787897,-1.42621,0,1,0,0,1,0,0,0,0,0
2,-0.61573,0,0,0.018883,-0.120279,0.489878,0,0,1,0,0,0,0,0,1,0
3,-0.260731,0,0,-0.58335,-0.120279,0.416183,0,1,0,0,0,1,0,0,0,0
4,1.514261,1,1,-1.035025,-1.028455,0.416183,0,0,1,0,0,1,0,0,0,0


In [14]:
# Save the DataFrame to a CSV file
csv_file_path = 'Resources/diabetes_prediction_cleaned_one_hot_encoding.csv'
dm_prediction_df.to_csv(csv_file_path, index=False)