# Project-4-Group-7_Dm_Prediction
Diabetes Prediction Dataset retrieved from kaggle, by Mohammed Mustafa: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

- Gender - refers to the biological sex of the individual, which can impact their susceptibility to diabetes.

- Age - is an important factor, as diabetes is more commonly diagnosed in older adults. The age range in this dataset is 0 to 80.

- Hypertension - is a medical condition in which the blood pressure in the arteries is persistently elevated. It has values of 0 or 1, where 0 indicates no hypertension and 1 means they have hypertension.

- Heart disease - is a medical condition associated with an increased risk of developing diabetes. It has values of 0 or 1, where 0 indicates no heart disease and 1 means they have heart disease.

- Smoking history - is considered a risk factor for diabetes and can exacerbate the complications associated with it. The dataset has five categories: not current, former, No Info, current, never, and ever.

- BMI (Body Mass Index) - is a measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. The range of BMI in the dataset is from 10.16 to 71.55.

- HbA1c (Hemoglobin A1c or Glycated hemoglobin) level - is a measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes.

- Blood glucose level - is a measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes.

- Diabetes - (the target) is a chronic medical disease that occurs either when the pancreas does not produce enough insulin or when the body cannot effectively use the insulin it produces. It has values of 0 or 1, where 0 indicates no diabetes and 1 means they have diabetes.

# Retrieve the dataset from the SQL Server

In [1]:
# Import modules
from sqlalchemy import create_engine, MetaData, Table
import pandas as pd

In [2]:
# Define the connection string
## engine = create_engine('postgresql+psycopg2://user:password@hostname/database_name')
engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost/Dm_Prediction")

In [3]:
# Reflect the database schema
metadata = MetaData()
metadata.reflect(bind=engine)

In [4]:
# Select the table
dm_prediction_table = Table('dm_prediction', metadata, autoload_with=engine)

In [5]:
# Use pandas to query the table and load it into a DataFrame
dm_prediction_df = pd.read_sql(dm_prediction_table.select(), engine)

# Display the first few rows of the DataFrame
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [6]:
# Save the DataFrame to a CSV file
csv_file_path = 'Resources/diabetes_prediction_dataset.csv'
dm_prediction_df.to_csv(csv_file_path, index=False)

# Data exploration and preparation using Pandas

In [7]:
# View the shape of the dataset
dm_prediction_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   hba1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [8]:
# Check for null values
dm_prediction_df.isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
hba1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [9]:
# Determine the number of unique values in each column.
dm_prediction_df.nunique()

gender                    3
age                     102
hypertension              2
heart_disease             2
smoking_history           6
bmi                    4247
hba1c_level              18
blood_glucose_level      18
diabetes                  2
dtype: int64

# Label encoding for the categorical columms (using Pandas)

In [10]:
# View unique values for Gender
dm_prediction_df['gender'].unique()

array(['Female', 'Male', 'Other'], dtype=object)

In [11]:
# Convert Gender to numeric values
dm_prediction_df['gender'] = dm_prediction_df['gender'].replace({'Male': 0, 'Female': 1, 'Other': 2})
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,never,25.19,6.6,140,0
1,1,54.0,0,0,No Info,27.32,6.6,80,0
2,0,28.0,0,0,never,27.32,5.7,158,0
3,1,36.0,0,0,current,23.45,5.0,155,0
4,0,76.0,1,1,current,20.14,4.8,155,0


In [12]:
# View unique values for Smoking History
dm_prediction_df['smoking_history'].unique()

array(['never', 'No Info', 'current', 'former', 'ever', 'not current'],
      dtype=object)

In [13]:
# Convert Smoking History to numeric values
dm_prediction_df['smoking_history'] = dm_prediction_df['smoking_history'].replace({'never': 0, 'No Info': 1, 'current': 2, 'former': 3, 'ever': 4, 'not current': 5})
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,25.19,6.6,140,0
1,1,54.0,0,0,1,27.32,6.6,80,0
2,0,28.0,0,0,0,27.32,5.7,158,0
3,1,36.0,0,0,2,23.45,5.0,155,0
4,0,76.0,1,1,2,20.14,4.8,155,0


# Ordinal encoding for the numeric columns (using Pandas)

In [14]:
# Convert BMI to numeric values acoording to the ranges
## Underweight: < 18.5 
## Healthy Weight: 18.5 to 24.9 
## Overweight: 25.0 to 29.9 
## Obese: >= 30.0 

# Function to categorize BMI with numeric values
def categorize_bmi_numeric(bmi):
    if bmi < 18.5:
        return 0  # Underweight
    elif bmi < 25.0:
        return 1  # Healthy Weight
    elif bmi < 30.0:
        return 2  # Overweight
    else:
        return 3  # Obese

# Apply the function to the BMI column
dm_prediction_df['bmi'] = dm_prediction_df['bmi'].apply(categorize_bmi_numeric)
dm_prediction_df.head()


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,2,6.6,140,0
1,1,54.0,0,0,1,2,6.6,80,0
2,0,28.0,0,0,0,2,5.7,158,0
3,1,36.0,0,0,2,1,5.0,155,0
4,0,76.0,1,1,2,1,4.8,155,0


In [15]:
# Convert HbA1c level to numeric values acoording to the ranges
## Normal: < 5.7% 
## PreDiabetes: 5.7% to 6.4% 
## Diagnosis of Diabetes: >= 6.5% 

# Function to categorize HbA1c
def categorize_hba1c(hba1c):
    if hba1c < 5.7:
        return 0  # Normal
    elif hba1c >= 5.7 and hba1c < 6.5:
        return 1  # Prediabetes
    else:
        return 2  # Diabetes

# Apply the function to the HbA1c column
dm_prediction_df['hba1c_level'] = dm_prediction_df['hba1c_level'].apply(categorize_hba1c)
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,2,2,140,0
1,1,54.0,0,0,1,2,2,80,0
2,0,28.0,0,0,0,2,1,158,0
3,1,36.0,0,0,2,1,0,155,0
4,0,76.0,1,1,2,1,0,155,0


In [16]:
# Convert Blood Glucose level to numeric values acoording to the ranges
## Normal: 99 mg/dL or below
## Prediabetes: 100–125 mg/dL
## Diabetes: 126 mg/dL or above

# Function to categorize Blood Glucose levels
def categorize_blood_glucose(blood_glucose):
    if blood_glucose <= 99:
        return 0  # Normal
    elif blood_glucose <= 125:
        return 1  # Prediabetes
    else:
        return 2  # Diabetes

# Apply the function to the Blood Glucose column
dm_prediction_df['blood_glucose_level'] = dm_prediction_df['blood_glucose_level'].apply(categorize_blood_glucose)
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,2,2,2,0
1,1,54.0,0,0,1,2,2,0,0
2,0,28.0,0,0,0,2,1,2,0
3,1,36.0,0,0,2,1,0,2,0
4,0,76.0,1,1,2,1,0,2,0


In [17]:
# Change diabetes column name to diabetes_status
dm_prediction_df.rename(columns={'diabetes': 'diabetes_status'}, inplace=True)
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes_status
0,1,80.0,0,1,0,2,2,2,0
1,1,54.0,0,0,1,2,2,0,0
2,0,28.0,0,0,0,2,1,2,0
3,1,36.0,0,0,2,1,0,2,0
4,0,76.0,1,1,2,1,0,2,0


In [18]:
# Save the DataFrame to a CSV file
csv_file_path = 'Resources/diabetes_prediction_cleaned_label_encoding.csv'
dm_prediction_df.to_csv(csv_file_path, index=False)