# Project-4-Group-7_Dm_Prediction
Diabetes Prediction Dataset retrieved from kaggle, by Mohammed Mustafa: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

- Gender - refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes.
- Age - is an important factor as diabetes is more commonly diagnosed in older adults. Age ranges from 0-80 in this dataset.
- Hypertension - medical condition in which the blood pressure in the arteries is persistently elevated. It has values a 0 or 1 where 0 indicates no hypertension and 1 means they have hypertension.
- Heart disease - medical condition that is associated with an increased risk of developing diabetes. It has values a 0 or 1 where 0 indicates no heart disease and 1 means they have heart disease.
- Smoking history - considered a risk factor for diabetes and can exacerbate the complications associated with diabetes. The dataset has 5 categories i.e not current, former, No Info, current, never and ever.
- BMI (Body Mass Index) - measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. The range of BMI in the dataset is from 10.16 to 71.55. 
- HbA1c (Hemoglobin A1c) level - measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. 
- Blood glucose level - measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. 

# Retrieve the dataset from the SQL Server

In [None]:
# Import modules
from sqlalchemy import create_engine, MetaData, Table
import pandas as pd


In [None]:
# Define the connection string
## engine = create_engine('postgresql+psycopg2://user:password@hostname/database_name')

engine = create_engine("postgresql+psycopg2://postgres:postgres@localhost/Dm_Prediction")

In [None]:
# Reflect the database schema
metadata = MetaData()
metadata.reflect(bind=engine)

In [None]:
# Select the table
dm_prediction_table = Table('dm_prediction', metadata, autoload_with=engine)

In [None]:
# Use pandas to query the table and load it into a DataFrame
dm_prediction_df = pd.read_sql(dm_prediction_table.select(), engine)

# Display the first few rows of the DataFrame
dm_prediction_df.head()

In [None]:
# Save the DataFrame to a CSV file
csv_file_path = 'Resources/diabetes_prediction_dataset.csv'
dm_prediction_df.to_csv(csv_file_path, index=False)

# Data cleaning and preparation using Pandas

In [None]:
# View the shape of the dataset
dm_prediction_df.shape

In [None]:
# Check for null values
dm_prediction_df.isnull().sum()

In [None]:
# View unique values for Gender
dm_prediction_df['gender'].unique()

In [None]:
# Convert Gender to numeric values
dm_prediction_df['gender'] = dm_prediction_df['gender'].replace({'Male': 0, 'Female': 1, 'Other': 2})
dm_prediction_df.head()

In [None]:
# View unique values for Smoking History
dm_prediction_df['smoking_history'].unique()

In [None]:
# Convert Smoking History to numeric values
dm_prediction_df['smoking_history'] = dm_prediction_df['smoking_history'].replace({'never': 0, 'No Info': 1, 'current': 2, 'former': 3, 'ever': 4, 'not current': 5})
dm_prediction_df.head()

In [None]:
# Convert BMI to numeric values acoording to the ranges
## Underweight: < 18.5 
## Healthy Weight: 18.5 to 24.9 
## Overweight: 25.0 to 29.9 
## Obese: >= 30.0 

# Function to categorize BMI with numeric values
def categorize_bmi_numeric(bmi):
    if bmi < 18.5:
        return 0  # Underweight
    elif bmi < 25.0:
        return 1  # Healthy Weight
    elif bmi < 30.0:
        return 2  # Overweight
    else:
        return 3  # Obese

# Apply the function to the BMI column
dm_prediction_df['bmi'] = dm_prediction_df['bmi'].apply(categorize_bmi_numeric)
dm_prediction_df.head()


In [None]:
# Convert HbA1c level to numeric values acoording to the ranges
## Normal: < 5.7% 
## PreDiabetes: 5.7% to 6.4% 
## Diagnosis of Diabetes: >= 6.5% 

# Function to categorize HbA1c
def categorize_hba1c(hba1c):
    if hba1c < 5.7:
        return 0  # Normal
    elif hba1c >= 5.7 and hba1c < 6.5:
        return 1  # Prediabetes
    else:
        return 2  # Diabetes

# Apply the function to the HbA1c column
dm_prediction_df['hba1c_level'] = dm_prediction_df['hba1c_level'].apply(categorize_hba1c)
dm_prediction_df.head()

In [None]:
# Convert Blood Glucose level to numeric values acoording to the ranges
## Normal: 99 mg/dL or below
## Prediabetes: 100–125 mg/dL
## Diabetes: 126 mg/dL or above

# Function to categorize Blood Glucose levels
def categorize_blood_glucose(blood_glucose):
    if blood_glucose <= 99:
        return 0  # Normal
    elif blood_glucose <= 125:
        return 1  # Prediabetes
    else:
        return 2  # Diabetes

# Apply the function to the Blood Glucose column
dm_prediction_df['blood_glucose_level'] = dm_prediction_df['blood_glucose_level'].apply(categorize_blood_glucose)
dm_prediction_df.head()

In [None]:
# Change diabetes column name to diabetes_status
dm_prediction_df.rename(columns={'diabetes': 'diabetes_status'}, inplace=True)
dm_prediction_df.head()

In [None]:
# Save the cleaned DataFrame to a CSV file
csv_file_path = 'Resources/diabetes_prediction_cleaned.csv'
dm_prediction_df.to_csv(csv_file_path, index=False)