# Project-4-Group-7_Dm_Prediction
Diabetes Prediction Dataset retrieved from kaggle, by Mohammed Mustafa: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

- Gender - refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes.
- Age - is an important factor as diabetes is more commonly diagnosed in older adults. Age ranges from 0-80 in this dataset.
- Hypertension - medical condition in which the blood pressure in the arteries is persistently elevated. It has values a 0 or 1 where 0 indicates no hypertension and 1 means they have hypertension.
- Heart disease - medical condition that is associated with an increased risk of developing diabetes. It has values a 0 or 1 where 0 indicates no heart disease and 1 means they have heart disease.
- Smoking history - considered a risk factor for diabetes and can exacerbate the complications associated with diabetes. The dataset has 5 categories i.e not current, former, No Info, current, never and ever.
- BMI (Body Mass Index) - measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. The range of BMI in the dataset is from 10.16 to 71.55. 
- HbA1c (Hemoglobin A1c) level - measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. 
- Blood glucose level - measure of a person's average blood sugar level over the past 2-3 months. Higher levels indicate a greater risk of developing diabetes. 

# Retrieve the dataset from the SQL Server

In [1]:
# Import modules
import psycopg2
import pandas as pd

In [2]:
# Define the connection string
conn_str = "host='localhost' dbname='Dm_Prediction' user='postgres' password='postgres'"

# Connect to the PostgreSQL database
conn = psycopg2.connect(conn_str)

# Create a cursor object
cur = conn.cursor()

In [3]:
# Execute a query to select all rows from the dm_prediction table
cur.execute("SELECT * FROM dm_prediction")

# Fetch all rows from the executed query
rows = cur.fetchall()

# Print each row (optional, can be commented out if not needed)
for row in rows:
    print(row)

('Female', Decimal('80.0'), 0, 1, 'never', Decimal('25.19'), Decimal('6.6'), 140, 0)
('Female', Decimal('54.0'), 0, 0, 'No Info', Decimal('27.32'), Decimal('6.6'), 80, 0)
('Male', Decimal('28.0'), 0, 0, 'never', Decimal('27.32'), Decimal('5.7'), 158, 0)
('Female', Decimal('36.0'), 0, 0, 'current', Decimal('23.45'), Decimal('5.0'), 155, 0)
('Male', Decimal('76.0'), 1, 1, 'current', Decimal('20.14'), Decimal('4.8'), 155, 0)
('Female', Decimal('20.0'), 0, 0, 'never', Decimal('27.32'), Decimal('6.6'), 85, 0)
('Female', Decimal('44.0'), 0, 0, 'never', Decimal('19.31'), Decimal('6.5'), 200, 1)
('Female', Decimal('79.0'), 0, 0, 'No Info', Decimal('23.86'), Decimal('5.7'), 85, 0)
('Male', Decimal('42.0'), 0, 0, 'never', Decimal('33.64'), Decimal('4.8'), 145, 0)
('Female', Decimal('32.0'), 0, 0, 'never', Decimal('27.32'), Decimal('5.0'), 100, 0)
('Female', Decimal('53.0'), 0, 0, 'never', Decimal('27.32'), Decimal('6.1'), 85, 0)
('Female', Decimal('54.0'), 0, 0, 'former', Decimal('54.7'), Decima

In [4]:
# Get column names from the cursor
colnames = [desc[0] for desc in cur.description]

# Load data into a Pandas DataFrame
dm_prediction_df = pd.DataFrame(rows, columns=colnames)

# Close the cursor and the connection
cur.close()
conn.close()

# Display the first few rows of the DataFrame
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [5]:
# Save the DataFrame to a CSV file
csv_file_path = 'Resources/diabetes_prediction_dataset.csv'
dm_prediction_df.to_csv(csv_file_path, index=False)

# Data cleaning and preparation using Pandas

In [6]:
# Import the modules
import numpy as np
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

In [7]:
# Display the first few rows of the DataFrame 
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [8]:
# View the shape of the dataset
dm_prediction_df.shape

(100000, 9)

In [9]:
# Check for null values
dm_prediction_df.isnull().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
hba1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [10]:
# View unique values for Gender
dm_prediction_df['gender'].unique()

array(['Female', 'Male', 'Other'], dtype=object)

In [11]:
# Convert Gender to numeric values
dm_prediction_df['gender'] = dm_prediction_df['gender'].replace({'Male': 0, 'Female': 1, 'Other': 2})
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,never,25.19,6.6,140,0
1,1,54.0,0,0,No Info,27.32,6.6,80,0
2,0,28.0,0,0,never,27.32,5.7,158,0
3,1,36.0,0,0,current,23.45,5.0,155,0
4,0,76.0,1,1,current,20.14,4.8,155,0


In [12]:
# View unique values for Smoking History
dm_prediction_df['smoking_history'].unique()

array(['never', 'No Info', 'current', 'former', 'ever', 'not current'],
      dtype=object)

In [13]:
# Convert Smoking History to numeric values
dm_prediction_df['smoking_history'] = dm_prediction_df['smoking_history'].replace({'never': 0, 'No Info': 1, 'current': 2, 'former': 3, 'ever': 4, 'not current': 5})
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,25.19,6.6,140,0
1,1,54.0,0,0,1,27.32,6.6,80,0
2,0,28.0,0,0,0,27.32,5.7,158,0
3,1,36.0,0,0,2,23.45,5.0,155,0
4,0,76.0,1,1,2,20.14,4.8,155,0


In [14]:
# Convert BMI to numeric values acoording to the ranges
## Underweight: < 18.5 
## Healthy Weight: 18.5 to 24.9 
## Overweight: 25.0 to 29.9 
## Obese: >= 30.0 

# Function to categorize BMI with numeric values
def categorize_bmi_numeric(bmi):
    if bmi < 18.5:
        return 0  # Underweight
    elif bmi < 25.0:
        return 1  # Healthy Weight
    elif bmi < 30.0:
        return 2  # Overweight
    else:
        return 3  # Obese

# Apply the function to the BMI column
dm_prediction_df['bmi'] = dm_prediction_df['bmi'].apply(categorize_bmi_numeric)
dm_prediction_df.head()


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,2,6.6,140,0
1,1,54.0,0,0,1,2,6.6,80,0
2,0,28.0,0,0,0,2,5.7,158,0
3,1,36.0,0,0,2,1,5.0,155,0
4,0,76.0,1,1,2,1,4.8,155,0


In [15]:
# Convert HbA1c level to numeric values acoording to the ranges
## Normal: < 5.7% 
## PreDiabetes: 5.7% to 6.4% 
## Diagnosis of Diabetes: >= 6.5% 

# Function to categorize HbA1c
def categorize_hba1c(hba1c):
    if hba1c < 5.7:
        return 0  # Normal
    elif hba1c >= 5.7 and hba1c < 6.5:
        return 1  # Prediabetes
    else:
        return 2  # Diabetes

# Apply the function to the HbA1c column
dm_prediction_df['hba1c_level'] = dm_prediction_df['hba1c_level'].apply(categorize_hba1c)
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,2,2,140,0
1,1,54.0,0,0,1,2,2,80,0
2,0,28.0,0,0,0,2,0,158,0
3,1,36.0,0,0,2,1,0,155,0
4,0,76.0,1,1,2,1,0,155,0


In [16]:
# Convert Blood Glucose level to numeric values acoording to the ranges
## Normal: 99 mg/dL or below
## Prediabetes: 100–125 mg/dL
## Diabetes: 126 mg/dL or above

# Function to categorize Blood Glucose levels# Function to categorize Blood Glucose levels
def categorize_blood_glucose(blood_glucose):
    if blood_glucose <= 99:
        return 0  # Normal
    elif blood_glucose <= 125:
        return 1  # Prediabetes
    else:
        return 2  # Diabetes

# Apply the function to the Blood Glucose column
dm_prediction_df['blood_glucose_level'] = dm_prediction_df['blood_glucose_level'].apply(categorize_blood_glucose)
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes
0,1,80.0,0,1,0,2,2,2,0
1,1,54.0,0,0,1,2,2,0,0
2,0,28.0,0,0,0,2,0,2,0
3,1,36.0,0,0,2,1,0,2,0
4,0,76.0,1,1,2,1,0,2,0


In [17]:
# Change diabetes column name to diabetes_status
dm_prediction_df.rename(columns={'diabetes': 'diabetes_status'}, inplace=True)
dm_prediction_df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level,diabetes_status
0,1,80.0,0,1,0,2,2,2,0
1,1,54.0,0,0,1,2,2,0,0
2,0,28.0,0,0,0,2,0,2,0
3,1,36.0,0,0,2,1,0,2,0
4,0,76.0,1,1,2,1,0,2,0


# Split the Data into Training and Testing Sets
## Create the labels set (y) from the “diabetes_status” column, and then create the features (X) DataFrame from the remaining columns.

In [18]:
# Separate the data into labels and features
# Separate the y variable, the labels
y = dm_prediction_df['diabetes_status']

# Separate the X variable, the features
X = dm_prediction_df.drop(columns=['diabetes_status'])

In [19]:
# Review the y variable Series
display(y.head())
display(y.tail())

0    0
1    0
2    0
3    0
4    0
Name: diabetes_status, dtype: int64

99995    0
99996    0
99997    0
99998    0
99999    0
Name: diabetes_status, dtype: int64

In [20]:
# Review the X variable DataFrame
display(X.head())
display(X.tail())

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level
0,1,80.0,0,1,0,2,2,2
1,1,54.0,0,0,1,2,2,0
2,0,28.0,0,0,0,2,0,2
3,1,36.0,0,0,2,1,0,2
4,0,76.0,1,1,2,1,0,2


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hba1c_level,blood_glucose_level
99995,1,80.0,0,0,1,2,1,0
99996,1,2.0,0,0,1,0,2,1
99997,0,66.0,0,0,3,2,0,2
99998,1,24.0,0,0,0,3,0,1
99999,1,57.0,0,0,2,1,2,0


## Split the data into training and testing datasets by using train_test_split.

In [21]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split. Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    random_state=1, 
                                                    stratify=y)

# Create a Logistic Regression Model with the Original Data
## Fit a logistic regression model by using the training data (X_train and y_train).

In [22]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model. Assign a random_state parameter of 1 to the model
logistic_regression_model = LogisticRegression(random_state=1)

# Fit the model using training data
logistic_regression_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Save the predictions on the testing data labels by using the testing feature data (X_test) and the fitted model.

In [23]:
# Make a prediction using the testing data
testing_predictions = logistic_regression_model.predict(X_test)

## Evaluate the model’s performance by doing the following:
Generate a confusion matrix and the classification report.

In [24]:
# Generate a confusion matrix for the model
test_matrix = confusion_matrix(y_test, testing_predictions)
print(test_matrix)

[[22445   430]
 [ 1386   739]]


In [25]:
# Print the classification report for the model
testing_report = classification_report(y_test, testing_predictions)

# Print the testing classification report
print(testing_report)

              precision    recall  f1-score   support

           0       0.94      0.98      0.96     22875
           1       0.63      0.35      0.45      2125

    accuracy                           0.93     25000
   macro avg       0.79      0.66      0.70     25000
weighted avg       0.92      0.93      0.92     25000

