# **Logistic Regression: Predicting Diabetes Diagnosis**

## Scenario
You are a data scientist at a healthcare organisation aiming to develop a predictive model to identify individuals at risk of diabetes. This model will help clinicians prioritise patients for early interventions and preventive care, reducing the burden on healthcare systems and improving patient outcomes.

## Dataset Overview
The dataset includes the following columns:
- `Gender`: Gender of the individual (`1` for male, `0` for female).
- `Age`: Age of the individual (in years).
- `Hypertension`: Whether the individual has hypertension (`1` for yes, `0` for no).
- `Heart_Disease`: Whether the individual has heart disease (`1` for yes, `0` for no).
- `BMI`: Body Mass Index (a measure of body fat based on height and weight).
- `HbA1c_Level`: Hemoglobin A1c level (a marker of average blood glucose over 3 months, measured in percentage).
- `Blood_Glucose_Level`: Current blood glucose level (measured in mg/dL).
- `Diabetes`: Whether the individual is at the risk of diabetes (`1` for yes, `0` for no).

## Your Challenge
Build a logistic regression model to predict whether an individual has diabetes (`Diabetes`).

## Important Instructions
- Split the dataset into training (70%) and testing (30%) subsets.
- Set the `random_state` parameter to `42` to ensure reproducibility.

## Step 1: Import Libraries

In [95]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
from sklearn.model_selection import train_test_split  # For splitting the dataset
from sklearn.linear_model import LogisticRegression  # Logistic regression model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # Evaluation metrics

## Step 2: Load the Dataset

In [97]:
# Load the dataset
file_path = 'diabetes_prediction.csv'
data = pd.read_csv(file_path)

# Display the first few rows
print(data.head())


   gender  age  hypertension  heart_disease    bmi  HbA1c_level  \
0       0   80             0              1  25.19          6.6   
1       0   54             0              0  27.32          6.6   
2       1   28             0              0  27.32          5.7   
3       0   36             0              0  23.45          5.0   
4       1   76             1              1  20.14          4.8   

   blood_glucose_level  diabetes  
0                  140         0  
1                   80         0  
2                  158         0  
3                  155         0  
4                  155         0  


## Step 3: Data Preparation

In [99]:
# Separate input features and target variable
X = data[['gender', 'age', 'hypertension', 'heart_disease', 'bmi','HbA1c_level','blood_glucose_level']] # Input Variables
y = data['diabetes'] # Target Variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


## Step 4: Train the Model

In [101]:
# Initialise and train the logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Display coefficients and intercept
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Coefficients: [[0.25786638 0.04614606 0.79375316 0.76579868 0.09333817 2.31777501
  0.03266737]]
Intercept: [-27.2448112]


## Step 5: Evaluate the Model

In [103]:
# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Confusion matrix and classification report
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.95
Confusion Matrix:
 [[22099   220]
 [  965  1547]]
Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.97     22319
           1       0.88      0.62      0.72      2512

    accuracy                           0.95     24831
   macro avg       0.92      0.80      0.85     24831
weighted avg       0.95      0.95      0.95     24831



### **Question - Predicting Diabetes Diagnosis**
You are provided with the following data for an individual:

- **Gender**: Male (`1`)
- **Age**: 75 years
- **Hypertension**: No (`0`)
- **Heart Disease**: No (`0`)
- **BMI**: 27.32
- **HbA1c Level**: 6.1
- **Blood Glucose Level**: 200 mg/dL

## Question
Based on the provided characteristics, **predict whether this individual is at the risk of Diabetes (`1`) or not (`0`).**

In [106]:
X = data[['gender', 'age', 'hypertension', 'heart_disease', 'bmi','HbA1c_level','blood_glucose_level']] # Input Variables

# Example new data for prediction
new_data = pd.DataFrame({
    'gender': [1],
    'age': [69],
    'hypertension': [0],
    'heart_disease': [0],
    'bmi': [28],
    'HbA1c_level': [6.5],
    'blood_glucose_level': [200]
})
new_data


Unnamed: 0,gender,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level
0,1,69,0,0,28,6.5,200


In [107]:
# Predict the loan status for new data
predicted_classes = model.predict(new_data)
print("Predicted Classes:", predicted_classes)

Predicted Classes: [1]
