In [3]:
import urllib.request, zipfile
import pandas as pd

# Download the zip file
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip'
urllib.request.urlretrieve(url, 'diabetes.zip')

# Extract it
with zipfile.ZipFile('diabetes.zip', 'r') as z:
    z.extractall()



In [6]:
import urllib.request
import zipfile
import pandas as pd
import os

# Step 1: Download the zip file
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip'
urllib.request.urlretrieve(url, 'dataset_diabetes.zip')

# Step 2: Extract the zip file
with zipfile.ZipFile('dataset_diabetes.zip', 'r') as zip_ref:
    zip_ref.extractall('diabetes_data')  # extract into folder

# Step 3: List the files to find the correct CSV filename
print("Extracted files:", os.listdir('diabetes_data'))




Extracted files: ['dataset_diabetes']


In [9]:
ls

[0m[01;34mdataset_diabetes[0m/     [01;34mdiabetes_data[0m/  [01;34msample_data[0m/
dataset_diabetes.zip  diabetes.zip


In [10]:
import pandas as pd

# Load the CSV from the correct extracted folder
df = pd.read_csv('dataset_diabetes/diabetic_data.csv')

# Confirm it's loaded
print(df.shape)
df.head()


(101766, 50)


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


In [11]:
# Check all columns and their data types
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101766 entries, 0 to 101765
Data columns (total 50 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   encounter_id              101766 non-null  int64 
 1   patient_nbr               101766 non-null  int64 
 2   race                      101766 non-null  object
 3   gender                    101766 non-null  object
 4   age                       101766 non-null  object
 5   weight                    101766 non-null  object
 6   admission_type_id         101766 non-null  int64 
 7   discharge_disposition_id  101766 non-null  int64 
 8   admission_source_id       101766 non-null  int64 
 9   time_in_hospital          101766 non-null  int64 
 10  payer_code                101766 non-null  object
 11  medical_specialty         101766 non-null  object
 12  num_lab_procedures        101766 non-null  int64 
 13  num_procedures            101766 non-null  int64 
 14  num_

In [12]:
# Find columns with missing or unknown values? Detect columns with '?' or empty strings representing missing values
missing_summary = df.isin(['?', '', 'Unknown']).sum()
missing_summary = missing_summary[missing_summary > 0].sort_values(ascending=False)
missing_summary
#This is important because the UCI dataset uses '?' instead of NaN for missing data.
#Which columns have ambiguous or missing values
#Which we may want to drop, impute, or encode in the next step

Unnamed: 0,0
weight,98569
medical_specialty,49949
payer_code,40256
race,2273
diag_3,1423
diag_2,358
diag_1,21


In [13]:
# Drop 'weight', 'payer_code', 'medical_specialty' due to high missingness
df_cleaned = df.drop(['weight', 'payer_code', 'medical_specialty'], axis=1)

# Replace '?' with np.nan for clarity
import numpy as np
df_cleaned.replace('?', np.nan, inplace=True)

# Impute missing 'race' with mode (most common value)
df_cleaned['race'].fillna(df_cleaned['race'].mode()[0], inplace=True)

# Impute diagnosis columns with 'Unknown'
for col in ['diag_1', 'diag_2', 'diag_3']:
    df_cleaned[col].fillna('Unknown', inplace=True)

# Confirm no more missing values
df_cleaned.isnull().sum().sum()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned['race'].fillna(df_cleaned['race'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned[col].fillna('Unknown', inplace=True)


np.int64(181168)

In [14]:
# Map target values
df_cleaned['readmit_30'] = df_cleaned['readmitted'].apply(lambda x: 1 if x == '<30' else 0)

# Drop original 'readmitted' column
df_cleaned.drop('readmitted', axis=1, inplace=True)

# Check new class distribution
df_cleaned['readmit_30'].value_counts(normalize=True)


Unnamed: 0_level_0,proportion
readmit_30,Unnamed: 1_level_1
0,0.888401
1,0.111599


created a binary target column readmit_30, where:

<30 → 1 (readmitted within 30 days — positive class)

NO and >30 → 0 (negative class)

CLASS DISTRIBUION SUMMARY:


| Class | Meaning                         | Proportion |
| ----- | ------------------------------- | ---------- |
| 0     | Not readmitted (NO or >30)      | **88.8%**  |
| 1     | Readmitted within 30 days (<30) | **11.2%**  |


 This is a classic imbalanced classification problem — we’ll handle it later using:

Class weights in modeling


In [15]:
# Drop IDs — not useful for prediction
df_model = df_cleaned.drop(['encounter_id', 'patient_nbr'], axis=1)


In [16]:
# Get column types
num_cols = df_model.select_dtypes(include=['int64', 'float64']).columns.tolist()
cat_cols = df_model.select_dtypes(include='object').columns.tolist()

# Exclude the label from feature lists
num_cols.remove('readmit_30')

# Output feature counts
print("Numerical features:", len(num_cols))
print("Categorical features:", len(cat_cols))


Numerical features: 11
Categorical features: 33


11 numerical features → e.g., time_in_hospital, num_lab_procedures, etc.

33 categorical features → many of these are drug-related (e.g., metformin, insulin), and string-type codes (like diag_1)

In [19]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define transformers
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
    ]
)



In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Create the pipeline
logreg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train the model
logreg_pipeline.fit(X_train, y_train)

# Predict on test set
y_pred_logreg = logreg_pipeline.predict(X_test)

# Evaluate performance
conf_matrix = confusion_matrix(y_test, y_pred_logreg)
print("Confusion Matrix:\n", conf_matrix)

# Classification report
class_report = classification_report(y_test, y_pred_logreg, output_dict=True)
print("\nClassification Report:\n", pd.DataFrame(class_report).transpose())


Confusion Matrix:
 [[18037    46]
 [ 2226    45]]

Classification Report:
               precision    recall  f1-score       support
0              0.890145  0.997456  0.940750  18083.000000
1              0.494505  0.019815  0.038103   2271.000000
accuracy       0.888376  0.888376  0.888376      0.888376
macro avg      0.692325  0.508636  0.489427  20354.000000
weighted avg   0.846001  0.888376  0.840037  20354.000000


KEY OBSERVATIONS:


| Metric                  | Value                                | Interpretation                                            |
| ----------------------- | ------------------------------------ | --------------------------------------------------------- |
| **Accuracy**            | 88.8%                                | High, but misleading due to imbalance.                    |
| **Recall (Class 1)**    | 1.98%                                | Very low — model misses most readmissions.                |
| **Precision (Class 1)** | 49.5%                                | When it predicts readmission, it's right \~half the time. |
| **Class Imbalance**     | Strong imbalance (class `1` = \~11%) | Model biased toward predicting `0` (no readmission).      |



Conclusion: The model is heavily biased toward the majority class. Although accuracy looks good, recall for patients who will be readmitted is extremely low, which is dangerous in a hospital setting.


To improve performance, especially recall for class 1, we’ll now train:  Random Forest Classifier

In [21]:
from sklearn.ensemble import RandomForestClassifier

# Create a pipeline with the same preprocessor and a Random Forest classifier
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train the Random Forest model
rf_pipeline.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf_pipeline.predict(X_test)

# Evaluate the model
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd

conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
class_report_rf = classification_report(y_test, y_pred_rf, output_dict=True)

print("Confusion Matrix:\n", conf_matrix_rf)
print("\nClassification Report:\n", pd.DataFrame(class_report_rf).transpose())


Confusion Matrix:
 [[18078     5]
 [ 2257    14]]

Classification Report:
               precision    recall  f1-score       support
0              0.889009  0.999723  0.941121  18083.000000
1              0.736842  0.006165  0.012227   2271.000000
accuracy       0.888867  0.888867  0.888867      0.888867
macro avg      0.812926  0.502944  0.476674  20354.000000
weighted avg   0.872031  0.888867  0.837480  20354.000000


| Metric         | Logistic Regression | Random Forest | Notes                                                  |
| -------------- | ------------------- | ------------- | ------------------------------------------------------ |
| Accuracy       | 88.8%               | 88.9%         | Similar                                                |
| **Recall (1)** | **1.98%**           | **0.6%**      | Dropped—model still misses most readmissions           |
| Precision (1)  | 49.5%               | 73.7%         | Better precision, but very few positives predicted     |
| F1-Score (1)   | 3.8%                | 1.2%          | Still low—model doesn’t generalize minority class well |





Conclusion
Random Forest is even more conservative in predicting positive class (readmitted within 30 days).

Still not usable in real-world hospital scenarios, where catching readmissions is critical, even at the cost of more false positives.





Recommendation Before Deployment
Before integrating this model into the hospital’s system (Part 2: Deployment), we must improve recall for class 1.

To Improve Recall:
Apply Class Weights to Random Forest or Logistic Regression

In [30]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# 1. Identify column types
# Get all categorical features
categorical_features = df_cleaned.select_dtypes(include=['object']).columns.tolist()

# Remove target if present
if 'readmit_30' in categorical_features:
    categorical_features.remove('readmit_30')

# Get numerical features
numerical_features = df_cleaned.select_dtypes(include=['int64', 'float64']).columns.tolist()

# 2. Define preprocessing
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 3. Build pipeline with weighted Logistic Regression
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
])

# 4. Fit model
model_pipeline.fit(X_train, y_train)

# 5. Predict and evaluate
y_pred_weighted = model_pipeline.predict(X_test)

print("Confusion Matrix (Weighted):")
print(confusion_matrix(y_test, y_pred_weighted))

print("\nClassification Report (Weighted):")
print(classification_report(y_test, y_pred_weighted))



ValueError: A given column is not a column of the dataframe

In [31]:
# Columns to exclude (IDs or non-predictive)
exclude_cols = ['encounter_id', 'patient_nbr', 'readmit_30']  # also exclude target here

# Get categorical features excluding IDs and target
categorical_features = [col for col in df_cleaned.select_dtypes(include=['object']).columns if col not in exclude_cols]

# Get numerical features excluding IDs and target
numerical_features = [col for col in df_cleaned.select_dtypes(include=['int64', 'float64']).columns if col not in exclude_cols]


In [32]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42))
])

model_pipeline.fit(X_train, y_train)

y_pred_weighted = model_pipeline.predict(X_test)

print("Confusion Matrix (Weighted):")
print(confusion_matrix(y_test, y_pred_weighted))

print("\nClassification Report (Weighted):")
print(classification_report(y_test, y_pred_weighted))


Confusion Matrix (Weighted):
[[11882  6201]
 [ 1024  1247]]

Classification Report (Weighted):
              precision    recall  f1-score   support

           0       0.92      0.66      0.77     18083
           1       0.17      0.55      0.26      2271

    accuracy                           0.65     20354
   macro avg       0.54      0.60      0.51     20354
weighted avg       0.84      0.65      0.71     20354



Interpretation of Metrics


| Metric        | Class 0 (No Readmit) | Class 1 (Readmit <30 days) |
| ------------- | -------------------- | -------------------------- |
| **Precision** | 0.92                 | 0.17                       |
| **Recall**    | 0.66                 | 0.55                       |
| **F1-score**  | 0.77                 | 0.26                       |




Key Takeaways:
Recall for minority class (1) increased from ~2% earlier to 55%, which is a huge improvement — thanks to class balancing.

Precision dropped for class 1 (now 0.17), meaning some false positives increased — but this is often acceptable in healthcare, where detecting true positives is more important.

Overall accuracy is 65%, but since the classes are imbalanced, recall and F1 for the minority class are more important.

Summary of AI Model Development (Hospital Readmission Prediction)
Objective
The goal of this project is to predict whether a diabetic patient will be readmitted to the hospital within 30 days, using the Diabetes 130-US hospitals dataset. Early identification of high-risk patients can help reduce hospital burden and improve patient care.

Workflow Steps Completed
1. Data Loading & Exploration
Dataset: diabetic_data.csv with 101,766 rows and 50 features.

Target variable: readmitted → Transformed into binary readmit_30 (1 if <30, else 0).

2. Data Cleaning
Removed:

High-missing columns: 'weight', 'payer_code', 'medical_specialty'.

Duplicates and placeholder values like '?'.

Filled missing values in key features like 'race' using mode.

3. Feature Engineering
Dropped identifier columns: 'encounter_id', 'patient_nbr'.

Transformed readmitted to binary target (readmit_30).

Identified:

11 numerical features

33 categorical features

4. Preprocessing
Numerical: StandardScaler

Categorical: OneHotEncoder with handle_unknown='ignore'

Combined via ColumnTransformer and wrapped in a pipeline.

 Modeling
Model 1: Logistic Regression (Baseline)
Used default settings, trained on 80/20 split.

Recall for class 1 (readmitted) was ~2%, indicating strong class imbalance.

 Model 2: Logistic Regression with Class Weighting
Set class_weight='balanced' to address imbalance.

Results:

| Metric    | Class 0 | Class 1 |
| --------- | ------- | ------- |
| Precision | 0.92    | 0.17    |
| Recall    | 0.66    | 0.55    |
| F1-score  | 0.77    | 0.26    |



Accuracy: 65% — a balanced trade-off due to better minority class recall.

 Key Improvement: Recall for the critical minority class (readmitted within 30 days) increased from 2% to 55%, a major gain in clinical relevance.

Next Step: Model Deployment
