# Predicting 30-Day Patient Readmission Rates: A Data-Driven Approach to Improve Hospital Efficiency


### Can we predict which patients are at risk of being readmitted within 30 days of discharge based on their initial treatment and follow-up care data?  


|Analytic Approach| 
|-----------------|
|Given that our problem is a binary classification problem (predicting whether a patient will be readmitted within 30 days or not), we can use supervised machine-learning techniques. These techniques will allow us to train a model on historical data and then use this model to make predictions on new data.|


|Data Requirements|
|-----------------|
|We would need patient data that includes demographics (age, gender, etc.), medical history, details of the current hospital stay (length of stay, treatments received, etc.), and whether the patient was readmitted within 30 days. Other useful data might include socioeconomic factors, lifestyle factors (smoking, alcohol use, etc.), and details about the patient’s follow-up care.|

|Data Collection|
|---------------|
|This data could be collected from the hospital’s electronic health record (EHR) system. We might also need to integrate data from other sources, such as national health databases or surveys to gather information on socioeconomic and lifestyle factors.|

|Data Understanding and Preparation|
|----------------------------------|
|This stage involves exploring the data to understand its structure, quality, and distribution. We would also need to clean the data (handle missing values, remove outliers, etc.), transform variables if necessary (e.g., standardizing or normalizing variables), and create new variables (e.g., a binary variable indicating whether a patient was readmitted within 30 days).|

|Modelling and Evaluation|
|------------------------|
|We would split our data into a training set and a test set. The training set is used to train the model, while the test set is used to evaluate its performance. We could use various machine learning models (e.g., logistic regression, decision trees, random forests, gradient boosting) and select the one that performs best based on a chosen metric (e.g., accuracy, precision, recall, AUC-ROC). It’s also important to validate our model using cross-validation to ensure it generalizes well to new data.|

# Dataset

### Based on our project requirements, the dataset should include:

- Demographics: Age, gender
- Medical History: previous conditions, number of previous admissions
- Current Hospital Stay: Length of stay, treatments received, cost
- Socioeconomic Factors: income level, employment status
- Lifestyle Factors: smoking, alcohol use
- Follow-Up care: number of follow-up appointments, compliance with medication
- Target Variable: readmitted within 30 days (binary)

### Data Collection

In [4]:
import pandas as pd

data = pd.read_csv('patient_data.csv')
print(data.head())

   age  gender  previous_conditions  previous_admissions  length_of_stay  \
0   69    Male                    0                    0               2   
1   32    Male                    3                    3              27   
2   89  Female                    1                    1               9   
3   78    Male                    0                    1              29   
4   38    Male                    2                    2               9   

   treatment_cost income_level employment_status  smoking  alcohol_use  \
0     3817.318619         High          Employed        0            0   
1    48779.485107          Low        Unemployed        1            1   
2    40940.208410         High          Employed        1            1   
3    42701.097352       Medium        Unemployed        0            0   
4    46929.278956          Low          Employed        0            0   

   follow_up_appointments  medication_compliance  readmitted_30_days  
0                       3  

### Load and Explore the Dataset

In [5]:
print(data.info())
print(data.describe())
print(data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   age                     1000 non-null   int64  
 1   gender                  1000 non-null   object 
 2   previous_conditions     1000 non-null   int64  
 3   previous_admissions     1000 non-null   int64  
 4   length_of_stay          1000 non-null   int64  
 5   treatment_cost          1000 non-null   float64
 6   income_level            1000 non-null   object 
 7   employment_status       1000 non-null   object 
 8   smoking                 1000 non-null   int64  
 9   alcohol_use             1000 non-null   int64  
 10  follow_up_appointments  1000 non-null   int64  
 11  medication_compliance   1000 non-null   int64  
 12  readmitted_30_days      1000 non-null   int64  
dtypes: float64(1), int64(9), object(3)
memory usage: 101.7+ KB
None
               age  previous_c

### Encode Categorical Variables

In [8]:
data = pd.get_dummies(data, columns=['gender', 'income_level', 'employment_status'], drop_first=True)
print(data.head())

   age  previous_conditions  previous_admissions  length_of_stay  \
0   69                    0                    0               2   
1   32                    3                    3              27   
2   89                    1                    1               9   
3   78                    0                    1              29   
4   38                    2                    2               9   

   treatment_cost  smoking  alcohol_use  follow_up_appointments  \
0     3817.318619        0            0                       3   
1    48779.485107        1            1                       0   
2    40940.208410        1            1                       4   
3    42701.097352        0            0                       1   
4    46929.278956        0            0                       4   

   medication_compliance  readmitted_30_days  gender_Male  income_level_Low  \
0                      1                   1            1                 0   
1                      1      

### Split the Data into Training and Test Sets

In [9]:
from sklearn.model_selection import train_test_split

X = data.drop(columns=['readmitted_30_days'])
y = data['readmitted_30_days']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(800, 14) (200, 14) (800,) (200,)


### Train and Evaluate Machine Learning Models

**Logistic Regression**

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

model = LogisticRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'AUC-ROC: {roc_auc}')

Accuracy: 0.765
Precision: 0.0
Recall: 0.0
AUC-ROC: 0.5


  _warn_prf(average, modifier, msg_start, len(result))


**Random Forest and Gradient Boosting**

In [12]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

rf_model = RandomForestClassifier(random_state=42)
gb_model = GradientBoostingClassifier(random_state=42)

rf_scores = cross_val_score(rf_model, X, y, cv=5, scoring='roc_auc')
gb_scores = cross_val_score(gb_model, X, y, cv=5, scoring='roc_auc')

print(f'Random Forest AUC-ROC Scores: {rf_scores}')
print(f'Mean Random Forest AUC-ROC Score: {rf_scores.mean()}')
print(f'Gradient Boosting AUC-ROC Scores: {gb_scores}')
print(f'Mean Gradient Boosting AUC-ROC Score: {gb_scores.mean()}')

Random Forest AUC-ROC Scores: [0.50571172 0.50616095 0.52348864 0.56879733 0.46010915]
Mean Random Forest AUC-ROC Score: 0.5128535584398117
Gradient Boosting AUC-ROC Scores: [0.48568862 0.45757926 0.42112694 0.57386728 0.54378898]
Mean Gradient Boosting AUC-ROC Score: 0.4964102157161457


**Hyperparameter Tuning with GridSearchCV**

In [13]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators':[100, 200],
    'learning_rate':[0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7]
}

grid_search = GridSearchCV(gb_model, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best AUC-ROC Score: {grid_search.best_score_}')

Best Parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100}
Best AUC-ROC Score: 0.5460505563001552


**Finalize and Save the Model**

In [16]:
final_model = GradientBoostingClassifier(**grid_search.best_params_, random_state=42)
final_model.fit(X_train, y_train)

final_y_pred = final_model.predict(X_test)

final_accuracy = accuracy_score(y_test, final_y_pred)
final_precision = precision_score(y_test, final_y_pred)
final_recall = recall_score(y_test, final_y_pred)
final_roc_auc = roc_auc_score(y_test, final_y_pred)

print(f'Final Accuracy: {final_accuracy}')
print(f'Final Precision: {final_precision}')
print(f'Final Recall: {final_recall}')
print(f'Final AUC-ROC: {final_roc_auc}')

Final Accuracy: 0.76
Final Precision: 0.3333333333333333
Final Recall: 0.02127659574468085
Final AUC-ROC: 0.5041023501599221


In [17]:
import joblib

joblib.dump(final_model, 'final_model.joblib')

['final_model.joblib']

# Summary

|What we have done|
|-----------------------------------------------------------------------------|
|We have successfully created a synthetic dataset, loaded and explored the data, prepared it for modeling, trained and evaluated multiple machine learning models, selected and tuned the best model, and saved the final model for daployment. This workflow can be applied to real data once it is available.|