
# Predictive Model for Cardiac Surgery Outcomes

This project uses the `CompleteDataExample_OperationsFor20232024.xlsx` dataset to develop a predictive model for cardiac surgery outcomes. 
The goal is to provide insights into clinical factors affecting patient outcomes.

### Project Steps:
1. **Data Loading and Cleaning**: Preparing data for analysis.
2. **Data Exploration and Visualization**: Analyzing key clinical features.
3. **Model Training**: Training a Gradient Boosting model with hyperparameter tuning.
4. **Model Evaluation**: Assessing model performance and feature importance.



In [None]:

# Step 1: Import Libraries and Load Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Set visualization style
sns.set(style="whitegrid", palette="muted", font_scale=1.2)

# Load dataset
file_path = 'CompleteDataExample_OperationsFor20232024.xlsx'
data = pd.read_excel(file_path)
data.head()



## Step 2: Data Cleaning and Preprocessing

This section involves cleaning the dataset, handling missing values, and encoding categorical features.


In [None]:

# Separate numeric and categorical columns
numeric_data = data.select_dtypes(include=[np.number])
categorical_data = data.select_dtypes(exclude=[np.number])

# Drop columns in numeric data with >50% missing data, then fill remaining with median
numeric_data = numeric_data.drop(columns=numeric_data.columns[numeric_data.isnull().mean() > 0.5])
numeric_data = numeric_data.fillna(numeric_data.median())

# Impute categorical columns by filling missing values with a placeholder 'Unknown' and encoding
categorical_data = categorical_data.fillna('Unknown')
for column in categorical_data.columns:
    categorical_data[column] = categorical_data[column].astype('category').cat.codes

# Combine cleaned numeric and categorical data
cleaned_data = pd.concat([numeric_data, categorical_data], axis=1)
cleaned_data.info()  # Display cleaned data structure



## Step 3: Data Exploration and Visualization

Visualizing key clinical outcomes and their correlations.


In [None]:

# Visualize distributions of key clinical outcomes
outcome_features = ['OPERATIVE MORTALITY', 'MORBIDITY & MORTALITY', 'STROKE', 'RENAL FAILURE', 'REOPERATION',
                    'PROLONGED VENTILATION', 'LONG HOSPITAL STAY', 'SHORT HOSPITAL STAY']

fig, axes = plt.subplots(3, 3, figsize=(15, 15))
fig.suptitle('Distributions of Key Clinical Outcomes')

for i, feature in enumerate(outcome_features):
    row, col = divmod(i, 3)
    sns.histplot(cleaned_data[feature], kde=True, ax=axes[row, col])
    axes[row, col].set_title(f'Distribution of {feature}')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cleaned_data[outcome_features].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Clinical Outcomes')
plt.show()



## Step 4: Model Training and Hyperparameter Tuning

Training a Gradient Boosting model with hyperparameter tuning.


In [None]:

# Define features (X) and target (y)
X = cleaned_data[outcome_features]
y = cleaned_data['OPERATIVE MORTALITY']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Gradient Boosting model
gbr = GradientBoostingRegressor(random_state=42)

# Define hyperparameters for RandomizedSearchCV
param_dist = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 3, 5]
}

# Hyperparameter tuning
random_search = RandomizedSearchCV(gbr, param_distributions=param_dist, n_iter=10, scoring='neg_mean_squared_error', cv=5, random_state=42)
random_search.fit(X_train, y_train)
best_gbr = random_search.best_estimator_

# Output best parameters
best_gbr.get_params()



## Step 5: Model Evaluation

Evaluating model performance on the test set using Mean Squared Error (MSE) and R-squared metrics.


In [None]:

# Predict on test set
y_pred = best_gbr.predict(X_test)

# Calculate performance metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Feature importance visualization
plt.figure(figsize=(10, 6))
sns.barplot(x=best_gbr.feature_importances_, y=X.columns)
plt.title('Feature Importance in Gradient Boosting Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
