# SEPSIS PREDICTION USING FASTAPI


## Business Understanding

Sepsis is a life-threatening condition caused by the body's response to an infection, which can lead to tissue damage, organ failure, and death if not treated promptly. Early detection and intervention are critical for improving patient outcomes and reducing mortality rates associated with sepsis. In this context, healthcare organizations are constantly seeking ways to improve their sepsis detection and management protocols.

# Hypothesis

Null Hypothesis (H0):
There is no significant difference in patient outcomes and mortality rates associated with sepsis between healthcare organizations that implement advanced machine learning-based predictive models for sepsis detection and those that do not.
Alternative Hypothesis (H1):
Implementing advanced machine learning-based predictive models for sepsis detection significantly improves patient outcomes and reduces mortality rates compared to healthcare organizations that do not utilize such models.

# Analytical Questions

1. What is the distribution of plasma glucose levels among patients who develop sepsis compared to those who don't?

2. Is there a correlation between blood pressure and body mass index (BMI)?

3. How does the age distribution differ between patients with and without valid insurance cards?

4. What is the proportion of patients with valid insurance cards among those who develop sepsis compared to those who don't?

5. What is the average value of Blood Work Result-1 among patients who develop sepsis?

# Data Understanding

Importation

In [1]:
# Data manipulation packages
import pandas as pd
import numpy as np
from dotenv import dotenv_values


#Data Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Machine learning Packages
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler
from sklearn.preprocessing import OneHotEncoder , LabelEncoder , OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn import set_config
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import pearsonr,stats as stats
from sklearn.model_selection import cross_val_score
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler,SMOTE
from sklearn.feature_selection import SelectKBest,mutual_info_classif
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score,roc_curve,auc
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from joblib import dump

# Database connection package
import pyodbc

# Ignore warnings (optional)
import warnings
warnings.filterwarnings("ignore")



# Data Loading

In [2]:
df_train = pd.read_csv("D:\Azubi lp5\Building-a-FastAPI-for-Sepsis-Prediction-\Paitients_Files_Train.csv")
df_train.head(3)

Unnamed: 0,ID\tPRG\tPL\tPR\tSK\tTS\tM11\tBD2\tAge\tInsurance\tSepssis
0,ICU200010\t6\t148\t72\t35\t0\t33.6\t0.627\t50\...
1,ICU200011\t1\t85\t66\t29\t0\t26.6\t0.351\t31\t...
2,ICU200012\t8\t183\t64\t0\t0\t23.3\t0.672\t32\t...


In [3]:
df_test = pd.read_csv("D:\Azubi lp5\Building-a-FastAPI-for-Sepsis-Prediction-\Paitients_Files_Test.csv")
df_test.head(3)

Unnamed: 0,ID,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance
0,ICU200609,1,109,38,18,120,23.1,0.407,26,1
1,ICU200610,1,108,88,19,0,27.1,0.4,24,1
2,ICU200611,6,96,0,0,0,23.7,0.19,28,1


# Data Fields
1. ID	N/A	Unique number to represent patient ID

2. PRG	Attribute1	Plasma glucose

3. PL	Attribute 2	Blood Work Result-1 (mu U/ml)

4. PR	Attribute 3	Blood Pressure (mm Hg)

5. SK	Attribute 4	Blood Work Result-2 (mm)

6. TS	Attribute 5	Blood Work Result-3 (mu U/ml)

7. M11	Attribute 6	Body mass index (weight in kg/(height in m)^2

8. BD2	Attribute 7	Blood Work Result-4 (mu U/ml)

9. Age	Attribute 8	patients age (years)

10. Insurance	N/A	If a patient holds a valid insurance card

11. Sepssis	Target	Positive: if a patient in ICU will develop a sepsis , and Negative: otherwise

# EDA

In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 1 columns):
 #   Column                                            Non-Null Count  Dtype 
---  ------                                            --------------  ----- 
 0   ID	PRG	PL	PR	SK	TS	M11	BD2	Age	Insurance	Sepssis  599 non-null    object
dtypes: object(1)
memory usage: 4.8+ KB


In [5]:
df_train.duplicated().sum()

0

In [6]:
df_train.describe()

Unnamed: 0,ID\tPRG\tPL\tPR\tSK\tTS\tM11\tBD2\tAge\tInsurance\tSepssis
count,599
unique,599
top,ICU200010\t6\t148\t72\t35\t0\t33.6\t0.627\t50\...
freq,1


In [7]:
df_train.shape

(599, 1)

In [8]:
df_test.describe()

Unnamed: 0,PRG,PL,PR,SK,TS,M11,BD2,Age,Insurance
count,169.0,169.0,169.0,169.0,169.0,169.0,169.0,169.0,169.0
mean,3.91716,123.52071,70.426036,20.443787,81.0,32.249704,0.438876,33.065089,0.727811
std,3.402415,29.259123,19.426805,15.764962,110.720852,7.444886,0.306935,11.54811,0.44641
min,0.0,56.0,0.0,0.0,0.0,0.0,0.1,21.0,0.0
25%,1.0,102.0,62.0,0.0,0.0,27.6,0.223,24.0,0.0
50%,3.0,120.0,74.0,23.0,0.0,32.4,0.343,28.0,1.0
75%,6.0,141.0,80.0,32.0,135.0,36.6,0.587,42.0,1.0
max,13.0,199.0,114.0,49.0,540.0,57.3,1.698,70.0,1.0


## Univariate Analysis

In [11]:
# Visualize distributions of numerical features
plt.figure(figsize=(10, 6))
sns.histplot(df_train['Age'], kde=True, bins=20, color='blue')
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

KeyError: 'Age'

<Figure size 1000x600 with 0 Axes>

In [None]:
# Visualize relationships between numerical features using pairplot
sns.pairplot(df_train, diag_kind='kde')
plt.show()

: 

In [None]:
# Visualize categorical features
plt.figure(figsize=(10, 6))
sns.countplot(x ='Insurance', data=df_train)
plt.title('Count of Patients with Insurance')
plt.xlabel('Insurance')
plt.ylabel('Count')
plt.show()

: 

In [None]:
# Explore relationship between categorical target variable (Sepssis) and numerical features
plt.figure(figsize=(10, 6))
sns.boxplot(x='Sepssis', y='Age', data=df_train)
plt.title('Age Distribution by Sepsis')
plt.xlabel('Sepsis')
plt.ylabel('Age')
plt.show()

: 

In [None]:
# Explore relationships between categorical and numerical features
plt.figure(figsize=(10, 6))
sns.boxplot(x='Insurance', y='Age', data=df_train)
plt.title('Age Distribution by Insurance')
plt.xlabel('Insurance')
plt.ylabel('Age')
plt.show()

: 

In [None]:
# Display histograms for all numerical features
df_train.hist(figsize=(12, 10))
plt.tight_layout()
plt.show()

: 

## Bi Variate Analysis

In [None]:
# Calculate correlation matrix
corr_matrix = df_train.drop(columns=['ID','Sepssis']).corr()

# Plot heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

: 

## MultiVariate Analysis (PCA)

In [None]:
# Separate features and target variable
X = df_train.drop(columns=['ID', 'Sepssis'])  # Exclude non-numeric and target columns
y = df_test.drop(columns =['ID'])

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Plot explained variance ratio
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, marker='o', linestyle='-')
plt.title('Explained Variance Ratio')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.xticks(range(1, len(pca.explained_variance_ratio_) + 1))
plt.grid(True)
plt.show()

: 

### Analytical questions and answers

In [None]:
df_train.columns

: 

In [None]:
# Question 1: Distribution of plasma glucose levels among patients who develop sepsis vs. those who don't
plt.figure(figsize=(10, 6))
sns.histplot(data=df_train, x='PRG', hue='Sepssis', bins=20, kde=True)
plt.title('Distribution of Plasma Glucose Levels by Sepsis Status')
plt.xlabel('Plasma Glucose')
plt.ylabel('Frequency')
plt.legend(title='Sepsis')
plt.show

: 

In [None]:
# Question 2: Correlation between blood pressure and body mass index (BMI)
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_train, x='PR', y='M11')
plt.title('Correlation between Blood Pressure and BMI')
plt.xlabel('Blood Pressure (mm Hg)')
plt.ylabel('Body Mass Index')
plt.show()

: 

In [None]:
# Question 3: Age distribution by insurance status
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_train, x='Insurance', y='Age')
plt.title('Age Distribution by Insurance Status')
plt.xlabel('Insurance')
plt.ylabel('Age')
plt.show()

: 

In [None]:

# Question 4: Proportion of patients with valid insurance cards among those who develop sepsis vs. those who don't
insurance_sepsis = df_train.groupby(['Sepssis', 'Insurance']).size().unstack()
insurance_sepsis.plot(kind='bar', stacked=True)
plt.title('Proportion of Patients by Sepsis Status and Insurance')
plt.xlabel('Sepsis')
plt.ylabel('Count')
plt.xticks(ticks=[0, 1], labels=['No Sepsis', 'Sepsis'], rotation=0)
plt.legend(title='Insurance', loc='upper right')
plt.show()


: 

In [None]:
# Question 5: Average value of Blood Work Result-1 among patients who develop sepsis
avg_blood_work_result_1 = df_train[df_train['Sepssis'] == 'Positive']['PL'].mean()
print("Average Blood Work Result-1 among patients who develop sepsis:", avg_blood_work_result_1)

: 

Split dataset and encode y target

In [None]:
# Separate features and target variable
X = df_train.drop(columns=['Sepssis','ID'])  # Exclude non-numeric and target columns
y = df_train['Sepssis']

: 

In [None]:
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

: 

In [None]:
X_train,X_test ,y_train,y_test = train_test_split(X,y, stratify =y, random_state=25)

: 

In [None]:
## encode y labels
encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_train_encoded = encoder.fit_transform(y_test)

: 

In [None]:
input_features = X.columns
input_features

: 

# Preprocessor

In [None]:
# Define preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('imputer', SimpleImputer(strategy='median'), input_features),  # Impute missing values using median
        ('scaler', RobustScaler(), input_features),  # Scale features using RobustScaler
        ('log_transformation', FunctionTransformer(np.log1p), input_features),  # Apply log transformation
    ]
)

: 

# Modelling and Evaluation

In [None]:
# Define the models
logistic_regression_model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

random_forest_model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

gradient_boosting_model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', GradientBoostingClassifier())
])

# Now you can train each model on your training data and evaluate them
# Assuming X_train, X_test, y_train, y_test are your training and testing data

# Train and evaluate Logistic Regression model
logistic_regression_model.fit(X_train, y_train)
logistic_regression_accuracy = logistic_regression_model.score(X_test, y_test)
print("Logistic Regression Model Accuracy:", logistic_regression_accuracy)

# Train and evaluate Random Forest model
random_forest_model.fit(X_train, y_train)
random_forest_accuracy = random_forest_model.score(X_test, y_test)
print("Random Forest Model Accuracy:", random_forest_accuracy)

# Train and evaluate Gradient Boosting model
gradient_boosting_model.fit(X_train, y_train)
gradient_boosting_accuracy = gradient_boosting_model.score(X_test, y_test)
print("Gradient Boosting Model Accuracy:", gradient_boosting_accuracy)

: 

In [None]:
# Define a function to generate metrics dictionary
def get_metrics(model, X_test, y_test):
    # Predict on the test set
    y_pred = model.predict(X_test)
    # Generate classification report
    report = classification_report(y_test, y_pred, output_dict=True)
    # Extract metrics dictionary from the classification report
    metrics = {
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-score': report['weighted avg']['f1-score'],
        'Support': report['weighted avg']['support']
    }
    return metrics

# Get metrics dictionary for each model
logistic_regression_metrics = get_metrics(logistic_regression_model, X_test, y_test)
random_forest_metrics = get_metrics(random_forest_model, X_test, y_test)
gradient_boosting_metrics = get_metrics(gradient_boosting_model, X_test, y_test)

# Create a DataFrame with metrics dictionaries
metrics_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'Gradient Boosting'],
    'Precision': [logistic_regression_metrics['Precision'], random_forest_metrics['Precision'], gradient_boosting_metrics['Precision']],
    'Recall': [logistic_regression_metrics['Recall'], random_forest_metrics['Recall'], gradient_boosting_metrics['Recall']],
    'F1-score': [logistic_regression_metrics['F1-score'], random_forest_metrics['F1-score'], gradient_boosting_metrics['F1-score']],
    'Support': [logistic_regression_metrics['Support'], random_forest_metrics['Support'], gradient_boosting_metrics['Support']]
})

# Display the DataFrame as a table
print(metrics_df)

: 

In [None]:

# Create the directory if it doesn't exist
os.makedirs(r"D:\Azubi lp5\Building-a-FastAPI-for-Sepsis-Prediction-\models", exist_ok=True)

# Dump models
dump(logistic_regression_model, './models/logistic_regression_pipeline.joblib')
dump(random_forest_model, './models/random_forest_pipeline.joblib')
dump(gradient_boosting_model, './models/gradient_boosting_pipeline.joblib')

# Dump LabelEncoder (assuming it's already fitted and named 'encoder')
dump(encoder, './models/encoder.joblib')

: 