This notebook presents the development of a classification model aimed at predicting in-hospital mortality for admitted patients. It is structured into five sections:

1. Data Loading
2. Exploratory Data Analysis & Data Cleaning
3. Data Pre-Processing
4. Logistic Regression Model
5. Conclusion

In [None]:
# import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

import warnings as w
w.filterwarnings('ignore')

 ## 1. Data Loading

In [None]:
# Function for Importing and analyzing raw data
def load_data(data_path, dataset_name):
    
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None) 
    print("\n"+"*"*100+"\n")
    print(f"DATASET NAME: {dataset_name}\n")
    raw_data = pd.read_csv(data_path + dataset_name + '.csv')
    display(raw_data.head())
    print("\n"+"-"*100+"\n")
    print(f"SHAPE: {raw_data.shape}")
    print("\n"+"-"*100+"\n")
    print(f"{raw_data.info()}")
    print("\n"+"-"*100+"\n")
    print("Duplicate rows in Data:")
    display(raw_data[raw_data.duplicated()])
    print("\n"+"-"*100+"\n")
    print("Data Summary:\n")
    display(raw_data.describe(include = 'all'))
    print("\n"+"-"*100+"\n")
    return raw_data


raw_data = load_data('../input/patient/', 'dataset')

Conclusions from above data information:

* Above dataset has 91713 data examples with 85 variables, which include the dependent variable.
* There are three types of columns in data: Float, Integer and Object.
* No duplicate rows are present in the dataset.
* The column named 'Unnamed: 83' contains solely null values, suggesting it should be dropped from the analysis.
* 'encounter_id', 'patient_id' and 'hospital_id' are unlikely to be relevant to patient survival and should be removed before further analysis.

## 2. Exploratory Data Analysis & Data Cleaning

In [None]:
# Drop columns 'encounter_id' ,'patient_id' ,'hospital_id' and 'Unnamed: 83'
raw_data.drop(['encounter_id' ,'patient_id' ,'hospital_id','Unnamed: 83'],axis =1, inplace=True)

In [None]:
#Listing Numerical and Categorical variables separately for exploratory data analysis
numerical_col = []
categorical_col = []
for col in raw_data.columns:
    if (raw_data[col].dtype==int)or (raw_data[col].dtype==float):
        numerical_col.append(col)
    elif (raw_data[col].dtype==object):
        categorical_col.append(col)
print (f'Numerical columns:\n\n {numerical_col}')
print("\n"+"-"*150+"\n")
print (f'Categorical columns:\n\n {categorical_col}')

In [None]:
#Plot Numerical columns
for col in numerical_col:
    plt.figure(figsize=(8,4))
    sns.histplot(raw_data[col])
    plt.show()

In [None]:
#Plot Categorical columns
for col in categorical_col:
    plt.figure(figsize=(6,4))
    count_plot = sns.countplot(x=raw_data[col])
    #rotate x-axis labels
    count_plot.set_xticklabels(count_plot.get_xticklabels(), rotation=45)
    plt.show()

The dataset features normal distributions for major variables and no observed gender bias. However, bias is expected in other categorical variables, likely due to their relevance to critical patient characteristics.

## 3. Data Pre-processing

In [None]:
#Check null data percentage
raw_data.isnull().sum()*100/len(raw_data)

The dataset exhibits missing values in several variables, each suggesting potential errors in data collection. To preserve valuable information, imputation rather than discarding data is recommended. Here's a breakdown of how to handle missing values for each variable:

**Age:** Age variable exhibits 4.61% missing values, suggesting that the absence of data is likely due to error. Therefore, it is advisable to impute these missing values rather than discarding potentially valuable data. Since age distribution is left-skewed, replacing missing values with the median age is recommended.

**BMI:** BMI variable displays a 3.73% incidence of missing values, indicating error similar to age variable. Hence, it is crucial to employ an appropriate technique to address these gaps in the data. Examination of the distribution of the BMI variable reveals a right-skewed pattern. Therefore, due to the skewed nature of the data, it is advisable to substitute the missing values with the median BMI value.

**Height:** Height variable has 1.45% missing values, likely due to oversight. As it approximates a normal distribution, filling missing values with the mean is recommended.

**Weight:** Weight variable has 2.96% missing values. Considering its right-skewed distribution, it's advisable to replace missing values with the median.

For the remaining variables with missing values, such as medical test readings, imputation techniques may not be suitable due to variations among patients. It's recommended to drop these missing data points to maintain the integrity of the analysis.

In [None]:
#Imputing missing values 'height' column
imputer_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
raw_data['height'] = imputer_mean.fit_transform(raw_data[['height']])

#Imputing missing values for 'age', 'bmi', and 'weight' columns
imputer_median = SimpleImputer(missing_values=np.nan, strategy='median')
raw_data['age'] = imputer_median.fit_transform(raw_data[['age']])
raw_data['bmi'] = imputer_median.fit_transform(raw_data[['bmi']])
raw_data['weight'] = imputer_median.fit_transform(raw_data[['weight']])

# Removing all the null values for other remaining variables
raw_data.dropna(inplace = True)

#creating a copy of this cleaned data
data = raw_data
data.isnull().sum()

In [None]:
# Convert categorical variables to dummy variables
data = pd.get_dummies(data, columns=categorical_col, drop_first=True, dtype=int)

#Print updated data
data.head()

In [None]:
# Defining independent variables and dependent variable
y = data['hospital_death'].values
x = data.drop(['hospital_death'], axis=1).values
print(data.shape)

In [None]:
# Perform train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
# Feature Scaling
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

## 4. Logistic Regression Model

Logistic regression is a fundamental yet powerful tool in the realm of classification. It's widely used because of its simplicity, interpretability, and effectiveness, especially for binary classification problems. Let us dive into implementing a basic logistic regression model.

In [None]:
# Training Logistic Regression Model using x_train
classifier = LogisticRegression(random_state = 42)
classifier.fit(x_train, y_train)

In [None]:
# Predicting y value for x_test data using above model
y_pred = classifier.predict(x_test)

In [None]:
#Dataframe creation for actual and predicted values of dependent variable
predicted_data = { 'hospital_death': y_test,'predicted_hospital_death':y_pred}
predicted_data = pd.DataFrame(predicted_data)

In [None]:
# Making confusion matrix 
confusion_matrix_logistic_reg = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(4,3))
sns.heatmap(confusion_matrix_logistic_reg, annot=True, fmt='d', cmap='Blues')
plt.ylabel('True label')
plt.xlabel('Predicted label')
print('\nClassification Report:\n')
print(classification_report(y_test, y_pred))
print('\nConfusion matrix:\n')


The model's accuracy score reveals its ability to accurately predict 93% of the dependent variable. However, its precision of 0.68 indicates that out of 440 predicted hospital deaths, only 299 were correct. Similarly, the model's recall, at 0.28, highlights its inefficiency, correctly predicting only 299 out of 1066 actual hospital deaths. This disparity between high accuracy and low recall stems from imbalanced data. Adjusting the threshold value for probability can be an effective solution in such cases, potentially enhancing the balance between precision and recall. Since the current model performs well on the majority class but struggles with the minority class, lowering the threshold could make it more sensitive to predicting hospital deaths, potentially improving recall at the expense of precision. However, finding the optimal threshold involves considering trade-offs, as overly lowering it may increase false positives, further reducing precision. Experimentation with various threshold values and evaluation of their impact on precision, recall, and other metrics like F1 score are essential to determining the most suitable balance for the specific problem at hand.  

In [None]:
#Testing different threshold probabilities to calculate F-score

def label(pos_probs, threshold):
    return (pos_probs >= threshold).astype('int')

#predict probabities of x_test
y_pred_prob = classifier.predict_proba(x_test)
# keep probabilities for the positive outcome only
probs = y_pred_prob[:, 1]
# define thresholds
thresholds = np.arange(0, 1, 0.01)
# evaluate each threshold
scores = [f1_score(y_test, label(probs, t)) for t in thresholds]
# get best threshold
ix = np.argmax(scores)
print(f'Threshold={thresholds[ix]:.3f}, F-Score= {scores[ix]:.5f}')

With the above results, it is clear that with threshold value at 0.21 logistic model gives highest F-score. Accuracy, precision and recall is calculated below for this model.

In [None]:
# Predicting y with threshold value 0.21
y_pred_new = label(probs, 0.21)

# Making confusion matrix
confusion_matrix_logistic_reg_new = confusion_matrix(y_test, y_pred_new)
fig, ax = plt.subplots(figsize=(4,3))
sns.heatmap(confusion_matrix_logistic_reg_new, annot=True, fmt='d', cmap='Blues')
plt.ylabel('True label')
plt.xlabel('Predicted label')
print('\nClassification Report:\n')
print(classification_report(y_test, y_pred_new))
print('\nConfusion matrix:\n')

## 5. Conclusion

In conclusion, this notebook presents a comprehensive approach to developing a classification model for predicting in-hospital mortality among admitted patients. Through thorough data loading, exploratory data analysis, and data preprocessing steps, including handling missing values and dropping irrelevant columns, a logistic regression model was implemented. The model achieved an accuracy score of 93%, but exhibited issues of low precision and recall due to imbalanced data. Adjusting the threshold value for probability improved the model's performance, resulting in the highest F-score at a threshold of 0.21.