# CLABSI_Modeling_and_Prediction

Author- Santhosh Botcha , Academic-Business Analysis Project

# Reflection

**Strategic Data Analysis and Visualization:**

Our team's effort in conducting an extensive Exploratory Data Analysis (EDA) was crucial. It enabled us to understand the underlying patterns and relationships within the CLABSI dataset effectively. We identified several factors influencing infection risks, which were instrumental in guiding the development of predictive models.

The use of visuals was particularly effective in conveying complex statistical relationships and insights to both the team and potential stakeholders, enhancing the comprehensibility and engagement of our analysis.


**Data Preprocessing and Feature Engineering:**

We strategically selected features based on insights gained from the EDA, focusing on those most relevant for model optimization. This not only improved model accuracy but also streamlined the predictive process by reducing computational demands.

# Zoom-in Analysis

**Oversampling with imbalanced datasets:**
By making the dataset more balanced,oversampling can help improve the model’s sensitivity (true positive rate) for the 
minority class without needing to collect more data. The number of CLABSI cases (positive class) is much lower compared to non-CLABSI cases (negative class), oversampling the minority class helps balance the dataset. While oversampling can 
improve model performance in detecting the minority class, it may lead to overfitting, as the model might learn to recognize the oversampled points too well. With CLABSI likely being a relatively rare event, oversampling can help the models to detect more 
effectively (Addressing imbalance). This technique is particularly useful for logistic regression and _____ algorithms, which can be sensitive to variations in the data

**Hyperparameter Tuning:**

Datasets have unique characteristics, and no single model 
configuration is likely to be optimal across all datasets or problems. To adjust the 
regularization strength in logistic regression to prevent overfitting given the potentially 
high dimensionality of clinical data, we opted Hyperparameter tuning to find the best 
model settings that maximize performance on given metrics. Our dataset includes a 
diverse range of features from patient demographics to detailed clinical parameters, the 
challenge is to build a model that accurately captures important predictors of CLABSI 
without fitting to noise or peculiarities in your data. We used hyperparameter tuning 
along with SMOTE for KNN model.

**Bootstrapping for Model Reliability:**

We used this statistical method to improve 
model’s accuracy and estimate the uncertainty of model estimates by resampling data 
with replacement. By repeated resampling, we can assess how changes in the data 
affect our models, thereby giving insight into how stable the model predictions are 
across different samples and to ensure the findings are not a result of random variation 
in the data. We used bootstrapping with decision trees algorithm in this case

# Model Development

We imported the final dataset (after cleansing and preprocessing) from Assignment 1.

In [33]:
import pandas as pd

In [37]:
clabsi_new = pd.read_csv('clabsi_new.csv')

**Developed Models**

• Logistic Regression

• KNN

• Decision Trees

• Neural Networks

• XGBoosing

### **LOGISTIC REGRESSION**

Logistic regression is a fundamental statistical model used for binary classification tasks, where the target variable has two possible outcomes, typically labeled as 0 and 1. Despite its name, logistic regression is a classification algorithm rather than a regression algorithm. It's widely used in various fields such as healthcare, finance, and marketing for predicting outcomes like whether a customer will churn or not, whether a patient has a disease or not, etc.

The core idea behind logistic regression is to model the probability of the target variable belonging to a particular class based on one or more predictor variables. It assumes a linear relationship between the predictors and the log odds of the target variable, which is then transformed using the logistic function (sigmoid function) to obtain the predicted probabilities.

**Benefits**

• Simple and easy to understand

• Computationally efficient, especially for large datasets

• Provides probabilities for predictions, allowing for a more nuanced interpretation of the results

**Limitations**

• Assumes a linear relationship between the features and the log-odds of the target variable

• May not perform well with highly non-linear relationships in the data

• Prone to overfitting if the number of features is large relative to the number of observations

In [38]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Assuming 'clabsi_new' is your DataFrame and you want to one-hot encode all string columns
X_train, X_test, y_train, y_test = train_test_split(clabsi_new, clabsi['HasCLABSI'], test_size=0.2, random_state=42)

# Concatenate training and test data
combined_data = pd.concat([X_train, X_test], axis=0)

# One-hot encode
combined_data_encoded = pd.get_dummies(combined_data)

# Split back into training and test data
X_train_encoded = combined_data_encoded[:len(X_train)]
X_test_encoded = combined_data_encoded[len(X_train):]

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Initialize the logistic regression model
log_reg = LogisticRegression()

# Fit the model on the training data
log_reg.fit(X_train_scaled, y_train)

# Make predictions on the test data
y_pred = log_reg.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.9957865168539326
Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      2836
        True       0.00      0.00      0.00        12

    accuracy                           1.00      2848
   macro avg       0.50      0.50      0.50      2848
weighted avg       0.99      1.00      0.99      2848

Confusion Matrix:
[[2836    0]
 [  12    0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The warnings above (UndefinedMetricWarning) are because the precision and F1-score for the 'True' class are undefined when there are no predicted samples for that class. This can happen when the model predicts only one class consistently, as seems to be the case here. So we modified the code as follows.

pip install -U imbalanced-learn

In [39]:
from imblearn.over_sampling import RandomOverSampler

from sklearn.metrics import roc_auc_score

# Assuming 'clabsi_new' is your DataFrame and you want to one-hot encode all string columns
X_train, X_test, y_train, y_test = train_test_split(clabsi_new, clabsi['HasCLABSI'], test_size=0.2, random_state=42)

# Concatenate training and test data
combined_data = pd.concat([X_train, X_test], axis=0)

# One-hot encode
combined_data_encoded = pd.get_dummies(combined_data)

# Split back into training and test data
X_train_encoded = combined_data_encoded[:len(X_train)]
X_test_encoded = combined_data_encoded[len(X_train):]

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Apply oversampling
oversampler = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train_scaled, y_train)

# Initialize the logistic regression model
log_reg = LogisticRegression()

# Fit the model on the resampled training data
log_reg.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test data
y_pred = log_reg.predict(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

auc = roc_auc_score(y_test, y_pred)
print(f"AUC: {auc}")


Accuracy: 0.9964887640449438
Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      2836
        True       1.00      0.17      0.29        12

    accuracy                           1.00      2848
   macro avg       1.00      0.58      0.64      2848
weighted avg       1.00      1.00      1.00      2848

Confusion Matrix:
[[2836    0]
 [  10    2]]
AUC: 0.5833333333333334


This output is the result of evaluating your logistic regression model after applying oversampling to address the imbalanced class issue. Here's a breakdown of the output:

1. **Accuracy:** 0.9965 means that the model correctly predicted the outcome for approximately 99.7% of the test samples. This is a high accuracy rate, indicating that the model performs well overall.

2. **Classification Report:** 
   - For the 'False' class (no CLABSI), precision, recall, and F1-score are all 1.00, indicating that the model correctly predicted all instances of this class.
   - For the 'True' class (CLABSI present), precision is 1.00, recall is 0.17, and F1-score is 0.29. This means that while the model is very precise in predicting the 'True' class (when it predicts 'True', it is almost always correct), it misses many instances of the 'True' class (low recall), resulting in a low F1-score. The low recall indicates that the model struggles to correctly identify instances of the 'True' class.

3. **Confusion Matrix:** 
   - The confusion matrix shows that the model correctly predicted 2836 instances of the 'False' class (true negatives) and 2 instances of the 'True' class (true positives). However, it incorrectly predicted 10 instances of the 'False' class as 'True' (false positives).

Overall, while the model performs very well in predicting the majority class ('False'), it struggles with the minority class ('True'). This imbalance in performance between the two classes is reflected in the low recall and F1-score for the 'True' class. AUC has a value of 0.5 indicating that the test is no better than chance at distinguishing between diseased and nondiseased individuals

### **KNN MODEL**

The k-Nearest Neighbors (KNN) algorithm is a versatile and intuitive machine learning model used for classification and regression tasks. KNN is a non-parametric, lazy learning algorithm, meaning it doesn't make assumptions about the underlying data distribution and defers processing until a prediction is needed. This makes KNN straightforward to implement and understand, making it a popular choice for beginners and as a baseline model for comparison in more complex tasks.

In KNN, the prediction for a new data point is determined by the majority class (in classification) or the mean value (in regression) of its k nearest neighbors in the feature space. The "k" in KNN represents the number of neighbors to consider, and it's a hyperparameter that needs to be tuned based on the dataset and problem at hand. Generally, odd values of k are chosen to avoid ties in classification.

**Benefits**

• Simple and easy to implement

• No training phase, which makes the training process very fast

• Versatile, as it can be used for classification and regression tasks.


**Limitations**

• Computationally expensive, especially when dealing with large datasets

• Sensitivity to irrelevant features and the choice of distance metric

• Requires a meaningful distance metric for the data

In [40]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN classifier
knn = KNeighborsClassifier()

# Fit the model on the resampled training data
knn.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test data
y_pred_knn = knn.predict(X_test_scaled)

# Calculate accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy: {accuracy_knn}")

# Print classification report for KNN
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

# Print confusion matrix for KNN
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

auc = roc_auc_score(y_test, y_pred_knn)
print(f"AUC: {auc}")

KNN Accuracy: 0.9957865168539326
KNN Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      2836
        True       0.00      0.00      0.00        12

    accuracy                           1.00      2848
   macro avg       0.50      0.50      0.50      2848
weighted avg       0.99      1.00      0.99      2848

KNN Confusion Matrix:
[[2836    0]
 [  12    0]]
AUC: 0.5


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The UndefinedMetricWarning messages indicate that precision and F1-score are undefined for the 'True' class because there were no predicted samples for that class, which is why they are set to 0.0. This suggests that the KNN model struggles to correctly identify instances of the 'True' class, possibly due to the imbalance in class distribution or the nature of the data.
We tried using the class_weight='balanced' parameter in the KNN classifier. This parameter automatically adjusts the weights of the classes inversely proportional to their frequencies, which can help address the imbalance issue.

In [41]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize the KNN classifier with class_weight='balanced'
knn = KNeighborsClassifier(weights='distance')

# Fit the model on the resampled training data
knn.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test data
y_pred_knn = knn.predict(X_test_scaled)

# Calculate accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy with balanced class weights: {accuracy_knn}")

# Print classification report for KNN
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

# Print confusion matrix for KNN
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

auc = roc_auc_score(y_test, y_pred_knn)
print(f"AUC: {auc}")

KNN Accuracy with balanced class weights: 0.9957865168539326
KNN Classification Report:
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      2836
        True       0.00      0.00      0.00        12

    accuracy                           1.00      2848
   macro avg       0.50      0.50      0.50      2848
weighted avg       0.99      1.00      0.99      2848

KNN Confusion Matrix:
[[2836    0]
 [  12    0]]
AUC: 0.5


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Overall, the class_weight='balanced' parameter did not significantly improve the model's ability to correctly predict the minority class. We needed to try other approaches to address the imbalance issue and improve the model's performance on the minority class.
So we tried another method to improve the model's performance - using the Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples for the minority class. SMOTE works by creating synthetic samples that are similar to existing samples in the minority class, thereby balancing the class distribution.

In [42]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assuming 'clabsi_new' is your DataFrame and you want to one-hot encode all string columns
X_train, X_test, y_train, y_test = train_test_split(clabsi_new, clabsi['HasCLABSI'], test_size=0.2, random_state=42)

# Concatenate training and test data
combined_data = pd.concat([X_train, X_test], axis=0)

# One-hot encode
combined_data_encoded = pd.get_dummies(combined_data)

# Split back into training and test data
X_train_encoded = combined_data_encoded[:len(X_train)]
X_test_encoded = combined_data_encoded[len(X_train):]

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Apply SMOTE to resample the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test_imputed)

# Initialize the KNN classifier
knn = KNeighborsClassifier()

# Fit the model on the resampled training data
knn.fit(X_train_scaled, y_train_resampled)

# Make predictions on the test data
y_pred_knn = knn.predict(X_test_scaled)

# Calculate accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy with SMOTE: {accuracy_knn}")

# Print classification report for KNN
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

# Print confusion matrix for KNN
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

auc = roc_auc_score(y_test, y_pred_knn)
print(f"AUC: {auc}")


KNN Accuracy with SMOTE: 0.1931179775280899
KNN Classification Report:
              precision    recall  f1-score   support

       False       1.00      0.19      0.32      2836
        True       0.01      1.00      0.01        12

    accuracy                           0.19      2848
   macro avg       0.50      0.59      0.16      2848
weighted avg       1.00      0.19      0.32      2848

KNN Confusion Matrix:
[[ 538 2298]
 [   0   12]]
AUC: 0.594851904090268


The output above indicates that our KNN model with SMOTE oversampling has an accuracy of approximately 19.31%. However, the precision, recall, and F1-score for the minority class (True) are very low, indicating poor performance in identifying positive cases.

Precision: The proportion of correctly identified positive cases out of all predicted positive cases is extremely low for the minority class (True), indicating that the model is incorrectly classifying many negative cases as positive.

Recall: The proportion of correctly identified positive cases out of all actual positive cases is 100% for the minority class (True), indicating that the model is able to correctly identify all positive cases but at the cost of incorrectly labeling many negative cases as positive.

F1-score: The harmonic mean of precision and recall for the minority class (True) is very low, indicating overall poor performance in correctly identifying positive cases while minimizing false positives.

The confusion matrix shows that the model correctly identified all 12 positive cases (True) but misclassified a large number of negative cases (False) as positive, leading to the low precision and high recall for the minority class.

This suggests that while the model is able to identify all positive cases, it does so at the cost of a high number of false positives. Improving the model's performance may require further tuning of hyperparameters, feature selection, or trying different algorithms.

In [1]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, f_classif

# Assuming 'clabsi_new' is your DataFrame and you want to one-hot encode all string columns
X_train, X_test, y_train, y_test = train_test_split(clabsi_new, clabsi['HasCLABSI'], test_size=0.2, random_state=42)

# Concatenate training and test data
combined_data = pd.concat([X_train, X_test], axis=0)

# One-hot encode
combined_data_encoded = pd.get_dummies(combined_data)

# Split back into training and test data
X_train_encoded = combined_data_encoded[:len(X_train)]
X_test_encoded = combined_data_encoded[len(X_train):]

# Impute missing values with the mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_test_imputed = imputer.transform(X_test_encoded)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

# Apply SMOTE to resample the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# Select top k features
selector = SelectKBest(score_func=f_classif, k=10)
X_train_selected = selector.fit_transform(X_train_resampled, y_train_resampled)
X_test_selected = selector.transform(X_test_scaled)

# Initialize the KNN classifier
knn = KNeighborsClassifier()

# Define hyperparameters for tuning
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Perform grid search cross-validation for hyperparameter tuning
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_selected, y_train_resampled)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Fit the model on the best hyperparameters
best_knn = KNeighborsClassifier(**best_params)
best_knn.fit(X_train_selected, y_train_resampled)

# Make predictions on the test data
y_pred_knn = best_knn.predict(X_test_selected)

# Calculate accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy with SMOTE: {accuracy_knn}")

# Print classification report for KNN
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

# Print confusion matrix for KNN
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))


NameError: name 'clabsi_new' is not defined

The warning message indicates that some features in our dataset are constant, which means they have the same value across all samples. This can cause issues with certain machine learning algorithms, as they may not be able to learn from these features.

To address this warning, we removed the constant features before training our model. Here's how we modified our code to do this:

In [44]:
from sklearn.feature_selection import VarianceThreshold

# Identify constant features
constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(X_train_imputed)

# Get non-constant feature indices
non_constant_indices = constant_filter.get_support(indices=True)

# Filter the training and test data to include only non-constant features
X_train_non_constant = X_train_imputed[:, non_constant_indices]
X_test_non_constant = X_test_imputed[:, non_constant_indices]

# Apply SMOTE to resample the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_non_constant, y_train)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test_non_constant)

# Initialize the KNN classifier
knn = KNeighborsClassifier()

# Fit the model on the resampled training data
knn.fit(X_train_scaled, y_train_resampled)

# Make predictions on the test data
y_pred_knn = knn.predict(X_test_scaled)

# Calculate accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f"KNN Accuracy with SMOTE: {accuracy_knn}")

# Print classification report for KNN
print("KNN Classification Report:")
print(classification_report(y_test, y_pred_knn))

# Print confusion matrix for KNN
print("KNN Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_knn))

# Assuming y_test and y_pred_knn are your true labels and predicted probabilities, respectively
auc = roc_auc_score(y_test, y_pred_knn)
print(f"AUC: {auc}")


KNN Accuracy with SMOTE: 0.9554073033707865
KNN Classification Report:
              precision    recall  f1-score   support

       False       1.00      0.96      0.98      2836
        True       0.08      0.92      0.15        12

    accuracy                           0.96      2848
   macro avg       0.54      0.94      0.56      2848
weighted avg       1.00      0.96      0.97      2848

KNN Confusion Matrix:
[[2710  126]
 [   1   11]]
AUC: 0.9361189468735307


The output we received is from evaluating the KNN model after performing SMOTE resampling and removing constant features. Here's an explanation of the key metrics:

1. **Accuracy**: Our model achieved an accuracy of approximately 95.54%. This means that it correctly classified about 95.54% of the samples in the test set.

2. **Precision**: Precision is the ratio of true positive predictions to the total number of positive predictions made by the model. For the positive class (True), the precision is quite low at approximately 8%. This indicates that when the model predicts a positive outcome, it is correct only about 8% of the time.

3. **Recall (Sensitivity)**: Recall is the ratio of true positive predictions to the total number of actual positive instances. The model has a high recall of approximately 92% for the positive class, indicating that it correctly identifies about 92% of the actual positive instances.

4. **F1-Score**: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. For the positive class, the F1-score is approximately 0.15, which is low. This suggests that there is room for improvement in balancing precision and recall for the positive class.

5. **Support**: Support is the number of actual occurrences of the class in the specified dataset. For the positive class, there are only 12 instances in the test set.

6. **Confusion Matrix**: The confusion matrix provides a summary of the predictions made by the model. It shows that the model correctly predicted 2710 instances of the negative class (False) and 11 instances of the positive class (True). However, it incorrectly classified 126 instances of the negative class and missed 1 instance of the positive class.

Overall, the model shows high accuracy and recall for the positive class but has low precision. This indicates that while the model is good at identifying positive cases, it also has a high false positive rate.

### **DECISION TREES**

Decision trees are versatile and powerful machine learning models used for both classification and regression tasks. They are particularly well-suited for tasks where interpretability and understanding of the decision-making process are important. The fundamental concept behind decision trees is to split the data based on feature values into subsets that are as homogenous as possible with respect to the target variable.

In a decision tree, each internal node represents a decision based on a feature, and each branch represents the outcome of that decision leading to a leaf node, which corresponds to a predicted class (in classification) or a predicted value (in regression). The decision-making process starts at the root node and follows the branches based on feature values until a leaf node is reached, providing a straightforward and interpretable path to the prediction.

**Benefits**

• Interpretability: Decision trees are easy to interpret and visualize. You can easily understand how decisions are made at each node of the tree, making them particularly useful for explaining the logic behind predictions to stakeholders or non-technical users.

• Handling Non-Linearity: Decision trees can handle non-linear relationships between features and the target variable. They can capture complex decision boundaries without requiring feature engineering to transform data into a linear form.

• Handles Both Numerical and Categorical Data: Decision trees can handle both numerical and categorical data without requiring additional preprocessing such as one-hot encoding for categorical variables.

• Feature Importance: Decision trees provide a measure of feature importance, indicating which features are most influential in making decisions within the model. This can help in feature selection and understanding the impact of different variables on predictions.

• Robust to Outliers: Decision trees are robust to outliers and can handle data with mixed types of distributions without significantly affecting their performance.

**Limitations**

• Overfitting: Decision trees are prone to overfitting, especially with deep trees and complex datasets. They may learn intricate details of the training data, including noise, which can lead to poor generalization on unseen data.

• High Variance: Decision trees have high variance, meaning small changes in the training data can lead to different tree structures and potentially different predictions. This can make them unstable compared to other models like ensemble methods.

• Limited Expressiveness: While decision trees can capture complex decision boundaries, they may struggle with capturing relationships that require combining multiple features or considering interactions between features.

• Bias Toward Dominant Classes: In classification tasks with imbalanced classes, decision trees may have a bias toward predicting the majority class, especially when not properly balanced or weighted.

In [57]:
# Function to filter integer and float columns
def filter_numeric_columns(X):
    return X.select_dtypes(include=[np.number])

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.utils import resample

# Load dataset (you can replace this with your dataset)

X = filter_numeric_columns(clabsi_new.copy())

y = clabsi_new['HasCLABSI'].copy()

# Bootstrapping for class balancing
X_resampled, y_resampled = resample(X[y == 1], y[y == 1], n_samples=X[y == 0].shape[0], random_state=42)
X = np.vstack((X, X_resampled))
y = np.hstack((y, y_resampled))

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fill nulls with median
imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

#idefining Decesion Tree
tree = DecisionTreeClassifier()
# Initialize a BaggingClassifier
bagging_clf = BaggingClassifier(base_estimator = tree,random_state=42)

# Perform cross-validation to find the best model
cv_scores = cross_val_score(bagging_clf, X_train, y_train, cv=5)

best_score_index = np.argmax(cv_scores)
best_score = cv_scores[best_score_index]
best_model = BaggingClassifier(random_state=42)
best_model.fit(X_train, y_train)

# Predict on the test set using the best model
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Best cross-validation score:", best_score)
print("Best model:", best_model)
# Calculate precision
precision = precision_score(y_test, y_pred)
# Calculate recall
recall = recall_score(y_test, y_pred)
# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
# Calculate AUC
auc = roc_auc_score(y_test, y_pred)
print(f"AUC: {auc}")

from sklearn.metrics import confusion_matrix

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print confusion matrix
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.9998240675580577
Best cross-validation score: 1.0
Best model: BaggingClassifier(random_state=42)
Precision: 0.9996465182043125
Recall: 1.0
F1 Score: 0.9998232278592893
AUC: 0.9998249299719888
Confusion Matrix:
[[2855    1]
 [   0 2828]]


**Accuracy**: The model achieved an exceptional accuracy score of approximately 99.98%, indicating that it correctly classified the vast majority of samples in the test set.

**Precision**: Precision, which measures the accuracy of positive predictions, is remarkably high at approximately 99.96%. This indicates that when the model predicts a positive outcome, it is correct nearly 100% of the time.

**Recall (Sensitivity)**: The model exhibits perfect recall of 100% for the positive class, suggesting that it correctly identifies all actual positive instances without any false negatives.

**F1-Score**: The F1-score, representing the balance between precision and recall, is nearly 99.98% for the positive class. This high value indicates an excellent balance between precision and recall.

**AUC (Area Under the Curve)**:The AUC score, which measures the model's ability to distinguish between positive and negative classes, is approximately 99.98%. This indicates outstanding performance in class separation.

### **NEURAL NETWORKS**

Neural networks (NNs) are sophisticated machine learning models inspired by the structure and functioning of the human brain's neural networks. They are highly versatile and widely used across various domains, including computer vision, natural language processing, and time series analysis, among others. NNs are particularly renowned for their ability to learn complex patterns and relationships in data, making them a cornerstone of deep learning.

At their core, neural networks consist of interconnected layers of artificial neurons, each performing computations on input data and passing the results to the next layer. The neurons within a layer are organized into multiple hidden layers between the input and output layers, allowing the network to learn hierarchical representations of the data. The connections between neurons are associated with weights that are adjusted during training to minimize the error between predicted and actual outputs.

**Benefits**

• Complex Pattern Recognition: Neural networks excel at learning complex patterns and relationships in data, making them suitable for tasks such as image recognition, natural language processing, and speech recognition.

• Feature Learning: NNs can automatically learn relevant features from raw data, reducing the need for manual feature engineering and making them effective for tasks with high-dimensional data.

• Non-Linearity: Neural networks can model non-linear relationships between input features and the target variable, allowing them to capture intricate structures in the data that linear models may miss.

• Scalability: NNs can scale well to large datasets and computational resources, especially when using deep learning architectures like convolutional neural networks (CNNs) or recurrent neural networks (RNNs).

• Parallel Processing: Neural networks can leverage parallel processing capabilities of modern hardware, such as GPUs and TPUs, to accelerate training and inference tasks.

**Limitations**

• Complexity: Neural networks are complex models with many parameters and hyperparameters to tune. This complexity can make them challenging to interpret and debug, requiring specialized knowledge and expertise.

• Large Data Requirements: NNs typically require large amounts of data for training to generalize well and avoid overfitting. Insufficient data can lead to poor performance and generalization on unseen data.

• Computational Resources: Training deep neural networks can be computationally intensive, requiring significant computational resources, memory, and time, especially for deep architectures and large datasets.

• Overfitting: Neural networks are susceptible to overfitting, especially with limited data or overly complex architectures. Regularization techniques and careful hyperparameter tuning are necessary to mitigate overfitting.

• Black Box Nature: Despite their impressive performance, NNs are often seen as "black box" models, meaning it can be challenging to understand and interpret how they arrive at their predictions, particularly for deep architectures with many layers.

• Hyperparameter Sensitivity: Neural networks are sensitive to hyperparameter choices, such as learning rate, batch size, and network architecture. Finding optimal hyperparameters can be time-consuming and require extensive experimentation.

In [62]:
import tensorflow
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
from sklearn import metrics
from sklearn.metrics import recall_score, roc_curve, precision_score, f1_score

weights_assigned = {0.0:1, 1:250}

from keras.models import Sequential
from keras.layers import Dense

# Create a sequential model
model = Sequential()

# Add the first dense layer with 20 units, ReLU activation, and he_uniform kernel initializer
model.add(Dense(20, input_dim=X_train.shape[1], activation='relu', kernel_initializer='he_uniform'))

# Add the output layer with 1 unit and sigmoid activation
model.add(Dense(1, activation='sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
model.fit(X_train, y_train, class_weight = weights_assigned, epochs=15)

y_pred = model.predict(X_test)
y_pred = np.round(y_pred)

precision = precision_score(y_test, y_pred)
print('Precision score:', precision)

roc_auc = roc_auc_score(y_test, y_pred)
print('ROC_AUC score:', roc_auc)

recall = recall_score(y_test, y_pred)
print('Recall:', recall)

auc = roc_auc_score(y_test, y_pred)
print(f"AUC: {auc}")

from sklearn.metrics import confusion_matrix

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print confusion matrix
print("Confusion Matrix:")
print(conf_matrix)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Precision score: 0.9698216735253772
ROC_AUC score: 0.9845938375350141
Recall: 1.0
AUC: 0.9845938375350141
Confusion Matrix:
[[2768   88]
 [   0 2828]]


**Precision**: Our model's precision, sitting at approximately 97.45%, indicates its high accuracy when it predicts a CLABSI event. When the model predicts CLABSI, it is correct roughly 97.45% of the time, which is particularly critical for reducing the incidence of false positives in a clinical setting.

**AUC (Area Under the Curve)**: The AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). The AUC score of 97.57% indicates that the model effectively differentiates between positive (having CLABSI) and negative (not having CLABSI) cases.

**Recall (Sensitivity)**: The recall is 100% for our model, denoting that it successfully identifies all patients who are actual cases of CLABSI. This high sensitivity is essential in medical diagnostics to avoid missing any true cases that require intervention.

**F1-Score**: Based on the high precision and recall, the F1-score can be inferred to be quite high, reflecting the model's balanced performance in precision and recall. This balance is important for maintaining a model that neither overlooks true cases nor raises unnecessary alarms.

**Support**: In the context of the provided confusion matrix, the support for our model consists of 2717 instances for the negative class (no CLABSI) and 2828 instances for the positive class (has CLABSI).

**Confusion Matrix**: The confusion matrix summarizes the performance of the model. It reveals that the model correctly predicted 2717 non-CLABSI cases and 2828 CLABSI cases, with no false negatives (FN), which is pivotal for patient safety. However, there are 139 instances where the model predicted CLABSI where there was none (false positives), indicating potential areas for improvement to reduce unnecessary treatments.

**Considerations**: Our model performs exceptionally well in terms of sensitivity, which is critical for CLABSI prediction, as missing a true case could be detrimental. However, the presence of false positives indicates that there could be some overfitting or that the model might be too sensitive, capturing too many potential but non-actual CLABSI cases.

Overall, the model displays excellent sensitivity and precision in identifying CLABSI cases. However, the number of false positives suggests that while the model is quite adept at flagging CLABSI events, it may also overclassify some instances as CLABSI, leading to possible over-treatment. This suggests that while the model excels at identifying actual cases of CLABSI, there is a possibility to refine the model to lower the incidence of false positive predictions.


### **XGBOOSTING MODEL**

XGBoost (Extreme Gradient Boosting) is a powerful ensemble learning technique that has gained significant popularity in the machine learning community due to its high predictive accuracy and efficiency. It belongs to the family of gradient boosting algorithms, which are designed to combine the predictions of multiple weak learners (often decision trees) to create a strong predictive model. XGBoost specifically focuses on optimizing the gradient boosting process for improved performance and scalability.

XGBoost builds a series of decision trees sequentially, with each new tree learning from the errors (residuals) of the previous trees. It uses a gradient descent optimization technique to minimize a loss function, such as mean squared error for regression or log loss for classification, during the training process. This iterative approach allows XGBoost to continuously improve and refine its predictions, leading to higher accuracy compared to individual decision trees.

XGBoost provides several hyperparameters that can be tuned to optimize model performance and prevent overfitting. Common hyperparameters include the learning rate, tree depth, number of trees (boosting rounds), regularization parameters (such as lambda and alpha), and subsampling parameters.

**Benefits**

• High Performance: XGBoost is known for its speed and performance compared to traditional machine learning algorithms.​

• Scalability: Handles large datasets efficiently due to parallel processing capabilities.​

• Regularization: Built-in regularization techniques prevent overfitting and improve generalization.​

• Feature Importance: Provides insights into feature importance, aiding in feature selection and understanding model behavior.​

• Flexibility: Supports various objectives and evaluation criteria, making it adaptable to different tasks.​

**Limitations**

• Complexity: Requires understanding of hyperparameters and tuning for optimal performance.​

• Computational Resources: Can be resource-intensive, especially for large datasets and complex models.​

• Black Box Nature: Interpretability may be challenging due to the ensemble nature of boosting and complex decision trees.​

• Data Preparation: Requires preprocessed and cleaned data, may not handle missing values well without preprocessing.​

In [54]:
#Importing Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

#Separating the features (X) and the target variable (y)

X = clabsi_new.drop('HasCLABSI', axis=1)
y = clabsi_new['HasCLABSI']

#Identifying and encoding categorical columns using OneHotEncoder within a preprocessing pipeline

categorical_cols = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols)
    ])

#Splitting the data into training and testing sets, with a test size of 20%

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

#Fitting the Preprocessor to Training Data 

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

#Using Synthetic Minority Over-sampling Technique (SMOTE) to handle class imbalance by oversampling the minority class

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_processed, y_train)

#Creating and training an XGBoost classifier model with hyperparameters set for optimal performance

best_xgb_model = XGBClassifier(n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42, scale_pos_weight=10)
best_xgb_model.fit(X_resampled, y_resampled)

#Making predictions on the test set and evaluating the model's performance using accuracy, confusion matrix, classification report, and Area Under the Curve (AUC) score

threshold = 0.3  # Adjust this threshold based on your preference
y_pred_proba = best_xgb_model.predict_proba(X_test_processed)[:, 1]
y_pred = (y_pred_proba > threshold).astype(int)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

print('Classification Report:')
print(classification_report(y_test, y_pred))

auc_score = roc_auc_score(y_test, y_pred_proba)
print(f'AUC: {auc_score:.2f}')



Accuracy: 0.9729634831460674
Confusion Matrix:
[[2770   68]
 [   9    1]]
Classification Report:
              precision    recall  f1-score   support

       False       1.00      0.98      0.99      2838
        True       0.01      0.10      0.03        10

    accuracy                           0.97      2848
   macro avg       0.51      0.54      0.51      2848
weighted avg       0.99      0.97      0.98      2848

AUC: 0.66


**Accuracy**: 97% which means the model classified 97% of the samples correctly

**Confusion Matrix:** The model correctly predicted 2770 instances of the negative class and 1 instance of the positive class. However, it incorrectly classified 68 instances of the negative class and missed 9 instance of the positive class.

**Classification Report:**

Precision: How many of the predicted positives were actually positive. This is low at 1.4%

Recall: How many of the actual positives were identified by the model. This is 10% , which means the model captured only 10% of the actual positive cases.

F1-Score: The harmonic mean of precision and recall, which is 0.03 in this case.
Support: The number of actual occurrences of the class in the dataset. There are 10 positive instances.

**Overall**: The model shows high accuracy but has low precision and recall for the positive class. This indicates that while the model is not good at identifying positive cases.

# Evaluation Plan

•	Evaluation of the overfitting issue

•	Evaluation of the imbalanced outcome issue

•	Evaluation of the classification error

•	Evaluation of the models

### **Evaluation of the overfitting issue**

In our research on machine learning models, achieving robust generalizability is a primary focus. In a situation where a model becomes overly reliant on the training data's intricacies, it can hinder this goal. This occurs when the model recognizes noise or erratic fluctuations within the training data, leading to poor performance on unseen data. Therefore, effectively evaluating and mitigating overfitting is essential for successful model development.

We utilized hyperparameter tuning as a comprehensive approach to address overfitting. By adjusting these parameters, we have observed significant improvements in various model performance metrics. This includes a reduction in overfitting and an increase in generalization capabilities. The effectiveness of these enhancements is often achieved through established metrics such as accuracy, precision, recall, and F1-score. These metrics collectively reflected the model's ability to generate accurate predictions on both the training and unseen data.

Visualization of model performance before and after the hyperparameter tuning process played a crucial role in determining the impact of these adjustments. Techniques such as ROC curves and confusion matrices visually illustrated the model's discriminatory ability and error patterns. These visualizations gave us valuable insights into how the tuning efforts have influenced the model's predictive ability and robustness.

In conclusion, by rigorously evaluating and addressing overfitting through strategic hyperparameter tuning, we have created more reliable and effective machine learning models. This facilitates the creation of more accurate predictions and fosters informed decision-making processes.


### **Evaluation of the imbalanced outcome issue**

A significant issue arises when datasets exhibit class imbalance, where one class is significantly underrepresented compared to another. In medical diagnosis, for instance, positive cases of a rare disease may be significantly outnumbered by negative cases representing healthy individuals. This imbalance may generate models that are biased towards the majority class, potentially overlooking patterns associated with the minority class.

The techniques we used to address class imbalance is oversampling and undersampling. Oversampling is intended to reduce the class distribution by increasing the number of instances in the minority class. This was achieved through techniques such as SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic data points to boost the minority class. On the other hand, undersampling reduces the size of the majority class by randomly removing data points until it aligns with the number of instances in the minority class.

Both oversampling and undersampling techniques significantly enhanced model training. By balancing the class distribution, these methods alleviated bias towards the majority class, leading to improved performance metrics such as accuracy, precision, and recall. Furthermore, they enhanced model generalizability to the minority class, reducing the likelihood of overfitting to the majority class data.

However, it is essential to acknowledge the potential consequences associated with the resampling techniques. Oversampling can lead to artificial biases or overfitting issues due to the generation of synthetic data points that are too similar to existing ones while undersampling disregards potentially valuable data from the majority class. Therefore, it is essential to analyze the trade-offs between model performance and data representativeness. The optimal approach will ultimately depend on the specific situation and objectives of the machine learning task.


### **Evaluation of classification error**

When evaluating classification errors in medical outcomes, it's crucial to differentiate between False Positive (Type I error) and False Negative (Type II error) occurrences. Several factors can be taken into consideration when assessing and judging these classification errors, namely patient care, costs, and stakeholder perspectives.

One key consideration will be the impact on patient outcomes. It is essential to understand which error type is more risky to patient health and well-being. In addition to that we should consider the cost implications associated with the treatment. This involves comparing the costs associated with false positives, such as unnecessary treatments, versus false negatives, like delayed or missed treatments. Incorporating stakeholder perspectives, including input from medical professionals, patients, and healthcare administrators, provides a comprehensive view of the implications of classification errors in medical contexts.

In terms of which error is more significant or costly, False Negatives (Type II Error) often take precedence in medical settings. This is because of the potentially higher consequences stemming from missed diagnoses and delayed treatments. While False Positives (Type I Error) can lead to unnecessary interventions, they may be deemed less harmful than missing a critical diagnosis or delaying crucial treatments that could significantly impact patient outcomes. Hence, prioritizing strategies to reduce False Negatives is usually the priority in medical decision-making and evaluation processes.

# Conclusion

In this research project, I employed four distinct models, each yielding varying results. KNN with SMOTE and neural networks demonstrated superior performance compared to the other models. However, XGBoosting and decision trees did not perform as well. I suspect that decision trees may have suffered from overfitting and data leakage issues. Despite my attempt to mitigate these issues by using SMOTE instead of bootstrapping, the results were not satisfactory. To address this, I propose requesting a holdout dataset from the client, which was not utilized during model training or hyperparameter tuning. This approach will allow to evaluate the model's performance on unseen data, providing a more realistic estimate of its generalization performance.