## Canadian Hospital Readmittance Challenge 

This notebook is made as a part of the Machine Learning (AI-511) project. It has been made by the following students - 

1. Siddharth Kothari (IMT2021019)
2. Sankalp Kothari (IMT2021028)
3. M Srinivasan (IMT2021058)


In [276]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import xgboost as xgb
import optuna
from math import floor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,ConfusionMatrixDisplay
import re
pd.options.display.max_rows = 4000

#### Model fitting 

We tried out various combinations of models for the data. Before that, we first carried out a few steps - 

1. We standard scaled the numerical columns.
2. We one-hot encoded all the categorical columns with 3 or more categories. This led to certain columns which were there in the test data but not in the training data (as some categories appeared only in the test data), and some others which were only available in the training data.
3. To handle this, we do the following - add all columns from the train data not present in the test data with all values as zero, while the columns in the test data not available in the test data are simply dropped. We then sort the columns based on name to ensure correct ordering in training and testing data.
4. We then do a train test split with the test size as 0.2, and proceed to fit the models.
5. We also tried introducing polynomial features, but it decreased the performance of the model, so we removed it.

In [252]:
input_encoded = pd.read_csv("./processed_data.csv")
test_encoded = pd.read_csv("./test_data.csv")

In [None]:
labels = input_encoded.iloc[:, -1]
input_encoded = input_encoded.iloc[:, 1:-1]

In [None]:
X_train,X_test,Y_train,Y_test = train_test_split(input_encoded, labels, test_size=0.2, random_state=42)

We then proceed to try the following ensemble models. We used optuna for hyperparameter tuning, and the best cross validation accuracy scores we got for the models that we trained are as follows - 

| Model | CV Score |
| ----- | -------- |
| Random Forest | 0.7114682762492981 |
| XGBoost (Gradient Boosting) | 0.7120297585626053 | 

Based on this we chose xgboost as the best model, and used it to make predictions on the test data. 

In [None]:
# def objective(trial):
#     criterion = trial.suggest_categorical("criterion", ["gini", "entropy"])
#     max_depth = trial.suggest_int("max_depth", 2, 32, log=True)
#     n_estimators = trial.suggest_int("n_estimators", 100,500)
#     random_state = trial.suggest_int("random_state",42,42)
#     rf = RandomForestClassifier(criterion =criterion,
#             max_depth=max_depth, 
#             n_estimators=n_estimators,
#             random_state=random_state
#         )
#     X_train,X_test,Y_train,Y_test = train_test_split(input_encoded, labels, test_size=0.2, random_state=42)
#     rf.fit(X_train,Y_train)
#     y_pred = rf.predict(X_test)
#     score = accuracy_score(y_pred, Y_test)
#     return score


# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=15)


# def objective2(trial):
#     # data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)
#     X_train,X_test,Y_train,Y_test = train_test_split(input_encoded, labels, test_size=0.3, random_state=42)
#     regex = re.compile(r"\[|\]|<", re.IGNORECASE)
#     dict ={0:1.45,2:1,1:1.4}
#     X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
#     X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]

#     max_depth = trial.suggest_int("max_depth", 3, 10)
#     n_estimators = trial.suggest_int("n_estimators", 200,500)
#     learning_rate = trial.suggest_int("learning_rate",0,1)
#     # gamma = trial.suggest_int("gamma",0,5)
#     reg_lambda = trial.suggest_int("reg_lambda",0,5)
#     class_weight = trial.suggest_int("class_weight",0,3)
#     rf = xgb.XGBClassifier(
#             max_depth=max_depth, 
#             n_estimators=n_estimators,
#             learning_rate=learning_rate,
#             reg_lambda=reg_lambda,
#             class_weight = class_weight
#         )
#     rf.fit(X_train,Y_train)
#     preds = rf.predict(X_test)
#     pred_labels = np.rint(preds)
#     accuracy = accuracy_score(Y_test, pred_labels)
#     return accuracy

# study = optuna.create_study(direction="maximize")
# study.optimize(objective2, n_trials=15)


In [None]:
# lr = LogisticRegression(random_state=42, multi_class="multinomial")
# lr.fit(X_train,Y_train)

# y_pred = lr.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

# nb = GaussianNB()
# nb.fit(X_train,Y_train)

# y_pred = nb.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

# tree = DecisionTreeClassifier(max_depth=20,random_state=42)
# tree.fit(X_train,Y_train)

# y_pred = tree.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

# rf = RandomForestClassifier(random_state=42, criterion='entropy', max_depth=30, n_estimators=440)
# rf.fit(X_train,Y_train)

# y_pred = rf.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

In [None]:
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
 
X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]
test_encoded.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in test_encoded.columns.values]


xgb = xgb.XGBClassifier(max_depth=3,n_estimators=208,learning_rate=1,reg_lambda=3,class_weight=2)
xgb.fit(X_train,Y_train)

y_pred = xgb.predict(X_test)
print(accuracy_score(y_pred, Y_test))

Parameters: { "class_weight" } are not used.

0.7120297585626053


In [None]:
test_Y = xgb.predict(test_encoded)

df_output = pd.read_csv("../canadian-hospital-re-admittance-challenge/sample_submission.csv")
df_output["readmission_id"] = test_Y
df_output.to_csv("submission_xg.csv", index=False)

We then finally plot the confusion matrix for the same.

In [None]:
# cm = confusion_matrix(Y_test, y_pred, labels=xgb.classes_)
# disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=xgb.classes_)
# disp.plot()
# plt.show()
# print(accuracy_score(y_pred, Y_test))

![Confusion_matrix](../images/confusion_matrix.png)

### References
1. List of ICD-9 codes : https://en.wikipedia.org/wiki/List_of_ICD-9_codes
2. Optuna documentation : https://optuna.readthedocs.io/en/stable/
3. Pandas documentation : https://pandas.pydata.org/docs/
4. Numpy documentation : https://numpy.org/doc/1.26/user/index.html
5. Scikit learn documentation : https://scikit-learn.org/stable/
