# Binary Location Classification 
The goal of this notebook is to find important lab tests per location of (positive) uveitis patients. 
The hypothesis is that an anterior inflammation can be identified by a different subset of lab tests as for example posterior inflammations. 
This would allow to order a subset of all possible lab tests after the location of the inflammation has been located to identify uveitis. One approach would be to train a model per location.

Steps:

1. Get Subset of Data (Target Featue: Location, Input Features: Lab Results) 
2. Define suitable Algorithms for Binary Classification (e.g. Logistic Regression, etc.)
3. Call preprocessing pipe with appropriat parameters for the current algorithm
4. Fit Model
5. Extract and Discuss important Features

In [None]:
# global Variables
RANDOM = 43

In [None]:
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

# sklearn standard imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, Binarizer, LabelEncoder, Normalizer, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# import decision tree
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix

# import of pipe module
os.chdir('../preprocessing/')
import pipe

## Data Preparation
To predict the location of an inflammation, we need to drop all columns that contain information about the location. Meta-Information about the patient will also be dropped.

In [None]:
# calling preprocessing function

# num_to_cat = True: Range Date is now dtype Category 
# drop_filter: Drop every column that is not a lab test

df = pipe.preprocessing_pipe(num_to_cat   = True,
                             drop_filter  = ['hla', 'ac_', 'vit_', 'gender', 'race', 'cat','specific_diagnosis'],
                             loc_approach = 'multi',
                             binary_cat   = True) 
df.head()

### Split Data into uveitis and not_uveitis data

In [None]:
df_uv_pos = df[df.uveitis == True]
df_uv_neg = df[df.uveitis != True]

## Decision Tree and Random Forest
! Note that decision trees can handle missing information 

One of the simplest and easiest to understand model is a Decision Tree. These Model try to classify a dataset based on a series of Yes or No Questions that are assembled as a Tree. (See visualization of tree later on). At first we try to train a decision Tree to identify if the patient is uveitis positive or negative based purely on lab test results. (Later on mor sophisticated methods will be applied)

### Decision Tree: Uveitis or Not Uveitis, binary Classification

**Problem** in Binary Classification of Uveitis: The dataset is extremly unbalanced. A decision tree tends to always predict the same class for an extremly unbalanced dataset, as it reaches the best accuracy with this approach.

In [None]:
# train_test_split
df_t = df.copy().dropna()
X = df_t.drop(columns=['loc','uveitis'])
y = df_t.uveitis
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=RANDOM, stratify = None)

In [None]:
# filter for numeric and categorical features
numerics = ['Int64','float64']
category = ['category','bool']

# select list of numeric and categorical features
numeric_features = X.select_dtypes(include=numerics).columns.tolist()
categorical_features = X.select_dtypes(include=category).columns.tolist()

# define imputer strategy (consult sklarn SimpleImputer and StandardScaler documentation for options)
imputer = {'categorical':{'strategy':'most_frequent','fill_value':'most_frequent'}, 'numerical':{'strategy':'median', 'fill_value':'mean'}}
imputer_encoder = pipe.impute_and_encode(categorical_features, numeric_features, imputer)

In [None]:
dectree = DecisionTreeClassifier()

pipeline = Pipeline(steps=[('preprocessor', imputer_encoder),
                      ('classifier', dectree)])

# Specify the hyperparameter space
# n_components = list(range(1,X.shape[1]+1,1)) # for pca if needed
criterion = ['gini', 'entropy']
max_depth = [2,4,6,8,10,12]
class_weight = [{True:0.2, False:1}]

parameters = {'criterion':criterion,
             'max_depth':max_depth,
             'class_weight':class_weight}

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(dectree, parameters, cv = 10)

# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))

ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred)).plot();
print(cv.best_params_);

plt.figure(figsize=(20,20))
class_names = ['Uveitis', 'Not Uveitis']
feature_names = X_test.columns.tolist()
plot_tree(cv.best_estimator_, fontsize=15, class_names=class_names, feature_names=feature_names, filled=True)
plt.show()