# Introduction Tree Learning: Exercise 2



In this exercise, you will have  the chance to investigate the informative data from the same publication. 

To make this task  a bit more realistic, we have given  you the  original data. You will have to standardize the dat (as shown in the pre-processing lecture and previous exercises) potentially using the pipeline approach of sklearn. 


```
'biomarkers_raw.csv'
``` 

Contains those genomic biomarkers.

Use Random Forests to establish, which  are the most informative features (attributs/variables/classifiers/ etc.)





In [None]:
import os
import sys
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt # plotting and visulisation
import seaborn as sns # nicer (easier) visualisation
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import LeaveOneOut, GridSearchCV, KFold, StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier



# own mini- library
import session_helpers
import IPython.display



## Loading in the file and setting the first column to be the index

In [None]:
biomarkers_file_csv = 'biomarkers_raw.csv'

df = pd.read_csv(biomarkers_file_csv)
df = df.set_index(['Sample'])


In [None]:
df.describe()
df


In [None]:
# Establish, which are the numerical columns
numeric_features = list(df.select_dtypes(float).columns)

# scaling using  the pipeline appraoch
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

# put the preprocessor together:
preprocessor = ColumnTransformer(transformers=[ ('num',numeric_transformer,numeric_features), ])
clf = Pipeline(steps=[('preprocessor', preprocessor)])

# create a new dataframe, using the same names and indeces as in df_p
df_norm = pd.DataFrame(clf.fit_transform(df[numeric_features]),columns=numeric_features,index=df.index)

# add 'response' back into the new dataframe as 'target' and directly do the mapping
target_mapper            = {
                             #'C.':'negative',
                             #'C.R.':'negative',
                             'Low':'negative',
                             'Int. I.':'negative',
                             'Int. II.':'negative',
                             'Int. II. R.':'negative',
                             'High':'positive',
                             'High R.':'positive',
                            }

target_mapper_multiclass = {
                             'C.':'C.',
                             'C. R.':'C. R.',
                             'Low':'Low',
                             'Int. I.':'Int. I.',
                             'Int. II.':'Int. II.',
                             'Int. II. R.':'Int. II. R.',
                             'High':'High',
                             'High R.':'High R.',
                            }

df_norm['target'] = df['Response'].map(target_mapper_multiclass)


# drop entries, which do not have a class label (this results in not mapping it to any new target class)
# if filter on the column 'target', looking for entries which are None or NaN
df_norm = df_norm[df_norm['target'].notna()]

# to be deleted
#df_norm['Response'] = df['Response']#.map(target_mapper)
#df_norm[['Response']+numeric_features].to_csv('clinical_biomarkers_new.csv')




## For consistency

we use X for the data vector and y for the target column

In [None]:
# target column
y = df_norm['target']
# this drops the column 'target' for the dataframe and stores it in X
X = df_norm.drop(['target'],axis=1)




## Plotting the values of all columns

Here we use the melt function of pandas. This function allows the values to be plotted in a nice fashion. Just click on Run and see. 

Are you able to spot an attribute or two, separating positive from negative?


In [None]:
plot_data_melt = pd.melt(df_norm,id_vars='target',
                    var_name='features',
                    value_name='value')
plt.figure(figsize=(60,10))
ax = sns.boxplot(x='features', y='value', hue='target', data=plot_data_melt)
ticks_information = plt.xticks(rotation=65)

## Random Forest Classifier

Now, establish the feature importance using a grid search and Random Forests

### Grid search

In [None]:
parameters = {
#    'criterion': ['gini','entropy'], 
    'n_estimators': [2,3,5,10], 
    'max_depth':[1,2,3,4,5],
    'min_samples_leaf':[2,5,7,10],
}

random_f_model = RandomForestClassifier() 

# possible values for scoring:
# 

rf_grid_search = GridSearchCV(random_f_model, parameters, cv=5,scoring='roc_auc_ovo') 
grid_search = rf_grid_search.fit(X, y)



### Best model

In [None]:
best_random_f_model = rf_grid_search.best_estimator_ # best model according to grid search 

best_random_f_model.get_params()

In [None]:
# using dataframes
df_importance = pd.DataFrame(list(zip(X.columns.values,best_random_f_model.feature_importances_)),columns=['column_name','feature_importance'])
df_importance = df_importance.set_index(['column_name'])
df_importance.sort_values(['feature_importance'],ascending=False,inplace=True)

df_importance[df_importance['feature_importance']>0.0]

In [None]:
plt.figure(figsize=(20,10))

sns.barplot(x='column_name',y='feature_importance',data=df_importance.reset_index(),palette='muted')
ticks_information = plt.xticks(rotation=65)