# Mini Project: Tree-Based Algorithms

## The "German Credit" Dataset

### Dataset Details

This dataset has two classes (these would be considered labels in Machine Learning terms) to describe the worthiness of a personal loan: "Good" or "Bad". There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, number of people being liable to provide maintenance for, telephone, and foreign worker status.

Many of these predictors are discrete and have been expanded into several 0/1 indicator variables (a.k.a. they have been one-hot-encoded).

This dataset has been kindly provided by Professor Dr. Hans Hofmann of the University of Hamburg, and can also be found on the UCI Machine Learning Repository.

## Decision Trees

 As we have learned in the previous lectures, Decision Trees as a family of algorithms (irrespective to the particular implementation) are powerful algorithms that can produce models with a predictive accuracy higher than that produced by linear models, such as Linear or Logistic Regression. Primarily, this is due to the fact the DT's can model nonlinear relationships, and also have a number of tuning paramters, that allow for the practicioner to achieve the best possible model. An added bonus is the ability to visualize the trained Decision Tree model, which allows for some insight into how the model has produced the predictions that it has. One caveat here, to keep in mind, is that sometimes, due to the size of the dataset (both in the sense of the number of records, as well as the number of features), the visualization might prove to be very large and complex, increasing the difficulty of interpretation.

To give you a very good example of how Decision Trees can be visualized and interpreted, we would strongly recommend that, before continuing on with solving the problems in this Mini Project, you take the time to read this fanstastic, detailed and informative blog post: http://explained.ai/decision-tree-viz/index.html

## Building Your First Decision Tree Model

So, now it's time to jump straight into the heart of the matter. Your first task, is to build a Decision Tree model, using the aforementioned "German Credit" dataset, which contains 1,000 records, and 62 columns (one of them presents the labels, and the other 61 present the potential features for the model.)

For this task, you will be using the scikit-learn library, which comes already pre-installed with the Anaconda Python distribution. In case you're not using that, you can easily install it using pip.

Before embarking on creating your first model, we would strongly encourage you to read the short tutorial for Decision Trees in scikit-learn (http://scikit-learn.org/stable/modules/tree.html), and then dive a bit deeper into the documentation of the algorithm itself (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

Also, since you want to be able to present the results of your model, we suggest you take a look at the tutorial for accuracy metrics for classification models (http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) as well as the more detailed documentation (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Finally, an *amazing* resource that explains the various classification model accuracy metrics, as well as the relationships between them, can be found on Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

(Note: as you've already learned in the Logistic Regression mini project, a standard practice in Machine Learning for achieving the best possible result when training a model is to use hyperparameter tuning, through Grid Search and k-fold Cross Validation. We strongly encourage you to use it here as well, not just because it's standard practice, but also becuase it's not going to be computationally to intensive, due to the size of the dataset that you're working with. Our suggestion here is that you split the data into 70% training, and 30% testing. Then, do the hyperparameter tuning and Cross Validation on the training set, and afterwards to a final test on the testing set.)

### Now we pass the torch onto you! You can start building your first Decision Tree model! :)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [3]:
'''
import zipfile
with zipfile.ZipFile("GermanCredit.csv.zip","r") as zip_ref:
    zip_ref.extractall("GermanCredit.csv")
'''

df = pd.read_csv('GermanCredit.csv')
df.head()

Unnamed: 0,Duration,Amount,InstallmentRatePercentage,ResidenceDuration,Age,NumberExistingCredits,NumberPeopleMaintenance,Telephone,ForeignWorker,Class,...,OtherInstallmentPlans.Bank,OtherInstallmentPlans.Stores,OtherInstallmentPlans.None,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified
0,6,1169,4,4,67,2,1,0,1,Good,...,0,0,1,0,1,0,0,0,1,0
1,48,5951,2,2,22,1,1,1,1,Bad,...,0,0,1,0,1,0,0,0,1,0
2,12,2096,2,3,49,1,2,1,1,Good,...,0,0,1,0,1,0,0,1,0,0
3,42,7882,2,4,45,1,2,1,1,Good,...,0,0,1,0,0,1,0,0,1,0
4,24,4870,3,4,53,2,2,1,1,Bad,...,0,0,1,0,0,1,0,0,1,0


In [14]:
# Construct data X, label y
X = df.drop('Class',axis=1)
y = df.Class == "Good"
y

0       True
1      False
2       True
3       True
4      False
       ...  
995     True
996     True
997     True
998    False
999     True
Name: Class, Length: 1000, dtype: bool

In [19]:
# Train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 40, test_size = 0.3)
clf = DecisionTreeClassifier()


# parameter settings
# Use max_depth=3 as an initial tree depth, use max_depth to control the size of the tree to prevent overfitting
max_depth= [3, 6, 8, 10, 12]
# min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree
min_samples_split = [2 , 5, 10, 15, 20]
min_samples_leaf = [1, 5, 8, 12, 15]

param = {"max_depth": max_depth,"min_samples_split": min_samples_split, "min_samples_leaf": min_samples_leaf}

# Grid Search & K-fold
from sklearn.model_selection import KFold
kf = KFold(n_splits = 10, shuffle = True)

tm = GridSearchCV(clf, param, cv = kf)
tm.fit(X_train, y_train)
y_pred = tm.predict(X_test)

In [20]:
tm.best_params_

{'max_depth': 3, 'min_samples_leaf': 15, 'min_samples_split': 2}

In [21]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.51      0.40      0.45        82
        True       0.79      0.85      0.82       218

    accuracy                           0.73       300
   macro avg       0.65      0.63      0.64       300
weighted avg       0.71      0.73      0.72       300



In [22]:
tm.score(X_test, y_test)

0.73

In [26]:
# Best Tree model
best_clf = DecisionTreeClassifier(max_depth = 3, min_samples_leaf = 15, min_samples_split = 2)
best_clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3, min_samples_leaf=15)

### After you've built the best model you can, now it's time to visualize it!

Rememeber that amazing blog post from a few paragraphs ago, that demonstrated how to visualize and interpret the results of your Decision Tree model. We've seen that this can perform very well, but let's see how it does on the "German Credit" dataset that we're working on, due to it being a bit larger than the one used by the blog authors.

First, we're going to need to install their package. If you're using Anaconda, this can be done easily by running:

In [23]:
! pip install dtreeviz

Collecting dtreeviz
  Downloading dtreeviz-1.3.tar.gz (60 kB)
[K     |████████████████████████████████| 60 kB 627 kB/s eta 0:00:01
[?25hCollecting graphviz>=0.9
  Downloading graphviz-0.17-py3-none-any.whl (18 kB)
Collecting colour
  Downloading colour-0.1.5-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: dtreeviz
  Building wheel for dtreeviz (setup.py) ... [?25ldone
[?25h  Created wheel for dtreeviz: filename=dtreeviz-1.3-py3-none-any.whl size=66638 sha256=f05d753fa296af585866e797df303ada884dbd5970a86c5487f64709426fecb1
  Stored in directory: /Users/cyx/Library/Caches/pip/wheels/9e/37/2c/3b30269ca762b6bb992fd0abb640f3e384c290e719597fddbc
Successfully built dtreeviz
Installing collected packages: graphviz, colour, dtreeviz
Successfully installed colour-0.1.5 dtreeviz-1.3 graphviz-0.17
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


If for any reason this way of installing doesn't work for you straight out of the box, please refer to the more detailed documentation here: https://github.com/parrt/dtreeviz

Now you're ready to visualize your Decision Tree model! Please feel free to use the blog post for guidance and inspiration!

In [27]:
from dtreeviz.trees import *
viz = dtreeviz(best_clf,
               X_train,
               y_train,
               target_name = 'Class',
               feature_names = X_train.columns,
               class_names= ['Good','Bad']
               )
viz.view()

ExecutableNotFound: failed to execute 'dot', make sure the Graphviz executables are on your systems' PATH

## Random Forests

As discussed in the lecture videos, Decision Tree algorithms also have certain undesireable properties. Mainly the have low bias, which is good, but tend to have high variance - which is *not* so good (more about this problem here: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).

Noticing these problems, the late Professor Leo Breiman, in 2001, developed the Random Forests algorithm, which mitigates these problems, while at the same time providing even higher predictive accuracy than the majority of Decision Tree algorithm implementations. While the curriculum contains two excellent lectures on Random Forests, if you're interested, you can dive into the original paper here: https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf.

In the next part of this assignment, your are going to use the same "German Credit" dataset to train, tune, and measure the performance of a Random Forests model. You will also see certain functionalities that this model, even though it's a bit of a "black box", provides for some degree of interpretability.

First, let's build a Random Forests model, using the same best practices that you've used for your Decision Trees model. You can reuse the things you've already imported there, so no need to do any re-imports, new train/test splits, or loading up the data again.

In [28]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf_clf = RandomForestClassifier()

# parameters setting
max_features = ["sqrt", "log2", 10]
n_estimators = [10, 20, 50, 100]
rf_param = {"max_depth": max_depth,"min_samples_split": min_samples_split, "min_samples_leaf": min_samples_leaf, 
             "max_features": max_features, "n_estimators": n_estimators}

rf_tm = GridSearchCV(rf_clf, rf_param, cv = kf)
rf_tm.fit(X_train, y_train)

In [None]:
rf_tm.best_params_

In [None]:
rf_tm.score(X_test, y_test)

In [None]:
print(classification_report(y_test, y_pred))

As mentioned, there are certain ways to "peek" into a model created by the Random Forests algorithm. The first, and most popular one, is the Feature Importance calculation functionality. This allows the ML practitioner to see an ordering of the importance of the features that have contributed the most to the predictive accuracy of the model. 

You can see how to use this in the scikit-learn documentation (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_). Now, if you tried this, you would just get an ordered table of not directly interpretable numeric values. Thus, it's much more useful to show the feature importance in a visual way. You can see an example of how that's done here: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

Now you try! Let's visualize the importance of features from your Random Forests model!

In [None]:
# best model for Random Forest
best_rf = RandomForestClassifier(max_depth = , min_samples_split = , 
                                 min_samples_leaf =, max_features = , n_estimators = )

importance = pd.Series(best_rf.feature_importances_, index = X_train.columns)
importance.plot(kind = "bar")


A final method for gaining some insight into the inner working of your Random Forests models is a so-called Partial Dependence Plot. The Partial Dependence Plot (PDP or PD plot) shows the marginal effect of a feature on the predicted outcome of a previously fit model. The prediction function is fixed at a few values of the chosen features and averaged over the other features. A partial dependence plot can show if the relationship between the target and a feature is linear, monotonic or more complex. 

In scikit-learn, PDPs are implemented and available for certain algorithms, but at this point (version 0.20.0) they are not yet implemented for Random Forests. Thankfully, there is an add-on package called **PDPbox** (https://pdpbox.readthedocs.io/en/latest/) which adds this functionality to Random Forests. The package is easy to install through pip.

In [None]:
! pip install pdpbox

Collecting pdpbox
  Downloading PDPbox-0.2.1.tar.gz (34.0 MB)
[K     |████████████████████████████████| 34.0 MB 27.4 MB/s eta 0:00:01     |████████████████████████████    | 29.8 MB 27.4 MB/s eta 0:00:01
Collecting matplotlib==3.1.1
  Downloading matplotlib-3.1.1-cp37-cp37m-manylinux1_x86_64.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 19.7 MB/s eta 0:00:01    |██████████████████████▎         | 9.2 MB 19.7 MB/s eta 0:00:01
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: pdpbox, sklearn
  Building wheel for pdpbox (setup.py) ... [?25ldone
[?25h  Created wheel for pdpbox: filename=PDPbox-0.2.1-py3-none-any.whl size=35758225 sha256=16cb03810006cda0c9cb55ab33655c3a741752ad391a6be8f30953edcb7a6a27
  Stored in directory: /root/.cache/pip/wheels/f4/d0/1a/b80035625c53131f52906a6fc4dd690d8efd2bf8af6a4015eb
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.p

While we encourage you to read the documentation for the package (and reading package documentation in general is a good habit to develop), the authors of the package have also written an excellent blog post on how to use it, showing examples on different algorithms from scikit-learn (the Random Forests example is towards the end of the blog post): https://briangriner.github.io/Partial_Dependence_Plots_presentation-BrianGriner-PrincetonPublicLibrary-4.14.18-updated-4.22.18.html

So, armed with this new knowledge, feel free to pick a few features, and make a couple of Partial Dependence Plots of your own!

In [None]:
# Partial Denpendence Plots
import pdpbox
from pdpbox import pdp

pdb_df = pdp.pdp_interact(model=best_rf, dataset=df, model_features = x.columns, features = ['Age','Amount'])
pdp.pdp_interact_plot(pdpdf, ['Age','Amount'])
pdp.plt.show()

## (Optional) Advanced Boosting-Based Algorithms

As explained in the video lectures, the next generation of algorithms after Random Forests (that use Bagging, a.k.a. Bootstrap Aggregation) were developed using Boosting, and the first one of these were Gradient Boosted Machines, which are implemented in scikit-learn (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).

Still, in recent years, a number of variations on GBMs have been developed by different research amd industry groups, all of them bringing improvements, both in speed, accuracy and functionality to the original Gradient Boosting algorithms.

In no order of preference, these are:
1. **XGBoost**: https://xgboost.readthedocs.io/en/latest/
2. **CatBoost**: https://tech.yandex.com/catboost/
3. **LightGBM**: https://lightgbm.readthedocs.io/en/latest/

If you're using the Anaconda distribution, these are all very easy to install:

In [None]:
! conda install -c anaconda py-xgboost

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/springboard

  added / updated specs:
    - py-xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           9 KB  anaconda
    ca-certificates-2020.10.14 |                0         128 KB  anaconda
    certifi-2020.6.20          |           py37_0         159 KB  anaconda
    conda-4.9.0                |           py37_0         3.1 MB  anaconda
    joblib-0.17.0              |             py_0         205 KB  anaconda
    libxgboost-0.90            |       he6710b0_1         3.8 MB  anaconda
    openssl-1.1.1k             |       h27cfd23_0         2.5 MB
    py-xgboost-0.90            |   py37he6710b0_1          77 KB  anaconda
    scikit-learn-0.23.2        |   py37h0573a6f_0         6.9 MB

In [None]:
! conda install -c conda-forge catboost

In [None]:
! conda install -c conda-forge lightgbm

Your task in this optional section of the mini project is to read the documentation of these three libraries, and apply all of them to the "German Credit" dataset, just like you did in the case of Decision Trees and Random Forests.

The final deliverable of this section should be a table (can be a pandas DataFrame) which shows the accuracy of all the five algorthms taught in this mini project in one place.

Happy modeling! :)