# Mini Project: Tree-Based Algorithms

## The "German Credit" Dataset

### Dataset Details

This dataset has two classes (these would be considered labels in Machine Learning terms) to describe the worthiness of a personal loan: "Good" or "Bad". There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, number of people being liable to provide maintenance for, telephone, and foreign worker status.

Many of these predictors are discrete and have been expanded into several 0/1 indicator variables (a.k.a. they have been one-hot-encoded).

This dataset has been kindly provided by Professor Dr. Hans Hofmann of the University of Hamburg, and can also be found on the UCI Machine Learning Repository.

## Decision Trees

 As we have learned in the previous lectures, Decision Trees as a family of algorithms (irrespective to the particular implementation) are powerful algorithms that can produce models with a predictive accuracy higher than that produced by linear models, such as Linear or Logistic Regression. Primarily, this is due to the fact the DT's can model nonlinear relationships, and also have a number of tuning paramters, that allow for the practicioner to achieve the best possible model. An added bonus is the ability to visualize the trained Decision Tree model, which allows for some insight into how the model has produced the predictions that it has. One caveat here, to keep in mind, is that sometimes, due to the size of the dataset (both in the sense of the number of records, as well as the number of features), the visualization might prove to be very large and complex, increasing the difficulty of interpretation.

To give you a very good example of how Decision Trees can be visualized and interpreted, we would strongly recommend that, before continuing on with solving the problems in this Mini Project, you take the time to read this fanstastic, detailed and informative blog post: http://explained.ai/decision-tree-viz/index.html

## Building Your First Decision Tree Model

So, now it's time to jump straight into the heart of the matter. Your first task, is to build a Decision Tree model, using the aforementioned "German Credit" dataset, which contains 1,000 records, and 62 columns (one of them presents the labels, and the other 61 present the potential features for the model.)

For this task, you will be using the scikit-learn library, which comes already pre-installed with the Anaconda Python distribution. In case you're not using that, you can easily install it using pip.

Before embarking on creating your first model, we would strongly encourage you to read the short tutorial for Decision Trees in scikit-learn (http://scikit-learn.org/stable/modules/tree.html), and then dive a bit deeper into the documentation of the algorithm itself (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

Also, since you want to be able to present the results of your model, we suggest you take a look at the tutorial for accuracy metrics for classification models (http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) as well as the more detailed documentation (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Finally, an *amazing* resource that explains the various classification model accuracy metrics, as well as the relationships between them, can be found on Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

(Note: as you've already learned in the Logistic Regression mini project, a standard practice in Machine Learning for achieving the best possible result when training a model is to use hyperparameter tuning, through Grid Search and k-fold Cross Validation. We strongly encourage you to use it here as well, not just because it's standard practice, but also becuase it's not going to be computationally to intensive, due to the size of the dataset that you're working with. Our suggestion here is that you split the data into 70% training, and 30% testing. Then, do the hyperparameter tuning and Cross Validation on the training set, and afterwards to a final test on the testing set.)

### Now we pass the torch onto you! You can start building your first Decision Tree model! :)

In [1]:
!pip install py4j==0.10.7

Collecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K    100% |████████████████████████████████| 204kB 22.0MB/s ta 0:00:01
[?25hInstalling collected packages: py4j
Successfully installed py4j-0.10.7
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
!pip install --user dtreeviz

Collecting dtreeviz
  Downloading https://files.pythonhosted.org/packages/00/b9/50676d3cdfee12a25c517f8ba761a56044168fb76e7acb7dce453279d444/dtreeviz-0.6.tar.gz
Collecting graphviz>=0.9 (from dtreeviz)
  Downloading https://files.pythonhosted.org/packages/94/cd/7b37f2b658995033879719e1ea4c9f171bf7a14c16b79220bd19f9eda3fe/graphviz-0.13-py2.py3-none-any.whl
Collecting colour (from dtreeviz)
  Downloading https://files.pythonhosted.org/packages/74/46/e81907704ab203206769dee1385dc77e1407576ff8f50a0681d0a6b541be/colour-0.1.5-py2.py3-none-any.whl
Building wheels for collected packages: dtreeviz
  Running setup.py bdist_wheel for dtreeviz ... [?25ldone
[?25h  Stored in directory: /home/ubuntu/.cache/pip/wheels/39/a3/1d/6b650e1dc7dee16d8385e11f4d6fff1d37e12f697d1dee5260
Successfully built dtreeviz
[31mmxnet-cu80 1.2.0 has requirement graphviz<0.9.0,>=0.8.1, but you'll have graphviz 0.13 which is incompatible.[0m
Installing collected packages: graphviz, colour, dtreeviz
Successfully install

In [5]:
import sys
sys.path.insert(0, "/home/ubuntu/.local/lib/python3.6/site-packages")

In [6]:
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *

In [9]:
! pip install --user graphviz==0.8.3

Collecting graphviz==0.8.3
  Downloading https://files.pythonhosted.org/packages/84/44/21a7fdd50841aaaef224b943f7d10df87e476e181bb926ccf859bcb53d48/graphviz-0.8.3-py2.py3-none-any.whl
[31mdtreeviz 0.6 has requirement graphviz>=0.9, but you'll have graphviz 0.8.3 which is incompatible.[0m
Installing collected packages: graphviz
  Found existing installation: graphviz 0.13
    Uninstalling graphviz-0.13:
      Successfully uninstalled graphviz-0.13
Successfully installed graphviz-0.8.3
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [15]:
!pip install --target="/home/ubuntu/.local/lib/python3.6/site-packages" graphviz==0.9

Collecting graphviz==0.9
  Downloading https://files.pythonhosted.org/packages/47/87/313cd4ea4f75472826acb74c57f94fc83e04ba93e4ccf35656f6b7f502e2/graphviz-0.9-py2.py3-none-any.whl
[31mmxnet-cu80 1.2.0 has requirement graphviz<0.9.0,>=0.8.1, but you'll have graphviz 0.9 which is incompatible.[0m
Installing collected packages: graphviz
Successfully installed graphviz-0.9
[33mTarget directory /home/ubuntu/.local/lib/python3.6/site-packages/graphviz already exists. Specify --upgrade to force replacement.[0m
[33mYou are using pip version 10.0.1, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [16]:

regr = tree.DecisionTreeRegressor(max_depth=2)
boston = load_boston()
regr.fit(boston.data, boston.target)

viz = dtreeviz(regr,
               boston.data,
               boston.target,
               target_name='price',
               feature_names=boston.feature_names)
              
viz.view() 

  (prop.get_family(), self.defaultFamily[fontext]))


ExecutableNotFound: failed to execute ['dot', '-Tsvg', '-o', '/tmp/DTreeViz_47.svg', '/tmp/DTreeViz_47'], make sure the Graphviz executables are on your systems' PATH

In [21]:
!conda install graphviz

Solving environment: done


  current version: 4.5.4
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base conda



NotWritableError: The current user does not have write permissions to a required path.
  path: /usr/local/anaconda/pkgs/conda-4.5.4-py36_0/info/repodata_record.json
  uid: 12574
  gid: 12574

If you feel that permissions on this path are set incorrectly, you can manually
change them by executing

  $ sudo chown 12574:12574 /usr/local/anaconda/pkgs/conda-4.5.4-py36_0/info/repodata_record.json

In general, it's not advisable to use 'sudo conda'.




In [21]:
import os
os.system('sudo ls')

256

In [19]:
!conda update conda


Solving environment: | ^C
failed

CondaError: KeyboardInterrupt



In [15]:
!sudo -i sudo -s.

[sudo] password for ubuntu: 


In [None]:
1+1

In [18]:
!sudo adduser tempuser

[sudo] password for ubuntu: 


In [2]:
!conda uninstall graphviz


Solving environment: failed

PackagesNotFoundError: The following packages are missing from the target environment:
  - graphviz




In [11]:
!pip install py4j==0.10.7

Collecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K    100% |████████████████████████████████| 204kB 24.3MB/s ta 0:00:01
[?25hInstalling collected packages: py4j
Successfully installed py4j-0.10.7
[33mYou are using pip version 10.0.1, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [13]:
!pip install --user dtreeviz

Collecting dtreeviz
Collecting graphviz>=0.9 (from dtreeviz)
  Using cached https://files.pythonhosted.org/packages/94/cd/7b37f2b658995033879719e1ea4c9f171bf7a14c16b79220bd19f9eda3fe/graphviz-0.13-py2.py3-none-any.whl
Collecting colour (from dtreeviz)
  Using cached https://files.pythonhosted.org/packages/74/46/e81907704ab203206769dee1385dc77e1407576ff8f50a0681d0a6b541be/colour-0.1.5-py2.py3-none-any.whl
[31mmxnet-cu80 1.2.0 has requirement graphviz<0.9.0,>=0.8.1, but you'll have graphviz 0.13 which is incompatible.[0m
Installing collected packages: graphviz, colour, dtreeviz
Successfully installed colour-0.1.5 dtreeviz-0.6 graphviz-0.13
[33mYou are using pip version 10.0.1, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [41]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [26]:
german_credit = pd.read_csv('./GermanCredit.csv')
X = german_credit.drop('Class', axis = 1)
y = german_credit.Class

In [43]:
# Your code here! :)
tree_classifier = DecisionTreeClassifier(random_state = 0, class_weight = 'balanced')
parameters = {'splitter': ('best','random')}
grid = GridSearchCV(tree_classifier, parameters, cv=5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 0)
regr = grid.fit(X_train, y_train)

In [59]:
from sklearn.datasets import load_boston
from sklearn import tree
from dtreeviz.trees import dtreeviz

In [None]:
import sys
sys.path.insert(0, "/home/ubuntu/.local/lib/python3.6/site-packages")

In [68]:
!pwd

/mnt/aic-8_2_8_tree-based-algorithms-mini-project


In [64]:
!/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

/bin/sh: 1: /usr/bin/ruby: not found


In [None]:
!sudo chown 12574:12574 /usr/local/anaconda/pkgs/conda-4.5.4-py36_0/info/repodata_record.json;

In [69]:

regr = tree.DecisionTreeRegressor(max_depth=2)
boston = load_boston()
regr.fit(boston.data, boston.target)

viz = dtreeviz(regr,
               boston.data,
               boston.target,
               target_name='price',
               feature_names=boston.feature_names)
              
viz.view()   

  (prop.get_family(), self.defaultFamily[fontext]))


ExecutableNotFound: failed to execute ['dot', '-Tsvg', '-o', '/tmp/DTreeViz_46.svg', '/tmp/DTreeViz_46'], make sure the Graphviz executables are on your systems' PATH

In [60]:
dtreeviz

<function dtreeviz.trees.dtreeviz>

In [35]:
# Your code here! :)
tree = DecisionTreeClassifier(random_state = 0, class_weight = 'balanced')
parameters = {'splitter': ('best','random')}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 0)
regr = tree.fit(X_train, y_train)

In [34]:
grid.cv_results_



{'mean_fit_time': array([ 0.00565395,  0.00456481]),
 'mean_score_time': array([ 0.00063343,  0.00061398]),
 'mean_test_score': array([ 0.71      ,  0.67714286]),
 'mean_train_score': array([ 1.,  1.]),
 'param_splitter': masked_array(data = ['best' 'random'],
              mask = [False False],
        fill_value = ?),
 'params': [{'splitter': 'best'}, {'splitter': 'random'}],
 'rank_test_score': array([1, 2], dtype=int32),
 'split0_test_score': array([ 0.75177305,  0.63829787]),
 'split0_train_score': array([ 1.,  1.]),
 'split1_test_score': array([ 0.75      ,  0.71428571]),
 'split1_train_score': array([ 1.,  1.]),
 'split2_test_score': array([ 0.67142857,  0.7       ]),
 'split2_train_score': array([ 1.,  1.]),
 'split3_test_score': array([ 0.71428571,  0.61428571]),
 'split3_train_score': array([ 1.,  1.]),
 'split4_test_score': array([ 0.6618705 ,  0.71942446]),
 'split4_train_score': array([ 1.,  1.]),
 'std_fit_time': array([  1.77144241e-04,   9.57930604e-05]),
 'std_score_ti

In [27]:
grid.best_estimator_

DecisionTreeClassifier(class_weight='balanced', criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

### After you've built the best model you can, now it's time to visualize it!

Rememeber that amazing blog post from a few paragraphs ago, that demonstrated how to visualize and interpret the results of your Decision Tree model. We've seen that this can perform very well, but let's see how it does on the "German Credit" dataset that we're working on, due to it being a bit larger than the one used by the blog authors.

First, we're going to need to install their package. If you're using Anaconda, this can be done easily by running:

In [29]:
! pip install py4j==0.10.7

[31mmxnet-cu80 1.2.0 has requirement graphviz<0.9.0,>=0.8.1, but you'll have graphviz 0.13 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [31]:
! pip install --user graphviz==0.8.3

[31mdtreeviz 0.6 has requirement graphviz>=0.9, but you'll have graphviz 0.8.3 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [32]:
! pip3 install --user dtreeviz

Collecting graphviz>=0.9 (from dtreeviz)
  Using cached https://files.pythonhosted.org/packages/94/cd/7b37f2b658995033879719e1ea4c9f171bf7a14c16b79220bd19f9eda3fe/graphviz-0.13-py2.py3-none-any.whl
[31mmxnet-cu80 1.2.0 has requirement graphviz<0.9.0,>=0.8.1, but you'll have graphviz 0.13 which is incompatible.[0m
Installing collected packages: graphviz
  Found existing installation: graphviz 0.8.3
    Uninstalling graphviz-0.8.3:
      Successfully uninstalled graphviz-0.8.3
Successfully installed graphviz-0.13
[33mYou are using pip version 10.0.1, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [61]:
! python3 -m pip install --user --upgrade dtreeviz

Requirement already up-to-date: dtreeviz in /home/ubuntu/.local/lib/python3.6/site-packages (0.6)
Requirement not upgraded as not directly required: scikit-learn in /usr/local/anaconda/lib/python3.6/site-packages (from dtreeviz) (0.19.1)
Requirement not upgraded as not directly required: pandas in /usr/local/anaconda/lib/python3.6/site-packages (from dtreeviz) (0.20.3)
Requirement not upgraded as not directly required: matplotlib in /usr/local/anaconda/lib/python3.6/site-packages (from dtreeviz) (2.2.2)
Requirement not upgraded as not directly required: graphviz>=0.9 in /home/ubuntu/.local/lib/python3.6/site-packages (from dtreeviz) (0.13)
Requirement not upgraded as not directly required: numpy in /usr/local/anaconda/lib/python3.6/site-packages (from dtreeviz) (1.13.3)
Requirement not upgraded as not directly required: colour in /home/ubuntu/.local/lib/python3.6/site-packages (from dtreeviz) (0.1.5)
Requirement not upgraded as not directly required: python-dateutil>=2 in /usr/local/an

If for any reason this way of installing doesn't work for you straight out of the box, please refer to the more detailed documentation here: https://github.com/parrt/dtreeviz

Now you're ready to visualize your Decision Tree model! Please feel free to use the blog post for guidance and inspiration!

In [21]:
import sys

In [33]:
import os
print(os.getenv('Path'))

None


In [57]:
! pip list

Package                            Version          
---------------------------------- -----------------
absl-py                            0.2.2            
alabaster                          0.7.10           
anaconda-client                    1.6.3            
anaconda-navigator                 1.6.4            
anaconda-project                   0.6.0            
appdirs                            1.4.3            
asn1crypto                         0.22.0           
astor                              0.6.2            
astroid                            1.5.3            
astropy                            2.0.1            
Babel                              2.5.0            
backports.shutil-get-terminal-size 1.0.0            
bcrypt                             3.1.4            
beautifulsoup4                     4.6.0            
bitarray                           0.8.1            
bkcharts                           0.2              
blaze                       

[33mYou are using pip version 10.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [None]:
from sklearn.datasets import *
boston = load_boston()


In [None]:

tree = DecisionTreeClassifier(random_state = 0, class_weight = 'balanced')
parameters = {'splitter': ('best','random')}
grid = GridSearchCV(tree, parameters, cv=5, class_weight = 'balanced')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 0)
regr = grid.fit(X_train, y_train)

In [None]:
viz = dtreeviz()

In [103]:
# Your code here! :)
#  from dtreeviz.trees import *

viz = dtreeviz(regr, X_train, y_train, target_name='Class',\
feature_names = np.array(german_credit.drop('Class', axis = 1)),\
class_names = dict{'Good', 'Bad'})

SyntaxError: invalid syntax (<ipython-input-103-845727bf8111>, line 4)

In [129]:
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *dict({0:'Good',1:'Bad'})

{0: 'Good', 1: 'Bad'}

In [133]:


regr = tree.DecisionTreeRegressor(max_depth=2)
boston = load_boston()
regr.fit(boston.data, boston.target)

viz = dtreeviz(regr,
               boston.data,
               boston.target,
               target_name='price',
               feature_names=boston.feature_names)
              
viz.view()       

AttributeError: 'DecisionTreeClassifier' object has no attribute 'DecisionTreeRegressor'

In [130]:
viz = dtreeviz(regr, X_train, y_train, target_name = 'Class', feature_names = np.array(X.columns), class_names = dict({0:'Good',1:'Bad'}))
              

KeyError: 'Bad'

In [3]:
import os
print(os.environ['HOME'])

/home/ubuntu


In [77]:
! python3 -m dtreeviz

/usr/local/anaconda/bin/python3: No module named dtreeviz.__main__; 'dtreeviz' is a package and cannot be directly executed


In [78]:
from sklearn.datasets import *
from sklearn import tree
from dtreeviz.trees import *

ModuleNotFoundError: No module named 'dtreeviz'

In [89]:
! conda uninstall graphviz


Solving environment: failed

PackagesNotFoundError: The following packages are missing from the target environment:
  - graphviz




In [105]:
from dtreeviz.trees import *

ModuleNotFoundError: No module named 'dtreeviz'

In [107]:
import graphviz
print(graphviz.__file__)

/usr/local/anaconda/lib/python3.6/site-packages/graphviz/__init__.py


In [53]:
! cd /usr/local/anaconda/lib/python3.6/site-packages; ls

absl
absl_py-0.2.2.dist-info
alabaster
alabaster-0.7.10-py3.6.egg-info
anaconda_client-1.6.3-py3.6.egg-info
anaconda_navigator
anaconda_navigator-1.6.4-py3.6.egg-info
anaconda_project
anaconda_project-0.6.0-py3.6.egg-info
appdirs-1.4.3.dist-info
appdirs.py
asn1crypto
asn1crypto-0.22.0-py3.6.egg-info
astor
astor-0.6.2.dist-info
astroid
astroid-1.5.3-py3.6.egg-info
astropy
astropy-2.0.1-py3.6.egg-info
babel
Babel-2.5.0-py3.6.egg-info
backports
backports.shutil_get_terminal_size-1.0.0-py3.6.egg-info
bcrypt
bcrypt-3.1.4.dist-info
beautifulsoup4-4.6.0-py3.6.egg-info
binstar_client
bitarray
bitarray-0.8.1-py3.6.egg-info
bkcharts
bkcharts-0.2-py3.6.egg-info
blaze
blaze-0.10.1-py3.6.egg-info
bleach
bleach-1.5.0-py3.6.egg-info
bokeh
bokeh-0.12.7-py3.6.egg-info
boto
boto-2.48.0-py3.6.egg-info
boto3
boto3-1.7.26.dist-info
botocore
botocore-1.10.26.dist-info
bottleneck
Bottleneck-1.2.1-py3.6.egg-info
brewer2mpl
brewer2mpl-1.4.1.dist-info
bs4
bson
bs

In [9]:
import sys
print(sys.path)

sys.path.insert(0, "/home/myname/pythonfiles")

['', '/mnt/aic-8_2_8_tree-based-algorithms-mini-project', '/opt/spark-2.2.1-bin-hadoop2.7/python', '/opt/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip', '/opt/spark-2.4.0-bin-hadoop2.7/python', '/opt/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip', '/usr/local/anaconda/lib/python36.zip', '/usr/local/anaconda/lib/python3.6', '/usr/local/anaconda/lib/python3.6/lib-dynload', '/usr/local/anaconda/lib/python3.6/site-packages', '/usr/local/anaconda/lib/python3.6/site-packages/Sphinx-1.6.3-py3.6.egg', '/usr/local/anaconda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg', '/usr/local/anaconda/lib/python3.6/site-packages/IPython/extensions', '/home/ubuntu/.ipython']


In [56]:
import sys
sys.path.insert(0, "/home/ubuntu/.local/lib/python3.6/site-packages")

In [57]:
import dtreeviz

In [36]:
viz = dtreeviz(regr,
               boston.data,
               boston.target,
               target_name='price',
               feature_names=boston.feature_names)
              
viz.view() 

NameError: name 'boston' is not defined

## Random Forests

As discussed in the lecture videos, Decision Tree algorithms also have certain undesireable properties. Mainly the have low bias, which is good, but tend to have high variance - which is *not* so good (more about this problem here: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).

Noticing these problems, the late Professor Leo Breiman, in 2001, developed the Random Forests algorithm, which mitigates these problems, while at the same time providing even higher predictive accuracy than the majority of Decision Tree algorithm implementations. While the curriculum contains two excellent lectures on Random Forests, if you're interested, you can dive into the original paper here: https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf.

In the next part of this assignment, your are going to use the same "German Credit" dataset to train, tune, and measure the performance of a Random Forests model. You will also see certain functionalities that this model, even though it's a bit of a "black box", provides for some degree of interpretability.

First, let's build a Random Forests model, using the same best practices that you've used for your Decision Trees model. You can reuse the things you've already imported there, so no need to do any re-imports, new train/test splits, or loading up the data again.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Your code here! :)

As mentioned, there are certain ways to "peek" into a model created by the Random Forests algorithm. The first, and most popular one, is the Feature Importance calculation functionality. This allows the ML practitioner to see an ordering of the importance of the features that have contributed the most to the predictive accuracy of the model. 

You can see how to use this in the scikit-learn documentation (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_). Now, if you tried this, you would just get an ordered table of not directly interpretable numeric values. Thus, it's much more useful to show the feature importance in a visual way. You can see an example of how that's done here: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

Now you try! Let's visualize the importance of features from your Random Forests model!

In [None]:
# Your code here

A final method for gaining some insight into the inner working of your Random Forests models is a so-called Partial Dependence Plot. The Partial Dependence Plot (PDP or PD plot) shows the marginal effect of a feature on the predicted outcome of a previously fit model. The prediction function is fixed at a few values of the chosen features and averaged over the other features. A partial dependence plot can show if the relationship between the target and a feature is linear, monotonic or more complex. 

In scikit-learn, PDPs are implemented and available for certain algorithms, but at this point (version 0.20.0) they are not yet implemented for Random Forests. Thankfully, there is an add-on package called **PDPbox** (https://pdpbox.readthedocs.io/en/latest/) which adds this functionality to Random Forests. The package is easy to install through pip.

In [None]:
! pip install pdpbox

While we encourage you to read the documentation for the package (and reading package documentation in general is a good habit to develop), the authors of the package have also written an excellent blog post on how to use it, showing examples on different algorithms from scikit-learn (the Random Forests example is towards the end of the blog post): https://briangriner.github.io/Partial_Dependence_Plots_presentation-BrianGriner-PrincetonPublicLibrary-4.14.18-updated-4.22.18.html

So, armed with this new knowledge, feel free to pick a few features, and make a couple of Partial Dependence Plots of your own!

In [None]:
# Your code here!

## (Optional) Advanced Boosting-Based Algorithms

As explained in the video lectures, the next generation of algorithms after Random Forests (that use Bagging, a.k.a. Bootstrap Aggregation) were developed using Boosting, and the first one of these were Gradient Boosted Machines, which are implemented in scikit-learn (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).

Still, in recent years, a number of variations on GBMs have been developed by different research amd industry groups, all of them bringing improvements, both in speed, accuracy and functionality to the original Gradient Boosting algorithms.

In no order of preference, these are:
1. **XGBoost**: https://xgboost.readthedocs.io/en/latest/
2. **CatBoost**: https://tech.yandex.com/catboost/
3. **LightGBM**: https://lightgbm.readthedocs.io/en/latest/

If you're using the Anaconda distribution, these are all very easy to install:

In [None]:
! conda install -c anaconda py-xgboost

In [None]:
! conda install -c conda-forge catboost

In [None]:
! conda install -c conda-forge lightgbm

Your task in this optional section of the mini project is to read the documentation of these three libraries, and apply all of them to the "German Credit" dataset, just like you did in the case of Decision Trees and Random Forests.

The final deliverable of this section should be a table (can be a pandas DataFrame) which shows the accuracy of all the five algorthms taught in this mini project in one place.

Happy modeling! :)