<img src="https://rhyme.com/assets/img/logo-dark.png" align=center></img>
<h2 align="center">Predict Employee Churn with Decision Trees and Random Forests</h2>

### Task 1: Import Libraries
---

In [3]:
from __future__ import print_function
%matplotlib inline
import os
import warnings
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as image
import pandas as pd
import pandas_profiling
plt.style.use("ggplot")
warnings.simplefilter("ignore")

In [4]:
plt.rcParams['figure.figsize'] = (12,8)

### Task 2: Exploratory Data Analysis
---

In [5]:
df = pd.read_csv("data/employee_data.csv")
df.head()

ImportError: cannot import name 'is_url' from 'pandas.io.common' (/home/sam/anaconda3/lib/python3.7/site-packages/pandas/io/common.py)

   satisfaction_level  last_evaluation  number_project  average_montly_hours  \
0                0.38             0.53               2                   157   
1                0.80             0.86               5                   262   
2                0.11             0.88               7                   272   
3                0.72             0.87               5                   223   
4                0.37             0.52               2                   159   

   time_spend_company  Work_accident  quit  promotion_last_5years department  \
0                   3              0     1                      0      sales   
1                   6              0     1                      0      sales   
2                   4              0     1                      0      sales   
3                   5              0     1                      0      sales   
4                   3              0     1                      0      sales   

   salary  
0     low  
1  medium  
2 

In [None]:
df.profile_report(title="Data Report")

### Task 3: Encode Categorical Features
---

### Task 4: Visualize Class Imbalance
---

In [None]:
from yellowbrick.target import ClassBalance
plt.style.use("ggplot")
plt.rcParams['figure.figsize'] = (12,8)

### Task 5: Create Training and Test Sets
---

### Task 6 & 7: Build an Interactive Decision Tree Classifier
---

Supervised learning: 
- The inputs are random variables $X = X_1, ..., X_p$;
- The output is a random variable $Y.$

- Data is a finite set $$\mathbb{L}=\{(x_i,y_i)|i=0, ..., N-1\}$$
where $x_i \in X = X_1 \times ... \times X_p$ and $y_i \in y$ are randomly drawn from $P_{X,Y}.$

E.g., $(x_i,y_i)=((\text{salary = low, department = sales, ...}),\text{quit = 1})$

- The goal is to find a model $\varphi_\mathbb{L}: X \mapsto y$ minimizing $$\text{Err}(\varphi_\mathbb{L}) = \mathbb{E}_{X,Y}\{L(Y, \varphi_\mathbb{L}(X))\}.$$

About:
 
 - Decision trees are non-parametric models which can model arbitrarily complex relations between inputs and outputs, without any a priori assumption
 
- Decision trees handle numeric and categorical variables

- They implement feature selection, making them robust to noisy features (to an extent)

- Robust to outliers or errors in labels

- Easily interpretable by even non-ML practioners.

#### Decision trees: partitioning the feature space:

![partition](assets/images/partition-feature-space.png)

- Decision trees generally have low bias but have high variance.
- We will solve the high variance problem in Task 8.

In [None]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import export_graphviz # display the tree within a Jupyter notebook
from IPython.display import SVG
from graphviz import Source
from IPython.display import display
from ipywidgets import interactive, IntSlider, FloatSlider, interact
import ipywidgets
from IPython.display import Image
from subprocess import call
import matplotlib.image as mpimg

### Task 8: Build an Interactive Random Forest Classifier
---

Although randomization increases bias, it is possible to get a reduction in variance of the ensemble. Random forests are one of the most robust machine learning algorithms for a variety of problems.

- Randomization and averaging lead to a reduction in variance and improve accuracy
- The implementations are parallelizable
- Memory consumption and training time can be reduced by bootstrapping
- Sampling features and not solely sampling examples is crucial to improving accuracy

In [None]:
@interact
def plot_tree_rf(crit=,
                 bootstrap=,
                 depth=IntSlider(min=,max=,value=, continuous_update=False),
                 forests=IntSlider(min=,max=,value=,continuous_update=False),
                 min_split=IntSlider(min=,max=,value=, continuous_update=False),
                 min_leaf=IntSlider(min=,max=,value=, continuous_update=False)):
    
    
    

### Task 9: Feature Importance and Evaluation Metrics
---

In [None]:
from yellowbrick.model_selection import FeatureImportances
plt.rcParams['figure.figsize'] = (12,8)
plt.style.use("ggplot")

In [None]:
### Type code below this line ###





In [None]:
## Plotting Code ##
from yellowbrick.classifier import ROCAUC

visualizer = ROCAUC(rf, classes=["stayed", "quit"])

visualizer.fit(X_train, y_train)        # Fit the training data to the visualizer
visualizer.score(X_test, y_test)        # Evaluate the model on the test data
visualizer.poof();

In [None]:
## Plotting Code ##
dt = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

visualizer = ROCAUC(dt, classes=["stayed", "quit"])

visualizer.fit(X_train, y_train)        # Fit the training data to the visualizer
visualizer.score(X_test, y_test)        # Evaluate the model on the test data
visualizer.poof();

### Optional: Comparison with Logistic Regression Classifier
---

In [None]:
from sklearn.linear_model import LogisticRegressionCV

logit = LogisticRegressionCV(random_state=1, n_jobs=-1,max_iter=500,
                             cv=10)

lr = logit.fit(X_train, y_train)

print('Logistic Regression Accuracy: {:.3f}'.format(accuracy_score(y_test, lr.predict(X_test))))

visualizer = ROCAUC(lr, classes=["stayed", "quit"])

visualizer.fit(X_train, y_train)        # Fit the training data to the visualizer
visualizer.score(X_test, y_test)        # Evaluate the model on the test data
visualizer.poof();

In [2]:
! pip install pandas_profiling

Collecting pandas_profiling
[?25l  Downloading https://files.pythonhosted.org/packages/b9/94/ef8ef4517540d13406fcc0b8adfd75336e014242c69bd4162ab46931f36a/pandas_profiling-2.8.0-py2.py3-none-any.whl (259kB)
[K     |████████████████████████████████| 266kB 623kB/s eta 0:00:01
[?25hCollecting tangled-up-in-unicode>=0.0.6 (from pandas_profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/4a/e2/e588ab9298d4989ce7fdb2b97d18aac878d99dbdc379a4476a09d9271b68/tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1MB)
[K     |████████████████████████████████| 3.1MB 4.1MB/s eta 0:00:01
Collecting requests>=2.23.0 (from pandas_profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/45/1e/0c169c6a5381e241ba7404532c16a21d86ab872c9bed8bdcd4c423954103/requests-2.24.0-py2.py3-none-any.whl (61kB)
[K     |████████████████████████████████| 71kB 5.2MB/s eta 0:00:01
[?25hCollecting matplotlib>=3.2.0 (from pandas_profiling)
[?25l  Downloading https://files.pythonhosted.org/pack

Building wheels for collected packages: confuse, htmlmin, imagehash
  Building wheel for confuse (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/sam/.cache/pip/wheels/f6/8b/23/41a1b516f6d8d4cc81f5bdb55394a47cdbe9659c53668d3c9e
  Building wheel for htmlmin (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/sam/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459
  Building wheel for imagehash (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/sam/.cache/pip/wheels/07/1c/dc/6831446f09feb8cc199ec73a0f2f0703253f6ae013a22f4be9
Successfully built confuse htmlmin imagehash
[31mERROR: phik 0.10.0 has requirement joblib>=0.14.1, but you'll have joblib 0.13.2 which is incompatible.[0m
Installing collected packages: tangled-up-in-unicode, requests, matplotlib, pandas, scipy, phik, confuse, htmlmin, jinja2, missingno, ipywidgets, networkx, attrs, imagehash, visions, tqdm, astropy, pandas-profiling
  Found existing installation: requests 2

    Uninstalling tqdm-4.32.1:
      Successfully uninstalled tqdm-4.32.1
  Found existing installation: astropy 3.2.1
    Uninstalling astropy-3.2.1:
      Successfully uninstalled astropy-3.2.1
Successfully installed astropy-4.0.1.post1 attrs-19.3.0 confuse-1.1.0 htmlmin-0.1.12 imagehash-4.1.0 ipywidgets-7.5.1 jinja2-2.11.2 matplotlib-3.2.2 missingno-0.4.2 networkx-2.4 pandas-1.0.5 pandas-profiling-2.8.0 phik-0.10.0 requests-2.24.0 scipy-1.4.1 tangled-up-in-unicode-0.0.6 tqdm-4.46.1 visions-0.4.4


In [8]:
!conda install pandas

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/sam/anaconda3

  added / updated specs:
    - pandas


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-4.8.3                |           py37_0         2.8 MB
    ------------------------------------------------------------
                                           Total:         2.8 MB

The following packages will be UPDATED:

  conda                                       4.7.10-py37_0 --> 4.8.3-py37_0


Proceed ([y]/n)? ^C

CondaSystemExit: 
Operation aborted.  Exiting.



In [9]:
y

NameError: name 'y' is not defined