# Intro (10 mins)
- introduction
- how it is related to Datathon tasks
- training objectives and agenda

# Environment (20 mins)
How to set up:
- beginners: https://colab.research.google.com
- advanced: [instructions to set up **Docker** container](https://github.com/AntonP84/ml-intro-2019-07/blob/master/README.md)

[Jupyter](https://jupyter.org/) UI overview:
- cells and cell types (code/markdown), output
- kernels (Python3)
- hotkeys (Ctrl+Enter)
- navigation, add/remove cells

In [None]:
# install packages not available in standard Google Colab environment
# restart the env
!pip install -q --user shap pdpbox

In [None]:
import numpy as np
import pandas as pd

from pandas_profiling import ProfileReport

import matplotlib.pyplot as plt
%matplotlib inline

# Data (30 mins)
- all the data is merged into a single flat table
- one column is a **target**, other columns are **features**
- data profiling/exploration step helps to understand the data and shape further efforts

## Read data
source: `https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-HR-Employee-Attrition.csv`

In [None]:
try:
    url = 'https://github.com/AntonP84/ml-intro-2019-07/raw/master/data/WA_Fn-UseC_-HR-Employee-Attrition.csv'
    df = pd.read_csv(url)
except:
    df = pd.read_csv('./data/WA_Fn-UseC_-HR-Employee-Attrition.csv')

df['target'] = (df['Attrition'] == 'Yes').astype(int)

df.head()

In [None]:
print('The dataframe consists of:')
print(f'- {df.shape[0]} rows ')
print(f'- and {df.shape[1]} columns:')
df.columns

**to read later**: 
- [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) tutorial

## Explore
- profiling: check values and distributions of columns
- graphs and plots
- suggest ideas for data cleaning and transformations

### Manual

In [None]:
df_attrition = df['Attrition'].value_counts(normalize=True)
df_attrition

In [None]:
# plots with Pandas API
df_attrition.plot.bar(title='Attrition rate is %.2f' % df_attrition["Yes"]);

In [None]:
# plots with Pandas API
df['Age'].plot.hist()

# you can use matplotlib for customization
plt.xlabel('Age')
plt.title('Histogram for Age');

In [None]:
# one line to get a simple plot
df['StockOptionLevel'].value_counts().plot.bar();

... but you are not expected to do it manually for every single column

**to read later**:
- pandas Visualization [Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)

### Automated with pandas_profiling

- basic plots and summaries for every column
- warnings about potential issues


In [None]:
ProfileReport(df)

**for discussion**:
- some columns (e.g., `Over18` and `StandardHours`) are constant and useless for analysis. Check it. Should we drop them?
- some columns (e.g., `StockOptionLevel` and `YearsSinceLastPromotion`) have many zeroes. Check it. Should we drop them?
- some columns (e.g., `MonthlyIncome` and `JobLevel`) are highly correlated. Should we drop them?
- `EmployeeNumber`, is it employee_id? Should we keep it?

- any ideas for Feature Engineering?

In [None]:
df = df.set_index('EmployeeNumber')

## Reminder: Business case slide

## Feature Engineering

Encode your domain expertise, intuition and ideas into new columns to help ML perform better / faster.

In [None]:
# new columns: years as percentage of YearsAtCompany
cols_years = ('YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager')
cols_new = [v + '_pct' for v in cols_years]

for col_years, col_new in zip(cols_years, cols_new):
    df[col_new] = df[col_years] / df['YearsAtCompany']
    
print('Added derived columns, with percentage of years:')
print(cols_new)

In [None]:
# to check what we got
df[cols_new].describe().round(2)

In [None]:
df.shape

## Transform
for ML you need:
- **target**: column you want to predict, denoted as *y*
- **features**: columns you use for prediction, denoted as *X*

In [None]:
y = df['target']
X = df.drop(columns=['Attrition', 'target'])

### Categorical columns
for ML you need *numerical* features. Apply `LabelEncoder` on 
- columns with textual names of categories
- columns with numeric codes of categories

In [None]:
print('Before')
X[['BusinessTravel', 'StockOptionLevel']].head()

In [None]:
from sklearn.preprocessing import LabelEncoder


cols_str = X.select_dtypes('object').columns.tolist()
cols_coded = ['JobLevel', 'StockOptionLevel', 'Education', 
              'EnvironmentSatisfaction', 'JobInvolvement', 'JobSatisfaction', 
              'PerformanceRating', 'RelationshipSatisfaction', 'WorkLifeBalance'
             ]

cols_categorical = cols_str + cols_coded

for col in cols_categorical:
    X[col] = LabelEncoder().fit_transform(X[col])

In [None]:
print('After')
X[['BusinessTravel', 'StockOptionLevel']].head()

**to read later**:
- examples in `LabelEncoder` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

**for discussion**:
- should we apply [other preprocessing transformations](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) from `sklearn` package (e.g., scaling, normalization, one-hot-encoding)?

# Machine Learning model (40 mins)

## Sampling
split the data into two parts:
- **train** sample is used to train the model
- **test** sample is used to test model performance

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2019)

print('Size of the train sample:', len(X_train))
print('Size of the test sample:', len(X_test))

## Train model
- **Gradient Boosting** is fast, accurate and flexible ML model
- details are not on the agenda

the workflow is:
1. **train** the model on the train sample
1. **test** the model on the test sample
1. **use** the model to get predictions and other info

In [None]:
%%time
from lightgbm import LGBMClassifier

model = LGBMClassifier()  # with default parameters
model.fit(X_train, y_train, categorical_feature=cols_categorical)
;

**for discussion**:
- can we get a good model without knowing what is inside using default parameters?
- ML model training was fast, is it ok?

## Evaluate model
Select evaluation metric according to the use case.

There are a lot of metrics (cf. [metrics available in `sklearn`](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)), but **precision** and **recall** are applied most of the time.

**Precision at k** for the current problem:
1. score the employees
1. select top 5% most likely to leave the company
1. among those, calculate the share of employees who actually left the company

In [None]:
# score the employees
predictions_test = model.predict_proba(X_test)[:, 1]
predictions_test[:10]

In [None]:
from sklearn.metrics import precision_score, recall_score

threshold = np.percentile(predictions_test, 95)
precision = precision_score(y_test, predictions_test > threshold)
print(f'Model precision is {precision:.1%}')

In [None]:
print('Employees most likely to leave the company:')

(
    X_test
    .assign(confidence=predictions_test)
    .sort_values(by='confidence', ascending=False)['confidence']
    .head()
)

## Reveal driving factors

In [None]:
import shap

# load JS visualization code to notebook
shap.initjs()

# explain the model's predictions using SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

### global
aka **driving factors** - which factors the model considers to be the most important for the whole dataset.

In [None]:
shap.summary_plot(shap_values, X_test, plot_type="bar")

### local
which factors drive the outcome *for each particular data point*

In [None]:
# visualize the prediction with the highest attrition score
id_ = predictions_test.argmax()

employee_number = X_test.iloc[id_].name
features = df.drop(columns=['target', 'Attrition']).loc[employee_number, :]

print('Model output:')
print(predictions_test[id_])
print()

print('Feature values:')
print(features)

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[id_,:], features)  

In [None]:
# visualize the prediction with the lowest attrition score
id_ = predictions_test.argmin()

employee_number = X_test.iloc[id_].name
features = df.drop(columns=['target', 'Attrition']).loc[employee_number, :]

print('Model output:')
print(predictions_test[id_])
print()

print('Feature values:')
print(features)

shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[id_,:], features)  

### explore relationships
to gain insights and understand the relationship btw attrition and key factors

In [None]:
from pdpbox import pdp, info_plots

In [None]:
info_plots.target_plot(
    df=df, feature='MonthlyIncome', feature_name='MonthlyIncome', target='target'
);

In [None]:
info_plots.target_plot(
    df=df, feature='Age', feature_name='Age', target='target'
);

In [None]:
df_test = X_test.copy()
df_test['predictions'] = predictions_test
df_test['target'] = y_test

info_plots.target_plot(
    df=df_test, feature='predictions', feature_name='predictions', target='target'
);

**for discussion:**
- we calculated *Precision@k*, but never used it. Why?
- we have ML model. What is the next step?

**exercise**:
- can you make ML model more precise by changing its parameters? Parameters are described [in the documentation](https://lightgbm.readthedocs.io/en/latest/Parameters.html#core-parameters). Too many of them. Hint: as the model is based on decision *trees*, you can try `num_trees`, `num_leaves`, `min_data_in_leaf` as a starting point.

# Deployment of the ML model (10 mins)
**Warning**: Colab is not the best place for model deployment. Switch back to your environment.

## [MLFlow](https://jupyter.org/) Intro

## Track experiments

In [None]:
import mlflow
import mlflow.sklearn


class Classifier(LGBMClassifier):
    def predict(self, X):
        return self.predict_proba(X)[:, 1]
    

def make_experiment(num_leaves=31):
    """create an experiment for different values of num_leaves"""
    with mlflow.start_run():
        # train model
        model = Classifier(num_leaves=num_leaves)       # try non-default parameter values
        model.fit(X_train, y_train, categorical_feature=cols_categorical)

        # evaluate model
        predictions_test = model.predict(X_test)
        threshold = np.percentile(predictions_test, 95)
        y_pred = predictions_test > threshold
        precision = precision_score(y_test, y_pred)
        recall = recall_score(y_test, y_pred)           # add recall metric

        # log results and save artifacts
        mlflow.log_param("num_leaves", num_leaves)
        mlflow.log_metric("precision", precision)
        mlflow.log_metric("recall", recall)
        mlflow.sklearn.log_model(model, "model")
    return

In [None]:
# num_leave=31 by default
for num_leaves in [2, 4, 8, 16, 31, 64]:
    make_experiment(num_leaves)

In [None]:
# run and go to MLFlow UI http://localhost:5000
# record the id of default model
!mlflow server --host 0.0.0.0

## Serving the model

In [None]:
%env MLFLOW_CONDA_HOME=/home/user/anaconda3

In [None]:
!mlflow models serve -m ./mlruns/0/a93fe5e7501e4f9dbd21ebafcc94bdbe/artifacts/model -p 1234 --no-conda --host 0.0.0.0

**exercise**:
- verify predictions in the notebook vs. in the served model

In [None]:
id_ = predictions_test.argmin()
features = X_test.iloc[[id_], :]

# values of features for the request
request_data = features.to_json(orient='split')

In [None]:
command =  'curl -X POST -H "Content-Type:application/json; format=pandas-split" '
command += f"--data '{request_data}' "
command += 'http://127.0.0.1:1234/invocations'

print('Command to execute:\n')
print(command)
print()

print('Verify predictions. You are expected to get the following value from the served model:')
print(predictions_test[id_])

# Q&A (10 mins)