In [None]:
import os
import pandas as pd
import numpy as np
import hashlib
import io
import json
import pickle
import requests
import joblib
import matplotlib.pyplot as plt
from scipy.stats import ks_2samp
from plotchecker import LinePlotChecker, ScatterPlotChecker, BarPlotChecker

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import precision_score, recall_score, precision_recall_curve
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

plt.rcParams['figure.figsize']=(4.8, 3.6)    

In the previous BLU, we worked with the "Velho Banco" data set, with the goal to design a system that predicts if a given individual earns more than 50K a year. 

As a reminder, each row in the dataset is about a client, and here's the attribute information:

    1) age - client's age
    2) workclass - type of work performed by the client (eg. `Private`)
    3) fnlwgt - final weight assigned by the Census Bureau: if two samples have the same (or similar) fnlwgt they have similar characteristics, demographically speaking
    4) education - level of education of client (eg. `Bachelors`)
    5) education-num - numerically encoded level of education
    6) marital-status - client's marital status (eg `Widowed`)
    7) occupation - type of job held by the client (eg. `Craft-repair`)
    8) relationship - family position of the client
    9) race - client's race
    10) sex - "male"/"female"
    11) capital-gain - total capital gain in the previous year
    12) capital-loss - total capital loss in previous year
    13) hours-per-week - number of hours the the client works per week
    14) native-country - client's original nationality (eg. `Portugal`)

The original data set is located in `data/bank.csv`.

Recently your client has provided you with a new set of observations, located in `data/bank_new_observations.csv`. This is your goal:

- Assess how your model performs with the new dataset
- Assess if there have been any changes to the data 
- Deploy a new model

Start by loading the old and new data and have a quick look at it.

In [None]:
def load_data(file):
    df = pd.read_csv(os.path.join("data", file))
    return df
df_original = load_data("bank.csv")
df_new = load_data("bank_new_observations.csv")
target='salary'
df_new.head()

## Exercise 1 - Data drift

One of the most important things is to check if the data has changed over time. This data is not timestamped, so the approach will be to check if the distribution has changed from one data set to the other. Let's begin!

**Important note about the grading**: Grading plots is difficult! We are using [`plotchecker`](https://github.com/jhamrick/plotchecker) to grade the plots with `nbgrader`. For `plotchecker` to work with nbgrader, we need to have the following line in the solution cell, after the code required to do the plot:

```
axis = plt.gca();
````

### Exercise 1.1 - Feature distributions

Start by building a function to plot the histogram of the given feature and data set.

In [None]:
def generate_distribution_histogram(dataframe, 
                                    column_name, 
                                    title, x_axis_label, y_axis_label,
                                    label_name,
                                    number_bins = 15):
    """
    This function generates a histogram for the column_name feature from the dataframe.
    Args:
        dataframe (pd.DataFrame): Input dataframe
        column_name (str): Name of the column to plot
        title (str): Title of the histogram
        x_axis_label (str): X-axis label
        y_axis_label (str): Y-axis label
        label_name (str): plot label in the legend
        number_bins (str): number of bins of the histogram
    Returns:
        Histogram of the column_name column from dataframe
    """

    # YOUR CODE HERE
    raise NotImplementedError()
    plt.legend(loc='upper right')
    axis = plt.gca();
    return axis

Plot the histogram:

In [None]:
axis = generate_distribution_histogram(df_new, 'age',
                                title = 'Age Distribution: Velho Banco',
                                x_axis_label = 'Age (years)',
                                y_axis_label = 'Frequency',
                                label_name = 'Age')

In [None]:
pc = BarPlotChecker(axis)
l = [pc.xlabel] + [pc.ylabel]

assert hashlib.sha256(json.dumps(''.join(l)).encode()).hexdigest() == \
'96587ec8128cfd651ee75e3659b87264b9140ed983af5982a367780577004501', 'Did you set the correct axis labels?'
assert hashlib.sha256(json.dumps(pc.title).encode()).hexdigest() == \
'6f3ef4e639562eaec813bcbe1260e2a0d059446a7dc0f852eb996f48bfe6c0d1', 'Did you set the correct plot title?'
try:
    pc.assert_num_bars(15)
except:
    "Did you set the right number of bins?"
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in pc.heights])).encode()).hexdigest() == \
'b274d2f78fd524e2123991f138d515ad7a6ee24f9317eaa935625b0a7121a75e', 'Did choose the correct variable and plot type?'
assert hashlib.sha256(json.dumps(' '.join([str(round(i,2)) for i in pc.centers])).encode()).hexdigest() == \
'3fcabfacf502b2c70af55c6cd38ca0dd7e108e0772f782ece147a502540f8abc', 'Did choose the correct variable and plot type?'

### Exercise 1.2 - Old and new age
Let's use your function to look at the "Age" variable of the original and new data. Examine the plot and answer the question below.

In [None]:
axis = generate_distribution_histogram(df_original, 'age',
                                title = 'Age Distribution: Velho Banco',
                                x_axis_label = 'Age (years)',
                                y_axis_label = 'Frequency',
                                label_name = 'Original data')
axis = generate_distribution_histogram(df_new, 'age',
                                title = 'Age Distribution: Velho Banco',
                                x_axis_label = 'Age (years)',
                                y_axis_label = 'Frequency',
                                label_name = 'New data')

Do you think there was a change in the distribution? Answer with `"yes"` or `"no"`.

In [None]:
#answer_1_2 = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert hashlib.sha256(json.dumps(answer_1_2).encode()).hexdigest() == \
'04a06452677210a3cdaec376fd5ebbca1714cb7af9e62bf5cce1644310a9086a', "Look again!"

### Exercise 1.3 K-S statistic

How sure are you about your answer to the previous question? That's what statistics is for! 

As we covered in the learning notebook, the [K-S statistic](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html) is a useful tool to check if the new data belongs to the same distribution as the original data. Let's use it to corroborate our hypothesis. Our null hypothesis is that the distributions are the same. We want to know with confidence level of 95%. If the p-value<0.05, we'll have to reject the null hypothesis, otherwise we keep it.

Implement the function below to return the p-value of the K-S statistic for the distribution of the selected feature from the new and original data sets.

In [None]:
def get_ks_test(feature, training_df, new_df):
    """
    Returns the k-s statistic for the distribution of the same feature in two datasets.
    Args:
        feature (str): feature name
        training_df (pd.DataFrame): dataframe used to train the model
        new_df (pd.DataFrame): dataframe with the new data
    Returns:
        p-value of the K-S statistic for feature in training_df and new_df
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return pvalue

In [None]:
pvalue = get_ks_test('age', df_original, df_new)
assert isinstance(pvalue, float), "Are you returning just the p-value?"
np.testing.assert_almost_equal(pvalue, 0.9331, decimal=3, 
                err_msg="The p-value is not correct.")
print(f'p-value: {pvalue}')

With a p-value as high as this, it's pretty safe to say that for the age, there is no difference in the distribution of values. What about the other features?

### Exercise 1.4 - Statistics for all features

Using the function above, check if there is any significant change in any feature distribution for the new and training data sets.

Calculate the p-values for all the features in the data sets. Save the results in a dictionary called `ks_test_dict` and don't forget to exclude the target variable.

In [None]:
ks_test_dict = {}
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(ks_test_dict, dict), 'The result should be a dictionary.'
assert hashlib.sha256(json.dumps(' '.join(sorted(ks_test_dict.keys()))).encode()).hexdigest() == \
'5af9f4d5e6510a7564b1938b29cf39e500c237e2477f3d9025f4e93dd3578c57', "Have you excluded the target variable?"
np.testing.assert_almost_equal(sum(ks_test_dict.values()), 11.41, decimal=2,
            err_msg="The p-values are not correct?")
ks_test_dict

Let's check if any of the features has changed significantly:

In [None]:
for k,v in ks_test_dict.items():
    if v < 0.005:
        print("Feature {k} has changed significantly!")

Now we can safely say that we do not have a case of data drift on our hands. At least with the very basic tests that we've done. Knowing this, we can combine all the data we have and see if the model improves!

## Exercise 2 - Prepare data
We know that the data is the same, so we'll merge the old and new data and prepare the train and test sets.

### Exercise 2.1 - Combine data sets

Combine the two datasets into one. Add a new column called `is_new` that is going to have `False` values for the old data and `True` values for the new observations.

In [None]:
# df_combined = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df_combined.shape == (32561, 16), 'The shape of the combined dataframe is wrong.'
assert 'is_new' in df_combined.columns, 'Did you add the is_new column?'
assert sum(df_combined['is_new']) == 14561, 'The is_new column has a wrong number of True values.'

In total we now have 32561 observations. This dataset is still small enough so that we do not have to do any data selection to retrain our model, but the same might not be true for future iterations.

### Exercise 2.2 - Prepare the train and test sets

Split the combined data into train and test sets in the following way:
- Create train and test dataframes. Call them `df_train` and `df_test`. We'll need them in the following exercises.
    - Test set should be 25% of `df_combined`.
    - Make sure to have 25% of new values in the test sample.
    - Use random state 42 while splitting the data sets.
- Binarize the target into `True` and `False` values:
    - `False`: client has a salary of less or equal than 50K
    - `True`: client has a salary higher than 50K
- Separate the train and test dataframes into features and target - `X_train`, `X_test`, `y_train` and `y_test`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df_train.shape == (24420, 16), 'The shape of df_train shape is wrong.?'
assert df_train[target].dtype == 'bool', "Have you changed the target variable to binary?"
assert df_test.shape == (8141, 16), 'The shape of df_test is wrong.?'
assert df_test[target].dtype == 'bool', "Have you changed the target variable to binary?"
assert sum(X_train['is_new']) == 10920, 'The is_new column in the training dataset set has a wrong number of True values.'
assert sum(X_test['is_new']) == 3641, 'The is_new column in test dataframe has a wrong number of True values.'
assert X_train.shape == (24420, 15), 'The shape of X_train is wrong.'
assert X_test.shape == (8141, 15), 'The shape of X_test is wrong.'
assert y_train.shape == (24420,), 'The shape of X_train is wrong.'
assert y_test.shape == (8141,), 'The shape of X_train is wrong.'

## Exercise 3 - Retrain the original model with all data
Now we need to retrain the model. Load the original model pipeline provided by the client from the `pipeline_blu14.pickle` file in the `data` directory. Train the pipeline with the training data created in ex. 2.2 and calculate the probabilities for the positive class and the predictions on the test set. Calculate the precision and recall for the predictions. Calculate also the the retrieval rates for male and female clients which were the important parameters for the client.

In [None]:
# pipeline = ...
# proba_pos = 
# preds = 
# precision_baseline = 
# recall_baseline = 
# retrieval_male_baseline = 
# retrieval_female_baseline = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(pipeline, Pipeline)
assert len(preds)==8141, 'The predictions are not correct.'
np.testing.assert_almost_equal(sum(preds), 3542, decimal=0, err_msg='The predictions are not correct.')
assert proba_pos.shape==(8141,), 'The probabilities are not correct.'
assert preds[0] == (proba_pos[0]>0.5), 'The probabilities should be of the positive class.'
np.testing.assert_almost_equal(sum(proba_pos), 3609.23, decimal=2, err_msg='The probabilities are not correct.')
assert isinstance(precision_baseline, float), 'The precision should be a float.'
np.testing.assert_almost_equal(precision_baseline, 0.47, decimal=2, err_msg='The precision is not correct.')
assert isinstance(recall_baseline, float), 'The recall should be a float.'
np.testing.assert_almost_equal(recall_baseline, 0.86, decimal=2, err_msg='The recall is not correct.')
assert isinstance(retrieval_male_baseline, float), 'The male retrieval rate should be a float.'
np.testing.assert_almost_equal(retrieval_male_baseline, 0.91, decimal=2, err_msg='The male retrieval rate is not correct.')
assert isinstance(retrieval_female_baseline, float), 'The female retrieval rate should be a float.'
np.testing.assert_almost_equal(retrieval_female_baseline, 0.55, decimal=2, err_msg='The female retrieval rate is not correct.')
print(f'Precision: {precision_baseline}, recall: {recall_baseline}')
print(f'Male retrieval rate {retrieval_male_baseline}, female retrieval rate {retrieval_female_baseline}')

## Exercise 4 - Probability threshold
Our recall is pretty good, but the precision is not even 50%. The difference between the male and female retrieval rates is also quite large. 

Let's try to improve the precision to at least 50% by selecting another threshold for the probabilities `proba_pos` calculated in ex. 3. Use the precision-recall curve to find the threshold where the precision is at least 0.5. Calculate the corresponding recall and the male and female retrieval rates.

In [None]:
# threshold = ...
# precision_threshold = 
# recall_threshold =
# retrieval_male_threshold = 
# retrieval_female_threshold = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(threshold, float), 'Threshold should be a float.'
assert isinstance(precision_threshold, float), 'The precision should be a float.'
assert isinstance(recall_threshold, float), 'The recall should be a float.'
assert isinstance(retrieval_male_threshold, float), 'The male retrieval rate should be a float.'
assert isinstance(retrieval_female_threshold, float), 'The female retrieval rate should be a float.'
np.testing.assert_almost_equal(threshold + precision_threshold + recall_threshold + retrieval_male_threshold + retrieval_female_threshold,
                               2.946778566399038, decimal=3, err_msg='The calculated values are not correct.')
np.testing.assert_almost_equal(np.var([threshold,precision_threshold,recall_threshold,retrieval_male_threshold,retrieval_female_threshold]),
                               0.0, decimal=1, err_msg='The calculated values are not correct.')
print(f'Precision: {precision_threshold}, recall: {recall_threshold}')
print(f'Male retrieval rate {retrieval_male_threshold}, female retrieval rate {retrieval_female_threshold}')

## Exercise 5 - Remove rare values

The female retrieval rate is much worse for just a small gain in precision, so that is not the way to go. Let's see if we can work on the data set. We will try to remove rare values for some of the features.

Filter rare values for the following features in `df_train`:
- Remove rows with `education` that appear <= 150 times.
- Remove rows with `marital-status` that appear <= 30 times.

The filtered data should be stored in a new dataframe `df_filtered`.

In [None]:
train_filtered = df_train.copy()
# train_filtered = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert train_filtered.shape == (24236, 16), 'The shape of the filtered dataframe is not correct.'
assert hashlib.sha256(json.dumps(' '.join(sorted(train_filtered.education.unique()))).encode()).hexdigest() == \
'48405cefe80fabb9e11337caa0b8e770f6bc76d131259ffd4edfd378791fcb96'
, 'The education feature is not filtered correctly.'
assert hashlib.sha256(json.dumps(' '.join(sorted(train_filtered['marital-status'].unique()))).encode()).hexdigest() == \
'2d65f0c31a2d8bfde9f97d036ac0b46a0f32b52538ec0ac1af013e6c84fe6c69', 'The marital-status column is not filtered correctly.'

## Exercise 6 - Merge similar values

The `workclass` feature has several values that represent the same information. It might be beneficial to merge them.

Implement the function below that merges the `?`, `Without-pay`, and `Never-worked` values in the `workclass` column into a single `No-salary` category.

In [None]:
def merge_workclass_values(df):
    """
    Merges the ?, Without-pay, and Never-worked values in the workclass column of df into a single No-salary category.

    Args:
        df (pd.DataFrame): dataframe with a workclass column
    Returns:
        df_merged_workclass (pd.DataFrame): dataframe with merged values in the workclass column
    """
    df_merged_workclass = df.copy()
    # YOUR CODE HERE
    raise NotImplementedError()
    return df_merged_workclass

In [None]:
train_filtered = merge_workclass_values(train_filtered)
assert train_filtered.shape == (24236, 16), 'The shape of the filtered dataframe is not correct.'
assert sum(train_filtered.workclass=='No-salary') == 1411, 'Did you merge the categories correctly?'

## Exercise 7 - Predict on the filtered data

Let's split `train_filtered` into `X` and `Y` parts and do the same thing once again:
- Fit the pipeline.
- Merge the workclass values in the test set.
- Calculate predictions on the processed test set.
- Calculate precision and recall.
- Calculate male and female retrieval rates.

In [None]:
# preds_filtered = 
# precision_filtered = 
# recall_filtered = 
# retrieval_male_filtered = 
# retrieval_female_filtered = 
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(preds_filtered)==8141, 'The predictions are not correct.'
np.testing.assert_almost_equal(sum(preds_filtered), 3649, decimal=0, err_msg='The predictions are not correct.')
assert isinstance(precision_filtered, float), 'The precision should be a float.'
np.testing.assert_almost_equal(precision_filtered, 0.46, decimal=2, err_msg='The precision is not correct.')
assert isinstance(recall_filtered, float), 'The recall should be a float.'
np.testing.assert_almost_equal(recall_filtered, 0.87, decimal=2, err_msg='The recall is not correct.')
assert isinstance(retrieval_male_filtered, float), 'The male retrieval rate should be a float.'
np.testing.assert_almost_equal(retrieval_male_filtered, 0.92, decimal=2, err_msg='The male retrieval rate is not correct.')
assert isinstance(retrieval_female_filtered, float), 'The female retrieval rate should be a float.'
np.testing.assert_almost_equal(retrieval_female_filtered, 0.56, decimal=2, err_msg='The female retrieval rate is not correct.')
print(f'Precision: {precision_filtered}, recall: {recall_filtered}')
print(f'Male retrieval rate {retrieval_male_filtered}, female retrieval rate {retrieval_female_filtered}')

## Exercise 8 - Retrain the model on all data
Our retrained model has a slightly worse precision and better recall, but still a large difference in the male and female retrieval rates. It's usually a good idea to retrain the model on the whole dataset, so now I want you to:
- Apply the filters from exercise 5 ` to `df_combined`
- Apply the transformation of the `workclass` column from exercise 6 to `df_combined`
- Train the pipeline on the combined data
- Export the trained pipeline, columns names, and data types to the `/tmp` directory to the files `new_pipeline.pickle`, `new_dtypes.pickle`, and `new_columns.json`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
with open(os.path.join('/tmp', 'new_columns.json')) as fh:
    columns = json.load(fh)

with open(os.path.join('/tmp', 'new_pipeline.pickle'), 'rb') as fh:
    pipeline = joblib.load(fh)

with open(os.path.join('/tmp', 'new_dtypes.pickle'), 'rb') as fh:
    dtypes = pickle.load(fh)

assert isinstance(columns, list), 'The columns should be a list of training features.'
assert 'salary' not in columns, 'There should be only training features in columns. You got target there.'
assert 'is_new' in columns, "Your columns don't contain the is_new feature."
assert isinstance(pipeline, Pipeline), 'new_pipeline.pickle does not seem to be an instance of the Pipeline class.'
assert isinstance(dtypes, pd.core.series.Series), 'new_dtypes.pickle is not pickled well'
assert all([column in dtypes.index for column in columns]), 'Some columns from new_columns file are not in the new_dtypes file.'
assert all([dtype in columns for dtype in dtypes.index]), 'Some dtypes from new_dtypes file are not in the new_columns file.'

And now it's time to change the server! I know you missed this part :)

Before we do it, I want to remind you that in this exercise notebook we didn't cover the ethics topic. Our model is trained on sensitive features like race and sex. In a real situation you'd need to make sure that your model is not discriminating anyone. Maybe the bad female vs. male retrieval rates are due to this or maybe to the imbalance of the classes.

Now go and create a copy of the `protected_server.py` file. Call it `new_server.py`.

In that file:
- Change the `check_valid_column` function to have the new added columns. You can also automate it by reading the columns file, it's even better!
- Change the `check_categorical_values` function: make sure the values there still make sense!
- We also added one more categorical feature to the dataframe, `is_new`. Go and add possible values to the check.

As soon as it's done, go ahead and start the server and play with the predictions. Make sure that the server checks the `is_new` feature values. Try to send requests without `is_new` or with a different value (not True or False).

There's much more we can do with this dataset to train a better model, and more importantly, a model that doesn't descriminate. So if you're willing, go crazy! Make the best model you can! It's great preparation for the hackathon! ;) See you there!

And now take a moment to be very proud of you! You have mastered all the exercise notebooks in the course.

<img src="media/proud_girl.png" width="300" />