# Capstone Week 7
---

# Index
- [Capstone Objectives](#Capstone-Objectives)
- [Read in Data](#Read-in-Data)
    - [Merge 2018 and 2019](#Merge-2018-and-2019)
    - [Make advisor and firm dictionary mapper](#Make-advisor-and-firm-dictionary-mapper)
- [EDA](#EDA)
- [Data Cleaning](#Data-Cleaning)
    - [Train-Test-Split](#Train-Test-Split)
    - [Custom Cleaning Functions](#Custom-Cleaning-Functions)
    - [Create Cleaning Pipeline](#Create-Cleaning-Pipeline)
- [Model building](#Model-building)
    - [Regression](#Regression)
        - [Calculate Baseline](#Calculate-Baseline)
        - [`sklearn` Feature Selection](#sklearn-Feature-Selection)
        - [Make function to output deciles](#Make-function-to-output-deciles)
    - [Classification](#Classification)
        - [Calculate Baseline-Classification](#Calculate-Baseline-Classification)
        - [Classification Feature Selection](#Classification-Feature-Selection)
- [Fairness and Bias](#Fairness-and-Bias)

# Capstone Objectives
- Assist sales and marketing by improving their targeting
- Predict sales for 2019 using the data for 2018
- Estimate the probability of adding a new fund in 2019

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

pd.set_option('display.max_columns', 50)

[Back to Top](#Index)
# Read in Data

In [2]:
df = pd.read_csv("../Transactions.csv", parse_dates=['refresh_date'])

In [3]:
df.head()

Unnamed: 0,CONTACT_ID,no_of_sales_12M_1,no_of_Redemption_12M_1,no_of_sales_12M_10K,no_of_Redemption_12M_10K,no_of_funds_sold_12M_1,no_of_funds_redeemed_12M_1,no_of_fund_sales_12M_10K,no_of_funds_Redemption_12M_10K,no_of_assetclass_sold_12M_1,no_of_assetclass_redeemed_12M_1,no_of_assetclass_sales_12M_10K,no_of_assetclass_Redemption_12M_10K,No_of_fund_curr,No_of_asset_curr,AUM,sales_curr,sales_12M,redemption_curr,redemption_12M,new_Fund_added_12M,redemption_rate,aum_AC_EQUITY,aum_AC_FIXED_INCOME_MUNI,aum_AC_FIXED_INCOME_TAXABLE,aum_AC_MONEY,aum_AC_MULTIPLE,aum_AC_PHYSICAL_COMMODITY,aum_AC_REAL_ESTATE,aum_AC_TARGET,aum_P_529,aum_P_ALT,aum_P_CEF,aum_P_ETF,aum_P_MF,aum_P_SMA,aum_P_UCITS,aum_P_UIT,refresh_date
0,85102111664960504040,3096,6592,302,157,8,13,7,7,2,3,2,2,9,2,19097020.0,399995.834888,12599930.0,-231714.43334,-6557185.0,0,-0.012133,9386941.0,9743856.0,-9655.913728,0.0,-24116.993988,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8431248.0,10665780.0,0.0,0.0,2017-12-31
1,4492101,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-14685.74,0.0,0.0,0.0,0.0,0,0.0,-7102.1,0.0,-7583.64,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-14685.74,0.0,0.0,0.0,2017-12-31
2,85102140943881291064,0,1,0,0,0,1,0,0,0,1,0,0,0,0,-71640.47,0.0,0.0,0.0,-195.0,0,0.0,-71640.47,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-71640.47,0.0,0.0,0.0,2017-12-31
3,85202121774856516280,1,0,0,0,1,0,0,0,1,0,0,0,2,2,342546.2,0.0,1164.76,0.0,0.0,1,0.0,0.0,70301.51,272244.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,342546.2,0.0,0.0,0.0,2017-12-31
4,0360380,7,0,0,0,1,0,0,0,1,0,0,0,2,0,-226272.1,0.0,3278.145,0.0,0.0,0,0.0,-111356.6,-20185.66,0.0,0.0,-94729.89,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-131542.3,-94729.89,0.0,0.0,2017-12-31


In [4]:
# df['refresh_date'].dt.month

# df['year'] = df['refresh_date'].dt.year
# df['month'] = df['refresh_date'].dt.month

# filt = (df['year'] == 2020) & (df['month'] == 11)
# df.loc[filt, :]

## Make advisor dictionary mapper

In [5]:
adviser_lookup = {
    idx: contact_id 
        for idx, contact_id in enumerate(df['CONTACT_ID'])
}

In [6]:
adviser_lookup[10]

'0082583'

# Combine `sales_curr` and `sales_12M`

In [7]:
df['total_sales'] = df['sales_curr'] + df['sales_12M']

[Back to Top](#EDA)
# EDA

In [8]:
!conda install -yc conda-forge pandas-profiling

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [9]:
# from pandas_profiling import ProfileReport

# missing_diagrams = {
#     'heatmap': True, 'dendrogram': True, 'matrix':True, 'bar': True,
# }

# profile = ProfileReport(df, title='Nuveen Profile Report', missing_diagrams=missing_diagrams)

# profile.to_file(output_file="nuveen_profiling.html")

# Data Cleaning

Before you change ANYTHING with the data - besides the above :) - do your train-test split

In [10]:
FEATURES = [
    'CONTACT_ID', 'no_of_sales_12M_1', 'no_of_Redemption_12M_1',
    'no_of_sales_12M_10K', 'no_of_Redemption_12M_10K',
    'no_of_funds_sold_12M_1', 'no_of_funds_redeemed_12M_1',
    'no_of_fund_sales_12M_10K', 'no_of_funds_Redemption_12M_10K',
    'no_of_assetclass_sold_12M_1', 'no_of_assetclass_redeemed_12M_1',
    'no_of_assetclass_sales_12M_10K', 'no_of_assetclass_Redemption_12M_10K',
    'No_of_fund_curr', 'No_of_asset_curr', 'AUM', 'sales_curr', 'sales_12M',
    'redemption_curr', 'redemption_12M', 'new_Fund_added_12M',
    'redemption_rate', 'aum_AC_EQUITY', 'aum_AC_FIXED_INCOME_MUNI',
    'aum_AC_FIXED_INCOME_TAXABLE', 'aum_AC_MONEY', 'aum_AC_MULTIPLE',
    'aum_AC_PHYSICAL_COMMODITY', 'aum_AC_REAL_ESTATE', 'aum_AC_TARGET',
    'aum_P_529', 'aum_P_ALT', 'aum_P_CEF', 'aum_P_ETF', 'aum_P_MF',
    'aum_P_SMA', 'aum_P_UCITS', 'aum_P_UIT', 'refresh_date',
]
TARGETS = 'total_sales'

In [11]:
# make a variable to keep all of the columns we want to drop
COLS_TO_DROP = [
    'CONTACT_ID', 'sales_curr', 'sales_12M', 
    'refresh_date', 'new_Fund_added_12M','no_of_Redemption_12M_1',
]

COLS_TO_KEEP = [
    'no_of_sales_12M_1', 
    'no_of_sales_12M_10K', 'no_of_Redemption_12M_10K',
    'no_of_funds_sold_12M_1', 'no_of_funds_redeemed_12M_1',
    'no_of_fund_sales_12M_10K', 'no_of_funds_Redemption_12M_10K',
    'no_of_assetclass_sold_12M_1', 'no_of_assetclass_redeemed_12M_1',
    'no_of_assetclass_sales_12M_10K', 'no_of_assetclass_Redemption_12M_10K',
    'No_of_fund_curr', 'No_of_asset_curr', 'AUM', 'redemption_curr', 
    'redemption_12M', 'redemption_rate', 'aum_AC_EQUITY', 
    'aum_AC_FIXED_INCOME_MUNI', 'aum_AC_FIXED_INCOME_TAXABLE', 'aum_AC_MONEY', 
    'aum_AC_MULTIPLE', 'aum_AC_PHYSICAL_COMMODITY', 'aum_AC_REAL_ESTATE', 
    'aum_AC_TARGET', 'aum_P_529', 'aum_P_ALT', 'aum_P_CEF', 'aum_P_ETF', 
    'aum_P_MF', 'aum_P_SMA', 'aum_P_UCITS', 'aum_P_UIT',
]

## Partition training and testing

In [12]:
training_rows = df['refresh_date'].dt.year.isin([2017, 2018, 2019])
testing_rows = df['refresh_date'].dt.year.isin([2020])

X = df.loc[training_rows, FEATURES].copy()
y_reg = df.loc[training_rows, TARGETS].copy()
y_cl = df.loc[training_rows, 'new_Fund_added_12M'].copy()

y_holdout_test = df.loc[testing_rows, TARGETS].copy() # forget about this for now

## Custom Cleaning Functions

Let's create functions that do some basic housekeeping

In [13]:
def extract_columns(df):
    '''extract out columns not listed in COLS_TO_DROP variable'''
    cols_to_keep = [col for col in df.columns if col not in COLS_TO_DROP]
    return df.loc[:, cols_to_keep].copy()


def fillna_values(df):
    '''fill nan values with zero'''
    if isinstance(df, type(pd.Series(dtype='float64'))):
        return df.fillna(0)
    num_df = df.select_dtypes(include=['number']).fillna(0)
    non_num_df = df.select_dtypes(exclude=['number'])
    return pd.concat([num_df, non_num_df], axis=1)


def negative_to_zero(series):
    if isinstance(series, type(pd.Series(dtype='float64'))):
        return series.apply(lambda x: max(0, x))
    else:
        return series

## Train Test Split

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test, y_train_reg, y_test_reg = train_test_split(
    X, y_reg, test_size=0.3, random_state=24
)
y_train_cl, y_test_cl = y_cl[y_train_reg.index], y_cl[y_test_reg.index]

[Back to Top](#Index)
## Create Cleaning Pipeline

- Pipeline for target variable
- Pipeline for features

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler

In [17]:
extract_columns_trans = FunctionTransformer(extract_columns)
fillna_values_trans = FunctionTransformer(fillna_values)
negative_to_zero_trans = FunctionTransformer(negative_to_zero)

Make pipeline for regression target variable

In [18]:
def extract_redemption(df):
    redemp_cols = [col for col in df.columns if 'redemption' in col.lower()]
    return df[redemp_cols].copy()

def replace_with_zero(df):
    for col in df.columns:
        df[col] = df[col].apply(lambda x: min(0, x))
    return df

In [19]:
extract_redemption_trans = FunctionTransformer(extract_redemption)
replace_with_zero_trans = FunctionTransformer(replace_with_zero)

In [20]:
redemption_pipe = Pipeline([
    ('extract_redemption_trans', extract_redemption_trans),
    ('replace_with_zero_trans', replace_with_zero_trans),
    ('StandardScaler', StandardScaler())
])

In [21]:
pd.DataFrame(
    redemption_pipe.fit_transform(X_train),
    index=X_train.index,
    columns=[col for col in X_train.columns if 'redemption' in col.lower()]
)

Unnamed: 0,no_of_Redemption_12M_1,no_of_Redemption_12M_10K,no_of_funds_Redemption_12M_10K,no_of_assetclass_Redemption_12M_10K,redemption_curr,redemption_12M,redemption_rate
122101,0.0,0.0,0.0,0.0,0.068414,0.116706,0.004086
46186,0.0,0.0,0.0,0.0,0.068414,0.116706,0.004086
41126,0.0,0.0,0.0,0.0,0.045743,-0.994636,0.004086
30070,0.0,0.0,0.0,0.0,0.067219,0.091516,0.004086
232410,0.0,0.0,0.0,0.0,0.068414,0.116706,0.004086
...,...,...,...,...,...,...,...
190609,0.0,0.0,0.0,0.0,0.068414,0.116706,0.004086
216465,0.0,0.0,0.0,0.0,0.068414,0.116706,0.004086
211136,0.0,0.0,0.0,0.0,0.068414,0.082220,0.004086
899,0.0,0.0,0.0,0.0,0.066797,0.116706,0.004086


In [22]:
targ_pipe_reg = Pipeline([
    ('fillna_values_trans', fillna_values_trans),
    ('negative_to_zero_trans', negative_to_zero_trans)
])

y_train_reg = targ_pipe_reg.fit_transform(y_train_reg)
y_test_reg = targ_pipe_reg.transform(y_test_reg)

Transform the classification target

In [23]:
from sklearn.preprocessing import Binarizer

targ_pipe_cl = Pipeline([
    ('fillna_values_trans', fillna_values_trans),
    ('Binarizer', Binarizer(threshold=0))
])

y_train_cl = pd.Series(
    targ_pipe_cl
        .fit_transform(y_train_cl.to_frame())
        .reshape(-1), index=y_train_cl.index)
y_test_cl = pd.Series(
    targ_pipe_cl
        .transform(y_test_cl.to_frame())
        .reshape(-1), index=y_test_cl.index)
y_test_cl

228198    1
240133    1
163658    1
176954    0
69498     1
         ..
239268    0
116033    0
238773    1
7527      1
11875     0
Length: 75075, dtype: int64

Create the pipeline for the features

In [24]:
from sklearn.preprocessing import PowerTransformer

In [25]:
feat_pipe = Pipeline([
    ('extract_columns_trans', extract_columns_trans),
    ('fillna_values_trans', fillna_values_trans),
    ('StandardScaler', StandardScaler()),
    ('power_transformer', PowerTransformer())
])

X_train_prepared = feat_pipe.fit(X_train).transform(X_train)
X_test_prepared = feat_pipe.transform(X_test)

**TRANSFORM** Test set

In [26]:
X_train_prepared = pd.DataFrame(
    X_train_prepared,
    index=X_train.index,
    columns=COLS_TO_KEEP
)

X_test_prepared = pd.DataFrame(
    feat_pipe.transform(X_test),
    index=X_test.index,
    columns=COLS_TO_KEEP
)

[Back to Top](#Index)
# Model building
- Evaluate baseline model
- Create new models
- Create evaluation function and cross validate

[Back to Top](#Index)
## Regression

Predict the sales of an advisor

In [27]:
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import RFE

### Calculate Baseline

In [28]:
y_baseline = y_test_reg.mean() * np.ones(y_test_reg.shape, dtype='float') # use mean as prediction
print(np.sqrt(mean_squared_error(y_test_reg, y_baseline)))

1093150.643517532


In [29]:
y_baseline

array([179451.46304811, 179451.46304811, 179451.46304811, ...,
       179451.46304811, 179451.46304811, 179451.46304811])

### `sklearn` Feature Selection

In [30]:
rfe = RandomForestRegressor(n_estimators=10, max_depth=8)
rfe.fit(X_train_prepared, y_train_reg)

RandomForestRegressor(max_depth=8, n_estimators=10)

In [32]:
subset = [
    'no_of_sales_12M_1', 'no_of_sales_12M_10K', 
    'no_of_Redemption_12M_10K', 'no_of_funds_sold_12M_1', 
    'no_of_funds_redeemed_12M_1', 'no_of_fund_sales_12M_10K', 
    'no_of_funds_Redemption_12M_10K',
]
X_train_prepared = X_train_prepared.loc[:, subset].copy()

In [33]:
rf = RandomForestRegressor(n_estimators=10, max_depth=8)
rfe = RFE(rf)
rfe.fit(X_train_prepared, y_train_reg)

RFE(estimator=RandomForestRegressor(max_depth=8, n_estimators=10))

This is a boolean mask indicating which columns were selected from the `RFE` fitting

In [None]:
rfe.support_

In [None]:
X_train_prepared.columns[rfe.support_]

Select columns from RFE

In [None]:
X_train_reg_rfe = X_train_prepared.loc[:, rfe.support_]
X_test_reg_rfe = X_test_prepared.loc[:, rfe.support_]

In [None]:
rf2 = RandomForestRegressor()
rf2.fit(X_train_reg_rfe, y_train_reg)

In [None]:
y_test_preds = rf2.predict(X_test_reg_rfe)

In [None]:
print(np.sqrt(mean_squared_error(y_test_reg, y_test_preds)))

Create a function to evaulate regression models

In [None]:
def evaluate_regression(model, X, y, training=False,):
    if training:
        print("Training Cross Validation Scores:")
        print(-cross_validate(model, X, y, scoring='neg_root_mean_squared_error')['test_score'])
        print('-'*55)
        preds = model.predict(X)
        lim = max(preds.max(), y.max())
        fig, ax = plt.subplots(1,1,figsize=(7,5))
        ax.scatter(x=y, y=preds, alpha=0.4)
        ax.plot([0, lim], [0, lim])
        ax.set_xlim([0, lim])
        ax.set_ylim([0, lim])
        ax.set_title("Actual vs Predicted - Regression")
        ax.set_xlabel("Actual")
        ax.set_ylabel("Predicted");
    else:
        rmse = np.sqrt(mean_squared_error(y_test_reg, y_test_preds))
        print("Testing Data Performance:")
        print('-'*55)
        print(f"RMSE:\t{rmse}")
        preds = model.predict(X)
        lim = max(preds.max(), y.max())
        fig, ax = plt.subplots(1,1,figsize=(7,5))
        ax.scatter(x=y, y=preds, alpha=0.4)
        ax.plot([0, lim], [0, lim])
        ax.set_xlim([0, lim])
        ax.set_ylim([0, lim])
        ax.set_title("Actual vs Predicted - Regression")
        ax.set_xlabel("Actual")
        ax.set_ylabel("Predicted");

In [None]:
# evaluate_regression(rf2, X_train_reg_rfe, y_train_reg, training=True)

In [None]:
# evaluate_regression(rf2, X_test_reg_rfe, y_test_reg)

[Back to Top](#Index)
### Make function to output deciles

In [None]:
y_test_preds = pd.Series(rf2.predict(X_test_reg_rfe), index=y_test_reg.index)

In [None]:
y_test_preds = (
    targ_pipe_reg
        .named_steps['PowerTransformer']
        .inverse_transform(y_test_preds.to_frame())
        .squeeze())
y_test_preds

In [None]:
def output_deciles(model, X, y, transform=False):
    if transform:
        results = pd.DataFrame(model.predict(X), index=X.index, columns=['predictions'])
        results['actual'] = y.values
        results['deciles'] = pd.qcut(results['predictions'], 10,labels=False)
        results['predictions'] = (targ_pipe_reg
            .named_steps['PowerTransformer']
            .inverse_transform(results['predictions'].to_frame())
            .squeeze())
        results['actual'] = (targ_pipe_reg
            .named_steps['PowerTransformer']
            .inverse_transform(results['actual'].to_frame())
            .squeeze())
        results['contact_id'] = results.index.map(adviser_lookup)
        return results
    else:
        results = pd.DataFrame(model.predict(X), index=X.index, columns=['predictions'])
        results['actual'] = y.values
        results['deciles'] = pd.qcut(results['predictions'], 10, labels=False)
        results['contact_id'] = results.index.map(adviser_lookup)
        return results

In [None]:
regression_deciles = output_deciles(rf2, X_test_reg_rfe, y_test_reg, transform=True)

In [None]:
regression_deciles

In [None]:
reg_chart = (regression_deciles
    .groupby('deciles')
    .agg({'actual': ['mean', 'count']})
    .droplevel(0, axis=1)
    .reset_index()
)

In [None]:
reg_chart['deciles'] = reg_chart['deciles'].apply(lambda x: (x-10)*-1)

In [None]:
fig, axes = plt.subplots(figsize=(10,6))
reg_chart.plot(kind='bar', x='deciles', y='mean', ax=axes, legend=None)
axes.set_xlabel("Deciles", fontsize=14)
axes.set_ylabel("Average Predicted Sales", fontsize=14)
axes.set_title("Average Predicted Sales by Decile", fontsize=16)
axes.spines['top'].set_visible(False);
axes.spines['right'].set_visible(False);

In [None]:
regression_deciles.sort_values(by='deciles')

[Back to Top](#Index)
## Classification

Predict if an advisor will make at least one sale

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

### Calculate Baseline-Classification

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy_cl = DummyClassifier(strategy='most_frequent') # use majority

In [None]:
dummy_cl.fit(X_train_prepared, y_train_cl)
y_baseline = dummy_cl.predict(X_test)
print(classification_report(y_test_cl, y_baseline, zero_division=0))

### Classification Feature Selection

In [None]:
from sklearn.feature_selection import SelectFromModel

In [None]:
gbc = GradientBoostingClassifier()

# find subset of features
sfm = SelectFromModel(gbc, threshold='median')
sfm.fit(X_train_prepared, y_train_cl)
X_train_cl_sfm = pd.DataFrame(
    sfm.transform(X_train_prepared),
    index=X_train_prepared.index,
    columns=X_train_prepared.columns[sfm.get_support()])
X_test_cl_sfm = pd.DataFrame(
    sfm.transform(X_test_prepared),
    index=X_test_prepared.index,
    columns=X_test_prepared.columns[sfm.get_support()])

# fit model with selected features
gbc.fit(X_train_cl_sfm, y_train_cl)

Create predictions

In [None]:
y_test_preds = gbc.predict(X_test_cl_sfm)

In [None]:
print(classification_report(y_test_cl, y_test_preds))

Create function to evaluate model

In [None]:
def evaluate_classifier(cl_model, X, y, training=False):
    if training:
        print("Training Cross Validation Scores:")
        print(-cross_validate(cl_model, X, y, scoring='f1')['test_score'])
        print('-'*55)
        preds = cl_model.predict(X)
        print(classification_report(y, preds))
    else:
        print("Testing Data Performance:")
        print('-'*55)
        preds = cl_model.predict(X)
        print(classification_report(y, preds))

In [None]:
evaluate_classifier(gbc, X_train_cl_sfm, y_train_cl, training=True)

In [None]:
evaluate_classifier(gbc, X_test_cl_sfm, y_test_cl)

In [None]:
gbc.predict_proba(X_test_cl_sfm)[:, 1]

In [None]:
def output_deciles_class(model, X, y):
    results = pd.DataFrame(model.predict_proba(X)[:, 1], index=X.index, columns=['predictions'])
    results['actual'] = y.values
    results['deciles'] = pd.qcut(results['predictions'], 10, labels=False)
    results['contact_id'] = results.index.map(adviser_lookup)
    return results

In [None]:
class_results = output_deciles_class(gbc, X_test_cl_sfm, y_test_cl)

In [None]:
class_results

In [None]:
class_res1 = (class_results
                  .groupby('deciles')
                  .agg({'actual': 'sum', 'contact_id': 'count'})
                  .rename(columns={'contact_id': 'count'})
                  .reset_index()
             )
class_res1['deciles'] = class_res1['deciles'].apply(lambda x: (x - 10)*-1)
class_res1 = class_res1.sort_values(by='deciles')
class_res1.to_csv('class_lift.csv', index=False)

In [None]:
cl_preds = pd.Series(gbc.predict(X_train_cl_sfm), index=X_train_cl_sfm.index)
cl_preds.value_counts()

In [None]:
# !pip install scikit-plot

In [None]:
import scikitplot as skplt

In [None]:
y_test_cl_preds = gbc.predict_proba(X_test_cl_sfm)

In [None]:
skplt.metrics.plot_lift_curve(y_test_cl, y_test_cl_preds);

[Back to Top](#Index)
## Fairness and Bias
1. [Visit the Aequitas project website](http://www.datasciencepublicpolicy.org/projects/aequitas/)
2. [Aequitas Fairness GitHub](https://github.com/dssg/aequitas)
3. [Aequitas API Docs](https://dssg.github.io/aequitas/api/aequitas.html)
4. [Aequitas Example](https://dssg.github.io/aequitas/examples/compas_demo.html)

In [None]:
# !pip install aequitas

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

from aequitas.group import Group
from aequitas.bias import Bias
from aequitas.fairness import Fairness
from aequitas.plotting import Plot

import warnings; warnings.simplefilter('ignore')

In [None]:
plt.rcParams["figure.figsize"] = [6.4, 4.8]

### Load sample data

In [None]:
RAW_DATA = 'https://raw.githubusercontent.com/dssg/aequitas/master/examples/data/compas_for_aequitas.csv'
df = pd.read_csv(RAW_DATA)

In [None]:
df.head()

### About the data
Risk assessment by race

COMPAS produces a risk score that predicts a person’s likelihood of commiting a crime in the next two years. The output is a score between 1 and 10 that maps to low, medium or high. For Aequitas, we collapse this to a binary prediction. A score of 0 indicates a prediction of “low” risk according to COMPAS, while a 1 indicates “high” or “medium” risk.

In [None]:
aq_palette = sns.diverging_palette(225, 35, n=2)

Look at the prediction distributons along the race, sex, and age attributes.

In [None]:
by_race = sns.countplot(
            x="race", hue="score", 
            data=df[df.race.isin(['African-American', 'Caucasian', 'Hispanic'])],
            palette=aq_palette
)

Race by label

In [None]:
label_by_race = sns.countplot(
    x="race", hue="label_value", 
    data=df[df.race.isin(['African-American', 'Caucasian', 'Hispanic'])], 
    palette=aq_palette
)

Predictions by Sex

In [None]:
by_sex = sns.countplot(x="sex", hue="score", data=df, palette=aq_palette)

Labels by Sex

In [None]:
label_by_age = sns.countplot(
    x="sex", hue="label_value", 
    data=df, palette=aq_palette
)

Predictions by Age

In [None]:
by_age = sns.countplot(x="age_cat", hue="score", data=df, palette=aq_palette)

Labels by Age

In [None]:
label_by_sex = sns.countplot(
    x="age_cat", hue="label_value", 
    data=df, palette=aq_palette
)

The graphs above show the base rates for recidivism are higher for black defendants compared to white defendants (.51 vs .39), though the predictions do not match the base rates.

#### Initialize a `Group()` instance to see metrics in cross tabulations

In [None]:
g = Group()
xtab, _ = g.get_crosstabs(df)

In [None]:
xtab

#### Plot the false negative rates

These show how often the model misses someone that does commit another crime within that group.

In [None]:
aqp = Plot()

In [None]:
fnr = aqp.plot_group_metric(xtab, 'fnr', min_group_size=0.05)

In [None]:
a = aqp.plot_group_metric_all(xtab, ncols=3)