## OPR RECOMMENDATION ENGINE

### Structure

- Deductive component (ML)
- Inductive componenets
    - Rule based Heuristic (RBH)
    - PDI (relevant for managers only) (PDI)
    
OPR rating for employee = __*d1*\*ML + *i1*\*RBH + *i2*\*PDI__

In [1]:
%reset -f

import sys
import pandas as pd, numpy as np
import pickle

pd.options.display.max_rows = 10 # specify if you want the full output in cells rather the truncated list
pd.options.display.max_columns = 100

# to display multiple outputs in a cell without usin print/display
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# ignore warnings (only if you are the kind that would code when the world is burning)
import warnings
warnings.filterwarnings('ignore')

In [2]:
# global function to flatten columns after a grouped operation and aggregation
# outside all classes since it is added as an attribute to pandas DataFrames

def __my_flatten_cols(self, how="_".join, reset_index=True):
    how = (lambda iter: list(iter)[-1]) if how == "last" else how
    self.columns = [how(filter(None, map(str, levels))) for levels in self.columns.values] \
    if isinstance(self.columns, pd.MultiIndex) else self.columns
    return self.reset_index(drop=True) if reset_index else self
pd.DataFrame.my_flatten_cols = __my_flatten_cols

In [3]:
# HELPER FUNCTIONS CLASS #


class helper_funcs():

    def __init__(self):
        """ list down the various functions defined here """
    
    def csv_read(self, file_path, cols_to_keep=None, dtype=None, drop_dup=None):
        self.cols_to_keep = cols_to_keep
        if dtype is None:
            x=pd.read_csv(file_path, na_values=['No Data', ' ', 'UNKNOWN', '', 'Not Rated', 'Not Applicable'], encoding='latin-1', low_memory=False)
        else:
            x=pd.read_csv(file_path, na_values=['No Data', ' ', 'UNKNOWN', '', 'Not Rated', 'Not Applicable'], encoding='latin-1', low_memory=False, dtype=dtype)
        chars_to_remove = [' ', '.', '(', ')', '__', '-', '/', '\'', ':']
        for i in chars_to_remove:
            x.columns = x.columns.str.strip().str.lower().str.replace(i, '_')
        if cols_to_keep is not None: x = x[cols_to_keep]
        if drop_dup is not None: x.drop_duplicates(inplace=True)
        print(x.shape)
        return x
    
    def txt_read(self, file_path, cols_to_keep=None, sep='|', skiprows=1, dtype=None, drop_dup=None):
        # currently only supports salary files with the default values (need to implement dynamic programming for any generic txt)
        self.cols_to_keep = cols_to_keep
        if dtype is None:
            x=pd.read_table(file_path, sep=sep, skiprows=skiprows, na_values=['No Data', ' ', 'UNKNOWN', '', 'Not Rated', 'Not Applicable'])
        else:
            x=pd.read_table(file_path, sep=sep, skiprows=skiprows, na_values=['No Data', ' ', 'UNKNOWN', '', 'Not Rated', 'Not Applicable'], dtype=dtype)
        chars_to_remove = [' ', '.', '(', ')', '__', '-', '/', '\'', ':']
        for i in chars_to_remove:
            x.columns = x.columns.str.strip().str.lower().str.replace(i, '_')
        if cols_to_keep is not None: x = x[cols_to_keep]
        if drop_dup is not None: x.drop_duplicates(inplace=True)
        print(x.shape)
        return x

    def xlsx_read(self, file_path, cols_to_keep=None, sheet_name=0, dtype=None, drop_dup=None):
        self.cols_to_keep = cols_to_keep
        if dtype is None:
          x=pd.read_excel(file_path, na_values=['No Data', ' ', 'UNKNOWN', '', 'Not Rated', 'Not Applicable'], sheet_name=sheet_name)
        else:
          x=pd.read_excel(file_path, na_values=['No Data', ' ', 'UNKNOWN', '', 'Not Rated', 'Not Applicable'], sheet_name=sheet_name, dtype=dtype)
        chars_to_remove = [' ', '.', '(', ')', '__', '-', '/', '\'', ':']
        for i in chars_to_remove:
            x.columns = x.columns.str.strip().str.lower().str.replace(i, '_')
        if cols_to_keep is not None: x = x[cols_to_keep]
        if drop_dup is not None: x.drop_duplicates(inplace=True)
        print(x.shape)
        return x
    
    def process_columns(self, df, cols=None):
        if cols is None:
            df = df.apply(lambda x: x.str.lower() if (x.dtype == 'object') else x)
            df = df.apply(lambda x: x.str.strip() if (x.dtype == 'object') else x)
            df = df.apply(lambda x: x.str.replace('\s+|\s', '_', regex=True) if (x.dtype == 'object') else x)
            df = df.apply(lambda x: x.str.replace('[^\w+\s+]', '_', regex=True) if (x.dtype == 'object') else x)
            df = df.apply(lambda x: x.str.replace('\_+', '_', regex=True) if (x.dtype == 'object') else x)
        else:
            df = df.apply(lambda x: x.str.lower() if x.name in cols else x)
            df = df.apply(lambda x: x.str.strip() if x.name in cols else x)
            df = df.apply(lambda x: x.str.replace('\s+|\s', '_', regex=True) if x.name in cols else x)
            df = df.apply(lambda x: x.str.replace('[^\w+\s+]', '_', regex=True) if x.name in cols else x)
            df = df.apply(lambda x: x.str.replace('\_+', '_', regex=True) if x.name in cols else x)
        return df
  
    def nlp_process_columns(self, df, nlp_cols):
        df = df.apply(lambda x: x.str.replace('_', ' ') if x.name in nlp_cols else x)
        df = df.apply(lambda x: x.str.replace('\s+', ' ', regex=True) if x.name in nlp_cols else x)
        df = df.apply(lambda x: x.str.replace('crft', 'craft') if x.name in nlp_cols else x)
        return df
    
    def retrieve_name(var):
        """
        Gets the name of var. Does it from the out most frame inner-wards.
        :param var: variable to get name from.
        :return: string
        """
        for fi in reversed(inspect.stack()):
            names = [var_name for var_name, var_val in fi.frame.f_locals.items() if var_val is var]
            if len(names) > 0:
                return names[0]

helpers = helper_funcs()

## SAZ specific

In [4]:
# predictions input

predictions_from_mod_beforeshape = helpers.csv_read('../working/saz_pred_aftermod.csv')
predictions_from_ml_aftershape = helpers.csv_read('../working/saz_pred_aftershape.csv')
predictions_from_rulebased = helpers.csv_read('../input/Rule_based_module_output/saz_pred_rulebased.csv', cols_to_keep=['employee_id', 'suggested_opr'])

predictions_from_ml_aftershape = pd.concat([predictions_from_mod_beforeshape.reset_index(drop=True), predictions_from_ml_aftershape], axis=1)
predictions_from_ml_aftershape = predictions_from_ml_aftershape[['global_id', 'saz_shape_class']]
predictions_from_rulebased.columns = ['global_id', 'saz_rulebased_class']

(5000, 2)
(5000, 1)
(3971, 2)


In [5]:
# opr files

required_cols = ['employee_global_id', 'year', 'opr_rating_scale']
opr_2015 = helpers.csv_read(file_path='../input/OPR/global_opr_2015.csv', drop_dup='yes', cols_to_keep=required_cols)
opr_2016 = helpers.csv_read(file_path='../input/OPR/global_opr_2016.csv', drop_dup='yes', cols_to_keep=required_cols)
opr_2017 = helpers.csv_read(file_path='../input/OPR/global_opr_2017.csv', drop_dup='yes', cols_to_keep=required_cols)
opr_2018 = helpers.csv_read(file_path='../input/OPR/global_opr_2018.csv', drop_dup='yes', cols_to_keep=required_cols)
#opr_2018.head(2)

## create the full set
opr_full = opr_2015.append(opr_2016, ignore_index=True)
opr_full = opr_full.append(opr_2017, ignore_index=True)
opr_full = opr_full.append(opr_2018, ignore_index=True)
opr_full.columns = ['global_id', 'year', 'opr']
opr_full.drop_duplicates(inplace=True, subset=['global_id', 'year'])
opr_full.dropna(how='any', inplace=True)
opr_full.reset_index(inplace=True, drop=True)
opr_full = opr_full[opr_full['opr']!='2']
#opr_full['opr'] = opr_full['opr'].map(dep_dict)

## reshaping and creating the pivot version
opr_reshaped = opr_full.pivot(index='global_id', columns='year', values=['opr']).reset_index().my_flatten_cols()
opr_reshaped.columns.name = None
#opr_reshaped[['opr_2015', 'opr_2016', 'opr_2017', 'opr_2018']] = opr_reshaped[['opr_2015', 'opr_2016', 'opr_2017', 'opr_2018']].apply(pd.to_numeric, errors='coerce')
#opr_reshaped = helpers.process_columns(df=opr_reshaped)
opr_reshaped.head(2)

(27867, 3)
(34839, 3)
(39203, 3)
(42860, 3)


Unnamed: 0,global_id,opr_2015,opr_2016,opr_2017,opr_2018
0,1001406,3B,3B,3B,3B
1,1001477,1B,,,


In [6]:
# bp files

# load backup
bp_files = open('../working/bp_backup.pkl', 'rb')
bp_2016 = pickle.load(bp_files)
bp_2017 = pickle.load(bp_files)
bp_2018 = pickle.load(bp_files)
bp_files.close()

In [7]:
bp_2016 = bp_2016[bp_2016['employment_status']=='Active']
bp_2017 = bp_2017[bp_2017['employment_status']=='Active']
bp_2018 = bp_2018[bp_2018['employment_status']=='Active']

In [357]:
# sar_temp = bp_2018[bp_2018['macro_entity_l2_code']=='10002169'].copy()
# #sar_temp = bp_2018[bp_2018['employee_band'].isin(['0-A', '0-B', 'I-A', 'I-B', 'EBM'])].copy()
# sar_temp = sar_temp[sar_temp['employee_band'].isin(['0-A', '0-B', 'I-A', 'I-B', 'EBM', 'II-A', 'II-B'])]
# sar_temp = sar_temp[['global_id', 'employee_name', 'employee_band', 'position_title', 'macro_entity_l2_desc']].reset_index(drop=True)
# sar_temp = pd.merge(sar_temp, saz_preds, how='left', on='global_id')
# sar_temp
# sar_temp.to_csv('saz_toplevelpeople.csv', index=False)

In [8]:
# bp files

bp_cols_to_keep = ['country_code', 'employee_band', 'global_id','analysis_block_l1_code', 
                   'analysis_block_l2_code', 'macro_entity_l2_code', 'position_id', 'position_title']

bp_2016 = bp_2016[bp_cols_to_keep]
bp_2017 = bp_2017[bp_cols_to_keep]
bp_2018 = bp_2018[bp_cols_to_keep]

bp_2016['year'] = 2016
bp_2017['year'] = 2017
bp_2018['year'] = 2018

bp_2016.drop_duplicates(subset=['global_id'], inplace=True)
bp_2017.drop_duplicates(subset=['global_id'], inplace=True)
bp_2018.drop_duplicates(subset=['global_id'], inplace=True)

bp_2018.head(2)

Unnamed: 0,country_code,employee_band,global_id,analysis_block_l1_code,analysis_block_l2_code,macro_entity_l2_code,position_id,position_title,year
0,DE,BWG 08,1000069,14000000,14000001,10000651,50016964,Area Manager,2018
1,DE,BWG 04,1000279,14000032,14000035,10000651,90049946,Machine Operator,2018


In [9]:
bp_2016 = bp_2016.add_prefix('2016_')
bp_2017 = bp_2017.add_prefix('2017_')
bp_2018 = bp_2018.add_prefix('2018_')

bp_2016.rename(columns={'2016_global_id':'global_id'}, inplace=True)
bp_2016.drop('2016_year', axis=1, inplace=True)
bp_2017.rename(columns={'2017_global_id':'global_id'}, inplace=True)
bp_2017.drop('2017_year', axis=1, inplace=True)
bp_2018.rename(columns={'2018_global_id':'global_id'}, inplace=True)
bp_2018.drop('2018_year', axis=1, inplace=True)

bp_2018.head(2)

Unnamed: 0,2018_country_code,2018_employee_band,global_id,2018_analysis_block_l1_code,2018_analysis_block_l2_code,2018_macro_entity_l2_code,2018_position_id,2018_position_title
0,DE,BWG 08,1000069,14000000,14000001,10000651,50016964,Area Manager
1,DE,BWG 04,1000279,14000032,14000035,10000651,90049946,Machine Operator


In [10]:
bp_full = pd.merge(bp_2018.reset_index(drop=True), bp_2017.reset_index(drop=True), how='left', on='global_id')
bp_full = pd.merge(bp_full.reset_index(drop=True), bp_2016.reset_index(drop=True), how='left', on='global_id')
bp_full.head(2)

Unnamed: 0,2018_country_code,2018_employee_band,global_id,2018_analysis_block_l1_code,2018_analysis_block_l2_code,2018_macro_entity_l2_code,2018_position_id,2018_position_title,2017_country_code,2017_employee_band,2017_analysis_block_l1_code,2017_analysis_block_l2_code,2017_macro_entity_l2_code,2017_position_id,2017_position_title,2016_country_code,2016_employee_band,2016_analysis_block_l1_code,2016_analysis_block_l2_code,2016_macro_entity_l2_code,2016_position_id,2016_position_title
0,DE,BWG 08,1000069,14000000,14000001,10000651,50016964,Area Manager,DE,BWG 08-V,14000000,14000001,10000651,50016964,Area Manager,DE,BWG 08-V,0,0,10000651,50016964,Area Manager
1,DE,BWG 04,1000279,14000032,14000035,10000651,90049946,Machine Operator,DE,BWG 04,14000032,14000035,10000651,90049946,Machine Operator,DE,BWG 04,0,0,10000651,90049946,Machine Operator


## ENSEMBLE Module

- Takes the individual component predictions
- Based on heuristics around the individual classes' performance, the seperate predictions are combined
- For instance:
    - the model predictions were brilliant for the 3B/3As **[p0]**
    - the model predictions post shape improved the recall for the 1B/1A/4B/4As **[p1]**
    - the rule based predictions performed well for the 4B/4As **[p2]**

In [282]:
def ensemble(p0, p1, p2, opr, bp):
    return None

In [283]:
# ensem = ensemble(p0=predictions_from_mod_beforeshape, p1=predictions_from_mod_aftershape, p2=predictions_from_rulebased, opr=, bp=)

In [68]:
# dummy
ensem = predictions_from_rulebased[['global_id', 'saz_rulebased_class']]

In [12]:
dep_dict = {'4A': 5, '4B': 4, '3A': 3, '3B': 2, '1A': 1, '1B': 0}
rev_dep_dict = {5:'4A', 4:'4B', 3:'3A', 2:'3B', 1:'1A', 0:'1B'}

# make a single dataframe
ads = pd.merge(predictions_from_ml_aftershape, predictions_from_rulebased, how='left', on=['global_id'])
ads['saz_rulebased_class'] = ads['saz_rulebased_class'].map(dep_dict)
ads = pd.merge(ads, opr_reshaped, how='left', on='global_id')
ads['saz_shape_class'] = ads['saz_shape_class'].map(rev_dep_dict)
ads['saz_rulebased_class'] = ads['saz_rulebased_class'].map(rev_dep_dict)
ads = pd.merge(ads, bp_full, how='left', on='global_id')
ads.shape
ads.head(2)

(5000, 28)

Unnamed: 0,global_id,saz_shape_class,saz_rulebased_class,opr_2015,opr_2016,opr_2017,opr_2018,2018_country_code,2018_employee_band,2018_analysis_block_l1_code,2018_analysis_block_l2_code,2018_macro_entity_l2_code,2018_position_id,2018_position_title,2017_country_code,2017_employee_band,2017_analysis_block_l1_code,2017_analysis_block_l2_code,2017_macro_entity_l2_code,2017_position_id,2017_position_title,2016_country_code,2016_employee_band,2016_analysis_block_l1_code,2016_analysis_block_l2_code,2016_macro_entity_l2_code,2016_position_id,2016_position_title
0,12025062,3B,3B,3A,3A,3B,3B,AR,III-B,14000187,14000201,10002169,10370374,Solutions Latam Serv & Ops Director-LAS,AR,III-B,0,0,10002169,10370374,Solutions Latam Serv & Ops Director-LAS,IN,IV-A,0.0,0.0,90000176.0,10519275.0,Director GCC Projects
1,28023327,1B,,,4B,3B,1A,BR,III-B,14000175,14000178,10002169,10527321,Digital Connection Director,BR,III-B,14000175,14000178,10002169,10527321,Digital Connection Director,,,,,,,


In [363]:
import pickle

# load backup
ads_backup = open('../working/ads_backup.pkl', 'rb')
train = pickle.load(ads_backup)
valid = pickle.load(ads_backup)
#ytrain = pickle.load(ads_backup)
#yvalid = pickle.load(ads_backup)
ads_backup.close()

In [364]:
req_cols = ['global_id', 'pers_year_comp_score_mean', 'pers_compgroup_year_comp_score_mean_leadership_competencies', 'net_target', 
            'cummean_global_id_net_target', 'prev_net_target', 'position_tenure', 'time_in_band']

train = train[req_cols]
valid = valid[req_cols]

train = train.add_prefix(prefix='2017_')
valid = valid.add_prefix(prefix='2018_')

train.rename(columns={'2017_global_id':'global_id'}, inplace=True)
valid.rename(columns={'2018_global_id':'global_id'}, inplace=True)

In [366]:
ads = pd.merge(ads, train, how='left', on='global_id')
ads = pd.merge(ads, valid, how='left', on='global_id')

In [368]:
## the PIVOT ##

ads['saz_shape_class_true'] = np.where(ads['saz_shape_class']==ads['opr_2018'], 1, 0)
ads['saz_rulebased_class_true'] = np.where(ads['saz_rulebased_class']==ads['opr_2018'], 1, 0)
ads.head(1)

Unnamed: 0,global_id,saz_shape_class,saz_rulebased_class,opr_2015,opr_2016,opr_2017,opr_2018,2018_country_code,2018_employee_band,2018_analysis_block_l1_code,2018_analysis_block_l2_code,2018_macro_entity_l2_code,2018_position_id,2018_position_title,2017_country_code,2017_employee_band,2017_analysis_block_l1_code,2017_analysis_block_l2_code,2017_macro_entity_l2_code,2017_position_id,2017_position_title,2016_country_code,2016_employee_band,2016_analysis_block_l1_code,2016_analysis_block_l2_code,2016_macro_entity_l2_code,2016_position_id,2016_position_title,2017_pers_year_comp_score_mean,2017_pers_compgroup_year_comp_score_mean_leadership_competencies,2017_net_target,2017_cummean_global_id_net_target,2017_prev_net_target,2017_position_tenure,2017_time_in_band,2018_pers_year_comp_score_mean,2018_pers_compgroup_year_comp_score_mean_leadership_competencies,2018_net_target,2018_cummean_global_id_net_target,2018_prev_net_target,2018_position_tenure,2018_time_in_band,saz_shape_class_true,saz_rulebased_class_true
0,12025062,3B,3B,3A,3A,3B,3B,AR,III-B,14000187,14000201,10002169,10370374,Solutions Latam Serv & Ops Director-LAS,AR,III-B,0,0,10002169,10370374,Solutions Latam Serv & Ops Director-LAS,IN,IV-A,0,0,90000176,10519275,Director GCC Projects,3.49,3.537,45.0,81.667,100.0,212.0,0.0,3.109,3.109,86.0,82.75,45.0,577.0,365.0,1,1


In [335]:
## function to groupby one/multiple levels and check the class recall/precision

def mean_perc(x):
    return round((x.mean())*100,2)
mean_perc.__name__ = 'mean_percentage'

def group_results(df, group_cols):
    group_cols.
    dummy = ads[['global_id', 'saz_shape_class', 'saz_rulebased_class', 'opr_2018', '2018_country_code', 
             'saz_shape_class_true', 'saz_rulebased_class_true']].copy()
    dummy['count'] = dummy.groupby('2018_country_code')['global_id'].transform(len)
    dummy = (dummy
             .groupby(['2018_country_code', 'opr_2018'])
             .agg({'saz_shape_class_true':[mean_perc, 'sum', 'count'], 
                   'saz_rulebased_class_true':[mean_perc, 'sum'],
                  'count':'first',
                  })
             .reset_index()
             .my_flatten_cols())
    dummy['count_first'] = dummy['saz_shape_class_true_count']/dummy['count_first']
    dummy.rename(columns={'count_first':'population_proportion_of_selected_groups'}, inplace=True)

    
    
    return results

In [341]:
dummy = ads[['global_id', 'saz_shape_class', 'saz_rulebased_class', 'opr_2018', '2018_country_code', 
             'saz_shape_class_true', 'saz_rulebased_class_true']].copy()
dummy['count'] = dummy.groupby('2018_country_code')['global_id'].transform(len)
dummy = (dummy
         .groupby(['2018_country_code', 'opr_2018'])
         .agg({'saz_shape_class_true':[mean_perc, 'sum', 'count'], 
               'saz_rulebased_class_true':[mean_perc, 'sum'],
              'count':'first',
              })
         .reset_index()
         .my_flatten_cols())
dummy['count_first'] = dummy['saz_shape_class_true_count']/dummy['count_first']
dummy.rename(columns={'count_first':'population_proportion_of_selected_groups'}, inplace=True)

dummy.head(10)

Unnamed: 0,2018_country_code,opr_2018,saz_shape_class_true_mean_percentage,saz_shape_class_true_sum,saz_shape_class_true_count,saz_rulebased_class_true_mean_percentage,saz_rulebased_class_true_sum,population_proportion_of_selected_groups
0,AR,1A,15.56,7,45,4.44,2,0.062937
1,AR,1B,50.0,13,26,0.0,0,0.036364
2,AR,3A,61.42,164,267,22.47,60,0.373427
3,AR,3B,60.08,149,248,32.26,80,0.346853
4,AR,4A,20.51,8,39,7.69,3,0.054545
5,AR,4B,36.67,33,90,15.56,14,0.125874
6,BO,1A,28.57,2,7,0.0,0,0.058333
7,BO,1B,50.0,1,2,0.0,0,0.016667
8,BO,3A,53.19,25,47,27.66,13,0.391667
9,BO,3B,63.64,28,44,40.91,18,0.366667


In [102]:
## RULE MODULE

- If OPR 2016 = OPR 2017 = 4B on the same position, then OPR 2018 = 4A
- If OPR 2017 = 4A and same position, then OPR 2018 = 4A
- If OPR 2016 = OPR 2017 = 3A on the same band, and results on first quartile, then OPR 2018 = 4B, if not, then OPR 2018 = 3B
- If OPR rating suggested = 3B and results+people KPIs are in bottom 10%, then OPR 2018 = 1A
- If OPR previous year = 1A and OPR suggests 1A, then 1B
- If time in band < 1 year, then OPR <  3A
- If culture question average in leadership competency appraisal < 2, then OPR 2018 = 1B


- Is the person a GMT?
- Is the person a GMBA?
- Has the person been a People Bet?
- Is the person on a talent pool?
- Does the person have a PDP (personal development plan)? Maybe only a few people do, but I have the hypothesis that this could be a differentiator between a 3B and a 3A. If the person is concerned about closing his/her gaps, he/she should be more likely to grow
- Does the person have a Green Belt or a Black Belt?
- Has the person filled in the field career aspirations of Navigate? Chances are that a person who has not filled it in is less ambitious
- Career Speed (we could create a KPI here for these, as for instance average time in bands throughout the career). I know this can become a “self fulfilling prophecy”, once if you say that a person who is growing faster has more potential, it might impact the classification, but I would test it and see the results to discuss

In [58]:
ensem = pd.merge(ensem.reset_index(drop=True), opr_reshaped, how='left', on=['global_id'])
ensem

Unnamed: 0,global_id,saz_rulebased_class,opr_2015,opr_2016,opr_2017,opr_2018
0,12025062,3B,3A,3A,3B,3B
...,...,...,...,...,...,...
3970,99803931,3B,,,,3B


In [None]:
from pivottablejs import pivot_ui
pivot_ui(ads)