# Industry Accelerators - Customer Life Event Prediction Models

## Introduction
In this notebook we'll be going through an end-to-end project to load in long form transactional type data and prepare the data into a wide format. Our long form data contains a feature named `EVENT_TYPE_ID` which contains events that the customer has experienced. In this project, we're specifically looking for the events with the prefix `LFE_`. These events are significant <b>Life Events</b> that have been experienced by the client. We're going to be specifically targeting two life events for this project: `LFE_RELOCATION` and `LFE_HOME_PURCHASE`. Therefore, we are going to build some machine learning models to help predict the likelihood of these two events occurring for our clients - which is in the next notebook (called 2-model-training.ipynb).

Additionally the user also has the option to incorporate `Census` data generated probabilities for migration, birth, marriage and birth into the dataset that is used for modeling.  

Before executing this notebook on IBM Cloud :<br>
1) When you import this project on an IBM Cloud environment, a project access token should be inserted at the top of this notebook as a code cell. <br>
If you do not see the cell above, Insert a project token: Click on **More -> Insert project token** in the top-right menu section and run the cell. <br>

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)
2) You can then step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.<br>


In [3]:
try:
    project
except NameError:
    # READING AND WRITING PROJECT ASSETS
    import project_lib
    project = project_lib.Project() 

## Load Event Data
For this project, we will be loading in the long form data called `event.csv`. You can execute this notebook with the sample data provided with the project. We will use `project_lib` to read the data.

**Sample Materials, provided under license. <br>
Licensed Materials - Property of IBM. <br>
© Copyright IBM Corp. 2019, 2020. All Rights Reserved. <br>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.<br>**

In [4]:
!pip install chart_studio
import pandas as pd
import time
pd.set_option('display.max_columns', 500)
from chart_studio.plotly import iplot
import plotly as py
py.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import pickle

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting chart_studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
[K     |████████████████████████████████| 64 kB 5.4 MB/s  eta 0:00:01
Installing collected packages: chart-studio
Successfully installed chart-studio-1.1.0


In [5]:
# Read event data from a CSV file.
events = pd.read_csv(project.get_file('event.csv'), parse_dates=['EVENT_DATE'], dayfirst=True)
    
# display the event data
print('\nEvent Data:')
display(events.head())
print("{} rows, {} columns".format(*events.shape))


Event Data:


Unnamed: 0,CUSTOMER_ID,EVENT_DATE,EVENT_TYPE_ID
0,1103,2013-02-02,INT_OTHER_PHYSICAL
1,1103,2013-02-02,XCT_MORTGAGE_NEW
2,1769,2013-02-03,INT_OTHER_PHYSICAL
3,1769,2013-02-03,XCT_MORTGAGE_NEW
4,1879,2013-02-15,XCT_MORTGAGE_NEW


103426 rows, 3 columns


The `events.csv` data which we loaded has only 3 columns and 103,426 rows. The three main columns to focus on are `CUSTOMER_ID`, `EVENT_DATE`, and `EVENT_TYPE_ID`. This data represents a list of the client experiencing specific events on a given date. We'll take this data and make counts of the events while focusing on the two target life events: <b>home purchase and relocation,</b>.

### Display Distinct Event Types in Data

All of the unique events can be found in the dataset under the feature `EVENT_TYPE_ID`. The wide format dataset will be created from this list of unique events. Additionally, this creates flags and counts on whether a client experienced the event.

In [6]:
events['EVENT_TYPE_ID'].unique()

array(['INT_OTHER_PHYSICAL', 'XCT_MORTGAGE_NEW',
       'MENTION_LFE_HOME_PURCHASE', 'XCT_EQ_SELL', 'XFER_FUNDS_OUT_LARGE',
       'LFE_RELOCATION', 'LFE_HOME_PURCHASE', 'ACNT_SEC_OPEN_*',
       'BIRTHDAY30', 'INT_LOGIN_WEB', 'BIRTHDAY77', 'BIRTHDAY36',
       'BIRTHDAY51', 'BIRTHDAY59', 'BIRTHDAY22', 'XCT_EQ_BUY',
       'BIRTHDAY55', 'BIRTHDAY47', 'BIRTHDAY58', 'BIRTHDAY43',
       'BIRTHDAY35', 'BIRTHDAY67', 'BIRTHDAY27', 'BIRTHDAY32',
       'BIRTHDAY63', 'BIRTHDAY24', 'BIRTHDAY34', 'BIRTHDAY80',
       'BIRTHDAY21', 'BIRTHDAY82', 'BIRTHDAY29', 'BIRTHDAY81',
       'BIRTHDAY73', 'BIRTHDAY42', 'BIRTHDAY53', 'BIRTHDAY70',
       'BIRTHDAY28', 'BIRTHDAY40', 'BIRTHDAY62', 'BIRTHDAY41',
       'BIRTHDAY74', 'BIRTHDAY69', 'BIRTHDAY64', 'BIRTHDAY44',
       'BIRTHDAY71', 'BIRTHDAY76', 'BIRTHDAY23', 'BIRTHDAY68',
       'BIRTHDAY38', 'BIRTHDAY79', 'BIRTHDAY78', 'BIRTHDAY39',
       'BIRTHDAY37', 'BIRTHDAY45', 'BIRTHDAY54', 'BIRTHDAY25',
       'BIRTHDAY56', 'BIRTHDAY60', 'BIRTHDAY49', 'BI

### Select Life Event Types to Predict

As mentioned above, the focus will be on predicting the two life events `LFE_RELOCATION` and `LFE_HOME_PURCHASE` so we'll filter the `EVENT_TYPE_ID` feature to only those events with the life event prefix of `LFE_`.

In [7]:
# prediction_types = ['LFE_HOME_PURCHASE','LFE_RELOCATION']
prediction_types = [event_type for event_type in list(events['EVENT_TYPE_ID'].unique()) if event_type[:4] == 'LFE_']

print("\nEvent Types to Predict:")
display(prediction_types)
print()


Event Types to Predict:


['LFE_RELOCATION', 'LFE_HOME_PURCHASE']




## User Inputs and Data Prep

See `/project_data/data_asset/life_event_prep.py` for details of data preparation

This script generates the dataset which is used as input for model training and scoring purposes. Given a list of events which customers experienced, this script transforms this long form dataset into a wide format, which can be used for modelling.

#### Data Cleaning
A number of functions are carried out throughout the code for cleaning the data. <br>

•	Any customer who does not have at least observation_window + forecast_horizon consecutive months of historical data are filtered out of the training dataset <br>
•	Customers who don’t have any events in their observation_window are also removed from the training dataset <br>
•	Similarly, when scoring, any customer who had no events in the observation window is filtered out <br>
•	Any life event that doesn’t have at least ‘life_event_minimum_target_count’ (default 100) unique customers experiencing it is removed and not used as a target variable <br>
•	Any column with more than 10% nulls is dropped from the training dataset <br>
•	Any columns with a constant value are removed from the training dataset <br>
•	The final scoring dataset is engineered to ensure it has the same columns and order as the training dataset <br>

#### User Inputs for `LifeEventPrep`:
- **target_event_type_ids :**  A list of life events which we are trying to predict. Single or multiple events can be handled.
- **train_or_score :** Specifies whether we are prepping the data for training or for scoring. Training data includes the target variable while scoring dataset does not.
- **training_start_date :** The start date that we start counting events from for training. Variable is a string.
- **training_end_date :** Cut off date for training events. Any events occurring after this date will not be included in training data. Again, the variable is a string.
- **forecast_horizon :** The window of time that we want to predict in. This is the number of months after the observation month in which the event can occur.
- **observation_window :** The lookback period from the observation month. We use the count of number of events which occurred in this window as input variables. Again, this variable is in months.
- **life_event_minimum_target_count :** To include a particular life event target variable, the variable must have at least this number of unique customers associated with it.
- **cols_to_drop**: Columns to be dropped due to known irrelevance to target (e.g. ID column)
- **b_use_census_data**: Boolean variable to allow the user to specify whether they would like to use the supplied census data or not.


#### Census Data (optional)

If you would like to use probabilities generated from USA Census data along with the life events, the `prep_census_data.py` script in the `/project_data/data_asset/` folder should be called by setting `b_use_census_data` to `True`. This script maps each customer in the events data to their most similar customer type in census data, based on customer gender, age range, income, marital status, location, education and employment status. The census data can be found in the `/project_data/data_asset/census_probabilities.csv` file and  contains the following fields:

- **LOCATION :** Location(State) of the Customer
- **MARITAL_STATUS :**	Marital status of the customer
- **EDUCATION :** Customer's Highest level of Education 
- **GENDER :** Gender of the Customer
- **EMPLOYMENT :** Customer's Employment status
- **INCOME :** Customer's annual income
- **AGE	:** Customer's Age
- **MIGRATION_PROB :**	Probability of a Customer to relocate
- **BIRTH_PROB :**	Probability of a customer to give birth
- **MARRIAGE_PROB :** Probability of a customer to get married
- **DIVORCE_PROB :** Probability of a customer to get divorced

If you use different customer data, you need to edit the `prep_census_data.py` script and generate the relevant mappings between the new customer details and census data.

*Disclaimer: The census data used in the accelerator is for showcasing the capability of the accelerator. It might not reflect the actual census data.*

In [8]:
b_use_census_data = False

In [9]:
%%writefile life_event_prep.py

"""
Sample Materials, provided under license.
Licensed Materials - Property of IBM
© Copyright IBM Corp. 2019. All Rights Reserved.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
"""

import pandas as pd
import numpy as np
import datetime
from dateutil.relativedelta import relativedelta
import sys
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import json
import os

class LifeEventPrep():

    def __init__(self, target_event_type_ids, b_use_census_data, train_or_score='train', training_start_date="2010-01-01", training_end_date="2017-08-01", forecast_horizon=3,
                 observation_window=4, scoring_end_date=datetime.datetime.today(), life_event_minimum_target_count=100, norepeat_months=4, cols_to_drop=['CUSTOMER_ID']):

        self.train_or_score = train_or_score
        self.target_event_type_ids = target_event_type_ids
        self.b_use_census_data = b_use_census_data
        self.forecast_horizon = forecast_horizon
        self.observation_window = observation_window
        self.training_start_date = training_start_date
        self.training_end_date = training_end_date
        self.scoring_end_date = scoring_end_date
        self.life_event_minimum_target_count = life_event_minimum_target_count
        self.norepeat_months = norepeat_months
        self.cols_to_drop = cols_to_drop
        self.latency_start = 1
        self.perc_positive_cutoff = 1.0

        # if a string for a particular date for end of scoring(obs month) is passed, convert to datetime
        if self.train_or_score == 'score':
            if isinstance(self.scoring_end_date, str):
                self.scoring_end_date = datetime.datetime.strptime(self.scoring_end_date, '%Y-%m-%d')


        if self.train_or_score == 'train':
        # create a dictionary with all values for user inputs. We will save this out and use it for scoring
        # to ensure that the user inputs are consistent across train and score notebooks
        # exclude variables that won't be used for scoring
          self.user_inputs_dict = { 'target_event_type_ids' : target_event_type_ids, 'b_use_census_data' : b_use_census_data, 'forecast_horizon' : forecast_horizon,
              'observation_window' : observation_window, 'life_event_minimum_target_count' : life_event_minimum_target_count, 'norepeat_months' : norepeat_months, 'cols_to_drop' : cols_to_drop}

    # Some functions to handle adding, subtracting, randomly selecting dates that are in YYYYMM format
    def udf_n_months(self, dateMin, dateMax):
        month_dif = (relativedelta(datetime.datetime.strptime(str(dateMax), '%Y%m'),
                        datetime.datetime.strptime(str(dateMin), '%Y%m')).months +
                    relativedelta(datetime.datetime.strptime(str(dateMax), '%Y%m'),
                        datetime.datetime.strptime(str(dateMin), '%Y%m')).years * 12)
        return month_dif

    # function returns a random date in [dateMin, dateMax]
    def udf_rand_date_in_range(self, dateMin, dateMax, randnumber):
        rand_month_in_range = (datetime.datetime.strptime(str(dateMin), '%Y%m') +
                                        (relativedelta(months=int(np.floor(randnumber *(relativedelta(
                                                        datetime.datetime.strptime(str(dateMax), '%Y%m'),
                                                        datetime.datetime.strptime(str(dateMin), '%Y%m')).months +
                                                    relativedelta(
                                                        datetime.datetime.strptime(str(dateMax), '%Y%m'),
                                                        datetime.datetime.strptime(str(dateMin),'%Y%m')).years * 12 +1))))))

        return rand_month_in_range.strftime('%Y%m')

    # function to add a specified number of months to a date in YYYYMM format
    def udf_add_months(self, YEAR_MONTH, months_to_add):
        new_date = int((datetime.datetime.strptime(str(YEAR_MONTH), '%Y%m') + relativedelta(months=months_to_add)).strftime('%Y%m'))
        return new_date

    # selects the observation month
    def udf_sub_rand_latency(self, YEAR_MONTH, randnumber, latency_start, forecast_horizon):
        rand_obs_month = (datetime.datetime.strptime(str(YEAR_MONTH), '%Y%m')
                            - relativedelta(months=latency_start + int(randnumber * (forecast_horizon - latency_start + 1)))).strftime('%Y%m')
        return rand_obs_month

    # Prep functions
    def prepare_single_event_type(self, target_event_type_id, events, cym_cnt, train_or_score):
        # ensure the dataframe has data in it - can be empty if scoring low number of cases and they were filtered out based on dates
        if cym_cnt.shape[0] > 0:
            # This function preps the data for one specified target event
            # create a temporary target variable, which is 1 if the specific life event passed to the function
            # happened in that month, otherwise 0
            print('\nPrepping data for ' + target_event_type_id)
            target_col = 'E_' + target_event_type_id
            # make sure that the target event exists as a column in the dataset, if not, add it with all 0's (should only really happen when scoring)
            if target_col not in list(cym_cnt.columns):
                cym_cnt[target_col] = 0

            cym_cnt['TARGET'] = 0
            cym_cnt.loc[cym_cnt[target_col]>0, 'TARGET'] = 1

            # Call the select_customer_observation_month function - returns 1 record per customer
            # with the target variable, the observation month, the start month of the observation period
            # and the end month of the forecast horizon month
            df_cust_target = self.select_customer_observation_month(cym_cnt, train_or_score)

            # filter out customers who have no event data in the observation window
            # join the target df with cym_cnt, the df with customer and YYYYMM and column per event
            print('Number of customers before removing those with no event data in observation window : ' + str(df_cust_target.shape[0]))
            if train_or_score=='train':
                print('Number of customers target=1 before above filtering : ' + str(df_cust_target[df_cust_target['TARGET']==1].shape[0]))

            # remove the temp target var in cym_cnt as we now have our final target
            cym_cnt.drop('TARGET', axis=1, inplace=True)
            df_cust_target_cym_cnt = cym_cnt.merge(df_cust_target, on='CUSTOMER_ID', how='inner')
            df_cust_target_cym_cnt = df_cust_target_cym_cnt[(df_cust_target_cym_cnt['YEAR_MONTH'].astype(int)>=df_cust_target_cym_cnt['OBS_MONTH_MIN_OW'].astype(int))
                                    & (df_cust_target_cym_cnt['YEAR_MONTH'].astype(int)<=df_cust_target_cym_cnt['OBS_MONTH'].astype(int))]

            # update the target df to include only those customers who have had events in the observation window
            df_cust_target = df_cust_target[df_cust_target['CUSTOMER_ID'].isin(list(df_cust_target_cym_cnt['CUSTOMER_ID'].unique()))]

            print('Number of customers after removing those with no event data in observation window : ' + str(df_cust_target.shape[0]))

            if df_cust_target.shape[0] == 0:
              print('Customers had no events in observation window.', file=sys.stderr)
              return None

            if train_or_score=='train':
                print('Number of customers target=1 after above filtering : ' + str(df_cust_target[df_cust_target['TARGET']==1].shape[0]))

            # not all customers have had an event in their observation month
            # Therefore they wouldn't have a record in the data for the observation month
            # We add the record in and fill all events with 0
            # I think this helps for later in the code
            # Get the observation month for each customer, join to the wide df (cym_cnt) on customer and YYYYMM,
            # Use a right outer join so if the observation month isn't in the wide df it will be included after join

            df_temp = cym_cnt.merge(df_cust_target[['CUSTOMER_ID', 'OBS_MONTH']], left_on=['CUSTOMER_ID', 'YEAR_MONTH'],
                                            right_on=['CUSTOMER_ID', 'OBS_MONTH'], how='right')
            # only take where the result is null, ie. where the observerd month wasn't in the wide (cym_cnt) table
            df_temp = df_temp[df_temp['YEAR_MONTH'].isnull()]
            # update YEAR_MONTH to be the OBS_MONTH and drop OBS_MONTH, fill na's with 0
            df_temp['YEAR_MONTH'] = df_temp['OBS_MONTH']
            df_temp.drop('OBS_MONTH', axis=1, inplace=True)
            df_temp.fillna(0, inplace=True)

            # join back to target df so we can add the relevant columns from that df
            # then append onto the df_cust_target_cym_cnt df
            # we then just have a df with customer, YYYYMM and column for each event,
            # but always including a record for the observation month for a customer
            df_temp = df_temp.merge(df_cust_target, on='CUSTOMER_ID', how='inner')
            df_cust_target_cym_cnt = pd.concat([df_cust_target_cym_cnt, df_temp], sort=False)

            # more filtering to remove edge cases
            # for scoring, remove customers who have experienced the life event in the previous norepeat_months months
            # where start of the observation window is before the start of our data, remove the records
            # where the end of the forecase period is after the end of our data, remove the records
            YM_min = df_cust_target_cym_cnt['YEAR_MONTH'].min()
            YM_max = df_cust_target_cym_cnt['YEAR_MONTH'].max()

            #print('Number of customers before filtering : ' + str(df_cust_target_cym_cnt['CUSTOMER_ID'].nunique()))
            #print('Number of customers target=1 before filtering : ' + str(df_cust_target_cym_cnt[df_cust_target_cym_cnt['TARGET']==1]['CUSTOMER_ID'].nunique()))

            df_cust_target_cym_cnt = df_cust_target_cym_cnt[df_cust_target_cym_cnt['OBS_MONTH_MIN_OW']>=YM_min]
            if train_or_score == 'train':
                df_cust_target_cym_cnt = df_cust_target_cym_cnt[df_cust_target_cym_cnt['OBS_MONTH_PLS_LATEND']<=YM_max]
            elif train_or_score == 'score':
                norepeat_months = self.norepeat_months
                # we don't want to score customers who have experienced the life event in the previous norepeat_months
                start_norepeat_period = self.udf_add_months(YM_max, 1-norepeat_months)

                customers_norepeat = df_cust_target_cym_cnt[(df_cust_target_cym_cnt['TARGET']>0) &
                                                (df_cust_target_cym_cnt['TARGET_MONTH']<start_norepeat_period)]

                customers_norepeat = pd.DataFrame(customers_norepeat['CUSTOMER_ID'].drop_duplicates())
                customers_norepeat['LIFE_EVENT_B4_NOREP_PERIOD'] = 1

                df_cust_target_cym_cnt = df_cust_target_cym_cnt.merge(customers_norepeat, on='CUSTOMER_ID', how='left')

                # Keep records where the target is 0 (haven't experienced the life event),
                # or target is 1 and life_event_b4_norep_period is 1 (the customer experienced the life event but it was
                # more than norepeat_months ago)
                df_cust_target_cym_cnt = df_cust_target_cym_cnt[(df_cust_target_cym_cnt['TARGET']==0) | ((df_cust_target_cym_cnt['TARGET']==1) &
                                                                        (df_cust_target_cym_cnt['LIFE_EVENT_B4_NOREP_PERIOD']==1))]

                df_cust_target_cym_cnt.drop(['LIFE_EVENT_B4_NOREP_PERIOD'], axis=1, inplace=True)
                if df_cust_target_cym_cnt.shape[0] == 0:
                    print('Note: All customers filtered out as they experienced the life event within ' +
                        str(norepeat_months) + ' months (norepeat_months) of the observation date', file=sys.stderr)


            #print('Number of customers after filtering : ' + str(df_cust_target_cym_cnt['CUSTOMER_ID'].nunique()))
            #print('Number of customers target=1 after filtering : ' + str(df_cust_target_cym_cnt[df_cust_target_cym_cnt['TARGET']==1]['CUSTOMER_ID'].nunique()))

            # get data into AMT format, one line of data per customer
            # we will create variables that are a count of each event per customer over their observation window (end with '_OW')
            # We also create variables for the count of each event in the actual observation month

            # remove columns we don't need anymore
            df_cust_target_cym_cnt.drop(['TARGET_MONTH', 'OBS_MONTH_MIN_OW', 'OBS_MONTH_PLS_LATEND'], axis=1, inplace=True)

            # count the number of occurences of each event over the observation window
            # drop target as it is summed up, correct target is later
            df_per_cust_ow = df_cust_target_cym_cnt.groupby(['CUSTOMER_ID', 'OBS_MONTH']).sum().reset_index()
            df_per_cust_ow.drop(['YEAR_MONTH', 'TARGET'], axis=1, inplace=True)
            # add a variable for total number of events over observarion window

            for col in df_per_cust_ow.columns:
                if col.startswith('E_'):
                    new_col_name = col + '_OW'
                    df_per_cust_ow.rename(columns={col:new_col_name}, inplace=True)

            # get the number of occurences of each event in the observation window
            df_per_cust_om = df_cust_target_cym_cnt[df_cust_target_cym_cnt['OBS_MONTH']==df_cust_target_cym_cnt['YEAR_MONTH']].copy()
            df_per_cust_om.drop('YEAR_MONTH', axis=1, inplace=True)

            for col in df_per_cust_om.columns:
                if col.startswith('E_'):
                    new_col_name = col + '_OM'
                    df_per_cust_om.rename(columns={col:new_col_name}, inplace=True)

            df_per_cust = df_per_cust_ow.merge(df_per_cust_om, on=['CUSTOMER_ID', 'OBS_MONTH'], how='inner')

            # add a variable for observation month.
            df_per_cust['MONTH'] = df_per_cust['OBS_MONTH'].astype(str).str[4:].astype(int)

            # get total number of events per customers in observation window and in observation month
            events_ow_cols = list(df_per_cust.loc[:, df_per_cust.columns.str.endswith('_OW')].columns)
            events_om_cols = list(df_per_cust.loc[:, (~(df_per_cust.columns.str.endswith('_OW')) & (df_per_cust.columns.str.startswith('E_')))].columns)

            df_per_cust['TOT_NB_OF_EVENTS_OW'] = df_per_cust[events_ow_cols].sum(axis=1)
            df_per_cust['TOT_NB_OF_EVENTS_OM'] = df_per_cust[events_om_cols].sum(axis=1)

            if train_or_score == 'train':
                # cleaning - move target variable to end
                cols = list(df_per_cust)
                cols.insert(len(cols), cols.pop(cols.index('TARGET')))
                df_per_cust = df_per_cust.loc[:, cols]
            elif train_or_score == 'score':
                df_per_cust.drop('TARGET', axis=1, inplace=True)

            return df_per_cust

        else:
            print('Error: No customers were passed to the function. Stopping.', file=sys.stderr)
            #sys.exit('Error: No customers were passed to the function. Stopping.')
            return None

    def select_customer_observation_month(self, cym_cnt,  train_or_score):
        # Creates the observation month for each customer and the target variable.
        # Also works out first month in observation window and end of forecast horizon period
        # First creates a target month for each customer, TARGET_MONTH, the month that the life event occurs
        # for the customer. If a customer hasn't experienced a life event, they are given a random TARGET_MONTH,
        # which is a month between the first event and last event month for that customer
        # For scoring, the observation month is the 'scoring_end_date' variable

        observation_window = self.observation_window
        forecast_horizon = self.forecast_horizon
        latency_start = self.latency_start

        print('Using observation_window = ' + str(observation_window))
        print('Using forecast_horizon = ' + str(forecast_horizon))

        # create df with just customer_id, year_month and target(as created in prepare_single_event_type function)
        # df_cust_month_target was cym_tgt in original code
        df_cust_month_target = cym_cnt[['CUSTOMER_ID', 'YEAR_MONTH', 'TARGET']]

        if train_or_score == 'train':
            # Directly from codebase:
            ######################################
            # Definition of the TARGET_MONTH:
            # occurence of target over histoical months
            # C1: 0 ------------------------- 0
            # C2: ----------------- 0 1 0 -----
            # C3: ------- 0 1 0 --- 0 1 0 -----
            #
            # C1 has no occurence of the event,
            # C2 has exactly one
            # C3 had two occurences

            # C1. For customers who didn't experience a life event give them a random TARGET_MONTH
            # within the period that they were active
            # Note this can select a target_month that is the first month a customer is seen
            # that customer will have no historical event data and will be removed later in the code
            df_negatives = df_cust_month_target.groupby(['CUSTOMER_ID']).agg({'YEAR_MONTH':['min', 'max'], 'TARGET': 'sum'}).reset_index()
            df_negatives.columns = df_negatives.columns.get_level_values(1)
            df_negatives.columns = ['CUSTOMER_ID', 'CUST_YM_MIN', 'CUST_YM_MAX', 'CUST_TARGET']
            df_negatives = df_negatives[df_negatives['CUST_TARGET']==0]
            df_negatives['RAND1'] = np.random.rand(df_negatives.shape[0])
            df_negatives['TARGET_MONTH'] = df_negatives.apply(lambda x: self.udf_rand_date_in_range(int(x['CUST_YM_MIN']), int(x['CUST_YM_MAX']), x['RAND1']), axis=1)
            df_negatives = df_negatives[['CUSTOMER_ID', 'TARGET_MONTH']]
            df_negatives['TARGET'] = 0

            # C2. Customers who experienced exactly 1 life event
            # assign a specified % as positive examples
            # should this % parameter be configurable?
            # remaing are set to negative, selecting a random month at least norepeat_months after the event

            perc_positive_cutoff = self.perc_positive_cutoff

            # Create a df with target month for each customer and a random number column which is used
            # to specify if the record should be used as a positive or negative example
            df_cust_target_one_occ = df_cust_month_target[df_cust_month_target['TARGET']>0].groupby('CUSTOMER_ID').agg({'YEAR_MONTH':'min', 'TARGET': 'sum'}).reset_index()
            df_cust_target_one_occ = df_cust_target_one_occ[df_cust_target_one_occ['TARGET']==1]
            df_cust_target_one_occ.rename(columns={'YEAR_MONTH':'TARGET_MONTH', 'TARGET':'TARGET_COUNT'}, inplace=True)
            df_cust_target_one_occ['RAND1'] = np.random.rand(df_cust_target_one_occ.shape[0])

            # Take all records with random number less than cutoff as positive examples
            df_cust_target_one_occ_pos = df_cust_target_one_occ[df_cust_target_one_occ['RAND1']<=perc_positive_cutoff][['CUSTOMER_ID', 'TARGET_MONTH']]
            df_cust_target_one_occ_pos['TARGET'] = 1

            # All records greater than the cutoff are negative examples
            # For each customer, find the starting point for their TARGET_MONTH
            # This has to be between norepeat_months after the event and the date of their last event
            # if the cutoff is set to 1 it means that we don't set any of these records to 0

            if perc_positive_cutoff < 1.0:
                df_cust_target_one_occ_neg = df_cust_target_one_occ[df_cust_target_one_occ['RAND1']>perc_positive_cutoff][['CUSTOMER_ID', 'TARGET_MONTH']]
                # specify a new (temp) TARGET_MONTH that is norepeat_months after the event occurred
                df_cust_target_one_occ_neg['TARGET_MONTH'] = df_cust_target_one_occ_neg.apply(lambda x: self.udf_add_months(int(x['TARGET_MONTH']), self.norepeat_months), axis=1)
                # Select a random month between the new TARGET_MONTH and the last time the customer is seen
                # join back to df_cust_month_target which has a record for every customer and month
                df_cust_target_one_occ_neg = df_cust_target_one_occ_neg.merge(df_cust_month_target, on='CUSTOMER_ID', how='inner')
                # filter to include only months >= the new target month
                df_cust_target_one_occ_neg[df_cust_target_one_occ_neg['YEAR_MONTH']>=df_cust_target_one_occ_neg['TARGET_MONTH']]
                # I changed this, original called for a random month between first event after new target_month and last event
                # I'm changing to say the customer can have a random target month between new target month and last event
                df_cust_target_one_occ_neg = df_cust_target_one_occ_neg.groupby(['CUSTOMER_ID', 'TARGET_MONTH'])['YEAR_MONTH'].max().reset_index()
                df_cust_target_one_occ_neg.rename(columns={'TARGET_MONTH':'CUST_YM_MIN', 'YEAR_MONTH':'CUST_YM_MAX'}, inplace=True)
                df_cust_target_one_occ_neg['RAND2'] = np.random.rand(df_cust_target_one_occ_neg.shape[0])
                # Call the function to select a random target month between TARGET_MONTH and last event month
                df_cust_target_one_occ_neg['TARGET_MONTH'] = df_cust_target_one_occ_neg.apply(lambda x: self.udf_rand_date_in_range(int(x['CUST_YM_MIN']), int(x['CUST_YM_MAX']), x['RAND2']), axis=1)
                # select relevant columns and set the target value to 0
                df_cust_target_one_occ_neg = df_cust_target_one_occ_neg[['CUSTOMER_ID', 'TARGET_MONTH']]
                df_cust_target_one_occ_neg['TARGET'] = 0

            # C3. Customers who experienced the life event multiple times
            # We just take the first time they experienced the event as the TARGET_MONTH

            # filter to only include months where target=1
            df_cust_target_multi_occ = df_cust_month_target[df_cust_month_target['TARGET']>=1]
            # Get the min of YEAR_MONTH per customer to find month of occurence of first life event
            # Sum up the target so we can filter to inlcude only customers who have had multiple life events
            df_cust_target_multi_occ = df_cust_target_multi_occ.groupby('CUSTOMER_ID').agg({'YEAR_MONTH':'min', 'TARGET':'sum'}).reset_index()
            df_cust_target_multi_occ.rename(columns={'YEAR_MONTH':'TARGET_MONTH'}, inplace=True)
            df_cust_target_multi_occ = df_cust_target_multi_occ[df_cust_target_multi_occ['TARGET']>1]
            # Filter to only include columns we need and set target to 1
            df_cust_target_multi_occ = df_cust_target_multi_occ[['CUSTOMER_ID', 'TARGET_MONTH']]
            df_cust_target_multi_occ['TARGET'] = 1

            print('Training data has #Target> 1 customers: ' + str(df_cust_target_multi_occ.shape[0]))
            print('Training data has #Target==1 customers: ' + str(df_cust_target_one_occ.shape[0]))
            print('   Of those, we set ' + str(df_cust_target_one_occ_pos.shape[0]) + ' to positive')
            if perc_positive_cutoff < 1.0:
                print('   and we set ' + str(df_cust_target_one_occ_neg.shape[0]) + ' to negative')
            else:
                print('   and we set 0 to negative')
            print('Training data has #Target==0 customers: ' + str(df_negatives.shape[0]))

            # this was cdates variable in original
            if perc_positive_cutoff < 1.0:
                df_cust_target = pd.concat([df_negatives, df_cust_target_one_occ_neg, df_cust_target_one_occ_pos, df_cust_target_multi_occ])
            else:
                df_cust_target = pd.concat([df_negatives, df_cust_target_one_occ_pos, df_cust_target_multi_occ])
            print('Number of records : ' + str(df_cust_target.shape[0]))
            print('Number of unique customers : ' + str(df_cust_target['CUSTOMER_ID'].nunique()))

            if df_cust_target['CUSTOMER_ID'].nunique() != df_cust_target.shape[0]:
                print('Something went wrong. We have more than 1 row per customer')

            # I haven't included hidden feature about boosting minority class

            # select the observation month for each customer
            # the month must be within the forecast_horizon of the target_month
            # For example, if the life event occurred in 201809, the observation month must be between
            # 201806 and 201809 if the forecast_horizon is 3 months
            df_cust_target['RAND2'] = np.random.rand(df_cust_target.shape[0])
            df_cust_target['OBS_MONTH'] = df_cust_target.apply(lambda x: self.udf_sub_rand_latency(int(x['TARGET_MONTH']), x['RAND2'], latency_start, forecast_horizon), axis=1)

            # Now that we have the observation month we get the first month of our observation window (OBS_MONTH_MIN_OW)
            # Events which occurred over the observation window will be used as variables in AMT
            # Note that the observation month is included in the observation window
            df_cust_target['OBS_MONTH_MIN_OW'] = df_cust_target.apply(lambda x: self.udf_add_months(int(x['OBS_MONTH']), (1-observation_window)), axis=1)

            # We also calculate the end month in the forecasting period (OBS_MONTH_PLS_LATEND)
            df_cust_target['OBS_MONTH_PLS_LATEND'] = df_cust_target.apply(lambda x: self.udf_add_months(int(x['OBS_MONTH']), forecast_horizon), axis=1)

            df_cust_target.drop('RAND2', axis=1, inplace=True)

            # set the months to ints instead of objects
            df_cust_target['OBS_MONTH'] = df_cust_target['OBS_MONTH'].astype(int)
            df_cust_target['TARGET_MONTH'] = df_cust_target['TARGET_MONTH'].astype(int)

        elif train_or_score =='score':

            df_cust_target = df_cust_month_target.groupby('CUSTOMER_ID')['TARGET'].max().reset_index()
            df_cust_target['OBS_MONTH'] = pd.to_datetime(self.scoring_end_date.date())
            df_cust_target['OBS_MONTH'] = df_cust_target['OBS_MONTH'].dt.strftime('%Y%m').astype(int)
            df_cust_target['OBS_MONTH_MIN_OW'] = df_cust_target.apply(lambda x: self.udf_add_months(int(x['OBS_MONTH']), (1-observation_window)), axis=1)
            df_cust_target['OBS_MONTH_PLS_LATEND'] = df_cust_target.apply(lambda x: self.udf_add_months(int(x['OBS_MONTH']), forecast_horizon), axis=1)

            # for those customers who have experienced the life event, we want to know when they last experienced it
            df_month_last_lfe_event = df_cust_month_target[df_cust_month_target['TARGET']>0].groupby('CUSTOMER_ID')['YEAR_MONTH'].max().reset_index()
            df_month_last_lfe_event.rename(columns={'YEAR_MONTH':'TARGET_MONTH'}, inplace=True)
            df_cust_target = df_cust_target.merge(df_month_last_lfe_event, on='CUSTOMER_ID', how='left')
            df_cust_target['TARGET_MONTH'] = df_cust_target['TARGET_MONTH'].fillna(0)
            df_cust_target['TARGET_MONTH'] = df_cust_target['TARGET_MONTH'].astype(int)

        return df_cust_target

    def prep_data(self, df_raw, train_or_score):
        np.random.seed(42)
        # just in case any caps are used
        train_or_score = train_or_score.lower()
        # hidden inputs
        latency_start = self.latency_start

        print('Before removing dates that are not in training period : ' + str(df_raw.shape))
        # remove any dates that are not in our training period
        if train_or_score == 'train':
            df_raw = df_raw[(df_raw['EVENT_DATE']>=datetime.datetime.strptime(self.training_start_date, '%Y-%m-%d'))
                    & (df_raw['EVENT_DATE']<=datetime.datetime.strptime(self.training_end_date, '%Y-%m-%d'))]
        else:
            # otherwise use same start period but all data to end of scoring period
            df_raw = df_raw[(df_raw['EVENT_DATE']>=datetime.datetime.strptime(self.training_start_date, '%Y-%m-%d'))
                    & (df_raw['EVENT_DATE']<=self.scoring_end_date)]
        print('After removing dates that are not in training period : ' + str(df_raw.shape) + '\n')

        # create a df with 1 record per customer, get date of first and last event
        # filter to include only those who have enough months of data
        # enough months = (observation + forecast) for training data
        # enough months = observation window for scoring data
        # For scoring, if we haven't seen the customer in the observation window, we filter them out here

        print('Number of customers before checking for enough history : ' + str(df_raw['CUSTOMER_ID'].nunique()))

        # prevent the code from going further if all customers have been filtered out
        if df_raw['CUSTOMER_ID'].nunique() == 0:
          print('Customer had no events in the effective date period', file=sys.stderr)

        else:
          if train_or_score == 'train':
              n_months = self.forecast_horizon + self.observation_window

              customers_with_enough_history = df_raw.groupby('CUSTOMER_ID')['EVENT_DATE'].agg([max, min]).reset_index()
              customers_with_enough_history.columns = ['CUSTOMER_ID', 'MAX_DATE', 'MIN_DATE']

              # Convert to yyyymm and add new column for number of months
              # filter to exclude customers who don't have enough months of data
              customers_with_enough_history['MAX_DATE'] = customers_with_enough_history['MAX_DATE'].dt.strftime('%Y%m').astype(int)
              customers_with_enough_history['MIN_DATE'] = customers_with_enough_history['MIN_DATE'].dt.strftime('%Y%m').astype(int)
              customers_with_enough_history['N_MONTHS'] = customers_with_enough_history.apply(lambda x: self.udf_n_months(x['MIN_DATE'], x['MAX_DATE']), axis=1)
              customers_with_enough_history = customers_with_enough_history[customers_with_enough_history['N_MONTHS']>n_months]

          elif train_or_score == 'score':
              n_months = self.observation_window

              customers_with_enough_history = df_raw.groupby('CUSTOMER_ID')['EVENT_DATE'].max().reset_index()
              customers_with_enough_history.columns = ['CUSTOMER_ID', 'MAX_DATE']
              # add a new column for effective date
              customers_with_enough_history['EFF_DATE_LATEST'] = pd.to_datetime(self.scoring_end_date.date())
              # Convert to yyyymm and add new column for number of months
              # filter to exclude customers who haven't had an event in the observation periods
              customers_with_enough_history['MAX_DATE'] = customers_with_enough_history['MAX_DATE'].dt.strftime('%Y%m').astype(int)
              customers_with_enough_history['EFF_DATE_LATEST'] = customers_with_enough_history['EFF_DATE_LATEST'].dt.strftime('%Y%m').astype(int)
              customers_with_enough_history['N_MONTHS'] = customers_with_enough_history.apply(lambda x: self.udf_n_months(x['MAX_DATE'], x['EFF_DATE_LATEST']), axis=1)
              customers_with_enough_history = customers_with_enough_history[customers_with_enough_history['N_MONTHS']<=n_months]
              if customers_with_enough_history.shape[0] == 0:
                  print('Note: No customer for scoring had any event within the observation window and all have been filtered out', file=sys.stderr)

          print('Number of customers after  checking for enough history : ' + str(customers_with_enough_history.shape[0]) + '\n')

          df_events = df_raw.merge(customers_with_enough_history, on='CUSTOMER_ID', how='inner')
          print('Total number of events in the data : ' + str(df_events.shape[0]) + '\n')
          # get a list of distinct events
          events = list(df_events['EVENT_TYPE_ID'].unique())

          # get a count of number of occurences of each event by customer and month (yyyymm)
          # pivot to give one record per customer and month (yyyymm) with each event having a column
          df_events['YEAR_MONTH'] = df_events['EVENT_DATE'].dt.strftime('%Y%m').astype(int)
          df_events = df_events.groupby(['CUSTOMER_ID', 'YEAR_MONTH', 'EVENT_TYPE_ID']).size().reset_index()
          df_events.rename(columns={0:'count'}, inplace=True)

          cym_cnt = pd.pivot_table(df_events, index=['CUSTOMER_ID', 'YEAR_MONTH'], columns='EVENT_TYPE_ID', values='count').reset_index()
          cym_cnt.fillna(0, inplace=True)

          if cym_cnt.shape[0] == 0:
              print('Note: All customers were filtered out\n', file=sys.stderr)

          # check to make sure target events are in the events table
          # any target event that doesn't appear in the events table is removed
          # This should only be carried out for training
          if train_or_score == 'train':
              for target_event in self.target_event_type_ids:
                  if target_event not in events:
                      self.target_event_type_ids.remove(target_event)
                      print(target_event + ' does not appear in events table and has been removed')

              if len(self.target_event_type_ids) == 0:
                  print('Note: event_type_ids from target_event_type_ids not found in events table', file=sys.stderr)

              # if there are less than the threshold number of customers associated with the target event, remove the event
              # get a count of number of unique customers associated with each target event
              # any below the threshold are removed from the target_event_type_ids list
              df_target_cust_count = pd.DataFrame(df_events[df_events['EVENT_TYPE_ID'].isin(self.target_event_type_ids)].groupby('EVENT_TYPE_ID')['CUSTOMER_ID'].nunique().reset_index())
              df_target_cust_count.rename(columns={'CUSTOMER_ID':'customer_count'}, inplace=True)
              events_below_threshold = list(df_target_cust_count[df_target_cust_count['customer_count']<self.life_event_minimum_target_count]['EVENT_TYPE_ID'])
              target_event_type_ids = [x for x in self.target_event_type_ids if x not in events_below_threshold]
              print('\n' + str(len(self.target_event_type_ids)) + ' Target ID(s) left after removing target events below threshold (' + str(self.life_event_minimum_target_count) + ' customers)')

          # rename event columns to include 'E_'
          for e in events:
              cym_cnt.rename(columns={e:'E_' + e}, inplace=True)

        result_map = {}
        for event_type_id in self.target_event_type_ids:
            if df_raw['CUSTOMER_ID'].nunique() > 0:
              #Call the prepare_single_event_type function
              result_map[event_type_id] = self.prepare_single_event_type(event_type_id, events, cym_cnt, train_or_score)
            else:
              result_map[event_type_id] = None

        if result_map[event_type_id] is not None:
          # store the columns names that are used as input into each model
          training_cols = {}
          for event_type_id in self.target_event_type_ids:
              # prep training data, remove columns where nulls make up over 10%
              # drop constant columns (eg all 0's)

              # drop obs_month column
              result_map[event_type_id].drop('OBS_MONTH', axis=1, inplace=True)

              if train_or_score == 'train':
                  columns_required = ['CUSTOMER_ID', 'TARGET', 'MONTH']
                  numeric_cols = []
                  for col in result_map[event_type_id].columns:
                      if is_numeric_dtype(result_map[event_type_id][col].dtype):
                          numeric_cols.append(col)

                  numeric_cols = set(numeric_cols) - set(columns_required)
                  print(result_map[event_type_id].shape)
                  # loop through columns and check for constants or missing vals
                  for col in numeric_cols:
                      # drop cols where min=max ie constants
                      curr_col = result_map[event_type_id][col]
                      if curr_col.min() == curr_col.max():
                          result_map[event_type_id].drop(col, axis=1, inplace=True)
                      # drop column if it is 10% or more null values
                      elif (curr_col.isna().sum()/curr_col.shape[0]) > 0.1:
                          result_map[event_type_id].drop(col, axis=1, inplace=True)

              for col in self.cols_to_drop:
                  result_map[event_type_id].drop(col, axis=1, inplace=True)

              training_cols[event_type_id] = list(result_map[event_type_id].columns)

          # if training, use json to save out the columns that were used for training
          if train_or_score == 'train':
              self.user_inputs_dict['cols_used_for_training'] = training_cols

              # save the user inputs and the columns used for building models
              with open('training_user_inputs_and_prepped_column_names.json', 'w') as f:
                  json.dump(self.user_inputs_dict, f)

        return result_map

Writing life_event_prep.py


In [10]:
%%writefile prep_census_data.py

# Copyright 2017, 2018 IBM. IPLA licensed Sample Materials.
# this function is called if the user selects to use the supplied census data (b_use_census_data variable)
# it reads in the census and customer data
# the census and customer data are matched based on age, marital status, education, employment status,
# income, location and gender
# the function does cleaning to align category names in each column between customer and census datasets
# returns the prepped dataset along with marriage, migration, birth and divorce probabilities from census data
import pandas as pd
import os 
import numpy as np

class census_data():
  def prep_census(self,census_data,customer_data,prepped_data,train_or_score):

      # read in the census data and the customer data
      df_census_probabilities = census_data.copy()#pd.read_csv('/project_data/data_asset/Census Migration Birth Marriage and Divorce Probabilities.csv')
      
      df_customers = customer_data.copy()#pd.read_csv('/project_data/data_asset/customer.csv')

      # to join the census data to customer data we map our customer categories to their most similar category in the census data 
  
      # age ranges in customer data: 23 to 30, 30 to 40, 40 to 55, 55 to 65, 65 and over   
      # age ranges in census data: 18-24, 25-29, 30-34, 35-39, 40-44, 45-54, 55-64, 65-74, 75+, unknown
      # update the census categories to be the same as in the customer data
      age_dict = {'18-24':'18-24', '25-29':'23 to 30', '30-34':'30 to 40', '35-39':'30 to 40', '40-44':'40 to 55', '45-54':'40 to 55', '55-64':'55 to 65', '65-74':'65 and over', '75+':'65 and over', 'Unknown':'Unknown'}
      df_census_probabilities['AGE'] = df_census_probabilities['AGE'].map(age_dict)
  
      # marital status in customer data: Married, Divorced, 'Single'
      # marrital status in census data: Married, Divorced or Separated, Single, Widoed, Unknown
      # update census 'Divorced or Separated' category to 'Divorced'
      # All other categories can remain the same
      df_census_probabilities['MARITAL_STATUS'] = df_census_probabilities['MARITAL_STATUS'].replace({'Divorced or Separated':'Divorced'})
  
      # education in customer data: High School, College, Professional, University, PhD
      # education in census data: Grade 11 or Lower, High School, University, Professional Degree, Doctorate Degree, Unknown
      # update 'Professional Degree' category in census data to 'Professional' 
      # update 'Doctorate Degree' category in census data to 'PhD'
      df_census_probabilities['EDUCATION'] = df_census_probabilities['EDUCATION'].replace({'Professional Degree':'Professional',
                                                                                    'Doctorate Degree':'PhD'})
      # update the 'College' category in customer data to 'University'
      df_customers['EDUCATION_LEVEL'] = df_customers['EDUCATION_LEVEL'].replace({'College':'University'})
  
      # employment status in customer data: Employed, Selfemployed, Homemaker, Retired, Unemployed
      # employment status in census data: Employed, 'Not in Labor Force', Unemployed, Unknown
      # update selfemployed category in customer data to employed
      # update homemaker category in customer data to Not in Labor Force
      # update retired category in customer data to Not in Labor Force
      df_customers['EMPLOYMENT_STATUS'] = df_customers['EMPLOYMENT_STATUS'].replace({'Selfemployed':'Employed'})
      df_customers['EMPLOYMENT_STATUS'] = df_customers['EMPLOYMENT_STATUS'].replace({'Homemaker':'Not in Labor Force'})
      df_customers['EMPLOYMENT_STATUS'] = df_customers['EMPLOYMENT_STATUS'].replace({'Retired':'Not in Labor Force'})
  
      # income in customer data is numerical
      # income in census data: Under 15k, 15k-35k, 35k-75k, 75k-125k, 125k-200k, 200K+, Unknown,
      # bin income up in the customer data
      bins = [0, 15000, 35000, 75000, 125000, 200000, 9999999999]
      labels = ['Under 15k', '15k-35k', '35k-75k', '75k-125k', '125k-200k', '200K+']
      df_customers['ANNUAL_INCOME'] = pd.cut(df_customers['ANNUAL_INCOME'], bins, labels=labels)
  
      # states should match between customer and census (where customer data is in USA)
      # set everything else in customer data to 'Unknown' to align with census
      df_customers['LOCATION'] = df_customers['ADDRESS_HOME_STATE']
      df_customers.loc[(~df_customers['ADDRESS_HOME_STATE'].isin(df_census_probabilities['LOCATION'].unique())), 'LOCATION'] = 'Unknown'
  
      # gender ranges in our customer data are the same as census (ex 'unknown')
  
      # because of how we grouped above, we can have duplicate records over location, age, marital status, education, employment,
      # gender and income, but with different probabilities
      # to combat this we group by these factors and take an average of the probabilities
      df_census_probabilities = df_census_probabilities.groupby(['LOCATION', 'MARITAL_STATUS', 'EDUCATION', 'GENDER', 'EMPLOYMENT',
             'INCOME', 'AGE'])[['MIGRATION_PROB', 'BIRTH_PROB', 'MARRIAGE_PROB', 'DIVORCE_PROB']].mean().reset_index()
      
      
      ##########  Code to plot census Data on training notebook 
      if train_or_score=='train':

        cols_to_plot=["LOCATION","MARITAL_STATUS","EDUCATION","GENDER","EMPLOYMENT","INCOME","AGE"]
        for col in cols_to_plot:
          df_to_plot=df_census_probabilities.groupby(col).mean().reset_index()
      
          #df_to_plot["BIRTH_PROB"]=df_to_plot["BIRTH_PROB"]*10
          #df_to_plot["MARRIAGE_PROB"]=df_to_plot["MARRIAGE_PROB"]*10
          #df_to_plot["DIVORCE_PROB"]=df_to_plot["DIVORCE_PROB"]*10
          self.plot_census(df_to_plot,col)
        
        
      # join customer and census data on all 7 fields to get the probabilities from census
      # first filter the customer data to return one record per customer
      df_customers = pd.merge(df_customers, df_customers.groupby('CUSTOMER_ID')['EFFECTIVE_DATE'].max().reset_index(), how='inner', on=['CUSTOMER_ID', 'EFFECTIVE_DATE'])
  
      # get the records that we can match on all 7 criteria
      # mapping above should ensure that all customer records will get a match so we can use an inner join
      df_census_probabilities = pd.merge(df_customers, df_census_probabilities, how='inner', left_on=['AGE_RANGE', 'MARITAL_STATUS', 
                                                                      'EDUCATION_LEVEL', 'EMPLOYMENT_STATUS',
                                                                      'LOCATION', 'ANNUAL_INCOME', 'GENDER'],
                                                          right_on=['AGE', 'MARITAL_STATUS', 
                                                                      'EDUCATION', 'EMPLOYMENT',
                                                                      'LOCATION', 'INCOME', 'GENDER'])
  
      df_census_probabilities = df_census_probabilities[['CUSTOMER_ID', 'MIGRATION_PROB', 'BIRTH_PROB', 'MARRIAGE_PROB', 'DIVORCE_PROB']]
      
      # replace any missing values with the mean for that column
      df_census_probabilities = df_census_probabilities.fillna(df_census_probabilities.mean())
      
      # loop through the dictionary of prepped data and append the probabilities to the prepped data
      for event_type, df in prepped_data.items():
          prepped_data[event_type] = pd.merge(prepped_data[event_type], df_census_probabilities, on='CUSTOMER_ID')
          prepped_data[event_type].drop('CUSTOMER_ID', axis=1, inplace=True)
  
      return prepped_data
  
  def plot_census(self,s,column):
    
    from chart_studio.plotly import iplot
    import plotly as py
    import plotly.graph_objs as go
    py.offline.init_notebook_mode(connected=True)

    data = [
        go.Bar(
            x=s[column],
            y=s["MIGRATION_PROB"],
            name="MIGRATION_PROB"
        ),
        go.Bar(
            x=s[column],
            y=s["BIRTH_PROB"],
            name="BIRTH_PROB"
        ),
        go.Bar(
            x=s[column],
            y=s["MARRIAGE_PROB"],
            name="MARRIAGE_PROB"
        ),
        go.Bar(
            x=s[column],
            y=s["DIVORCE_PROB"],
            name="DIVORCE_PROB"
        )

    ]
    
    layout = go.Layout(
        barmode='group',
        title='Average Census data Probabilities by '+column
    )

    fig = dict(data = data, layout = layout)
    py.offline.iplot(fig)

Overwriting prep_census_data.py


In [11]:
# Copy files into the Notebook filesystem
files = ['training_user_inputs_and_prepped_column_names.json', 'event.csv', 'census_probabilities.csv', 'customer.csv']
for item in files:
    f = open(item, 'w+b')
    f.write(project.get_file(item).getbuffer())
    f.close()
    
if b_use_census_data:
    cols_to_drop=[]
else:
    cols_to_drop=['CUSTOMER_ID']
    
from life_event_prep import LifeEventPrep

lfe_prep = LifeEventPrep(target_event_type_ids=prediction_types,
                         b_use_census_data=b_use_census_data,
                         train_or_score='train',
                         training_start_date="2010-01-01",
                         training_end_date="2017-08-01",
                         forecast_horizon=3,
                         observation_window=4,
                         life_event_minimum_target_count=100,
                         cols_to_drop=cols_to_drop)

# Prepare Home Purchase and Relocation Data
prepped_data = lfe_prep.prep_data(events, 'train')

Before removing dates that are not in training period : (103426, 3)
After removing dates that are not in training period : (45776, 3)

Number of customers before checking for enough history : 960
Number of customers after  checking for enough history : 789

Total number of events in the data : 44849


2 Target ID(s) left after removing target events below threshold (100 customers)

Prepping data for LFE_RELOCATION
Using observation_window = 4
Using forecast_horizon = 3
Training data has #Target> 1 customers: 0
Training data has #Target==1 customers: 575
   Of those, we set 575 to positive
   and we set 0 to negative
Training data has #Target==0 customers: 214
Number of records : 789
Number of unique customers : 789
Number of customers before removing those with no event data in observation window : 789
Number of customers target=1 before above filtering : 575
Number of customers after removing those with no event data in observation window : 489
Number of customers target=1 after above

If you want to use the census data then call the prep function to add the probabilities to the data and display the census data plots showing the average probabilities by **`Location, Marital_status, Education, Gender, Employment, Income and Age`**.

In [12]:
census_df = pd.read_csv(project.get_file('census_probabilities.csv'))

customer_data = pd.read_csv(project.get_file('customer.csv'))

# if the user has selected to use the census data call the function to prep the census data and add the probabilities to the prepped data

if b_use_census_data:
    
    from prep_census_data import census_data
    census=census_data()
    prepped_data=census.prep_census(census_df,customer_data,prepped_data,'train')

### Display Prepared Data

The final dataset contains one record per customer, with variables based on counts of the number of events that a customer had within a specified timeframe, the observation window. All event-related variables are prefixed with 'E_'. 

For each event, we get a count of the number of times the customer experienced the event in the observation window. These variables are suffixed with "\_OW". We also create variables for the number of times each customer experienced each event in the observation month (end month in observation window). These variables are suffixed with "\_OM". 

If the b_census_flag is set to True, each customer is matched to census data based on **`Location, Age, Education, Income, Profession and Gender to add MIGRATION_PROB, BIRTH_PROB, MARRIAGE_PROB and DIVORCE_PROB columns to the prepped dataset`**.

The dataset also contains a target variable indicating whether that customer experienced the life event or not within a particular timeframe, the forecast horizon.


In [13]:
for event_type, df in prepped_data.items():
    print('\nTraining Data for '+event_type+':')
    display(df.head())
    print("{} rows, {} columns\n".format(*df.shape))
    


Training Data for LFE_RELOCATION:


Unnamed: 0,E_ACNT_SEC_OPEN_*_OW,E_BIRTHDAY24_OW,E_BIRTHDAY26_OW,E_BIRTHDAY27_OW,E_BIRTHDAY28_OW,E_BIRTHDAY29_OW,E_BIRTHDAY32_OW,E_BIRTHDAY35_OW,E_BIRTHDAY36_OW,E_BIRTHDAY39_OW,E_BIRTHDAY40_OW,E_BIRTHDAY42_OW,E_BIRTHDAY51_OW,E_BIRTHDAY52_OW,E_BIRTHDAY54_OW,E_BIRTHDAY55_OW,E_BIRTHDAY59_OW,E_BIRTHDAY68_OW,E_BIRTHDAY71_OW,E_BIRTHDAY73_OW,E_BIRTHDAY75_OW,E_BIRTHDAY76_OW,E_BIRTHDAY77_OW,E_BIRTHDAY79_OW,E_BIRTHDAY80_OW,E_BIRTHDAY83_OW,E_INT_LOGIN_WEB_OW,E_INT_OTHER_PHYSICAL_OW,E_MENTION_LFE_HOME_PURCHASE_OW,E_XCT_EQ_BUY_OW,E_XCT_EQ_SELL_OW,E_XCT_MORTGAGE_NEW_OW,E_XFER_FUNDS_OUT_LARGE_OW,E_ACNT_SEC_OPEN_*_OM,E_BIRTHDAY24_OM,E_BIRTHDAY27_OM,E_BIRTHDAY28_OM,E_BIRTHDAY39_OM,E_BIRTHDAY55_OM,E_BIRTHDAY68_OM,E_BIRTHDAY73_OM,E_BIRTHDAY77_OM,E_INT_LOGIN_WEB_OM,E_INT_OTHER_PHYSICAL_OM,E_MENTION_LFE_HOME_PURCHASE_OM,E_XCT_EQ_BUY_OM,E_XCT_EQ_SELL_OM,E_XCT_MORTGAGE_NEW_OM,E_XFER_FUNDS_OUT_LARGE_OM,MONTH,TOT_NB_OF_EVENTS_OW,TOT_NB_OF_EVENTS_OM,TARGET
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,9,1.0,1.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1,16.0,2.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,8,2.0,1.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,2.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,9,1.0,1.0,1


452 rows, 53 columns


Training Data for LFE_HOME_PURCHASE:


Unnamed: 0,E_ACNT_SEC_OPEN_*_OW,E_BIRTHDAY26_OW,E_BIRTHDAY27_OW,E_BIRTHDAY28_OW,E_BIRTHDAY32_OW,E_BIRTHDAY35_OW,E_BIRTHDAY36_OW,E_BIRTHDAY39_OW,E_BIRTHDAY40_OW,E_BIRTHDAY42_OW,E_BIRTHDAY43_OW,E_BIRTHDAY44_OW,E_BIRTHDAY50_OW,E_BIRTHDAY52_OW,E_BIRTHDAY55_OW,E_BIRTHDAY59_OW,E_BIRTHDAY61_OW,E_BIRTHDAY62_OW,E_BIRTHDAY68_OW,E_BIRTHDAY69_OW,E_BIRTHDAY70_OW,E_BIRTHDAY71_OW,E_BIRTHDAY72_OW,E_BIRTHDAY73_OW,E_BIRTHDAY74_OW,E_BIRTHDAY75_OW,E_BIRTHDAY77_OW,E_BIRTHDAY79_OW,E_BIRTHDAY80_OW,E_BIRTHDAY81_OW,E_BIRTHDAY82_OW,E_INT_LOGIN_WEB_OW,E_INT_OTHER_PHYSICAL_OW,E_LFE_RELOCATION_OW,E_MENTION_LFE_HOME_PURCHASE_OW,E_XCT_EQ_BUY_OW,E_XCT_EQ_SELL_OW,E_XCT_MORTGAGE_NEW_OW,E_XFER_FUNDS_OUT_LARGE_OW,E_ACNT_SEC_OPEN_*_OM,E_BIRTHDAY27_OM,E_BIRTHDAY35_OM,E_BIRTHDAY39_OM,E_BIRTHDAY44_OM,E_BIRTHDAY55_OM,E_BIRTHDAY62_OM,E_BIRTHDAY69_OM,E_BIRTHDAY70_OM,E_BIRTHDAY72_OM,E_BIRTHDAY74_OM,E_BIRTHDAY75_OM,E_BIRTHDAY77_OM,E_BIRTHDAY79_OM,E_BIRTHDAY82_OM,E_INT_LOGIN_WEB_OM,E_INT_OTHER_PHYSICAL_OM,E_LFE_RELOCATION_OM,E_MENTION_LFE_HOME_PURCHASE_OM,E_XCT_EQ_BUY_OM,E_XCT_EQ_SELL_OM,E_XCT_MORTGAGE_NEW_OM,E_XFER_FUNDS_OUT_LARGE_OM,MONTH,TOT_NB_OF_EVENTS_OW,TOT_NB_OF_EVENTS_OM,TARGET
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,4.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,5,15.0,3.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7,1.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,7,18.0,4.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9,2.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10,1.0,0.0,1


479 rows, 66 columns



### Save Prepared Dictionary as Pickle File
Since the prepped data is a dictionary, we must save it as a pickle file in order to transfer the data to the model training notebook. This data will be called `prepared_data.pkl`

In [14]:
#function saving dictionary to pickle file on cp4d
def save_dict(obj, name ):
    with open(name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
        print('SUCCESSFULLY SAVED')
        f.close()

save_dict(prepped_data, 'prepared_data')

SUCCESSFULLY SAVED


We have now finished preparing the dataset and saved out the prepped data for modelling. See notebook `2-model-training` for the next step.

<hr>

Sample Materials, provided under <a href="https://github.com/IBM/Industry-Accelerators/blob/master/CPD%20SaaS/LICENSE" target="_blank" rel="noopener noreferrer">license.</a> <br>
Licensed Materials - Property of IBM. <br>
© Copyright IBM Corp. 2019, 2021. All Rights Reserved. <br>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. <br>