# Industry Accelerators - Financial Markets Customer Segmentation Model

## Data Preprocessing

## Introduction


In this notebook we will be going through an end-to-end project to load in long form transactional type data, prepare the data into a wide format. The summary and demographic information is analyzed at a client level. The model input data structure is a wide form data structure (multiple rows per client), organized by **`Customer ID`** as the key field. We will use the function **`CustomerSegmentationPrep()`** to prepare the data.

Before executing this notebook on IBM Cloud :<br>
1) When you import this project on an IBM Cloud environment, a project access token should be inserted at the top of this notebook as a code cell. <br>
If you do not see the cell above, Insert a project token: Click on **More -> Insert project token** in the top-right menu section and run the cell <br>

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)
2) You can then step through the notebook execution cell by cell, by selecting Shift-Enter. Or you can execute the entire notebook by selecting **Cell -> Run All** from the menu.<br>


In [7]:
try:
    project
except NameError:
    # READING AND WRITING PROJECT ASSETS
    import project_lib
    project = project_lib.Project() 

## Load Customer Segmentation Data

For this project we will be loading the csv file called **customer_full_summary_latest.csv**. The file is located in the `data_assets`. We use **project_lib** library to fetch and save the files associated with the project.

The easiest way to load in data is to use the <b>Find and Add Data</b> icon in the upper right hand corner. Once selected you will see a sidebar come out with options to load from either Files or Connections.

If you loaded your dataset into a Watson Studio analytics project, like a CSV file, then select Files and you should be able to find your dataset name. From there you can click the <b>Insert to code</b> and select to either insert a pandas dataframe or a spark dataframe. Once you make the selection you'll see python code inserted into the notebook cell with either Pandas or PySpark code for reading in your data. Now you're ready to explore and manipulate your dataset. 

In the cell below we import the python libraries that we will use throughout the notebook.

In [8]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 1000)
import json
import importlib
import warnings
import sys
import time
import os
import pickle 
import shutil
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.preprocessing import StandardScaler
warnings.filterwarnings('ignore')

np.random.seed(0)

## User Inputs and Data Prep


### Data Prep

**`CustomerSegmentationPrep()`** function is used for data preparation.

The function generates the dataset that is used for clustering. We take a wide form dataset with customer details, filter to include only columns that are relevant, complete data cleaning and produce a dataframe suitable for clustering. 


### User Inputs

**effective_date :**  This is the date that the segmentation is computed. All input data should be before this date.<br>
**train_or_score :**  Specify whether we are prepping the data for training or scoring. Should always be 'train' in this notebook.<br>

**granularity_key :** Specifies the customer ID column.<br>
**customer_start_date :** Column with the start of the summary month of customer data.<br> 
**customer_end_date :** As above, but last day of the summary month.<br>
**status_attribute :** Column which indicates whether the customer is active or inactive and is used to define churn. Churned customers are removed from the dataset.<br>
**status_flag_active :** The name of the variable in the status_attribute that indicates that the customer has churned, in this case it is 'Inactive'.<br>
**date_customer_joined :** Specifies the column where the customer join date is recorded. This variable is used to calculate customer tenure.<br>

**columns_required :** A list of default columns required, includes ID column and date columns.<br>
**default_attributes :** A list of the variables that we would like to use for the segmentation.<br>
**risk_tolerance_list :** A list of the risk categories for the customer's accounts. 'High', 'Low' etc.<br> 
**investment_objective_list :** A list of the investment objective categories for the customer's accounts. 'Security', 'Income' etc.<br>

The last three user input variables are used for data cleaning.<br>
**std_multiplier :** This variable is used to identify outlier values. This number is multiplied by the variable standard deviation. Any value above this is defined as an outlier and the value is capped at this number multiplied by the standard deviation.<br>
**max_num_cat_cardinality :** This variable defines the maximum cardinality for categorical variables. Any categorical variable with more categories than this maximum is removed from the dataset.<br> 
**nulls_threshold :** This threshold is used to identify columns with many null values. Any column with percentage of nulls greater than this threshold will be removed from the dataset.<br>

The user can use the default inputs as listed below or can choose their own. The user inputs will be stored and the same inputs will be applied automatically at scoring time. 


### Data Cleaning

•	Any customer who attrited in the dataset is removed. Only active customers are used for clustering.<br>
•	We take the most recent record for each customer.<br>
•	Any columns in the dataset that have a single constant value are removed.<br>
•	Any column with more than 10% null values is removed.<br>
•	High cardinality categorical columns are removed.<br>
•	Numerical outliers are cleaned. <br>
•	Remaining missing values are filled with 'Unknown' for categorical and the average of the column for numerical. 

In [9]:
%%writefile customer_segmentation_prep.py

import pandas as pd
import numpy as np
import datetime
from dateutil.relativedelta import relativedelta
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
import sys
import json
import os

class CustomerSegmentationPrep():
    
    def __init__(self, train_or_score, 
        granularity_key='CUSTOMER_CUSTOMER_ID',
        customer_start_date='CUSTOMER_SUMMARY_START_DATE',
        customer_end_date='CUSTOMER_SUMMARY_END_DATE',
        status_attribute='CUSTOMER_STATUS',
        status_flag_active='Active',
        date_customer_joined='CUSTOMER_RELATIONSHIP_START_DATE',
        columns_required=['CUSTOMER_CUSTOMER_ID', 'CUSTOMER_STATUS', 'CUSTOMER_SUMMARY_START_DATE', 'CUSTOMER_SUMMARY_END_DATE',
                        'CUSTOMER_EFFECTIVE_DATE',  'CUSTOMER_SYSTEM_LOAD_TIMESTAMP'], 
        default_attributes=['CUSTOMER_GENDER', 'CUSTOMER_AGE_RANGE', 'CUSTOMER_EDUCATION_LEVEL',
                                'CUSTOMER_EMPLOYMENT_STATUS', 'CUSTOMER_MARITAL_STATUS', 'CUSTOMER_NUMBER_OF_DEPENDENT_CHILDREN',
                                'CUSTOMER_URBAN_CODE', 'CUSTOMER_ANNUAL_INCOME', 'CUSTOMER_RELATIONSHIP_START_DATE',
                                'CUSTOMER_SUMMARY_FUNDS_UNDER_MANAGEMENT', 'CUSTOMER_SUMMARY_RETURN_SINCE_INCEPTION',
                                'CUSTOMER_SUMMARY_RETURN_LAST_QUARTER', 'CUSTOMER_SUMMARY_ASSETS',
                                'CUSTOMER_SUMMARY_NUMBER_OF_ACTIVE_ACCOUNTS', 'CUSTOMER_SUMMARY_NUMBER_OF_EMAILS',
                                'CUSTOMER_SUMMARY_NUMBER_OF_LOGINS', 'CUSTOMER_SUMMARY_NUMBER_OF_CALLS',
                                'CUSTOMER_SUMMARY_TOTAL_NUMBER_OF_BUY_TRADES', 'CUSTOMER_SUMMARY_TOTAL_NUMBER_OF_SELL_TRADES',
                                'CUSTOMER_SUMMARY_TOTAL_AMOUNT_OF_ALL_FEES'] , 
        risk_tolerance_list = [], investment_objective_list = [], effective_date = '2018-09-30', std_multiplier=5, max_num_cat_cardinality=10,
        nulls_threshold=0.1):
            
        self.train_or_score = train_or_score
        self.columns_required = columns_required
        self.default_attributes = default_attributes 
        self.granularity_key = granularity_key
        self.date_customer_joined = date_customer_joined 
        self.customer_end_date = customer_end_date  
        self.customer_start_date = customer_start_date 
        self.risk_tolerance_list = risk_tolerance_list
        self.investment_objective_list = investment_objective_list 
        self.effective_date = effective_date
        self.status_attribute = status_attribute
        self.status_flag_active = status_flag_active
        self.std_multiplier = std_multiplier
        self.max_num_cat_cardinality = max_num_cat_cardinality
        self.nulls_threshold = nulls_threshold

        # if effective date is a date convert it to a string for consistency
        if isinstance(self.effective_date, datetime.datetime):
            self.effective_date = datetime.datetime.strftime(self.effective_date, '%Y-%m-%d')
            
        if self.train_or_score == 'train':
            # create a dictionary with all values for user inputs. We will save this out and use it for scoring
            # to ensure that the user inputs are consistent across train and score notebooks
            # exclude variables that won't be used for scoring
            self.user_inputs_dict = { 'columns_required' : columns_required, 'default_attributes' : default_attributes,
                'granularity_key' : granularity_key, 'date_customer_joined' : date_customer_joined, 
                'customer_end_date' : customer_end_date, 'customer_start_date' : customer_start_date,
                'risk_tolerance_list' : risk_tolerance_list, 'investment_objective_list' : investment_objective_list,
                'effective_date' : effective_date, 'status_attribute' : status_attribute,
                'status_flag_active' : status_flag_active, 'std_multiplier' : std_multiplier,
                'max_num_cat_cardinality' : max_num_cat_cardinality, 'nulls_threshold' : nulls_threshold }

    # function to get the difference between 2 dates returned in months
    def udf_n_months(self, date1, date2):
        month_dif = (relativedelta(date1, date2).months + 
                relativedelta(date1, date2).years * 12)
        return month_dif

    # function to fill in any missing data for customer join date
    # if only some records are missing for the customer and we have the join date in other records use that
    # Otherwise, use the earliest customer summary start date
    def fill_date_customer_joined(self, df):
        nb_cust_date_customer_joined_filled = df[df[self.date_customer_joined].isnull()][self.granularity_key].nunique()

        if nb_cust_date_customer_joined_filled > 0:
            print('Filling date_customer_joined for ' + str(nb_cust_date_customer_joined_filled) + ' customers')
            # get a list of the customers who are missing start dates
            cust_date_cust_joined_missing = list(df[df[self.date_customer_joined].isnull()][self.granularity_key].unique())

            # check to see if any of the start date records for the customer are filled in
            # use this if it's available
            df_new_start_date = df[df[self.granularity_key].isin(cust_date_cust_joined_missing)].groupby(self.granularity_key)[self.date_customer_joined].min().reset_index()
            df_new_start_date = df_new_start_date[df_new_start_date[self.date_customer_joined].notnull()]
            df_new_start_date.rename(columns={self.date_customer_joined: 'MIN_START_DATE'}, inplace=True)
            if df_new_start_date.shape[0] > 0:
                df = df.merge(df_new_start_date, on=self.granularity_key, how='left')
                df[self.date_customer_joined].fillna(df['MIN_START_DATE'], inplace=True)
                # since these customers are not now missing start dates, remove them from the list
                cust_date_cust_joined_missing = list(set(cust_date_cust_joined_missing) - set(df_new_start_date[self.granularity_key].unique()))
                # drop the min_start_date var
                df.drop('MIN_START_DATE', axis=1, inplace=True)

            if len(cust_date_cust_joined_missing) > 0:
                # get the earliest customer summary start date for each customer who is missing a start date
                df_new_start_date = df[df[self.granularity_key].isin(cust_date_cust_joined_missing)].groupby(self.granularity_key)[self.customer_start_date].min().reset_index()
                df_new_start_date.rename(columns={self.customer_start_date:'NEW_START_DATE'}, inplace=True)
                # join back to original df and update 
                df = df.merge(df_new_start_date, on=self.granularity_key, how='left')
                df[self.date_customer_joined].fillna(df['NEW_START_DATE'], inplace=True)
                df.drop('NEW_START_DATE', axis=1, inplace=True)

        return df

    # this function returns a list of dynamic attributes from lists provided
    # User provides a list of risk and investment objective types
    # the function gets the column names for the counts of accounts of each type
    def dynamic_attributes_from_list(self):
        dynamic_attributes = []
    
        if len(self.risk_tolerance_list) > 0:
            for risk in self.risk_tolerance_list:
                col_name = 'NUM_ACCOUNTS_WITH_RISK_TOLERANCE_' + risk.upper().replace(" ", "_")
                dynamic_attributes.append(col_name)
    
        if len(self.investment_objective_list) > 0:
            for objective in self.investment_objective_list:
                col_name = 'NUM_ACCOUNTS_WITH_INVESTMENT_OBJECTIVE_' + objective.upper().replace(" ", "_")
                dynamic_attributes.append(col_name)
    
        return dynamic_attributes

    # this function filters the dataframe to only include the columns that are specified
    def filter_attributes(self, df, columns_required, default_attributes):
    
        # the attributes we will use are the required ones plus ones specitied in defualt_attributes
        working_attributes = columns_required + default_attributes
        # check to make sure we don't have duplicate columns names
        working_attributes = list(set(working_attributes))
        #check to make sure that the attributes are in the original dataframe
        if set(working_attributes) - set(df.columns) == 0:
            print('Invalid column names, no column names in columns_required or default_attributes lists are contained in the dataframe')
    
        # check to see if any columns passed in the list are not actually in the dataframe, print them to screen
        # and remove from the list of working_attributes
        cols_passed_but_not_in_df = [attribute for attribute in working_attributes if attribute not in df.columns]
        if len(cols_passed_but_not_in_df) > 0:
            print(str(len(cols_passed_but_not_in_df)) + ' columns were passed but are not contained in the data. :' + str(cols_passed_but_not_in_df))
            working_attributes = [col for col in working_attributes if col not in cols_passed_but_not_in_df]
        
        df = df[working_attributes]
        return df

    # This function does some data cleaning by removing columns that have constant or missing values
    # All numeric data that has only 1 value is removed
    # For categorical variables, we drop columns that have only 1 unique value
    # For categoricals, we drop columns that have a cardinality greater than or equal to max_num_cat_cardinality
    # If drop_count_column_distinct is True, we drop columns that have null values above the specified threshold, nulls_threshold
    def drop_dataframe_columns(self, df, max_num_cat_cardinality = 10, nulls_threshold = 0.1, keep=[], drop_count_column_distinct=False):

        print('Before cleaning, we had ' + str(df.shape[1]) + ' columns.')
        # get the numeric columns
        numeric_cols = list(df.select_dtypes(include=[np.number]).columns)
        # remove the columns that are required from the list
        numeric_cols = list(set(numeric_cols) - set(keep))

        # drop all numeric columns that just contain a constant value, min=max
        # record cols that we are dropping and remove after iterating over the list
        # don't remove in list as I think it causes issues when iterating over it
        cols_to_remove = []
        for col in numeric_cols:
            curr_col = df[col]
            if curr_col.max() == curr_col.min():
                df.drop(col, axis=1, inplace=True)
                # remove the column from our list of numerical variables
                cols_to_remove.append(col)

        numeric_cols = list(set(numeric_cols) - set(cols_to_remove))

        # get the string and datetime columns
        string_cols = list(df.select_dtypes(include=[object]).columns)
        # remove the columns that are required from the list
        string_cols = list(set(string_cols) - set(keep))
        datetime_cols = list(df.select_dtypes(include=[np.datetime64]).columns)
        # remove the columns that are required from the list
        datetime_cols = list(set(datetime_cols) - set(keep))

        # treat string and datetime cols the same for below
        not_num_cols = string_cols + datetime_cols

        # get a count of number of null values in each column,
        # if the number of nulls is greater than a threshold percentage, drop the column
        if drop_count_column_distinct:
            cols_to_remove = []
            for col in numeric_cols:
                curr_col = df[col]
                if (curr_col.isna().sum()/curr_col.shape[0]) > nulls_threshold:
                    df.drop(col, axis=1, inplace=True)
                    # add the column name to the list of attributes to remove
                    cols_to_remove.append(col)

            numeric_cols = list(set(numeric_cols) - set(cols_to_remove))  

            # do the same for non-numerical columns
            cols_to_remove = []
            for col in not_num_cols:
                curr_col = df[col]
                if (curr_col.isna().sum()/curr_col.shape[0]) > nulls_threshold:
                    df.drop(col, axis=1, inplace=True)
                    # add the column name to the list of tho
                    cols_to_remove.append(col)

            numeric_cols = list(set(not_num_cols) - set(cols_to_remove))  

        # drop categorical variables that are constant or more than cat_cardinality_threshold (10) categories
        for col in string_cols:
            col_cardinality = df[col].nunique()
            if col_cardinality == 1 or col_cardinality >= max_num_cat_cardinality:
                df.drop(col, axis=1, inplace=True)

        print('After cleaning, we have ' + str(df.shape[1]) + ' columns.')

        return df

    # This function takes a dataframe, a list of columns and a multiplier
    # and replaces values that are more than multiplier * standard deviations from the mean
    def clean_outliers(self, df, column_list, multiplier=5):
        for col in column_list:
            col_std = df[col].std()
            col_mean = df[col].mean()
            df.loc[df[col] >= col_mean + (multiplier * col_std), col] = col_mean + (multiplier * col_std)

        return df

    def prep_data(self, df_raw, train_or_score):
        # just in case any caps are used
        train_or_score = train_or_score.lower()

        # find the columns that are used for risk and investment objective
        dynamic_attributes = self.dynamic_attributes_from_list()
        # add the dynamic attributes to the already defined default attribute list
        self.default_attributes = self.default_attributes + dynamic_attributes

        # filter the dataframe to only include attributes that have been specified
        df_prep = self.filter_attributes(df_raw, self.columns_required, self.default_attributes)

        # fill missing customer join dates with the customer summary start date
        if self.date_customer_joined in df_prep.columns:
            df_prep = self.fill_date_customer_joined(df_prep)
        
        # filter to only include customers who most recent record is active. All customers who churned are removed
        #sort by customer ID and summary END_DATE, take the latest record
        df_prep = df_prep.sort_values(by=[self.granularity_key, self.customer_end_date])
        df_prep = df_prep.groupby(self.granularity_key).last().reset_index()

        print('Before removing inactive customers we have ' + str(df_prep[self.granularity_key].nunique()) + ' customers')
        df_prep = df_prep[df_prep[self.status_attribute]==self.status_flag_active]
        print('After removing inactive customers we have ' + str(df_prep[self.granularity_key].nunique()) + ' customers')

        # drop some columns that we don't need
        df_prep.drop(['CUSTOMER_STATUS', 'CUSTOMER_SYSTEM_LOAD_TIMESTAMP'], axis=1, inplace=True)

        if train_or_score == 'train':
            # drop more columns
            # we only do this for training, as when scoring, we already know the columns dropped from training
            df_prep = self.drop_dataframe_columns(df_prep, self.max_num_cat_cardinality, self.nulls_threshold, keep=self.columns_required, drop_count_column_distinct=True)

        # Calculate the customer tenure
        if self.date_customer_joined in df_prep.columns:
            df_prep = df_prep[df_prep[self.date_customer_joined]<=datetime.datetime.strptime(self.effective_date, '%Y-%m-%d')]
            if df_prep.shape[0] == 0:
                print('Error: No data to train with', file=sys.stderr)
            else:
                print('Add a column for customer tenure')
                df_prep['CUSTOMER_TENURE_IN_MONTHS'] = df_prep.apply(lambda x: self.udf_n_months(datetime.datetime.strptime(self.effective_date, '%Y-%m-%d'), x[self.date_customer_joined]), axis=1)        
        
        if df_prep.shape[0] == 0:
            return None
          
        # drop any column that looks like a date
        if train_or_score == 'train':
            for col in df_prep.columns:
                if df_prep[col].dtype == 'datetime64[ns]':
                    df_prep.drop(col, axis=1, inplace=True)

        print('Prepped data has ' + str(df_prep.shape[0]) + ' rows and ' + str(df_prep.shape[1]) + ' columns.')
        print('Prep has data for ' + str(df_prep[self.granularity_key].nunique()) + ' customers')
        
        if train_or_score == 'train':
            # get a list of columns that we would like to remove outliers for
            # we use only float valued columns
            float_cols = list(df_prep.select_dtypes(include=[np.float]).columns)
            # call the function to remove outliers
            df_prep = self.clean_outliers(df_prep, float_cols, self.std_multiplier) 

        # for string columns replace nulls with 'Unknown'
        # for numerical replace with mean. If there are no values for the column to calculate a mean (can happen in scoring),
        # fill with 0 instead
        string_cols = list(df_prep.select_dtypes(include=[object]).columns)
        numeric_cols = list(df_prep.select_dtypes(include=[np.number]).columns)

        for col in string_cols:
            df_prep[col].fillna('Unknown', inplace=True)

        for col in numeric_cols:
            col_mean = df_prep[col].mean()
            # if the whole column is null (can happen when scoring, esp if just 1 customer), fill the value with 0
            if pd.isnull(col_mean):
                df_prep[col].fillna(0, inplace=True)
            else:
                df_prep[col].fillna(col_mean, inplace=True)
        
        if train_or_score == 'train':
            with open('training_data_metadata.json', 'w') as f:
                json.dump(self.user_inputs_dict, f)
        return df_prep


Writing customer_segmentation_prep.py


#### Prep Variables 

In [10]:
# User input variables
effective_date = '2018-09-30'  # date at which the prediction was computed 
train_or_score = 'train'

granularity_key='CUSTOMER_CUSTOMER_ID'
customer_start_date='CUSTOMER_SUMMARY_START_DATE'
customer_end_date='CUSTOMER_SUMMARY_END_DATE'
status_attribute='CUSTOMER_STATUS'
status_flag_active='Active'
date_customer_joined='CUSTOMER_RELATIONSHIP_START_DATE'

columns_required=['CUSTOMER_CUSTOMER_ID', 'CUSTOMER_STATUS', 'CUSTOMER_SUMMARY_START_DATE', 'CUSTOMER_SUMMARY_END_DATE',
                    'CUSTOMER_EFFECTIVE_DATE',  'CUSTOMER_SYSTEM_LOAD_TIMESTAMP']

default_attributes=['CUSTOMER_GENDER', 'CUSTOMER_AGE_RANGE', 'CUSTOMER_EDUCATION_LEVEL',
                            'CUSTOMER_EMPLOYMENT_STATUS', 'CUSTOMER_MARITAL_STATUS', 
                            'CUSTOMER_URBAN_CODE', 'CUSTOMER_ANNUAL_INCOME', 'CUSTOMER_RELATIONSHIP_START_DATE', 
                            'CUSTOMER_SUMMARY_RETURN_LAST_QUARTER', 
                            'CUSTOMER_SUMMARY_NUMBER_OF_EMAILS',
                            'CUSTOMER_SUMMARY_NUMBER_OF_LOGINS',
                    'CUSTOMER_SUMMARY_AMOUNT_OF_MANAGEMENT_FEES',
                           'CUSTOMER_SUMMARY_TOP_SPENDING_CATEGORY', 'CUSTOMER_CREDIT_AUTHORITY_LEVEL', 'CUSTOMER_CUSTOMER_BEHAVIOR', 'CUSTOMER_IMPORTANCE_LEVEL_CODE',
                           'CUSTOMER_MARKET_GROUP',
                           'CUSTOMER_PURSUIT']
risk_tolerance_list = []
investment_objective_list = []

std_multiplier=5
max_num_cat_cardinality=15
nulls_threshold=0.1

In [11]:
customer_full_summary_latest_file = project.get_file("customer_full_summary_latest.csv")
customer_full_summary_latest_file.seek(0)

df_raw = pd.read_csv(customer_full_summary_latest_file,
                     parse_dates=['CUSTOMER_RELATIONSHIP_START_DATE',
                                 'CUSTOMER_SUMMARY_END_DATE', 'CUSTOMER_SUMMARY_START_DATE'], infer_datetime_format=True)

from customer_segmentation_prep import CustomerSegmentationPrep

data_prep = CustomerSegmentationPrep(train_or_score=train_or_score, effective_date=effective_date, granularity_key=granularity_key, customer_start_date=customer_start_date, customer_end_date=customer_end_date,
                                        status_attribute=status_attribute, status_flag_active=status_flag_active, date_customer_joined=date_customer_joined, columns_required=columns_required, default_attributes=default_attributes,
                                        risk_tolerance_list=risk_tolerance_list, investment_objective_list=investment_objective_list, std_multiplier=std_multiplier, max_num_cat_cardinality=max_num_cat_cardinality, nulls_threshold=nulls_threshold)

df_prepped = data_prep.prep_data(df_raw, train_or_score)

Before removing inactive customers we have 1000 customers
After removing inactive customers we have 838 customers
Before cleaning, we had 22 columns.
After cleaning, we have 19 columns.
Add a column for customer tenure
Prepped data has 838 rows and 17 columns.
Prep has data for 838 customers


In [12]:
# Preview prepped data
df_prepped.head()

Unnamed: 0,CUSTOMER_CUSTOMER_ID,CUSTOMER_MARITAL_STATUS,CUSTOMER_URBAN_CODE,CUSTOMER_SUMMARY_AMOUNT_OF_MANAGEMENT_FEES,CUSTOMER_GENDER,CUSTOMER_SUMMARY_TOP_SPENDING_CATEGORY,CUSTOMER_EMPLOYMENT_STATUS,CUSTOMER_CUSTOMER_BEHAVIOR,CUSTOMER_MARKET_GROUP,CUSTOMER_CREDIT_AUTHORITY_LEVEL,CUSTOMER_AGE_RANGE,CUSTOMER_PURSUIT,CUSTOMER_IMPORTANCE_LEVEL_CODE,CUSTOMER_EFFECTIVE_DATE,CUSTOMER_EDUCATION_LEVEL,CUSTOMER_ANNUAL_INCOME,CUSTOMER_TENURE_IN_MONTHS
0,1000,Married,City,1757.13,Male,Recreation,Employed,Moderate,Accumulating,Medium,30 to 40,Capital Acquisition,Low priority,2018-01-02,College,325000.0,8
1,1001,Divorced,Urban,17935.79,Female,Uncategorized,Selfemployed,Aggressive,Gifting,Very High,65 and over,Retirement Planning,Normal priority,2017-11-29,Professional,280000.0,10
2,1002,Married,Urban,1221.06,Female,Travel,Homemaker,Growth,Accumulating,Very Low,55 to 65,Increase Net Worth,High priority,2017-08-28,PhD,130000.0,13
3,1003,Married,Urban,1176.59,Female,Travel,Homemaker,Growth,Accumulating,Very Low,65 and over,Increase Net Worth,High priority,2018-01-17,PhD,120000.0,8
4,1004,Married,City,14452.36,Male,Food,Employed,Moderate,Accumulating,Medium,40 to 55,Estate Planning,Low priority,2018-01-03,College,350000.0,8


Now that the data is prepared we need to continue with a few more data preparation steps before we can do clustering. First is to simply remove the columns `CUSTOMER_CUSTOMER_ID` and `CUSTOMER_EFFECTIVE_DATE` since they're not needed for segmentation.

In [13]:
# Drop columns not needed for segmentation
df_prepped.drop(['CUSTOMER_CUSTOMER_ID', 'CUSTOMER_EFFECTIVE_DATE'], axis=1, inplace=True)

### Display Prepared Data

Now that the data is ready for analysis, we will take a quick look at the dataset to ensure that everything is as expected.

In [14]:
# Preview prepped data with standardized numeric values
print('\nTraining Data for Customer Segmentation use case:')
display(df_prepped.head())
print("{} rows, {} columns\n".format(*df_prepped.shape))


Training Data for Customer Segmentation use case:


Unnamed: 0,CUSTOMER_MARITAL_STATUS,CUSTOMER_URBAN_CODE,CUSTOMER_SUMMARY_AMOUNT_OF_MANAGEMENT_FEES,CUSTOMER_GENDER,CUSTOMER_SUMMARY_TOP_SPENDING_CATEGORY,CUSTOMER_EMPLOYMENT_STATUS,CUSTOMER_CUSTOMER_BEHAVIOR,CUSTOMER_MARKET_GROUP,CUSTOMER_CREDIT_AUTHORITY_LEVEL,CUSTOMER_AGE_RANGE,CUSTOMER_PURSUIT,CUSTOMER_IMPORTANCE_LEVEL_CODE,CUSTOMER_EDUCATION_LEVEL,CUSTOMER_ANNUAL_INCOME,CUSTOMER_TENURE_IN_MONTHS
0,Married,City,1757.13,Male,Recreation,Employed,Moderate,Accumulating,Medium,30 to 40,Capital Acquisition,Low priority,College,325000.0,8
1,Divorced,Urban,17935.79,Female,Uncategorized,Selfemployed,Aggressive,Gifting,Very High,65 and over,Retirement Planning,Normal priority,Professional,280000.0,10
2,Married,Urban,1221.06,Female,Travel,Homemaker,Growth,Accumulating,Very Low,55 to 65,Increase Net Worth,High priority,PhD,130000.0,13
3,Married,Urban,1176.59,Female,Travel,Homemaker,Growth,Accumulating,Very Low,65 and over,Increase Net Worth,High priority,PhD,120000.0,8
4,Married,City,14452.36,Male,Food,Employed,Moderate,Accumulating,Medium,40 to 55,Estate Planning,Low priority,College,350000.0,8


838 rows, 15 columns



### Save Prepared data

We can save the prepared data in order to transfer the data to the model training notebook. This data will be called `training_data.csv`

In [15]:
project.save_data('training_data.csv', df_prepped.to_csv(index=False), overwrite=True)

{'file_name': 'training_data.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'financialmarketscustomersegmentat-donotdelete-pr-pufwjklnffdfmt',
 'asset_id': '07eb2d32-ed72-4232-9ea1-b19dd1fa69fc'}

Now we have finished preparing the dataset and saved out the prepped data for modelling. See notebook **`2-model-training`** for the next step.

<hr>
This project contains Sample Materials, provided under this <a href="https://github.com/IBM/Industry-Accelerators/blob/master/CPD%20SaaS/LICENSE" target="_blank" rel="noopener noreferrer">license</a>. <br/>
Licensed Materials - Property of IBM. <br/>
© Copyright IBM Corp. 2019, 2020, 2021. All Rights Reserved. <br/>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.<br/>