# Absa Corporate Client Activity Forecasting Challenge! 

<br> By Pieter Cawood

This notebook implements weighted model averaging of 2 gradient boosting models. The first model is trained on quantile values  of selected transaction descriptions in the transaction data, along with the customer data. The second model is trained on quantile values of unique channel and product codes, along with the customer data. Both sets of features are supplemented with features of: the customer's transaction counts as well as quantile values of the amounts flowing into the customer's account. All features that are quantile values are determined for both the positive and negative amounts, that way the model should be able to learn from the user's earning and spending instead of simple average features.


The models are partially trained on multiple cross validation sets, with overfit detection on each run. The predictions are finally made as a weighted sum of both models.

In [1]:
!pip install -r requirements.txt



In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import KFold, GroupShuffleSplit
from catboost import CatBoostRegressor, CatBoostClassifier
from sklearn.linear_model import Lasso
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Concatenate
from tensorflow.keras.optimizers import Adam
np.set_printoptions(suppress=True)
# This seed is used by some data splitting and the models initial weights
seed = 1

In [3]:
# Load all of the data from file
customer_df = pd.read_csv("Data/customer.csv")
income_group_df = pd.read_csv("Data/income_group.csv")
transactions_df = pd.read_csv("Data/transactions.csv")
train_incomes_df = pd.read_csv("Data/Train.csv")
test_incomes_df = pd.read_csv("Data/Test.csv")

In [4]:
# Update dates to np.datetime64
transactions_df.RECORD_DATE = transactions_df.RECORD_DATE.astype(np.datetime64)
# Make the comma seperated values as whole numbers
train_incomes_df.DECLARED_NET_INCOME = train_incomes_df.DECLARED_NET_INCOME.str.replace(',','').astype(int)

In [5]:
# Sorted lists of customer identifiers
train_customer_identifiers = sorted(train_incomes_df.CUSTOMER_IDENTIFIER.unique())
test_customer_identifiers = sorted(test_incomes_df.CUSTOMER_IDENTIFIER.unique())

## Threshold transaction period
Since the incomes might have changed over time, only transactions for the last 3 months within the data are considered

In [6]:
truncate_transactions = False

if truncate_transactions:
    transactions_horizon_days = 241
    transaction_horizon_start = pd.Timestamp(transactions_df.RECORD_DATE.values.max() -\
                                             pd.Timedelta(transactions_horizon_days, 'd'))
    print('Transactions horizon from ', transaction_horizon_start, ' to ', transactions_df.RECORD_DATE.values.max())
    transactions_df = transactions_df[transactions_df.RECORD_DATE >= transaction_horizon_start]

# This section handles customers with transaction data

## Transactions features
Lets first analyse the top $n$ most common transaction types, to avoid using scarce transaction types. <br>
The plan is to later use the lower quantile, median, upper quantile and max values of each transaction types for both <br>
positive and negative amounts seperately, thus producing 8 features per $n$ defined transaction types.

In [7]:
from collections import Counter

number_of_transaction_types = 40

transaction_counter = Counter(transactions_df.TRANSACTION_DESCRIPTION.values)
transaction_feature_names = [name for name, count in transaction_counter.most_common(number_of_transaction_types)]
transaction_feature_names

[nan,
 'POS PURCHASE',
 'ATM WITHDRAWAL',
 'AIRTIME DEBIT',
 'ACB DEBIT:EXTERNAL',
 'DIGITAL PAYMENT DT',
 'ACB CREDIT',
 'IBANK TRANSFER',
 'LOTTO PURCHASE',
 'CASHSEND DIGITAL',
 'MANAGEMENT FEE',
 'CHARGES',
 'IMDTE DIGITAL PMT',
 'DC TRACK INTERNAL',
 'IBANK PAYMENT FROM',
 'DC TRACK EXTERNAL',
 'DIGITAL TRANSF DT',
 'DIGITAL TRAN FEES',
 'IMMEDIATE TRF CR',
 'POS PUR & CASH',
 'PREPAID DEBIT',
 'INT DEBIT ORDER TO',
 'REWARDS FEE',
 'OVERSEAS PURCHASE',
 'LOTTO WINNINGS',
 'STOP ORDER TO',
 'UNPAID DEBIT',
 'CARDLESS CASH DEP',
 'POS CASH WDL',
 'TRI ATM WITHDRAWAL',
 'OPENED-FROM SAV',
 'TRANSFER FROM',
 'DIGITAL TRANSF CR',
 'CASH ACCEPTOR DEP',
 'IBANK PAYMENT TO',
 'EXT STOP ORDER TO',
 'NPF CREDIT',
 'CAN CASHSEND IB',
 'INTEREST',
 'ATM TRANSFER']

### Feature extraction

<u> GBM model types #1</u> <br>
Per customer count $n$ it takes the following shape $n \ \times \ t \ \times \ $ (pos amounts lower quantile, pos amounts median, pos amounts upper quantile and pos amounts max value, neg amounts lower quantile, neg amounts median, neg amounts upper quantile and neg amounts max value.) where $t$ is the number of used transaction types.

<br> <u> GBM model types #2</u> <br>
Per customer count $n$ it takes the following shape $n \ \times \ f \ \times \ $ (pos amounts lower quantile, pos amounts median, pos amounts upper quantile and pos amounts max value, neg amounts lower quantile, neg amounts median, neg amounts upper quantile and neg amounts max value.) where $f$ is the number of (unique channels + product codes.)


In [8]:
def get_transaction_features_1(transcation_types, transactions_df_in, customer_identifiers, 
                               pos_quantiles=[0.25, 0.5, 0.75, 1.0],
                               neg_quantiles=[0, 0.25, 0.5, 0.75]):
    '''Gets the quantile values for each specified transaction type, and lastly an aggregation of 
    for all the other transaction types. The Positive and Negative values are kept as individual features.'''
    all_user_features = []
    for customer_id in tqdm(customer_identifiers):
        user_transactions_df = transactions_df_in[transactions_df_in.CUSTOMER_IDENTIFIER == customer_id]
        # Lets make the features for all of the selected transaction types 
        user_features = []
        for transcation_type in transcation_types:
            if type(transcation_type) is float:
                # NaN transaction type
                transaction_type_df = user_transactions_df[user_transactions_df.TRANSACTION_DESCRIPTION.isnull()]
            else:
                # Non-NaN transaction types
                transaction_type_df = user_transactions_df[user_transactions_df.TRANSACTION_DESCRIPTION == transcation_type]
            # First let's consider the positive amounts 
            pos_values = transaction_type_df[transaction_type_df.AMT > 0].AMT.values
            # Now let's get the lower quantile, median, upper quantile and max for the postive amounts 
            if len(pos_values) == 0:
                #No transactions, make zeros
                user_features += [0] * len(pos_quantiles)
            else:
                #Get the quantile values
                pos_qvals = np.quantile(pos_values, q=pos_quantiles).tolist()
                user_features += pos_qvals
            # Now let's consider the negative amounts
            neg_values = transaction_type_df[transaction_type_df.AMT < 0].AMT.values
            if len(neg_values) == 0:
                #No transactions, make zeros
                user_features += [0] * len(neg_quantiles)
            else:
                #Get the quantile values
                neg_qvals = np.quantile(neg_values, q=neg_quantiles).tolist()
                user_features += neg_qvals
                
        # Other transactions
        other_transactions_df = user_transactions_df[~user_transactions_df.TRANSACTION_DESCRIPTION.isin(transcation_types)]
        # First let's consider the positive amounts 
        pos_values = other_transactions_df[other_transactions_df.AMT > 0].AMT.values
         # Now let's get the lower quantile, median, upper quantile and max for the postive amounts 
        if len(pos_values) == 0:
            #No transactions, make zeros
            user_features += [0] * len(pos_quantiles)
        else:
            #Get the quantile values
            pos_qvals = np.quantile(pos_values, q=pos_quantiles).tolist()
            user_features += pos_qvals
        # Now let's consider the negative amounts
        neg_values = other_transactions_df[other_transactions_df.AMT < 0].AMT.values
        if len(neg_values) == 0:
            #No transactions, make zeros
            user_features += [0] * len(neg_quantiles)
        else:
            #Get the quantile values
            neg_qvals = np.quantile(neg_values, q=neg_quantiles).tolist()
            user_features += neg_qvals
            
        # Store all the features created for this user
        all_user_features.append(user_features) 
        
    return np.array(all_user_features).astype(float)     

def get_transaction_features_2(transactions_df_in, customer_identifiers, 
                                  pos_quantiles=[0.25, 0.5, 0.75, 1.0],
                                  neg_quantiles=[0, 0.25, 0.5, 0.75]):
    '''Gets the quantile values for each unique channel and product code. The Positive and 
    Negative values are kept as individual features.'''
    all_user_features = []
    for customer_id in tqdm(customer_identifiers):
        user_transactions_df = transactions_df_in[transactions_df_in.CUSTOMER_IDENTIFIER == customer_id]
        user_features = []
        for feature in (transactions_df.CHANNEL.unique().tolist() + transactions_df.PRODUCT_CODE.unique().tolist()):
            pos_values = user_transactions_df[(user_transactions_df.CHANNEL == feature) &\
                                             (user_transactions_df.AMT > 0)].AMT.values
            # No transactions of this type
            if len(pos_values) == 0:
                user_features += [0] * len(pos_quantiles)
            else:
                pos_qvals = np.quantile(pos_values, q=pos_quantiles).tolist()
                user_features += pos_qvals
                
            neg_values = user_transactions_df[(user_transactions_df.CHANNEL == feature) &\
                                             (user_transactions_df.AMT < 0)].AMT.values
            if len(neg_values) == 0:
                user_features += [0] * len(neg_quantiles)
            else:
                neg_qvals = np.quantile(neg_values, q=neg_quantiles).tolist()
                user_features += neg_qvals
        # Store all the features created for this user
        all_user_features.append(user_features)         
    return np.array(all_user_features).astype(float)   

<u> Transaction counts </u> <br>
Per customer count $n$ it prodives the features of positive and negative transaction counts.

In [9]:
def get_transaction_count_feature(transactions_df_in, customer_identifiers, 
                                  low_transaction_count_bins=[0, 5, 10, 20]):
    '''Gets the number of transactions per customer,
    This feature should help improve the modelling from quantiles. It returns the counts for 
    both the positive and negative transaction amounts.'''
    pos_transaction_count = []
    neg_transaction_count = []
    transaction_count_category = []
    for customer_id in tqdm(customer_identifiers):
        pos_transaction_count.append(len(transactions_df_in[(transactions_df_in.CUSTOMER_IDENTIFIER == customer_id) &\
                                                            (transactions_df_in.AMT > 0)]))
        neg_transaction_count.append(len(transactions_df_in[(transactions_df_in.CUSTOMER_IDENTIFIER == customer_id) &\
                                                            (transactions_df_in.AMT < 0)]))
        
        # Assign the user's transaction count to a binned feature
        transaction_bin = 0
        bin_assigned = False
        for transaction_count in low_transaction_count_bins:
            if (pos_transaction_count[-1] <= transaction_count)  or\
               (neg_transaction_count[-1] <= transaction_count):
                    transaction_count_category.append(transaction_bin)
                    bin_assigned = True
                    break
            transaction_bin += 1
        # No bin assinged
        if not bin_assigned:
            transaction_count_category.append(len(low_transaction_count_bins))
            
    return np.array(pos_transaction_count), np.array(neg_transaction_count), np.array(transaction_count_category)  

<u> Account balance </u> <br>
The customer's account balance qauntiles

In [10]:
def get_account_balance_features(transactions_df_in, customer_identifiers, 
                                 quantiles=[0.25, 0.5, 0.75, 1.0]):
    '''Gets the quantile values for each customer's account balance.'''
    all_user_features = []
    for customer_id in tqdm(customer_identifiers):
        user_features = []
        user_transactions_df = transactions_df_in[transactions_df_in.CUSTOMER_IDENTIFIER == customer_id]
        # No transactions
        if len(user_transactions_df) == 0:
            user_features = [0] * len(quantiles)
        else:
            user_features = np.quantile(user_transactions_df.ACCOUNT_BALANCE.values, q=quantiles).tolist()
        all_user_features.append(user_features)
    return np.array(all_user_features).astype(float)   

##### Training transaction features
This takes around 10 minutes

In [11]:
train_balance_features = get_account_balance_features(transactions_df_in=transactions_df,
                                                      customer_identifiers=train_customer_identifiers)

100%|██████████████████████████████████████████████████████████████████████████████| 3600/3600 [03:04<00:00, 19.46it/s]


In [12]:
train_1_transaction_features = get_transaction_features_1(transcation_types=transaction_feature_names,
                                                          transactions_df_in=transactions_df,
                                                          customer_identifiers=train_customer_identifiers)

train_1_transaction_features = np.hstack((train_1_transaction_features,
                                         train_balance_features))

100%|██████████████████████████████████████████████████████████████████████████████| 3600/3600 [06:28<00:00,  9.27it/s]


In [13]:
train_2_transaction_features = get_transaction_features_2(transactions_df_in=transactions_df,
                                                          customer_identifiers=train_customer_identifiers)

train_2_transaction_features = np.hstack((train_2_transaction_features,
                                         train_balance_features))

100%|██████████████████████████████████████████████████████████████████████████████| 3600/3600 [06:59<00:00,  8.58it/s]


In [14]:
train_pos_transaction_counts, train_neg_transaction_counts,\
train_transaction_count_category = get_transaction_count_feature(transactions_df_in=transactions_df,
                                                                 customer_identifiers=train_customer_identifiers)

100%|██████████████████████████████████████████████████████████████████████████████| 3600/3600 [05:19<00:00, 11.28it/s]


##### Testing transaction features
This takes around 5 minutes

In [15]:
test_balance_features = get_account_balance_features(transactions_df_in=transactions_df,
                                                     customer_identifiers=test_customer_identifiers)

100%|██████████████████████████████████████████████████████████████████████████████| 1544/1544 [01:10<00:00, 21.80it/s]


In [16]:
test_1_transaction_features = get_transaction_features_1(transcation_types=transaction_feature_names,
                                                         transactions_df_in=transactions_df,
                                                         customer_identifiers=test_customer_identifiers)

test_1_transaction_features = np.hstack((test_1_transaction_features,
                                        test_balance_features))

100%|██████████████████████████████████████████████████████████████████████████████| 1544/1544 [02:22<00:00, 10.85it/s]


In [17]:
test_2_transaction_features = get_transaction_features_2(transactions_df_in=transactions_df,
                                                         customer_identifiers=test_customer_identifiers)

test_2_transaction_features = np.hstack((test_2_transaction_features,
                                        test_balance_features))

100%|██████████████████████████████████████████████████████████████████████████████| 1544/1544 [02:54<00:00,  8.87it/s]


In [18]:
test_pos_transaction_counts, test_neg_transaction_counts,\
test_transaction_count_category = get_transaction_count_feature(transactions_df_in=transactions_df,
                                                                customer_identifiers=test_customer_identifiers)

100%|██████████████████████████████████████████████████████████████████████████████| 1544/1544 [02:23<00:00, 10.74it/s]


## Create training features

#### Update the customer group codes if outdated

In [19]:
def income_to_group(value):
    '''Helper function to return correct group code for a specified income.'''
    if value < 500:
        return 1
    elif value < 1000:
        return 2
    elif value < 2000:
        return 3
    elif value < 3000:
        return 4
    elif value < 4000:
        return 5
    elif value < 5000:
        return 6
    elif value < 6000:
        return 7
    elif value < 7000:
        return 8
    elif value < 8000:
        return 9
    elif value < 9000:
        return 10
    elif value < 10000:
        return 11
    elif value < 12000:
        return 12
    elif value < 15000:
        return 13
    elif value < 20000:
        return 14
    elif value < 25000:
        return 15
    elif value < 34000:
        return 16
    elif value < 42000:
        return 17
    elif value < 63000:
        return 18
    elif value < 85000:
        return 19
    elif value < 125000:
        return 20
    else:
        return 21

def group_to_income(value):
    '''Helper function to return average income for a specified group code.'''
    if value < 2:
        return 250
    elif value < 3:
        return 750
    elif value < 4:
        return 1500
    elif value < 5:
        return 2500
    elif value < 6:
        return 3500
    elif value < 7:
        return 4500
    elif value < 8:
        return 5500
    elif value < 9:
        return 6500
    elif value < 10:
        return 7500
    elif value < 11:
        return 8500
    elif value < 12:
        return 9500
    elif value < 13:
        return 11000
    elif value < 14:
        return 13500
    elif value < 15:
        return 17500
    elif value < 16:
        return 22500
    elif value < 17:
        return 29500
    elif value < 18:
        return 38000
    elif value < 19:
        return 52500
    elif value < 20:
        return 74000
    else:
        return 90000


def get_customer_groups(customers_df_in, train_incomes_df_in):
    '''Get updated user incomes according to the income codes csv.'''
    
    new_income_groups = []
    new_update_dates = []
    old_df = customers_df_in.copy()
    train_incomes_df_ = train_incomes_df_in.copy()
    old_df['RECORD_DATE'] = train_incomes_df_.RECORD_DATE
    old_df['NET_INCOME'] = train_incomes_df_.DECLARED_NET_INCOME
    for income_code, net_income, record_date, last_update in\
        zip(old_df.INCOME_GROUP_CODE.values, old_df.NET_INCOME.values, 
            old_df.RECORD_DATE.values, old_df.DATE_LAST_UPDATED.values):
        if record_date > last_update:
            # Outdates user, get an updated income group code.
            new_income_groups.append(income_to_group(net_income))
            new_update_dates.append(record_date)
        else:
            # User is up to date, use the known income group code.
            new_income_groups.append(income_code)
            new_update_dates.append(last_update)  
    return new_income_groups, new_update_dates

from sklearn.model_selection import train_test_split

def make_dataset(customer_features, transaction_features, categorical_indices):
    '''Combines features and formats them for modelling. If targets are given the data is split into two subsets of data,
    to train two different models to ensemble.'''
    # Combine customer features and transactional ones
    stacked_features = np.hstack((customer_features, transaction_features))
    # No targets were given, this is probably test data
    x = pd.DataFrame(stacked_features)
    # Make the categorical features of type int
    if categorical_indices is not None:
        for index in categorical_indices: 
            x[index] = x[index].astype(int)
    return x
def get_customer_groups(customers_df_in, train_incomes_df_in):
    '''Get updated user incomes according to the income codes csv.'''
    
    new_income_groups = []
    new_update_dates = []
    old_df = customers_df_in.copy()
    train_incomes_df_ = train_incomes_df_in.copy()
    old_df['RECORD_DATE'] = train_incomes_df_.RECORD_DATE
    old_df['NET_INCOME'] = train_incomes_df_.DECLARED_NET_INCOME
    for income_code, net_income, record_date, last_update in\
        zip(old_df.INCOME_GROUP_CODE.values, old_df.NET_INCOME.values, 
            old_df.RECORD_DATE.values, old_df.DATE_LAST_UPDATED.values):
        if record_date > last_update:
            # Outdates user, get an updated income group code.
            new_income_groups.append(income_to_group(net_income))
            new_update_dates.append(record_date)
        else:
            # User is up to date, use the known income group code.
            new_income_groups.append(income_code)
            new_update_dates.append(last_update)  
    return new_income_groups, new_update_dates

from sklearn.model_selection import train_test_split

def make_dataset(customer_features, transaction_features, categorical_indices):
    '''Combines features and formats them for modelling. If targets are given the data is split into two subsets of data,
    to train two different models to ensemble.'''
    # Combine customer features and transactional ones
    stacked_features = np.hstack((customer_features, transaction_features))
    # No targets were given, this is probably test data
    x = pd.DataFrame(stacked_features)
    # Make the categorical features of type int
    if categorical_indices is not None:
        for index in categorical_indices: 
            x[index] = x[index].astype(int)
    return x

### Wrangle the customer dataframes for the modelling

In [20]:
train_customer_features = customer_df.copy()
train_customer_features = train_customer_features[train_customer_features.\
                                                 CUSTOMER_IDENTIFIER.isin(train_customer_identifiers)]
train_customer_features = train_customer_features.sort_values(by="CUSTOMER_IDENTIFIER")
train_customer_features["TRANSACTION_POS_COUNTS"] = train_pos_transaction_counts
train_customer_features["TRANSACTION_NEG_COUNTS"] = train_neg_transaction_counts
train_customer_features["TRANSACTION_COUNT_CATEGORY"] = train_transaction_count_category
train_customer_features.DATE_LAST_UPDATED = train_customer_features.DATE_LAST_UPDATED.astype(np.datetime64)

# Update the income groups and last updated
train_incomes_df.RECORD_DATE = train_incomes_df.RECORD_DATE.astype(np.datetime64)
train_incomes_df = train_incomes_df.sort_values(by="CUSTOMER_IDENTIFIER")
new_train_income_groups, _ = get_customer_groups(train_customer_features, train_incomes_df)
train_customer_features['INCOME_GROUP_CODE'] = new_train_income_groups

In [21]:
test_customer_features = customer_df.copy()
test_customer_features = test_customer_features[test_customer_features.CUSTOMER_IDENTIFIER.isin(test_customer_identifiers)]
test_customer_features = test_customer_features.sort_values(by="CUSTOMER_IDENTIFIER")
test_customer_features["TRANSACTION_POS_COUNTS"] = test_pos_transaction_counts
test_customer_features["TRANSACTION_NEG_COUNTS"] = test_neg_transaction_counts
test_customer_features["TRANSACTION_COUNT_CATEGORY"] = test_transaction_count_category
test_customer_features.DATE_LAST_UPDATED = test_customer_features.DATE_LAST_UPDATED.astype(np.datetime64)

In [22]:
# Drop high cardinality and unused features
columns_to_drop = ["CUSTOMER_IDENTIFIER", "DATE_LAST_UPDATED", "SEX_CODE", "OCCUPATIONAL_STATUS_CODE"]
train_customer_features.drop(columns=columns_to_drop, inplace=True)
test_customer_features.drop(columns=columns_to_drop, inplace=True)

In [23]:
categorical_features = ["TRANSACTION_COUNT_CATEGORY"]
categorical_feature_indices = [train_customer_features.columns.get_loc(feature) for feature in categorical_features]

categorical_feature_indices = None

#### Make the final train and test ready data sets

There are 2 types of data structures used. GBM type 1 and GBM type 2. The following code just joins the customer and transaction features and optionally splits them so that we can train an ensemble of GBMs instead of a single model.

In [24]:
train_y = train_incomes_df[train_incomes_df.CUSTOMER_IDENTIFIER.isin(train_customer_identifiers)].\
                            sort_values(by="CUSTOMER_IDENTIFIER").DECLARED_NET_INCOME.values

train_x_1 = make_dataset(train_customer_features.values,
                         train_1_transaction_features, 
                         categorical_feature_indices)

train_x_2 = make_dataset(train_customer_features.values,
                         train_2_transaction_features, 
                         categorical_feature_indices)

test_x_1 = make_dataset(test_customer_features.values, 
                        test_1_transaction_features, 
                        categorical_feature_indices)

test_x_2 = make_dataset(test_customer_features.values, 
                        test_2_transaction_features, 
                        categorical_feature_indices)

## Gradient Boost Models
CatBoost seems to work best, there are 3 models per data structure, the models utilising the structure type #1 are referred to as mdl1 and the models that utilise the structure type #2 are referred to as mdl2.
Each mdlx_0 learns from both the subsets of data, where the mdlx_1 & mdlx_2 models are trained using a single subset of data.

In [25]:
skfold = KFold(n_splits=10, shuffle=True, random_state=seed)

model_1 = CatBoostRegressor(random_seed=seed*10,
                            random_strength=3.0,
                            l2_leaf_reg=1,
                            n_estimators=3000,
                            learning_rate=0.02,
                            cat_features=categorical_feature_indices)
    
for fold, (trn_idx, val_idx) in enumerate(skfold.split(train_x_1, train_y)):
        print(f'Model 1: Fold {fold + 1}')
        X_train, y_train = train_x_1.iloc[trn_idx], train_y[trn_idx]
        X_valid, y_valid = train_x_1.iloc[val_idx], train_y[val_idx]
            
        if fold == 0:
            model_1.fit(
                X_train, y_train,
                eval_set=(X_valid, y_valid),
                verbose=100,
                use_best_model=True,
                early_stopping_rounds=100)    
        else:
            model_1.fit(
                X_train, y_train,
                eval_set=(X_valid, y_valid),
                verbose=100,
                use_best_model=True,
                early_stopping_rounds=100,
                init_model=model_1)    
        
        
model_2 = CatBoostRegressor(random_seed=seed*10,
                            random_strength=3.0,
                            l2_leaf_reg=1,
                            n_estimators=3000,
                            learning_rate=0.022,
                            cat_features=categorical_feature_indices)
    
for fold, (trn_idx, val_idx) in enumerate(skfold.split(train_x_2, train_y)):
        print(f'Model 2: Fold {fold + 1}')
        X_train, y_train = train_x_2.iloc[trn_idx], train_y[trn_idx]
        X_valid, y_valid = train_x_2.iloc[val_idx], train_y[val_idx]
            
        if fold == 0:
            model_2.fit(
                X_train, y_train,
                eval_set=(X_valid, y_valid),
                verbose=100,
                use_best_model=True,
                early_stopping_rounds=100)    
        else:
            model_2.fit(
                X_train, y_train,
                eval_set=(X_valid, y_valid),
                verbose=100,
                use_best_model=True,
                early_stopping_rounds=100,
                init_model=model_2)  

Model 1: Fold 1
0:	learn: 10242.7847738	test: 9298.9161984	best: 9298.9161984 (0)	total: 171ms	remaining: 8m 31s
100:	learn: 7087.3481089	test: 6625.6398699	best: 6625.6398699 (100)	total: 1.13s	remaining: 32.5s
200:	learn: 6499.2693253	test: 6365.2688913	best: 6361.9977010 (198)	total: 2.3s	remaining: 32.1s
300:	learn: 6238.7586918	test: 6266.5385236	best: 6266.5385236 (300)	total: 3.73s	remaining: 33.4s
400:	learn: 6020.9125420	test: 6217.8387983	best: 6217.8259973 (391)	total: 4.98s	remaining: 32.3s
500:	learn: 5735.7042481	test: 6168.4244534	best: 6167.0223603 (496)	total: 6.17s	remaining: 30.8s
600:	learn: 5420.2279354	test: 6131.7948344	best: 6131.7948344 (600)	total: 7.49s	remaining: 29.9s
700:	learn: 5153.1199635	test: 6114.2430578	best: 6107.7755645 (677)	total: 8.78s	remaining: 28.8s
800:	learn: 4944.9595800	test: 6096.0340620	best: 6096.0340620 (800)	total: 10.2s	remaining: 28s
900:	learn: 4749.4231776	test: 6063.5393145	best: 6060.7402755 (881)	total: 11.5s	remaining: 26.9s

100:	learn: 5287.2984330	test: 3988.4433922	best: 3958.0276107 (0)	total: 466ms	remaining: 13.4s
Stopped by overfitting detector  (100 iterations wait)

bestTest = 3958.027611
bestIteration = 0

Shrink model to first 1 iterations.


## Make submission csv

The first group of models are given higher weights since they make use of the more detailed features of the different transaction types as apposed to the other that only look at the product codes and channels.

In [26]:
def weighted_model_averaging(mdl1_x, mdl2_x, weights=[1.5/2, 0.5/2]):
    ''' Combines the models to make a weighted model average prediction.'''
    md1_yhat = model_1.predict(mdl1_x)
    md2_yhat = model_2.predict(mdl2_x)

    return ((md1_yhat*weights[0]) +
            (md2_yhat*weights[1]))

In [27]:
all_predictions = []

# Make the predictions
y_hat = weighted_model_averaging(test_x_1, test_x_2)

# Make an array ready for submission csv
for prediction, userid in zip(y_hat, test_customer_identifiers):
    pred = f'{prediction:}'
    all_predictions.append(np.array([userid, pred]))

all_predictions = np.asarray(all_predictions)  

### Create submission.csv in the root code directory

In [28]:
try:
    np.savetxt("submission.csv", all_predictions, delimiter=",", fmt='%s', 
               header='CUSTOMER_IDENTIFIER,DECLARED_NET_INCOME',comments='')
except:
    print("Something went wrong.. Close the submissions.csv file ?")