# Mansa loans

## EDA

We start by performing an EDA of the datasets.

In [1]:
import pandas as pd
from src.data_tools import get_data

In [2]:
# read the csv files
df_accounts = get_data("data/accounts.csv")
df_transactions = get_data("data/transactions.csv")

In [3]:
df_accounts.head()

Unnamed: 0,id,balance,update_date
0,0,13.63,2021-07-02
1,1,12.91,2021-07-02
2,2,19.84,2021-07-02
3,3,130.0,2021-07-02
4,4,2806.75,2021-07-02


In [4]:
df_transactions.head()

Unnamed: 0,account_id,date,amount
0,0,2020-10-16,200.0
1,0,2020-10-16,-192.0
2,0,2020-10-16,200.0
3,0,2020-10-16,-24.0
4,0,2020-10-16,-50.0


#### Checking for null values

In [5]:
df_accounts.isna().any()

id             False
balance        False
update_date    False
dtype: bool

In [6]:
df_transactions.isna().any()

account_id    False
date          False
amount        False
dtype: bool

There is no null value in any of the two datasets.


#### Checking for duplicates

In [7]:
df_accounts.duplicated().any()

False

In [8]:
df_transactions.duplicated().any()

True

There are duplicates in the transactions. Lets look at some of them

In [9]:
n_duplicated = df_transactions.duplicated().sum()
dup_percent = int(n_duplicated / df_transactions.shape[0]*100)
print(f'there are {n_duplicated} duplicates in transactions.csv, representing {dup_percent}% of the data.')

there are 28661 duplicates in transactions.csv, representing 5% of the data.


In [10]:
df_dup = df_transactions[df_transactions.duplicated()]
df_dup.head()

Unnamed: 0,account_id,date,amount
2,0,2020-10-16,200.0
11,0,2020-10-17,0.0
14,0,2020-10-19,0.0
23,0,2020-10-21,0.0
45,0,2020-10-29,-5.5


In [11]:
df_transactions[
    (df_transactions.account_id == 0) 
    & (df_transactions.date == '2020-10-16') 
    &(df_transactions.amount == 200.0)]

Unnamed: 0,account_id,date,amount
0,0,2020-10-16,200.0
2,0,2020-10-16,200.0


It is difficult, without further information if the above transactions are legitimate siilar transactions or duplicates coming from an error when the data were processed. We decide to keep these duplicates. It would be interesting to enrich the transactions data with a transaction id. We could be certain if a similar transaction on a given account and a given date is a duplicate or legitimate.

#### Account history distribution

Let's check the history distribution of the different accounts in our data.

In [12]:
# we change the date column from str to datetime
df_transactions['date'] = pd.to_datetime(df_transactions['date'])
group = df_transactions.groupby('account_id')['date']

In [13]:
history_df = group.agg(["min", "max"])
history = history_df['max'] - history_df['min']
history.head()

account_id
0   257 days
1   231 days
2   100 days
3   176 days
4   293 days
dtype: timedelta64[ns]

In [14]:
trigger = pd.Timedelta(180, "d")
n_accounts = (history > trigger).sum()
frac_long_account = int(n_accounts / len(history) * 100)
print(f'{n_accounts} accounts, which represent {frac_long_account}% of all the accounts, have an history of more than 6 months')

823 accounts, which represent 65% of all the accounts, have an history of more than 6 months


We'll focus on the accounts with more than 6 months history. For that purpose we use get_df_with_history.

In [15]:
from src.data_tools import get_df_with_history
df_accounts, df_transactions = get_df_with_history(df_accounts, df_transactions)

In [16]:
n_transac = len(set(df_transactions['account_id']))
print(f'We are left with {len(df_accounts)} accounts in df_accounts and {n_transac} in df_transactions which is as expected.')

We are left with 860 accounts in df_accounts and 860 in df_transactions which is as expected.


## Processing the data for training

We are looking to predict the next month outgoing given the last 6 month of transaction. For that purpose it make sense to divide the history of transactions in 30 days buckets and to calculate the total amount of inflow and outflow for each 30 days bucket.  
When processing the data, we keep the last 2 months for testing and the previous 2 months for validation so that we are sure there is no data leakage when training and testing. 

In [17]:
from src.data_tools import get_training_data
# When processing the data, we keep the last 2 months for testing and the previous 2 months for validation
training_data = get_training_data(test_size=2)

In [18]:
training_data.keys()

dict_keys(['train', 'val', 'test'])

In [19]:
training_data['train'].keys()

dict_keys(['X', 'y'])

The data are split between training, validation and test. Each of these split is a dictionary with the input of the model: 'X' and the target variable: 'y'. Let's check the relative proportion of these different splits:

In [20]:
n_train = training_data['train']['X'].shape[0]
n_val = training_data['val']['X'].shape[0]
n_test = training_data['test']['X'].shape[0]
n = n_train + n_val + n_test
print(f'There is a total of {n} data')
print(f'{n_train/n*100:.1f}% for training, {n_val/n*100:.1f}% for validation and {n_test/n*100:.1f}% for testing')

There is a total of 13520 data
75.5% for training, 12.3% for validation and 12.2% for testing


This is an acceptable split of the data for training our model. Lets look at how the data were processed:

In [21]:
training_data['train']['X'].head()

Unnamed: 0,1M inflow,2M inflow,3M inflow,4M inflow,5M inflow,6M inflow,1M outflow,2M outflow,3M outflow,4M outflow,5M outflow,6M outflow,initial_balance
0,90.0,0.0,0.0,50.0,0.0,0.0,0.0,-50.0,-50.0,-100.0,-50.0,-50.0,2525.0
1,4138.47,5376.46,3833.0,3600.0,3097.2,2000.0,-4946.87,-4929.89,-3876.41,-4195.21,-2071.26,-1680.31,70.2
2,0.0,0.0,0.0,0.0,1182.0,4034.0,0.0,0.0,0.0,0.0,-21.27,-2088.17,-80.94
3,1705.0,646.0,231.85,312.35,455.07,4787.98,-1351.2,-1039.65,-217.8,-255.07,-271.26,-754.99,280.66
4,0.0,0.0,0.0,0.0,1040.0,714.0,0.0,0.0,0.0,0.0,0.0,-68.65,123.69


These data represent 6 months of transaction history. For example, '1M inflow' gives the total amount of positive transactions in the accounts during the first month of the history considered and '6M outflow' gives the total amount of negative transaction in the accounts during the last month of the history considered. initial_balance give the balances of the account at the beginning of the 6 months period considered.

We have organised the data as needed for the next step:

In [22]:
X_train = training_data['train']['X']
y_train = training_data['train']['y']
X_val = training_data['val']['X']
y_val = training_data['val']['y']
X_test = training_data['test']['X']
y_test = training_data['test']['y']


## Model choice

Given that time is limited we'll compare different model performance with their default setting to choose our model

In [23]:
# we perform some scaling on the data. It is not necessary for tree based models but we'll keep a common 
# groundwork for our analysis
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

### Regression

Regression will give us a benchmark. It's unlikely to be the most performant model, we should be able to do better with ensemble models.

In [24]:
from sklearn.linear_model import LinearRegression

# we start with linear regression
reg = LinearRegression()
reg.fit(X_train, y_train)
r2 = reg.score(X_val, y_val)
print(f'Linear regression regression r2: {r2}')

Linear regression regression r2: 0.26590342126231004


In [25]:
from sklearn.linear_model import Ridge

# we impose a penalty using Ridge regression
reg = Ridge()
reg.fit(X_train, y_train)
r2 = reg.score(X_val, y_val)
print(f'Ridge regression r2: {r2}')

Ridge regression r2: 0.26592745106054116


In [26]:
from sklearn.linear_model import Lasso

# we now impose a penalty using Lasso regression
reg = Lasso()
reg.fit(X_train, y_train)
r2 = reg.score(X_val, y_val)
print(f'Lasso regression r2: {r2}')

Lasso regression r2: 0.2658387256598487


### Random Forests

In [27]:
from sklearn.ensemble import RandomForestRegressor
reg_rf = RandomForestRegressor(n_estimators=100, random_state=48)
reg_rf.fit(X_train, y_train)
r2 = reg_rf.score(X_val, y_val)
print(f'Random forest r2: {r2}')

Random forest r2: 0.1799939470557319


### Light GBM

In [28]:
from lightgbm import LGBMRegressor
reg_lg = LGBMRegressor(n_estimators=100, random_state=48)
reg_lg.fit(X_train, y_train)
r2 = reg_lg.score(X_val, y_val)
print(f'Light gbm r2: {r2}')

Light gbm r2: 0.28423212148743804


Light GBM seems the most promising model. We'll use Light GBM from now on

### Hyperparameter tuning for Light GBM

We use Optuna for hyperparameters tuning and fine tune more parameters:
- parameters that control the tree structure: num_leaves and max_depth (between 3 and 12). LGBM documentation indicates that num_leaves should be < 2^(max_depth)
- parameters for better accuracy: n_estimators and learning_rate (0.01-0.3)
- parameters to control overfitting: regularization (l1 or l2) and bagging fraction  


In [29]:
import optuna
from src.hyper_params import get_objective

  from .autonotebook import tqdm as notebook_tqdm


In [30]:
study = optuna.create_study(direction='minimize')
objective = get_objective(X_train, X_val, y_train, y_val)
study.optimize(objective, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

[32m[I 2022-04-22 14:04:30,452][0m A new study created in memory with name: no-name-7c9d3e30-7753-4e05-9d04-f2768c9a000b[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7660.11
[200]	valid_0's rmse: 7386.87


[32m[I 2022-04-22 14:04:30,804][0m Trial 0 finished with value: 7381.3500106699985 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.718567010048906, 'reg_lambda': 0.30879783323044596, 'colsample_bytree': 0.5, 'subsample': 0.7, 'learning_rate': 0.01, 'max_depth': 12, 'num_leaves': 423, 'min_child_samples': 139, 'min_data_per_groups': 36}. Best is trial 0 with value: 7381.3500106699985.[0m


[300]	valid_0's rmse: 7420.32
Early stopping, best iteration is:
[228]	valid_0's rmse: 7381.35
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7959.14


[32m[I 2022-04-22 14:04:31,131][0m Trial 1 finished with value: 7521.91418835897 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.5010106557156396, 'reg_lambda': 0.020814518035434534, 'colsample_bytree': 0.7, 'subsample': 0.6, 'learning_rate': 0.008, 'max_depth': 12, 'num_leaves': 233, 'min_child_samples': 201, 'min_data_per_groups': 83}. Best is trial 0 with value: 7381.3500106699985.[0m


[200]	valid_0's rmse: 7552.56
[300]	valid_0's rmse: 7527.63
Early stopping, best iteration is:
[257]	valid_0's rmse: 7521.91
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:31,230][0m Trial 2 finished with value: 7644.776428623257 and parameters: {'n_estimators': 100, 'reg_alpha': 0.010712661149237492, 'reg_lambda': 0.6433364511553987, 'colsample_bytree': 1.0, 'subsample': 1.0, 'learning_rate': 0.017, 'max_depth': 15, 'num_leaves': 269, 'min_child_samples': 292, 'min_data_per_groups': 11}. Best is trial 0 with value: 7381.3500106699985.[0m


[100]	valid_0's rmse: 7644.78
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 7644.78
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7777.8
[200]	valid_0's rmse: 7376.84


[32m[I 2022-04-22 14:04:31,454][0m Trial 3 finished with value: 7368.851759414185 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.00841886435939294, 'reg_lambda': 0.002184959277473596, 'colsample_bytree': 0.5, 'subsample': 0.5, 'learning_rate': 0.008, 'max_depth': 7, 'num_leaves': 718, 'min_child_samples': 92, 'min_data_per_groups': 78}. Best is trial 3 with value: 7368.851759414185.[0m


[300]	valid_0's rmse: 7382.36
Early stopping, best iteration is:
[214]	valid_0's rmse: 7368.85
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7511.53


[32m[I 2022-04-22 14:04:31,702][0m Trial 4 finished with value: 7489.878283720449 and parameters: {'n_estimators': 250, 'reg_alpha': 1.9145996590327772, 'reg_lambda': 0.001951846550796919, 'colsample_bytree': 0.4, 'subsample': 0.5, 'learning_rate': 0.017, 'max_depth': 15, 'num_leaves': 754, 'min_child_samples': 205, 'min_data_per_groups': 50}. Best is trial 3 with value: 7368.851759414185.[0m


[200]	valid_0's rmse: 7554.43
Early stopping, best iteration is:
[126]	valid_0's rmse: 7489.88
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7574.57
[200]	valid_0's rmse: 7512.99


[32m[I 2022-04-22 14:04:31,943][0m Trial 5 finished with value: 7481.823144604938 and parameters: {'n_estimators': 2000, 'reg_alpha': 6.347305425889957, 'reg_lambda': 0.03317199830894514, 'colsample_bytree': 0.4, 'subsample': 1.0, 'learning_rate': 0.014, 'max_depth': 12, 'num_leaves': 625, 'min_child_samples': 204, 'min_data_per_groups': 58}. Best is trial 3 with value: 7368.851759414185.[0m
[32m[I 2022-04-22 14:04:32,090][0m Trial 6 finished with value: 7521.038795879433 and parameters: {'n_estimators': 250, 'reg_alpha': 0.5149635000155278, 'reg_lambda': 0.10000139313301608, 'colsample_bytree': 0.6, 'subsample': 0.5, 'learning_rate': 0.008, 'max_depth': 7, 'num_leaves': 363, 'min_child_samples': 201, 'min_data_per_groups': 70}. Best is trial 3 with value: 7368.851759414185.[0m


Early stopping, best iteration is:
[167]	valid_0's rmse: 7481.82
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7967.09
[200]	valid_0's rmse: 7551.08
Did not meet early stopping. Best iteration is:
[250]	valid_0's rmse: 7521.04
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8109.95
[200]	valid_0's rmse: 7551.62
[300]	valid_0's rmse: 7406.05


[32m[I 2022-04-22 14:04:32,533][0m Trial 7 finished with value: 7392.601595111302 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.23074913273853306, 'reg_lambda': 0.024395450061034274, 'colsample_bytree': 1.0, 'subsample': 1.0, 'learning_rate': 0.006, 'max_depth': 12, 'num_leaves': 111, 'min_child_samples': 139, 'min_data_per_groups': 70}. Best is trial 3 with value: 7368.851759414185.[0m


[400]	valid_0's rmse: 7400.39
Early stopping, best iteration is:
[348]	valid_0's rmse: 7392.6
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7938.4


[32m[I 2022-04-22 14:04:32,859][0m Trial 8 finished with value: 7263.847092796182 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.0016119218181259286, 'reg_lambda': 0.002509094868592898, 'colsample_bytree': 0.9, 'subsample': 0.7, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 436, 'min_child_samples': 23, 'min_data_per_groups': 62}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7424.75
[300]	valid_0's rmse: 7290.18
[400]	valid_0's rmse: 7270.39
Early stopping, best iteration is:
[359]	valid_0's rmse: 7263.85
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8365.59


[32m[I 2022-04-22 14:04:32,987][0m Trial 9 finished with value: 7597.130239605568 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.0015897816838100183, 'reg_lambda': 0.0015137425816443828, 'colsample_bytree': 0.4, 'subsample': 0.7, 'learning_rate': 0.006, 'max_depth': 3, 'num_leaves': 793, 'min_child_samples': 274, 'min_data_per_groups': 76}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7780.61
[300]	valid_0's rmse: 7613.66
[400]	valid_0's rmse: 7602.4
Early stopping, best iteration is:
[361]	valid_0's rmse: 7597.13
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7816.8


[32m[I 2022-04-22 14:04:33,151][0m Trial 10 finished with value: 7814.7537257641 and parameters: {'n_estimators': 500, 'reg_alpha': 0.04070595442418664, 'reg_lambda': 9.177371878394373, 'colsample_bytree': 0.9, 'subsample': 0.4, 'learning_rate': 0.02, 'max_depth': 5, 'num_leaves': 982, 'min_child_samples': 2, 'min_data_per_groups': 28}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 8213.37
Early stopping, best iteration is:
[103]	valid_0's rmse: 7814.75
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7955.91
[200]	valid_0's rmse: 7389.8


[32m[I 2022-04-22 14:04:33,454][0m Trial 11 finished with value: 7281.575288410799 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.001301003215391624, 'reg_lambda': 0.004064309099831402, 'colsample_bytree': 0.5, 'subsample': 0.8, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 587, 'min_child_samples': 36, 'min_data_per_groups': 97}. Best is trial 8 with value: 7263.847092796182.[0m


[300]	valid_0's rmse: 7285.38
[400]	valid_0's rmse: 7305.02
Early stopping, best iteration is:
[316]	valid_0's rmse: 7281.58
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8596.73
[200]	valid_0's rmse: 8505.95
[300]	valid_0's rmse: 8494.22


[32m[I 2022-04-22 14:04:33,874][0m Trial 12 finished with value: 8472.535364172008 and parameters: {'n_estimators': 750, 'reg_alpha': 0.0018365823008690974, 'reg_lambda': 0.006917795573275732, 'colsample_bytree': 0.8, 'subsample': 0.8, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 582, 'min_child_samples': 3, 'min_data_per_groups': 100}. Best is trial 8 with value: 7263.847092796182.[0m


Early stopping, best iteration is:
[263]	valid_0's rmse: 8472.54
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8006.13
[200]	valid_0's rmse: 7385.27
[300]	valid_0's rmse: 7297.32


[32m[I 2022-04-22 14:04:34,418][0m Trial 13 finished with value: 7294.173304221844 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.007640336865087608, 'reg_lambda': 0.012036051467734501, 'colsample_bytree': 0.3, 'subsample': 0.8, 'learning_rate': 0.006, 'max_depth': 10, 'num_leaves': 495, 'min_child_samples': 62, 'min_data_per_groups': 91}. Best is trial 8 with value: 7263.847092796182.[0m


Early stopping, best iteration is:
[278]	valid_0's rmse: 7294.17
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7914.56
[200]	valid_0's rmse: 7422.2


[32m[I 2022-04-22 14:04:34,679][0m Trial 14 finished with value: 7366.479407490431 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.0011363610238240545, 'reg_lambda': 0.004986622663313889, 'colsample_bytree': 0.9, 'subsample': 0.7, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 43, 'min_child_samples': 48, 'min_data_per_groups': 51}. Best is trial 8 with value: 7263.847092796182.[0m
[32m[I 2022-04-22 14:04:34,823][0m Trial 15 finished with value: 7365.02906616742 and parameters: {'n_estimators': 750, 'reg_alpha': 0.047034607071285976, 'reg_lambda': 0.0011180194013982734, 'colsample_bytree': 0.9, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 7, 'num_leaves': 931, 'min_child_samples': 53, 'min_data_per_groups': 100}. Best is trial 8 with value: 7263.847092796182.[0m


[300]	valid_0's rmse: 7368.59
Early stopping, best iteration is:
[287]	valid_0's rmse: 7366.48
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7383.47
[200]	valid_0's rmse: 7502.39
Early stopping, best iteration is:
[116]	valid_0's rmse: 7365.03


[32m[I 2022-04-22 14:04:34,959][0m Trial 16 finished with value: 7360.569689714738 and parameters: {'n_estimators': 100, 'reg_alpha': 0.004598573741195137, 'reg_lambda': 0.09722050428966661, 'colsample_bytree': 0.5, 'subsample': 0.4, 'learning_rate': 0.02, 'max_depth': 10, 'num_leaves': 564, 'min_child_samples': 92, 'min_data_per_groups': 6}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7364.14
Did not meet early stopping. Best iteration is:
[89]	valid_0's rmse: 7360.57
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7622.15
[200]	valid_0's rmse: 7287.31
[300]	valid_0's rmse: 7313.63
Early stopping, best iteration is:
[223]	valid_0's rmse: 7278.78


[32m[I 2022-04-22 14:04:35,059][0m Trial 17 finished with value: 7278.775306268422 and parameters: {'n_estimators': 500, 'reg_alpha': 0.02679175556396939, 'reg_lambda': 0.005821002048417379, 'colsample_bytree': 0.3, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 3, 'num_leaves': 395, 'min_child_samples': 29, 'min_data_per_groups': 32}. Best is trial 8 with value: 7263.847092796182.[0m
[32m[I 2022-04-22 14:04:35,155][0m Trial 18 finished with value: 7394.958259556604 and parameters: {'n_estimators': 500, 'reg_alpha': 0.028100579104821225, 'reg_lambda': 3.5793848520421006, 'colsample_bytree': 0.3, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 3, 'num_leaves': 320, 'min_child_samples': 97, 'min_data_per_groups': 25}. Best is trial 8 with value: 7263.847092796182.[0m
[32m[I 2022-04-22 14:04:35,253][0m Trial 19 finished with value: 7309.800424845902 and parameters: {'n_estimators': 500, 'reg_alpha': 0.12881616751475908, 'reg_lambda': 0.05494314363816614, 'colsample_by

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7707.37
[200]	valid_0's rmse: 7398.25
[300]	valid_0's rmse: 7415.46
Early stopping, best iteration is:
[212]	valid_0's rmse: 7394.96
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7659.7
[200]	valid_0's rmse: 7320.27
[300]	valid_0's rmse: 7342.47
Early stopping, best iteration is:
[217]	valid_0's rmse: 7309.8
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:35,394][0m Trial 20 finished with value: 7414.369594607533 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.01805456385853269, 'reg_lambda': 0.00999986374296881, 'colsample_bytree': 0.7, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 5, 'num_leaves': 416, 'min_child_samples': 115, 'min_data_per_groups': 19}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7623.85
[200]	valid_0's rmse: 7414.37
[300]	valid_0's rmse: 7482.06
Early stopping, best iteration is:
[200]	valid_0's rmse: 7414.37
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8105.69
[200]	valid_0's rmse: 7500.92
[300]	valid_0's rmse: 7365.81
[400]	valid_0's rmse: 7340.01
Early stopping, best iteration is:
[393]	valid_0's rmse: 7338.86


[32m[I 2022-04-22 14:04:35,531][0m Trial 21 finished with value: 7338.858966167534 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.002420438141306747, 'reg_lambda': 0.002947090862027898, 'colsample_bytree': 0.6, 'subsample': 0.7, 'learning_rate': 0.006, 'max_depth': 3, 'num_leaves': 487, 'min_child_samples': 32, 'min_data_per_groups': 61}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7932.93
[200]	valid_0's rmse: 7410.21
[300]	valid_0's rmse: 7367.97
Early stopping, best iteration is:
[285]	valid_0's rmse: 7363.93


[32m[I 2022-04-22 14:04:35,779][0m Trial 22 finished with value: 7363.93054192049 and parameters: {'n_estimators': 500, 'reg_alpha': 0.004343756313312238, 'reg_lambda': 0.0037298431768576367, 'colsample_bytree': 0.8, 'subsample': 0.8, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 623, 'min_child_samples': 67, 'min_data_per_groups': 39}. Best is trial 8 with value: 7263.847092796182.[0m
[32m[I 2022-04-22 14:04:35,983][0m Trial 23 finished with value: 7287.968069697077 and parameters: {'n_estimators': 500, 'reg_alpha': 0.001030253687925956, 'reg_lambda': 0.009436247540371352, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 465, 'min_child_samples': 27, 'min_data_per_groups': 89}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7512.58
[200]	valid_0's rmse: 7290.78
Early stopping, best iteration is:
[174]	valid_0's rmse: 7287.97


[32m[I 2022-04-22 14:04:36,113][0m Trial 24 finished with value: 7413.333486370669 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.004791076471739091, 'reg_lambda': 0.0012315144503803198, 'colsample_bytree': 0.9, 'subsample': 0.8, 'learning_rate': 0.006, 'max_depth': 3, 'num_leaves': 366, 'min_child_samples': 68, 'min_data_per_groups': 48}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8060.99
[200]	valid_0's rmse: 7527.45
[300]	valid_0's rmse: 7418.07
[400]	valid_0's rmse: 7419.89
Early stopping, best iteration is:
[344]	valid_0's rmse: 7413.33
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7426.69


[32m[I 2022-04-22 14:04:36,259][0m Trial 25 finished with value: 7401.693308445371 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.0033526650465610167, 'reg_lambda': 0.004654131956168804, 'colsample_bytree': 0.3, 'subsample': 0.7, 'learning_rate': 0.017, 'max_depth': 7, 'num_leaves': 843, 'min_child_samples': 167, 'min_data_per_groups': 63}. Best is trial 8 with value: 7263.847092796182.[0m
[32m[I 2022-04-22 14:04:36,329][0m Trial 26 finished with value: 7402.379385387006 and parameters: {'n_estimators': 100, 'reg_alpha': 0.08410664860668715, 'reg_lambda': 0.013783646047116427, 'colsample_bytree': 0.5, 'subsample': 0.4, 'learning_rate': 0.014, 'max_depth': 5, 'num_leaves': 676, 'min_child_samples': 31, 'min_data_per_groups': 29}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7456.48
Early stopping, best iteration is:
[122]	valid_0's rmse: 7401.69
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7402.38
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 7402.38
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:36,749][0m Trial 27 finished with value: 9006.636183954603 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.01506841565334068, 'reg_lambda': 0.44051717706482096, 'colsample_bytree': 0.9, 'subsample': 0.6, 'learning_rate': 0.02, 'max_depth': 10, 'num_leaves': 552, 'min_child_samples': 2, 'min_data_per_groups': 14}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 10337.6
Early stopping, best iteration is:
[32]	valid_0's rmse: 9006.64
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7556.87
[200]	valid_0's rmse: 7319.56


[32m[I 2022-04-22 14:04:37,180][0m Trial 28 finished with value: 7310.305778355652 and parameters: {'n_estimators': 250, 'reg_alpha': 0.023212510126666374, 'reg_lambda': 0.0033666806170297814, 'colsample_bytree': 0.3, 'subsample': 0.8, 'learning_rate': 0.01, 'max_depth': 15, 'num_leaves': 284, 'min_child_samples': 79, 'min_data_per_groups': 92}. Best is trial 8 with value: 7263.847092796182.[0m
[32m[I 2022-04-22 14:04:37,277][0m Trial 29 finished with value: 7395.388316426382 and parameters: {'n_estimators': 1250, 'reg_alpha': 1.916492969137763, 'reg_lambda': 0.19801804476304294, 'colsample_bytree': 0.5, 'subsample': 0.7, 'learning_rate': 0.01, 'max_depth': 3, 'num_leaves': 416, 'min_child_samples': 120, 'min_data_per_groups': 34}. Best is trial 8 with value: 7263.847092796182.[0m


Did not meet early stopping. Best iteration is:
[176]	valid_0's rmse: 7310.31
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7701.95
[200]	valid_0's rmse: 7402.96
[300]	valid_0's rmse: 7421.94
Early stopping, best iteration is:
[231]	valid_0's rmse: 7395.39
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8162.24
[200]	valid_0's rmse: 7609.8
[300]	valid_0's rmse: 7476.25


[32m[I 2022-04-22 14:04:37,679][0m Trial 30 finished with value: 7454.931202187149 and parameters: {'n_estimators': 750, 'reg_alpha': 0.0028674208086978056, 'reg_lambda': 1.3511594423265314, 'colsample_bytree': 1.0, 'subsample': 0.7, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 206, 'min_child_samples': 161, 'min_data_per_groups': 2}. Best is trial 8 with value: 7263.847092796182.[0m


[400]	valid_0's rmse: 7455.47
[500]	valid_0's rmse: 7465.34
Early stopping, best iteration is:
[406]	valid_0's rmse: 7454.93
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7524.62


[32m[I 2022-04-22 14:04:37,987][0m Trial 31 finished with value: 7281.018046762286 and parameters: {'n_estimators': 500, 'reg_alpha': 0.0010324632187601354, 'reg_lambda': 0.007661195248786861, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 467, 'min_child_samples': 25, 'min_data_per_groups': 89}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7282.63
Early stopping, best iteration is:
[197]	valid_0's rmse: 7281.02
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7499.84


[32m[I 2022-04-22 14:04:38,219][0m Trial 32 finished with value: 7341.6748593155835 and parameters: {'n_estimators': 500, 'reg_alpha': 0.0015355546112040264, 'reg_lambda': 0.0058674449293293195, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 442, 'min_child_samples': 45, 'min_data_per_groups': 85}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7366.54
Early stopping, best iteration is:
[157]	valid_0's rmse: 7341.67
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:38,550][0m Trial 33 finished with value: 7369.959367974427 and parameters: {'n_estimators': 500, 'reg_alpha': 0.006972414695781994, 'reg_lambda': 0.018934308899345897, 'colsample_bytree': 0.7, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 529, 'min_child_samples': 15, 'min_data_per_groups': 96}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7594.17
[200]	valid_0's rmse: 7371.91
[300]	valid_0's rmse: 7403.29
Early stopping, best iteration is:
[207]	valid_0's rmse: 7369.96
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7674.26
[200]	valid_0's rmse: 7341.19
[300]	valid_0's rmse: 7430.5
Early stopping, best iteration is:
[200]	valid_0's rmse: 7341.19


[32m[I 2022-04-22 14:04:38,862][0m Trial 34 finished with value: 7341.190574184246 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.0023314009710856034, 'reg_lambda': 0.0021868623114460724, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.008, 'max_depth': 7, 'num_leaves': 353, 'min_child_samples': 45, 'min_data_per_groups': 81}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:39,407][0m Trial 35 finished with value: 7297.402025677863 and parameters: {'n_estimators': 500, 'reg_alpha': 0.0010734409281992442, 'reg_lambda': 0.03772776266576703, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.017, 'max_depth': 15, 'num_leaves': 645, 'min_child_samples': 37, 'min_data_per_groups': 74}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7299.46
[200]	valid_0's rmse: 7427.6
Early stopping, best iteration is:
[113]	valid_0's rmse: 7297.4
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7551.75


[32m[I 2022-04-22 14:04:40,219][0m Trial 36 finished with value: 7358.579725204465 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.010076320046214596, 'reg_lambda': 0.0022324098911393098, 'colsample_bytree': 0.8, 'subsample': 1.0, 'learning_rate': 0.01, 'max_depth': 12, 'num_leaves': 403, 'min_child_samples': 16, 'min_data_per_groups': 84}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7359.97
[300]	valid_0's rmse: 7412.96
Early stopping, best iteration is:
[203]	valid_0's rmse: 7358.58
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7772.42
[200]	valid_0's rmse: 7390.37
[300]	valid_0's rmse: 7363.42
Early stopping, best iteration is:
[254]	valid_0's rmse: 7356.6


[32m[I 2022-04-22 14:04:40,472][0m Trial 37 finished with value: 7356.60331794936 and parameters: {'n_estimators': 500, 'reg_alpha': 0.005519855880839055, 'reg_lambda': 0.0062577214029190375, 'colsample_bytree': 0.6, 'subsample': 0.5, 'learning_rate': 0.008, 'max_depth': 7, 'num_leaves': 248, 'min_child_samples': 111, 'min_data_per_groups': 44}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7970.02
[200]	valid_0's rmse: 7401.62
[300]	valid_0's rmse: 7346.9


[32m[I 2022-04-22 14:04:40,739][0m Trial 38 finished with value: 7343.378167446537 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.24590040863143486, 'reg_lambda': 0.016735729887660606, 'colsample_bytree': 0.5, 'subsample': 0.5, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 698, 'min_child_samples': 70, 'min_data_per_groups': 56}. Best is trial 8 with value: 7263.847092796182.[0m


Early stopping, best iteration is:
[292]	valid_0's rmse: 7343.38
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7618.17
[200]	valid_0's rmse: 7546.98
Did not meet early stopping. Best iteration is:
[167]	valid_0's rmse: 7529.34


[32m[I 2022-04-22 14:04:40,935][0m Trial 39 finished with value: 7529.340134102238 and parameters: {'n_estimators': 250, 'reg_alpha': 0.002830288732560486, 'reg_lambda': 0.001031655061840231, 'colsample_bytree': 0.4, 'subsample': 1.0, 'learning_rate': 0.014, 'max_depth': 12, 'num_leaves': 298, 'min_child_samples': 250, 'min_data_per_groups': 70}. Best is trial 8 with value: 7263.847092796182.[0m
[32m[I 2022-04-22 14:04:41,131][0m Trial 40 finished with value: 7366.70866793358 and parameters: {'n_estimators': 100, 'reg_alpha': 1.2033882704420829, 'reg_lambda': 0.03431212273907242, 'colsample_bytree': 1.0, 'subsample': 0.7, 'learning_rate': 0.017, 'max_depth': 15, 'num_leaves': 514, 'min_child_samples': 81, 'min_data_per_groups': 87}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7366.71
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 7366.71
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:41,414][0m Trial 41 finished with value: 7380.297788878749 and parameters: {'n_estimators': 500, 'reg_alpha': 0.0012643640580754921, 'reg_lambda': 0.008798511542435649, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 470, 'min_child_samples': 17, 'min_data_per_groups': 89}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7638.85
[200]	valid_0's rmse: 7396.66
[300]	valid_0's rmse: 7436.31
Early stopping, best iteration is:
[224]	valid_0's rmse: 7380.3
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:41,672][0m Trial 42 finished with value: 7285.6703249465145 and parameters: {'n_estimators': 500, 'reg_alpha': 0.0010164611580631745, 'reg_lambda': 0.002805852568011058, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 610, 'min_child_samples': 22, 'min_data_per_groups': 96}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7529.14
[200]	valid_0's rmse: 7286.6
Early stopping, best iteration is:
[199]	valid_0's rmse: 7285.67
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:41,906][0m Trial 43 finished with value: 7347.65027428241 and parameters: {'n_estimators': 500, 'reg_alpha': 0.00237619918276773, 'reg_lambda': 0.0018269273511573607, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 754, 'min_child_samples': 55, 'min_data_per_groups': 95}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7496.55
[200]	valid_0's rmse: 7366.34
Early stopping, best iteration is:
[156]	valid_0's rmse: 7347.65
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:42,385][0m Trial 44 finished with value: 7580.386226166961 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.0016918038535250904, 'reg_lambda': 0.002973879381999548, 'colsample_bytree': 0.9, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 7, 'num_leaves': 595, 'min_child_samples': 10, 'min_data_per_groups': 95}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7752.19
[200]	valid_0's rmse: 7622.62
Early stopping, best iteration is:
[167]	valid_0's rmse: 7580.39


[32m[I 2022-04-22 14:04:42,516][0m Trial 45 finished with value: 7297.498012207293 and parameters: {'n_estimators': 500, 'reg_alpha': 8.046398506920204, 'reg_lambda': 0.0017123410536480433, 'colsample_bytree': 0.3, 'subsample': 0.4, 'learning_rate': 0.006, 'max_depth': 3, 'num_leaves': 364, 'min_child_samples': 39, 'min_data_per_groups': 80}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8105.9
[200]	valid_0's rmse: 7448.86
[300]	valid_0's rmse: 7304.57
[400]	valid_0's rmse: 7322.6
Early stopping, best iteration is:
[331]	valid_0's rmse: 7297.5
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7487.91
[200]	valid_0's rmse: 7508.69
Early stopping, best iteration is:
[129]	valid_0's rmse: 7462.64


[32m[I 2022-04-22 14:04:42,626][0m Trial 46 finished with value: 7462.643418894582 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.0015213182501488199, 'reg_lambda': 0.0038555295308327644, 'colsample_bytree': 0.7, 'subsample': 0.8, 'learning_rate': 0.02, 'max_depth': 5, 'num_leaves': 591, 'min_child_samples': 188, 'min_data_per_groups': 100}. Best is trial 8 with value: 7263.847092796182.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7504.99


[32m[I 2022-04-22 14:04:43,055][0m Trial 47 finished with value: 7295.006707619093 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.01332378831566878, 'reg_lambda': 0.006628585551407009, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 10, 'num_leaves': 652, 'min_child_samples': 26, 'min_data_per_groups': 76}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7300.72
Early stopping, best iteration is:
[175]	valid_0's rmse: 7295.01
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8009.9


[32m[I 2022-04-22 14:04:43,398][0m Trial 48 finished with value: 7313.983923411407 and parameters: {'n_estimators': 500, 'reg_alpha': 0.0010005607086849629, 'reg_lambda': 0.06283525699766657, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 532, 'min_child_samples': 57, 'min_data_per_groups': 66}. Best is trial 8 with value: 7263.847092796182.[0m


[200]	valid_0's rmse: 7410.99
[300]	valid_0's rmse: 7316.4
Early stopping, best iteration is:
[282]	valid_0's rmse: 7313.98
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-22 14:04:43,666][0m Trial 49 finished with value: 7370.233440136633 and parameters: {'n_estimators': 750, 'reg_alpha': 0.003589946351202035, 'reg_lambda': 0.024026758117131126, 'colsample_bytree': 0.9, 'subsample': 0.5, 'learning_rate': 0.008, 'max_depth': 7, 'num_leaves': 736, 'min_child_samples': 81, 'min_data_per_groups': 34}. Best is trial 8 with value: 7263.847092796182.[0m


[100]	valid_0's rmse: 7736.7
[200]	valid_0's rmse: 7386.77
[300]	valid_0's rmse: 7400.8
Early stopping, best iteration is:
[240]	valid_0's rmse: 7370.23
Number of finished trials: 50
Best trial: {'n_estimators': 1000, 'reg_alpha': 0.0016119218181259286, 'reg_lambda': 0.002509094868592898, 'colsample_bytree': 0.9, 'subsample': 0.7, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 436, 'min_child_samples': 23, 'min_data_per_groups': 62}


Let's create an LGBMRegresor with the best hyperparameters found in the previous study

In [31]:
params=study.best_params   
params['random_state'] = 48
params['metric'] = 'rmse'
params

{'n_estimators': 1000,
 'reg_alpha': 0.0016119218181259286,
 'reg_lambda': 0.002509094868592898,
 'colsample_bytree': 0.9,
 'subsample': 0.7,
 'learning_rate': 0.006,
 'max_depth': 7,
 'num_leaves': 436,
 'min_child_samples': 23,
 'min_data_per_groups': 62,
 'random_state': 48,
 'metric': 'rmse'}

In [32]:
_ = params.pop('min_data_per_groups') # no such parameter in light gbm
model = LGBMRegressor(**params)
model.fit(X_train,y_train)
r2 = model.score(X_val, y_val)


In [33]:
r2 = model.score(X_val, y_val)
print(f'Light gbm r2 after hyper parameters tuning: {r2}')

Light gbm r2 after hyper parameters tuning: 0.39505408879971415


In [34]:
from src.model import save_model, load_model

In [38]:
save_model(model,'lgb2')

In [39]:
m = load_model('lgb2')

In [40]:
r2 = m.score(X_val, y_val)

In [41]:
r2

0.39505408879971415