# Mansa loans

## EDA

We start by performing an EDA of the datasets.

In [1]:
import pandas as pd
from src.data_tools import get_data

In [2]:
# read the csv files
df_accounts = get_data("data/accounts.csv")
df_transactions = get_data("data/transactions.csv")

In [3]:
df_accounts.head()

Unnamed: 0,id,balance,update_date
0,0,13.63,2021-07-02
1,1,12.91,2021-07-02
2,2,19.84,2021-07-02
3,3,130.0,2021-07-02
4,4,2806.75,2021-07-02


In [4]:
df_transactions.head()

Unnamed: 0,account_id,date,amount
0,0,2020-10-16,200.0
1,0,2020-10-16,-192.0
2,0,2020-10-16,200.0
3,0,2020-10-16,-24.0
4,0,2020-10-16,-50.0


#### Checking for null values

In [5]:
df_accounts.isna().any()

id             False
balance        False
update_date    False
dtype: bool

In [6]:
df_transactions.isna().any()

account_id    False
date          False
amount        False
dtype: bool

There is no null value in any of the two datasets.


#### Checking for duplicates

In [7]:
df_accounts.duplicated().any()

False

In [8]:
df_transactions.duplicated().any()

True

There are duplicates in the transactions. Lets look at some of them

In [9]:
n_duplicated = df_transactions.duplicated().sum()
dup_percent = int(n_duplicated / df_transactions.shape[0]*100)
print(f'there are {n_duplicated} duplicates in transactions.csv, representing {dup_percent}% of the data.')

there are 28661 duplicates in transactions.csv, representing 5% of the data.


In [10]:
df_dup = df_transactions[df_transactions.duplicated()]
df_dup.head()

Unnamed: 0,account_id,date,amount
2,0,2020-10-16,200.0
11,0,2020-10-17,0.0
14,0,2020-10-19,0.0
23,0,2020-10-21,0.0
45,0,2020-10-29,-5.5


In [11]:
df_transactions[
    (df_transactions.account_id == 0) 
    & (df_transactions.date == '2020-10-16') 
    &(df_transactions.amount == 200.0)]

Unnamed: 0,account_id,date,amount
0,0,2020-10-16,200.0
2,0,2020-10-16,200.0


It is difficult, without further information to know if the above transactions are legitimate similar transactions or duplicates coming from an error when the data were processed. We decide to keep these duplicates. It would be interesting to enrich the transactions data with a transaction id. We could decide with certainty with such an id if similar transactions on a given account and a given date are duplicates or legitimates.

#### Account history distribution

Let's check the history distribution of the different accounts in our data.

In [12]:
# we change the date column from str to datetime
df_transactions['date'] = pd.to_datetime(df_transactions['date'])
group = df_transactions.groupby('account_id')['date']

In [13]:
history_df = group.agg(["min", "max"])
history = history_df['max'] - history_df['min']
history.head()

account_id
0   257 days
1   231 days
2   100 days
3   176 days
4   293 days
dtype: timedelta64[ns]

In [14]:
trigger = pd.Timedelta(180, "d")
n_accounts = (history > trigger).sum()
frac_long_account = int(n_accounts / len(history) * 100)
print(f'{n_accounts} accounts, which represent {frac_long_account}% of all the accounts, have an history of more than 6 months')

823 accounts, which represent 65% of all the accounts, have an history of more than 6 months


We want to build a model that make predictions using 6 months of data so we'll focus on accounts with more than 6 months history. We use get_df_with_history to select these accounts.

In [15]:
from src.data_tools import get_df_with_history
df_accounts, df_transactions = get_df_with_history(df_accounts, df_transactions)

In [16]:
n_transac = len(set(df_transactions['account_id']))
print(f'We are left with {len(df_accounts)} accounts in df_accounts and {n_transac} in df_transactions which is as expected.')

We are left with 860 accounts in df_accounts and 860 in df_transactions which is as expected.


## Processing the data for training

We are looking to predict the next month outgoing given the last 6 month of transaction. For that purpose it make sense to divide the history of transactions in 30 days buckets and to calculate the total amount of inflow and outflow for each 30 days bucket. 

When processing the data, we keep the last 2 months for testing and the previous 2 months for validation so that we are sure there is no data leakage when training and testing. 
 
For some accounts there are more than 11 months of data. In this case we'll get from the account several training 
samples. For example if an account has 12 month of data, the last 4 months are used for validation and testing, so we 
are left with 8 months of data for training. We'll have then 2 training samples from this account (2 rows in X_train):
  - one sample with the first 6 months for predicting the 7 month
  - A second sample with the 2nd month to the 7th month for predicting the 8th month.

In [17]:
from src.data_tools import get_training_data
# When processing the data, we keep the last 2 months for testing and the previous 2 months for validation
training_data = get_training_data(test_size=2)

In [18]:
training_data.keys()

dict_keys(['train', 'val', 'test'])

In [19]:
training_data['train'].keys()

dict_keys(['X', 'y'])

The data are split between training, validation and test. Each of these split is a dictionary with the input of the model: 'X' and the target variable: 'y'. Let's check the relative proportion of these different splits:

In [20]:
n_train = training_data['train']['X'].shape[0]
n_val = training_data['val']['X'].shape[0]
n_test = training_data['test']['X'].shape[0]
n = n_train + n_val + n_test
print(f'There is a total of {n} data')
print(f'{n_train/n*100:.1f}% for training, {n_val/n*100:.1f}% for validation and {n_test/n*100:.1f}% for testing')

There is a total of 13520 data
75.5% for training, 12.3% for validation and 12.2% for testing


This is an acceptable split of the data for training our model. Lets look at how the data were processed:

In [21]:
training_data['train']['X'].head()

Unnamed: 0,1M inflow,2M inflow,3M inflow,4M inflow,5M inflow,6M inflow,1M outflow,2M outflow,3M outflow,4M outflow,5M outflow,6M outflow,initial_balance
0,90.0,0.0,0.0,50.0,0.0,0.0,0.0,-50.0,-50.0,-100.0,-50.0,-50.0,2525.0
1,4138.47,5376.46,3833.0,3600.0,3097.2,2000.0,-4946.87,-4929.89,-3876.41,-4195.21,-2071.26,-1680.31,70.2
2,0.0,0.0,0.0,0.0,1182.0,4034.0,0.0,0.0,0.0,0.0,-21.27,-2088.17,-80.94
3,1705.0,646.0,231.85,312.35,455.07,4787.98,-1351.2,-1039.65,-217.8,-255.07,-271.26,-754.99,280.66
4,0.0,0.0,0.0,0.0,1040.0,714.0,0.0,0.0,0.0,0.0,0.0,-68.65,123.69


These data represent 6 months of transaction history. For example, '1M inflow' gives the total amount of positive transactions in the accounts during the first month of the history considered and '6M outflow' gives the total amount of negative transaction in the accounts during the last month of the history considered. initial_balance give the balances of the account at the beginning of the 6 months period considered.

We have organised the data as needed for the next step:

In [22]:
X_train = training_data['train']['X']
y_train = training_data['train']['y']
X_val = training_data['val']['X']
y_val = training_data['val']['y']
X_test = training_data['test']['X']
y_test = training_data['test']['y']


## Model choice

Given that time is limited we'll compare different model performance with their default setting to choose our model

In [23]:
# we perform some scaling on the data. It is not necessary for tree based models but we'll keep a common 
# groundwork for our analysis
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)


### Regression

Regression will give us a benchmark. It's unlikely to be the most performant model, we should be able to do better with ensemble models.

In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# we start with linear regression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_predict = reg.predict(X_val)
r2 = r2_score(y_val, y_predict)
print(f'Linear regression regression r2: {r2}')

Linear regression regression r2: 0.26590342126231004


In [25]:
from sklearn.linear_model import Ridge

# we impose a penalty using Ridge regression
ridge = Ridge()
ridge.fit(X_train, y_train)
y_predict = ridge.predict(X_val)
r2 = r2_score(y_val, y_predict)
print(f'Ridge regression r2: {r2}')

Ridge regression r2: 0.26592745106054116


In [26]:
from sklearn.linear_model import Lasso

# we now impose a penalty using Lasso regression
lasso = Lasso()
lasso.fit(X_train, y_train)
y_predict = lasso.predict(X_val)
r2 = r2_score(y_val, y_predict)
print(f'Lasso regression r2: {r2}')

Lasso regression r2: 0.2658387256598487


### Random Forests

In [27]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=48)
rf.fit(X_train, y_train)
y_predict = rf.predict(X_val)
r2 = r2_score(y_val, y_predict)
print(f'Random forest r2: {r2}')

Random forest r2: 0.1799939470557319


### Light GBM

In [28]:
from lightgbm import LGBMRegressor
reg_lg = LGBMRegressor(n_estimators=100, random_state=48)
reg_lg.fit(X_train, y_train)
y_predict = reg_lg.predict(X_val)
r2 = r2_score(y_val, y_predict)
print(f'Light gbm r2: {r2}')

Light gbm r2: 0.28423212148743804


Light GBM seems the most promising model. We'll use Light GBM from now on. We don't need to scale the data for tree based models so we adjust accordingly

In [29]:
X_train = training_data['train']['X']
X_val = training_data['val']['X']

### Hyperparameter tuning for LightGBM

We use Optuna for hyperparameters tuning:
- parameters that control the tree structure: num_leaves and max_depth (between 3 and 12). LGBM documentation indicates that num_leaves should be < 2^(max_depth)
- parameters for better accuracy: n_estimators and learning_rate 
- parameters to control overfitting: bagging fraction
- some other parameters


In [30]:
import optuna
from src.hyper_params import get_objective

  from .autonotebook import tqdm as notebook_tqdm


In [31]:
study = optuna.create_study(direction='minimize')
objective = get_objective(X_train, X_val, y_train, y_val)
study.optimize(objective, n_trials=50)
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

[32m[I 2022-04-24 08:08:33,169][0m A new study created in memory with name: no-name-156652e0-be1c-4cf4-976a-81255775dbc1[0m
[32m[I 2022-04-24 08:08:33,228][0m Trial 0 finished with value: 7729.432309399014 and parameters: {'n_estimators': 100, 'reg_alpha': 5.4547207424579955, 'reg_lambda': 0.006695853706927359, 'colsample_bytree': 0.8, 'subsample': 0.6, 'learning_rate': 0.014, 'max_depth': 2, 'num_leaves': 50, 'min_child_samples': 219}. Best is trial 0 with value: 7729.432309399014.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7729.43
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 7729.43
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8069.26
[200]	valid_0's rmse: 7464.32
[300]	valid_0's rmse: 7355.88


[32m[I 2022-04-24 08:08:33,444][0m Trial 1 finished with value: 7349.144280399479 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.02336708435466059, 'reg_lambda': 0.0034719329416884066, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.006, 'max_depth': 5, 'num_leaves': 82, 'min_child_samples': 100}. Best is trial 1 with value: 7349.144280399479.[0m
[32m[I 2022-04-24 08:08:33,506][0m Trial 2 finished with value: 7448.456518798932 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.01466992340477327, 'reg_lambda': 0.022722547964770613, 'colsample_bytree': 0.9, 'subsample': 0.7, 'learning_rate': 0.02, 'max_depth': 2, 'num_leaves': 92, 'min_child_samples': 107}. Best is trial 1 with value: 7349.144280399479.[0m
[32m[I 2022-04-24 08:08:33,557][0m Trial 3 finished with value: 8173.121825378745 and parameters: {'n_estimators': 100, 'reg_alpha': 0.010565164519354448, 'reg_lambda': 1.4835949734666976, 'colsample_bytree': 0.3, 'subsample': 0.4, 'learning_rate': 0.006

[400]	valid_0's rmse: 7367.41
Early stopping, best iteration is:
[344]	valid_0's rmse: 7349.14
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7460.43
[200]	valid_0's rmse: 7524.91
Early stopping, best iteration is:
[118]	valid_0's rmse: 7448.46
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8173.12
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 8173.12
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7495.72
[200]	valid_0's rmse: 7504.87


[32m[I 2022-04-24 08:08:33,643][0m Trial 4 finished with value: 7477.373364611033 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.001951693968452124, 'reg_lambda': 0.5789276147325875, 'colsample_bytree': 0.7, 'subsample': 0.7, 'learning_rate': 0.02, 'max_depth': 3, 'num_leaves': 56, 'min_child_samples': 184}. Best is trial 1 with value: 7349.144280399479.[0m
[32m[I 2022-04-24 08:08:33,705][0m Trial 5 finished with value: 7578.339480793068 and parameters: {'n_estimators': 250, 'reg_alpha': 5.6837038890475275, 'reg_lambda': 0.1796889699124674, 'colsample_bytree': 0.3, 'subsample': 1.0, 'learning_rate': 0.017, 'max_depth': 2, 'num_leaves': 48, 'min_child_samples': 227}. Best is trial 1 with value: 7349.144280399479.[0m
[32m[I 2022-04-24 08:08:33,789][0m Trial 6 finished with value: 7444.223233450637 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.004200953536753077, 'reg_lambda': 0.0037161481573136975, 'colsample_bytree': 1.0, 'subsample': 0.4, 'learning_rate': 0.01, '

Early stopping, best iteration is:
[128]	valid_0's rmse: 7477.37
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7618.32
[200]	valid_0's rmse: 7624.34
Early stopping, best iteration is:
[140]	valid_0's rmse: 7578.34
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7856.3
[200]	valid_0's rmse: 7485.05
[300]	valid_0's rmse: 7458.13
Early stopping, best iteration is:
[257]	valid_0's rmse: 7444.22
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7838.78


[32m[I 2022-04-24 08:08:33,877][0m Trial 7 finished with value: 7554.00734737884 and parameters: {'n_estimators': 250, 'reg_alpha': 0.002921711615588487, 'reg_lambda': 0.2805777500906671, 'colsample_bytree': 1.0, 'subsample': 0.8, 'learning_rate': 0.01, 'max_depth': 2, 'num_leaves': 84, 'min_child_samples': 15}. Best is trial 1 with value: 7349.144280399479.[0m
[32m[I 2022-04-24 08:08:33,995][0m Trial 8 finished with value: 7420.82309761631 and parameters: {'n_estimators': 750, 'reg_alpha': 0.36302263267172286, 'reg_lambda': 0.16165978693723276, 'colsample_bytree': 0.3, 'subsample': 0.4, 'learning_rate': 0.02, 'max_depth': 7, 'num_leaves': 46, 'min_child_samples': 204}. Best is trial 1 with value: 7349.144280399479.[0m
[32m[I 2022-04-24 08:08:34,041][0m Trial 9 finished with value: 7615.925835074988 and parameters: {'n_estimators': 100, 'reg_alpha': 0.15720503438698366, 'reg_lambda': 0.2839186742382533, 'colsample_bytree': 0.8, 'subsample': 0.5, 'learning_rate': 0.01, 'max_dept

[200]	valid_0's rmse: 7562.91
Did not meet early stopping. Best iteration is:
[230]	valid_0's rmse: 7554.01
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7424.31
[200]	valid_0's rmse: 7514.52
Early stopping, best iteration is:
[103]	valid_0's rmse: 7420.82
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7615.93
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 7615.93


[32m[I 2022-04-24 08:08:34,164][0m Trial 10 finished with value: 7425.650953858175 and parameters: {'n_estimators': 500, 'reg_alpha': 0.03596591760068141, 'reg_lambda': 0.0011658878963447718, 'colsample_bytree': 0.5, 'subsample': 0.6, 'learning_rate': 0.008, 'max_depth': 5, 'num_leaves': 4, 'min_child_samples': 144}. Best is trial 1 with value: 7349.144280399479.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7982.44
[200]	valid_0's rmse: 7496.62
[300]	valid_0's rmse: 7426.94
[400]	valid_0's rmse: 7458.43
Early stopping, best iteration is:
[306]	valid_0's rmse: 7425.65
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:34,530][0m Trial 11 finished with value: 7544.117084220341 and parameters: {'n_estimators': 750, 'reg_alpha': 0.351068105871921, 'reg_lambda': 6.4549695286662585, 'colsample_bytree': 0.5, 'subsample': 0.4, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 28, 'min_child_samples': 255}. Best is trial 1 with value: 7349.144280399479.[0m


[100]	valid_0's rmse: 8300.6
[200]	valid_0's rmse: 7716.01
[300]	valid_0's rmse: 7564.2
[400]	valid_0's rmse: 7547.81
Early stopping, best iteration is:
[375]	valid_0's rmse: 7544.12


[32m[I 2022-04-24 08:08:34,640][0m Trial 12 finished with value: 7579.542482287944 and parameters: {'n_estimators': 750, 'reg_alpha': 0.6297865282374893, 'reg_lambda': 0.0285110198263229, 'colsample_bytree': 0.4, 'subsample': 0.6, 'learning_rate': 0.02, 'max_depth': 5, 'num_leaves': 60, 'min_child_samples': 300}. Best is trial 1 with value: 7349.144280399479.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7579.56
[200]	valid_0's rmse: 7656.53
Early stopping, best iteration is:
[102]	valid_0's rmse: 7579.54
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8097.04


[32m[I 2022-04-24 08:08:34,919][0m Trial 13 finished with value: 7338.604818694006 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.9484190840449388, 'reg_lambda': 0.06817182543933409, 'colsample_bytree': 0.6, 'subsample': 1.0, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 29, 'min_child_samples': 134}. Best is trial 13 with value: 7338.604818694006.[0m


[200]	valid_0's rmse: 7493.08
[300]	valid_0's rmse: 7355.55
[400]	valid_0's rmse: 7343.52
Early stopping, best iteration is:
[342]	valid_0's rmse: 7338.6
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8128.8
[200]	valid_0's rmse: 7539.74
[300]	valid_0's rmse: 7404.91
[400]	valid_0's rmse: 7391.98


[32m[I 2022-04-24 08:08:35,269][0m Trial 14 finished with value: 7386.914623643734 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.990900000147986, 'reg_lambda': 0.034036440167315024, 'colsample_bytree': 0.6, 'subsample': 1.0, 'learning_rate': 0.006, 'max_depth': 10, 'num_leaves': 24, 'min_child_samples': 142}. Best is trial 13 with value: 7338.604818694006.[0m


Early stopping, best iteration is:
[372]	valid_0's rmse: 7386.91
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8041.61
[200]	valid_0's rmse: 7454.28
[300]	valid_0's rmse: 7349.52
[400]	valid_0's rmse: 7350.6
Early stopping, best iteration is:
[342]	valid_0's rmse: 7339.02


[32m[I 2022-04-24 08:08:35,467][0m Trial 15 finished with value: 7339.018883609527 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.053238479308453796, 'reg_lambda': 0.006556560987422812, 'colsample_bytree': 0.6, 'subsample': 1.0, 'learning_rate': 0.006, 'max_depth': 5, 'num_leaves': 70, 'min_child_samples': 87}. Best is trial 13 with value: 7338.604818694006.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7948.57
[200]	valid_0's rmse: 7394.52
[300]	valid_0's rmse: 7323.13


[32m[I 2022-04-24 08:08:35,762][0m Trial 16 finished with value: 7321.353446191256 and parameters: {'n_estimators': 1000, 'reg_alpha': 0.0728978583670885, 'reg_lambda': 0.0558453981353075, 'colsample_bytree': 0.6, 'subsample': 1.0, 'learning_rate': 0.006, 'max_depth': 7, 'num_leaves': 68, 'min_child_samples': 44}. Best is trial 16 with value: 7321.353446191256.[0m
[32m[I 2022-04-24 08:08:35,918][0m Trial 17 finished with value: 7310.961795605074 and parameters: {'n_estimators': 1000, 'reg_alpha': 1.8478254645686907, 'reg_lambda': 0.05786341265444248, 'colsample_bytree': 0.6, 'subsample': 1.0, 'learning_rate': 0.014, 'max_depth': 7, 'num_leaves': 33, 'min_child_samples': 62}. Best is trial 17 with value: 7310.961795605074.[0m


Early stopping, best iteration is:
[294]	valid_0's rmse: 7321.35
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7330.54
[200]	valid_0's rmse: 7412.73
Early stopping, best iteration is:
[112]	valid_0's rmse: 7310.96
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:36,056][0m Trial 18 finished with value: 7319.9534172962085 and parameters: {'n_estimators': 1000, 'reg_alpha': 2.48230715919321, 'reg_lambda': 1.3594090809855217, 'colsample_bytree': 0.6, 'subsample': 1.0, 'learning_rate': 0.014, 'max_depth': 7, 'num_leaves': 14, 'min_child_samples': 44}. Best is trial 17 with value: 7310.961795605074.[0m
[32m[I 2022-04-24 08:08:36,149][0m Trial 19 finished with value: 7414.884489074778 and parameters: {'n_estimators': 1000, 'reg_alpha': 2.2640746806571364, 'reg_lambda': 9.60942817044461, 'colsample_bytree': 0.6, 'subsample': 0.5, 'learning_rate': 0.014, 'max_depth': 7, 'num_leaves': 4, 'min_child_samples': 6}. Best is trial 17 with value: 7310.961795605074.[0m


[100]	valid_0's rmse: 7374.01
[200]	valid_0's rmse: 7340.5
Early stopping, best iteration is:
[146]	valid_0's rmse: 7319.95
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7584.09
[200]	valid_0's rmse: 7422.02
[300]	valid_0's rmse: 7434.41
Early stopping, best iteration is:
[249]	valid_0's rmse: 7414.88
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:36,300][0m Trial 20 finished with value: 7319.821193745511 and parameters: {'n_estimators': 2000, 'reg_alpha': 9.103035417143731, 'reg_lambda': 2.0979853805367172, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 7, 'num_leaves': 17, 'min_child_samples': 52}. Best is trial 17 with value: 7310.961795605074.[0m
[32m[I 2022-04-24 08:08:36,437][0m Trial 21 finished with value: 7321.514022217337 and parameters: {'n_estimators': 2000, 'reg_alpha': 9.521344048292763, 'reg_lambda': 2.470294084407735, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 7, 'num_leaves': 16, 'min_child_samples': 59}. Best is trial 17 with value: 7310.961795605074.[0m


[100]	valid_0's rmse: 7364.76
[200]	valid_0's rmse: 7404.45
Early stopping, best iteration is:
[141]	valid_0's rmse: 7319.82
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7366.61
[200]	valid_0's rmse: 7423.37
Early stopping, best iteration is:
[141]	valid_0's rmse: 7321.51


[32m[I 2022-04-24 08:08:36,603][0m Trial 22 finished with value: 7319.291395159231 and parameters: {'n_estimators': 2000, 'reg_alpha': 2.431079538320238, 'reg_lambda': 1.4596730382414442, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 7, 'num_leaves': 38, 'min_child_samples': 51}. Best is trial 17 with value: 7310.961795605074.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7354.93
[200]	valid_0's rmse: 7410.4
Early stopping, best iteration is:
[141]	valid_0's rmse: 7319.29
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:36,845][0m Trial 23 finished with value: 7295.7965560751 and parameters: {'n_estimators': 2000, 'reg_alpha': 2.1932499270244104, 'reg_lambda': 3.1834449128431275, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 39, 'min_child_samples': 116}. Best is trial 23 with value: 7295.7965560751.[0m


[100]	valid_0's rmse: 7398.29
[200]	valid_0's rmse: 7322.5
Early stopping, best iteration is:
[158]	valid_0's rmse: 7295.8
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:37,081][0m Trial 24 finished with value: 7301.216077220878 and parameters: {'n_estimators': 2000, 'reg_alpha': 2.321112350875795, 'reg_lambda': 0.790014226040559, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 39, 'min_child_samples': 119}. Best is trial 23 with value: 7295.7965560751.[0m


[100]	valid_0's rmse: 7393.86
[200]	valid_0's rmse: 7331.7
Early stopping, best iteration is:
[152]	valid_0's rmse: 7301.22
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:37,324][0m Trial 25 finished with value: 7301.2971664362885 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.1765233543676026, 'reg_lambda': 0.6129915402823156, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 41, 'min_child_samples': 118}. Best is trial 23 with value: 7295.7965560751.[0m


[100]	valid_0's rmse: 7392.47
[200]	valid_0's rmse: 7331.28
Early stopping, best iteration is:
[152]	valid_0's rmse: 7301.3
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7918.87
[200]	valid_0's rmse: 7440.13
[300]	valid_0's rmse: 7406.76
Early stopping, best iteration is:
[263]	valid_0's rmse: 7397.15


[32m[I 2022-04-24 08:08:37,638][0m Trial 26 finished with value: 7397.150666623873 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.20002712549705773, 'reg_lambda': 0.6063287645235116, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.008, 'max_depth': 10, 'num_leaves': 38, 'min_child_samples': 167}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:37,842][0m Trial 27 finished with value: 7303.343025421724 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.13269647426972606, 'reg_lambda': 5.0155037221864145, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.017, 'max_depth': 10, 'num_leaves': 41, 'min_child_samples': 121}. Best is trial 23 with value: 7295.7965560751.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7326.62
[200]	valid_0's rmse: 7371.32
Early stopping, best iteration is:
[126]	valid_0's rmse: 7303.34


[32m[I 2022-04-24 08:08:38,065][0m Trial 28 finished with value: 7404.042679327236 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.38424294373605256, 'reg_lambda': 0.6691161967449184, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 56, 'min_child_samples': 167}. Best is trial 23 with value: 7295.7965560751.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7491.56
[200]	valid_0's rmse: 7432.68
Early stopping, best iteration is:
[149]	valid_0's rmse: 7404.04


[32m[I 2022-04-24 08:08:38,275][0m Trial 29 finished with value: 7342.563785169407 and parameters: {'n_estimators': 2000, 'reg_alpha': 1.163528197331439, 'reg_lambda': 4.83313506555188, 'colsample_bytree': 0.9, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 66, 'min_child_samples': 117}. Best is trial 23 with value: 7295.7965560751.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7428.92
[200]	valid_0's rmse: 7375.4
Early stopping, best iteration is:
[146]	valid_0's rmse: 7342.56
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7552.09
[200]	valid_0's rmse: 7456.51
Early stopping, best iteration is:
[186]	valid_0's rmse: 7450.35


[32m[I 2022-04-24 08:08:38,508][0m Trial 30 finished with value: 7450.34842550885 and parameters: {'n_estimators': 500, 'reg_alpha': 5.777266782779044, 'reg_lambda': 3.2063309429644673, 'colsample_bytree': 0.8, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 44, 'min_child_samples': 169}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:38,708][0m Trial 31 finished with value: 7307.337367118984 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.11094376382195714, 'reg_lambda': 0.8792533915542928, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.017, 'max_depth': 10, 'num_leaves': 41, 'min_child_samples': 119}. Best is trial 23 with value: 7295.7965560751.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7327.56
[200]	valid_0's rmse: 7376.68
Early stopping, best iteration is:
[123]	valid_0's rmse: 7307.34
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:38,924][0m Trial 32 finished with value: 7314.509668577416 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.2536135241235588, 'reg_lambda': 4.937637036046647, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.017, 'max_depth': 10, 'num_leaves': 52, 'min_child_samples': 127}. Best is trial 23 with value: 7295.7965560751.[0m


[100]	valid_0's rmse: 7340.27
[200]	valid_0's rmse: 7381.92
Early stopping, best iteration is:
[126]	valid_0's rmse: 7314.51
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7322.33


[32m[I 2022-04-24 08:08:39,130][0m Trial 33 finished with value: 7314.69320324692 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.6180660040139837, 'reg_lambda': 3.0760932371347804, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.017, 'max_depth': 10, 'num_leaves': 35, 'min_child_samples': 101}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:39,292][0m Trial 34 finished with value: 7317.582008471749 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.015425934691496149, 'reg_lambda': 0.327214516218526, 'colsample_bytree': 0.7, 'subsample': 0.7, 'learning_rate': 0.017, 'max_depth': 10, 'num_leaves': 29, 'min_child_samples': 85}. Best is trial 23 with value: 7295.7965560751.[0m


[200]	valid_0's rmse: 7405.69
Early stopping, best iteration is:
[122]	valid_0's rmse: 7314.69
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7337.23
[200]	valid_0's rmse: 7419.64
Early stopping, best iteration is:
[112]	valid_0's rmse: 7317.58


[32m[I 2022-04-24 08:08:39,479][0m Trial 35 finished with value: 7302.8702404260475 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.031615809427483633, 'reg_lambda': 8.726726587215822, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 21, 'min_child_samples': 109}. Best is trial 23 with value: 7295.7965560751.[0m


Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7411.83
[200]	valid_0's rmse: 7325.11
Early stopping, best iteration is:
[157]	valid_0's rmse: 7302.87
Training until validation scores don't improve for 100 rounds


[32m[I 2022-04-24 08:08:39,587][0m Trial 36 finished with value: 7391.87625162968 and parameters: {'n_estimators': 100, 'reg_alpha': 0.031171297219441833, 'reg_lambda': 9.618007340346216, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 22, 'min_child_samples': 95}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:39,718][0m Trial 37 finished with value: 7420.565494903042 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.0068735566691764385, 'reg_lambda': 1.1804381224827543, 'colsample_bytree': 0.7, 'subsample': 0.7, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 10, 'min_child_samples': 158}. Best is trial 23 with value: 7295.7965560751.[0m


[100]	valid_0's rmse: 7391.88
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 7391.88
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7517.29
[200]	valid_0's rmse: 7430.34
Early stopping, best iteration is:
[178]	valid_0's rmse: 7420.57
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7602.06


[32m[I 2022-04-24 08:08:39,810][0m Trial 38 finished with value: 7480.505267317099 and parameters: {'n_estimators': 250, 'reg_alpha': 0.018621930966212463, 'reg_lambda': 0.4490851999794524, 'colsample_bytree': 0.9, 'subsample': 0.5, 'learning_rate': 0.014, 'max_depth': 3, 'num_leaves': 21, 'min_child_samples': 186}. Best is trial 23 with value: 7295.7965560751.[0m


[200]	valid_0's rmse: 7484.19
Did not meet early stopping. Best iteration is:
[184]	valid_0's rmse: 7480.51
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7365.01


[32m[I 2022-04-24 08:08:40,036][0m Trial 39 finished with value: 7311.875771141511 and parameters: {'n_estimators': 2000, 'reg_alpha': 4.048000479331015, 'reg_lambda': 2.313826415651575, 'colsample_bytree': 1.0, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 48, 'min_child_samples': 76}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:40,146][0m Trial 40 finished with value: 7393.032781893001 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.04787821638083725, 'reg_lambda': 0.14351346682566057, 'colsample_bytree': 0.4, 'subsample': 0.6, 'learning_rate': 0.008, 'max_depth': 2, 'num_leaves': 94, 'min_child_samples': 107}. Best is trial 23 with value: 7295.7965560751.[0m


[200]	valid_0's rmse: 7365.02
Early stopping, best iteration is:
[130]	valid_0's rmse: 7311.88
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 8023.33
[200]	valid_0's rmse: 7495.51
[300]	valid_0's rmse: 7395.22
[400]	valid_0's rmse: 7413.3
Early stopping, best iteration is:
[305]	valid_0's rmse: 7393.03
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7326.85


[32m[I 2022-04-24 08:08:40,672][0m Trial 41 finished with value: 7303.159572873923 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.09542911843987667, 'reg_lambda': 5.382409872856089, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.017, 'max_depth': 10, 'num_leaves': 42, 'min_child_samples': 116}. Best is trial 23 with value: 7295.7965560751.[0m


[200]	valid_0's rmse: 7366.14
Early stopping, best iteration is:
[126]	valid_0's rmse: 7303.16
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7654.41
[200]	valid_0's rmse: 7306.89


[32m[I 2022-04-24 08:08:41,223][0m Trial 42 finished with value: 7299.038319475483 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.09324489948030219, 'reg_lambda': 7.743683471157935, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.01, 'max_depth': 10, 'num_leaves': 52, 'min_child_samples': 109}. Best is trial 23 with value: 7295.7965560751.[0m


[300]	valid_0's rmse: 7351.95
Early stopping, best iteration is:
[221]	valid_0's rmse: 7299.04
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7674.1


[32m[I 2022-04-24 08:08:41,673][0m Trial 43 finished with value: 7336.38061371937 and parameters: {'n_estimators': 500, 'reg_alpha': 0.01201633379149778, 'reg_lambda': 7.807686612683449, 'colsample_bytree': 0.3, 'subsample': 0.8, 'learning_rate': 0.01, 'max_depth': 10, 'num_leaves': 54, 'min_child_samples': 137}. Best is trial 23 with value: 7295.7965560751.[0m


[200]	valid_0's rmse: 7339.87
[300]	valid_0's rmse: 7374.2
Early stopping, best iteration is:
[216]	valid_0's rmse: 7336.38
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7630.08
[200]	valid_0's rmse: 7305.52
[300]	valid_0's rmse: 7360.41


[32m[I 2022-04-24 08:08:42,101][0m Trial 44 finished with value: 7301.861771880741 and parameters: {'n_estimators': 2000, 'reg_alpha': 0.001154114203031602, 'reg_lambda': 3.6379934249995283, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.01, 'max_depth': 10, 'num_leaves': 50, 'min_child_samples': 103}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:42,241][0m Trial 45 finished with value: 7339.243201846077 and parameters: {'n_estimators': 1250, 'reg_alpha': 0.001144660867818594, 'reg_lambda': 1.009002548539752, 'colsample_bytree': 0.8, 'subsample': 0.4, 'learning_rate': 0.01, 'max_depth': 3, 'num_leaves': 60, 'min_child_samples': 73}. Best is trial 23 with value: 7295.7965560751.[0m


Early stopping, best iteration is:
[221]	valid_0's rmse: 7301.86
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7609.82
[200]	valid_0's rmse: 7342.37
[300]	valid_0's rmse: 7372.52
Early stopping, best iteration is:
[218]	valid_0's rmse: 7339.24
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7681.6
[200]	valid_0's rmse: 7403.7


[32m[I 2022-04-24 08:08:42,593][0m Trial 46 finished with value: 7393.837484248751 and parameters: {'n_estimators': 250, 'reg_alpha': 0.004956210962547387, 'reg_lambda': 3.539695177530633, 'colsample_bytree': 0.5, 'subsample': 0.8, 'learning_rate': 0.01, 'max_depth': 10, 'num_leaves': 75, 'min_child_samples': 149}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:42,681][0m Trial 47 finished with value: 7777.019276656346 and parameters: {'n_estimators': 100, 'reg_alpha': 0.002025152148772397, 'reg_lambda': 1.9254707128909907, 'colsample_bytree': 0.4, 'subsample': 0.7, 'learning_rate': 0.01, 'max_depth': 2, 'num_leaves': 100, 'min_child_samples': 93}. Best is trial 23 with value: 7295.7965560751.[0m


Did not meet early stopping. Best iteration is:
[228]	valid_0's rmse: 7393.84
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7777.02
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 7777.02
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7744.85
[200]	valid_0's rmse: 7477.71


[32m[I 2022-04-24 08:08:43,178][0m Trial 48 finished with value: 7475.498985933856 and parameters: {'n_estimators': 750, 'reg_alpha': 1.49056651431573, 'reg_lambda': 0.44324882420774714, 'colsample_bytree': 1.0, 'subsample': 0.6, 'learning_rate': 0.01, 'max_depth': 10, 'num_leaves': 49, 'min_child_samples': 192}. Best is trial 23 with value: 7295.7965560751.[0m
[32m[I 2022-04-24 08:08:43,342][0m Trial 49 finished with value: 7327.090551163841 and parameters: {'n_estimators': 2000, 'reg_alpha': 4.288200799443621, 'reg_lambda': 0.20097179526944384, 'colsample_bytree': 0.3, 'subsample': 0.5, 'learning_rate': 0.02, 'max_depth': 5, 'num_leaves': 34, 'min_child_samples': 129}. Best is trial 23 with value: 7295.7965560751.[0m


[300]	valid_0's rmse: 7493.71
Early stopping, best iteration is:
[216]	valid_0's rmse: 7475.5
Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 7330.3
[200]	valid_0's rmse: 7446.07
Early stopping, best iteration is:
[103]	valid_0's rmse: 7327.09
Number of finished trials: 50
Best trial: {'n_estimators': 2000, 'reg_alpha': 2.1932499270244104, 'reg_lambda': 3.1834449128431275, 'colsample_bytree': 0.4, 'subsample': 0.8, 'learning_rate': 0.014, 'max_depth': 10, 'num_leaves': 39, 'min_child_samples': 116}


Let's create an LGBMRegresor with the best hyperparameters found in the previous study

In [32]:
params=study.best_params   
params['random_state'] = 48
params['metric'] = 'rmse'
params

{'n_estimators': 2000,
 'reg_alpha': 2.1932499270244104,
 'reg_lambda': 3.1834449128431275,
 'colsample_bytree': 0.4,
 'subsample': 0.8,
 'learning_rate': 0.014,
 'max_depth': 10,
 'num_leaves': 39,
 'min_child_samples': 116,
 'random_state': 48,
 'metric': 'rmse'}

In [34]:
model = LGBMRegressor(**params)
model.fit(X_train,y_train)
r2 = model.score(X_val, y_val)


In [35]:
r2 = model.score(X_val, y_val)
print(f'Light gbm r2 after hyper parameters tuning: {r2}')

Light gbm r2 after hyper parameters tuning: 0.34828934506153364


Fine tuning the hyperparameters of the model has greatly improved it r2, from 0.28 to 0.35. We save so model so that it can be used by the API and move to the test part

### Tests

In [36]:
from src.model import load_model
from src.train_model import eval_model

In [37]:
model = load_model('lgb')

In [39]:
result = eval_model(model)
result

{'r2': 0.33630406990008166, 'rmse': 11606.626269252638, 'rmse_percent': 188.3}

The test results are close to the validation results, which is positive. On the other hand the rmse is still too large at 188% of the average target variable value.