<h1> Fine Tuning </h1>

<h2> GradientBoostingRegressor </h2>

<h3> Fix learning rate and number of estimators </h3>

We will start tuning with a set of parameters with common values. Additionaly let's set learning_rate to 0.1 and try to establish n_estimators value.

In [44]:
import numpy as np
from scipy.stats import uniform, truncnorm, randint
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV


In [47]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.1], # fixed, 0.1 is a good value to start with, generally use sth in 0.05-0.3
                  'n_estimators': randint(20,80), # 40-70
                  'max_depth': [5], # 5-8
                  'min_child_weight': [1], # dataset is very small, 1 should be at the beginning
                  'gamma': [0], # good value to start is 0, also in 0.1 to 0.2
                  'subsample': [0.8], # value to start should be in 0.5-0.9
                  'colsample_bytree': [0.8], # value to start should be in 0.5-0.9
                 
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=61, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

In [48]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 31, 'subsample': 0.8}
0.44889668523513393
1.6578379026849261


Let's set n_estimators to 31.

<h4> Tune max_depth and min_child_weight </h4>

In [51]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.1], # fixed
                  'n_estimators': [31], # tuned
                  'max_depth': randint(1, 5), # tuning
                  'min_child_weight': randint(1, 6), # tuning
                  'gamma': [0], # good value to start is 0, also in 0.1 to 0.2
                  'subsample': [0.8], # value to start should be in 0.5-0.9
                  'colsample_bytree': [0.8], # value to start should be in 0.5-0.9
                 
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=48, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

In [52]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 2, 'min_child_weight': 5, 'n_estimators': 31, 'subsample': 0.8}
1.005481113685155
1.7072821790650718


As we can see, an optimal pair for these parameters is (2,5)

<h4> Tune gamma </h4>

In [55]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.1], # fixed
                  'n_estimators': [31], # tuned
                  'max_depth': [2], # tuned
                  'min_child_weight': [5], # tuned
                  'gamma': uniform(loc=0.4, scale=0.3), # tuning
                  'subsample': [0.8], # value to start should be in 0.5-0.9
                  'colsample_bytree': [0.8], # value to start should be in 0.5-0.9
                 
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=48, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

In [56]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'colsample_bytree': 0.8, 'gamma': 0.5123620356542088, 'learning_rate': 0.1, 'max_depth': 2, 'min_child_weight': 5, 'n_estimators': 31, 'subsample': 0.8}
1.005481113685155
1.716340428112405


Let's update gamma and tune n_estimators again.

In [57]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.1], # fixed
                  'n_estimators': randint(20,80), # tuning
                  'max_depth': [2], # tuned
                  'min_child_weight': [5], # tuned
                  'gamma': [0.51], # tuned
                  'subsample': [0.8], # value to start should be in 0.5-0.9
                  'colsample_bytree': [0.8], # value to start should be in 0.5-0.9
                 
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=48, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

In [58]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'colsample_bytree': 0.8, 'gamma': 0.51, 'learning_rate': 0.1, 'max_depth': 2, 'min_child_weight': 5, 'n_estimators': 27, 'subsample': 0.8}
1.0708516382646263
1.7259549117908763


Now, we set n_estimators to 27.

<h4> Tune subsample and colsample_bytree </h4>

In [62]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.1], # fixed
                  'n_estimators': [27], # tuned x2
                  'max_depth': [2], # tuned
                  'min_child_weight': [5], # tuned
                  'gamma': [0.51], # tuned
                  'subsample': uniform(loc=0.6, scale=0.3), # tuning
                  'colsample_bytree': uniform(loc=0.6, scale=0.3), # tuning
                 
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=80, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

In [63]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'colsample_bytree': 0.6712912631977199, 'gamma': 0.51, 'learning_rate': 0.1, 'max_depth': 2, 'min_child_weight': 5, 'n_estimators': 27, 'subsample': 0.8184649045835578}
1.0815125594541606
1.7131871604597633


An optimal pair is (0.82, 0.67)

<h4> Tuning Regularization Parameters </h4>

In [65]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.1], # fixed
                  'n_estimators': [27], # tuned x2
                  'max_depth': [2], # tuned
                  'min_child_weight': [5], # tuned
                  'gamma': [0.51], # tuned
                  'subsample': [0.82], # tuned
                  'colsample_bytree': [0.67], # tuned
                  'reg_alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 0, 1, 10 ,100, 1000], # tuning
                  'reg_lambda': [0.00001, 0.0001, 0.001, 0.01, 0.1, 0, 1, 10 ,100, 1000], # tuning
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=81, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

In [66]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'subsample': 0.82, 'reg_lambda': 1, 'reg_alpha': 0, 'n_estimators': 27, 'min_child_weight': 5, 'max_depth': 2, 'learning_rate': 0.1, 'gamma': 0.51, 'colsample_bytree': 0.67}
1.0820393098139645
1.6786847206496238


Let's narrow down the search

In [71]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.1], # fixed
                  'n_estimators': [27], # tuned x2
                  'max_depth': [2], # tuned
                  'min_child_weight': [5], # tuned
                  'gamma': [0.51], # tuned
                  'subsample': [0.82], # tuned
                  'colsample_bytree': [0.67], # tuned
                  'reg_alpha': uniform(loc=0, scale=0.0001), # tuning x2
                  'reg_lambda': uniform(loc=0.01, scale=0.99), # tuning x2
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=81, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

In [72]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'colsample_bytree': 0.67, 'gamma': 0.51, 'learning_rate': 0.1, 'max_depth': 2, 'min_child_weight': 5, 'n_estimators': 27, 'reg_alpha': 7.455064367977083e-06, 'reg_lambda': 0.9870180672345121, 'subsample': 0.82}
1.0817639448336969
1.6541790592172911


The optimal values are for reg_alpha 0.000007, for reg_lambda 0.99

<h4> Reducing Learning Rate </h4>

In [164]:
X, y = train_set_new_ready, train_set_labels

distributions = { 'learning_rate':[0.01], # fixed x2
                  'n_estimators': randint(340,370), # tuning
                  'max_depth': [2], # tuned
                  'min_child_weight': [5], # tuned
                  'gamma': [0.51], # tuned
                  'subsample': [0.82], # tuned
                  'colsample_bytree': [0.67], # tuned
                  'reg_alpha': [0.000007], # tuning x2
                  'reg_lambda': [0.99], # tuning x2
                }

model = XGBRegressor()
reg = RandomizedSearchCV(model, distributions, n_iter=31, random_state=42)

reg.fit(X, y)

predictions = reg.predict(X)
train_mse = mean_squared_error(predictions, y)
rmse_training = np.sqrt(train_mse)

scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
                              scoring="neg_mean_squared_error", cv=4)

rmse_cv = np.sqrt(-scores).mean()

Audio(sound_file, autoplay=True)

KeyboardInterrupt: 

In [93]:
print(reg.best_params_)
print(rmse_training)
print(rmse_cv)

{'colsample_bytree': 0.67, 'gamma': 0.51, 'learning_rate': 0.01, 'max_depth': 2, 'min_child_weight': 5, 'n_estimators': 351, 'reg_alpha': 7e-06, 'reg_lambda': 0.99, 'subsample': 0.82}
0.9437116683335078
1.6742276803166876


That was a last stage, which clearly didn't help.

<h4> Measure model accuracy </h4>

In [170]:
# X, y = train_set_new_ready, train_set_labels

# X_test = train_set.drop("age", axis=1)
# y_test = train_set["age"].copy()

# print(y_test.isnull().values.any())

# num_pipeline = Pipeline([
#     ('imputer', SimpleImputer(strategy='median')),
#     ('attribs_adder', CombinedAttributesAdder())
#     ])

# X_test_ready = pipeline.fit_transform(X_test)

# y_test = y_test.to_numpy()
# imputer = SimpleImputer(strategy='median')
# imputer.fit(y_test.reshape(-1, 1))
# y_test = imputer.transform(y_test.reshape(-1, 1))



# distributions = { 'learning_rate':[0.1], # fixed
#                   'n_estimators': [27], # tuning
#                   'max_depth': [2], # tuned
#                   'min_child_weight': [5], # tuned
#                   'gamma': [0.51], # tuned
#                   'subsample': [0.82], # tuned
#                   'colsample_bytree': [0.67], # tuned
#                   'reg_alpha': [0.000007], # tuning x2
#                   'reg_lambda': [0.99], # tuning x2
#                 }

# model = XGBRegressor()
# reg = RandomizedSearchCV(model, distributions, n_iter=1, random_state=42)

# reg.fit(X, y)

# print(X.shape)

# predictions = reg.predict(X_test_ready)
# train_mse = mean_squared_error(predictions, y_test)
# rmse_training = np.sqrt(train_mse)

# scores = cross_val_score(reg, train_set_new_ready, np.ravel(train_set_labels),
#                               scoring="neg_mean_squared_error", cv=4)

# rmse_cv = np.sqrt(-scores).mean()

# Audio(sound_file, autoplay=True)

In [171]:
# from sklearn.metrics import accuracy_score
# print(reg.best_params_)
# print(rmse_training)
# print(rmse_cv)