This notebook evaluates 11 different regressors with Optuna and then tunes some regressors (especially CatBoost, BayesianRidge, LGBM). 

As before (see optuna.ipynb) input varies by how it is vectorized (count, tfidf, doc vectors, scaled doc vectors). Surprisingly, scaled doc vectors results in a lower performance than doc vectors.

First experimenting is done with a single output, male share of applicants. The unknown gender category is removed in this setting.<br> 
Best result was gotten with GradientBoosting with default hyperparameters and doc vectors, but it differs some for every run: RMSE: 0.117, MAE: 0.091, R2: 0.766.<br>
Best robust result (CatBoost) RMSE: 0.12 MAE: 0.095 R2: 0.751. 

Then a test is also done with multioutput (male share and female share, unknown gender is left as it is, but not predicted), which gives promising results with CatBoost and doc vectors, RMSE: 0.117 MAE: 0.091 R2: 0.756.

In [16]:
import os
os.chdir(r"c:\Users\britt\Desktop\YH\Applicerad AI\job_discrimination_sandbox")
import re
import warnings

import gensim
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import optuna
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import LinearRegression, ElasticNet, SGDRegressor, BayesianRidge
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_recall_fscore_support, classification_report, accuracy_score, recall_score, precision_score, f1_score, mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, cross_val_predict
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from unidecode import unidecode
from xgboost.sklearn import XGBRegressor

In [17]:
#warnings.simplefilter("ignore")

In [21]:
df = pd.read_csv("data/cleaned_data/bulletins_w_labels_and_content.csv", dtype={'ID': object})  
df = df[["ID", "Job Description", "Apps Received", "Female", "Male", "Unknown_Gender", "Cleaned text"]]
df

Unnamed: 0,ID,Job Description,Apps Received,Female,Male,Unknown_Gender,Cleaned text
0,9206,311 DIRECTOR,54,20,31,3,director class code open date annual salary du...
1,1223,ACCOUNTING CLERK,648,488,152,8,accounting clerk class code open date exam ope...
2,7260,AIRPORT MANAGER,51,13,37,1,airport manager class code open date exam open...
3,3227,AIRPORT POLICE LIEUTENANT,48,9,38,1,airport police lieutenant class code open date...
4,2400,AQUARIST,40,15,24,1,aquarist class code open date annual salary ca...
...,...,...,...,...,...,...,...
172,7840,WASTEWATER TREATMENT LABORATORY MANAGER,16,6,9,1,wastewater treatment laboratory manager class ...
173,4123,WASTEWATER TREATMENT OPERATOR,125,9,113,3,wastewater treatment operator class code open ...
174,7857,WATER MICROBIOLOGIST,179,89,82,8,water microbiologist class code open date revi...
175,3912,WATER UTILITY WORKER,96,2,92,2,water utility worker class code open date exam...


In [22]:
df["Apps Received (unknown gender removed)"] = df["Male"] + df["Female"]
df["Male share"] = round(df["Male"] / df["Apps Received (unknown gender removed)"], 3)
df["Female share"] = round(df["Female"] / df["Apps Received (unknown gender removed)"], 3)
df["Male share (unknown gender included)"] =  round(df["Male"] / df["Apps Received"], 3)
df["Female share (unknown gender included)"] =  round(df["Female"] / df["Apps Received"], 3)
df

Unnamed: 0,ID,Job Description,Apps Received,Female,Male,Unknown_Gender,Cleaned text,Apps Received (unknown gender removed),Male share,Female share,Male share (unknown gender included),Female share (unknown gender included)
0,9206,311 DIRECTOR,54,20,31,3,director class code open date annual salary du...,51,0.608,0.392,0.574,0.370
1,1223,ACCOUNTING CLERK,648,488,152,8,accounting clerk class code open date exam ope...,640,0.238,0.762,0.235,0.753
2,7260,AIRPORT MANAGER,51,13,37,1,airport manager class code open date exam open...,50,0.740,0.260,0.725,0.255
3,3227,AIRPORT POLICE LIEUTENANT,48,9,38,1,airport police lieutenant class code open date...,47,0.809,0.191,0.792,0.188
4,2400,AQUARIST,40,15,24,1,aquarist class code open date annual salary ca...,39,0.615,0.385,0.600,0.375
...,...,...,...,...,...,...,...,...,...,...,...,...
172,7840,WASTEWATER TREATMENT LABORATORY MANAGER,16,6,9,1,wastewater treatment laboratory manager class ...,15,0.600,0.400,0.562,0.375
173,4123,WASTEWATER TREATMENT OPERATOR,125,9,113,3,wastewater treatment operator class code open ...,122,0.926,0.074,0.904,0.072
174,7857,WATER MICROBIOLOGIST,179,89,82,8,water microbiologist class code open date revi...,171,0.480,0.520,0.458,0.497
175,3912,WATER UTILITY WORKER,96,2,92,2,water utility worker class code open date exam...,94,0.979,0.021,0.958,0.021


In [23]:
# Save distributional data to csv
# df.to_csv("data/cleaned_data/bulletins_labels_share_content.csv")

In [23]:
# Shuffle samples
df = df.sample(frac=1).reset_index(drop=True)
df

Unnamed: 0,ID,Job Description,Apps Received,Female,Male,Unknown_Gender,Cleaned text,Apps Received (unknown gender removed),Male share,Female share,Male share (unknown gender included),Female share (unknown gender included)
0,3190,BUILDING MAINTENANCE DISTRICT SUPERVISOR,47,1,45,1,build maintenance district supervisor class co...,46,0.978,0.022,0.957,0.021
1,3860,ELEVATOR MECHANIC HELPER,203,2,195,6,elevator mechanic helper class code open date ...,197,0.990,0.010,0.961,0.010
2,3987,WATERWORKS MECHANIC SUPERVISOR,30,1,29,0,waterworks mechanic supervisor class code open...,30,0.967,0.033,0.967,0.033
3,2434,RECREATION FACILITY DIRECTOR,443,206,230,7,recreation facility director class code open d...,436,0.528,0.472,0.519,0.465
4,1775,WORKERS COMPENSATION CLAIMS ASSISTANT,116,95,19,2,worker compensation claim assistant class code...,114,0.167,0.833,0.164,0.819
...,...,...,...,...,...,...,...,...,...,...,...,...
172,1861,UTILITY BUYER,126,64,58,4,utility buyer class code open date exam open c...,122,0.475,0.525,0.460,0.508
173,3586,TRUCK AND EQUIPMENT DISPATCHER,75,0,75,0,truck equipment dispatcher class code open dat...,75,1.000,0.000,1.000,0.000
174,1769,SENIOR WORKERS COMPENSATION ANALYST,44,26,18,0,senior worker compensation analyst class code ...,44,0.409,0.591,0.409,0.591
175,1336,UTILITY EXECUTIVE SECRETARY,430,395,31,4,utility executive secretary class code open da...,426,0.073,0.927,0.072,0.919


In [162]:
X = df["Cleaned text"]
# Single output, female share can be derived from male share (1 - male share)
y = df["Male share"]

In [163]:
vectorizer = CountVectorizer()
X_count = vectorizer.fit_transform(X).toarray()

In [164]:
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(X).toarray()

In [165]:
X_count_train, X_count_test, y_count_train, y_count_test = train_test_split(X_count, y, random_state=1001)

In [166]:
X_tfidf_train, X_tfidf_test, y_tfidf_train, y_tfidf_test = train_test_split(X_tfidf, y, random_state=1000)

In [27]:
lin_reg = LinearRegression()

In [175]:
lin_reg.fit(X_count_train, y_count_train)

In [29]:
score = lin_reg.score(X_count_test, y_count_test)
score

In [31]:
scores = cross_val_score(lin_reg, X_count, y, cv=10)

In [32]:
scores

array([ 0.46149942,  0.40056629,  0.36142392,  0.41468927,  0.37584931,
        0.72916511,  0.40369832,  0.46350851, -0.46917048,  0.61741993])

In [176]:
def objective(trial):

    classifier_name = trial.suggest_categorical("classifier", ["LinReg", "RandomForest", "DecTree", "SVR", "ElasticNet", "SGD", "BayesRidge", "CatBoost", "KernelRidge", "XGBoost", "LGBM", "GradientBoost"])
    # classifier_name = trial.suggest_categorical("classifier", ["RandomForest", "CatBoost", "LGBM"])
    
    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LinReg':
         classifier_obj = LinearRegression()
    if classifier_name == "RandomForest":
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = RandomForestRegressor(n_estimators=rf_n_estimators, max_depth=rf_max_depth)
    elif classifier_name == "DecTree":
        classifier_obj = DecisionTreeRegressor()
    elif classifier_name == "SVR":
        #svr_c = trial.suggest_float('svr_c', 1e-10, 1e10, log=True)
        classifier_obj = SVR(gamma='auto')
    elif classifier_name == "GradientBoost":
    #     gb_n_estimators = trial.suggest_int("gb_n_estimators", 10, 1000)
    #     gb_lr = trial.suggest_float("gb_lr", 1e-10, 1e10, log=True)
        classifier_obj = GradientBoostingRegressor()
    elif classifier_name == "ElasticNet":
        classifier_obj = ElasticNet()
    elif classifier_name == "SGD":
        classifier_obj = SGDRegressor()
    elif classifier_name == "BayesRidge":
        classifier_obj = BayesianRidge()
    elif classifier_name == "CatBoost":
        #cb_lr = trial.suggest_float("cb_lr", 1e-10, 1e10, log=True)
        classifier_obj = CatBoostRegressor()
    elif classifier_name == "KernelRidge":
        classifier_obj = KernelRidge()
    elif classifier_name == "XGBoost":
        classifier_obj = XGBRegressor()
    elif classifier_name == "LGBM":
        lgbm_n_estimators = trial.suggest_int("lgbm_n_estimators", 10, 1000)
        lgbm_max_depth = trial.suggest_int("lgbm_max_depth", 2, 32, log=True)
        classifier_obj = LGBMRegressor(n_estimators=lgbm_n_estimators, max_depth=lgbm_max_depth)

    # Step 3: Scoring method:
    # score = cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    # accuracy = score.mean()
    # return accuracy

    classifier_obj.fit(X_count_train, y_count_train)
    loss_rmse = mean_squared_error(y_count_test, classifier_obj.predict(X_count_test), squared=False)
    return loss_rmse

In [120]:
study = optuna.create_study()
study.optimize(objective, n_trials=50)

[32m[I 2023-01-11 08:51:52,590][0m A new study created in memory with name: no-name-3a4c4a6b-9d57-4016-a876-3d982a18000b[0m
[32m[I 2023-01-11 08:51:52,966][0m Trial 0 finished with value: 0.1785932638411691 and parameters: {'classifier': 'LGBM', 'lgbm_n_estimators': 780, 'lgbm_max_depth': 18}. Best is trial 0 with value: 0.1785932638411691.[0m
[32m[I 2023-01-11 08:51:53,321][0m Trial 1 finished with value: 0.18949685272498415 and parameters: {'classifier': 'XGBoost'}. Best is trial 0 with value: 0.1785932638411691.[0m
[32m[I 2023-01-11 08:51:56,553][0m Trial 2 finished with value: 0.18292682486390363 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 507, 'rf_max_depth': 3}. Best is trial 0 with value: 0.1785932638411691.[0m
[32m[I 2023-01-11 08:52:08,142][0m Trial 3 finished with value: 0.1758212884250743 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 920, 'rf_max_depth': 25}. Best is trial 3 with value: 0.1758212884250743.[0m
[32m[I 2

Learning rate set to 0.029732
0:	learn: 0.2362336	total: 14ms	remaining: 14s
1:	learn: 0.2343479	total: 26ms	remaining: 13s
2:	learn: 0.2319374	total: 35.1ms	remaining: 11.7s
3:	learn: 0.2298585	total: 43.6ms	remaining: 10.9s
4:	learn: 0.2275216	total: 52.2ms	remaining: 10.4s
5:	learn: 0.2253184	total: 78.8ms	remaining: 13.1s
6:	learn: 0.2233103	total: 87.5ms	remaining: 12.4s
7:	learn: 0.2212585	total: 98.4ms	remaining: 12.2s
8:	learn: 0.2188183	total: 107ms	remaining: 11.8s
9:	learn: 0.2175220	total: 116ms	remaining: 11.5s
10:	learn: 0.2157635	total: 124ms	remaining: 11.2s
11:	learn: 0.2137554	total: 133ms	remaining: 10.9s
12:	learn: 0.2116367	total: 142ms	remaining: 10.8s
13:	learn: 0.2099282	total: 151ms	remaining: 10.7s
14:	learn: 0.2080965	total: 161ms	remaining: 10.6s
15:	learn: 0.2064178	total: 171ms	remaining: 10.5s
16:	learn: 0.2048262	total: 181ms	remaining: 10.5s
17:	learn: 0.2028967	total: 190ms	remaining: 10.4s
18:	learn: 0.2014289	total: 200ms	remaining: 10.3s
19:	learn: 

[32m[I 2023-01-11 08:52:31,617][0m Trial 14 finished with value: 0.1726764445829648 and parameters: {'classifier': 'CatBoost'}. Best is trial 6 with value: 0.16695339977277046.[0m
[32m[I 2023-01-11 08:52:31,898][0m Trial 15 finished with value: 0.16693325222418817 and parameters: {'classifier': 'BayesRidge'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:32,162][0m Trial 16 finished with value: 0.16693325222418817 and parameters: {'classifier': 'BayesRidge'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:32,398][0m Trial 17 finished with value: 0.16693325222418817 and parameters: {'classifier': 'BayesRidge'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:32,634][0m Trial 18 finished with value: 0.16693325222418817 and parameters: {'classifier': 'BayesRidge'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:32,870][0m Trial 19 finished with value: 0.

Learning rate set to 0.029732
0:	learn: 0.2362336	total: 9.22ms	remaining: 9.21s
1:	learn: 0.2343479	total: 18.3ms	remaining: 9.12s
2:	learn: 0.2319374	total: 27.1ms	remaining: 9.02s
3:	learn: 0.2298585	total: 36.5ms	remaining: 9.1s
4:	learn: 0.2275216	total: 46.7ms	remaining: 9.3s
5:	learn: 0.2253184	total: 56.3ms	remaining: 9.32s
6:	learn: 0.2233103	total: 65ms	remaining: 9.21s
7:	learn: 0.2212585	total: 73.5ms	remaining: 9.11s
8:	learn: 0.2188183	total: 82.4ms	remaining: 9.07s
9:	learn: 0.2175220	total: 92.1ms	remaining: 9.12s
10:	learn: 0.2157635	total: 103ms	remaining: 9.22s
11:	learn: 0.2137554	total: 112ms	remaining: 9.21s
12:	learn: 0.2116367	total: 121ms	remaining: 9.16s
13:	learn: 0.2099282	total: 130ms	remaining: 9.15s
14:	learn: 0.2080965	total: 140ms	remaining: 9.2s
15:	learn: 0.2064178	total: 149ms	remaining: 9.18s
16:	learn: 0.2048262	total: 158ms	remaining: 9.11s
17:	learn: 0.2028967	total: 166ms	remaining: 9.06s
18:	learn: 0.2014289	total: 175ms	remaining: 9.03s
19:	le

[32m[I 2023-01-11 08:52:43,865][0m Trial 28 finished with value: 0.1726764445829648 and parameters: {'classifier': 'CatBoost'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:43,885][0m Trial 29 finished with value: 0.18595819229125723 and parameters: {'classifier': 'SVR'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:43,899][0m Trial 30 finished with value: 0.26343531471797227 and parameters: {'classifier': 'ElasticNet'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:44,138][0m Trial 31 finished with value: 0.16693325222418817 and parameters: {'classifier': 'BayesRidge'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:44,434][0m Trial 32 finished with value: 0.18949685272498415 and parameters: {'classifier': 'XGBoost'}. Best is trial 15 with value: 0.16693325222418817.[0m
[32m[I 2023-01-11 08:52:44,678][0m Trial 33 finished with value: 0.166933252

In [116]:
study.best_trials

[FrozenTrial(number=9, values=[0.16693325222418817], datetime_start=datetime.datetime(2023, 1, 11, 8, 47, 2, 173205), datetime_complete=datetime.datetime(2023, 1, 11, 8, 47, 2, 537318), params={'classifier': 'BayesRidge'}, distributions={'classifier': CategoricalDistribution(choices=('LinReg', 'RandomForest', 'DecTree', 'SVR', 'ElasticNet', 'SGD', 'BayesRidge', 'CatBoost', 'KernelRidge', 'XGBoost', 'LGBM'))}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=9, state=TrialState.COMPLETE, value=None),
 FrozenTrial(number=10, values=[0.16693325222418817], datetime_start=datetime.datetime(2023, 1, 11, 8, 47, 2, 539362), datetime_complete=datetime.datetime(2023, 1, 11, 8, 47, 2, 825572), params={'classifier': 'BayesRidge'}, distributions={'classifier': CategoricalDistribution(choices=('LinReg', 'RandomForest', 'DecTree', 'SVR', 'ElasticNet', 'SGD', 'BayesRidge', 'CatBoost', 'KernelRidge', 'XGBoost', 'LGBM'))}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=

In [88]:
study.best_params

{'classifier': 'BayesRidge'}

In [177]:
def objective(trial):

    classifier_name = trial.suggest_categorical("classifier", ["LinReg", "RandomForest", "DecTree", "SVR", "ElasticNet", "SGD", "BayesRidge", "CatBoost", "KernelRidge", "XGBoost", "LGBM", "GradientBoost"])
    # classifier_name = trial.suggest_categorical("classifier", ["RandomForest", "CatBoost", "LGBM"])
    
    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LinReg':
         classifier_obj = LinearRegression()
    if classifier_name == "RandomForest":
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = RandomForestRegressor(n_estimators=rf_n_estimators, max_depth=rf_max_depth)
    elif classifier_name == "DecTree":
        classifier_obj = DecisionTreeRegressor()
    elif classifier_name == "SVR":
        #svr_c = trial.suggest_float('svr_c', 1e-10, 1e10, log=True)
        classifier_obj = SVR(gamma='auto')
    elif classifier_name == "GradientBoost":
    #     gb_n_estimators = trial.suggest_int("gb_n_estimators", 10, 1000)
    #     gb_lr = trial.suggest_float("gb_lr", 1e-10, 1e10, log=True)
        classifier_obj = GradientBoostingRegressor()
    elif classifier_name == "ElasticNet":
        classifier_obj = ElasticNet()
    elif classifier_name == "SGD":
        classifier_obj = SGDRegressor()
    elif classifier_name == "BayesRidge":
        classifier_obj = BayesianRidge()
    elif classifier_name == "CatBoost":
        #cb_lr = trial.suggest_float("cb_lr", 1e-10, 1e10, log=True)
        classifier_obj = CatBoostRegressor()
    elif classifier_name == "KernelRidge":
        classifier_obj = KernelRidge()
    elif classifier_name == "XGBoost":
        classifier_obj = XGBRegressor()
    elif classifier_name == "LGBM":
        lgbm_n_estimators = trial.suggest_int("lgbm_n_estimators", 10, 1000)
        lgbm_max_depth = trial.suggest_int("lgbm_max_depth", 2, 32, log=True)
        classifier_obj = LGBMRegressor(n_estimators=lgbm_n_estimators, max_depth=lgbm_max_depth)

    # Step 3: Scoring method:
    # score = cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    # accuracy = score.mean()
    # return accuracy

    classifier_obj.fit(X_tfidf_train, y_tfidf_train)
    loss_rmse = mean_squared_error(y_tfidf_test, classifier_obj.predict(X_tfidf_test), squared=False)
    return loss_rmse

In [178]:
study = optuna.create_study()
study.optimize(objective, n_trials=50)

[32m[I 2023-01-11 12:58:05,664][0m A new study created in memory with name: no-name-6a37474f-ca53-4fe6-bc95-b516738ba60e[0m
[32m[I 2023-01-11 12:58:05,951][0m Trial 0 finished with value: 0.14185805818966826 and parameters: {'classifier': 'BayesRidge'}. Best is trial 0 with value: 0.14185805818966826.[0m
[32m[I 2023-01-11 12:58:05,958][0m Trial 1 finished with value: 0.2369972553581393 and parameters: {'classifier': 'ElasticNet'}. Best is trial 0 with value: 0.14185805818966826.[0m
[32m[I 2023-01-11 12:58:06,430][0m Trial 2 finished with value: 0.1658561790867664 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 51, 'rf_max_depth': 3}. Best is trial 0 with value: 0.14185805818966826.[0m
[32m[I 2023-01-11 12:58:06,445][0m Trial 3 finished with value: 0.22985678926322325 and parameters: {'classifier': 'SGD'}. Best is trial 0 with value: 0.14185805818966826.[0m
[32m[I 2023-01-11 12:58:12,170][0m Trial 4 finished with value: 0.15772467999820575 and paramete

Learning rate set to 0.029732
0:	learn: 0.2455460	total: 25.8ms	remaining: 25.8s
1:	learn: 0.2431550	total: 51.6ms	remaining: 25.8s
2:	learn: 0.2406414	total: 98.1ms	remaining: 32.6s
3:	learn: 0.2381147	total: 119ms	remaining: 29.7s
4:	learn: 0.2363735	total: 144ms	remaining: 28.6s
5:	learn: 0.2341562	total: 163ms	remaining: 27.1s
6:	learn: 0.2322488	total: 184ms	remaining: 26.1s
7:	learn: 0.2299518	total: 210ms	remaining: 26s
8:	learn: 0.2277158	total: 244ms	remaining: 26.9s
9:	learn: 0.2256573	total: 268ms	remaining: 26.5s
10:	learn: 0.2241205	total: 295ms	remaining: 26.5s
11:	learn: 0.2221271	total: 316ms	remaining: 26.1s
12:	learn: 0.2201485	total: 339ms	remaining: 25.7s
13:	learn: 0.2179338	total: 364ms	remaining: 25.6s
14:	learn: 0.2154029	total: 386ms	remaining: 25.3s
15:	learn: 0.2138432	total: 410ms	remaining: 25.2s
16:	learn: 0.2119817	total: 432ms	remaining: 25s
17:	learn: 0.2100591	total: 463ms	remaining: 25.3s
18:	learn: 0.2082731	total: 488ms	remaining: 25.2s
19:	learn: 0

[32m[I 2023-01-11 12:58:42,737][0m Trial 24 finished with value: 0.14638582848761725 and parameters: {'classifier': 'CatBoost'}. Best is trial 14 with value: 0.1414445509493152.[0m


998:	learn: 0.0001223	total: 24.9s	remaining: 24.9ms
999:	learn: 0.0001216	total: 24.9s	remaining: 0us


[32m[I 2023-01-11 12:58:42,821][0m Trial 25 finished with value: 0.1414445509493152 and parameters: {'classifier': 'LinReg'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:58:42,903][0m Trial 26 finished with value: 0.1414445509493152 and parameters: {'classifier': 'LinReg'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:58:42,985][0m Trial 27 finished with value: 0.1414445509493152 and parameters: {'classifier': 'LinReg'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:58:43,021][0m Trial 28 finished with value: 0.22534930909836645 and parameters: {'classifier': 'DecTree'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:58:43,089][0m Trial 29 finished with value: 0.1429435161620852 and parameters: {'classifier': 'LGBM', 'lgbm_n_estimators': 120, 'lgbm_max_depth': 2}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:58:43,114][0m Trial 30 fi

Learning rate set to 0.029732
0:	learn: 0.2455460	total: 24.4ms	remaining: 24.4s
1:	learn: 0.2431550	total: 50.7ms	remaining: 25.3s
2:	learn: 0.2406414	total: 70.9ms	remaining: 23.6s
3:	learn: 0.2381147	total: 95.4ms	remaining: 23.8s
4:	learn: 0.2363735	total: 117ms	remaining: 23.3s
5:	learn: 0.2341562	total: 143ms	remaining: 23.7s
6:	learn: 0.2322488	total: 164ms	remaining: 23.2s
7:	learn: 0.2299518	total: 186ms	remaining: 23s
8:	learn: 0.2277158	total: 209ms	remaining: 23s
9:	learn: 0.2256573	total: 232ms	remaining: 23s
10:	learn: 0.2241205	total: 258ms	remaining: 23.2s
11:	learn: 0.2221271	total: 281ms	remaining: 23.2s
12:	learn: 0.2201485	total: 305ms	remaining: 23.1s
13:	learn: 0.2179338	total: 332ms	remaining: 23.4s
14:	learn: 0.2154029	total: 359ms	remaining: 23.5s
15:	learn: 0.2138432	total: 380ms	remaining: 23.4s
16:	learn: 0.2119817	total: 406ms	remaining: 23.4s
17:	learn: 0.2100591	total: 427ms	remaining: 23.3s
18:	learn: 0.2082731	total: 454ms	remaining: 23.5s
19:	learn: 0.

[32m[I 2023-01-11 12:59:09,581][0m Trial 38 finished with value: 0.14638582848761725 and parameters: {'classifier': 'CatBoost'}. Best is trial 14 with value: 0.1414445509493152.[0m


995:	learn: 0.0001257	total: 24.3s	remaining: 97.5ms
996:	learn: 0.0001248	total: 24.3s	remaining: 73.1ms
997:	learn: 0.0001235	total: 24.3s	remaining: 48.7ms
998:	learn: 0.0001223	total: 24.3s	remaining: 24.4ms
999:	learn: 0.0001216	total: 24.4s	remaining: 0us


[32m[I 2023-01-11 12:59:09,662][0m Trial 39 finished with value: 0.1414445509493152 and parameters: {'classifier': 'LinReg'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:59:09,677][0m Trial 40 finished with value: 0.2369972553581393 and parameters: {'classifier': 'ElasticNet'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:59:09,760][0m Trial 41 finished with value: 0.1414445509493152 and parameters: {'classifier': 'LinReg'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:59:09,839][0m Trial 42 finished with value: 0.1414445509493152 and parameters: {'classifier': 'LinReg'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:59:09,918][0m Trial 43 finished with value: 0.1414445509493152 and parameters: {'classifier': 'LinReg'}. Best is trial 14 with value: 0.1414445509493152.[0m
[32m[I 2023-01-11 12:59:26,042][0m Trial 44 finished with value: 0.15278921612344204 and 

In [89]:
regressor = BayesianRidge()
regressor.fit(X_count_train, y_count_train)
mse = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
mse

0.16693325222418817

In [90]:
mae = mean_absolute_error(y_count_test, regressor.predict(X_count_test))
mae

0.12361076300701394

In [73]:
r2 = regressor.score(X_count_test, y_count_test)
r2

0.5883187263648963

In [75]:
mae = mean_absolute_error(y_count_test, regressor.predict(X_count_test))
mae

0.12361076300701394

In [196]:
regressor = BayesianRidge()
regressor.fit(X_tfidf_train, y_tfidf_train)

In [197]:
rmse = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
mae = mean_absolute_error(y_tfidf_test, regressor.predict(X_tfidf_test))
r2 = r2_score(y_tfidf_test, regressor.predict(X_tfidf_test))

In [198]:
rmse

0.14185805818966826

In [199]:
mae

0.09992012731715566

In [200]:
r2

0.6202062325779809

In [34]:
def objective(trial):
    param = {}
    param['learning_rate'] = trial.suggest_float("learning_rate", 0.001, 0.02, step=0.001)
    param['depth'] = trial.suggest_int('depth', 9, 15)
    param['l2_leaf_reg'] = trial.suggest_float('l2_leaf_reg', 1.0, 5.5, step=0.5)
    param['min_child_samples'] = trial.suggest_categorical('min_child_samples', [1, 4, 8, 16, 32])
    param['grow_policy'] = 'Depthwise'
    #param['iterations'] = 10000
    #param['use_best_model'] = True
    param['eval_metric'] = 'RMSE'
    param['od_type'] = 'iter'
    param['od_wait'] = 20
    param['random_state'] = 1
    param['logging_level'] = 'Silent'
    
    regressor = CatBoostRegressor(**param)

    regressor.fit(X_count_train, y_count_train, early_stopping_rounds=100)
    loss = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
    return loss
    

In [35]:
study = optuna.create_study()
study.optimize(objective, n_trials=200, timeout=24000)

[32m[I 2023-01-10 08:58:28,874][0m A new study created in memory with name: catboost-seed1[0m
[32m[I 2023-01-10 09:00:06,943][0m Trial 0 finished with value: 0.031040337805151554 and parameters: {'learning_rate': 0.005, 'depth': 9, 'l2_leaf_reg': 5.5, 'min_child_samples': 1}. Best is trial 0 with value: 0.031040337805151554.[0m
[32m[I 2023-01-10 09:01:28,512][0m Trial 1 finished with value: 0.040813299781956615 and parameters: {'learning_rate': 0.001, 'depth': 15, 'l2_leaf_reg': 1.5, 'min_child_samples': 8}. Best is trial 0 with value: 0.031040337805151554.[0m
[32m[I 2023-01-10 09:02:39,974][0m Trial 2 finished with value: 0.03245324351548526 and parameters: {'learning_rate': 0.002, 'depth': 13, 'l2_leaf_reg': 1.0, 'min_child_samples': 8}. Best is trial 0 with value: 0.031040337805151554.[0m
[32m[I 2023-01-10 09:03:11,248][0m Trial 3 finished with value: 0.02709879194730593 and parameters: {'learning_rate': 0.007, 'depth': 11, 'l2_leaf_reg': 1.5, 'min_child_samples': 32}.

KeyboardInterrupt: 

In [27]:
study.best_params

{'learning_rate': 0.009000000000000001,
 'depth': 15,
 'l2_leaf_reg': 3.0,
 'min_child_samples': 32}

In [76]:
param = {}
param['learning_rate'] = 0.009
param['depth'] = 15
param['l2_leaf_reg'] = 3.0
param['min_child_samples'] = 32
param['grow_policy'] = 'Depthwise'
#param['iterations'] = 10000
param['eval_metric'] = 'RMSE'
param['od_type'] = 'Iter'
param['od_wait'] = 20
param['random_state'] = 1
param['logging_level'] = 'Silent'
    
regressor = CatBoostRegressor(**param)

#score = cross_val_score(regressor, X, y, n_jobs=-1, cv=3)
#accuracy = score.mean()

In [35]:
accuracy

0.5617213761636992

In [77]:
regressor.fit(X_count_train, y_count_train, early_stopping_rounds=100)

<catboost.core.CatBoostRegressor at 0x271946fc310>

In [78]:
loss_mse = mean_squared_error(y_count_test, regressor.predict(X_count_test))
loss_mse

0.027751175038021048

In [79]:
np.sqrt(loss_mse)

0.16658683933018553

In [80]:
loss_rmse = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
loss_rmse

0.16658683933018553

In [81]:
r2 = regressor.score(X_count_test, y_count_test)
r2

0.5900255610259757

In [82]:
0.167*0.167

0.027889000000000004

In [83]:
np.sqrt(4000000)

2000.0

In [85]:
mae = mean_absolute_error(y_count_test, regressor.predict(X_count_test))
mae

0.12576820719986984

In [179]:
def objective(trial):
    param = {}
    param['learning_rate'] = trial.suggest_float("learning_rate", 0.001, 0.02, step=0.001)
    param['depth'] = trial.suggest_int('depth', 9, 15)
    param['l2_leaf_reg'] = trial.suggest_float('l2_leaf_reg', 1.0, 5.5, step=0.5)
    param['min_child_samples'] = trial.suggest_categorical('min_child_samples', [1, 4, 8, 16, 32])
    param['grow_policy'] = 'Depthwise'
    #param['iterations'] = 10000
    #param['use_best_model'] = True
    param['eval_metric'] = 'RMSE'
    param['od_type'] = 'iter'
    param['od_wait'] = 20
    param['random_state'] = 1
    param['logging_level'] = 'Silent'
    
    regressor = CatBoostRegressor(**param)

    regressor.fit(X_tfidf_train, y_tfidf_train, early_stopping_rounds=100)
    loss = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
    return loss

In [180]:
study = optuna.create_study()
study.optimize(objective, n_trials=200, timeout=24000)

[32m[I 2023-01-11 13:00:09,574][0m A new study created in memory with name: no-name-12f7d3d9-469f-41f9-8a0f-36c3d4ab7918[0m
[32m[I 2023-01-11 13:00:34,837][0m Trial 0 finished with value: 0.14256510008149767 and parameters: {'learning_rate': 0.016, 'depth': 10, 'l2_leaf_reg': 3.5, 'min_child_samples': 16}. Best is trial 0 with value: 0.14256510008149767.[0m
[32m[I 2023-01-11 13:01:06,550][0m Trial 1 finished with value: 0.14484825572638105 and parameters: {'learning_rate': 0.014000000000000002, 'depth': 14, 'l2_leaf_reg': 5.0, 'min_child_samples': 16}. Best is trial 0 with value: 0.14256510008149767.[0m
[32m[I 2023-01-11 13:01:34,739][0m Trial 2 finished with value: 0.14867807272026493 and parameters: {'learning_rate': 0.011, 'depth': 12, 'l2_leaf_reg': 4.5, 'min_child_samples': 16}. Best is trial 0 with value: 0.14256510008149767.[0m
[32m[I 2023-01-11 13:02:01,534][0m Trial 3 finished with value: 0.14257557345876928 and parameters: {'learning_rate': 0.017, 'depth': 15, '

KeyboardInterrupt: 

In [181]:
study.best_params

{'learning_rate': 0.009000000000000001,
 'depth': 13,
 'l2_leaf_reg': 1.5,
 'min_child_samples': 32}

In [188]:
param = {}
param['learning_rate'] = 0.008
param['depth'] = 9
param['l2_leaf_reg'] = 2.5
param['min_child_samples'] = 32
param['grow_policy'] = 'Depthwise'
#param['iterations'] = 10000
param['eval_metric'] = 'RMSE'
param['od_type'] = 'Iter'
param['od_wait'] = 20
param['random_state'] = 1
param['logging_level'] = 'Silent'
    
regressor = CatBoostRegressor(**param)

In [189]:
regressor.fit(X_tfidf_train, y_tfidf_train, early_stopping_rounds=100)

<catboost.core.CatBoostRegressor at 0x271946758d0>

In [190]:
rmse = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
mae = mean_absolute_error(y_tfidf_test, regressor.predict(X_tfidf_test))
r2 = r2_score(y_tfidf_test, regressor.predict(X_tfidf_test))

In [191]:
rmse

0.13784796190431803

In [192]:
mae

0.09835777308043474

In [193]:
r2

0.6413750416660888

In [91]:
def objective(trial):
    n_iter = trial.suggest_int("n_iter", 1, 1000)
    tol = trial.suggest_float("tol", 1e-10, 1e10, log=True)
    alpha_1 = trial.suggest_float("alpha_1", 1e-10, 1e10, log=True)
    alpha_2 = trial.suggest_float("alpha_2", 1e-10, 1e10, log=True)
    lambda_1 = trial.suggest_float("lambda_1", 1e-10, 1e10, log=True)
    lambda_2 = trial.suggest_float("lambda_2", 1e-10, 1e10, log=True)
    comp_score = trial.suggest_categorical("cpm_score", [True, False])

    regressor = BayesianRidge(n_iter=n_iter, tol=tol, alpha_1=alpha_1, alpha_2=alpha_2, lambda_1=lambda_1, lambda_2=lambda_2, compute_score=comp_score)

    regressor.fit(X_count_train, y_count_train)
    loss = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
    return loss

In [93]:
study = optuna.create_study()
study.optimize(objective, n_trials=1000)

[32m[I 2023-01-10 15:37:04,880][0m A new study created in memory with name: no-name-4f2e4768-22b4-470f-abdc-6d8ce0eb20b7[0m
[32m[I 2023-01-10 15:37:05,215][0m Trial 0 finished with value: 0.12361267988079416 and parameters: {'n_iter': 54, 'tol': 807.1874618467465, 'alpha_1': 990885.9255777951, 'alpha_2': 3.069267177233713e-06, 'lambda_1': 2.5961399843404798, 'lambda_2': 8.529896790115403e-06, 'cpm_score': True}. Best is trial 0 with value: 0.12361267988079416.[0m
[32m[I 2023-01-10 15:37:05,481][0m Trial 1 finished with value: 0.12361267988384103 and parameters: {'n_iter': 208, 'tol': 0.06362846827939381, 'alpha_1': 752752095.323246, 'alpha_2': 0.001144360345159972, 'lambda_1': 137.46434589716748, 'lambda_2': 22.85950702069754, 'cpm_score': False}. Best is trial 0 with value: 0.12361267988079416.[0m
[32m[I 2023-01-10 15:37:05,768][0m Trial 2 finished with value: 0.12361225812134372 and parameters: {'n_iter': 917, 'tol': 4.98232519334111e-07, 'alpha_1': 7.411598160083413e-10, 

KeyboardInterrupt: 

In [201]:
def objective(trial):
    n_iter = trial.suggest_int("n_iter", 1, 1000)
    tol = trial.suggest_float("tol", 1e-10, 1e10, log=True)
    alpha_1 = trial.suggest_float("alpha_1", 1e-10, 1e10, log=True)
    alpha_2 = trial.suggest_float("alpha_2", 1e-10, 1e10, log=True)
    lambda_1 = trial.suggest_float("lambda_1", 1e-10, 1e10, log=True)
    lambda_2 = trial.suggest_float("lambda_2", 1e-10, 1e10, log=True)
    comp_score = trial.suggest_categorical("cpm_score", [True, False])

    regressor = BayesianRidge(n_iter=n_iter, tol=tol, alpha_1=alpha_1, alpha_2=alpha_2, lambda_1=lambda_1, lambda_2=lambda_2, compute_score=comp_score)

    regressor.fit(X_tfidf_train, y_tfidf_train)
    loss = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
    return loss

In [202]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=200)

[32m[I 2023-01-11 14:01:59,135][0m A new study created in memory with name: no-name-6048f704-f23f-444e-96b2-7196890170c1[0m
[32m[I 2023-01-11 14:01:59,477][0m Trial 0 finished with value: 0.2369970532346252 and parameters: {'n_iter': 546, 'tol': 5245147.645030035, 'alpha_1': 0.0073847957236279765, 'alpha_2': 40473.755419945366, 'lambda_1': 0.015122796795486042, 'lambda_2': 5.286478395096773e-06, 'cpm_score': False}. Best is trial 0 with value: 0.2369970532346252.[0m
[32m[I 2023-01-11 14:01:59,741][0m Trial 1 finished with value: 0.23699725521713436 and parameters: {'n_iter': 711, 'tol': 4.3312777001474634e-07, 'alpha_1': 3.8519191623248386, 'alpha_2': 1.753254630537205e-09, 'lambda_1': 1861561582.4078724, 'lambda_2': 0.04166658754354329, 'cpm_score': False}. Best is trial 1 with value: 0.23699725521713436.[0m
[32m[I 2023-01-11 14:02:00,039][0m Trial 2 finished with value: 0.14185955993119237 and parameters: {'n_iter': 71, 'tol': 3.0781362975912553e-06, 'alpha_1': 0.000491216

In [107]:
study.best_params

{'n_iter': 600,
 'tol': 5.17101634042113e-08,
 'alpha_1': 0.009728344955292616,
 'alpha_2': 377.76335518777535,
 'lambda_1': 0.08678407411192426,
 'lambda_2': 588.1972168565699,
 'cpm_score': True}

In [99]:
study.best_trial

FrozenTrial(number=193, values=[0.1645068501561882], datetime_start=datetime.datetime(2023, 1, 10, 15, 42, 6, 357061), datetime_complete=datetime.datetime(2023, 1, 10, 15, 42, 6, 600888), params={'n_iter': 23, 'tol': 3012581181.6955023, 'alpha_1': 1.0137498748134246e-09, 'alpha_2': 0.001489310174937195, 'lambda_1': 3.640274044983589e-07, 'lambda_2': 4.356449095344607e-09, 'cpm_score': False}, distributions={'n_iter': IntDistribution(high=1000, log=False, low=1, step=1), 'tol': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'alpha_1': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'alpha_2': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'lambda_1': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'lambda_2': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'cpm_score': CategoricalDistribution(choices=(True, False))}, user_attrs={}, system_attrs={}, intermediate_values={},

In [100]:
regressor = BayesianRidge(n_iter=23, tol=3012581181.696, alpha_1=1.0137e-09, alpha_2=0.0015, lambda_1=3.6403e-07, lambda_2=4.3564e-09)

regressor.fit(X_tfidf_train, y_tfidf_train)

In [101]:
rmse = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
mae = mean_absolute_error(y_tfidf_test, regressor.predict(X_tfidf_test))
r2 = r2_score(y_tfidf_test, regressor.predict(X_tfidf_test))

In [102]:
rmse

0.16450687365057737

In [103]:
mae

0.1236098488511423

In [104]:
r2

0.6001993448804249

In [108]:
regressor = BayesianRidge(n_iter=600, tol=5.171e-08, alpha_1=0.001, alpha_2=377.763, lambda_1=0.0868, lambda_2=588.197)

regressor.fit(X_count_train, y_count_train)

In [109]:
rmse = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
mae = mean_absolute_error(y_count_test, regressor.predict(X_count_test))
r2 = r2_score(y_count_test, regressor.predict(X_count_test))

In [110]:
rmse

0.16562615212474965

In [111]:
mae

0.122579991536417

In [112]:
r2

0.5947404777024883

In [133]:
def objective(trial):
    alpha = trial.suggest_float("alpha", 1e-10, 1e10, log=True)
    gamma = trial.suggest_float("gamma", 1e-10, 1e10, log=True)
    degree = trial.suggest_int("degree", 1, 50)
    coef0 = trial.suggest_float("coef0", 1e-10, 1e10, log=True)

    regressor = KernelRidge(alpha=alpha, gamma=gamma, degree=degree, coef0=coef0)

    regressor.fit(X_count_train, y_count_train)
    loss = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
    return loss

In [134]:
study = optuna.create_study()
study.optimize(objective, n_trials=200)

[32m[I 2023-01-11 09:52:18,977][0m A new study created in memory with name: no-name-8905772e-4cde-400c-8bff-5a4ad446089a[0m
[32m[I 2023-01-11 09:52:18,997][0m Trial 0 finished with value: 0.17709108783414076 and parameters: {'alpha': 4.0955829194958054e-06, 'gamma': 9.020146596360546e-06, 'degree': 48, 'coef0': 0.07051865796789863}. Best is trial 0 with value: 0.17709108783414076.[0m
[32m[I 2023-01-11 09:52:19,018][0m Trial 1 finished with value: 0.5668660657408372 and parameters: {'alpha': 769619.4851209878, 'gamma': 56616.39040591696, 'degree': 20, 'coef0': 2437608.554030387}. Best is trial 0 with value: 0.17709108783414076.[0m
[32m[I 2023-01-11 09:52:19,033][0m Trial 2 finished with value: 0.17709108822301123 and parameters: {'alpha': 1.3156845206170776e-07, 'gamma': 2750.49293649512, 'degree': 3, 'coef0': 9322200810.756687}. Best is trial 0 with value: 0.17709108783414076.[0m
[32m[I 2023-01-11 09:52:19,048][0m Trial 3 finished with value: 0.17707724345784623 and param

In [135]:
study.best_params

{'alpha': 39.96115967073431,
 'gamma': 0.10967409337459422,
 'degree': 36,
 'coef0': 2.2829765469089713e-09}

In [128]:
regressor = KernelRidge(alpha=39.7, gamma=4.4, degree=8, coef0=0.014)

regressor.fit(X_count_train, y_count_train)

In [129]:
rmse = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
mae = mean_absolute_error(y_count_test, regressor.predict(X_count_test))
r2 = r2_score(y_count_test, regressor.predict(X_count_test))

In [130]:
rmse

0.17570077933854597

In [131]:
mae

0.13348630701225442

In [132]:
r2

0.543939163767521

In [203]:
def objective(trial):
    alpha = trial.suggest_float("alpha", 1e-10, 1e10, log=True)
    gamma = trial.suggest_float("gamma", 1e-10, 1e10, log=True)
    degree = trial.suggest_int("degree", 1, 50)
    coef0 = trial.suggest_float("coef0", 1e-10, 1e10, log=True)

    regressor = KernelRidge(alpha=alpha, gamma=gamma, degree=degree, coef0=coef0)

    regressor.fit(X_tfidf_train, y_tfidf_train)
    loss = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
    return loss

In [204]:
study = optuna.create_study()
study.optimize(objective, n_trials=200)

[32m[I 2023-01-11 14:02:53,870][0m A new study created in memory with name: no-name-7dcbaa7c-5795-4451-acbc-66d7386ee71b[0m
[32m[I 2023-01-11 14:02:53,885][0m Trial 0 finished with value: 0.17618962802763904 and parameters: {'alpha': 3.271729180698815e-07, 'gamma': 1832.753167054094, 'degree': 44, 'coef0': 0.057609215975537406}. Best is trial 0 with value: 0.17618962802763904.[0m
[32m[I 2023-01-11 14:02:53,895][0m Trial 1 finished with value: 0.4406644387389253 and parameters: {'alpha': 30.47012022281175, 'gamma': 0.0019979987763834705, 'degree': 33, 'coef0': 24019.101569130184}. Best is trial 0 with value: 0.17618962802763904.[0m
[32m[I 2023-01-11 14:02:53,906][0m Trial 2 finished with value: 0.1688054229155813 and parameters: {'alpha': 0.1531401926063681, 'gamma': 1162728.9944773873, 'degree': 35, 'coef0': 9.734388401442095}. Best is trial 2 with value: 0.1688054229155813.[0m
[32m[I 2023-01-11 14:02:53,915][0m Trial 3 finished with value: 0.8040128264693659 and paramete

In [205]:
study.best_params

{'alpha': 0.1951867218485564,
 'gamma': 271398.99552393524,
 'degree': 11,
 'coef0': 0.06350329482011649}

In [206]:
regressor = KernelRidge(alpha=0.195, gamma=271398.996, degree=11, coef0=0.063)

regressor.fit(X_tfidf_train, y_tfidf_train)

In [207]:
rmse = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
mae = mean_absolute_error(y_tfidf_test, regressor.predict(X_tfidf_test))
r2 = r2_score(y_tfidf_test, regressor.predict(X_tfidf_test))

In [208]:
rmse

0.16861473476157282

In [209]:
mae

0.11805750779306966

In [210]:
r2

0.46342447471032944

In [136]:
regressor = lin_reg
rmse = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
mae = mean_absolute_error(y_count_test, regressor.predict(X_count_test))
r2 = r2_score(y_count_test, regressor.predict(X_count_test))

In [137]:
rmse

0.16695339977277046

In [138]:
mae

0.12361447535721265

In [139]:
r2

0.5882193468957562

In [212]:
regressor = LinearRegression()
regressor.fit(X_tfidf_train, y_tfidf_train)

In [213]:
rmse = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
mae = mean_absolute_error(y_tfidf_test, regressor.predict(X_tfidf_test))
r2 = r2_score(y_tfidf_test, regressor.predict(X_tfidf_test))

In [214]:
rmse

0.1414445509493152

In [215]:
mae

0.09966589990998029

In [216]:
r2

0.6224171550022863

In [167]:
def objective(trial):
    rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
    rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
    regressor = RandomForestRegressor(n_estimators=rf_n_estimators, max_depth=rf_max_depth)

    regressor.fit(X_count_train, y_count_train)
    loss = mean_squared_error(y_count_test, regressor.predict(X_count_test), squared=False)
    return loss

In [168]:
study = optuna.create_study()
study.optimize(objective, n_trials=200)

[32m[I 2023-01-11 12:32:55,465][0m A new study created in memory with name: no-name-db5e2600-5dd1-4d48-8473-fcc363cccfe4[0m
[32m[I 2023-01-11 12:33:04,026][0m Trial 0 finished with value: 0.17471633289221458 and parameters: {'rf_n_estimators': 650, 'rf_max_depth': 27}. Best is trial 0 with value: 0.17471633289221458.[0m
[32m[I 2023-01-11 12:33:04,689][0m Trial 1 finished with value: 0.18101788500596225 and parameters: {'rf_n_estimators': 78, 'rf_max_depth': 5}. Best is trial 0 with value: 0.17471633289221458.[0m
[32m[I 2023-01-11 12:33:06,763][0m Trial 2 finished with value: 0.19379657458990637 and parameters: {'rf_n_estimators': 506, 'rf_max_depth': 2}. Best is trial 0 with value: 0.17471633289221458.[0m
[32m[I 2023-01-11 12:33:13,731][0m Trial 3 finished with value: 0.17345767971047482 and parameters: {'rf_n_estimators': 551, 'rf_max_depth': 12}. Best is trial 3 with value: 0.17345767971047482.[0m
[32m[I 2023-01-11 12:33:14,939][0m Trial 4 finished with value: 0.1773

In [169]:
study.best_params

{'rf_n_estimators': 574, 'rf_max_depth': 18}

In [170]:
regressor = RandomForestRegressor(n_estimators=574, max_depth=18)

regressor.fit(X_count_train, y_count_train)

In [171]:
rmse = mean_squared_error(y_count_test, lin_reg.predict(X_count_test), squared=False)
mae = mean_absolute_error(y_count_test, lin_reg.predict(X_count_test))
r2 = r2_score(y_count_test, lin_reg.predict(X_count_test))

In [172]:
rmse

0.16695339977277046

In [173]:
mae

0.12361447535721265

In [174]:
r2

0.5882193468957562

In [217]:
def objective(trial):
    rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
    rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
    regressor = RandomForestRegressor(n_estimators=rf_n_estimators, max_depth=rf_max_depth)

    regressor.fit(X_tfidf_train, y_tfidf_train)
    loss = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
    return loss

In [218]:
study = optuna.create_study()
study.optimize(objective, n_trials=200)

[32m[I 2023-01-11 14:11:19,017][0m A new study created in memory with name: no-name-cd580a15-21c2-490c-be68-f19fd558131d[0m
[32m[I 2023-01-11 14:11:23,719][0m Trial 0 finished with value: 0.162359006500605 and parameters: {'rf_n_estimators': 800, 'rf_max_depth': 2}. Best is trial 0 with value: 0.162359006500605.[0m
[32m[I 2023-01-11 14:11:32,450][0m Trial 1 finished with value: 0.1525309756907177 and parameters: {'rf_n_estimators': 594, 'rf_max_depth': 7}. Best is trial 1 with value: 0.1525309756907177.[0m
[32m[I 2023-01-11 14:11:33,689][0m Trial 2 finished with value: 0.1608860089684409 and parameters: {'rf_n_estimators': 211, 'rf_max_depth': 2}. Best is trial 1 with value: 0.1525309756907177.[0m
[32m[I 2023-01-11 14:11:36,979][0m Trial 3 finished with value: 0.15417447488409639 and parameters: {'rf_n_estimators': 185, 'rf_max_depth': 26}. Best is trial 1 with value: 0.1525309756907177.[0m
[32m[I 2023-01-11 14:11:41,159][0m Trial 4 finished with value: 0.1543370071674

In [220]:
regressor = RandomForestRegressor(n_estimators=115, max_depth=14)
regressor.fit(X_tfidf_train, y_tfidf_train)

In [221]:
rmse = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
mae = mean_absolute_error(y_tfidf_test, regressor.predict(X_tfidf_test))
r2 = r2_score(y_tfidf_test, regressor.predict(X_tfidf_test))

In [222]:
rmse

0.15396299774162933

In [223]:
mae

0.11746317920404882

In [224]:
r2

0.5526241578841027

In [290]:
def objective(trial):
    # boosting_type = trial.suggest_categorical("boosting_type", ["gbdt", "dart", "rf"])
    num_leaves = trial.suggest_int("num_leaves", 2, 100)
    max_depth = trial.suggest_int("max_depth", -1, 500)
    learning_rate = trial.suggest_float("learning_rate", 0.001, 0.02, step=0.001)
    n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    subsample_for_bin = trial.suggest_int("subsample_for_bin", 100000, 500000)
    min_split_gain = trial.suggest_float("min_split_gain", 1e-10, 1e10, log=True)
    # min_child_weight = trial.suggest_float("min_child_weight", 1e-10, 1e10, log=True)
    # min_child_samples = trial.suggest_int("min_child_samples", 5, 50)
    # subsample = trial.suggest_float("subsample", 1e-10, 1e10, log=True)
    # subsample_freq = trial.suggest_int("subsamples_freq", 0, 50)
    # colsample_bytree = trial.suggest_float("colsample_bytree", 1e-10, 1e10, log=True)
    # reg_alpha = trial.suggest_float("reg_alpha", 1e-10, 1e10, log=True)
    # reg_lambda = trial.suggest_float("reg_lambda", 1e-10, 1e10, log=True)
    
    # regressor = LGBMRegressor(num_leaves=num_leaves, max_depth=max_depth, learning_rate=learning_rate, n_estimators=n_estimators, 
    #                           subsample_for_bin=subsample_for_bin, min_split_gain=min_split_gain, min_child_weight=min_child_weight, 
    #                           min_child_samples=min_child_samples, subsample=subsample, subsample_freq=subsample_freq, 
    #                           colsample_bytree=colsample_bytree, reg_alpha=reg_alpha, reg_lambda=reg_lambda)

    regressor = LGBMRegressor(num_leaves=num_leaves, max_depth=max_depth, learning_rate=learning_rate, n_estimators=n_estimators, subsample_for_bin=subsample_for_bin, min_split_gain=min_split_gain)

    # regressor = LGBMRegressor()


    regressor.fit(X_tfidf_train, y_tfidf_train)
    loss = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
    return loss

In [291]:
study = optuna.create_study()
study.optimize(objective, n_trials=1000)

[32m[I 2023-01-11 14:53:44,329][0m A new study created in memory with name: no-name-b23fd04e-30eb-491f-9ad7-54c50a9d31aa[0m
[32m[I 2023-01-11 14:53:44,416][0m Trial 0 finished with value: 0.1702752514926859 and parameters: {'num_leaves': 83, 'max_depth': 88, 'learning_rate': 0.014000000000000002, 'n_estimators': 70, 'subsample_for_bin': 192482, 'min_split_gain': 1.9390064033171897e-10}. Best is trial 0 with value: 0.1702752514926859.[0m
[32m[I 2023-01-11 14:53:44,495][0m Trial 1 finished with value: 0.23699725543659475 and parameters: {'num_leaves': 37, 'max_depth': 498, 'learning_rate': 0.019000000000000003, 'n_estimators': 156, 'subsample_for_bin': 307777, 'min_split_gain': 35.06608886983671}. Best is trial 0 with value: 0.1702752514926859.[0m
[32m[I 2023-01-11 14:53:44,810][0m Trial 2 finished with value: 0.1873542956119839 and parameters: {'num_leaves': 45, 'max_depth': 392, 'learning_rate': 0.001, 'n_estimators': 604, 'subsample_for_bin': 225148, 'min_split_gain': 0.068

In [292]:
study.best_params

{'num_leaves': 4,
 'max_depth': 245,
 'learning_rate': 0.016,
 'n_estimators': 808,
 'subsample_for_bin': 292633,
 'min_split_gain': 0.0054370582165134415}

In [298]:
study.best_trial

FrozenTrial(number=578, values=[0.13522194139875388], datetime_start=datetime.datetime(2023, 1, 11, 14, 57, 32, 102764), datetime_complete=datetime.datetime(2023, 1, 11, 14, 57, 32, 349879), params={'num_leaves': 4, 'max_depth': 245, 'learning_rate': 0.016, 'n_estimators': 808, 'subsample_for_bin': 292633, 'min_split_gain': 0.0054370582165134415}, distributions={'num_leaves': IntDistribution(high=100, log=False, low=2, step=1), 'max_depth': IntDistribution(high=500, log=False, low=-1, step=1), 'learning_rate': FloatDistribution(high=0.02, log=False, low=0.001, step=0.001), 'n_estimators': IntDistribution(high=1000, log=False, low=10, step=1), 'subsample_for_bin': IntDistribution(high=500000, log=False, low=100000, step=1), 'min_split_gain': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=578, state=TrialState.COMPLETE, value=None)

In [299]:
regressor = LGBMRegressor(num_leaves=4, max_depth=245, learning_rate=0.016, n_estimators=808, subsample_for_bin=292633, min_split_gain=0.0054370582165134415)
regressor.fit(X_tfidf_train, y_tfidf_train)

In [300]:
rmse = mean_squared_error(y_tfidf_test, regressor.predict(X_tfidf_test), squared=False)
mae = mean_absolute_error(y_tfidf_test, regressor.predict(X_tfidf_test))
r2 = r2_score(y_tfidf_test, regressor.predict(X_tfidf_test))

In [301]:
rmse

0.13522194139875388

In [302]:
mae

0.09334979992727398

In [303]:
r2

0.6549085925928108

In [306]:
corpus = list(df["Cleaned text"])
google_model = gensim.models.KeyedVectors.load_word2vec_format("c:/Users/britt/Downloads/GoogleNews-vectors-negative300.bin.gz", binary=True)

In [307]:
tfidf_vectorizer2 = TfidfVectorizer()
tfidf_vectorizer2.fit_transform(corpus)

<177x3835 sparse matrix of type '<class 'numpy.float64'>'
	with 68003 stored elements in Compressed Sparse Row format>

In [308]:
vocabulary = tfidf_vectorizer2.get_feature_names_out()
documents_embeddings = []
documents_scaled_embeddings = []
for doc in corpus:
    word_embeddings = []
    scaled_embeddings  = []
    doc_list = doc.split()
    for word in doc_list:
        if word in google_model.key_to_index.keys():
            embedding = google_model[word]
            word_embeddings.append(embedding)
            index = np.where(vocabulary == word)[0]
            try:
                scaled_embeddings.append(embedding * tfidf_vectorizer2.idf_[index])
            except ValueError:
                pass
    documents_embeddings.append(word_embeddings)
    documents_scaled_embeddings.append(scaled_embeddings)

In [309]:
df["Embeddings"] = documents_embeddings
df["Scaled embeddings"] = documents_scaled_embeddings
df

Unnamed: 0,ID,Job Description,Apps Received,Female,Male,Unknown_Gender,Cleaned text,Apps Received (unknown gender removed),Male share,Female share,Male share (unknown gender included),Female share (unknown gender included),Embeddings,Scaled embeddings
0,3190,BUILDING MAINTENANCE DISTRICT SUPERVISOR,47,1,45,1,build maintenance district supervisor class co...,46,0.978,0.022,0.957,0.021,"[[-0.14355469, 0.21679688, 0.03881836, 0.08984...","[[-0.4436903723137345, 0.6700630112493133, 0.1..."
1,3860,ELEVATOR MECHANIC HELPER,203,2,195,6,elevator mechanic helper class code open date ...,197,0.990,0.010,0.961,0.010,"[[-0.05810547, -0.22949219, -0.26757812, -0.01...","[[-0.2550844070540135, -1.0074762295410618, -1..."
2,3987,WATERWORKS MECHANIC SUPERVISOR,30,1,29,0,waterworks mechanic supervisor class code open...,30,0.967,0.033,0.967,0.033,"[[0.024902344, 0.03515625, 0.31054688, -0.0625...","[[0.11386212281775257, 0.1607465263309448, 1.4..."
3,2434,RECREATION FACILITY DIRECTOR,443,206,230,7,recreation facility director class code open d...,436,0.528,0.472,0.519,0.465,"[[0.06298828, -0.095214844, 0.23339844, 0.0217...","[[0.25839947222215826, -0.39060385335907644, 0..."
4,1775,WORKERS COMPENSATION CLAIMS ASSISTANT,116,95,19,2,worker compensation claim assistant class code...,114,0.167,0.833,0.164,0.819,"[[0.047607422, -0.13476562, 0.030151367, 0.018...","[[0.12369667951855838, -0.3501567543294576, 0...."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
172,1861,UTILITY BUYER,126,64,58,4,utility buyer class code open date exam open c...,122,0.475,0.525,0.460,0.508,"[[0.087890625, 0.15820312, 0.05053711, -0.1767...","[[0.241505538205688, 0.4347099687702384, 0.138..."
173,3586,TRUCK AND EQUIPMENT DISPATCHER,75,0,75,0,truck equipment dispatcher class code open dat...,75,1.000,0.000,1.000,0.000,"[[0.1328125, -0.019165039, -0.12695312, -0.017...","[[0.5448422980188143, -0.07862154484278847, -0..."
174,1769,SENIOR WORKERS COMPENSATION ANALYST,44,26,18,0,senior worker compensation analyst class code ...,44,0.409,0.591,0.409,0.591,"[[0.076171875, 0.140625, -0.022705078, -0.0135...","[[0.15774507374732846, 0.29122167461045256, -0..."
175,1336,UTILITY EXECUTIVE SECRETARY,430,395,31,4,utility executive secretary class code open da...,426,0.073,0.927,0.072,0.919,"[[0.087890625, 0.15820312, 0.05053711, -0.1767...","[[0.241505538205688, 0.4347099687702384, 0.138..."


In [310]:
doc_vectors = [np.average(doc, axis=0) for doc in df["Embeddings"]]
len(doc_vectors)

177

In [311]:
X_embeddings = np.array(doc_vectors)
X_embeddings.shape

(177, 300)

In [312]:
X_emb_train, X_emb_test, y_emb_train, y_emb_test = train_test_split(X_embeddings, y, test_size=0.3, random_state=428)

In [313]:
def objective(trial):

    classifier_name = trial.suggest_categorical("classifier", ["LinReg", "RandomForest", "DecTree", "SVR", "ElasticNet", "SGD", "BayesRidge", "CatBoost", "KernelRidge", "XGBoost", "LGBM", "GradientBoost"])
    # classifier_name = trial.suggest_categorical("classifier", ["RandomForest", "CatBoost", "LGBM"])
    
    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LinReg':
         classifier_obj = LinearRegression()
    if classifier_name == "RandomForest":
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = RandomForestRegressor(n_estimators=rf_n_estimators, max_depth=rf_max_depth)
    elif classifier_name == "DecTree":
        classifier_obj = DecisionTreeRegressor()
    elif classifier_name == "SVR":
        #svr_c = trial.suggest_float('svr_c', 1e-10, 1e10, log=True)
        classifier_obj = SVR(gamma='auto')
    elif classifier_name == "GradientBoost":
    #     gb_n_estimators = trial.suggest_int("gb_n_estimators", 10, 1000)
    #     gb_lr = trial.suggest_float("gb_lr", 1e-10, 1e10, log=True)
        classifier_obj = GradientBoostingRegressor()
    elif classifier_name == "ElasticNet":
        classifier_obj = ElasticNet()
    elif classifier_name == "SGD":
        classifier_obj = SGDRegressor()
    elif classifier_name == "BayesRidge":
        classifier_obj = BayesianRidge()
    elif classifier_name == "CatBoost":
        #cb_lr = trial.suggest_float("cb_lr", 1e-10, 1e10, log=True)
        classifier_obj = CatBoostRegressor()
    elif classifier_name == "KernelRidge":
        classifier_obj = KernelRidge()
    elif classifier_name == "XGBoost":
        classifier_obj = XGBRegressor()
    elif classifier_name == "LGBM":
        lgbm_n_estimators = trial.suggest_int("lgbm_n_estimators", 10, 1000)
        lgbm_max_depth = trial.suggest_int("lgbm_max_depth", 2, 32, log=True)
        classifier_obj = LGBMRegressor(n_estimators=lgbm_n_estimators, max_depth=lgbm_max_depth)

    # Step 3: Scoring method:
    # score = cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    # accuracy = score.mean()
    # return accuracy

    classifier_obj.fit(X_emb_train, y_emb_train)
    loss_rmse = mean_squared_error(y_emb_test, classifier_obj.predict(X_emb_test), squared=False)
    return loss_rmse

In [314]:
study = optuna.create_study()
study.optimize(objective, n_trials=50)

[32m[I 2023-01-11 15:07:45,010][0m A new study created in memory with name: no-name-ac7dba95-6c75-47e4-b3e5-f72e2bc811c1[0m


Learning rate set to 0.029403
0:	learn: 0.2446522	total: 34.7ms	remaining: 34.6s
1:	learn: 0.2416834	total: 45.8ms	remaining: 22.9s
2:	learn: 0.2391995	total: 59.1ms	remaining: 19.6s
3:	learn: 0.2366515	total: 69.5ms	remaining: 17.3s
4:	learn: 0.2345606	total: 80.7ms	remaining: 16.1s
5:	learn: 0.2322182	total: 90.9ms	remaining: 15.1s
6:	learn: 0.2291443	total: 102ms	remaining: 14.4s
7:	learn: 0.2267429	total: 117ms	remaining: 14.5s
8:	learn: 0.2244345	total: 129ms	remaining: 14.2s
9:	learn: 0.2216986	total: 139ms	remaining: 13.8s
10:	learn: 0.2193035	total: 149ms	remaining: 13.4s
11:	learn: 0.2167790	total: 161ms	remaining: 13.2s
12:	learn: 0.2142846	total: 176ms	remaining: 13.4s
13:	learn: 0.2110141	total: 188ms	remaining: 13.2s
14:	learn: 0.2083821	total: 199ms	remaining: 13.1s
15:	learn: 0.2064751	total: 211ms	remaining: 13s
16:	learn: 0.2037921	total: 225ms	remaining: 13s
17:	learn: 0.2019035	total: 238ms	remaining: 13s
18:	learn: 0.1999003	total: 250ms	remaining: 12.9s
19:	learn: 

[32m[I 2023-01-11 15:07:58,359][0m Trial 0 finished with value: 0.13478579642531957 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.13478579642531957.[0m
[32m[I 2023-01-11 15:07:58,385][0m Trial 1 finished with value: 0.201355591116994 and parameters: {'classifier': 'DecTree'}. Best is trial 0 with value: 0.13478579642531957.[0m


993:	learn: 0.0000580	total: 12.8s	remaining: 77.3ms
994:	learn: 0.0000575	total: 12.8s	remaining: 64.4ms
995:	learn: 0.0000570	total: 12.8s	remaining: 51.6ms
996:	learn: 0.0000564	total: 12.8s	remaining: 38.7ms
997:	learn: 0.0000559	total: 12.9s	remaining: 25.8ms
998:	learn: 0.0000554	total: 12.9s	remaining: 12.9ms
999:	learn: 0.0000549	total: 12.9s	remaining: 0us


[32m[I 2023-01-11 15:07:58,629][0m Trial 2 finished with value: 0.13066366045353808 and parameters: {'classifier': 'LGBM', 'lgbm_n_estimators': 952, 'lgbm_max_depth': 2}. Best is trial 2 with value: 0.13066366045353808.[0m
[32m[I 2023-01-11 15:07:58,654][0m Trial 3 finished with value: 0.23917814529171597 and parameters: {'classifier': 'DecTree'}. Best is trial 2 with value: 0.13066366045353808.[0m
[32m[I 2023-01-11 15:07:58,774][0m Trial 4 finished with value: 0.1339425712608728 and parameters: {'classifier': 'BayesRidge'}. Best is trial 2 with value: 0.13066366045353808.[0m
[32m[I 2023-01-11 15:07:58,795][0m Trial 5 finished with value: 0.2028003944769339 and parameters: {'classifier': 'DecTree'}. Best is trial 2 with value: 0.13066366045353808.[0m
[32m[I 2023-01-11 15:07:58,803][0m Trial 6 finished with value: 0.2407105843604681 and parameters: {'classifier': 'ElasticNet'}. Best is trial 2 with value: 0.13066366045353808.[0m
[32m[I 2023-01-11 15:08:02,045][0m Trial 

Learning rate set to 0.029403
0:	learn: 0.2446522	total: 13.8ms	remaining: 13.7s
1:	learn: 0.2416834	total: 26.7ms	remaining: 13.3s
2:	learn: 0.2391995	total: 39.1ms	remaining: 13s
3:	learn: 0.2366515	total: 48.8ms	remaining: 12.2s
4:	learn: 0.2345606	total: 59.3ms	remaining: 11.8s
5:	learn: 0.2322182	total: 71.9ms	remaining: 11.9s
6:	learn: 0.2291443	total: 82.9ms	remaining: 11.8s
7:	learn: 0.2267429	total: 94.2ms	remaining: 11.7s
8:	learn: 0.2244345	total: 106ms	remaining: 11.6s
9:	learn: 0.2216986	total: 118ms	remaining: 11.7s
10:	learn: 0.2193035	total: 130ms	remaining: 11.7s
11:	learn: 0.2167790	total: 143ms	remaining: 11.7s
12:	learn: 0.2142846	total: 155ms	remaining: 11.8s
13:	learn: 0.2110141	total: 171ms	remaining: 12s
14:	learn: 0.2083821	total: 183ms	remaining: 12s
15:	learn: 0.2064751	total: 195ms	remaining: 12s
16:	learn: 0.2037921	total: 205ms	remaining: 11.9s
17:	learn: 0.2019035	total: 218ms	remaining: 11.9s
18:	learn: 0.1999003	total: 232ms	remaining: 12s
19:	learn: 0.

[32m[I 2023-01-11 15:08:20,823][0m Trial 29 finished with value: 0.13478579642531957 and parameters: {'classifier': 'CatBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m


988:	learn: 0.0000605	total: 11.6s	remaining: 130ms
989:	learn: 0.0000600	total: 11.7s	remaining: 118ms
990:	learn: 0.0000596	total: 11.7s	remaining: 106ms
991:	learn: 0.0000591	total: 11.7s	remaining: 94.2ms
992:	learn: 0.0000586	total: 11.7s	remaining: 82.4ms
993:	learn: 0.0000580	total: 11.7s	remaining: 70.6ms
994:	learn: 0.0000575	total: 11.7s	remaining: 58.9ms
995:	learn: 0.0000570	total: 11.7s	remaining: 47.1ms
996:	learn: 0.0000564	total: 11.7s	remaining: 35.3ms
997:	learn: 0.0000559	total: 11.8s	remaining: 23.6ms
998:	learn: 0.0000554	total: 11.8s	remaining: 11.8ms
999:	learn: 0.0000549	total: 11.8s	remaining: 0us


[32m[I 2023-01-11 15:08:23,653][0m Trial 30 finished with value: 0.15059522323441837 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 987, 'rf_max_depth': 2}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:24,184][0m Trial 31 finished with value: 0.11813240248720833 and parameters: {'classifier': 'GradientBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:24,711][0m Trial 32 finished with value: 0.12248260653197979 and parameters: {'classifier': 'GradientBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:25,234][0m Trial 33 finished with value: 0.12267063342415196 and parameters: {'classifier': 'GradientBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:25,309][0m Trial 34 finished with value: 0.1339425712608728 and parameters: {'classifier': 'BayesRidge'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023

Learning rate set to 0.029403
0:	learn: 0.2446522	total: 14.2ms	remaining: 14.2s
1:	learn: 0.2416834	total: 27.7ms	remaining: 13.8s
2:	learn: 0.2391995	total: 40.4ms	remaining: 13.4s
3:	learn: 0.2366515	total: 53.3ms	remaining: 13.3s
4:	learn: 0.2345606	total: 66ms	remaining: 13.1s
5:	learn: 0.2322182	total: 78.4ms	remaining: 13s
6:	learn: 0.2291443	total: 92.8ms	remaining: 13.2s
7:	learn: 0.2267429	total: 104ms	remaining: 12.9s
8:	learn: 0.2244345	total: 114ms	remaining: 12.6s
9:	learn: 0.2216986	total: 124ms	remaining: 12.2s
10:	learn: 0.2193035	total: 135ms	remaining: 12.1s
11:	learn: 0.2167790	total: 148ms	remaining: 12.2s
12:	learn: 0.2142846	total: 159ms	remaining: 12.1s
13:	learn: 0.2110141	total: 170ms	remaining: 12s
14:	learn: 0.2083821	total: 180ms	remaining: 11.8s
15:	learn: 0.2064751	total: 191ms	remaining: 11.7s
16:	learn: 0.2037921	total: 205ms	remaining: 11.9s
17:	learn: 0.2019035	total: 217ms	remaining: 11.8s
18:	learn: 0.1999003	total: 228ms	remaining: 11.8s
19:	learn:

[32m[I 2023-01-11 15:08:38,001][0m Trial 40 finished with value: 0.13478579642531957 and parameters: {'classifier': 'CatBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m


998:	learn: 0.0000554	total: 12.2s	remaining: 12.2ms
999:	learn: 0.0000549	total: 12.2s	remaining: 0us


[32m[I 2023-01-11 15:08:38,548][0m Trial 41 finished with value: 0.12055810448243354 and parameters: {'classifier': 'GradientBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:39,190][0m Trial 42 finished with value: 0.12484859132040112 and parameters: {'classifier': 'GradientBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:39,708][0m Trial 43 finished with value: 0.12481180491923174 and parameters: {'classifier': 'GradientBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:39,716][0m Trial 44 finished with value: 0.24609940693277896 and parameters: {'classifier': 'SVR'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:40,235][0m Trial 45 finished with value: 0.12412646626583827 and parameters: {'classifier': 'GradientBoost'}. Best is trial 22 with value: 0.11575383259932427.[0m
[32m[I 2023-01-11 15:08:40,311][0m Trial 46 finished with 

In [315]:
scaled_doc_vectors = [np.average(doc, axis=0) for doc in df["Scaled embeddings"]]
len(scaled_doc_vectors)

177

In [316]:
X_scaled_embeddings = np.array(scaled_doc_vectors)
X_scaled_embeddings.shape

(177, 300)

In [318]:
X_scal_emb_train, X_scal_emb_test, y_scal_emb_train, y_scal_emb_test = train_test_split(X_scaled_embeddings, y, test_size=0.3, random_state=429)

In [321]:
def objective(trial):

    classifier_name = trial.suggest_categorical("classifier", ["LinReg", "RandomForest", "DecTree", "SVR", "ElasticNet", "SGD", "BayesRidge", "CatBoost", "KernelRidge", "XGBoost", "LGBM", "GradientBoost"])
    # classifier_name = trial.suggest_categorical("classifier", ["RandomForest", "CatBoost", "LGBM"])
    
    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LinReg':
         classifier_obj = LinearRegression()
    if classifier_name == "RandomForest":
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = RandomForestRegressor(n_estimators=rf_n_estimators, max_depth=rf_max_depth)
    elif classifier_name == "DecTree":
        classifier_obj = DecisionTreeRegressor()
    elif classifier_name == "SVR":
        #svr_c = trial.suggest_float('svr_c', 1e-10, 1e10, log=True)
        classifier_obj = SVR(gamma='auto')
    elif classifier_name == "GradientBoost":
    #     gb_n_estimators = trial.suggest_int("gb_n_estimators", 10, 1000)
    #     gb_lr = trial.suggest_float("gb_lr", 1e-10, 1e10, log=True)
        classifier_obj = GradientBoostingRegressor()
    elif classifier_name == "ElasticNet":
        classifier_obj = ElasticNet()
    elif classifier_name == "SGD":
        classifier_obj = SGDRegressor()
    elif classifier_name == "BayesRidge":
        classifier_obj = BayesianRidge()
    elif classifier_name == "CatBoost":
        #cb_lr = trial.suggest_float("cb_lr", 1e-10, 1e10, log=True)
        classifier_obj = CatBoostRegressor()
    elif classifier_name == "KernelRidge":
        classifier_obj = KernelRidge()
    elif classifier_name == "XGBoost":
        classifier_obj = XGBRegressor()
    elif classifier_name == "LGBM":
        lgbm_n_estimators = trial.suggest_int("lgbm_n_estimators", 10, 1000)
        lgbm_max_depth = trial.suggest_int("lgbm_max_depth", 2, 32, log=True)
        classifier_obj = LGBMRegressor(n_estimators=lgbm_n_estimators, max_depth=lgbm_max_depth)

    # Step 3: Scoring method:
    # score = cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    # accuracy = score.mean()
    # return accuracy

    classifier_obj.fit(X_scal_emb_train, y_scal_emb_train)
    loss_rmse = mean_squared_error(y_scal_emb_test, classifier_obj.predict(X_scal_emb_test), squared=False)
    return loss_rmse

In [322]:
study = optuna.create_study()
study.optimize(objective, n_trials=50)

[32m[I 2023-01-11 15:18:37,787][0m A new study created in memory with name: no-name-7fdf4a33-1013-4ea4-bad4-7f79fe779b11[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 15.3ms	remaining: 15.3s
1:	learn: 0.2335283	total: 27ms	remaining: 13.5s
2:	learn: 0.2305875	total: 38.4ms	remaining: 12.8s
3:	learn: 0.2282271	total: 50.5ms	remaining: 12.6s
4:	learn: 0.2255329	total: 63.5ms	remaining: 12.6s
5:	learn: 0.2227201	total: 76.2ms	remaining: 12.6s
6:	learn: 0.2195469	total: 97.7ms	remaining: 13.9s
7:	learn: 0.2170205	total: 118ms	remaining: 14.6s
8:	learn: 0.2144726	total: 147ms	remaining: 16.2s
9:	learn: 0.2123150	total: 163ms	remaining: 16.1s
10:	learn: 0.2103379	total: 178ms	remaining: 16s
11:	learn: 0.2082409	total: 199ms	remaining: 16.4s
12:	learn: 0.2053252	total: 212ms	remaining: 16.1s
13:	learn: 0.2029725	total: 223ms	remaining: 15.7s
14:	learn: 0.2007924	total: 236ms	remaining: 15.5s
15:	learn: 0.1985794	total: 253ms	remaining: 15.5s
16:	learn: 0.1958777	total: 266ms	remaining: 15.4s
17:	learn: 0.1937542	total: 279ms	remaining: 15.2s
18:	learn: 0.1911913	total: 309ms	remaining: 16s
19:	learn:

[32m[I 2023-01-11 15:18:50,313][0m Trial 0 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


987:	learn: 0.0000635	total: 12s	remaining: 146ms
988:	learn: 0.0000631	total: 12s	remaining: 134ms
989:	learn: 0.0000626	total: 12.1s	remaining: 122ms
990:	learn: 0.0000622	total: 12.1s	remaining: 110ms
991:	learn: 0.0000617	total: 12.1s	remaining: 97.4ms
992:	learn: 0.0000612	total: 12.1s	remaining: 85.2ms
993:	learn: 0.0000606	total: 12.1s	remaining: 73ms
994:	learn: 0.0000603	total: 12.1s	remaining: 60.9ms
995:	learn: 0.0000598	total: 12.1s	remaining: 48.7ms
996:	learn: 0.0000594	total: 12.1s	remaining: 36.5ms
997:	learn: 0.0000589	total: 12.1s	remaining: 24.3ms
998:	learn: 0.0000588	total: 12.2s	remaining: 12.2ms
999:	learn: 0.0000584	total: 12.2s	remaining: 0us


[32m[I 2023-01-11 15:18:50,328][0m Trial 1 finished with value: 0.2568814757677022 and parameters: {'classifier': 'DecTree'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:18:50,829][0m Trial 2 finished with value: 0.17522415089315815 and parameters: {'classifier': 'GradientBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 11.8ms	remaining: 11.8s
1:	learn: 0.2335283	total: 23.3ms	remaining: 11.6s
2:	learn: 0.2305875	total: 37.1ms	remaining: 12.3s
3:	learn: 0.2282271	total: 49.6ms	remaining: 12.3s
4:	learn: 0.2255329	total: 64.5ms	remaining: 12.8s
5:	learn: 0.2227201	total: 77.3ms	remaining: 12.8s
6:	learn: 0.2195469	total: 90.4ms	remaining: 12.8s
7:	learn: 0.2170205	total: 106ms	remaining: 13.2s
8:	learn: 0.2144726	total: 119ms	remaining: 13.1s
9:	learn: 0.2123150	total: 129ms	remaining: 12.8s
10:	learn: 0.2103379	total: 142ms	remaining: 12.7s
11:	learn: 0.2082409	total: 153ms	remaining: 12.6s
12:	learn: 0.2053252	total: 166ms	remaining: 12.6s
13:	learn: 0.2029725	total: 179ms	remaining: 12.6s
14:	learn: 0.2007924	total: 192ms	remaining: 12.6s
15:	learn: 0.1985794	total: 208ms	remaining: 12.8s
16:	learn: 0.1958777	total: 221ms	remaining: 12.8s
17:	learn: 0.1937542	total: 232ms	remaining: 12.7s
18:	learn: 0.1911913	total: 245ms	remaining: 12.6s
19:	

[32m[I 2023-01-11 15:19:03,396][0m Trial 3 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:06,764][0m Trial 4 finished with value: 0.17511369892159168 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 605, 'rf_max_depth': 6}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:06,770][0m Trial 5 finished with value: 0.19108903552471038 and parameters: {'classifier': 'KernelRidge'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:09,651][0m Trial 6 finished with value: 0.1747276744766537 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 422, 'rf_max_depth': 13}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:09,684][0m Trial 7 finished with value: 0.252210984409186 and parameters: {'classifier': 'LinReg'}. Best is trial 0 with value: 0.169228228676532.[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 11.2ms	remaining: 11.2s
1:	learn: 0.2335283	total: 22.4ms	remaining: 11.2s
2:	learn: 0.2305875	total: 35.9ms	remaining: 11.9s
3:	learn: 0.2282271	total: 51.1ms	remaining: 12.7s
4:	learn: 0.2255329	total: 62ms	remaining: 12.3s
5:	learn: 0.2227201	total: 74.8ms	remaining: 12.4s
6:	learn: 0.2195469	total: 91.3ms	remaining: 12.9s
7:	learn: 0.2170205	total: 104ms	remaining: 12.9s
8:	learn: 0.2144726	total: 115ms	remaining: 12.7s
9:	learn: 0.2123150	total: 128ms	remaining: 12.7s
10:	learn: 0.2103379	total: 140ms	remaining: 12.6s
11:	learn: 0.2082409	total: 156ms	remaining: 12.8s
12:	learn: 0.2053252	total: 166ms	remaining: 12.6s
13:	learn: 0.2029725	total: 177ms	remaining: 12.5s
14:	learn: 0.2007924	total: 189ms	remaining: 12.4s
15:	learn: 0.1985794	total: 200ms	remaining: 12.3s
16:	learn: 0.1958777	total: 212ms	remaining: 12.3s
17:	learn: 0.1937542	total: 224ms	remaining: 12.2s
18:	learn: 0.1911913	total: 238ms	remaining: 12.3s
19:	le

[32m[I 2023-01-11 15:19:22,280][0m Trial 8 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


991:	learn: 0.0000617	total: 12.1s	remaining: 97.8ms
992:	learn: 0.0000612	total: 12.1s	remaining: 85.6ms
993:	learn: 0.0000606	total: 12.2s	remaining: 73.4ms
994:	learn: 0.0000603	total: 12.2s	remaining: 61.1ms
995:	learn: 0.0000598	total: 12.2s	remaining: 48.9ms
996:	learn: 0.0000594	total: 12.2s	remaining: 36.7ms
997:	learn: 0.0000589	total: 12.2s	remaining: 24.5ms
998:	learn: 0.0000588	total: 12.2s	remaining: 12.2ms
999:	learn: 0.0000584	total: 12.2s	remaining: 0us


[32m[I 2023-01-11 15:19:22,357][0m Trial 9 finished with value: 0.18541529076469904 and parameters: {'classifier': 'BayesRidge'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:22,468][0m Trial 10 finished with value: 0.18541383551948765 and parameters: {'classifier': 'LGBM', 'lgbm_n_estimators': 319, 'lgbm_max_depth': 3}. Best is trial 0 with value: 0.169228228676532.[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 13.7ms	remaining: 13.7s
1:	learn: 0.2335283	total: 24.7ms	remaining: 12.3s
2:	learn: 0.2305875	total: 36.8ms	remaining: 12.2s
3:	learn: 0.2282271	total: 50ms	remaining: 12.4s
4:	learn: 0.2255329	total: 66ms	remaining: 13.1s
5:	learn: 0.2227201	total: 76ms	remaining: 12.6s
6:	learn: 0.2195469	total: 86.1ms	remaining: 12.2s
7:	learn: 0.2170205	total: 96.6ms	remaining: 12s
8:	learn: 0.2144726	total: 107ms	remaining: 11.8s
9:	learn: 0.2123150	total: 121ms	remaining: 11.9s
10:	learn: 0.2103379	total: 131ms	remaining: 11.8s
11:	learn: 0.2082409	total: 142ms	remaining: 11.7s
12:	learn: 0.2053252	total: 154ms	remaining: 11.7s
13:	learn: 0.2029725	total: 167ms	remaining: 11.7s
14:	learn: 0.2007924	total: 182ms	remaining: 11.9s
15:	learn: 0.1985794	total: 194ms	remaining: 11.9s
16:	learn: 0.1958777	total: 204ms	remaining: 11.8s
17:	learn: 0.1937542	total: 217ms	remaining: 11.8s
18:	learn: 0.1911913	total: 231ms	remaining: 11.9s
19:	learn: 

[32m[I 2023-01-11 15:19:35,231][0m Trial 11 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


989:	learn: 0.0000626	total: 12.2s	remaining: 124ms
990:	learn: 0.0000622	total: 12.3s	remaining: 111ms
991:	learn: 0.0000617	total: 12.3s	remaining: 99ms
992:	learn: 0.0000612	total: 12.3s	remaining: 86.6ms
993:	learn: 0.0000606	total: 12.3s	remaining: 74.3ms
994:	learn: 0.0000603	total: 12.3s	remaining: 61.9ms
995:	learn: 0.0000598	total: 12.3s	remaining: 49.5ms
996:	learn: 0.0000594	total: 12.4s	remaining: 37.2ms
997:	learn: 0.0000589	total: 12.4s	remaining: 24.8ms
998:	learn: 0.0000588	total: 12.4s	remaining: 12.4ms
999:	learn: 0.0000584	total: 12.4s	remaining: 0us
Learning rate set to 0.029403
0:	learn: 0.2366744	total: 14.2ms	remaining: 14.2s
1:	learn: 0.2335283	total: 30ms	remaining: 15s
2:	learn: 0.2305875	total: 42.1ms	remaining: 14s
3:	learn: 0.2282271	total: 54ms	remaining: 13.4s
4:	learn: 0.2255329	total: 65.2ms	remaining: 13s
5:	learn: 0.2227201	total: 77.8ms	remaining: 12.9s
6:	learn: 0.2195469	total: 93.2ms	remaining: 13.2s
7:	learn: 0.2170205	total: 106ms	remaining: 13.

[32m[I 2023-01-11 15:19:48,262][0m Trial 12 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:48,269][0m Trial 13 finished with value: 0.2584963170766097 and parameters: {'classifier': 'SVR'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:48,410][0m Trial 14 finished with value: 0.19388379531731945 and parameters: {'classifier': 'XGBoost'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:48,419][0m Trial 15 finished with value: 0.25803767165635105 and parameters: {'classifier': 'ElasticNet'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:19:48,428][0m Trial 16 finished with value: 0.24728111454891297 and parameters: {'classifier': 'SGD'}. Best is trial 0 with value: 0.169228228676532.[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 17ms	remaining: 17s
1:	learn: 0.2335283	total: 32.5ms	remaining: 16.2s
2:	learn: 0.2305875	total: 47.9ms	remaining: 15.9s
3:	learn: 0.2282271	total: 61ms	remaining: 15.2s
4:	learn: 0.2255329	total: 75.6ms	remaining: 15.1s
5:	learn: 0.2227201	total: 92.9ms	remaining: 15.4s
6:	learn: 0.2195469	total: 106ms	remaining: 15s
7:	learn: 0.2170205	total: 117ms	remaining: 14.5s
8:	learn: 0.2144726	total: 129ms	remaining: 14.3s
9:	learn: 0.2123150	total: 141ms	remaining: 13.9s
10:	learn: 0.2103379	total: 156ms	remaining: 14s
11:	learn: 0.2082409	total: 167ms	remaining: 13.8s
12:	learn: 0.2053252	total: 178ms	remaining: 13.5s
13:	learn: 0.2029725	total: 189ms	remaining: 13.3s
14:	learn: 0.2007924	total: 201ms	remaining: 13.2s
15:	learn: 0.1985794	total: 214ms	remaining: 13.2s
16:	learn: 0.1958777	total: 227ms	remaining: 13.1s
17:	learn: 0.1937542	total: 239ms	remaining: 13.1s
18:	learn: 0.1911913	total: 252ms	remaining: 13s
19:	learn: 0.1887

[32m[I 2023-01-11 15:20:02,756][0m Trial 17 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


990:	learn: 0.0000622	total: 13.7s	remaining: 125ms
991:	learn: 0.0000617	total: 13.8s	remaining: 111ms
992:	learn: 0.0000612	total: 13.8s	remaining: 97.1ms
993:	learn: 0.0000606	total: 13.8s	remaining: 83.2ms
994:	learn: 0.0000603	total: 13.8s	remaining: 69.4ms
995:	learn: 0.0000598	total: 13.8s	remaining: 55.5ms
996:	learn: 0.0000594	total: 13.8s	remaining: 41.6ms
997:	learn: 0.0000589	total: 13.8s	remaining: 27.7ms
998:	learn: 0.0000588	total: 13.9s	remaining: 13.9ms
999:	learn: 0.0000584	total: 13.9s	remaining: 0us
Learning rate set to 0.029403
0:	learn: 0.2366744	total: 11.4ms	remaining: 11.3s
1:	learn: 0.2335283	total: 24.1ms	remaining: 12s
2:	learn: 0.2305875	total: 36.6ms	remaining: 12.2s
3:	learn: 0.2282271	total: 50.5ms	remaining: 12.6s
4:	learn: 0.2255329	total: 62.8ms	remaining: 12.5s
5:	learn: 0.2227201	total: 77.6ms	remaining: 12.9s
6:	learn: 0.2195469	total: 94.4ms	remaining: 13.4s
7:	learn: 0.2170205	total: 108ms	remaining: 13.4s
8:	learn: 0.2144726	total: 123ms	remaini

[32m[I 2023-01-11 15:20:16,348][0m Trial 18 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:20:16,620][0m Trial 19 finished with value: 0.1835007486841977 and parameters: {'classifier': 'LGBM', 'lgbm_n_estimators': 940, 'lgbm_max_depth': 30}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:20:17,213][0m Trial 20 finished with value: 0.17820549757276377 and parameters: {'classifier': 'GradientBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 10.6ms	remaining: 10.6s
1:	learn: 0.2335283	total: 21.5ms	remaining: 10.7s
2:	learn: 0.2305875	total: 33.8ms	remaining: 11.2s
3:	learn: 0.2282271	total: 44.4ms	remaining: 11.1s
4:	learn: 0.2255329	total: 55ms	remaining: 10.9s
5:	learn: 0.2227201	total: 65.8ms	remaining: 10.9s
6:	learn: 0.2195469	total: 76.7ms	remaining: 10.9s
7:	learn: 0.2170205	total: 90.4ms	remaining: 11.2s
8:	learn: 0.2144726	total: 103ms	remaining: 11.4s
9:	learn: 0.2123150	total: 115ms	remaining: 11.4s
10:	learn: 0.2103379	total: 127ms	remaining: 11.4s
11:	learn: 0.2082409	total: 140ms	remaining: 11.6s
12:	learn: 0.2053252	total: 156ms	remaining: 11.8s
13:	learn: 0.2029725	total: 170ms	remaining: 12s
14:	learn: 0.2007924	total: 184ms	remaining: 12.1s
15:	learn: 0.1985794	total: 197ms	remaining: 12.1s
16:	learn: 0.1958777	total: 214ms	remaining: 12.4s
17:	learn: 0.1937542	total: 224ms	remaining: 12.2s
18:	learn: 0.1911913	total: 235ms	remaining: 12.1s
19:	lea

[32m[I 2023-01-11 15:20:30,596][0m Trial 21 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 15.9ms	remaining: 15.9s
1:	learn: 0.2335283	total: 30.7ms	remaining: 15.3s
2:	learn: 0.2305875	total: 48ms	remaining: 16s
3:	learn: 0.2282271	total: 65.6ms	remaining: 16.3s
4:	learn: 0.2255329	total: 79.8ms	remaining: 15.9s
5:	learn: 0.2227201	total: 93.8ms	remaining: 15.5s
6:	learn: 0.2195469	total: 111ms	remaining: 15.7s
7:	learn: 0.2170205	total: 124ms	remaining: 15.3s
8:	learn: 0.2144726	total: 136ms	remaining: 14.9s
9:	learn: 0.2123150	total: 149ms	remaining: 14.7s
10:	learn: 0.2103379	total: 165ms	remaining: 14.8s
11:	learn: 0.2082409	total: 178ms	remaining: 14.7s
12:	learn: 0.2053252	total: 191ms	remaining: 14.5s
13:	learn: 0.2029725	total: 206ms	remaining: 14.5s
14:	learn: 0.2007924	total: 224ms	remaining: 14.7s
15:	learn: 0.1985794	total: 238ms	remaining: 14.6s
16:	learn: 0.1958777	total: 252ms	remaining: 14.6s
17:	learn: 0.1937542	total: 266ms	remaining: 14.5s
18:	learn: 0.1911913	total: 285ms	remaining: 14.7s
19:	learn

[32m[I 2023-01-11 15:20:44,561][0m Trial 22 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m


999:	learn: 0.0000584	total: 13.6s	remaining: 0us
Learning rate set to 0.029403
0:	learn: 0.2366744	total: 16.9ms	remaining: 16.8s
1:	learn: 0.2335283	total: 28.5ms	remaining: 14.2s
2:	learn: 0.2305875	total: 41.2ms	remaining: 13.7s
3:	learn: 0.2282271	total: 54ms	remaining: 13.4s
4:	learn: 0.2255329	total: 65.6ms	remaining: 13.1s
5:	learn: 0.2227201	total: 75.9ms	remaining: 12.6s
6:	learn: 0.2195469	total: 91.1ms	remaining: 12.9s
7:	learn: 0.2170205	total: 104ms	remaining: 12.9s
8:	learn: 0.2144726	total: 119ms	remaining: 13.1s
9:	learn: 0.2123150	total: 135ms	remaining: 13.3s
10:	learn: 0.2103379	total: 148ms	remaining: 13.3s
11:	learn: 0.2082409	total: 163ms	remaining: 13.4s
12:	learn: 0.2053252	total: 180ms	remaining: 13.6s
13:	learn: 0.2029725	total: 196ms	remaining: 13.8s
14:	learn: 0.2007924	total: 208ms	remaining: 13.7s
15:	learn: 0.1985794	total: 223ms	remaining: 13.7s
16:	learn: 0.1958777	total: 235ms	remaining: 13.6s
17:	learn: 0.1937542	total: 249ms	remaining: 13.6s
18:	lea

[32m[I 2023-01-11 15:20:59,558][0m Trial 23 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:20:59,598][0m Trial 24 finished with value: 0.252210984409186 and parameters: {'classifier': 'LinReg'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:20:59,605][0m Trial 25 finished with value: 0.19108903552471038 and parameters: {'classifier': 'KernelRidge'}. Best is trial 0 with value: 0.169228228676532.[0m


992:	learn: 0.0000612	total: 14.5s	remaining: 102ms
993:	learn: 0.0000606	total: 14.5s	remaining: 87.6ms
994:	learn: 0.0000603	total: 14.5s	remaining: 73ms
995:	learn: 0.0000598	total: 14.5s	remaining: 58.4ms
996:	learn: 0.0000594	total: 14.6s	remaining: 43.8ms
997:	learn: 0.0000589	total: 14.6s	remaining: 29.2ms
998:	learn: 0.0000588	total: 14.6s	remaining: 14.6ms
999:	learn: 0.0000584	total: 14.6s	remaining: 0us


[32m[I 2023-01-11 15:20:59,614][0m Trial 26 finished with value: 0.2584963170766097 and parameters: {'classifier': 'SVR'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:20:59,619][0m Trial 27 finished with value: 0.25803767165635105 and parameters: {'classifier': 'ElasticNet'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:20:59,704][0m Trial 28 finished with value: 0.18541529076469904 and parameters: {'classifier': 'BayesRidge'}. Best is trial 0 with value: 0.169228228676532.[0m


Learning rate set to 0.029403
0:	learn: 0.2366744	total: 16.9ms	remaining: 16.9s
1:	learn: 0.2335283	total: 33ms	remaining: 16.5s
2:	learn: 0.2305875	total: 51.4ms	remaining: 17.1s
3:	learn: 0.2282271	total: 64.8ms	remaining: 16.1s
4:	learn: 0.2255329	total: 77.9ms	remaining: 15.5s
5:	learn: 0.2227201	total: 89.6ms	remaining: 14.8s
6:	learn: 0.2195469	total: 105ms	remaining: 14.9s
7:	learn: 0.2170205	total: 118ms	remaining: 14.7s
8:	learn: 0.2144726	total: 132ms	remaining: 14.5s
9:	learn: 0.2123150	total: 145ms	remaining: 14.3s
10:	learn: 0.2103379	total: 157ms	remaining: 14.1s
11:	learn: 0.2082409	total: 173ms	remaining: 14.2s
12:	learn: 0.2053252	total: 185ms	remaining: 14.1s
13:	learn: 0.2029725	total: 198ms	remaining: 14s
14:	learn: 0.2007924	total: 213ms	remaining: 14s
15:	learn: 0.1985794	total: 230ms	remaining: 14.2s
16:	learn: 0.1958777	total: 245ms	remaining: 14.2s
17:	learn: 0.1937542	total: 261ms	remaining: 14.3s
18:	learn: 0.1911913	total: 281ms	remaining: 14.5s
19:	learn: 

[32m[I 2023-01-11 15:21:13,882][0m Trial 29 finished with value: 0.169228228676532 and parameters: {'classifier': 'CatBoost'}. Best is trial 0 with value: 0.169228228676532.[0m
[32m[I 2023-01-11 15:21:13,901][0m Trial 30 finished with value: 0.24231117123175253 and parameters: {'classifier': 'DecTree'}. Best is trial 0 with value: 0.169228228676532.[0m


991:	learn: 0.0000617	total: 13.7s	remaining: 110ms
992:	learn: 0.0000612	total: 13.7s	remaining: 96.5ms
993:	learn: 0.0000606	total: 13.7s	remaining: 82.8ms
994:	learn: 0.0000603	total: 13.7s	remaining: 68.9ms
995:	learn: 0.0000598	total: 13.7s	remaining: 55.2ms
996:	learn: 0.0000594	total: 13.7s	remaining: 41.4ms
997:	learn: 0.0000589	total: 13.8s	remaining: 27.6ms
998:	learn: 0.0000588	total: 13.8s	remaining: 13.8ms
999:	learn: 0.0000584	total: 13.8s	remaining: 0us
Learning rate set to 0.029403
0:	learn: 0.2366744	total: 18.2ms	remaining: 18.2s
1:	learn: 0.2335283	total: 32.6ms	remaining: 16.3s
2:	learn: 0.2305875	total: 49ms	remaining: 16.3s
3:	learn: 0.2282271	total: 61ms	remaining: 15.2s
4:	learn: 0.2255329	total: 71.2ms	remaining: 14.2s
5:	learn: 0.2227201	total: 81.7ms	remaining: 13.5s
6:	learn: 0.2195469	total: 93.5ms	remaining: 13.3s
7:	learn: 0.2170205	total: 108ms	remaining: 13.4s
8:	learn: 0.2144726	total: 120ms	remaining: 13.3s
9:	learn: 0.2123150	total: 130ms	remaining: 

[33m[W 2023-01-11 15:21:20,517][0m Trial 31 failed because of the following error: KeyboardInterrupt('')[0m
Traceback (most recent call last):
  File "c:\Users\britt\Desktop\YH\Applicerad AI\job_discrimination_sandbox\venv\lib\site-packages\optuna\study\_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "C:\Users\britt\AppData\Local\Temp\ipykernel_42288\465896623.py", line 45, in objective
    classifier_obj.fit(X_scal_emb_train, y_scal_emb_train)
  File "c:\Users\britt\Desktop\YH\Applicerad AI\job_discrimination_sandbox\venv\lib\site-packages\catboost\core.py", line 5730, in fit
    return self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline,
  File "c:\Users\britt\Desktop\YH\Applicerad AI\job_discrimination_sandbox\venv\lib\site-packages\catboost\core.py", line 2355, in _fit
    self._train(
  File "c:\Users\britt\Desktop\YH\Applicerad AI\job_discrimination_sandbox\venv\lib\site-packages

468:	learn: 0.0031949	total: 6.36s	remaining: 7.2s
469:	learn: 0.0031613	total: 6.37s	remaining: 7.18s
470:	learn: 0.0031326	total: 6.38s	remaining: 7.17s
471:	learn: 0.0030995	total: 6.39s	remaining: 7.15s


KeyboardInterrupt: 

In [323]:
def objective(trial):
    # boosting_type = trial.suggest_categorical("boosting_type", ["gbdt", "dart", "rf"])
    #num_leaves = trial.suggest_int("num_leaves", 2, 100)
    #max_depth = trial.suggest_int("max_depth", -1, 500)
    learning_rate = trial.suggest_float("learning_rate", 0.001, 0.02, step=0.001)
    # n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    # subsample_for_bin = trial.suggest_int("subsample_for_bin", 100000, 500000)
    # min_split_gain = trial.suggest_float("min_split_gain", 1e-10, 1e10, log=True)
    # min_child_weight = trial.suggest_float("min_child_weight", 1e-10, 1e10, log=True)
    # min_child_samples = trial.suggest_int("min_child_samples", 5, 50)
    # subsample = trial.suggest_float("subsample", 1e-10, 1e10, log=True)
    # subsample_freq = trial.suggest_int("subsamples_freq", 0, 50)
    # colsample_bytree = trial.suggest_float("colsample_bytree", 1e-10, 1e10, log=True)
    # reg_alpha = trial.suggest_float("reg_alpha", 1e-10, 1e10, log=True)
    # reg_lambda = trial.suggest_float("reg_lambda", 1e-10, 1e10, log=True)
    
    # regressor = LGBMRegressor(num_leaves=num_leaves, max_depth=max_depth, learning_rate=learning_rate, n_estimators=n_estimators, 
    #                           subsample_for_bin=subsample_for_bin, min_split_gain=min_split_gain, min_child_weight=min_child_weight, 
    #                           min_child_samples=min_child_samples, subsample=subsample, subsample_freq=subsample_freq, 
    #                           colsample_bytree=colsample_bytree, reg_alpha=reg_alpha, reg_lambda=reg_lambda)

    regressor = GradientBoostingRegressor(learning_rate=learning_rate)

    regressor.fit(X_emb_train, y_emb_train)
    loss = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
    return loss

In [330]:
study = optuna.create_study()
study.optimize(objective, n_trials=500)

[32m[I 2023-01-11 15:24:18,888][0m A new study created in memory with name: no-name-84398bf1-51a2-426c-bed3-cd9752130fdc[0m
[32m[I 2023-01-11 15:24:19,444][0m Trial 0 finished with value: 0.15486539423060675 and parameters: {'learning_rate': 0.013000000000000001}. Best is trial 0 with value: 0.15486539423060675.[0m
[32m[I 2023-01-11 15:24:20,018][0m Trial 1 finished with value: 0.15541916833922326 and parameters: {'learning_rate': 0.013000000000000001}. Best is trial 0 with value: 0.15486539423060675.[0m
[32m[I 2023-01-11 15:24:20,597][0m Trial 2 finished with value: 0.21811697300882568 and parameters: {'learning_rate': 0.002}. Best is trial 0 with value: 0.15486539423060675.[0m
[32m[I 2023-01-11 15:24:21,123][0m Trial 3 finished with value: 0.1654416588533347 and parameters: {'learning_rate': 0.010000000000000002}. Best is trial 0 with value: 0.15486539423060675.[0m
[32m[I 2023-01-11 15:24:21,637][0m Trial 4 finished with value: 0.18381883214395744 and parameters: {'l

In [348]:
regressor = GradientBoostingRegressor()
regressor.fit(X_emb_train, y_emb_train)

In [349]:
rmse = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
mae = mean_absolute_error(y_emb_test, regressor.predict(X_emb_test))
r2 = r2_score(y_emb_test, regressor.predict(X_emb_test))

In [350]:
rmse

0.11996961109211347

In [351]:
mae

0.09193783302023194

In [352]:
r2

0.7515661242013418

In [355]:
gradient_boost_results = []
for i in range(20):
    regressor = GradientBoostingRegressor()
    regressor.fit(X_emb_train, y_emb_train)
    rmse = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
    mae = mean_absolute_error(y_emb_test, regressor.predict(X_emb_test))
    r2 = r2_score(y_emb_test, regressor.predict(X_emb_test))
    gradient_boost_result = {"RMSE": rmse, "MAE": mae, "R2": r2}
    gradient_boost_results.append(gradient_boost_result)

In [356]:
gradient_boost_results

[{'RMSE': 0.12452314603770737,
  'MAE': 0.09679948099756877,
  'R2': 0.7323492394853949},
 {'RMSE': 0.1165603674012914,
  'MAE': 0.09059773997755426,
  'R2': 0.7654852689304208},
 {'RMSE': 0.11781244192704424,
  'MAE': 0.09094108919761311,
  'R2': 0.7604199625619946},
 {'RMSE': 0.12224856892876307,
  'MAE': 0.09411846465705304,
  'R2': 0.7420379137761326},
 {'RMSE': 0.11780716387183848,
  'MAE': 0.09049623240399017,
  'R2': 0.7604414286876028},
 {'RMSE': 0.1212372626472072,
  'MAE': 0.09512538719779261,
  'R2': 0.7462882639555342},
 {'RMSE': 0.11981437120555001,
  'MAE': 0.09309908145824233,
  'R2': 0.752208651816858},
 {'RMSE': 0.12250918006957837,
  'MAE': 0.0945264058947273,
  'R2': 0.7409368873555509},
 {'RMSE': 0.11963902149272648,
  'MAE': 0.0909150097516516,
  'R2': 0.7529334120674965},
 {'RMSE': 0.12452185925500375,
  'MAE': 0.09680418117558516,
  'R2': 0.7323547710929826},
 {'RMSE': 0.11772301895875273,
  'MAE': 0.0922026815695091,
  'R2': 0.760783520553772},
 {'RMSE': 0.12551

In [357]:
gb_df = pd.DataFrame(gradient_boost_results)

In [361]:
gb_df = gb_df.sort_values(by=["RMSE"], ignore_index=True)

In [362]:
gb_df

Unnamed: 0,RMSE,MAE,R2
0,0.116527,0.091067,0.765618
1,0.11656,0.090598,0.765485
2,0.116913,0.091649,0.764065
3,0.117723,0.092203,0.760784
4,0.117807,0.090496,0.760441
5,0.117812,0.090941,0.76042
6,0.118967,0.092872,0.755703
7,0.119639,0.090915,0.752933
8,0.119814,0.093099,0.752209
9,0.121075,0.093496,0.746967


In [331]:
def objective(trial):
    # boosting_type = trial.suggest_categorical("boosting_type", ["gbdt", "dart", "rf"])
    num_leaves = trial.suggest_int("num_leaves", 2, 100)
    max_depth = trial.suggest_int("max_depth", -1, 500)
    learning_rate = trial.suggest_float("learning_rate", 0.001, 0.02, step=0.001)
    n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    subsample_for_bin = trial.suggest_int("subsample_for_bin", 100000, 500000)
    min_split_gain = trial.suggest_float("min_split_gain", 1e-10, 1e10, log=True)
    # min_child_weight = trial.suggest_float("min_child_weight", 1e-10, 1e10, log=True)
    # min_child_samples = trial.suggest_int("min_child_samples", 5, 50)
    # subsample = trial.suggest_float("subsample", 1e-10, 1e10, log=True)
    # subsample_freq = trial.suggest_int("subsamples_freq", 0, 50)
    # colsample_bytree = trial.suggest_float("colsample_bytree", 1e-10, 1e10, log=True)
    # reg_alpha = trial.suggest_float("reg_alpha", 1e-10, 1e10, log=True)
    # reg_lambda = trial.suggest_float("reg_lambda", 1e-10, 1e10, log=True)
    
    # regressor = LGBMRegressor(num_leaves=num_leaves, max_depth=max_depth, learning_rate=learning_rate, n_estimators=n_estimators, 
    #                           subsample_for_bin=subsample_for_bin, min_split_gain=min_split_gain, min_child_weight=min_child_weight, 
    #                           min_child_samples=min_child_samples, subsample=subsample, subsample_freq=subsample_freq, 
    #                           colsample_bytree=colsample_bytree, reg_alpha=reg_alpha, reg_lambda=reg_lambda)

    regressor = LGBMRegressor(num_leaves=num_leaves, max_depth=max_depth, learning_rate=learning_rate, n_estimators=n_estimators, subsample_for_bin=subsample_for_bin, min_split_gain=min_split_gain)

    # regressor = LGBMRegressor()


    regressor.fit(X_emb_train, y_emb_train)
    loss = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
    return loss

In [332]:
study = optuna.create_study()
study.optimize(objective, n_trials=500)

[32m[I 2023-01-11 15:30:07,152][0m A new study created in memory with name: no-name-c7b68909-fe4a-4ac5-80c6-a7f503443db9[0m
[32m[I 2023-01-11 15:30:07,336][0m Trial 0 finished with value: 0.153380682821916 and parameters: {'num_leaves': 23, 'max_depth': 65, 'learning_rate': 0.004, 'n_estimators': 448, 'subsample_for_bin': 427524, 'min_split_gain': 6.237759850869866e-08}. Best is trial 0 with value: 0.153380682821916.[0m
[32m[I 2023-01-11 15:30:07,435][0m Trial 1 finished with value: 0.14090888908448823 and parameters: {'num_leaves': 13, 'max_depth': 101, 'learning_rate': 0.011, 'n_estimators': 466, 'subsample_for_bin': 409295, 'min_split_gain': 0.03763823250904023}. Best is trial 1 with value: 0.14090888908448823.[0m
[32m[I 2023-01-11 15:30:07,555][0m Trial 2 finished with value: 0.14677337368181245 and parameters: {'num_leaves': 83, 'max_depth': 160, 'learning_rate': 0.008, 'n_estimators': 663, 'subsample_for_bin': 190202, 'min_split_gain': 0.107740219388794}. Best is trial

In [368]:
def objective(trial):
    n_iter = trial.suggest_int("n_iter", 1, 1000)
    tol = trial.suggest_float("tol", 1e-10, 1e10, log=True)
    alpha_1 = trial.suggest_float("alpha_1", 1e-10, 1e10, log=True)
    alpha_2 = trial.suggest_float("alpha_2", 1e-10, 1e10, log=True)
    lambda_1 = trial.suggest_float("lambda_1", 1e-10, 1e10, log=True)
    lambda_2 = trial.suggest_float("lambda_2", 1e-10, 1e10, log=True)
    comp_score = trial.suggest_categorical("cpm_score", [True, False])

    regressor = BayesianRidge(n_iter=n_iter, tol=tol, alpha_1=alpha_1, alpha_2=alpha_2, lambda_1=lambda_1, lambda_2=lambda_2, compute_score=comp_score)

    regressor.fit(X_emb_train, y_emb_train)
    loss = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
    return loss

In [369]:
study = optuna.create_study()
study.optimize(objective, n_trials=500)

[32m[I 2023-01-11 15:49:21,772][0m A new study created in memory with name: no-name-3650365e-e1cf-4740-afc7-a32cfbbf13d7[0m
[32m[I 2023-01-11 15:49:22,251][0m Trial 0 finished with value: 0.19231289909063734 and parameters: {'n_iter': 569, 'tol': 6.609163796184268e-07, 'alpha_1': 59492963.29376776, 'alpha_2': 1.6903753239804305e-05, 'lambda_1': 1.732900267088442e-06, 'lambda_2': 9.305152591012232e-07, 'cpm_score': True}. Best is trial 0 with value: 0.19231289909063734.[0m
[32m[I 2023-01-11 15:49:22,307][0m Trial 1 finished with value: 0.19423129061985966 and parameters: {'n_iter': 793, 'tol': 13732731.79320285, 'alpha_1': 0.2623913040533534, 'alpha_2': 0.00014578825705604845, 'lambda_1': 1.3370739185271898e-08, 'lambda_2': 96802.4844609885, 'cpm_score': True}. Best is trial 0 with value: 0.19231289909063734.[0m
[32m[I 2023-01-11 15:49:22,399][0m Trial 2 finished with value: 0.17863534588425034 and parameters: {'n_iter': 680, 'tol': 0.010744192131956098, 'alpha_1': 5.65315320

In [365]:
study.best_trial

FrozenTrial(number=83, values=[0.13393849059593912], datetime_start=datetime.datetime(2023, 1, 11, 15, 47, 15, 947368), datetime_complete=datetime.datetime(2023, 1, 11, 15, 47, 16, 21492), params={'n_iter': 406, 'tol': 0.5736883991262077, 'alpha_1': 2.4867833891504907e-05, 'alpha_2': 0.0067510304118338195, 'lambda_1': 7.682387720294421e-05, 'lambda_2': 5.479293529614971e-06, 'cpm_score': True}, distributions={'n_iter': IntDistribution(high=1000, log=False, low=1, step=1), 'tol': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'alpha_1': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'alpha_2': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'lambda_1': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'lambda_2': FloatDistribution(high=10000000000.0, log=True, low=1e-10, step=None), 'cpm_score': CategoricalDistribution(choices=(True, False))}, user_attrs={}, system_attrs={}, intermediate_values={

In [370]:
def objective(trial):
    param = {}
    param['learning_rate'] = trial.suggest_float("learning_rate", 0.001, 0.02, step=0.001)
    param['depth'] = trial.suggest_int('depth', 9, 15)
    param['l2_leaf_reg'] = trial.suggest_float('l2_leaf_reg', 1.0, 5.5, step=0.5)
    param['min_child_samples'] = trial.suggest_categorical('min_child_samples', [1, 4, 8, 16, 32])
    param['grow_policy'] = 'Depthwise'
    #param['iterations'] = 10000
    #param['use_best_model'] = True
    param['eval_metric'] = 'RMSE'
    param['od_type'] = 'iter'
    param['od_wait'] = 20
    param['random_state'] = 1
    param['logging_level'] = 'Silent'
    
    regressor = CatBoostRegressor(**param)

    regressor.fit(X_emb_train, y_emb_train, early_stopping_rounds=100)
    loss = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
    return loss

In [371]:
study = optuna.create_study()
study.optimize(objective, n_trials=500)

[32m[I 2023-01-11 15:51:57,640][0m A new study created in memory with name: no-name-0f8b9358-4b9d-48a2-9f38-1153d564b19c[0m
[32m[I 2023-01-11 15:52:04,042][0m Trial 0 finished with value: 0.12925771615448145 and parameters: {'learning_rate': 0.011, 'depth': 9, 'l2_leaf_reg': 5.5, 'min_child_samples': 32}. Best is trial 0 with value: 0.12925771615448145.[0m
[32m[I 2023-01-11 15:52:20,105][0m Trial 1 finished with value: 0.12985634696498496 and parameters: {'learning_rate': 0.009000000000000001, 'depth': 10, 'l2_leaf_reg': 3.0, 'min_child_samples': 8}. Best is trial 0 with value: 0.12925771615448145.[0m
[32m[I 2023-01-11 15:52:29,607][0m Trial 2 finished with value: 0.1451802864741166 and parameters: {'learning_rate': 0.003, 'depth': 10, 'l2_leaf_reg': 5.5, 'min_child_samples': 16}. Best is trial 0 with value: 0.12925771615448145.[0m
[32m[I 2023-01-11 15:56:00,912][0m Trial 3 finished with value: 0.16628859240595353 and parameters: {'learning_rate': 0.015, 'depth': 11, 'l2_

In [372]:
study.best_params

{'learning_rate': 0.011,
 'depth': 13,
 'l2_leaf_reg': 3.0,
 'min_child_samples': 32}

In [381]:
param = {}
param['learning_rate'] = 0.011
param['depth'] = 13
param['l2_leaf_reg'] = 3.0
param['min_child_samples'] = 32
param['grow_policy'] = 'Depthwise'
param['eval_metric'] = 'RMSE'
param['od_type'] = 'iter'
param['od_wait'] = 20
param['random_state'] = 1
param['logging_level'] = 'Silent'

regressor = CatBoostRegressor(**param)

regressor.fit(X_emb_train, y_emb_train, early_stopping_rounds=100)

<catboost.core.CatBoostRegressor at 0x271a13562c0>

In [382]:
rmse = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
mae = mean_absolute_error(y_emb_test, regressor.predict(X_emb_test))
r2 = r2_score(y_emb_test, regressor.predict(X_emb_test))

In [383]:
rmse

0.12017730152530944

In [384]:
mae

0.09486313130428797

In [385]:
r2

0.7507052061527454

In [386]:
def objective(trial):
    rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
    rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
    regressor = RandomForestRegressor(n_estimators=rf_n_estimators, max_depth=rf_max_depth)

    regressor.fit(X_emb_train, y_emb_train)
    loss = mean_squared_error(y_emb_test, regressor.predict(X_emb_test), squared=False)
    return loss

In [387]:
study = optuna.create_study()
study.optimize(objective, n_trials=500)

[32m[I 2023-01-12 08:56:30,417][0m A new study created in memory with name: no-name-e8a994c4-e71f-4421-aa99-b24b58b24bbf[0m
[32m[I 2023-01-12 08:56:32,198][0m Trial 0 finished with value: 0.14365404774518467 and parameters: {'rf_n_estimators': 371, 'rf_max_depth': 3}. Best is trial 0 with value: 0.14365404774518467.[0m
[32m[I 2023-01-12 08:56:32,954][0m Trial 1 finished with value: 0.1439542972496639 and parameters: {'rf_n_estimators': 154, 'rf_max_depth': 4}. Best is trial 0 with value: 0.14365404774518467.[0m
[32m[I 2023-01-12 08:56:35,587][0m Trial 2 finished with value: 0.1433167242593273 and parameters: {'rf_n_estimators': 623, 'rf_max_depth': 3}. Best is trial 2 with value: 0.1433167242593273.[0m
[32m[I 2023-01-12 08:56:42,689][0m Trial 3 finished with value: 0.14095000926548462 and parameters: {'rf_n_estimators': 982, 'rf_max_depth': 15}. Best is trial 3 with value: 0.14095000926548462.[0m
[32m[I 2023-01-12 08:56:45,335][0m Trial 4 finished with value: 0.1426696

In [388]:
df

Unnamed: 0,ID,Job Description,Apps Received,Female,Male,Unknown_Gender,Cleaned text,Apps Received (unknown gender removed),Male share,Female share,Male share (unknown gender included),Female share (unknown gender included),Embeddings,Scaled embeddings
0,3190,BUILDING MAINTENANCE DISTRICT SUPERVISOR,47,1,45,1,build maintenance district supervisor class co...,46,0.978,0.022,0.957,0.021,"[[-0.14355469, 0.21679688, 0.03881836, 0.08984...","[[-0.4436903723137345, 0.6700630112493133, 0.1..."
1,3860,ELEVATOR MECHANIC HELPER,203,2,195,6,elevator mechanic helper class code open date ...,197,0.990,0.010,0.961,0.010,"[[-0.05810547, -0.22949219, -0.26757812, -0.01...","[[-0.2550844070540135, -1.0074762295410618, -1..."
2,3987,WATERWORKS MECHANIC SUPERVISOR,30,1,29,0,waterworks mechanic supervisor class code open...,30,0.967,0.033,0.967,0.033,"[[0.024902344, 0.03515625, 0.31054688, -0.0625...","[[0.11386212281775257, 0.1607465263309448, 1.4..."
3,2434,RECREATION FACILITY DIRECTOR,443,206,230,7,recreation facility director class code open d...,436,0.528,0.472,0.519,0.465,"[[0.06298828, -0.095214844, 0.23339844, 0.0217...","[[0.25839947222215826, -0.39060385335907644, 0..."
4,1775,WORKERS COMPENSATION CLAIMS ASSISTANT,116,95,19,2,worker compensation claim assistant class code...,114,0.167,0.833,0.164,0.819,"[[0.047607422, -0.13476562, 0.030151367, 0.018...","[[0.12369667951855838, -0.3501567543294576, 0...."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
172,1861,UTILITY BUYER,126,64,58,4,utility buyer class code open date exam open c...,122,0.475,0.525,0.460,0.508,"[[0.087890625, 0.15820312, 0.05053711, -0.1767...","[[0.241505538205688, 0.4347099687702384, 0.138..."
173,3586,TRUCK AND EQUIPMENT DISPATCHER,75,0,75,0,truck equipment dispatcher class code open dat...,75,1.000,0.000,1.000,0.000,"[[0.1328125, -0.019165039, -0.12695312, -0.017...","[[0.5448422980188143, -0.07862154484278847, -0..."
174,1769,SENIOR WORKERS COMPENSATION ANALYST,44,26,18,0,senior worker compensation analyst class code ...,44,0.409,0.591,0.409,0.591,"[[0.076171875, 0.140625, -0.022705078, -0.0135...","[[0.15774507374732846, 0.29122167461045256, -0..."
175,1336,UTILITY EXECUTIVE SECRETARY,430,395,31,4,utility executive secretary class code open da...,426,0.073,0.927,0.072,0.919,"[[0.087890625, 0.15820312, 0.05053711, -0.1767...","[[0.241505538205688, 0.4347099687702384, 0.138..."


In [391]:
male_share = list(df["Male share (unknown gender included)"])
female_share = list(df["Female share (unknown gender included)"])
y_multi = list(zip(male_share, female_share))
y_multi

[(0.957, 0.021),
 (0.961, 0.01),
 (0.967, 0.033),
 (0.519, 0.465),
 (0.164, 0.819),
 (1.0, 0.0),
 (1.0, 0.0),
 (0.268, 0.723),
 (0.643, 0.357),
 (0.615, 0.369),
 (0.914, 0.057),
 (0.514, 0.448),
 (0.952, 0.0),
 (0.585, 0.366),
 (0.512, 0.462),
 (0.399, 0.563),
 (0.305, 0.678),
 (0.94, 0.02),
 (0.957, 0.032),
 (0.725, 0.255),
 (0.721, 0.26),
 (0.378, 0.545),
 (0.639, 0.329),
 (1.0, 0.0),
 (0.953, 0.016),
 (0.342, 0.613),
 (0.739, 0.253),
 (0.4, 0.6),
 (0.492, 0.462),
 (0.982, 0.005),
 (0.933, 0.0),
 (0.746, 0.211),
 (0.595, 0.381),
 (0.84, 0.15),
 (0.768, 0.232),
 (0.429, 0.571),
 (0.891, 0.096),
 (0.962, 0.029),
 (0.574, 0.37),
 (0.907, 0.088),
 (0.441, 0.456),
 (0.938, 0.0),
 (0.985, 0.004),
 (0.952, 0.014),
 (0.972, 0.028),
 (0.96, 0.035),
 (0.663, 0.327),
 (0.5, 0.5),
 (0.464, 0.536),
 (0.955, 0.023),
 (0.979, 0.018),
 (0.754, 0.231),
 (0.756, 0.234),
 (0.815, 0.148),
 (0.95, 0.031),
 (0.929, 0.071),
 (0.536, 0.422),
 (0.929, 0.071),
 (0.566, 0.402),
 (0.471, 0.52),
 (0.818, 0.152),

In [392]:
X_emb_train, X_emb_test, y_multi_train, y_multi_test = train_test_split(X_embeddings, y_multi, test_size=0.3, random_state=428)

In [398]:
param = {}
param['loss_function'] = 'MultiRMSE'
param['eval_metric'] = 'MultiRMSE'

regressor = CatBoostRegressor(**param)

regressor.fit(X_emb_train, y_multi_train, early_stopping_rounds=100)

0:	learn: 0.3401025	total: 47.7ms	remaining: 47.6s
1:	learn: 0.3356286	total: 83.2ms	remaining: 41.5s
2:	learn: 0.3319746	total: 106ms	remaining: 35.3s
3:	learn: 0.3284051	total: 132ms	remaining: 33s
4:	learn: 0.3256876	total: 169ms	remaining: 33.7s
5:	learn: 0.3227613	total: 195ms	remaining: 32.3s
6:	learn: 0.3196724	total: 219ms	remaining: 31s
7:	learn: 0.3166404	total: 241ms	remaining: 29.9s
8:	learn: 0.3129524	total: 270ms	remaining: 29.7s
9:	learn: 0.3097605	total: 307ms	remaining: 30.4s
10:	learn: 0.3069651	total: 341ms	remaining: 30.6s
11:	learn: 0.3040737	total: 366ms	remaining: 30.1s
12:	learn: 0.3007079	total: 398ms	remaining: 30.2s
13:	learn: 0.2980190	total: 424ms	remaining: 29.8s
14:	learn: 0.2947877	total: 450ms	remaining: 29.5s
15:	learn: 0.2912032	total: 471ms	remaining: 28.9s
16:	learn: 0.2878331	total: 496ms	remaining: 28.7s
17:	learn: 0.2848374	total: 523ms	remaining: 28.5s
18:	learn: 0.2817934	total: 554ms	remaining: 28.6s
19:	learn: 0.2784021	total: 579ms	remaining

<catboost.core.CatBoostRegressor at 0x2719c8929b0>

In [399]:
rmse = mean_squared_error(y_multi_test, regressor.predict(X_emb_test), squared=False)
mae = mean_absolute_error(y_multi_test, regressor.predict(X_emb_test))
r2 = r2_score(y_multi_test, regressor.predict(X_emb_test))

In [400]:
rmse

0.13480111811283998

In [401]:
mae

0.10568842047889285

In [402]:
r2

0.6736126142636137

In [403]:
def objective(trial):
    param = {}
    param['learning_rate'] = trial.suggest_float("learning_rate", 0.001, 0.02, step=0.001)
    param['depth'] = trial.suggest_int('depth', 9, 15)
    param['l2_leaf_reg'] = trial.suggest_float('l2_leaf_reg', 1.0, 5.5, step=0.5)
    param['min_child_samples'] = trial.suggest_categorical('min_child_samples', [1, 4, 8, 16, 32])
    param['grow_policy'] = 'Depthwise'
    param['eval_metric'] = 'MultiRMSE'
    param['loss_function'] = 'MultiRMSE'
    param['od_type'] = 'iter'
    param['od_wait'] = 20
    param['random_state'] = 1
    param['logging_level'] = 'Silent'
    
    regressor = CatBoostRegressor(**param)

    regressor.fit(X_emb_train, y_multi_train, early_stopping_rounds=100)
    loss = mean_squared_error(y_multi_test, regressor.predict(X_emb_test), squared=False)
    return loss

In [404]:
study = optuna.create_study()
study.optimize(objective, n_trials=500)

[32m[I 2023-01-12 12:05:42,487][0m A new study created in memory with name: no-name-8a56347c-339b-42d9-b3ad-ae97774867f1[0m
[32m[I 2023-01-12 12:05:48,835][0m Trial 0 finished with value: 0.11846610100766909 and parameters: {'learning_rate': 0.016, 'depth': 10, 'l2_leaf_reg': 4.5, 'min_child_samples': 32}. Best is trial 0 with value: 0.11846610100766909.[0m
[32m[I 2023-01-12 12:06:30,391][0m Trial 1 finished with value: 0.14055955975524015 and parameters: {'learning_rate': 0.011, 'depth': 15, 'l2_leaf_reg': 2.5, 'min_child_samples': 4}. Best is trial 0 with value: 0.11846610100766909.[0m
[32m[I 2023-01-12 12:09:33,932][0m Trial 2 finished with value: 0.16798274834767685 and parameters: {'learning_rate': 0.010000000000000002, 'depth': 11, 'l2_leaf_reg': 2.5, 'min_child_samples': 1}. Best is trial 0 with value: 0.11846610100766909.[0m
[32m[I 2023-01-12 12:15:56,193][0m Trial 3 finished with value: 0.22385588844515186 and parameters: {'learning_rate': 0.002, 'depth': 15, 'l2

In [405]:
study.best_params

{'learning_rate': 0.017,
 'depth': 10,
 'l2_leaf_reg': 3.0,
 'min_child_samples': 32}

In [406]:
param = {}
param['learning_rate'] = 0.017
param['depth'] = 10
param['l2_leaf_reg'] = 3.0
param['min_child_samples'] = 32
param['grow_policy'] = 'Depthwise'
param['eval_metric'] = 'MultiRMSE'
param['loss_function'] = 'MultiRMSE'
param['od_type'] = 'iter'
param['od_wait'] = 20
param['random_state'] = 1
param['logging_level'] = 'Silent'

regressor = CatBoostRegressor(**param)

regressor.fit(X_emb_train, y_multi_train, early_stopping_rounds=100)

<catboost.core.CatBoostRegressor at 0x271946fd8a0>

In [407]:
rmse = mean_squared_error(y_multi_test, regressor.predict(X_emb_test), squared=False)
mae = mean_absolute_error(y_multi_test, regressor.predict(X_emb_test))
r2 = r2_score(y_multi_test, regressor.predict(X_emb_test))

In [408]:
rmse

0.11651528573100299

In [409]:
mae

0.09121499189375501

In [410]:
r2

0.7561451085003495