# Case Intro
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

Content
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer or not. The data folder contains two datasets:-

Bank.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)

Detailed Column Descriptions
bank client data:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

Missing Attribute Values: None


In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/Bank.csv',sep = ';')
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [35]:
print(df.shape)
df.info()
df.isnull().sum()

(45211, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [36]:
#For object check the data
for cn in df.columns:
  if(df[cn].dtype==object):
    print(df[cn].value_counts())


job
blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: count, dtype: int64
marital
married     27214
single      12790
divorced     5207
Name: count, dtype: int64
education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: count, dtype: int64
default
no     44396
yes      815
Name: count, dtype: int64
housing
yes    25130
no     20081
Name: count, dtype: int64
loan
no     37967
yes     7244
Name: count, dtype: int64
contact
cellular     29285
unknown      13020
telephone     2906
Name: count, dtype: int64
month
may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: count, dtype: int64
poutcome
unknown    36959
failure     4901
other  

In [37]:
def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [38]:
df=Encoder(df)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3,0
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3,0
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3,0
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3,0
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3,1
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3,1
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2,1
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3,0


In [39]:
y = df['y'] #Output
X = df.drop('y',axis=1)
X

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3


In [40]:
from sklearn.model_selection import  train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)




Q1)Using  Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate Accuracy on Test data. Which method gives the best accuracy on test data

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
models = {
    'RandomForest': RandomForestClassifier(random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'XGBoost': xgb.XGBClassifier(random_state=42),
    'LightGBM': LGBMClassifier(random_state=42)
}

accuracies = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    accuracies[name] = accuracy_score(y_test, predictions)

# Print the accuracies
for name, accuracy in accuracies.items():
    print(f'{name} accuracy on test data: {accuracy:.4f}')

[LightGBM] [Info] Number of positive: 3940, number of negative: 29968
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004307 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 988
[LightGBM] [Info] Number of data points in the train set: 33908, number of used features: 16
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.116197 -> initscore=-2.028949
[LightGBM] [Info] Start training from score -2.028949
RandomForest accuracy on test data: 0.9026
GradientBoosting accuracy on test data: 0.9019
XGBoost accuracy on test data: 0.9037
LightGBM accuracy on test data: 0.9074


# LightGBM classifier gives the best accuracy on test data (0.9074). However, the difference among classifiers is very insignificant

Q2) Using optuna hyperparmeter optimization technique and 100 trial

 a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation accuracy?

"max_depth": range(2, 16), "max_features": range(2, 16)

 b)Evaluate the performance of the  method with highest cross validation accuracy on test data.What is the accuracy value?


In [11]:
!pip install optuna



In [21]:
import optuna
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
def objective(trial):
    # Define the hyperparameters to tune
    params = {
        "max_depth": trial.suggest_int("max_depth", 2, 15),
        "max_features": trial.suggest_int("max_features", 2, 15),}
    model = xgb.XGBClassifier(**params, random_state=42)
    cv = StratifiedKFold(n_splits=3)
    score = cross_val_score(model, X_train, y_train, cv=cv, scoring="accuracy", error_score='raise').mean()
    return score
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print("Best parameters:", study.best_params)
print("Highest cross-validation accuracy:", study.best_value)
xgb_pred = model.predict(X_test)
accuracy_score(y_test, xgb_pred)
print(accuracy_score)

[I 2024-04-17 18:32:26,032] A new study created in memory with name: no-name-664e8474-104b-422f-8c85-83f99290555e
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:32:28,080] Trial 0 finished with value: 0.9023534545829378 and parameters: {'max_depth': 10, 'max_features': 2}. Best is trial 0 with value: 0.9023534545829378.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:32:30,365] Trial 1 finished with value: 0.9025303883197139 and parameters: {'max_depth': 11, 'max_features': 5}. Best is trial 1 with value: 0.9025303883197139.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:32:33,374] Trial 2 finished with value: 0.9048602706989106 and parameters: {'max_depth': 6, 'max_fe

Best parameters: {'max_depth': 4, 'max_features': 10}
Highest cross-validation accuracy: 0.9062463777548118


0.9073697248518092

In [20]:
def objective(trial2):
    # Define the hyperparameters to tune
    params = {
        "max_depth": trial2.suggest_int("max_depth", 2, 15),
        "max_features": trial2.suggest_int("max_features", 2, 15),}
    model2 = RandomForestClassifier(**params, random_state=42)
    cv = StratifiedKFold(n_splits=3)
    score = cross_val_score(model2, X_train, y_train, cv=cv, scoring="accuracy", error_score='raise').mean()
    return score
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print("Best parameters:", study.best_params)
print("Highest cross-validation accuracy:", study.best_value)
RandomClassifier = model.predict(X_test)
accuracy_score(y_test, RandomClassifier)
print(accuracy_score)

[I 2024-04-17 18:11:59,319] A new study created in memory with name: no-name-323bd469-72ff-41a5-8e40-53fc6c8e0ade
[I 2024-04-17 18:12:10,986] Trial 0 finished with value: 0.9027663895558913 and parameters: {'max_depth': 6, 'max_features': 13}. Best is trial 0 with value: 0.9027663895558913.
[I 2024-04-17 18:12:19,675] Trial 1 finished with value: 0.9043588766855719 and parameters: {'max_depth': 10, 'max_features': 5}. Best is trial 1 with value: 0.9043588766855719.
[I 2024-04-17 18:12:22,548] Trial 2 finished with value: 0.8867818610501437 and parameters: {'max_depth': 4, 'max_features': 3}. Best is trial 1 with value: 0.9043588766855719.
[I 2024-04-17 18:12:37,739] Trial 3 finished with value: 0.9044473722566367 and parameters: {'max_depth': 11, 'max_features': 8}. Best is trial 3 with value: 0.9044473722566367.
[I 2024-04-17 18:12:45,342] Trial 4 finished with value: 0.8979591817564074 and parameters: {'max_depth': 11, 'max_features': 2}. Best is trial 3 with value: 0.904447372256636

Best parameters: {'max_depth': 14, 'max_features': 4}
Highest cross-validation accuracy: 0.9056270157400625


In [None]:
def objective(trial3):
    # Define the hyperparameters to tune
    params = {
        "max_depth": trial3.suggest_int("max_depth", 2, 15),
        "max_features": trial3.suggest_int("max_features", 2, 15),}
    model3 = LGBMClassifier(**params, random_state=42)
    cv = StratifiedKFold(n_splits=3)
    score = cross_val_score(model3, X_train, y_train, cv=cv, scoring="accuracy", error_score='raise').mean()
    return score
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print("Best parameters:", study.best_params)
print("Highest cross-validation accuracy:", study.best_value)
lgbm_pred = model.predict(X_test)
accuracy_score(y_test, lgbm_pred)
print(accuracy_score)

In [42]:
def objective(trial4):
    # Define the hyperparameters to tune
    params = {
        "max_depth": trial4.suggest_int("max_depth", 2, 15),
        "max_features": trial4.suggest_int("max_features", 2, 15),}
    model4 = GradientBoostingClassifier(**params, random_state=42)
    cv = StratifiedKFold(n_splits=3)
    score = cross_val_score(model4, X_train, y_train, cv=cv, scoring="accuracy", error_score='raise').mean()
    return score
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print("Best parameters:", study.best_params)
print("Highest cross-validation accuracy:", study.best_value)
gradientboostingclassifier = model.predict(X_test)
accuracy_score(y_test, gradientboostingclassifier)
print(accuracy_score)

[I 2024-04-17 18:42:01,958] A new study created in memory with name: no-name-5bd866fc-4828-4962-9018-4ee53d177c0c
[I 2024-04-17 18:42:07,922] Trial 0 finished with value: 0.9016161871908001 and parameters: {'max_depth': 2, 'max_features': 13}. Best is trial 0 with value: 0.9016161871908001.
[I 2024-04-17 18:42:14,461] Trial 1 finished with value: 0.9037986030448978 and parameters: {'max_depth': 3, 'max_features': 8}. Best is trial 1 with value: 0.9037986030448978.
[I 2024-04-17 18:42:24,158] Trial 2 finished with value: 0.9053911032212497 and parameters: {'max_depth': 4, 'max_features': 11}. Best is trial 2 with value: 0.9053911032212497.
[I 2024-04-17 18:42:35,759] Trial 3 finished with value: 0.9058629647895549 and parameters: {'max_depth': 4, 'max_features': 13}. Best is trial 3 with value: 0.9058629647895549.
[I 2024-04-17 18:42:41,087] Trial 4 finished with value: 0.9053911084399182 and parameters: {'max_depth': 4, 'max_features': 4}. Best is trial 3 with value: 0.9058629647895549

KeyboardInterrupt: 

For Q3 and Q4 ,use the following data.

In [23]:
dr=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/diamond.csv')
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171
...,...,...,...,...,...,...,...,...
5995,1.03,Ideal,D,SI1,EX,EX,GIA,6250
5996,1.00,Very Good,D,SI1,VG,VG,GIA,5328
5997,1.02,Ideal,D,SI1,EX,EX,GIA,6157
5998,1.27,Signature-Ideal,G,VS1,EX,EX,GIA,11206


In [24]:
def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [25]:
dr=Encoder(dr)
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,2,4,2,3,0,1,5169
1,0.83,2,4,3,2,2,0,3470
2,0.85,2,4,2,0,0,1,3183
3,0.91,2,1,2,3,3,1,4370
4,0.83,2,3,2,0,0,1,3171
...,...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1,6250
5996,1.00,4,0,2,3,3,1,5328
5997,1.02,2,0,2,0,0,1,6157
5998,1.27,3,3,3,0,0,1,11206


In [26]:
y = dr['Price'] #Output
X = dr.drop('Price',axis=1)
X

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report
0,1.10,2,4,2,3,0,1
1,0.83,2,4,3,2,2,0
2,0.85,2,4,2,0,0,1
3,0.91,2,1,2,3,3,1
4,0.83,2,3,2,0,0,1
...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1
5996,1.00,4,0,2,3,3,1
5997,1.02,2,0,2,0,0,1
5998,1.27,3,3,3,0,0,1


In [27]:
from sklearn.model_selection import  train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)

Q3)Using Linear Regression,Decison Tree Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate R2 statistics on test data. Which method gives the best accuracy on test data

In [28]:
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
models = {
    'LinearRegression': LinearRegression(),
    'DecisionTree': DecisionTreeRegressor(random_state=17),
    'RandomForest': RandomForestRegressor(random_state=17),
    'XGBoost': xgb.XGBRegressor(random_state=17),
    'LightGBM': LGBMRegressor(random_state=17),
    'GradientBoosting': GradientBoostingRegressor(random_state=17)
}
r2_scores = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    r2_scores[name] = r2_score(y_test, predictions)
for name, score in r2_scores.items():
    print(f'{name} R2 score on test data: {score:}')

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000950 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 193
[LightGBM] [Info] Number of data points in the train set: 4500, number of used features: 7
[LightGBM] [Info] Start training from score 11827.946667
LinearRegression R2 score on test data: 0.823384582969696
DecisionTree R2 score on test data: 0.95923297527452
RandomForest R2 score on test data: 0.9799660276771226
XGBoost R2 score on test data: 0.9808734679798924
LightGBM R2 score on test data: 0.9812609895578275
GradientBoosting R2 score on test data: 0.9739035655494326


# LightBGM again provides best score (0.9812). Others, except linear regression performs very close.

Q4) Using optuna hyperparmeter optimization technique (100 trial)  with Random Forest,XGBoost, Light GBM and Gradient Boosting Regressor

a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation R2?

"max_depth": range(2, 7), "max_features": range(2, 7)

 b)Evaluate the performance of the  method with highest cross validation R2 on test data. What is the R2 value?


In [29]:
import optuna
from sklearn.model_selection import StratifiedKFold

# Import the desired models
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# Define the objective function:
def objective(trial: optuna.Trial):
    # Define the hyperparameters
    params = {
        "max_depth": trial.suggest_int("max_depth", 2, 7),
        "max_features": trial.suggest_int("max_features", 2, 7),
    }

    # Choose the model based on the trial suggestion
    model_name = trial.suggest_categorical("model", ["RandomForestRegressor", "XGBRegressor", "LGBMRegressor", "GradientBoostingRegressor"])
    model = globals()[model_name](**params, random_state=42)

    # Perform cross-validation and calculate the average R2 score
    cv = StratifiedKFold(n_splits=3)
    score = cross_val_score(model, X_train, y_train, cv=cv, scoring="r2").mean()

    return score

In [30]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

# Get the best hyperparameters, the best model, and the best score
best_params = study.best_params
best_model_name = study.best_params["model"]
best_model = globals()[best_model_name](**best_params, random_state=42)
best_score = study.best_value

[I 2024-04-17 18:36:15,509] A new study created in memory with name: no-name-abe55a5a-c40e-4014-88d8-1f3cbe323690
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:15,775] Trial 0 finished with value: 0.9840747630102852 and parameters: {'max_depth': 5, 'max_features': 6, 'model': 'XGBRegressor'}. Best is trial 0 with value: 0.9840747630102852.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:15,950] Trial 1 finished with value: 0.9695884370050969 and parameters: {'max_depth': 2, 'max_features': 7, 'model': 'XGBRegressor'}. Best is trial 0 with value: 0.9840747630102852.
[I 2024-04-17 18:36:16,930] Trial 2 finished with value: 0.9621918179852774 and parameters: {'max_depth': 6, 'max_features': 6, 'model': 'RandomForestRegressor'}. Best is trial 0 with value: 0.9840747630102

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000525 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000173 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667


[I 2024-04-17 18:36:18,092] Trial 6 finished with value: 0.9778924930869556 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'LGBMRegressor'}. Best is trial 4 with value: 0.9854563058794632.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000483 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11901.064667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000528 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667


[I 2024-04-17 18:36:18,351] Trial 7 finished with value: 0.972289552642699 and parameters: {'max_depth': 3, 'max_features': 4, 'model': 'LGBMRegressor'}. Best is trial 4 with value: 0.9854563058794632.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000517 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000509 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11901.064667


Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:18,588] Trial 8 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 3, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
[I 2024-04-17 18:36:19,351] Trial 9 finished with value: 0.9159233921490536 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'RandomForestRegressor'}. Best is trial 4 with value: 0.9854563058794632.
[I 2024-04-17 18:36:20,142] Trial 10 finished with value: 0.9777429033618869 and parameters: {'max_depth': 6, 'max_features': 2, 'model': 'GradientBoostingRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:21,832] Trial 11 finished with value: 0.9840747630102852 and parameters: {'

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000387 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000542 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667


[I 2024-04-17 18:36:23,952] Trial 17 finished with value: 0.9779367628743253 and parameters: {'max_depth': 6, 'max_features': 3, 'model': 'LGBMRegressor'}. Best is trial 4 with value: 0.9854563058794632.


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000150 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11901.064667


[I 2024-04-17 18:36:24,792] Trial 18 finished with value: 0.9105032880407906 and parameters: {'max_depth': 4, 'max_features': 7, 'model': 'RandomForestRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:25,056] Trial 19 finished with value: 0.9840747630102852 and parameters: {'max_depth': 5, 'max_features': 2, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:25,472] Trial 20 finished with value: 0.9821763755220677 and parameters: {'max_depth': 7, 'max_features': 4, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { 

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000539 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000511 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000144 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train se

[I 2024-04-17 18:36:27,775] Trial 27 finished with value: 0.9165346984714801 and parameters: {'max_depth': 4, 'max_features': 6, 'model': 'RandomForestRegressor'}. Best is trial 4 with value: 0.9854563058794632.
[I 2024-04-17 18:36:28,156] Trial 28 finished with value: 0.9440109555506293 and parameters: {'max_depth': 2, 'max_features': 3, 'model': 'GradientBoostingRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:28,448] Trial 29 finished with value: 0.9840747630102852 and parameters: {'max_depth': 5, 'max_features': 6, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:28,814] Trial 30 finished with value: 0.9844646375618451 and parameters: 

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000509 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000115 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667


[I 2024-04-17 18:36:31,660] Trial 39 finished with value: 0.9778924930869556 and parameters: {'max_depth': 4, 'max_features': 4, 'model': 'LGBMRegressor'}. Best is trial 4 with value: 0.9854563058794632.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000581 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11901.064667


Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:33,543] Trial 40 finished with value: 0.9829669903199679 and parameters: {'max_depth': 3, 'max_features': 3, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:34,025] Trial 41 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:34,275] Trial 42 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'XGBRegressor'}. Best is trial 4 with valu

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000500 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000384 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000520 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start trai

[I 2024-04-17 18:36:36,639] Trial 48 finished with value: 0.9778924930869556 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'LGBMRegressor'}. Best is trial 4 with value: 0.9854563058794632.




Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:36,832] Trial 49 finished with value: 0.9695884370050969 and parameters: {'max_depth': 2, 'max_features': 6, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
[I 2024-04-17 18:36:37,344] Trial 50 finished with value: 0.9748194061943654 and parameters: {'max_depth': 3, 'max_features': 4, 'model': 'GradientBoostingRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:37,572] Trial 51 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 3, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000556 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000614 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667


[I 2024-04-17 18:36:39,524] Trial 58 finished with value: 0.9780511058253353 and parameters: {'max_depth': 5, 'max_features': 2, 'model': 'LGBMRegressor'}. Best is trial 4 with value: 0.9854563058794632.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000367 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11901.064667


[I 2024-04-17 18:36:40,285] Trial 59 finished with value: 0.9159233921490536 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'RandomForestRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:40,484] Trial 60 finished with value: 0.9829669903199679 and parameters: {'max_depth': 3, 'max_features': 4, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:40,724] Trial 61 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 4, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { 

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000522 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000537 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000520 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start trai

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:47,994] Trial 78 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:48,232] Trial 79 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 4, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
[I 2024-04-17 18:36:48,647] Trial 80 finished with value: 0.9641718045682447 and parameters: {'max_depth': 3, 'max_features': 2, 'model': 'GradientBoostingRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000506 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 171
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11670.998667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000494 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 178
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start training from score 11911.776667
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000511 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 177
[LightGBM] [Info] Number of data points in the train set: 3000, number of used features: 7
[LightGBM] [Info] Start trai

[I 2024-04-17 18:36:51,036] Trial 89 finished with value: 0.9778924930869556 and parameters: {'max_depth': 4, 'max_features': 4, 'model': 'LGBMRegressor'}. Best is trial 4 with value: 0.9854563058794632.




[I 2024-04-17 18:36:51,937] Trial 90 finished with value: 0.9463463812378564 and parameters: {'max_depth': 5, 'max_features': 6, 'model': 'RandomForestRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:52,173] Trial 91 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

[I 2024-04-17 18:36:52,419] Trial 92 finished with value: 0.9854563058794632 and parameters: {'max_depth': 4, 'max_features': 5, 'model': 'XGBRegressor'}. Best is trial 4 with value: 0.9854563058794632.
Parameters: { "max_features" } are not used.

Parameters: { "max_features" } are not used.

Parameters: { 

In [31]:
best_model.fit(X_train, y_train)

Parameters: { "max_features", "model" } are not used.



In [32]:
y_pred = best_model.predict(X_test)
r2_score = r2_score(y_test, y_pred)

print(f"Test R2 Score: {r2_score:.4f}")

Test R2 Score: 0.9834
