# Machine Learning beadandó feladat 
# Dataracing 2022: exportforgalom előrejelzés (1. javított verzió)
#### Király Márk (AX83OL)

### Importáljuk a szükséges modulokat.

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

### Adatok betöltése, célváltozó beállítása és adathalmaz felosztás. Végül EZEK UTÁN standardizáljuk az adatokat.
## A JAVÍTÁS: Előbb osszuk fel, utána standardizálunk.

In [5]:
# Adatok betöltése
df = pd.read_csv("https://raw.githubusercontent.com/karsarobert/Machine_Learning_2024/main/train.csv")

# Célváltozó és prediktorok beállítása
y = df['target_reg']
corr_col = ['arbevexp_2014', 'arbevexp_2015', 'arbevexp_2016', 'arbevert_2014', 'arbevert_2015', 'arbevert_2016', 'ranyag_2014', 'ranyag_2015', 'ranyag_2016', 'rszem_2016']
X = df[corr_col]

# Adatok felosztása tanító és teszt halmazokra
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# StandardScaler létrehozása és illesztése a tanító adatokra
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Az adathalmaz vizsgálatát, tesztelését most mellőzöm.

# Modell illesztése

## Random Forest Regressor + Randomized Search egy hiperparaméter halmazon (és ez az egész brute force-olva).

In [6]:
# Random Forest Regressor létrehozása
rf_reg = RandomForestRegressor()

# Hyperparaméterek beállítása
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 6, 9, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

best_mae = float('inf')
best_model = None

# RandomizedSearchCV futtatása többször
for i in range(10):
    print(f"Iteration {i+1}/10")
    rs_cv = RandomizedSearchCV(
        estimator=rf_reg,
        param_distributions=param_grid,
        n_iter=20,
        scoring='neg_mean_absolute_error',
        cv=5,
        random_state=42,
        n_jobs=-1
    )
    rs_cv.fit(X_train_scaled, y_train)

    # Legjobb hyperparaméterek és score kiírása
    print('Best Hyperparameters:', rs_cv.best_params_)
    print('Best Score:', -rs_cv.best_score_)

    # Legjobb modell illesztése a teljes edzési adatokra
    current_model = rs_cv.best_estimator_
    current_model.fit(X_train_scaled, y_train)

    # Előrejelzések készítése a teszthalmazon
    y_pred = current_model.predict(X_test_scaled)

    # Hibametrikák kiértékelése
    current_mae = mean_absolute_error(y_test, y_pred)
    print('Current Mean Absolute Error:', current_mae)

    # Legjobb modell kiválasztása
    if current_mae < best_mae:
        best_mae = current_mae
        best_model = current_model
        joblib.dump(best_model, 'best_rf_model.pkl')
        print(f"New best model saved with MAE = {best_mae}")

# Legjobb modell betöltése
best_model = joblib.load('best_rf_model.pkl')

# Tesztadatokon való előrejelzés a legjobb modellel
y_pred = best_model.predict(X_test_scaled)

# Hibametrikák kiértékelése a teszthalmazon
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print('Final Model Performance on Test Set:')
print('Mean Squared Error:', mse)
print('R-squared Score:', r2)
print('Mean Absolute Error:', mae)

Iteration 1/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
22 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 60601.076020060435
Current Mean Absolute Error: 54557.151428022735
New best model saved with MAE = 54557.151428022735
Iteration 2/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
11 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 59720.82337493418
Current Mean Absolute Error: 54166.35052233439
New best model saved with MAE = 54166.35052233439
Iteration 3/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 61528.58580871219
Current Mean Absolute Error: 54462.84642293433
Iteration 4/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
34 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 60826.34232465023
Current Mean Absolute Error: 56043.04626731162
Iteration 5/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
16 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 60552.92076738349
Current Mean Absolute Error: 54808.24011463721
Iteration 6/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 59885.001302016375
Current Mean Absolute Error: 54606.966068201626
Iteration 7/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
42 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 50, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 61395.83781056087
Current Mean Absolute Error: 55782.52716303128
Iteration 8/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
19 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 60737.921267522404
Current Mean Absolute Error: 53815.395353831496
New best model saved with MAE = 53815.395353831496
Iteration 9/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 60422.22097845237
Current Mean Absolute Error: 55098.92534676471
Iteration 10/10


55 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 1467, in wrapper
    estimator._validate_params()
  File "/usr/lib/python3.12/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/usr/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.utils._param_validat

Best Hyperparameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': None, 'bootstrap': False}
Best Score: 60988.56538840762
Current Mean Absolute Error: 54061.73586207858
Final Model Performance on Test Set:
Mean Squared Error: 251440569632.98523
R-squared Score: 0.9590107076706736
Mean Absolute Error: 53815.395353831496


### Az eredmény: 53815.395353831496 (magyarán nem jó)

## Random Forest Regressor + Bayes Optimizer (ugyanúgy először hiperparaméter halmazon, majd végül a legjobbnak talált hiperparaméterekkel brute force).

In [7]:
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error

#Hiperparaméter keresési tér definiálása
param_space = {
    'n_estimators': (40, 60),    # Szám
    'max_depth': (1, 10),        # Mélység
    'min_samples_split': (2, 6), # Min. minta szám a felosztáshoz
    'min_samples_leaf': (1, 3),  # Min. minta szám a leveleken
}

# Bayesian optimalizációval a legjobb hiperparaméterek megtalálása
opt = BayesSearchCV(
    estimator=RandomForestRegressor(),
    search_spaces=param_space,
    n_iter=10, # Iterációk száma
    random_state=42, # Változatlanul hagytam (kért érték)!
    #n_jobs=-1 # A párhuzamosítás is hatással van a pontosságra a tapasztalataim alapján (de nem teljesen értem miért)
)

# A modell illesztése az adatokra
opt.fit(X_train, y_train)

# A legjobb modell előrejelzése
y_pred = opt.predict(X_test)

# MAE számolás
mae = mean_absolute_error(y_test, y_pred)

# A legjobb paraméterek kiíratása
print("Legjobb paraméterek:", opt.best_params_)
print("MAE", mae)

Legjobb paraméterek: OrderedDict({'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 2, 'n_estimators': 49})
MAE 60374.07103216211


### Ez nem jött be...
### Most akkor manuálisan próbálkozom a paraméterekkel:
#### Az előző paraméterekből indultam ki, és azt kombináltam a régebben jónak találtakkal, majd ezeket is módosítottam.

In [26]:
# Hiperparaméterek
#hyperparameters = {'max_depth': 8, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 56}
hyperparameters = {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 49}

# Modell inicializálása a megadott hiperparaméterekkel
model = RandomForestRegressor(**hyperparameters)

# A modell illesztése az adatokra
model.fit(X_train, y_train)

# A modell előrejelzése
y_pred = model.predict(X_test)

# MAE számolása
mae = mean_absolute_error(y_test, y_pred)

# Eredmények kiíratása
print("MAE:", mae)

MAE: 50799.56813550819


# MAE: 50799.56813550819