## Ten notebook ma na celu wytrenowanie modelu regresji, używając pakietu Pycaret, który opiera się na popularnym pakiecie Scikitlearn. Dane dotyczą półmaratonów Wrocławskich, które odbyły się w latach 2023 oraz 2024. Naszym celem jest wytrenowanie modelu na podstawie konkretnych danych:
- wiek zawodnika,
- płeć,
- tempo na 5km,
- czas przebiegnięcia całego półmaratonu.
Naszą wartością docelową jest czas przebiegnięcia całego półmaratonu.

## W pierwszej kolejności pobieram dane z tzw. chmury - Digital Ocean

In [48]:
import boto3
import os
import pandas as pd
from dotenv import load_dotenv

load_dotenv()
s3 = boto3.client(
    "s3",
    # aws_access_key_id=os.getenv("AWS_ACCESS_KEY_ID"),
    # aws_secret_access_key=os.getenv("AWS_SECRET_ACCESS_KEY"),
    # Endpoint jest potrzebny dlatego że używamy chmury innej niż AWS 
    # endpoint_url=os.getenv("AWS_ENDPOINT_URL_S3"), 
    # Gdybyśmy używali AWS to musielibyśmy natomiast podać region
    #region_name='eu-central-1'
)
BUCKET_NAME = "pawelsbucket"

In [49]:
df2023 = pd.read_csv(f"s3://{BUCKET_NAME}/data/dane_maraton2023.csv")
df2024 = pd.read_csv(f"s3://{BUCKET_NAME}/data/dane_maraton2024.csv")

## W danych, znajduje się rocznik zawodnika. Dla modelu wartości typu 1996 oraz 1980, będą bardzo podobne, gdzie w rzeczywistości zawodników dzieli wtedy 16 lat, co przy średniej wieku ok 42 lata dla obu półmaratonów, jest to wartość znacząca. Zmieńmy więc rocznik na wiek.

In [50]:
df2023['Wiek'] = 2023 - df2023['Rocznik']
df2024['Wiek'] = 2024 - df2024['Rocznik']

## Dane z obu półmaratonów zostały połączone, aby model był możliwie dokładny oraz uniwersalny.

In [51]:
df = pd.concat([df2023, df2024], ignore_index=True)


## W df pozostawiamy tylko te kolumny, na których będziemy trenować model tj.: płeć, tempo na 5m, czas oraz wiek

In [52]:
df = df[["Płeć","5 km Tempo","Czas", "Wiek"]]

In [53]:
df

Unnamed: 0,Płeć,5 km Tempo,Czas,Wiek
0,M,2.923333,3899.0,31.0
1,M,2.960000,3983.0,37.0
2,M,3.153333,4104.0,27.0
3,M,3.236667,4216.0,35.0
4,M,3.240000,4227.0,28.0
...,...,...,...,...
21952,K,,,42.0
21953,K,,,26.0
21954,M,,,29.0
21955,K,,,33.0


## Pozbywamy się wartości brakujących

In [54]:
df.isna().sum()

Płeć            11
5 km Tempo    3546
Czas          3507
Wiek           485
dtype: int64

In [55]:
df = df.dropna()

## Dla tak przygotowanych danych, wytrenujmy model

In [75]:
from pycaret.regression import setup, compare_models, tune_model, pull, save_model

In [57]:
exp = setup(data = df, target="Czas", session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Czas
2,Target type,Regression
3,Original data shape,"(17927, 4)"
4,Transformed data shape,"(17927, 4)"
5,Transformed train set shape,"(12548, 4)"
6,Transformed test set shape,"(5379, 4)"
7,Numeric features,2
8,Categorical features,1
9,Preprocess,True


In [58]:
best = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,299.4187,170713.4911,412.7915,0.8833,0.0523,0.0391,0.083
lightgbm,Light Gradient Boosting Machine,302.5669,174331.017,417.137,0.8808,0.0528,0.0396,0.127
lasso,Lasso Regression,300.929,177226.7584,420.4153,0.879,0.0526,0.0393,0.012
ridge,Ridge Regression,300.9323,177228.9407,420.4171,0.879,0.0526,0.0393,0.013
lar,Least Angle Regression,300.9289,177228.9696,420.4169,0.879,0.0526,0.0393,0.012
llar,Lasso Least Angle Regression,300.929,177226.7678,420.4153,0.879,0.0526,0.0393,0.011
omp,Orthogonal Matching Pursuit,300.8259,177175.7351,420.3513,0.879,0.0526,0.0393,0.011
br,Bayesian Ridge,300.9298,177228.9746,420.4169,0.879,0.0526,0.0393,0.01
lr,Linear Regression,300.9289,177228.9696,420.4169,0.879,0.0526,0.0393,0.011
huber,Huber Regressor,296.5484,179609.0325,423.2407,0.8773,0.0528,0.0384,0.018


## Już dla tak przygotowanych danych, osiągnęliśmy satysfakcjonujące wyniki: R2 na poziomie 0.88 oraz MAE na poziomie 299. Dla R2 maksymalna wartość to 1, MAE to średni błąd bezwględny, wyrażony w takiej samej jednostce co wartość docelowa czyli sekundy. Sprawdźmy czy możemy dopracować model aby otrzymać lepsze wyniki.

In [59]:
dfcopy = df.copy()

## Na kopi danych, pozbywam się wartości odstających

In [60]:
Q1 = dfcopy["Czas"].quantile(0.25)
Q3 = dfcopy["Czas"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

fake_no_outliers_df = dfcopy[~((dfcopy["Czas"] < lower_bound) | (dfcopy["Czas"] > upper_bound))]

In [61]:
exp2 = setup(data = df, target="Czas", session_id=123)
best2 = compare_models()

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Czas
2,Target type,Regression
3,Original data shape,"(17927, 4)"
4,Transformed data shape,"(17927, 4)"
5,Transformed train set shape,"(12548, 4)"
6,Transformed test set shape,"(5379, 4)"
7,Numeric features,2
8,Categorical features,1
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,299.4187,170713.4911,412.7915,0.8833,0.0523,0.0391,0.074
lightgbm,Light Gradient Boosting Machine,302.5669,174331.017,417.137,0.8808,0.0528,0.0396,0.121
lasso,Lasso Regression,300.929,177226.7584,420.4153,0.879,0.0526,0.0393,0.012
ridge,Ridge Regression,300.9323,177228.9407,420.4171,0.879,0.0526,0.0393,0.011
lar,Least Angle Regression,300.9289,177228.9696,420.4169,0.879,0.0526,0.0393,0.01
llar,Lasso Least Angle Regression,300.929,177226.7678,420.4153,0.879,0.0526,0.0393,0.012
omp,Orthogonal Matching Pursuit,300.8259,177175.7351,420.3513,0.879,0.0526,0.0393,0.01
br,Bayesian Ridge,300.9298,177228.9746,420.4169,0.879,0.0526,0.0393,0.01
lr,Linear Regression,300.9289,177228.9696,420.4169,0.879,0.0526,0.0393,0.013
huber,Huber Regressor,296.5484,179609.0325,423.2407,0.8773,0.0528,0.0384,0.018


In [70]:
tuned_gbr1 = tune_model(best, optimize='R2')



Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,311.5047,196493.1394,443.2755,0.8633,0.0562,0.041
1,308.8866,172634.7737,415.4934,0.8794,0.0528,0.0406
2,323.3367,197275.8338,444.1574,0.8532,0.056,0.0421
3,312.1908,181615.9831,426.1643,0.8664,0.0538,0.0409
4,316.3259,192065.942,438.2533,0.8769,0.0551,0.0411
5,334.2168,228089.8193,477.5875,0.8421,0.0603,0.0436
6,338.0277,222818.1686,472.0362,0.8571,0.0584,0.0434
7,318.9876,204283.9811,451.9779,0.8606,0.0565,0.0416
8,313.8711,177113.5079,420.8486,0.8814,0.0537,0.0411
9,309.4343,179973.7534,424.2331,0.8848,0.0537,0.0404


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [71]:
results1 = pull()

In [73]:
results1

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,311.5047,196493.1394,443.2755,0.8633,0.0562,0.041
1,308.8866,172634.7737,415.4934,0.8794,0.0528,0.0406
2,323.3367,197275.8338,444.1574,0.8532,0.056,0.0421
3,312.1908,181615.9831,426.1643,0.8664,0.0538,0.0409
4,316.3259,192065.942,438.2533,0.8769,0.0551,0.0411
5,334.2168,228089.8193,477.5875,0.8421,0.0603,0.0436
6,338.0277,222818.1686,472.0362,0.8571,0.0584,0.0434
7,318.9876,204283.9811,451.9779,0.8606,0.0565,0.0416
8,313.8711,177113.5079,420.8486,0.8814,0.0537,0.0411
9,309.4343,179973.7534,424.2331,0.8848,0.0537,0.0404


In [72]:
tuned_gbr2 = tune_model(best2, optimize='R2')
results2 = pull()


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,311.5047,196493.1394,443.2755,0.8633,0.0562,0.041
1,308.8866,172634.7737,415.4934,0.8794,0.0528,0.0406
2,323.3367,197275.8338,444.1574,0.8532,0.056,0.0421
3,312.1908,181615.9831,426.1643,0.8664,0.0538,0.0409
4,316.3259,192065.942,438.2533,0.8769,0.0551,0.0411
5,334.2168,228089.8193,477.5875,0.8421,0.0603,0.0436
6,338.0277,222818.1686,472.0362,0.8571,0.0584,0.0434
7,318.9876,204283.9811,451.9779,0.8606,0.0565,0.0416
8,313.8711,177113.5079,420.8486,0.8814,0.0537,0.0411
9,309.4343,179973.7534,424.2331,0.8848,0.0537,0.0404


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


## Strojenie hiperparametrów obu modeli, dla danych z outliersami oraz bez nich nie poskutkowało lepszymi wynikami R2. W takim razie naszym pierwotnym modelem będzie model uzyskany na samym początku.

In [76]:
save_model(best, 'best_model')


Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['5 km Tempo', 'Wiek'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['Płeć'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('ordinal_encoding',
                  TransformerWrapper(include=['Płeć'],
                                     transformer=OrdinalEncoder(cols=['Płeć'],
                                                                handle_missing='return_nan',
                                                                mapping=[{'col': 'Płeć',
                                                                          'data_type': dtype('O'),
                                                                          'mapping': K      0
 M      1
 NaN   -1
 dtype: int64}]))),
         