## Base model

First, I started with a neural network model, where I got the best prediction as of 52%. After having a chat with Frigyes in the Heller dormatory, I turned towards boosting. According to ChatGPT, I chose extreme gradient boosting, the results are presented below.

### Predicting binary outcome

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import xgboost as xgb

# Load data
test = pd.read_csv('D:\\Corvinus\\23_24_1\\ML\\test.csv')
train = pd.read_csv('D:\\Corvinus\\23_24_1\\ML\\train.csv')

# Prepare data
X_train = train.iloc[:, :59]
y_train = train.iloc[:, 59]
X_test = test.iloc[:, :59]

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train_scaled, y_train, test_size=0.2, random_state=42
)

# Define XGBoost model
model = xgb.XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,  # To avoid a warning message in newer versions
    early_stopping_rounds=10  # Stop training if no improvement in 10 rounds
)

# Train the model
model.fit(
    X_train_split, y_train_split,
    eval_set=[(X_val_split, y_val_split)]
)
# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Create a DataFrame for final scores
FinalScores = pd.DataFrame({'article_id': test.iloc[:, 59].astype('int32'), 'score': y_pred.astype('int32')})

# Save the results to a CSV file
FinalScores.to_csv('XGBoost_output.csv', index=False)


[0]	validation_0-logloss:0.35647
[1]	validation_0-logloss:0.34536
[2]	validation_0-logloss:0.33892
[3]	validation_0-logloss:0.33595
[4]	validation_0-logloss:0.33285
[5]	validation_0-logloss:0.33149
[6]	validation_0-logloss:0.33004
[7]	validation_0-logloss:0.32925
[8]	validation_0-logloss:0.32889
[9]	validation_0-logloss:0.32843
[10]	validation_0-logloss:0.32805
[11]	validation_0-logloss:0.32780
[12]	validation_0-logloss:0.32853
[13]	validation_0-logloss:0.32902
[14]	validation_0-logloss:0.32988
[15]	validation_0-logloss:0.33001
[16]	validation_0-logloss:0.32990
[17]	validation_0-logloss:0.33023
[18]	validation_0-logloss:0.33077
[19]	validation_0-logloss:0.33089
[20]	validation_0-logloss:0.33171
[21]	validation_0-logloss:0.33200


First I tried to predict the binary variable, but it gave quite poor results, so I moved to predicting classification probabilities. The train-validation split and the model parameters are from ChatGPT.

### Predicting probabilities

In [8]:
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]  # Probability of the positive class

# Create a DataFrame for final scores
FinalScores = pd.DataFrame({'article_id': test.iloc[:, 59].astype('int32'), 'score': y_pred_proba})

# Save the results to a CSV file
FinalScores.to_csv('XGBoost_output_proba.csv', index=False)

By predicting probabilities, the performance of the model improved significantly, and thanks to the developers the xgboost package it could be easily implemented. Next, I applied hyperparameter optimization to ameliorate the predictive power of the model.

## Hyperparameter optimization

To do so, I defined 4x3 parameter grid, where I vary the learning rate of the model, the maximum depth of decision trees, the number of estimators and the percentage of features to be used for building a decision tree.

In [5]:
model = xgb.XGBClassifier(objective='binary:logistic', use_label_encoder=False)

# Define the hyperparameter grid for tuning
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'n_estimators': [50, 100, 200],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

# Create a GridSearchCV object
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring='neg_log_loss',  # Use log loss as the scoring metric for binary classification
    cv=3,  # Number of cross-validation folds
    verbose=2
)

# Perform the grid search
grid_search.fit(X_train_scaled, y_train)

# Get the best parameters from the grid search
best_params = grid_search.best_params_

# Train the model with the best parameters
best_model = xgb.XGBClassifier(objective='binary:logistic', use_label_encoder=False, **best_params)
best_model.fit(X_train_scaled, y_train)

# Predict probabilities on the test set
y_pred_proba = best_model.predict_proba(X_test_scaled)[:, 1]  # Probability of the positive class

# Create a DataFrame for final scores
FinalScores = pd.DataFrame({'article_id': test.iloc[:, 59].astype('int32'), 'score': y_pred_proba})

# Save the results to a CSV file
FinalScores.to_csv('XGBoost_output_proba_tuned.csv', index=False)

Fitting 3 folds for each of 81 candidates, totalling 243 fits
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=50; total time=   0.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=50; total time=   0.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=50; total time=   0.2s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.6s
[CV] END colsample_bytree=0.6, l

[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=50; total time=   0.1s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=50; total time=   0.2s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.6s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.6s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.6s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, n_estimators=50; total time=   0.3s
[CV] END colsample_bytree=0.8, learning_rate=0.01, max_depth=5, n_estimators=50; total time=  

[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=3, n_estimators=50; total time=   0.2s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=3, n_estimators=100; total time=   0.4s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.8s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.7s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=3, n_estimators=200; total time=   0.7s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=5, n_estimators=50; total time=   0.3s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=5, n_estimators=50; total time=   0.3s
[CV] END colsample_bytree=1.0, learning_rate=0.01, max_depth=5, n_estimators=50; total time=  

In [6]:
best_params

{'colsample_bytree': 0.8,
 'learning_rate': 0.1,
 'max_depth': 3,
 'n_estimators': 100}

Hyperparameter tuning improved the model further, and the best model can be achieved, when the parameters are the following: 80% of the features are used to create each tree, the learning rate is 10%, the maximum depth of the decision trees is 3, and the optimal number of estimators is 100.
Thinking about other ways to build a better model, regularization came to my mind. Again, the great work of the xgboost developers made it very easy to implement. The regularization parameters are from ChatGPT again.

### Regularization

In [9]:
# Define XGBoost model with regularization
model = xgb.XGBClassifier(
    objective='binary:logistic',
    use_label_encoder=False,  # To avoid a warning message in newer versions
    learning_rate=0.1,
    max_depth=3,
    n_estimators=100,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=0.1
)

# Train the model
model.fit(
    X_train_split, y_train_split,
    eval_set=[(X_val_split, y_val_split)]
)

# Predict probabilities on the test set
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]  # Probability of the positive class

# Create a DataFrame for final scores
FinalScores = pd.DataFrame({'article_id': test.iloc[:, 59].astype('int32'), 'score': y_pred_proba})

# Save the results to a CSV file
FinalScores.to_csv('XGBoost_output_proba_regularized.csv', index=False)

[0]	validation_0-logloss:0.36670
[1]	validation_0-logloss:0.36184
[2]	validation_0-logloss:0.35760
[3]	validation_0-logloss:0.35399
[4]	validation_0-logloss:0.35091
[5]	validation_0-logloss:0.34820
[6]	validation_0-logloss:0.34596
[7]	validation_0-logloss:0.34406
[8]	validation_0-logloss:0.34211
[9]	validation_0-logloss:0.34057
[10]	validation_0-logloss:0.33929
[11]	validation_0-logloss:0.33816
[12]	validation_0-logloss:0.33700
[13]	validation_0-logloss:0.33618
[14]	validation_0-logloss:0.33503
[15]	validation_0-logloss:0.33430
[16]	validation_0-logloss:0.33344
[17]	validation_0-logloss:0.33276
[18]	validation_0-logloss:0.33200
[19]	validation_0-logloss:0.33149
[20]	validation_0-logloss:0.33101
[21]	validation_0-logloss:0.33061
[22]	validation_0-logloss:0.33005
[23]	validation_0-logloss:0.32958
[24]	validation_0-logloss:0.32923
[25]	validation_0-logloss:0.32898
[26]	validation_0-logloss:0.32871
[27]	validation_0-logloss:0.32852
[28]	validation_0-logloss:0.32825
[29]	validation_0-loglos