# Feature Selection
<br>
Referred to https://www.sciencedirect.com/science/article/pii/S1877050922007955 by Fátima Rodriguesa
and Ângelo Pintob

- Boruta algorithm was used. Boruta is a heuristic variable selection algorithm based on the random forests algorithm that aims to find the most relevant variables in a dataset.
- The Boruta algorithm show the relevant and non-relevant variables. This algorithm was used because it does not look for a suboptimal solution, instead it tries to find all variables with relevant information, thus allowing to eliminate variables that would negatively affect the forecast models

In [1]:
!pip install boruta



In [2]:
import pandas as pd
import numpy as np
import pickle
import json
from sklearn.impute import SimpleImputer
from constants.constants import *
from models import Boruta, PreProcessor # see models.py for implementation

pd.set_option('display.max_columns', None) # show all columns
pd.set_option('display.max_rows', 20) # show all rows
pd.set_option('display.max_seq_items', None)

To keep it simple for now, we will just use the outcome to filter out the most important features. We should only use the data from 2019 to 2022 as we are predicting 2023

In [3]:
dfs = []

for year in range(2019, 2023):
    df = pd.read_csv(f'./data/machine_learning/outcome/{str(year)}.csv')
    dfs.append(df)

train_df = pd.concat(dfs, ignore_index=True)

In [4]:
train_df = train_df.drop(['form', 'opponent_form'], axis=1) # too complicated to take into account form when predicting

Unnamed: 0,season_start_year,year,weekday,month,day,hour,minute,team,opponent,is_home,form,opponent_form,team_minutes_90s,team_country,team_squad_size,team_foreigners,team_squad_value_avg,team_squad_value,team_players_used,team_gd_per90,team_pens_made_per90,team_pens_att_per90,team_cards_yellow_per90,team_cards_red_per90,team_cards_yellow_red_per90,team_fouls_per90,team_fouled_per90,team_offsides_per90,team_own_goals_per90,team_gk_goals_against_per90,team_gk_save_pct,team_gk_clean_sheets_pct,team_gk_psxg,team_gk_psnpxg_per_shot_on_target_against,team_gk_psxg_net_per90,team_gk_passes_pct_launched,team_gk_pct_passes_launched,team_gk_goal_kick_length_avg,team_gk_crosses_stopped_pct,team_gk_def_actions_outside_pen_area_per90,team_gk_shots_on_target_against_per90,team_gk_pens_saved_per90,team_gk_pens_missed_per90,team_gk_free_kick_goals_against_per90,team_gk_corner_kick_goals_against_per90,team_goals_per90,team_assists_per90,team_goals_assists_per90,team_goals_pens_per90,team_goals_assists_pens_per90,team_xg_per90,team_npxg_per90,team_xg_assist_per90,team_xg_xg_assist_per90,team_npxg_xg_assist_per90,team_shots_on_target_pct,team_shots_per90,team_shots_on_target_per90,team_goals_per_shot,team_goals_per_shot_on_target,team_average_shot_distance,team_npxg_per_shot,team_passes_pct,team_passes_pct_short,team_passes_pct_medium,team_passes_pct_long,team_passes_total_distance_per90,team_progressive_passes_per90,team_passes_progressive_distance_per90,team_pass_xa_per90,team_xg_assist_net_per90,team_assisted_shots_per90,team_passes_into_final_third_per90,team_passes_into_penalty_area_per90,team_crosses_into_penalty_area_per90,team_passes_live_per90,team_passes_dead_per90,team_passes_free_kicks_per90,team_through_balls_per90,team_passes_switches_per90,team_crosses_per90,team_throw_ins_per90,team_corner_kicks_per90,team_corner_kicks_in_per90,team_corner_kicks_out_per90,team_corner_kicks_straight_per90,team_sca_per90,team_gca_per90,team_sca_passes_live_per90,team_sca_passes_dead_per90,team_sca_take_ons_per90,team_sca_shots_per90,team_sca_fouled_per90,team_sca_defense_per90,team_gca_passes_live_per90,team_gca_passes_dead_per90,team_gca_take_ons_per90,team_gca_shots_per90,team_gca_fouled_per90,team_gca_defense_per90,team_take_ons_won_pct,team_take_ons_tackled_pct,team_possession,team_touches_per90,team_touches_def_pen_area_per90,team_touches_def_3rd_per90,team_touches_mid_3rd_per90,team_touches_att_3rd_per90,team_touches_att_pen_area_per90,team_touches_live_ball_per90,team_take_ons_per90,team_carries_per90,team_carries_distance_per90,team_progressive_carries_per90,team_carries_progressive_distance_per90,team_carries_into_final_third_per90,team_carries_into_penalty_area_per90,team_miscontrols_per90,team_dispossessed_per90,team_challenge_tackles_pct,team_aerials_won_pct,team_tackles_per90,team_tackles_won_per90,team_tackles_def_3rd_per90,team_tackles_mid_3rd_per90,team_tackles_att_3rd_per90,team_challenge_tackles_per90,team_challenges_per90,team_challenges_lost_per90,team_blocks_per90,team_blocked_shots_per90,team_blocked_passes_per90,team_interceptions_per90,team_tackles_interceptions_per90,team_clearances_per90,team_errors_per90,team_ball_recoveries_per90,team_aerials_won_per90,team_aerials_lost_per90,opponent_minutes_90s,opponent_country,opponent_squad_size,opponent_foreigners,opponent_squad_value_avg,opponent_squad_value,opponent_players_used,opponent_gd_per90,opponent_pens_made_per90,opponent_pens_att_per90,opponent_cards_yellow_per90,opponent_cards_red_per90,opponent_cards_yellow_red_per90,opponent_fouls_per90,opponent_fouled_per90,opponent_offsides_per90,opponent_own_goals_per90,opponent_gk_goals_against_per90,opponent_gk_save_pct,opponent_gk_clean_sheets_pct,opponent_gk_psxg,opponent_gk_psnpxg_per_shot_on_target_against,opponent_gk_psxg_net_per90,opponent_gk_passes_pct_launched,opponent_gk_pct_passes_launched,opponent_gk_goal_kick_length_avg,opponent_gk_crosses_stopped_pct,opponent_gk_def_actions_outside_pen_area_per90,opponent_gk_shots_on_target_against_per90,opponent_gk_pens_saved_per90,opponent_gk_pens_missed_per90,opponent_gk_free_kick_goals_against_per90,opponent_gk_corner_kick_goals_against_per90,opponent_goals_per90,opponent_assists_per90,opponent_goals_assists_per90,opponent_goals_pens_per90,opponent_goals_assists_pens_per90,opponent_xg_per90,opponent_npxg_per90,opponent_xg_assist_per90,opponent_xg_xg_assist_per90,opponent_npxg_xg_assist_per90,opponent_shots_on_target_pct,opponent_shots_per90,opponent_shots_on_target_per90,opponent_goals_per_shot,opponent_goals_per_shot_on_target,opponent_average_shot_distance,opponent_npxg_per_shot,opponent_passes_pct,opponent_passes_pct_short,opponent_passes_pct_medium,opponent_passes_pct_long,opponent_passes_total_distance_per90,opponent_progressive_passes_per90,opponent_passes_progressive_distance_per90,opponent_pass_xa_per90,opponent_xg_assist_net_per90,opponent_assisted_shots_per90,opponent_passes_into_final_third_per90,opponent_passes_into_penalty_area_per90,opponent_crosses_into_penalty_area_per90,opponent_passes_live_per90,opponent_passes_dead_per90,opponent_passes_free_kicks_per90,opponent_through_balls_per90,opponent_passes_switches_per90,opponent_crosses_per90,opponent_throw_ins_per90,opponent_corner_kicks_per90,opponent_corner_kicks_in_per90,opponent_corner_kicks_out_per90,opponent_corner_kicks_straight_per90,opponent_sca_per90,opponent_gca_per90,opponent_sca_passes_live_per90,opponent_sca_passes_dead_per90,opponent_sca_take_ons_per90,opponent_sca_shots_per90,opponent_sca_fouled_per90,opponent_sca_defense_per90,opponent_gca_passes_live_per90,opponent_gca_passes_dead_per90,opponent_gca_take_ons_per90,opponent_gca_shots_per90,opponent_gca_fouled_per90,opponent_gca_defense_per90,opponent_take_ons_won_pct,opponent_take_ons_tackled_pct,opponent_possession,opponent_touches_per90,opponent_touches_def_pen_area_per90,opponent_touches_def_3rd_per90,opponent_touches_mid_3rd_per90,opponent_touches_att_3rd_per90,opponent_touches_att_pen_area_per90,opponent_touches_live_ball_per90,opponent_take_ons_per90,opponent_carries_per90,opponent_carries_distance_per90,opponent_progressive_carries_per90,opponent_carries_progressive_distance_per90,opponent_carries_into_final_third_per90,opponent_carries_into_penalty_area_per90,opponent_miscontrols_per90,opponent_dispossessed_per90,opponent_challenge_tackles_pct,opponent_aerials_won_pct,opponent_tackles_per90,opponent_tackles_won_per90,opponent_tackles_def_3rd_per90,opponent_tackles_mid_3rd_per90,opponent_tackles_att_3rd_per90,opponent_challenge_tackles_per90,opponent_challenges_per90,opponent_challenges_lost_per90,opponent_blocks_per90,opponent_blocked_shots_per90,opponent_blocked_passes_per90,opponent_interceptions_per90,opponent_tackles_interceptions_per90,opponent_clearances_per90,opponent_errors_per90,opponent_ball_recoveries_per90,opponent_aerials_won_per90,opponent_aerials_lost_per90,outcome
0,2019,2019,1,6,18,1,0,Aston Villa,Sheffield Utd,1,,,46.0,ENG,43,23,3820000.0,1.644500e+08,31,0.43,0.130435,0.173913,1.739130,0.065217,0.021739,11.152174,14.065217,1.500000,0.000000,1.326087,68.9,26.1,54.8,0.27,-0.134783,30.4,52.7,54.6,4.3,0.76,3.913043,0.043478,0.000000,0.043478,0.173913,1.76,1.28,3.04,1.63,2.91,1.60,1.46,1.16,2.76,2.63,34.7,14.09,4.89,0.12,0.33,17.4,0.11,76.9,88.3,83.4,50.3,6797.934783,39.739130,2492.760870,1.195652,0.121739,11.000000,29.086957,7.891304,3.021739,426.130435,54.195652,15.217391,0.456522,4.239130,25.282609,22.217391,5.934783,3.847826,0.434783,0.369565,26.11,3.22,18.673913,2.586957,1.847826,1.152174,1.586957,0.260870,1.956522,0.413043,0.282609,0.173913,0.347826,0.043478,60.5,39.5,53.5,590.630435,60.195652,177.869565,265.304348,152.586957,20.608696,590.456522,14.195652,276.652174,1766.260870,21.478261,929.543478,13.826087,4.978261,15.108696,10.913043,47.3,51.2,13.369565,8.108696,6.043478,6.108696,1.217391,5.195652,10.978261,5.782609,9.434783,3.108696,6.326087,7.630435,21.000000,25.326087,0.565217,52.434783,22.391304,21.326087,46.0,ENG,39,18,1580000.0,61750000.0,26,0.74,0.108696,0.152174,1.630435,0.065217,0.021739,12.152174,9.717391,2.260870,0.021739,0.891304,72.8,45.7,39.8,0.27,-0.004348,28.0,59.8,65.1,5.3,0.30,3.282609,0.000000,0.021739,0.043478,0.195652,1.63,1.28,2.91,1.52,2.80,1.62,1.50,1.20,2.82,2.70,32.0,12.43,3.98,0.12,0.38,16.3,0.12,73.9,86.8,79.7,46.8,6615.108696,45.760870,2488.391304,1.282609,0.082609,9.782609,29.304348,10.260870,2.434783,441.369565,53.956522,10.782609,0.413043,4.217391,22.434783,26.565217,6.826087,3.500000,0.782609,1.369565,22.52,3.02,16.043478,2.673913,1.065217,1.369565,1.021739,0.347826,2.043478,0.347826,0.217391,0.108696,0.260870,0.043478,56.6,43.2,51.8,604.673913,53.369565,168.108696,265.152174,176.760870,25.760870,604.521739,10.869565,270.217391,1382.891304,15.782609,737.891304,12.500000,4.391304,12.543478,8.173913,44.7,50.2,15.152174,8.804348,7.695652,5.826087,1.630435,6.260870,14.000000,7.739130,9.934783,3.000000,6.934783,10.804348,25.956522,25.934783,0.217391,52.913043,25.130435,24.913043,D
1,2019,2019,1,6,18,3,15,Manchester City,Arsenal,1,,,38.0,ENG,46,35,26160000.0,1.200000e+09,21,1.78,0.078947,0.105263,1.157895,0.026316,0.000000,8.631579,7.921053,2.578947,0.000000,0.605263,76.5,52.6,23.3,0.25,0.007895,41.0,17.1,42.9,6.7,1.08,2.131579,0.000000,0.000000,0.026316,0.131579,2.39,1.87,4.26,2.32,4.18,2.22,2.14,1.72,3.94,3.86,37.1,17.87,6.63,0.13,0.35,16.9,0.12,86.9,92.9,89.8,67.3,10395.815789,63.921053,2988.131579,1.673684,0.144737,13.578947,51.263158,14.578947,1.631579,701.578947,42.921053,9.052632,2.947368,5.947368,20.605263,19.447368,7.842105,2.236842,3.184211,0.026316,32.00,4.42,24.394737,1.868421,2.421053,1.868421,1.157895,0.289474,3.657895,0.157895,0.236842,0.131579,0.210526,0.026316,59.5,40.3,67.9,848.368421,48.026316,160.473684,436.394737,257.894737,38.368421,848.263158,19.315789,590.973684,3131.315789,34.868421,1794.657895,26.763158,9.631579,13.394737,10.342105,41.3,50.8,13.631579,8.605263,5.631579,5.657895,2.342105,5.078947,12.289474,7.210526,8.394737,1.657895,6.736842,9.526316,23.157895,15.000000,0.394737,52.631579,14.210526,13.736842,38.0,ENG,42,28,15690000.0,659050000.0,28,0.48,0.105263,0.131579,1.921053,0.052632,0.026316,10.842105,11.973684,2.342105,0.026316,1.342105,75.8,21.1,53.5,0.26,0.092105,33.3,29.2,45.5,3.3,1.26,4.789474,0.000000,0.000000,0.026316,0.078947,1.82,1.37,3.18,1.71,3.08,1.58,1.47,1.23,2.81,2.70,34.4,12.16,4.18,0.14,0.41,16.7,0.12,81.7,89.1,87.4,59.5,8247.368421,49.315789,2764.315789,1.155263,0.139474,9.552632,32.105263,10.973684,1.657895,539.947368,51.105263,14.289474,1.710526,4.000000,15.921053,21.657895,5.500000,1.552632,1.710526,0.052632,21.87,3.05,17.236842,1.526316,1.315789,0.894737,0.578947,0.315789,2.394737,0.157895,0.131579,0.157895,0.184211,0.026316,56.7,43.3,58.1,701.842105,64.657895,210.421053,325.815789,170.842105,25.894737,701.710526,14.105263,391.815789,2180.394737,22.842105,1169.763158,15.552632,5.342105,14.842105,10.763158,42.6,46.6,16.026316,9.631579,7.736842,6.210526,2.078947,6.000000,14.078947,8.078947,11.157895,3.105263,8.052632,10.842105,26.868421,21.000000,0.736842,52.631579,14.605263,16.763158,W
2,2019,2019,3,6,20,1,0,Norwich,Southampton,1,,,46.0,ENG,37,25,2330000.0,8.630000e+07,24,0.76,0.021739,0.152174,1.717391,0.021739,0.000000,11.565217,14.282609,1.304348,0.021739,1.239130,72.4,28.3,46.3,0.23,-0.210870,28.1,42.1,47.2,3.8,0.57,4.173913,0.000000,0.000000,0.000000,0.282609,2.00,1.52,3.52,1.98,3.50,1.70,1.59,1.22,2.93,2.81,34.8,15.26,5.30,0.13,0.37,17.1,0.11,78.1,88.3,84.1,47.4,7315.891304,44.500000,2719.391304,1.141304,0.297826,11.413043,31.739130,8.565217,1.152174,497.608696,54.826087,14.934783,1.000000,3.130435,17.760870,22.913043,6.586957,3.695652,2.217391,0.130435,26.76,3.50,19.739130,1.913043,1.326087,1.891304,1.413043,0.478261,2.673913,0.195652,0.195652,0.239130,0.108696,0.086957,54.1,45.9,57.5,667.673913,61.586957,219.043478,296.913043,158.260870,25.934783,667.521739,17.000000,363.021739,2093.326087,23.130435,1136.434783,14.326087,6.565217,14.108696,11.543478,47.1,45.1,16.521739,10.500000,8.391304,6.086957,2.043478,6.891304,14.630435,7.739130,11.217391,2.826087,8.391304,9.673913,26.195652,21.391304,0.217391,56.913043,16.608696,20.195652,38.0,ENG,40,19,7640000.0,305400000.0,30,-0.55,0.105263,0.131579,1.921053,0.078947,0.052632,11.052632,8.973684,1.605263,0.078947,1.710526,64.0,18.4,54.0,0.30,-0.210526,29.8,65.6,58.8,3.7,0.76,4.605263,0.026316,0.000000,0.000000,0.157895,1.16,0.71,1.87,1.05,1.76,1.23,1.14,0.77,2.00,1.91,32.4,12.61,4.08,0.08,0.26,18.9,0.10,72.0,84.9,78.0,43.2,5302.500000,35.078947,2082.815789,0.734211,-0.060526,8.868421,28.578947,6.842105,1.789474,370.315789,47.289474,10.763158,0.921053,3.657895,17.526316,21.184211,5.078947,2.552632,1.657895,0.578947,21.55,2.03,14.868421,1.921053,1.289474,1.631579,1.236842,0.605263,1.421053,0.078947,0.078947,0.157895,0.184211,0.105263,56.5,43.5,43.4,547.684211,56.578947,157.473684,261.815789,133.868421,18.657895,547.552632,13.368421,254.026316,1432.315789,15.868421,725.131579,11.684211,4.078947,14.184211,10.000000,42.9,46.8,18.368421,11.263158,8.815789,7.447368,2.105263,7.289474,17.000000,9.710526,14.026316,4.026316,10.000000,14.447368,32.815789,29.684211,0.473684,54.026316,20.184211,22.973684,L
3,2019,2019,3,6,20,3,15,Tottenham,Manchester Utd,1,,,38.0,ENG,34,20,26430000.0,8.986000e+08,28,0.71,0.105263,0.105263,1.500000,0.078947,0.026316,9.894737,9.578947,1.947368,0.078947,1.026316,76.9,34.2,47.7,0.28,0.307895,39.4,28.8,43.1,3.1,1.05,4.105263,0.052632,0.000000,0.026316,0.184211,1.74,1.24,2.97,1.63,2.87,1.44,1.36,1.03,2.47,2.39,34.1,14.03,4.79,0.12,0.34,18.1,0.10,81.6,89.6,86.4,58.7,8551.868421,45.894737,2823.578947,0.952632,0.207895,10.315789,33.763158,9.578947,2.368421,555.657895,49.131579,12.026316,2.289474,3.578947,16.921053,22.394737,5.105263,1.421053,2.131579,0.131579,24.66,3.03,17.789474,2.131579,1.631579,1.447368,1.210526,0.447368,2.131579,0.236842,0.157895,0.289474,0.184211,0.026316,55.1,44.9,59.0,718.710526,65.447368,202.921053,353.710526,168.473684,23.263158,718.605263,17.684211,451.842105,2230.210526,22.105263,1158.394737,18.000000,5.394737,15.868421,9.184211,46.4,49.5,16.500000,9.815789,8.157895,6.236842,2.105263,7.052632,15.184211,8.131579,10.921053,2.947368,7.973684,8.552632,25.052632,21.657895,0.526316,55.447368,15.763158,16.078947,38.0,ENG,36,22,22160000.0,797600000.0,28,0.29,0.236842,0.315789,1.973684,0.105263,0.052632,11.394737,10.447368,2.368421,0.078947,1.421053,71.9,18.4,49.9,0.26,-0.028947,42.7,46.6,54.6,1.7,0.66,4.500000,0.000000,0.000000,0.052632,0.184211,1.71,1.05,2.76,1.47,2.53,1.62,1.37,1.04,2.66,2.41,40.9,13.53,5.53,0.11,0.27,18.7,0.10,80.4,89.3,85.2,57.5,7486.552632,43.894737,2677.052632,1.007895,0.010526,10.026316,32.078947,8.921053,1.526316,498.263158,50.473684,11.447368,1.657895,5.447368,16.789474,22.605263,5.263158,2.052632,1.842105,0.526316,25.18,3.11,19.210526,1.605263,1.421053,1.394737,1.368421,0.184211,2.131579,0.105263,0.315789,0.263158,0.263158,0.026316,58.0,41.8,54.7,660.394737,59.526316,177.578947,330.263158,157.684211,23.500000,660.078947,14.552632,346.842105,1847.526316,21.078947,1029.421053,15.631579,5.657895,13.894737,11.157895,44.1,55.5,15.289474,9.131579,7.368421,6.289474,1.631579,6.315789,14.315789,8.000000,11.710526,3.368421,8.342105,9.342105,24.631579,22.289474,0.684211,51.815789,18.078947,14.473684,D
4,2019,2019,3,6,20,19,30,Watford,Leicester,1,,,38.0,ENG,39,32,6130000.0,2.391000e+08,25,-0.21,0.026316,0.026316,2.078947,0.105263,0.026316,11.394737,10.052632,2.342105,0.078947,1.552632,69.4,18.4,55.5,0.30,-0.013158,36.1,69.9,63.2,5.4,1.89,4.736842,0.026316,0.000000,0.052632,0.157895,1.34,1.03,2.37,1.32,2.34,1.27,1.25,0.92,2.19,2.17,33.9,11.47,3.89,0.11,0.34,17.3,0.11,73.9,85.7,79.3,47.2,5609.394737,34.052632,2297.684211,0.750000,0.110526,8.157895,24.842105,7.657895,1.447368,396.447368,47.894737,12.710526,0.842105,3.078947,15.000000,19.868421,4.657895,2.210526,1.447368,0.000000,19.05,2.18,13.763158,1.473684,1.210526,1.105263,1.210526,0.289474,1.263158,0.210526,0.394737,0.157895,0.105263,0.052632,53.8,46.2,46.7,566.000000,59.263158,171.157895,260.184211,139.552632,22.236842,565.973684,14.289474,309.131579,1478.447368,15.473684,746.473684,11.421053,5.026316,15.421053,11.000000,41.3,48.6,17.263158,10.289474,9.210526,6.394737,1.657895,6.763158,16.368421,9.605263,10.631579,3.157895,7.473684,11.263158,28.526316,26.894737,0.473684,53.184211,19.815789,20.973684,38.0,ENG,37,27,11730000.0,434100000.0,25,0.00,0.131579,0.184211,1.578947,0.131579,0.078947,9.236842,9.947368,1.894737,0.000000,1.263158,68.8,26.3,40.0,0.26,-0.210526,44.1,44.3,52.3,2.2,0.66,3.631579,0.000000,0.000000,0.000000,0.105263,1.26,0.89,2.16,1.13,2.03,1.38,1.24,0.99,2.37,2.23,34.1,13.37,4.55,0.08,0.25,18.6,0.10,76.8,86.3,81.2,56.2,6831.710526,40.447368,2501.368421,0.828947,-0.094737,10.026316,31.078947,6.947368,1.736842,449.078947,50.815789,11.921053,2.315789,4.894737,20.500000,22.947368,5.526316,2.157895,2.263158,0.368421,23.63,2.24,16.447368,2.289474,1.684211,1.473684,1.315789,0.421053,1.552632,0.078947,0.184211,0.184211,0.236842,0.000000,59.0,41.0,50.9,622.552632,59.000000,186.421053,289.394737,153.263158,21.157895,622.368421,17.473684,361.763158,1899.157895,19.789474,1009.026316,15.052632,5.184211,15.315789,9.394737,43.2,54.4,17.763158,10.342105,9.052632,7.289474,1.421053,6.236842,14.447368,8.210526,11.289474,2.526316,8.763158,11.526316,29.289474,27.210526,0.631579,54.921053,21.342105,17.868421,D
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14403,2022,2023,6,5,28,3,0,Monaco,Rennes,0,0.8,1.8,38.0,FRA,41,23,9660000.0,3.960000e+08,27,0.58,0.236842,0.236842,2.236842,0.105263,0.078947,13.052632,10.736842,2.473684,0.078947,1.052632,77.3,28.9,41.5,0.24,0.118421,38.6,32.0,42.3,5.4,1.21,3.710526,0.000000,0.026316,0.000000,0.131579,1.63,1.18,2.82,1.39,2.58,1.56,1.38,1.16,2.72,2.54,34.4,11.71,4.03,0.12,0.35,17.3,0.12,79.3,86.8,85.9,55.9,7710.947368,49.236842,2770.789474,1.289474,0.023684,9.157895,36.500000,10.342105,2.157895,490.315789,48.236842,11.789474,2.026316,3.947368,18.552632,21.184211,5.789474,2.605263,2.052632,0.000000,20.50,2.82,15.052632,1.947368,1.026316,1.052632,1.078947,0.342105,2.289474,0.078947,0.105263,0.131579,0.210526,0.000000,56.1,43.9,54.1,647.342105,52.684211,173.552632,318.947368,160.842105,24.631579,647.105263,17.157895,395.868421,2100.526316,19.736842,1076.473684,14.368421,4.973684,17.210526,8.447368,46.2,52.3,19.605263,11.447368,7.684211,9.684211,2.236842,8.526316,18.447368,9.921053,9.947368,2.710526,7.236842,11.157895,30.763158,12.815789,0.263158,60.184211,14.368421,13.105263,38.0,FRA,36,11,8480000.0,305350000.0,28,1.08,0.105263,0.105263,1.473684,0.052632,0.026316,11.421053,9.921053,1.552632,0.026316,1.052632,67.0,36.8,37.7,0.32,-0.034211,41.0,30.3,50.8,4.8,1.47,2.789474,0.026316,0.000000,0.000000,0.131579,2.13,1.61,3.74,2.03,3.63,1.65,1.57,1.24,2.89,2.81,35.4,14.42,5.11,0.14,0.40,16.1,0.11,82.8,87.8,88.9,66.7,8709.184211,45.710526,2870.947368,1.205263,0.365789,11.210526,33.868421,9.657895,2.342105,512.578947,45.421053,11.473684,1.078947,5.236842,18.368421,20.157895,5.842105,1.605263,3.473684,0.000000,25.55,3.97,18.736842,2.710526,0.842105,1.947368,0.973684,0.342105,3.157895,0.184211,0.157895,0.263158,0.157895,0.052632,54.2,45.8,55.9,663.421053,53.921053,185.263158,327.236842,156.473684,23.684211,663.315789,13.947368,415.815789,2240.000000,17.789474,1117.157895,14.105263,3.710526,15.447368,7.078947,47.5,50.6,14.368421,8.368421,6.157895,5.657895,2.552632,6.868421,14.473684,7.605263,9.131579,2.421053,6.710526,8.289474,22.657895,17.000000,0.315789,50.105263,14.394737,14.052632,L
14404,2022,2023,6,5,28,3,0,PSG,Strasbourg,0,2.4,2.0,38.0,FRA,41,28,21350000.0,8.751500e+08,36,1.37,0.184211,0.210526,2.078947,0.105263,0.026316,9.394737,10.947368,2.447368,0.026316,0.947368,76.8,34.2,35.9,0.23,0.023684,40.6,10.3,21.4,3.8,1.47,3.736842,0.000000,0.000000,0.000000,0.131579,2.32,1.71,4.03,2.13,3.84,2.02,1.86,1.47,3.49,3.33,36.4,14.61,5.32,0.15,0.40,17.3,0.13,89.4,93.2,92.6,74.2,9995.605263,54.078947,3175.289474,1.473684,0.239474,11.394737,48.526316,10.605263,1.342105,656.315789,42.815789,12.842105,2.473684,4.473684,12.078947,16.394737,5.289474,2.105263,0.894737,0.052632,26.79,4.18,20.184211,1.473684,2.184211,1.421053,1.157895,0.368421,3.131579,0.157895,0.473684,0.236842,0.157895,0.026316,56.2,43.7,63.0,802.026316,58.184211,191.078947,411.921053,205.026316,26.763158,801.815789,22.947368,566.552632,2861.763158,23.078947,1474.868421,21.447368,6.210526,14.578947,9.157895,48.4,54.2,15.973684,9.473684,6.026316,7.263158,2.684211,7.631579,15.763158,8.131579,8.684211,2.789474,5.894737,7.973684,23.947368,11.500000,0.368421,48.000000,6.894737,5.815789,38.0,FRA,33,18,3640000.0,120050000.0,26,0.34,0.157895,0.210526,1.894737,0.078947,0.026316,13.078947,10.526316,1.605263,0.026316,1.131579,70.2,36.8,35.1,0.24,-0.181579,41.0,39.4,37.7,3.1,1.03,3.447368,0.000000,0.000000,0.000000,0.131579,1.47,1.05,2.53,1.32,2.37,1.46,1.30,1.04,2.51,2.34,31.9,12.37,3.95,0.11,0.33,16.8,0.11,78.1,86.6,84.8,56.9,7092.552632,41.815789,2564.815789,1.102632,0.010526,9.157895,31.157895,8.789474,2.842105,442.052632,47.210526,11.815789,1.236842,5.578947,20.973684,21.236842,4.736842,0.973684,3.184211,0.052632,21.32,2.58,15.473684,2.157895,0.789474,1.447368,1.131579,0.315789,1.842105,0.131579,0.131579,0.263158,0.210526,0.000000,54.6,45.4,49.5,597.184211,53.131579,169.315789,287.131579,146.394737,21.842105,596.973684,13.105263,354.236842,1849.605263,15.078947,934.131579,12.473684,3.078947,15.500000,7.894737,46.8,55.7,17.368421,10.315789,7.947368,6.763158,2.657895,7.710526,16.473684,8.763158,10.000000,2.447368,7.552632,11.394737,28.763158,14.815789,0.157895,55.052632,17.210526,13.710526,D
14405,2022,2023,6,5,28,3,0,Auxerre,Toulouse,0,0.4,1.2,38.0,FRA,32,12,767000.0,2.455000e+07,22,0.52,0.078947,0.078947,1.552632,0.210526,0.052632,12.552632,10.105263,1.973684,0.000000,1.026316,70.7,36.8,37.7,0.29,-0.034211,24.5,41.4,53.1,2.2,1.34,3.236842,0.026316,0.000000,0.026316,0.078947,1.55,1.21,2.76,1.47,2.68,1.39,1.33,1.06,2.45,2.39,38.6,12.82,4.95,0.11,0.30,18.4,0.11,80.5,87.7,87.8,60.8,9013.368421,46.736842,2989.684211,0.960526,0.152632,9.868421,36.421053,8.289474,2.157895,531.789474,50.763158,13.368421,0.789474,9.000000,17.921053,23.684211,4.684211,2.368421,1.657895,0.000000,22.89,2.68,17.868421,1.526316,0.921053,1.263158,1.026316,0.289474,2.078947,0.131579,0.131579,0.236842,0.078947,0.026316,55.7,44.3,57.6,684.052632,56.000000,210.052632,316.447368,163.605263,20.973684,683.973684,12.842105,420.394737,2087.131579,15.315789,965.421053,11.710526,4.289474,14.368421,8.605263,45.1,48.1,15.736842,9.263158,6.447368,7.000000,2.289474,7.763158,17.210526,9.447368,8.473684,2.289474,6.184211,10.631579,26.368421,15.394737,0.131579,55.000000,14.763158,15.947368,38.0,FRA,33,23,1670000.0,55050000.0,27,1.24,0.131579,0.131579,1.657895,0.052632,0.000000,14.631579,10.263158,2.605263,0.052632,0.868421,77.0,42.1,35.6,0.25,0.121053,40.6,19.9,32.2,6.9,1.32,3.315789,0.000000,0.000000,0.026316,0.131579,2.11,1.53,3.63,1.97,3.50,1.88,1.78,1.39,3.28,3.18,35.2,14.74,5.18,0.13,0.38,16.7,0.13,81.8,88.6,86.6,62.7,8576.710526,52.894737,3186.078947,1.163158,0.134211,11.000000,41.473684,7.973684,2.210526,531.131579,48.000000,11.973684,1.210526,6.157895,19.684211,21.078947,6.078947,2.342105,3.052632,0.026316,25.42,3.42,18.105263,2.657895,1.131579,2.157895,1.105263,0.263158,2.105263,0.421053,0.078947,0.526316,0.236842,0.052632,52.3,47.5,60.9,683.526316,49.763158,192.394737,337.368421,159.710526,21.526316,683.394737,15.394737,425.789474,2296.210526,17.736842,1150.868421,13.526316,4.157895,16.763158,9.052632,45.5,58.1,15.921053,8.842105,7.394737,6.184211,2.342105,7.421053,16.315789,8.894737,8.947368,2.394737,6.552632,9.552632,25.473684,13.157895,0.105263,57.605263,18.342105,13.236842,D
14406,2022,2023,6,5,28,3,0,Nice,Montpellier,0,1.4,2.0,38.0,FRA,34,23,7830000.0,2.660500e+08,27,0.42,0.210526,0.263158,2.026316,0.157895,0.052632,12.000000,9.763158,1.394737,0.000000,0.947368,78.1,36.8,44.3,0.27,0.218421,38.3,29.8,42.6,6.2,0.89,3.842105,0.026316,0.000000,0.026316,0.026316,1.37,0.92,2.29,1.16,2.08,1.46,1.26,1.03,2.49,2.29,32.6,11.63,3.79,0.10,0.31,16.9,0.11,81.5,88.3,87.5,57.3,7414.473684,38.789474,2583.315789,1.047368,-0.105263,8.789474,30.289474,6.868421,2.684211,474.657895,45.842105,11.578947,0.973684,2.657895,19.526316,20.157895,5.157895,2.473684,1.973684,0.026316,20.76,2.47,15.131579,1.815789,1.368421,1.078947,1.026316,0.342105,1.842105,0.105263,0.131579,0.157895,0.236842,0.000000,55.4,44.3,51.5,629.736842,53.447368,179.736842,322.578947,134.105263,19.789474,629.473684,18.394737,409.842105,2254.263158,19.263158,1128.710526,15.263158,3.947368,16.710526,9.184211,53.2,52.8,16.052632,9.710526,7.657895,6.657895,1.736842,8.263158,15.526316,7.263158,10.315789,3.131579,7.184211,11.605263,27.657895,15.657895,0.184211,54.315789,13.842105,12.394737,38.0,FRA,36,17,4020000.0,144600000.0,26,-0.35,0.078947,0.105263,2.026316,0.210526,0.105263,10.815789,12.236842,1.394737,0.026316,1.605263,74.0,26.3,60.5,0.26,0.013158,39.0,30.1,36.2,7.6,0.82,5.657895,0.026316,0.000000,0.026316,0.052632,1.26,0.82,2.08,1.18,2.00,1.12,1.04,0.75,1.88,1.79,34.9,11.39,3.97,0.10,0.30,19.6,0.09,78.4,87.4,84.5,54.3,6491.947368,33.736842,2518.973684,0.750000,0.063158,7.631579,25.184211,6.289474,1.842105,414.657895,52.473684,12.736842,0.947368,4.078947,18.342105,22.842105,4.684211,1.631579,1.947368,0.052632,19.37,2.24,13.578947,1.578947,1.078947,1.078947,1.605263,0.447368,1.552632,0.184211,0.157895,0.157895,0.157895,0.026316,53.6,46.4,48.2,576.868421,70.131579,209.500000,254.157895,118.894737,15.105263,576.763158,15.526316,356.289474,1871.947368,15.894737,935.894737,11.157895,2.500000,17.052632,8.947368,42.7,43.8,16.447368,10.000000,8.947368,5.526316,1.973684,8.131579,19.026316,10.894737,9.552632,3.526316,6.026316,9.368421,25.815789,17.763158,0.342105,53.105263,11.500000,14.736842,W


We create a Preprocessor class and Boruta child class in models.py so that it can have access to the preprocessor

Normalize the data

In [5]:
train_df_clean = train_df.dropna()
X = train_df_clean.drop(columns=['outcome'])
y = train_df_clean['outcome']
preprocessor = PreProcessor(X, y)

# Logistic Regression

Trying to use Logistic Regression to further reduce the number of features

In [6]:
with open('data/machine_learning/pkl/processor.pkl', 'rb') as f:
    preprocessor = pickle.load(f)

In [7]:
from sklearn.linear_model import LogisticRegression

# Initialize and fit Logistic Regression with L1 penalty
log_reg = LogisticRegression(penalty='l1', solver='saga', C=1.0, max_iter=10000, multi_class='multinomial', random_state=42)
# Fit your logistic regression model
log_reg.fit(preprocessor.df_train_preprocessed, preprocessor.y_train_encoded)

In [8]:
feature_names = preprocessor.df_train_preprocessed.columns
coefficients = log_reg.coef_
class_labels = preprocessor.target_encoder.inverse_transform(log_reg.classes_)

coef_df = pd.DataFrame(coefficients.T,  # Transpose to make features as rows
                       columns=[f'Coeff_Class_{cls}' for cls in class_labels],
                       index=feature_names)

coef_df

Unnamed: 0,Coeff_Class_D,Coeff_Class_L,Coeff_Class_W
day,0.021551,0.000000,-0.025881
form,0.018492,-0.061269,0.000000
hour,0.004605,-0.007555,0.000000
is_home,0.000000,-0.168583,0.194414
minute,0.016961,-0.022936,0.000000
...,...,...,...
team_country_ENG,0.001083,0.000000,-0.073859
team_country_FRA,-0.031006,0.044002,0.000000
team_country_GER,-0.150584,0.000000,0.000000
team_country_ITA,0.000000,-0.050712,0.001613


In [9]:
# Identify features with zero coefficients
threshold = 0.1
weak_features_mask = (coef_df['Coeff_Class_D'] < threshold) & (coef_df['Coeff_Class_L'] < threshold) & (coef_df['Coeff_Class_W'] < threshold)
strong_features_mask = ~weak_features_mask
coef_df[weak_features_mask]

Unnamed: 0,Coeff_Class_D,Coeff_Class_L,Coeff_Class_W
day,0.021551,0.000000,-0.025881
form,0.018492,-0.061269,0.000000
hour,0.004605,-0.007555,0.000000
minute,0.016961,-0.022936,0.000000
month,-0.013872,0.001974,0.000000
...,...,...,...
team_country_ENG,0.001083,0.000000,-0.073859
team_country_FRA,-0.031006,0.044002,0.000000
team_country_GER,-0.150584,0.000000,0.000000
team_country_ITA,0.000000,-0.050712,0.001613


In [10]:
strong_features_df = coef_df[strong_features_mask]
strong_features_df

Unnamed: 0,Coeff_Class_D,Coeff_Class_L,Coeff_Class_W
is_home,0.000000,-0.168583,0.194414
opponent_aerials_won_pct,0.118672,0.000000,-0.039114
opponent_aerials_won_per90,-0.044429,0.000000,0.173660
opponent_assisted_shots_per90,0.167085,0.000000,-0.252517
opponent_carries_per90,0.000000,0.000000,0.174517
...,...,...,...
team_shots_per90,0.000000,0.000000,0.701233
team_squad_value_avg,0.000000,-0.243025,0.157051
team_touches_att_3rd_per90,0.000000,0.113958,0.000000
team_xg_assist_per90,0.000000,0.246370,-0.511695


In [11]:
LOG_REG_COLS = strong_features_df.index.tolist()

In [12]:
boruta = Boruta()

boruta.fit(preprocessor.df_train_preprocessed[LOG_REG_COLS], preprocessor.y_train)
BORUTA_FEATURES = boruta.get_selected_features()

with open(BORUTA_FEATURES_PATH, 'w') as file:
    json.dump(BORUTA_FEATURES, file, indent=4)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	54
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	54
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	54
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	54
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	54
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	54
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	54
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	38
Tentative: 	6
Rejected: 	10
Iteration: 	9 / 100
Confirmed: 	38
Tentative: 	6
Rejected: 	10
Iteration: 	10 / 100
Confirmed: 	38
Tentative: 	6
Rejected: 	10
Iteration: 	11 / 100
Confirmed: 	38
Tentative: 	6
Rejected: 	10
Iteration: 	12 / 100
Confirmed: 	38
Tentative: 	3
Rejected: 	13
Iteration: 	13 / 100
Confirmed: 	38
Tentative: 	3
Rejected: 	13
Iteration: 	14 / 100
Confirmed: 	38
Tentative: 	3
Rejected: 	13
Iteration: 	15 / 100
Confirmed: 	38
Tentative: 	3
Rejected: 	13
Iteration: 	16 / 100
Confirmed: 	38
Tentative: 	3
Reject

In [13]:
print(BORUTA_FEATURES)

['is_home', 'opponent_assisted_shots_per90', 'opponent_carries_per90', 'opponent_gd_per90', 'opponent_gk_pct_passes_launched', 'opponent_gk_psxg', 'opponent_goals_pens_per90', 'opponent_goals_per90', 'opponent_passes_dead_per90', 'opponent_passes_live_per90', 'opponent_passes_pct', 'opponent_passes_pct_short', 'opponent_progressive_passes_per90', 'opponent_shots_on_target_per90', 'opponent_shots_per90', 'opponent_squad_value_avg', 'opponent_touches_att_3rd_per90', 'opponent_touches_mid_3rd_per90', 'opponent_xg_assist_per90', 'opponent_xg_per90', 'team_assisted_shots_per90', 'team_carries_distance_per90', 'team_carries_per90', 'team_carries_progressive_distance_per90', 'team_gca_per90', 'team_gd_per90', 'team_goals_per90', 'team_passes_dead_per90', 'team_passes_live_per90', 'team_passes_pct', 'team_passes_pct_short', 'team_passes_total_distance_per90', 'team_progressive_passes_per90', 'team_shots_on_target_per90', 'team_shots_per90', 'team_squad_value_avg', 'team_touches_att_3rd_per90',