The data will undergo feature selection. This refers to collecting a subset of features that best encapsulate the target feature. This will be done using both the Forward and Backward techniques.

## Read in data

In [None]:
import pandas as pd
X_train = pd.read_csv('X_train.csv')
y_train = pd.read_csv('y_train.csv')
X_val = pd.read_csv('X_val.csv')
y_val = pd.read_csv('y_val.csv')

In [None]:
X_train

Unnamed: 0,const,FIFA Rank,Manager_Age,Contract until,Titles,Months_installed,Age,Height,Caps,Goals,MarketValue,Win Percentage,Q_GF,Q_GA,Q_GD,Q_PPG_Last_5,Q_Clean_Sheets%,Q_xGF,K_meansCluster,H_clustering
0,1.0,0.876712,0.482759,0.0,0.0,0.086957,0.684229,0.3692,0.366782,0.0,0.013687,0.0,0.0,0.125,0.315789,0.377778,0.384615,0.082803,2,4
1,1.0,0.0,0.655172,1.0,0.666667,1.0,0.537804,0.5342,0.621733,0.863125,0.837203,0.693827,0.708333,0.0625,0.789474,0.861111,0.769231,1.0,0,1
2,1.0,0.630137,0.655172,0.5,0.333333,0.130435,0.670886,0.5308,0.611778,0.146009,0.045383,0.177778,0.208333,0.375,0.342105,0.555556,0.384615,0.235669,2,4
3,1.0,0.60274,0.310345,0.0,0.0,0.166667,0.578857,0.4154,0.247642,0.124368,0.0,0.088889,0.166667,0.1875,0.394737,0.555556,0.538462,0.350318,2,4
4,1.0,0.520548,0.448276,1.0,0.0,0.021739,0.171057,0.4308,0.303888,0.16765,0.163033,0.355556,0.083333,0.3125,0.289474,0.516667,0.2,0.503185,1,3
5,1.0,0.068493,0.862069,1.0,0.333333,0.086957,0.420801,0.6846,0.579653,0.545953,0.529122,0.797386,0.208333,0.3125,0.368421,0.583333,0.584615,0.643312,0,3
6,1.0,0.465753,0.827586,0.5,0.333333,0.0,0.0,0.7538,0.0,0.135188,0.071528,0.266667,0.0,0.25,0.263158,0.377778,0.384615,0.528662,1,3
7,1.0,0.09589,1.0,1.0,0.666667,0.028986,0.4078,0.5154,0.165988,0.102726,0.431081,0.225,0.166667,0.4375,0.289474,0.305556,0.2,0.420382,1,3
8,1.0,0.013699,0.068966,1.0,0.0,0.07971,0.537804,0.5822,0.779054,0.778808,0.362474,0.519865,0.416667,0.125,0.578947,0.722222,0.584615,0.464968,0,3
9,1.0,0.054795,0.482759,1.0,0.333333,0.086957,0.565857,0.1232,0.947792,1.0,0.676359,1.0,1.0,0.0,1.0,1.0,1.0,0.929936,0,1


## Forward Selection

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import itertools
import time

A forward selection algorithm is run. This algorithm starts with an empty algorithm and fits the best new feature on each iteration. This process is repeated until the model's AIC is not improved upon.

In [None]:
# Forward selection function
def forward_selection(response, predictors):
    selected_predictors = []
    remaining_predictors = list(predictors.columns)

    # Fit null model
    X_null = sm.add_constant(pd.DataFrame(index=response.index))
    model_null = sm.OLS(response, X_null).fit()
    best_aic = model_null.aic

    while remaining_predictors:
        temp_results = {}

        for predictor in remaining_predictors:
            predictors_with_new = selected_predictors + [predictor]
            X = sm.add_constant(predictors[predictors_with_new])
            model = sm.OLS(response, X).fit()
            temp_results[predictor] = model.aic

        best_predictor = min(temp_results, key=temp_results.get)
        best_aic_new = temp_results[best_predictor]

        if best_aic_new < best_aic:
            selected_predictors.append(best_predictor)
            remaining_predictors.remove(best_predictor)
            best_aic = best_aic_new
        else:
            break

    return selected_predictors

In [None]:
# Perform forward selection
selected_predictors = forward_selection(y_train, X_train)
selected_predictors

['FIFA Rank', 'Caps', 'Titles', 'H_clustering', 'Q_Clean_Sheets%', 'Q_xGF']

Forward Selection technique considers 'FIFA Rank', 'Caps', 'Titles', 'H_clustering', 'Q_Clean_Sheets%', 'Q_xGF' to be the best predictor features.

##Backward Selection

A backward selection algorithm is run. This algorithm starts with a full algorithm and removes each feature one by one. The feature whose removal improves the model fit the most is then dropped and the process is repeated.

In [None]:
# Backward elimination function returning top 5 predictors
def backward_elimination(response, predictors, top_n=7):
    selected_predictors = list(predictors.columns)

    while len(selected_predictors) > top_n:
        temp_results = {}

        for predictor in selected_predictors:
            predictors_with_removed = [p for p in selected_predictors if p != predictor]
            if not predictors_with_removed:
                continue

            X_temp = sm.add_constant(predictors[predictors_with_removed])
            model_temp = sm.OLS(response, X_temp).fit()
            temp_results[predictor] = model_temp.aic

        if not temp_results:
            break

        # Find the predictor whose removal improves the model
        worst_predictor = min(temp_results, key=temp_results.get)
        worst_aic = temp_results[worst_predictor]

        if worst_aic < sm.OLS(response, sm.add_constant(predictors[selected_predictors])).fit().aic:
            selected_predictors.remove(worst_predictor)
        else:
            break

    # Ensure to return only the top_n predictors
    top_predictors = selected_predictors[:top_n]
    return top_predictors

# Perform backward elimination and return the top 5 predictors
top_predictors = backward_elimination(y_train, X_train, top_n=7)
top_predictors

['const',
 'FIFA Rank',
 'Manager_Age',
 'Titles',
 'Months_installed',
 'Age',
 'Height']

The top 6 features according to backward selection are: 'FIFA Rank','Manager_Age', 'Titles', 'Months_installed',
 'Age',
 'Height'.

In [None]:
top_predictors.remove('const')

In [None]:
df = pd.DataFrame(list(zip(selected_predictors, top_predictors)), columns = ['Foward Selection', 'Backward Selection'])
df

Unnamed: 0,Foward Selection,Backward Selection
0,FIFA Rank,FIFA Rank
1,Caps,Manager_Age
2,Titles,Titles
3,H_clustering,Months_installed
4,Q_Clean_Sheets%,Age
5,Q_xGF,Height


Both forward and backward selection have identified "Fifa_rank" and "Titles" as key feature.

However, these two methods suggest different remaining sets of features in their top six features.

Models containing these 'optimum' features identified in both Forward and Backward Selection will be run and compared to models run on the entire dataset.