# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [2]:
player_df = pd.read_csv("fifa19.csv")

In [3]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [4]:
player_df = player_df[numcols+catcols]

In [5]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [6]:
traindf = pd.DataFrame(traindf,columns=features)

In [7]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [8]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,False,False,False,False,False,False,False,False,False,False
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,False,False,False,False,False,False,False,False,False,False
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,False,False,False,False,False,False,False,False,False,False
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,False,False,False,False,False,False,False,False,False,False
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,False,False,False,False,False,False,False,False,False,False


In [9]:
len(X.columns)

223

### Set some fixed set of features

In [10]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [11]:
def cor_selector(X, y, num_feats):
    cor_list = []
    for col in X.columns:
        cor = np.corrcoef(X[col], y)[0, 1]
        if np.isnan(cor):
            cor = 0
        cor_list.append(cor)
    cor_list = [abs(i) for i in cor_list]

    cor_support = [False] * len(cor_list)
    cor_feature = X.iloc[:, np.argsort(cor_list)[-num_feats:]].columns.tolist()
    for i in np.argsort(cor_list)[-num_feats:]:
        cor_support[i] = True

    return cor_support, cor_feature

In [12]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [13]:
cor_feature

['Nationality_Costa Rica',
 'Position_LAM',
 'Nationality_Uruguay',
 'Acceleration',
 'SprintSpeed',
 'Strength',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Stamina',
 'Weak Foot',
 'Agility',
 'Crossing',
 'Nationality_Belgium',
 'Dribbling',
 'ShotPower',
 'LongShots',
 'Finishing',
 'BallControl',
 'FKAccuracy',
 'LongPassing',
 'Volleys',
 'ShortPassing',
 'Position_RF',
 'Position_LF',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_Courtois',
 'Body Type_Neymar',
 'Body Type_Messi',
 'Body Type_C. Ronaldo',
 'Reactions']

## Filter Feature Selection - Chi-Sqaure

In [14]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [15]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2

def chi_squared_selector(X, y, num_feats):
    X_norm = X.copy().fillna(0)
    X_norm = MinMaxScaler().fit_transform(X_norm)

    k = min(num_feats, X_norm.shape[1])

    chi_selector = SelectKBest(score_func=chi2, k=k)
    chi_selector.fit(X_norm, y)

    chi_support = chi_selector.get_support()
    chi_feature = X.columns[chi_support].tolist()

    return chi_support, chi_feature


In [16]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square 

In [17]:
chi_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'LongShots',
 'Position_CM',
 'Position_LAM',
 'Position_LF',
 'Position_LW',
 'Position_RB',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Egypt',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Gabon',
 'Nationality_Slovakia',
 'Nationality_Slovenia',
 'Nationality_Spain',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [18]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [19]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

def rfe_selector(X, y, num_feats):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    model = LogisticRegression(max_iter=5000, solver="liblinear")
    rfe = RFE(estimator=model, n_features_to_select=min(num_feats, X.shape[1]))
    rfe.fit(X_scaled, y)

    rfe_support = rfe.get_support()
    rfe_feature = X.columns[rfe_support].tolist()

    return rfe_support, rfe_feature


In [20]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

30 selected features


### List the selected features from RFE

In [21]:
rfe_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'FKAccuracy',
 'Reactions',
 'Strength',
 'Aggression',
 'Position_CAM',
 'Position_CB',
 'Position_CF',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_RB',
 'Position_RCB',
 'Position_RM',
 'Position_RW',
 'Body Type_Lean',
 'Nationality_China PR',
 'Nationality_Netherlands',
 'Nationality_Portugal',
 'Nationality_Saudi Arabia',
 'Nationality_Switzerland']

## Embedded Selection - Lasso: SelectFromModel

In [22]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [23]:
def embedded_log_reg_selector(X, y, num_feats):
    model = LogisticRegression(penalty="l1", solver="liblinear", max_iter=500)
    selector = SelectFromModel(model, max_features=num_feats)
    selector.fit(X, y)
    embedded_lr_support = selector.get_support()
    embedded_lr_feature = X.loc[:, embedded_lr_support].columns.tolist()
    return embedded_lr_support, embedded_lr_feature

In [24]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [25]:
embedded_lr_feature

['Crossing',
 'Finishing',
 'Dribbling',
 'LongPassing',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CM',
 'Position_LCB',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RW',
 'Body Type_Lean',
 'Nationality_Brazil',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Germany',
 'Nationality_Italy',
 'Nationality_Netherlands',
 'Nationality_Uruguay']

## Tree based(Random Forest): SelectFromModel

In [26]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [27]:
def embedded_rf_selector(X, y, num_feats):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    selector = SelectFromModel(model, max_features=num_feats)
    selector.fit(X, y)
    embedded_rf_support = selector.get_support()
    embedded_rf_feature = X.loc[:, embedded_rf_support].columns.tolist()
    return embedded_rf_support, embedded_rf_feature

In [28]:

embedder_rf_support, embedder_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedder_rf_feature)), 'selected features')


24 selected features


In [29]:
embedder_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Right',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Slovenia']

## Tree based(Light GBM): SelectFromModel

In [30]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [31]:
def embedded_lgbm_selector(X, y, num_feats):

    model = LGBMClassifier(n_estimators=200, random_state=42)

    selector = SelectFromModel(model, max_features=num_feats)
    selector.fit(X, y)
    embedded_lgbm_support = selector.get_support()
    embedded_lgbm_feature = X.loc[:, embedded_lgbm_support].columns.tolist()
    return embedded_lgbm_support, embedded_lgbm_feature

In [32]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature)), 'selected features')


[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001108 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1812
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555
23 selected features


In [33]:
embedded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LCB',
 'Body Type_Lean',
 'Nationality_France',
 'Nationality_Italy']

## Putting all of it together: AutoFeatureSelector Tool

In [34]:
pd.set_option('display.max_rows', None)

# Put all selection together
feature_selection_df = pd.DataFrame({
    'Feature': feature_name,
    'Pearson': cor_support,
    'Chi-2': chi_support,
    'RFE': rfe_support,
    'Logistics': embedded_lr_support,      
    'Random Forest': embedder_rf_support, 
    'LightGBM': embedded_lgbm_support      
})

# Count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df.drop(columns=['Feature']), axis=1)

# Sort by Total
feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=[False, True])
feature_selection_df.reset_index(drop=True, inplace=True)

feature_selection_df


Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
0,FKAccuracy,True,True,True,True,True,True,6
1,Finishing,True,True,True,True,True,True,6
2,LongPassing,True,True,True,True,True,True,6
3,Reactions,True,True,True,True,True,True,6
4,Acceleration,True,False,True,True,True,True,5
5,Agility,True,False,True,True,True,True,5
6,BallControl,True,True,True,False,True,True,5
7,Crossing,True,False,True,True,True,True,5
8,ShortPassing,True,True,True,False,True,True,5
9,SprintSpeed,True,False,True,True,True,True,5


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [35]:
def preprocess_dataset(dataset_path):
    player_df = pd.read_csv(dataset_path)

    numcols = ['Overall', 'Crossing','Finishing','ShortPassing','Dribbling','LongPassing',
               'BallControl','Acceleration','SprintSpeed','Agility','Stamina','Volleys',
               'FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots',
               'Aggression','Interceptions']
    catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

    player_df = player_df[numcols+catcols]

    traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])], axis=1)
    traindf = traindf.dropna()
    features = traindf.columns
    traindf = pd.DataFrame(traindf, columns=features)

    y = traindf['Overall'] >= 87
    X = traindf.copy()
    del X['Overall']

    num_feats = 30
    return X, y, num_feats

In [36]:
def autoFeatureSelector(dataset_path, methods, num_feats=20):
    import pandas as pd
    import numpy as np

    # Use your preprocessing function to get data
    X, y, default_num_feats = preprocess_dataset(dataset_path)

    # If user didn’t specify num_feats, fall back to preprocess default
    if num_feats is None:
        num_feats = default_num_feats

    feature_name = X.columns
    results = {}

    # Run each method requested
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y, num_feats)
        results['Pearson'] = cor_support
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y, num_feats)
        results['Chi-2'] = chi_support
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y, num_feats)
        results['RFE'] = rfe_support
    if 'log-reg' in methods:
    
        lr_support, lr_feature = embedded_log_reg_selector(X, y, num_feats)
        results['Logistic'] = lr_support
    if 'rf' in methods:
        rf_support, rf_feature = embedded_rf_selector(X, y, num_feats)
        results['Random Forest'] = rf_support
    if 'lgbm' in methods:
        lgbm_support, lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
        results['LightGBM'] = lgbm_support

    # Combine into DataFrame
    feature_selection_df = pd.DataFrame({'Feature': feature_name})
    for method, support in results.items():
        feature_selection_df[method] = support

    # Count how many times a feature was selected
    feature_selection_df['Total'] = np.sum(
        feature_selection_df.drop(columns=['Feature']), axis=1
    )
    feature_selection_df = feature_selection_df.sort_values(
        ['Total', 'Feature'], ascending=[False, True]
    ).reset_index(drop=True)

    # Best features are those selected at least once
    best_features = feature_selection_df[
        feature_selection_df['Total'] > 0
    ]['Feature'].tolist()

    return best_features, feature_selection_df


In [37]:
best_features, summary = autoFeatureSelector(
    dataset_path="fifa19.csv",
    methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'],
    num_feats=30
)

print("Best features selected:", best_features)
summary


[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000775 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1812
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555
Best features selected: ['FKAccuracy', 'Finishing', 'LongPassing', 'Reactions', 'Acceleration', 'Agility', 'BallControl', 'Crossing', 'ShortPassing', 'SprintSpeed', 'Aggression', 'Dribbling', 'LongShots', 'Stamina', 'Strength', 'Volleys', 'Balance', 'Body Type_Courtois', 'Body Type_Lean', 'Interceptions', 'Nationality_France', 'Nationality_Slovenia', 'Nationality_Uruguay', 'Position_CM', 'Position_LCB', 'Position_RB', 'ShotPower', 'Weak Foo

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistic,Random Forest,LightGBM,Total
0,FKAccuracy,True,True,True,True,True,True,6
1,Finishing,True,True,True,True,True,True,6
2,LongPassing,True,True,True,True,True,True,6
3,Reactions,True,True,True,True,True,True,6
4,Acceleration,True,False,True,True,True,True,5
5,Agility,True,False,True,True,True,True,5
6,BallControl,True,True,True,False,True,True,5
7,Crossing,True,False,True,True,True,True,5
8,ShortPassing,True,True,True,False,True,True,5
9,SprintSpeed,True,False,True,True,True,True,5


### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features