# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [255]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [256]:
player_df = pd.read_csv("fifa19.csv")

In [257]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [258]:
player_df = player_df[numcols+catcols]

In [259]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [260]:
traindf = pd.DataFrame(traindf,columns=features)

In [261]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [262]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,0,0,0,0,0,0,0,0,0,0
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,0,0,0,0,0,0,0,0,0,0
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,0,0,0,0,0,0,0,0,0,0
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,0,0,0,0,0,0,0,0,0,0
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,0,0,0,0,0,0,0,0,0,0


In [263]:
len(X.columns)

223

### Set some fixed set of features

In [264]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=30

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [265]:
def cor_selector(X, y,num_feats):
    # Your code goes here (Multiple lines)
    # Calculate Pearson correlation coefficients between features and target variable
    cor_target = []
    for feature in X.columns:
        cor = np.corrcoef(X[feature], y)[0, 1]
        cor_target.append((feature, abs(cor)))

    # Sort features by correlation in descending order
    cor_target.sort(key=lambda x: x[1], reverse=True)

    # Select the top 'num_feats' features based on correlation
    cor_feature = [feature[0] for feature in cor_target[:num_feats]]

    # Create a boolean mask for selected features
    cor_support = [True if feature in cor_feature else False for feature in X.columns]
    # Your code ends here
    return cor_support, cor_feature

In [266]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

30 selected features


### List the selected features from Pearson Correlation

In [267]:
cor_feature

['Reactions',
 'Body Type_C. Ronaldo',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Position_LF',
 'Position_RF',
 'ShortPassing',
 'Volleys',
 'LongPassing',
 'FKAccuracy',
 'BallControl',
 'Finishing',
 'LongShots',
 'ShotPower',
 'Dribbling',
 'Nationality_Belgium',
 'Crossing',
 'Agility',
 'Weak Foot',
 'Stamina',
 'Nationality_Slovenia',
 'Nationality_Gabon',
 'Strength',
 'SprintSpeed',
 'Acceleration',
 'Nationality_Uruguay',
 'Position_LAM',
 'Nationality_Costa Rica']

## Filter Feature Selection - Chi-Sqaure

In [268]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [269]:
def chi_squared_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Initialize the SelectKBest feature selector with the chi-squared test
    chi2_selector = SelectKBest(score_func=chi2, k=num_feats)
    
    # Fit the selector to the data and transform the features
    X_new = chi2_selector.fit_transform(X, y)
    
    # Get the boolean mask of selected features
    chi_support = chi2_selector.get_support()
    
    # Get the names of selected features
    chi_feature = X.columns[chi_support].tolist()
    # Your code ends here
    return chi_support, chi_feature

In [270]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

30 selected features


### List the selected features from Chi-Square 

In [271]:
chi_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LF',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [272]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [273]:
def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Initialize the RFE feature selector with an estimator
    estimator = LogisticRegression(solver='liblinear', max_iter=1000)  # You can choose an appropriate estimator for your problem
    rfe_selector = RFE(estimator)
    
    # Set the number of features to select
    rfe_selector = rfe_selector.set_params(n_features_to_select=num_feats)
    
    # Fit the selector to the data and transform the features
    X_new = rfe_selector.fit_transform(X, y)
    
    # Get the boolean mask of selected features
    rfe_support = rfe_selector.support_
    
    # Get the names of selected features
    rfe_feature = X.columns[rfe_support].tolist()
    
    # Your code ends here
    return rfe_support, rfe_feature

In [274]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

30 selected features


### List the selected features from RFE

In [275]:
rfe_feature

['Weak Foot',
 'Preferred Foot_Left',
 'Preferred Foot_Right',
 'Position_CM',
 'Position_LAM',
 'Position_LF',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RF',
 'Position_RW',
 'Body Type_C. Ronaldo',
 'Body Type_Lean',
 'Body Type_Normal',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Chile',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Gabon',
 'Nationality_Japan',
 'Nationality_Korea Republic',
 'Nationality_Netherlands',
 'Nationality_Slovakia',
 'Nationality_Slovenia',
 'Nationality_Sweden',
 'Nationality_Uruguay']

## Embedded Selection - Lasso: SelectFromModel

In [276]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [277]:
def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Step 1: Initialize the logistic regression model
    model = LogisticRegression(solver='liblinear', max_iter=1000)  # Adjust max_iter as needed

    # Step 2: Initialize the feature selection method
    feature_selector = SelectFromModel(estimator=model, max_features=num_feats)

    # Step 3: Fit the feature selector to your data
    feature_selector.fit(X, y)  # Replace X and y with your data

    # Step 4: Get the boolean mask of selected features
    embedded_lr_support = feature_selector.get_support()

    # Step 5: Get the names of selected features
    embedded_lr_feature = X.columns[embedded_lr_support].tolist()

    
    # Your code ends here
    return embedded_lr_support, embedded_lr_feature

In [278]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

30 selected features


In [279]:
embedded_lr_feature

['Preferred Foot_Left',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LF',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RM',
 'Position_RW',
 'Body Type_Lean',
 'Body Type_Normal',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Brazil',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Portugal',
 'Nationality_Senegal',
 'Nationality_Slovenia',
 'Nationality_Sweden',
 'Nationality_Ukraine',
 'Nationality_Uruguay',
 'Nationality_Wales']

## Tree based(Random Forest): SelectFromModel

In [280]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [281]:
def embedded_rf_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Step 1: Initialize the Random Forest classifier
    model = RandomForestClassifier(n_estimators=100, random_state=42)

    # Step 2: Initialize the feature selection method
    feature_selector = SelectFromModel(estimator=model, max_features=num_feats)

    # Step 3: Fit the feature selector to your data
    feature_selector.fit(X, y)

    # Step 4: Get the boolean mask of selected features
    embedded_rf_support = feature_selector.get_support()

    # Step 5: Get the names of selected features
    embedded_rf_feature = X.columns[embedded_rf_support].tolist()
    
    # Your code ends here
    return embedded_rf_support, embedded_rf_feature

In [282]:
embedder_rf_support, embedder_rf_feature = embedded_rf_selector(X, y, num_feats=24)
print(str(len(embedder_rf_feature)), 'selected features')

24 selected features


In [283]:
embedder_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Preferred Foot_Right',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Slovenia']

## Tree based(Light GBM): SelectFromModel

In [284]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [285]:
def embedded_lgbm_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Step 1: Initialize the LightGBM classifier
    model = LGBMClassifier(n_estimators=100, random_state=42)

    # Step 2: Initialize the feature selection method
    feature_selector = SelectFromModel(estimator=model, max_features=num_feats)

    # Step 3: Fit the feature selector to your data
    feature_selector.fit(X, y)

    # Step 4: Get the boolean mask of selected features
    embedded_lgbm_support = feature_selector.get_support()

    # Step 5: Get the names of selected features
    embedded_lgbm_feature = X.columns[embedded_lgbm_support].tolist()
    
    # Your code ends here
    return embedded_lgbm_support, embedded_lgbm_feature

In [286]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats=30)
print(str(len(embedded_lgbm_feature)), 'selected features')

[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002359 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1812
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555
22 selected features


In [287]:
embedded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LCB',
 'Body Type_Lean',
 'Nationality_Italy']

## Putting all of it together: AutoFeatureSelector Tool

In [289]:
# Put all selection together
feature_selection_df = pd.DataFrame({'Feature': feature_name, 'Pearson': cor_support, 'Chi-2': chi_support, 'RFE': rfe_support, 'Logistic': embedded_lr_support,
                                    'Random Forest': embedder_rf_support, 'LightGBM': embedded_lgbm_support})

# Count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df[['Pearson', 'Chi-2', 'RFE', 'Logistic', 'Random Forest', 'LightGBM']], axis=1)

# Display the top 'num_feats'
feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df) + 1)
feature_selection_df.head(num_feats)

Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistic,Random Forest,LightGBM,Total
1,Nationality_Slovenia,True,True,True,True,True,False,5
2,Volleys,True,True,False,False,True,True,4
3,Strength,True,True,False,False,True,True,4
4,Stamina,True,True,False,False,True,True,4
5,SprintSpeed,True,True,False,False,True,True,4
6,ShotPower,True,True,False,False,True,True,4
7,ShortPassing,True,True,False,False,True,True,4
8,Reactions,True,True,False,False,True,True,4
9,Position_LF,True,True,True,True,False,False,4
10,Nationality_Uruguay,True,True,True,True,False,False,4


## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [315]:
def preprocess_dataset(dataset_path):
    # Load the dataset
    player_df = pd.read_csv(dataset_path)

    numcols = ['Overall', 'Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions', 'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions']
    catcols = ['Preferred Foot', 'Position', 'Body Type', 'Weak Foot']

    # Exclude features with 'Nationality' in their names
    exclude_features = [col for col in player_df.columns if 'Nationality' in col]
    selected_numcols = [col for col in numcols if col not in exclude_features]

    # Select relevant columns
    player_df = player_df[selected_numcols + catcols]

    # Concatenate numerical features and one-hot encoded categorical features
    traindf = pd.concat([player_df[selected_numcols], pd.get_dummies(player_df[catcols])], axis=1)
    features = traindf.columns

    # Drop rows with missing values
    traindf = traindf.dropna()

    # Create X (features) and y (target)
    y = traindf['Overall'] >= 87
    X = traindf.copy()
    del X['Overall']

    # Number of maximum features to select
    num_feats = 30

    return X, y, num_feats

In [316]:
def autoFeatureSelector(dataset_path, methods=[]):
    # Preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)

    # Initialize empty lists to store selected features from each method
    selected_features = []

    # Run feature selection methods and collect selected features
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y, num_feats)
        selected_features.extend(cor_feature)
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y, num_feats)
        selected_features.extend(chi_feature)
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y, num_feats)
        selected_features.extend(rfe_feature)
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
        selected_features.extend(embedded_lr_feature)
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
        selected_features.extend(embedded_rf_feature)
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
        selected_features.extend(embedded_lgbm_feature)

    # Combine all the selected features into a single list
    combined_features = list(set(selected_features))

    # Count the number of selected features
    num_selected_features = len(combined_features)

    return combined_features, num_selected_features

In [318]:
best_features, num_best_features = autoFeatureSelector(dataset_path="fifa19.csv", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])
print("Best Features:", best_features)
print("Number of Best Features:", num_best_features)

[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001978 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1664
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 50
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555
Best Features: ['SprintSpeed', 'Body Type_Messi', 'Position_LW', 'Position_CAM', 'ShotPower', 'Position_LM', 'Volleys', 'Body Type_Courtois', 'Dribbling', 'Position_CF', 'Position_RW', 'Position_LS', 'Interceptions', 'Position_LF', 'Acceleration', 'Preferred Foot_Right', 'Position_LB', 'Balance', 'Body Type_Lean', 'ShortPassing', 'Finishing', 'Reactions', 'Position_CM', 'Body Type_C. Ronaldo', 'Position_LWB', 'Strength', 'Position_LCB', 'Body Type_PLAYER_BODY_TYPE_25', 'Position_RWB', 'BallControl', 'Pos

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features