# Task 7: AutoFeatureSelector Tool
## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit

### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool
- Pearson Correlation
- Chi-Square
- RFE
- Embedded
- Tree (Random Forest)
- Tree (Light GBM)

### Dataset: FIFA 19 Player Skills
#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [2]:
url = 'https://raw.githubusercontent.com/BhavyaDivecha/ML_Task/4fb6c5b3ae575443b8592f7dfc102bc9d6cb24b6/Task%206/fifa19.csv'

player_df = pd.read_csv(url)

In [3]:
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

In [4]:
player_df = player_df[numcols+catcols]

In [5]:
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)
features = traindf.columns

traindf = traindf.dropna()

In [6]:
traindf = pd.DataFrame(traindf,columns=features)

In [7]:
y = traindf['Overall']>=87
X = traindf.copy()
del X['Overall']

In [8]:
X.head()

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Uganda,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,False,False,False,False,False,False,False,False,False,False
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,False,False,False,False,False,False,False,False,False,False
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,False,False,False,False,False,False,False,False,False,False
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,False,False,False,False,False,False,False,False,False,False
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,False,False,False,False,False,False,False,False,False,False


In [9]:
len(X.columns)

223

### Set some fixed set of features

In [70]:
feature_name = list(X.columns)
# no of maximum features we need to select
num_feats=35

## Filter Feature Selection - Pearson Correlation

### Pearson Correlation function

In [71]:

from scipy.stats import pearsonr

def cor_selector(X, y, num_feats):
    # Create a DataFrame combining X and y
    df = pd.concat([X, y], axis=1)

    # Calculate the Pearson correlation coefficients for each feature
    corr_values = df.corr().iloc[:-1, -1]

    # Sort the features based on their absolute correlation with the target variable
    sorted_features = corr_values.abs().sort_values(ascending=False).index

    # Select the top num_feats features
    selected_features = sorted_features[:num_feats]

    # Print the correlation coefficients for the selected features
    print("Correlation coefficients for selected features:")
    for feature in selected_features:
        corr = corr_values[feature]
        print(f"{feature}: {corr:.2f}")

    # Define the correlation support and feature
    cor_support = [True if feature in selected_features else False for feature in X.columns]
    cor_feature = selected_features

    # Return the correlation support and feature
    return cor_support, cor_feature


In [72]:
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

Correlation coefficients for selected features:
Reactions: 0.15
Body Type_C. Ronaldo: 0.13
Body Type_Neymar: 0.13
Body Type_Messi: 0.13
Body Type_Courtois: 0.13
Body Type_PLAYER_BODY_TYPE_25: 0.13
Position_LF: 0.07
Position_RF: 0.07
ShortPassing: 0.06
Volleys: 0.06
LongPassing: 0.06
FKAccuracy: 0.06
BallControl: 0.06
Finishing: 0.06
LongShots: 0.05
ShotPower: 0.05
Dribbling: 0.05
Nationality_Belgium: 0.04
Crossing: 0.04
Agility: 0.04
Weak Foot: 0.04
Stamina: 0.04
Nationality_Slovenia: 0.03
Nationality_Gabon: 0.03
Strength: 0.03
SprintSpeed: 0.03
Acceleration: 0.03
Nationality_Uruguay: 0.03
Position_LAM: 0.03
Nationality_Costa Rica: 0.02
Nationality_Egypt: 0.02
Aggression: 0.02
Balance: 0.02
Position_LW: 0.02
Nationality_Croatia: 0.02
35 selected features


### List the selected features from Pearson Correlation

In [13]:
cor_feature

Index(['Reactions', 'Body Type_C. Ronaldo', 'Body Type_Neymar',
       'Body Type_Messi', 'Body Type_Courtois',
       'Body Type_PLAYER_BODY_TYPE_25', 'Position_LF', 'Position_RF',
       'ShortPassing', 'Volleys', 'LongPassing', 'FKAccuracy', 'BallControl',
       'Finishing', 'LongShots', 'ShotPower', 'Dribbling',
       'Nationality_Belgium', 'Crossing', 'Agility', 'Weak Foot', 'Stamina',
       'Nationality_Slovenia', 'Nationality_Gabon', 'Strength', 'SprintSpeed',
       'Acceleration', 'Nationality_Uruguay', 'Position_LAM',
       'Nationality_Costa Rica', 'Nationality_Egypt', 'Aggression', 'Balance',
       'Position_LW', 'Nationality_Croatia'],
      dtype='object')

## Filter Feature Selection - Chi-Sqaure

In [14]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler

### Chi-Squared Selector function

In [15]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

def chi_squared_selector(X, y, num_feats):
    # Use SelectKBest to apply the chi-squared test
    chi2_selector = SelectKBest(chi2, k=num_feats)
    chi2_selector.fit(X, y)

    # Get the selected features
    chi_support = chi2_selector.get_support()
    chi_feature = X.columns[chi_support].tolist()

    # Print the selected features
    print("Selected features:")
    print(chi_feature)

    return chi_support, chi_feature


In [16]:
chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
print(str(len(chi_feature)), 'selected features')

Selected features:
['Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions', 'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Position_LAM', 'Position_LF', 'Position_LW', 'Position_RF', 'Body Type_C. Ronaldo', 'Body Type_Courtois', 'Body Type_Messi', 'Body Type_Neymar', 'Body Type_PLAYER_BODY_TYPE_25', 'Nationality_Belgium', 'Nationality_Costa Rica', 'Nationality_Croatia', 'Nationality_Egypt', 'Nationality_Gabon', 'Nationality_Slovenia', 'Nationality_Uruguay']
35 selected features


### List the selected features from Chi-Square 

In [17]:
chi_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LAM',
 'Position_LF',
 'Position_LW',
 'Position_RF',
 'Body Type_C. Ronaldo',
 'Body Type_Courtois',
 'Body Type_Messi',
 'Body Type_Neymar',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Egypt',
 'Nationality_Gabon',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Wrapper Feature Selection - Recursive Feature Elimination

In [18]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

### RFE Selector function

In [19]:
def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Scale the features using MinMaxScaler
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # Choose Logistic Regression as the estimator
    estimator = LogisticRegression()

    # Initialize RFE
    rfe = RFE(estimator, n_features_to_select=num_feats)

    # Fit RFE and get the selected features
    rfe.fit(X_scaled, y)
    rfe_support = rfe.support_
    rfe_feature = X.columns[rfe_support].tolist()

    # Print the selected features
    print("Selected features:")
    print(rfe_feature)

    # Your code ends here
    return rfe_support, rfe_feature

In [20]:
rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
print(str(len(rfe_feature)), 'selected features')

Selected features:
['Finishing', 'ShortPassing', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Volleys', 'FKAccuracy', 'Reactions', 'Strength', 'Weak Foot', 'Position_CAM', 'Position_CM', 'Position_GK', 'Position_LCB', 'Position_LF', 'Position_LM', 'Position_LW', 'Position_RB', 'Position_RCB', 'Position_RF', 'Position_RM', 'Position_RW', 'Body Type_Courtois', 'Body Type_PLAYER_BODY_TYPE_25', 'Nationality_Belgium', 'Nationality_Costa Rica', 'Nationality_Croatia', 'Nationality_Egypt', 'Nationality_Gabon', 'Nationality_Netherlands', 'Nationality_Slovakia', 'Nationality_Slovenia', 'Nationality_Uruguay']
35 selected features


### List the selected features from RFE

In [21]:
rfe_feature

['Finishing',
 'ShortPassing',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Strength',
 'Weak Foot',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LF',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RCB',
 'Position_RF',
 'Position_RM',
 'Position_RW',
 'Body Type_Courtois',
 'Body Type_PLAYER_BODY_TYPE_25',
 'Nationality_Belgium',
 'Nationality_Costa Rica',
 'Nationality_Croatia',
 'Nationality_Egypt',
 'Nationality_Gabon',
 'Nationality_Netherlands',
 'Nationality_Slovakia',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Embedded Selection - Lasso: SelectFromModel

In [56]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler

In [57]:
def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Standardize features using StandardScaler
    print("total number of features set by us is :",num_feats)
    scaler = MinMaxScaler()
    X_scaled = scaler.fit_transform(X)

    # Choose Logistic Regression with L1 penalty (Lasso) as the estimator
    estimator = LogisticRegression(penalty='l1', solver='liblinear')

    # Use SelectFromModel to perform feature selection
    embedded_lr_selector = SelectFromModel(estimator, max_features=num_feats)
    embedded_lr_selector.fit(X_scaled, y)

    # Get the selected features
    embedded_lr_support = embedded_lr_selector.get_support()
    embedded_lr_feature = X.columns[embedded_lr_support].tolist()

    # Print the selected features
    print("Selected features:")
    print(embedded_lr_feature)

    return embedded_lr_support, embedded_lr_feature
    

In [58]:
embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
print(str(len(embedded_lr_feature)), 'selected features')

total number of features set by us is : 35
Selected features:
['LongPassing', 'Reactions', 'Balance', 'Aggression', 'Preferred Foot_Right', 'Position_CAM', 'Position_CM', 'Position_GK', 'Position_LCB', 'Position_LM', 'Position_LW', 'Position_RB', 'Position_RCB', 'Position_RW', 'Body Type_Lean', 'Body Type_Stocky', 'Nationality_Belgium', 'Nationality_Brazil', 'Nationality_Croatia', 'Nationality_England', 'Nationality_France', 'Nationality_Germany', 'Nationality_Italy', 'Nationality_Netherlands', 'Nationality_Portugal', 'Nationality_Slovenia', 'Nationality_Uruguay']
27 selected features


In [59]:
embedded_lr_feature

['LongPassing',
 'Reactions',
 'Balance',
 'Aggression',
 'Preferred Foot_Right',
 'Position_CAM',
 'Position_CM',
 'Position_GK',
 'Position_LCB',
 'Position_LM',
 'Position_LW',
 'Position_RB',
 'Position_RCB',
 'Position_RW',
 'Body Type_Lean',
 'Body Type_Stocky',
 'Nationality_Belgium',
 'Nationality_Brazil',
 'Nationality_Croatia',
 'Nationality_England',
 'Nationality_France',
 'Nationality_Germany',
 'Nationality_Italy',
 'Nationality_Netherlands',
 'Nationality_Portugal',
 'Nationality_Slovenia',
 'Nationality_Uruguay']

## Tree based(Random Forest): SelectFromModel

In [26]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

In [27]:
def embedded_rf_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    
    # Choosing RandomForestClassifier as the estimator
    estimator = RandomForestClassifier()

    # Using SelectFromModel to perform feature selection
    embedded_rf_selector = SelectFromModel(estimator, max_features=num_feats)
    embedded_rf_selector.fit(X, y)

    # Get the selected features
    embedded_rf_support = embedded_rf_selector.get_support()
    embedded_rf_feature = X.columns[embedded_rf_support].tolist()

    # Print the selected features
    print("Selected features:")
    print(embedded_rf_feature)
    # Your code ends here
    return embedded_rf_support, embedded_rf_feature

In [52]:
embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
print(str(len(embedded_rf_feature)), 'selected features')


Selected features:
['Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions', 'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Weak Foot', 'Body Type_Courtois', 'Body Type_Normal', 'Nationality_Belgium']
23 selected features


In [55]:
embedded_rf_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Weak Foot',
 'Body Type_Courtois',
 'Body Type_Normal',
 'Nationality_Belgium']

## Tree based(Light GBM): SelectFromModel

In [40]:
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

In [41]:
def embedded_lgbm_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    # Choosing LGBMClassifier as the estimator
    estimator = LGBMClassifier()

    # Using SelectFromModel to perform feature selection
    embedded_lgbm_selector = SelectFromModel(estimator, max_features=num_feats)
    embedded_lgbm_selector.fit(X, y)

    # Get the selected features
    embedded_lgbm_support = embedded_lgbm_selector.get_support()
    embedded_lgbm_feature = X.columns[embedded_lgbm_support].tolist()

    # Print the selected features
    print("Selected features:")
    print(embedded_lgbm_feature)

    return embedded_lgbm_support, embedded_lgbm_feature

In [42]:
embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
print(str(len(embedded_lgbm_feature)), 'selected features')

[LightGBM] [Info] Number of positive: 55, number of negative: 18104
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001652 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1812
[LightGBM] [Info] Number of data points in the train set: 18159, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.003029 -> initscore=-5.796555
[LightGBM] [Info] Start training from score -5.796555
Selected features:
['Crossing', 'Finishing', 'ShortPassing', 'Dribbling', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Stamina', 'Volleys', 'FKAccuracy', 'Reactions', 'Balance', 'ShotPower', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Position_LCB', 'Body Type_Lean', 'Nationality_Italy']
22 selected features


In [43]:
embedded_lgbm_feature

['Crossing',
 'Finishing',
 'ShortPassing',
 'Dribbling',
 'LongPassing',
 'BallControl',
 'Acceleration',
 'SprintSpeed',
 'Agility',
 'Stamina',
 'Volleys',
 'FKAccuracy',
 'Reactions',
 'Balance',
 'ShotPower',
 'Strength',
 'LongShots',
 'Aggression',
 'Interceptions',
 'Position_LCB',
 'Body Type_Lean',
 'Nationality_Italy']

## Putting all of it together: AutoFeatureSelector Tool

In [65]:
pd.set_option('display.max_rows', None)
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embedded_lr_support,
                                    'Random Forest':embedded_rf_support, 'LightGBM':embedded_lgbm_support})
# count the selected times for each feature


# Count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df.iloc[:, 1:], axis=1)

# Display the top features
top_features_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=False).head(num_feats)
top_features_df.index = range(1, len(top_features_df) + 1)

# Display the top features
print(top_features_df)


                          Feature  Pearson  Chi-2    RFE  Logistics  \
1                       Reactions     True   True   True       True   
2                     LongPassing     True   True   True       True   
3                         Volleys     True   True   True      False   
4                        Strength     True   True   True      False   
5                     SprintSpeed     True   True   True      False   
6                    ShortPassing     True   True   True      False   
7             Nationality_Belgium     True   True   True       True   
8                       Finishing     True   True   True      False   
9                      FKAccuracy     True   True   True      False   
10                    BallControl     True   True   True      False   
11                        Balance     True   True  False       True   
12                        Agility     True   True   True      False   
13                     Aggression     True   True  False       True   
14    

## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?

In [93]:
def preprocess_dataset(dataset_path):
    # Your code starts here (Multiple lines)
    df = pd.read_csv(dataset_path)
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]
    num_feats = len(X.columns)
    return X, y, num_feats
    # Your code ends here
    return X, y, num_feats

In [66]:
def autoFeatureSelector(dataset_path, methods=[]):
    # Parameters
    # data - dataset to be analyzed (csv file)
    # methods - various feature selection methods we outlined before, use them all here (list)
    
    # preprocessing
    X, y, num_feats = preprocess_dataset(dataset_path)
    
    # Run every method we outlined above from the methods list and collect returned best features from every method
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y,num_feats)
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y,num_feats)
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y,num_feats)
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)
    
    
    # Combine all the above feature list and count the maximum set of features that got selected by all methods
    #### Your Code starts here (Multiple lines)
    
    #### Your Code ends here
    return best_features

In [67]:
best_features = autoFeatureSelector(dataset_path="data/fifa19.csv", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])
best_features

NameError: name 'preprocess_dataset' is not defined

### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features