## Pre-processing sample for data set fifa 19

#### In order to use the FeatureSelectorTool one first must prepare a csv file containing the candidate features that are being considered to train an algorithm. 

#### The columns of this csv file must be the features of our prediction and the last column should be the Y prediction or target. All features and target must be previously pre-processed. The minimum requirement is to map all categorical features into numeric values. But for better performance consider also normalizing, scaling etc... for numeric features. Also consider customizing the code for your specific problems or even adding more correlation measures.

#### The following performs data-set pre processing to an example data-seet used to test FeatureSelectorTool.

#### fifa 19 dataset contain 89 cols of player-skills. Let's try to use this skills to train an algorithm that will predict if a player is good or bad based on his skills (good player: Overall score >= 87 else bad player). For this we would like to know which are the most relevant skills to make such prediction, we will find this by measuring the correlation of each feature with the target label.

### Data Pre-processing

In [1]:
# Import libraries
%matplotlib inline
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats

In [2]:
#read df
player_df = pd.read_csv("fifa_19_raw.csv")

#Check shape of the df
player_df.shape

(18207, 89)

In [3]:
#check the df info
player_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                18207 non-null  int64  
 1   ID                        18207 non-null  int64  
 2   Name                      18207 non-null  object 
 3   Age                       18207 non-null  int64  
 4   Photo                     18207 non-null  object 
 5   Nationality               18207 non-null  object 
 6   Flag                      18207 non-null  object 
 7   Overall                   18207 non-null  int64  
 8   Potential                 18207 non-null  int64  
 9   Club                      17966 non-null  object 
 10  Club Logo                 18207 non-null  object 
 11  Value                     18207 non-null  object 
 12  Wage                      18207 non-null  object 
 13  Special                   18207 non-null  int64  
 14  Prefer

In [17]:
#Divide the columns into numerical and categorical columns (also select only the columns we are interested in)
numcols = ['Overall', 'Crossing','Finishing',  'ShortPassing',  'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility',  'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']
catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']

#select the columns
player_df = player_df[numcols+catcols]

#Label encode categorical features and create the df that would be used for training
traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)

#Drop NA values, in a real case application a better approach for dealing with missing values would be to use
#an imputing method to not lose data. 
traindf = traindf.dropna()

#Generate a target y to predict
traindf['Overall_Target'] = traindf['Overall']>87
traindf['Overall_Target'] = traindf['Overall_Target'].apply(lambda row: 1 if row == True else 0)
traindf.drop(columns={'Overall'}, axis=1, inplace=True)

#Finally save the df as a csv file that can be later used with auto feature selector
traindf.to_csv('fifa_19_FeatureSelector.csv', index=False)

In [18]:
traindf

Unnamed: 0,Crossing,Finishing,ShortPassing,Dribbling,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Stamina,...,Nationality_Ukraine,Nationality_United Arab Emirates,Nationality_United States,Nationality_Uruguay,Nationality_Uzbekistan,Nationality_Venezuela,Nationality_Wales,Nationality_Zambia,Nationality_Zimbabwe,Overall_Target
0,84.0,95.0,90.0,97.0,87.0,96.0,91.0,86.0,91.0,72.0,...,0,0,0,0,0,0,0,0,0,1
1,84.0,94.0,81.0,88.0,77.0,94.0,89.0,91.0,87.0,88.0,...,0,0,0,0,0,0,0,0,0,1
2,79.0,87.0,84.0,96.0,78.0,95.0,94.0,90.0,96.0,81.0,...,0,0,0,0,0,0,0,0,0,1
3,17.0,13.0,50.0,18.0,51.0,42.0,57.0,58.0,60.0,43.0,...,0,0,0,0,0,0,0,0,0,1
4,93.0,82.0,92.0,86.0,91.0,91.0,78.0,76.0,79.0,90.0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18202,34.0,38.0,49.0,42.0,45.0,43.0,54.0,57.0,60.0,40.0,...,0,0,0,0,0,0,0,0,0,0
18203,23.0,52.0,43.0,39.0,25.0,40.0,41.0,39.0,38.0,43.0,...,0,0,0,0,0,0,0,0,0,0
18204,25.0,40.0,38.0,45.0,28.0,44.0,70.0,69.0,50.0,55.0,...,0,0,0,0,0,0,0,0,0,0
18205,44.0,50.0,42.0,51.0,32.0,52.0,61.0,60.0,52.0,40.0,...,0,0,0,0,0,0,0,0,0,0
