# Data imputation

This dataset has been downloaded from  Kaggle https://www.kaggle.com/karangadiya/fifa19. License: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)

In this notebook we will do data processing for the dataset, imputing values that are missing based on present data.

## Step 1: Import libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, minmax_scale, scale

import matplotlib.pyplot as plt
import seaborn as sns
import bokeh as bk

## Step 2: Load data

First, we define where our data is and where we will store the imputated file

In [2]:
DATA = "../Data"
INPUT_FILE_NAME = f"{DATA}/FootballPlayerRawDataset.csv"

ATT_FILE_NAME = f"{DATA}/FootballPlayerPreparedCleanAttributes.csv"
IMPUTED_ATT_FILE_NAME = f"{DATA}/ImputedFootballPlayerPreparedCleanAttributes.csv"

ONE_HOT_ENCODED_CLASSES_FILE_NAME = f"{DATA}/FootballPlayerOneHotEncodedClasses.csv"
IMPUTED_ONE_HOT_ENCODED_CLASSES_FILE_NAME = f"{DATA}/ImputedFootballPlayerOneHotEncodedClasses.csv"

Now we load the data and show its info

In [3]:
dataset = pd.read_csv(INPUT_FILE_NAME, sep=",")

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                18207 non-null  int64  
 1   ID                        18207 non-null  int64  
 2   Name                      18207 non-null  object 
 3   Age                       18207 non-null  int64  
 4   Photo                     18207 non-null  object 
 5   Nationality               18207 non-null  object 
 6   Flag                      18207 non-null  object 
 7   Overall                   18207 non-null  int64  
 8   Potential                 18207 non-null  int64  
 9   Club                      17966 non-null  object 
 10  Club Logo                 18207 non-null  object 
 11  Value                     18207 non-null  object 
 12  Wage                      18207 non-null  object 
 13  Special                   18207 non-null  int64  
 14  Prefer

## Step 3: Data cleaning

We remove all goalkeepers as we will also remove the columns that represent their statistics

In [5]:
dataset.drop(dataset[dataset.Position=='GK'].index, inplace=True)

First we remove unnecesary columns that we think won't affect the overall score of a player:
- Id
- Name
- Photo
- Nationality and Flag
- Team
- Club and Club Logo
- Preferred Foot
- Work Rate
- Body Type
- Real Face
- Position
- Jersey Number
- Joined
- Loaned From
- Contract Valid Until
- Height
- Weight
- From LS to RB
- From GKDiving to GKReflexes


In [6]:
dataset.drop(dataset.loc[:, 'Unnamed: 0':'Name'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'Photo':'Flag'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'Club':'Club Logo'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'Preferred Foot':'Preferred Foot'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'Work Rate':'RB'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'GKDiving':'GKReflexes'].columns, inplace=True, axis = 1)



In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16182 entries, 0 to 18206
Data columns (total 39 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       16182 non-null  int64  
 1   Overall                   16182 non-null  int64  
 2   Potential                 16182 non-null  int64  
 3   Value                     16182 non-null  object 
 4   Wage                      16182 non-null  object 
 5   Special                   16182 non-null  int64  
 6   International Reputation  16134 non-null  float64
 7   Weak Foot                 16134 non-null  float64
 8   Skill Moves               16134 non-null  float64
 9   Crossing                  16134 non-null  float64
 10  Finishing                 16134 non-null  float64
 11  HeadingAccuracy           16134 non-null  float64
 12  ShortPassing              16134 non-null  float64
 13  Volleys                   16134 non-null  float64
 14  Dribbl

Now we need to take care of the columns whose DType is object

- Let's start by the **Value** column: We will remove the '€' , 'K' and 'M' characters and then change its type to float

In [8]:
dataset["Value"] = dataset["Value"].str.replace('€','')
dataset["Value"] = dataset["Value"].str.replace('M','')
dataset["Value"] = dataset["Value"].str.replace('K','')
dataset["Value"] = dataset["Value"].astype(float)

- Next column to take care of is **Wage** column: We will remove the '€' , 'K' and 'M' characters and then change its type to float

In [9]:
dataset["Wage"] = dataset["Wage"].str.replace('€','')
dataset["Wage"] = dataset["Wage"].str.replace('K','')
dataset["Wage"] = dataset["Wage"].astype(float)

- The last column that needs processing is **Release Clause**: We will remove the '€' , 'K' and 'M' characters and then change its type to float

In [10]:
dataset["Release Clause"] = dataset["Release Clause"].str.replace('€','')
dataset["Release Clause"] = dataset["Release Clause"].str.replace('M','')
dataset["Release Clause"] = dataset["Release Clause"].str.replace('K','')
dataset["Release Clause"] = dataset["Release Clause"].astype(float)

Finally let's check the dataset info after the changes

In [11]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16182 entries, 0 to 18206
Data columns (total 39 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       16182 non-null  int64  
 1   Overall                   16182 non-null  int64  
 2   Potential                 16182 non-null  int64  
 3   Value                     16182 non-null  float64
 4   Wage                      16182 non-null  float64
 5   Special                   16182 non-null  int64  
 6   International Reputation  16134 non-null  float64
 7   Weak Foot                 16134 non-null  float64
 8   Skill Moves               16134 non-null  float64
 9   Crossing                  16134 non-null  float64
 10  Finishing                 16134 non-null  float64
 11  HeadingAccuracy           16134 non-null  float64
 12  ShortPassing              16134 non-null  float64
 13  Volleys                   16134 non-null  float64
 14  Dribbl