# Data imputation

This dataset has been downloaded from  Kaggle https://www.kaggle.com/karangadiya/fifa19. License: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)

In this notebook we will do data processing for the dataset, imputing values that are missing based on present data.

## Step 1: Import libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, minmax_scale, scale

import matplotlib.pyplot as plt
import seaborn as sns
import bokeh as bk

## Step 2: Load data

First, we define where our data is and where we will store the imputated file

In [3]:
DATA = "../Data"
INPUT_FILE_NAME = f"{DATA}/FootballPlayerRawDataset.csv"

ATT_FILE_NAME = f"{DATA}/FootballPlayerPreparedCleanAttributes.csv"
IMPUTED_ATT_FILE_NAME = f"{DATA}/ImputedFootballPlayerPreparedCleanAttributes.csv"

ONE_HOT_ENCODED_CLASSES_FILE_NAME = f"{DATA}/FootballPlayerOneHotEncodedClasses.csv"
IMPUTED_ONE_HOT_ENCODED_CLASSES_FILE_NAME = f"{DATA}/ImputedFootballPlayerOneHotEncodedClasses.csv"

Now we load the data and show its info

In [4]:
dataset = pd.read_csv(INPUT_FILE_NAME, sep=",")

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 89 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                18207 non-null  int64  
 1   ID                        18207 non-null  int64  
 2   Name                      18207 non-null  object 
 3   Age                       18207 non-null  int64  
 4   Photo                     18207 non-null  object 
 5   Nationality               18207 non-null  object 
 6   Flag                      18207 non-null  object 
 7   Overall                   18207 non-null  int64  
 8   Potential                 18207 non-null  int64  
 9   Club                      17966 non-null  object 
 10  Club Logo                 18207 non-null  object 
 11  Value                     18207 non-null  object 
 12  Wage                      18207 non-null  object 
 13  Special                   18207 non-null  int64  
 14  Prefer

## Step 3: Data cleaning

First we remove unnecesary columns that we think won't affect the overall score of a player:
- Id
- Name
- Photo
- Nationality and Flag
- Team
- Club and Club Logo
- Preferred Foot
- Work Rate
- Body Type
- Real Face
- Position
- Jersey Number
- Joined
- Loaned From
- Contract Valid Until
- Height
- Weight
- From LS to RB


In [None]:
dataset.drop(dataset.loc[:, 'Unnamed: 0':'Name'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'Photo':'Flag'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'Club':'Club Logo'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'PreferredFoot':'PreferredFoot'].columns, inplace=True, axis = 1)
dataset.drop(dataset.loc[:, 'Work Rate':'RB'].columns, inplace=True, axis = 1)

