# Importance of Data Preparation 

- **data** refers to examples or cases from the domain that characterize the problem you want to solve

- predictive modeling projects involve learning from **data**
    - all machine learning algorithms use some **input data** to create **outputs**

- this input data comprises of *features*, 
    - features which are usually in the form of columns 

- predictive model algorithms require features to have specific characteristics to work properly

- according to a survey in Forbes, data scientists spend 60% of their time on data preparation

![60% of Data Scientist's Time](https://miro.medium.com/max/700/0*-dn9U8gMVWjDahQV.jpg)

- on a predictive modeling project, such as *classification* or *regression*, **raw data** typically cannot be used directly

- there are four main reasons why this is the case:
    - **data types**: machine learning algorithms require data to be numbers
    - **data requirements**: some machine learning algorithms impose requirements on the data
    - **data errors**: statistical noise and errors in the data may need to be corrected
    - **data complexity**: complex nonlinear relationships may be teased out of the data

- the **raw data** must be *pre-processed* prior to being used to fit and evaluate a machine learning model
    - this step in a predictive modeling project is referred to as "**data preparation**"



# Business Goals 

- The FIFA '19  dataset can be used for several business cases 
    - build a new club
    - choose players for awards 
    - analysis for betting on certain members

- We here will be looking at building a Dream Team of 11 players 
    - we will base it on the data we have after cleaning it up 
    - we also explore some statistics concepts to come up with a Dream Team 

# Load Python Libraries 

- load `numpy`
- load `pandas`
- load `sklearn`

In [None]:
# load numpy 
import numpy as np

# load pandas
import pandas as pd

# configure pandas display settings
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# load sklearn
import sklearn


# Import and Review FIFA '19 Data Set

### Load Dataset

In [None]:
# read csv dataset from file, setting the zeroth (first) column as the index
dataset = pd.read_csv('fifa19.csv', index_col=0)
# set the path to your dataset

### Preliminary Checks

In [None]:
# check dataset head
dataset.head()

In [None]:
# check the number of rows and columns (i.e. number of players and features)
dataset.shape # (rows, columns)


In [None]:
# get an overview of the dataset
dataset.info()

# Dropping Columns


### Meaningless Columns

- datasets usually have columns that arent meaningful inputs to create a prediction model 

- in our FIFA dataset, we have a few columns like that:
    - `Photo`, `Flag` and `Club Logo` can be removed as they are simply URLs to photos 
    - `ID` column wont influence a prediction model meaningfully if used
    - `Real Face` column also has no particular meaning 

- so let's drop all of those columns


In [None]:
# drop meaningless columns
dataset.drop(columns = ['Photo','Flag','Club Logo','ID', 'Real Face'], inplace=True)

In [None]:
# check head to confirm meaningless columns have been dropped
dataset.head()

### Null Value Majority Columns

- usually, 10% - 15% missing values is the cutoff for dropping a column 
    - here, we will drop all columns with more than 10% missing data 

In [None]:
# check the number of null values for each column in the dataset 
dataset.isnull().sum()

In [None]:
# percentage of missing values in each column
round(dataset.isnull().sum()/dataset.shape[0] * 100)

In [None]:
# extract the columns names that have more than 10% missing values 
drop_cols = [col_name for col_name in dataset.columns if dataset[col_name].isnull().sum()/dataset.shape[0]*100 > 10.0]

# list the columns to be dropped
drop_cols


In [None]:
# drop the columns which have more than 10% missing values 
dataset.drop(columns = drop_cols, inplace=True)


In [None]:
# recheck the percentage of mising values in each column
round(dataset.isnull().sum()/dataset.shape[0] * 100)

In [None]:
# check dataset after dropping columns
dataset.head()

# Business Goals

- we've defined our business goal as coming with a dream team of 11 players
- there can be several strategies to pick this dream team 
- we will base our selection based on some main featues 
    - We will pick a dream team based on best normalized average for:
        - Crossing	
        - Finishing	
        - HeadingAccuracy	
        - ShortPassing	
        - Volleys	
        - Dribbling	
        - Curve	
        - FKAccuracy	
        - LongPassing	
        - BallControl	
        - Acceleration	
        - SprintSpeed	
        - Agility	
        - Reactions	
        - Balance	
        - ShotPower	
        - Jumping	
        - Stamina	
        - Strength	
        - LongShots	
        - Aggression	
        - Interceptions	
        - Positioning	
        - Vision	
        - Penalties	
        - Composure	
        - Marking	
        - StandingTackle	
        - SlidingTackle	
        - GKDiving	
        - GKHandling	
        - GKKicking	
        - GKPositioning	
        - GKReflexes
    - we will then find the total price of the dream team 

# Inspect Data 

### Release Clause Columns

- this column is a money value column, 
    - but the entries are string 
    - and are of the format "€xxx.xM"

- this column needs to be converted to `int` from `string`

In [None]:
dataset['Release Clause'].head()

In [None]:
dataset['Release Clause'].dtypes

In [None]:
# the release clause has to be converted to int first 

# define the strip and clean up function
def str_to_int_num(rcn):
    try:
        return float(rcn[1:-1])*1000000
    except:
        return np.nan

# do the actual clean up of the Release Clause column 
dataset['Release Clause'] = dataset['Release Clause'].apply(str_to_int_num)

In [None]:
dataset['Release Clause'].dtypes

In [None]:
dataset['Release Clause'].isnull().sum()

# Extract Relevant Data

In [None]:
extracted_dataset_cols = ['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes','Release Clause']

extracted_datset = dataset[extracted_dataset_cols]


In [None]:
extracted_datset.head()

# Imputation 

- we will use the median imputation method to fill missing values

In [None]:
# check the for the number of missing values in each column
extracted_datset.isnull().sum()

In [None]:
# each column's missing values is filled with the median 
for col in extracted_dataset_cols:
    extracted_datset[col] = extracted_datset[col].fillna(extracted_datset[col].median())

In [None]:
# check the for the number of missing values in each column
extracted_datset.isnull().sum()

# Scaling Data

- for some learning algorithms, for the input features 
    - bigger numbers influence the learning model more
    - smaller numbers influence the learning model less 

- this is because the inputs have different ranges 
    - to avoid this range effect on the prediciton algorithm, all inputs are scaled to a comparable values 

- normalization and standardization are two common methods of scaling input data 

- we shall use a `MinMaxScalar` to normalize the features 

In [None]:
extracted_datset.shape

- seperate out the input features (X) and the target label (y)

In [None]:
# X is the set of input features 
X = extracted_datset.drop(['Release Clause'], axis=1)
X.shape

In [None]:
# y is the label, in this case, the Release Clause column
y = extracted_datset['Release Clause']
y.shape


### Scale only the input features 


In [None]:
# import the MinMax Scaler from sklearn 
from sklearn.preprocessing import MinMaxScaler

# initialize the min-max-scaler 
min_max_scaler = MinMaxScaler()

# fit X to the scaler 
min_max_scaler.fit(X)

# perform the actual scaling on X 
extracted_datset_scaled = pd.DataFrame(min_max_scaler.transform(X))

# check preliminary stats to verify that scaling was successfully applied
extracted_datset_scaled.describe()

In [None]:
# check the head of the extracted, scaled datset
extracted_datset_scaled.head()


In [None]:
# columns names of scaled features
scaled_model_names = ['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']

# assign appropriate column names to extract
extracted_datset_scaled.columns = scaled_model_names

# read the head of the extracted dataset after the column labels have been applied
extracted_datset_scaled.head()

# Assemble Data for Business Goal

In [None]:
# Create a new dataframe from the average of the Feature Set, Names of Players, and Release Clause
three_set = pd.DataFrame()

In [None]:
# capture the names of the players form the original dataset
three_set['Name'] = dataset['Name']

In [None]:
# get the average score for each player
three_set['Average Score'] = extracted_datset_scaled.sum(axis=1)/len(scaled_model_names)

In [None]:
# check the head of the dataframe currently
three_set.head()

In [None]:
# get the release clause for each person 
three_set['Release Clause'] = extracted_datset['Release Clause']

In [None]:
# check the head of the assembled dataframe
three_set.head()

In [None]:
# sort the dataframe by average score in descending order
three_set.sort_values(by=['Average Score'], ascending=False, inplace=True)

In [None]:
# extract the top 11 players with the best average scores 
three_set.head(11)

# Exercise

- compute the total Release Clasue for this Dream Team of 11
- compute the average wage that will be paid to the dream team


# Processing DateTime

- when working with date-time components in datasets, it is important to sanitize them to a common format
- here in our FIFA dataset, we have a couple data-time columns, lets examine and sanitize them as needed


In [None]:
dataset.head()

### Joined Column

In [None]:
# check the datatype of 'Joined'
dataset['Joined'].dtypes

In [None]:
# check the head after clean up
dataset['Joined'].head()

In [None]:
# check value counts of 'Joined'
# dataset['Joined'].value_counts()

In [None]:
# import the time 
import datetime

# define the datetime convertor function
def data_str_to_datetime_1(date_str):
    try:
        return datetime.datetime.strptime(date_str, '%b %d, %Y')
    except:
        return np.nan

# do the actual clean up of the Joined column 
dataset['Joined'] = dataset['Joined'].apply(data_str_to_datetime_1)

In [None]:
# check the head after clean up
dataset['Joined'].head()

### Contract Valid Until

In [None]:
# check the datatype of 'Contract Valid Until'
dataset['Contract Valid Until'].dtypes

In [None]:
# check unique value counts in 'Contract Valid Until'
dataset['Contract Valid Until'].value_counts()

In [None]:
# define the datetime cleanup function
def data_str_to_datetime_2(date_str):

    try:
        if date_str.find(',') != -1:
            curr_date = datetime.datetime.strptime(date_str, '%b %d, %Y')
            return curr_date.strftime("%Y")
        elif date_str.find(',') == -1:
            curr_date = datetime.datetime.strptime(date_str, '%Y')
            return curr_date.strftime("%Y")

    except:
        return np.nan

# do the actual clean up of the Joined column 
dataset['Contract Valid Until'] = dataset['Contract Valid Until'].apply(data_str_to_datetime_2)

In [None]:
# check the datatype of 'Contract Valid Until'
dataset['Contract Valid Until'].dtypes

In [None]:
# check unique value counts in 'Contract Valid Until'
dataset['Contract Valid Until'].value_counts()

# Encoding Categorical Data

- categorical data can be converted to numerical data using encoding 
- this makes it possbile to create a numerical input even for categorical, string-like features
- there are two popular kinds of encoding categorical data 
    - Label Encoding: 
        - assigns a numerical value to a particular value of the categorical variable
    - One-Hot Encoding: 
        - creates new columns for each possible value of the categorical variable, 
        - uses binary value to classify presence in the original feature columns

- [Relevant Reading - Choosing the right Encoding method-Label vs OneHot Encoder](https://towardsdatascience.com/choosing-the-right-encoding-method-label-vs-onehot-encoder-a4434493149b)


In [None]:
dataset.head()

- `Club` and `Position` are both categorial variables
- let's explore Label Encoding for `Club` and One-Hot Encoding for `Position`

### Label Encoding 

- we will apply Label Encoding for `Club`

In [None]:
# check how many nas exist in the Club column
dataset.Club.isnull().sum()

In [None]:
# fill missing CLub values with Unknown
dataset.Club.fillna('Unknown',inplace=True)

In [None]:
# check how many nas exist in the Club column
dataset.Club.isnull().sum()

In [None]:
# import preprocessing library from sklearn
from sklearn import preprocessing

# initialize a label encoder 
label_encoder = preprocessing.LabelEncoder()

# fit the data to the label encoder 
label_encoder.fit(dataset['Club'])

In [None]:
# transfrom the column and save it back into the main DataFrame
dataset['Club'] = label_encoder.transform(dataset['Club'])

In [None]:
# chcke the label encoded Club column in the main DataFrame 
dataset.head()

### One Hot Encoding 

- we will apply One Hot encoding for `Position`

In [None]:
# check the number of unique values in Position
dataset['Position'].value_counts()

In [None]:
# check number of null values 
dataset['Position'].isnull().sum()

In [None]:
# replace null values with 'Unknown'
dataset['Position'].fillna('Unknown', inplace=True)

In [None]:
# recheck number of null values 
dataset['Position'].isnull().sum()

In [None]:
# number of unique values in Position
len(dataset['Position'].value_counts())

- so 28 new columns will be added as a result on one-hot encoding
- but we will remove the original column after one-hot encoding

In [None]:
# check the current shape of dataframe 
dataset.shape

In [None]:
# create dummy variables 
encoded_columns = pd.get_dummies(dataset.Position)

encoded_columns.shape

In [None]:
# recreate the data set with the encoded columns
dataset = dataset.join(encoded_columns).drop('Position',axis=1)

In [None]:
# check the first five rows of the dataframe
dataset.head()

In [None]:
# check the shape of one-hot encoded data frame
dataset.shape

# Train-Test Split

- one of the very common issues while developing Machine Learning systems is *overfitting*

- to avoid this to a large extent, the available data is split into two parts
    - a training part 
    - a test/validation part 

- the training part is used to fit the actual model 
- the testing/validation part is used to provide an unbiased evaluation of a model fit on the training dataset 

- the model never learns from the testing/validation part

- below is a demonstration of doing a train-test split for a dataset
    - this if often done at the very end of the clean up process

In [None]:
# import the train_test_split from sklearn
from sklearn.model_selection import train_test_split

# do a 80% train - 20% test split
X_train, X_test, y_train, y_test = train_test_split( X , y , test_size = 0.2, random_state = 0)

In [None]:
# check X_train shape 
X_train.shape



In [None]:
# check X_test shape 
X_test.shape



In [None]:
# check X_train shape 
y_train.shape



In [None]:
# check X_test shape 
y_test.shape