In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [33]:
plt.style.use("seaborn-notebook")

# Football Players Overall Predictor For FIFA 18
## Author: Kostadin Kostadinov

![alternative text](logo.jpg)

## Abstract
This is a football players overall predictor based on FIFA 18. Our data contains over 18 000 players with theirs characteristics, weak sides and strong sides. Based on this data this project will predict the overall rating(it can be from 1 to 100).

__We will:__
-  Apply some preprocessing steps to prepare the data.
- Then, we will perform a descriptive analysis of the data to better understand the main characteristics that they have.
- We will continue by training different machine learning models using scikit-learn.
- Then, we will iterate and evaluate the learned models by using unseen data. Later, we will compare them until we find a good model that meets our expectations.
- Once we have chosen the candidate model, we will use it to perform predictions and to create a simple web application that consumes this predictive model.

### Loading and cleaning our data

We will load the dataset with the players and apply some cleaning and transformation.


In [3]:
col_types = {'Overall': np.int32, 'Age': np.int32}

players_data = pd.read_csv("data/football_players_data.csv", low_memory=False,dtype = col_types)
players_data = players_data.drop("Unnamed: 0",axis=1)
print(players_data.shape)
players_data.head()

(17981, 74)


Unnamed: 0,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,Value,...,RB,RCB,RCM,RDM,RF,RM,RS,RW,RWB,ST
0,Cristiano Ronaldo,32,https://cdn.sofifa.org/48/18/players/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Real Madrid CF,https://cdn.sofifa.org/24/18/teams/243.png,€95.5M,...,61.0,53.0,82.0,62.0,91.0,89.0,92.0,91.0,66.0,92.0
1,L. Messi,30,https://cdn.sofifa.org/48/18/players/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,93,93,FC Barcelona,https://cdn.sofifa.org/24/18/teams/241.png,€105M,...,57.0,45.0,84.0,59.0,92.0,90.0,88.0,91.0,62.0,88.0
2,Neymar,25,https://cdn.sofifa.org/48/18/players/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,94,Paris Saint-Germain,https://cdn.sofifa.org/24/18/teams/73.png,€123M,...,59.0,46.0,79.0,59.0,88.0,87.0,84.0,89.0,64.0,84.0
3,L. Suárez,30,https://cdn.sofifa.org/48/18/players/176580.png,Uruguay,https://cdn.sofifa.org/flags/60.png,92,92,FC Barcelona,https://cdn.sofifa.org/24/18/teams/241.png,€97M,...,64.0,58.0,80.0,65.0,88.0,85.0,88.0,87.0,68.0,88.0
4,M. Neuer,31,https://cdn.sofifa.org/48/18/players/167495.png,Germany,https://cdn.sofifa.org/flags/21.png,92,92,FC Bayern Munich,https://cdn.sofifa.org/24/18/teams/21.png,€61M,...,,,,,,,,,,


This dataset has a lot of columns.

In [7]:
players_data.columns

Index(['Name', 'Age', 'Photo', 'Nationality', 'Flag', 'Overall', 'Potential',
       'Club', 'Club Logo', 'Value', 'Wage', 'Special', 'Acceleration',
       'Aggression', 'Agility', 'Balance', 'Ball control', 'Composure',
       'Crossing', 'Curve', 'Dribbling', 'Finishing', 'Free kick accuracy',
       'GK diving', 'GK handling', 'GK kicking', 'GK positioning',
       'GK reflexes', 'Heading accuracy', 'Interceptions', 'Jumping',
       'Long passing', 'Long shots', 'Marking', 'Penalties', 'Positioning',
       'Reactions', 'Short passing', 'Shot power', 'Sliding tackle',
       'Sprint speed', 'Stamina', 'Standing tackle', 'Strength', 'Vision',
       'Volleys', 'CAM', 'CB', 'CDM', 'CF', 'CM', 'ID', 'LAM', 'LB', 'LCB',
       'LCM', 'LDM', 'LF', 'LM', 'LS', 'LW', 'LWB', 'Preferred Positions',
       'RAM', 'RB', 'RCB', 'RCM', 'RDM', 'RF', 'RM', 'RS', 'RW', 'RWB', 'ST'],
      dtype='object')

In the value column we have characters, not only numbers, we will remove the euro sign, M(millions) and K(thousands). 

In [4]:
players_data.Value = players_data.Value.str.replace('€','')

def parseValue(strVal):
    """
    Parse string with M or K to numeric values
    """
    if 'M' in strVal:
        return int(float(strVal.replace('M', '')) * 1000000)
    elif 'K' in strVal:
        return int(float(strVal.replace('K', '')) * 1000)
    else:
        return int(strVal)
    
players_data['Value'] = players_data['Value'].apply(lambda x: parseValue(x))

In [34]:
players_data.head()

Unnamed: 0,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,Value,...,RB,RCB,RCM,RDM,RF,RM,RS,RW,RWB,ST
0,Cristiano Ronaldo,32,https://cdn.sofifa.org/48/18/players/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Real Madrid CF,https://cdn.sofifa.org/24/18/teams/243.png,95500000,...,61.0,53.0,82.0,62.0,91.0,89.0,92.0,91.0,66.0,92.0
1,L. Messi,30,https://cdn.sofifa.org/48/18/players/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,93,93,FC Barcelona,https://cdn.sofifa.org/24/18/teams/241.png,105000000,...,57.0,45.0,84.0,59.0,92.0,90.0,88.0,91.0,62.0,88.0
2,Neymar,25,https://cdn.sofifa.org/48/18/players/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,94,Paris Saint-Germain,https://cdn.sofifa.org/24/18/teams/73.png,123000000,...,59.0,46.0,79.0,59.0,88.0,87.0,84.0,89.0,64.0,84.0
3,L. Suárez,30,https://cdn.sofifa.org/48/18/players/176580.png,Uruguay,https://cdn.sofifa.org/flags/60.png,92,92,FC Barcelona,https://cdn.sofifa.org/24/18/teams/241.png,97000000,...,64.0,58.0,80.0,65.0,88.0,85.0,88.0,87.0,68.0,88.0
4,M. Neuer,31,https://cdn.sofifa.org/48/18/players/167495.png,Germany,https://cdn.sofifa.org/flags/21.png,92,92,FC Bayern Munich,https://cdn.sofifa.org/24/18/teams/21.png,61000000,...,,,,,,,,,,


Now is better. Let now check for invalid value. An invalid value is a value smaller than or equal to 0. If we don't remove such values they can "break" the training models for example - they can cause overfitting. We are going to get only the players with positive value.

In [5]:
players_data = players_data.loc[players_data.Value > 0]
players_data.shape

(17725, 74)

We have removed over 200 invalid for our case entries. Let's now make the same thing but for the players overall.

In [6]:
players_data = players_data.loc[players_data.Overall > 0]
players_data.shape

(17725, 74)

The Overall is valid. Let's now remove all non-numeric entries from the skills columns.

In [18]:
def remove_non_numeric_entries(dataset, column):
    """
    Remove all non-numeric entries in the column of the dataset
    """
    def between_1_and_99(s):
        try:
            n = int(s)
            return (1 <= n and n <= 99)
        except ValueError:
            return False
        
    dataset = dataset.loc[players_data[column].apply(lambda x: between_1_and_99(x))]
 
    dataset[column] = dataset[column].astype('int')
    
    return dataset

In [28]:
skills = ['Acceleration',
       'Aggression', 'Agility', 'Balance', 'Composure',
       'Crossing', 'Curve', 'Dribbling', 'Finishing', 'Ball control','Free kick accuracy',
       'GK diving', 'GK handling', 'GK kicking', 'GK positioning',
       'GK reflexes', 'Heading accuracy', 'Interceptions', 'Jumping',
       'Long passing', 'Long shots', 'Marking', 'Penalties', 'Positioning','Reactions', 'Short passing', 'Shot power', 'Sliding tackle',
       'Sprint speed','Stamina', 'Standing tackle', 'Strength', 'Vision',
       'Volleys']

for skill in skills:
    players_data = remove_non_numeric_entries(players_data,skill)

Now all skill columns are int type. I think that for now this cleaning process is enough and now we can go to the data analysis.

### Data analysis(Descriptive analysis

## Conclusion

## Resources

Dataset: https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset#CompleteDataset.csv