# Finding the Next World-Class Soccer Players

#### Joshua Chen

<img src = "background.jpg" style="width: 300px;">
image from https://wallpaperbro.com/liverpool-players-phone

## Introduction

Ever since I was little, I've followed sports. Collecting cards, reading up on the news and watching many games as most kids do. And one thing that is constantly discussed in sports is the data, especially more so in recent years as technology and models have advanced. The prime example being the infamous Moneyball A's. But in recent years, if you read many sports articles, they always mention statistics and data. For example, the 2017 Superbowl between the Patriots and the Falcons where the Patriots came back from a .02% chance of winning (according the ESPN's win-probability graph). The models and statistics and unlikelihood of the comeback were talked about for months, especially in the highly data driven NFL, where every play can be broken down and analyzed thoroughly. The same can be said for basketball, baseball, and tennis. But one sport where this fails is soccer.

Soccer has been the "problem child" of sports data science as the game was always considered too complicated and too fluid to be analyzed. Many managers and coaches relied on instinct and feel for the game and still rely on these traits. But slowly over time, this has been changing. I'm a huge Liverpool fan and earlier in the year read an article about how Liverpool has gone from mediocre over the past few years to completely dominant with the help of their analytics department (https://www.nytimes.com/2019/05/22/magazine/soccer-data-liverpool.html). Many of Liverpool's world-class bargain signings came from their analysis of the data and statistics that people can't see. 

This is the inspiration and the idea behind this tutorial. Can a model be created to predict which players will become world-class players? In this tutorial, I'll be taking data from FIFA's assessment of players over the past 5 years and create a model to try to predict player's current level of play. I'll compare my model with FIFA's most recent assessment as well as the player's current in-game form. I hope that this tutorial show fans that data can be used to help assess players and perhaps get the more data-driven people who aren't soccer fans to look into cracking one of the hardest sports to analyze through data.

### Set-up

To start we will be using different libraries to help us retrieve, visualize and analyze the data. To name a few, we will be using Pandas and Numpy to help process the data. Matplotlib and Seaborn will be used to visualize the data and Scikit will be used to help create our model and test our model.

In [288]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## Getting the Data:

The first step will be to retrieve our data and to process it in a way that it can be used for our model. As stated before, in this tutorial we will be using the FIFA rating data which can be found <a href = "https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset"> here. </a>

I decided to use this data as it is a very comprehensive list of players and one of the most easily obtainable. It provides similar metrics over the last few years as FIFA hasn't changed the metrics it collects on players. Another reason is that data for soccer isn't that readily available. Most data is collected by individuals who want to mess with it and can be found via Twitter graphs or is super expensive and professional (<a href = "https://www.optasports.com/">Opta</a> being on of the only distributors). Thus, I settled for the best I could do which is this FIFA data.

### Data Wrangling

The following pieces of data are stored in the Github repository which can be found here.

In [289]:
fifa15 = pd.read_csv("players_15.csv")
fifa16 = pd.read_csv("players_16.csv")
fifa17 = pd.read_csv("players_17.csv")
fifa18 = pd.read_csv("players_18.csv")
fifa19 = pd.read_csv("players_19.csv")

fifa15.head()

Unnamed: 0,sofifa_id,player_url,short_name,long_name,age,dob,height_cm,weight_kg,nationality,club,...,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,27,1987-06-24,169,67,Argentina,FC Barcelona,...,62+3,62+3,62+3,62+3,62+3,54+3,45+3,45+3,45+3,54+3
1,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,29,1985-02-05,185,80,Portugal,Real Madrid,...,63+3,63+3,63+3,63+3,63+3,57+3,52+3,52+3,52+3,57+3
2,9014,https://sofifa.com/player/9014/arjen-robben/15...,A. Robben,Arjen Robben,30,1984-01-23,180,80,Netherlands,FC Bayern München,...,64+3,64+3,64+3,64+3,64+3,55+3,46+3,46+3,46+3,55+3
3,41236,https://sofifa.com/player/41236/zlatan-ibrahim...,Z. Ibrahimović,Zlatan Ibrahimović,32,1981-10-03,195,95,Sweden,Paris Saint-Germain,...,61+3,65+3,65+3,65+3,61+3,56+3,55+3,55+3,55+3,56+3
4,167495,https://sofifa.com/player/167495/manuel-neuer/...,M. Neuer,Manuel Neuer,28,1986-03-27,193,92,Germany,FC Bayern München,...,,,,,,,,,,


Above we have an example of the data from 2015 (all the data from the datasets we want are uniform, so not too much to worry about there). It can be seen that there is a lot of information, most of it is data that we don't need. The most important data is the name of the player, their age, club, overall rating, and rating for each skill.

The following code filters that information and then further makes sure to remove the changes that have occurred. For example, if a player started with an 80 in passing but improved over the course of that year/season, FIFA would update their ranking by adding +1. So the data reflects this by stating that their passing rating is 80+1. This isn't convenient for our data, so the following code helps make 80+1 simply 81.

In [None]:
fifa15 = fifa15.filter(["short_name","age","club","overall","player_positions","attacking_crossing",
                         "attacking_finishing","attacking_heading_accuracy","attacking_short_passing",
                         "attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy",
                         "skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed",
                         "movement_agility","movement_reactions","movement_balance","power_shot_power",
                         "power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression",
                         "mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties",
                         "mentality_composure","defending_marking","defending_standing_tackle",
                         "defending_sliding_tackle","goalkeeping_diving","goalkeeping_handling","goalkeeping_kicking"
                         ,"goalkeeping_positioning","goalkeeping_reflexes"])
fifa16 = fifa16.filter(["short_name","age","club","overall","player_positions","attacking_crossing",
                         "attacking_finishing","attacking_heading_accuracy","attacking_short_passing",
                         "attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy",
                         "skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed",
                         "movement_agility","movement_reactions","movement_balance","power_shot_power",
                         "power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression",
                         "mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties",
                         "mentality_composure","defending_marking","defending_standing_tackle",
                         "defending_sliding_tackle","goalkeeping_diving","goalkeeping_handling","goalkeeping_kicking"
                         ,"goalkeeping_positioning","goalkeeping_reflexes"])
fifa17 = fifa17.filter(["short_name","age","club","overall","player_positions","attacking_crossing",
                         "attacking_finishing","attacking_heading_accuracy","attacking_short_passing",
                         "attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy",
                         "skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed",
                         "movement_agility","movement_reactions","movement_balance","power_shot_power",
                         "power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression",
                         "mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties",
                         "mentality_composure","defending_marking","defending_standing_tackle",
                         "defending_sliding_tackle","goalkeeping_diving","goalkeeping_handling","goalkeeping_kicking"
                         ,"goalkeeping_positioning","goalkeeping_reflexes"])
fifa18 = fifa18.filter(["short_name","age","club","overall","player_positions","attacking_crossing",
                         "attacking_finishing","attacking_heading_accuracy","attacking_short_passing",
                         "attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy",
                         "skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed",
                         "movement_agility","movement_reactions","movement_balance","power_shot_power",
                         "power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression",
                         "mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties",
                         "mentality_composure","defending_marking","defending_standing_tackle",
                         "defending_sliding_tackle","goalkeeping_diving","goalkeeping_handling","goalkeeping_kicking"
                         ,"goalkeeping_positioning","goalkeeping_reflexes"])
fifa19 = fifa19.filter(["short_name","age","club","overall","player_positions","attacking_crossing",
                         "attacking_finishing","attacking_heading_accuracy","attacking_short_passing",
                         "attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy",
                         "skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed",
                         "movement_agility","movement_reactions","movement_balance","power_shot_power",
                         "power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression",
                         "mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties",
                         "mentality_composure","defending_marking","defending_standing_tackle",
                         "defending_sliding_tackle","goalkeeping_diving","goalkeeping_handling","goalkeeping_kicking"
                         ,"goalkeeping_positioning","goalkeeping_reflexes"])

def filtering(df):
    for i, rows in df.iterrows():
        if len(rows[4]) > 3:
            df.at[i,df.columns[4]] = rows[4][0:rows[4].find(",")]
        for j in range(5,len(rows)):
            if type(rows[j]) == str and (rows[j].find("+") != -1 or rows[j].find("-") != -1) :
                df.at[i,df.columns[j]] = str(eval(rows[j]))

filtering(fifa15)
filtering(fifa16)
filtering(fifa17)
filtering(fifa18)
filtering(fifa19)

fifa15.head()


The above table is the final list of data that we want from each year's dataset. Although there are 39 different columns the data we need from each is fairly simple:
- short_name - is simply the name of the player and in this case the identifier for each player (the primary key)
- age - the player's age which will come in handy later when analyzing up and coming players
- club - the club the player plays for which can be used to identify the league and country the player plays in
- overall - the overall rating the player was given by FIFA that year. 99 being the highest and best score
- player_positions - the primary position the player plays. We'll go over these in a bit
- skills - the rest of the columns are the ratings of different soccer skills from 0-99 with 99 being the best for each player


In [None]:
fifa20 = pd.read_csv("players_20.csv")

fifa20 = (fifa20.filter(["short_name","age","club","overall","player_positions","attacking_crossing",
                         "attacking_finishing","attacking_heading_accuracy","attacking_short_passing",
                         "attacking_volleys","skill_dribbling","skill_curve","skill_fk_accuracy",
                         "skill_long_passing","skill_ball_control","movement_acceleration","movement_sprint_speed",
                         "movement_agility","movement_reactions","movement_balance","power_shot_power",
                         "power_jumping","power_stamina","power_strength","power_long_shots","mentality_aggression",
                         "mentality_interceptions","mentality_positioning","mentality_vision","mentality_penalties",
                         "mentality_composure","defending_marking","defending_standing_tackle",
                         "defending_sliding_tackle","goalkeeping_diving","goalkeeping_handling","goalkeeping_kicking"
                         ,"goalkeeping_positioning","goalkeeping_reflexes"]))


filtering(fifa20)

fifa20.head()

This table is the previous processing and tidying bundled up neatly. It also displays the final table that we will be comparing to, the FIFA 20 data that is currently still being updated week by week by FIFA.

### Data Processing and Tidying

Although we've done some processing and tidying, the following processing and tidying is splitting the current data we have up so that we can create a more accurate model later with a lot less random variables to account for.

#### Standardizing the Data

With all the various skills that are listed in FIFA, it would seem impossible that any one player would be a master in all and that is correct. What FIFA actually does to assess a player's overall rating is to assess what FIFA has deemed the necessary skills for the player's position. FIFA assigns different skills to different positions and then also weighs the skills in importance to create the player's final overall rating. In order to assess the overall rating correctly between the years and to create a standard, I will be reevaluating previous years from 2015-2018 which used a different calculation for the overall rating from 2019 and 2020 with the standard used in 2019 and 2020. Many people have experimented with player values to find the exact coefficients which is more explained <a href = "https://www.fifauteam.com/player-ratings-guide-fifa-19/"> here</a>. But these are the standards that will be used.

As a few examples:
<table>
<tr>
<td><img src = "coef1.png" style="height: 300px;"/></td>        <td><img src = "coef2.png" style="height: 300px;"/></td>
</tr>
</table>

In [None]:
def SortPositions(df):
    for i, rows in df.iterrows():
        if rows["player_positions"] == "GK":
            df.at[i,df.columns[3]] = (float(rows["goalkeeping_diving"]) * .24 + float(rows["goalkeeping_handling"]) * 
                                      .22 + float(rows["goalkeeping_positioning"]) * .22 +
                                      float(rows["goalkeeping_reflexes"]) * .22 + float(rows["movement_reactions"]) 
                                      * .06 + float(rows["goalkeeping_kicking"]) * .04)
        elif rows["player_positions"] == "CB":
            df.at[i,df.columns[3]] = (float(rows["defending_marking"]) * .15 + float(rows["defending_standing_tackle"]) * .15 + 
                                      float(rows["defending_sliding_tackle"]) * .15 + 
                                      float(rows["attacking_heading_accuracy"]) * .1 + float(rows["power_strength"]) 
                                      * .1 + float(rows["mentality_aggression"]) * .08 + 
                                      float(rows["mentality_interceptions"]) * .08 + 
                                      float(rows["attacking_short_passing"]) * .05 +
                                      float(rows["movement_reactions"]) * .05 + float(rows["power_jumping"]) * .04 +
                                      float(rows["skill_ball_control"])* .05)
        elif rows["player_positions"] == "RB" or rows["player_positions"] == "LB":
            df.at[i,df.columns[3]] = (float(rows["defending_marking"]) * .10 + float(rows["defending_standing_tackle"])
                                      * .12 + float(rows["defending_sliding_tackle"]) * .13 + 
                                      float(rows["attacking_heading_accuracy"]) * .07 + float(rows["power_stamina"]) 
                                      * .08 + float(rows["mentality_aggression"]) * .05 + 
                                      float(rows["attacking_crossing"]) * .07 + float(rows["mentality_interceptions"])
                                      * .12 + float(rows["attacking_short_passing"]) * .06 + 
                                      float(rows["movement_sprint_speed"]) * .05 + float(rows["movement_reactions"]) 
                                      * .08 + float(rows["skill_ball_control"])* .07)
        elif rows["player_positions"] == "RWB" or rows["player_positions"] == "LWB":
            df.at[i,df.columns[3]] = (float(rows["defending_marking"]) * .09 + 
                                      float(rows["defending_standing_tackle"]) * .11 + 
                                      float(rows["defending_sliding_tackle"]) * .10 + float(rows["power_stamina"]) * 
                                      .08 + float(rows["attacking_crossing"]) * .10 + float(rows["skill_dribbling"]) 
                                      * .07 + float(rows["movement_agility"]) * .03 + 
                                      float(rows["mentality_interceptions"]) * .10 + 
                                      float(rows["attacking_short_passing"]) * .10 + 
                                      float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"]) 
                                      * .08 + float(rows["skill_ball_control"])* .10)
        elif rows["player_positions"] == "CM":
            df.at[i,df.columns[3]] = (float(rows["skill_long_passing"]) * .13 + float(rows["power_stamina"]) * .06 + 
                                      float(rows["mentality_vision"]) * .12 + float(rows["power_long_shots"]) * .05 + 
                                      float(rows["skill_dribbling"]) * .09 + float(rows["defending_standing_tackle"]) 
                                      * .06 + float(rows["mentality_interceptions"]) * .08 + 
                                      float(rows["attacking_short_passing"]) * .15 + float(rows["movement_reactions"])
                                      * .08 + float(rows["mentality_positioning"]) * .08 + 
                                      float(rows["skill_ball_control"])* .10)
        elif rows["player_positions"] == "CDM":
            df.at[i,df.columns[3]] = (float(rows["skill_long_passing"]) * .11 + float(rows["power_stamina"]) * .06 + 
                                      float(rows["defending_marking"]) * .10 + float(rows["power_strength"]) * .06 + 
                                      float(rows["defending_standing_tackle"]) * .10 + 
                                      float(rows["mentality_interceptions"]) * .12 + 
                                      float(rows["attacking_short_passing"]) * .13 + float(rows["mentality_vision"]) 
                                      * .08 + float(rows["movement_reactions"]) * .09 + 
                                      float(rows["mentality_aggression"]) * .05 + float(rows["skill_ball_control"])* 
                                      .09)
        elif rows["player_positions"] == "CAM":
            df.at[i,df.columns[3]] = (float(rows["movement_agility"]) * .04 + float(rows["movement_acceleration"]) * 
                                      .04 + float(rows["mentality_vision"]) * .16 + float(rows["power_long_shots"]) * 
                                      .06 + float(rows["skill_dribbling"]) * .11 + float(rows["attacking_finishing"]) 
                                      * .05 + float(rows["attacking_short_passing"]) * .16 + 
                                      float(rows["power_shot_power"]) * .05 + float(rows["movement_reactions"]) * .08 
                                      + float(rows["mentality_positioning"]) * .12 + float(rows["skill_ball_control"])
                                      * .13)
        elif rows["player_positions"] == "RM" or rows["player_positions"] == "LM":
            df.at[i,df.columns[3]] = (float(rows["movement_agility"]) * .03 + float(rows["movement_acceleration"]) * 
                                      .05 + float(rows["mentality_vision"]) * .08 + float(rows["skill_long_passing"]) 
                                      * .08 + float(rows["skill_dribbling"]) * .14 + float(rows["power_stamina"]) * 
                                      .05 + float(rows["attacking_crossing"]) * .14 + 
                                      float(rows["attacking_short_passing"]) * .12 + 
                                      float(rows["movement_sprint_speed"]) * .05 + float(rows["movement_reactions"]) 
                                      * .07 + float(rows["mentality_positioning"]) * .07 + 
                                      float(rows["skill_ball_control"])* .12)
        elif rows["player_positions"] == "RW" or rows["player_positions"] == "LW":
            df.at[i,df.columns[3]] = (float(rows["power_shot_power"]) * .10 + float(rows["movement_acceleration"]) * 
                                      .04 + float(rows["mentality_vision"]) * .05 + float(rows["power_long_shots"]) * 
                                      .10 + float(rows["skill_dribbling"]) * .11 + 
                                      float(rows["attacking_heading_accuracy"]) * .05 + 
                                      float(rows["attacking_crossing"]) * .16 + 
                                      float(rows["attacking_short_passing"]) * .06 + 
                                      float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"]) 
                                      * .10 + float(rows["mentality_positioning"]) * .12 + 
                                      float(rows["skill_ball_control"])* .11)
        elif rows["player_positions"] == "RF" or rows["player_positions"] == "CF" or rows["player_positions"] == "LF":
            df.at[i,df.columns[3]] = (float(rows["power_shot_power"]) * .10 + float(rows["movement_acceleration"]) * 
                                      .04 + float(rows["mentality_vision"]) * .05 + float(rows["power_long_shots"]) * 
                                      .10 + float(rows["skill_dribbling"]) * .11 + 
                                      float(rows["attacking_heading_accuracy"]) * .05 + 
                                      float(rows["attacking_finishing"]) * .12 + 
                                      float(rows["attacking_short_passing"]) * .06 + 
                                      float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"]) 
                                      * .10 + float(rows["mentality_positioning"]) * .12 + 
                                      float(rows["skill_ball_control"])* .11)
        elif rows["player_positions"] == "ST":
            df.at[i,df.columns[3]] = (float(rows["power_shot_power"]) * .10 + float(rows["movement_acceleration"]) * 
                                      .05 + float(rows["attacking_volleys"]) * .05 + float(rows["power_long_shots"]) 
                                      * .05 + float(rows["skill_dribbling"]) * .08 + 
                                      float(rows["attacking_heading_accuracy"]) * .10 + 
                                      float(rows["attacking_finishing"]) * .20 + float(rows["power_strength"]) * .03 
                                      + float(rows["movement_sprint_speed"]) * .04 + float(rows["movement_reactions"]) * .10 + float(rows["mentality_positioning"]) * .12 + float(rows["skill_ball_control"])* .08)
SortPositions(fifa15)
SortPositions(fifa16)
SortPositions(fifa17)
SortPositions(fifa18)

fifa15 = fifa15.sort_values(by="overall",ascending = False)
fifa16 = fifa16.sort_values(by="overall",ascending = False)
fifa17 = fifa17.sort_values(by="overall",ascending = False)
fifa18 = fifa18.sort_values(by="overall",ascending = False)

fifa15.head()

#### Splitting By Roles

In soccer there are different positions for each of the 11 players on the field. The following image does a good job in showing where on the field all the positions are (a quick guide is that L is left, C is center and R is right for positions. Then vertically you have GK as goalkeeper, B as back, M as midfielder, W as wing, and F as forward)

<img src = "positions2.jpg" style="height: 300px;">

In this second image, it shows how each position is split (basically if it ends with a B it is a defensive player, M is a midfielder and W, F or ST is an attacker). This is how we are going to be splitting each position when we do our analysis as each position has different focuses. Although each position has different skills it needs to be successful, each type of player (defensive, midfielder, attacker, and goalkeeper) have enough in common with the other similar types of players that we won't split all the players into different lists.

<img src = "positions.jpg" style="width: 500px;">

In [None]:
attackers = pd.DataFrame(columns = fifa20.columns)
midfielders = pd.DataFrame(columns = fifa20.columns)
defenders = pd.DataFrame(columns = fifa20.columns)
goalkeepers = pd.DataFrame(columns = fifa20.columns)

def aggregation(df, name):
    addDict = {}
    for i in fifa20.columns:
        five = fifa15.loc[fifa15['short_name'] == name]
        six = fifa16.loc[fifa16['short_name'] == name]
        seven = fifa17.loc[fifa17['short_name'] == name]
        eight = fifa18.loc[fifa18['short_name'] == name]
        nine = fifa19.loc[fifa19['short_name'] == name]
        if i == "short_name":
            addDict[i] = name
        else:
            addDict[i] = [list(five[i]),list(six[i]),list(seven[i]),list(eight[i]),list(nine[i])]
    return addDict

unique_players = list(fifa20["short_name"])

for x in range(0,100):
    name = unique_players[x]
    pos = fifa20.loc[fifa20['short_name'] == name]["player_positions"]
    pos = list(pos)[0]
    if pos == "CF" or pos == "ST" or pos == "LW" or pos == "RW":
        attackers.append(aggregation(attackers,name),ignore_index=True)
    elif pos == "CAM" or pos == "CM" or pos == "CDM" or pos == "RM" or pos == "LM":
        aggregation(midfielders,name)
    elif pos == "LWB" or pos == "RWB" or pos == "LB" or pos == "RB" or pos == "CB":
        aggregation(defenders,name)
    elif pos == "GK":
        aggregation(goalkeepers,name)
    print(attackers)
        

## Exploratory Data Analysis

### Attackers Analysis

#### Best Attackers

#### Best Attackers Under 21

### Midfielders Analysis

#### Best Midfielders

#### Best Midfielders under 21

### Defenders Analysis

#### Best Defenders

#### Best Defenders under 21

### Goalkeepers Analysis

#### Best Goalkeepers

#### Best Goalkeepers under 21

## Building the Data Model

## Assessing the Validity of the Model

## Conclusion