Preview of the data:

In [1]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels as sm
from io import StringIO

In [2]:
df = pd.read_csv('data/data.csv')

In [3]:
df = df['Position'].value_counts().sort_index()

In [4]:
dft = pd.DataFrame(data=df)

In [5]:
df_trans = dft.T

In [6]:
df_trans

Unnamed: 0,CAM,CB,CDM,CF,CM,GK,LAM,LB,LCB,LCM,...,RB,RCB,RCM,RDM,RF,RM,RS,RW,RWB,ST
Position,958,1778,948,74,1394,2025,21,1322,648,395,...,1291,662,391,248,16,1124,203,370,87,2152


In [7]:
Attacker = df_trans.drop(['GK','LWB', 'RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB', 'LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LDM', 'CDM', 'RDM'], axis=1)
Defender = df_trans.drop(['GK','LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LDM', 'CDM', 'RDM','LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW'], axis=1)
Midfielder = df_trans.drop(['GK','LWB', 'RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB','LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW'], axis=1)
Goalkeeper = df_trans.drop(['LWB', 'RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB', 'LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LDM', 'CDM', 'RDM','LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW'], axis=1)

In [8]:
Attacker['Attacker'] = Attacker.sum(axis=1)
Defender['Defender']= Defender.sum(axis=1)
Midfielder['Midfielder'] = Midfielder.sum(axis=1)
Goalkeeper['Goalkeeper'] = Goalkeeper.sum(axis=1)

In [9]:
Attacker = Attacker.drop(['ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW','LS'], axis=1)
Midfielder = Midfielder.drop(['LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LDM', 'CDM', 'RDM'], axis=1)
Defender = Defender.drop(['RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB','LWB'], axis=1)
Goalkeeper = Goalkeeper.drop(['GK'], axis=1)

In [10]:
Position = pd.concat([Attacker,Midfielder,Defender,Goalkeeper], axis=1)

In [40]:
Position

Unnamed: 0,Attacker,Midfielder,Defender,Goalkeeper
Position,3418,6838,5866,2025


In [44]:
PT = Position.T

In [50]:
plt.style.use('dark_background')

PT.plot(kind='bar', figsize(10,5),y='Position')


SyntaxError: positional argument follows keyword argument (<ipython-input-50-2e47aeb1939c>, line 3)

# Description of the dataset

This dataset has a total of 89 columns. The first "unnamed" column is not relevant, so we can just concentrate on the other 88. We can divide them in 3 categories:

### 1. Basic attributes

The basic attributes of the players and game. They are mostly self explanatory:
```
	* ID
	* Name
	* Age
	* Photo
	* Nationality
	* Flag
	* Overall
	* Potential
	* Club
	* Club Logo
	* Value
	* Wage
	* Special
	* Preferred Foot
	* International Reputation
	* Weak Foot
	* Skill Moves
	* Work Rate
	* Body Type
	* Real Face
	* Position
	* Jersey Number
	* Joined
	* Loaned From
	* Contract Valid Until
	* Height
	* Weight
	* Release Clause
```

Most of these features are categorical, with a few exceptions like Height, Weight, Value, Wage, etc. The attribute `Overall` indicates the rating of the player, **it's an important feature** as we'll try to predict it later.

In [None]:
DataFrame(df, columns = [
                    'Crossing',
                    'Finishing',
                    'HeadingAccuracy',
                    'ShortPassing',
                    'Volleys',
                    'Dribbling',
                    'Curve',
                    'FKAccuracy',
                    'LongPassing',
                    'BallControl',
                    'Acceleration',
                    'SprintSpeed',
                    'Agility',
                    'Reactions',
                    'Balance',
                    'ShotPower',
                    'Jumping',
                    'Stamina',
                    'Strength',
                    'LongShots',
                    'Aggression',
                    'Interceptions',
                    'Positioning',
                    'Vision',
                    'Penalties',
                    'Composure',
                    'Marking',
                    'StandingTackle',
                    'SlidingTackle',
                    'GKDiving',
                    'GKHandling',
                    'GKKicking',
                    'GKPositioning',
                    'GKReflexes'
                                ])
    


### 2. Skills and player attributes

These features indicate the skills of players for each one of those traits. Generally numerical, ranging between 0 (worst) - 100 (best).

```
	* Crossing
	* Finishing
	* HeadingAccuracy
	* ShortPassing
	* Volleys
	* Dribbling
	* Curve
	* FKAccuracy
	* LongPassing
	* BallControl
	* Acceleration
	* SprintSpeed
	* Agility
	* Reactions
	* Balance
	* ShotPower
	* Jumping
	* Stamina
	* Strength
	* LongShots
	* Aggression
	* Interceptions
	* Positioning
	* Vision
	* Penalties
	* Composure
	* Marking
	* StandingTackle
	* SlidingTackle
	* GKDiving
	* GKHandling
	* GKKicking
	* GKPositioning
	* GKReflexes
```

### 3. Performance of player on different positions

Players are generally identified with one general playing role:
* Goalkeepers
* Defenders
* Midfielders
* Attackers

Some players have mixed roles, for example, "attacking midfielders", or "defensive midfielders". The performance of the player will vary depending the positions they play. For example, Messi plays 94 overall, but that's only when playing as an attacker. If you make Messi play as a left back (defender), he'll drop to 64:

![positions](img/sofifa-2.png)

Those values are captured in the columns:

```
	* LS
	* ST
	* RS
	* LW
	* LF
	* CF
	* RF
	* RW
	* LAM
	* CAM
	* RAM
	* LM
	* LCM
	* CM
	* RCM
	* RM
	* LWB
	* LDM
	* CDM
	* RDM
	* RWB
	* LB
	* LCB
	* CB
	* RCB
	* RB
```