# Predicting Overall Score for Fifa Players using Support Vector Machines
A friend of mine asked me to do this. My thesis is about SVM implementations so of course I'd do a basic SVM implementation using sklearn's SVC. I made some preprocessing in R to get only numeric columns and tidy up the data.

⚠️ Note: This notebook relies on a dataset that is no longer available. Do not re-run unless you provide your own dataset.

In [None]:
# Imports
from sklearn import datasets, metrics
from sklearn.svm import SVC
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [None]:
# This already contains only numeric columns
df = pd.read_csv("data/path")
# This is the full dataset
df_full = pd.read_csv("data/path")

df.apply(pd.to_numeric, errors='ignore')
scaler = StandardScaler()
cols = df.columns

X = df.drop(columns = "overall")
y = df["overall"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y.values, train_size = 0.8, random_state = 1)

svm_model = SVR(kernel = "poly", degree = 3, epsilon = 0.3)
#svm_model = make_pipeline(StandardScaler(), SVC(kernel = "poly", degree = 3 ))
                                                #epsilon = 0.2, gamma = "scale"))
svm_model.fit(X_train, y_train)

#svm_model.fit(X_train, y_train)

y_pred = svm_model.predict(X_test)
print("MinPred: ", min(y_pred), "MaxPred: ", max(y_pred))

print("MSE: ", metrics.mean_squared_error(y_test, y_pred))
print("R^2: ", metrics.r2_score(y_test, y_pred))

MinPred:  47 MaxPred:  90
MSE:  4.126946423858538
R^2:  0.9130695927051868


To be honest, this doesn't tell me how good are the predictions we're getting. Let's inspect some players to get an idea of how good this model performs.

In [None]:
# Messi has index 0
print(df_full.iloc[0])

sofifa_id                                                158023
player_url    https://sofifa.com/player/158023/lionel-messi/...
short_name                                             L. Messi
long_name                        Lionel Andrés Messi Cuccittini
age                                                          33
                                    ...                        
lb                                                         62+3
lcb                                                        52+3
cb                                                         52+3
rcb                                                        52+3
rb                                                         62+3
Name: 0, Length: 106, dtype: object


Lets make a prediction on Messi's stats and compare it to the real overall.

In [None]:
print("Messi's overall: ", df.iloc[0].overall)
#print("Messi's overall prediction: ", svm_model.predict(X.iloc[0]))
messi = X.iloc[0].drop(columns='overall')
messi = messi.values.reshape(1, -1)

messi_pred = svm_model.predict(messi)
print("Messi's predicted overall: ", *messi_pred)

Messi's overall:  93
Messi's overall:  93




Our prediction for Messi is spot on. That's great news! What about a mid overall player and a end of the tail player?

In [None]:
# Let's do worst player first
print(df_full[['short_name', 'overall']].sort_values(by=['overall'], ascending=True).head(5))

        short_name  overall
18943     Song Yue       47
18928  V. Da Silva       47
18929     B. Hough       47
18930  R. McKinley       47
18931    M. Flores       47


In [None]:
# Lets do Song Yue, index 18943
print(df_full.iloc[18943])

sofifa_id                                               257936
player_url    https://sofifa.com/player/257936/yue-song/210002
short_name                                            Song Yue
long_name                                             Yue Song
age                                                         28
                                    ...                       
lb                                                        47+0
lcb                                                       46+1
cb                                                        46+1
rcb                                                       46+1
rb                                                        47+0
Name: 18943, Length: 106, dtype: object


In [None]:
print("Song Yue overall: ", df.iloc[18943].overall)
yue = X.iloc[18943].drop(columns='overall')
yue = yue.values.reshape(1, -1)

yue_pred = svm_model.predict(yue)
print("Song Yue predicted overall: ", *yue_pred)

Song Yue overall:  47
[False False False False False False False False False False False False
 False False False]
Song Yue predicted overall:  47




The prediction is spot on again. Let's test for an average overall player and then test only on indexes that are not in the trainset.

In [None]:
print(df_full['overall'].mean())
avg_players = df_full[df_full['overall'] == 66]
print(avg_players[['short_name', 'overall']].head(5))

65.67778716216216
         short_name  overall
8426    S. Esposito       66
8427     A. Vranckx       66
8428     T. Lamptey       66
8429  Félix Correia       66
8430     C. Olivera       66


In [None]:
# Let's do Esposito, index 8426
print("S. Esposito overall: ", df.iloc[8426].overall)
esposito = X.iloc[8426].drop(columns='overall')
esposito = esposito.values.reshape(1, -1)

esposito_pred = svm_model.predict(esposito)
print("S. Esposito predicted overall: ", *esposito_pred)

S. Esposito overall:  66
S. Esposito predicted overall:  68




In [None]:
# Let's do for 3 more players from the testset and call it a day.
print(X_test.head(4))
print(df_full[['short_name', 'overall']].iloc[[18650, 3833, 554]])

       age  height_cm  weight_kg  potential  wage_eur  pace  shooting   
18650   20        182         69         61      2000    68        39  \
3833    20        185         85         82     13000    81        68   
11148   25        191         83         71      2000    43        37   
554     25        180         73         82     42000    73        75   

       passing  dribbling  defending  physic  attacking_crossing   
18650       45         50         45      52                  36  \
3833        65         74         32      73                  69   
11148       44         43         64      73                  32   
554         79         78         72      78                  77   

       attacking_finishing  attacking_heading_accuracy   
18650                   38                          46  \
3833                    68                          63   
11148                   26                          65   
554                     75                          69   

  

In [None]:
print(svm_model.predict(X.iloc[[18650, 3833, 554]]))

[51 72 79]


I'm already convinced that this model works pretty well, but if you're not convinced, take a look at MSE and $R^2$ scores again:

In [None]:
print("MSE: ", metrics.mean_squared_error(y_test, y_pred))
print("R^2: ", metrics.r2_score(y_test, y_pred))

MSE:  4.126946423858538
R^2:  0.9130695927051868
