# Project Overview

As my friend and I prepared for our Fantacalcio 21/22 season, I wondered if I could do any further analysis using the python libraries I am familiar with. 

If you are not familiar, Fantacalcio is the name of the Italian Serie A Draft Fantasy football competition. Our league consists of 8 to 10 people usually but calcio lovers all over Italy have their own private leagues and it's a very popular game. Essentially it is a head to head draft league, but you bid for players you want in an auction format, with all the other competitors present. You each have 1000 credits to fill a 23 man squad, of which 11 players can play at a time, so you want to fill these positions the best you can, to optimize your points. After the auction is done, there is a match for each gameweek of the season, where you face one of your rivals head to head, and whoever takes the most points from their designated 11 players that week (who get points for scoring, asssiting, clean sheet etc.) win 3 points and potentially move up the table. Most points at the end of the season is the winner.  

It is fairly simple (perhaps my explanation is not the best so do Google if necessary) but as you can imagine, there are a lot of moving parts so there is an oppurtunity for analysis. Of course, you want to use your budget in the wisest way in order to maximize your potenital points. Initially I had the idea use ML techniques to perform a regression, predicting player prices so we could scalp undervalued players, as well as avoiding overpaying for any players.

We we will start here and I will comment as going along.

# Part 1: ML Regression

Below are the imports required to perform the analysis. Scikit's MeanPoissonDeviance module is currently commented out as there were some errors with the module when I attempted to use it**

In [1]:
import csv
import pandas as pd
import numpy as np
from patsy import dmatrices
from pulp import *
import matplotlib.pyplot as plt


from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
#from sklearn.metrics import mean_poisson_deviance
from sklearn.model_selection import cross_val_score

import seaborn as sns
from sklearn.cluster import KMeans

import statsmodels.api as sm

# Data Wrangling


The data file we will use is "calcio.csv" and you can find this in the code repository. It gives various information on the players we are analyzing. Most importantly their name which will be the unique identifier, their position and their fantacalcio scores of the last 2 years. We are looking at 149 players that we have sufficient data for, and omit goalkeepers for now as their scoring system is completely different from outfield players and does not tie in the regression techniques in use.

Below are a few lines used to get the dataset "calcio.csv" into a format from which we can begin to work from. Some work has already been done to the raw csv file in Excel in order to prepare it without using any code. Namely I removed some rows that had insuffiecient information and might affect the regression negatively. Also, an average value was taken using 1/3 of the 19/20 Score and 2/3 of the 20/21 Score. We welcome this recency bias as it places more emphasis on the more recent form of the players.



We choose to scale the bid data, as it can vary by a magnitude of more than 10. We define the scaler using the MinMaxScaler function and create a new column representing these scaled bid values.

Finally, we split the data set into Defenders (D), Midfielders (C) and Attackers (A). We will be analyzing these data sets seperately as the bid action tends to be different for each category due to various game factors, and we will adhere to this when performing our analysis.

In [2]:
calcio = pd.read_csv("calcio.csv")

calcio = calcio.reset_index()

calcio = calcio.drop(range(147,535))
calcio = calcio.drop(columns = ["PREDICTED VALUE"])



scaler = MinMaxScaler(feature_range=(0,1))


scaled1 = scaler.fit_transform((np.array(calcio["BID"])).reshape(-1,1))
calcio["scaled_bid"] = scaled1.reshape(-1,1)


calcio = calcio.sample(frac=1)


calcio_D = calcio.loc[calcio["R"]=="D"]
calcio_C = calcio.loc[calcio["R"]=="C"]
calcio_A = calcio.loc[calcio["R"]=="A"] 







# Fitting Data

Next we define a fitting function which we will call on when calculating the regression scores. 

We define our X and y variables: X being the 19/20 scores of each player, and y being the "scaled_bid" we computed earlier- how many credits these players were bought for the next year, normalized against the other bids. The data is then split into testing and training sets. A z variable is also added as the average of the last 2 years scores, which we will extrapolate later to predict the bids for the upcoming 21/22 auction.

The objective is to find out how many credits we can expect a player to be bought for, judging on their past performance.

In [4]:
def fitting(data):
    global X_train
    global X_test
    global y_train
    global y_test
    
    global X
    global y
    global z
    
    X = np.array(data["FV 19/20"])
    y = np.array(data["scaled_bid"])
    
    z = np.array(data["2 YR AVG"])



    X_train, X_test, y_train, y_test = train_test_split(X.reshape(-1,1), y.reshape(-1,1), random_state=0)




A Score Estimator function is defined in order to look at our errors once the regression is performed. It predicts the y values based on the test data for X. However, we have to inverse transform these to get them back to the original scale, so our results will be easier to interpret.

We print the Mean Squared Error, Mean Average Error and also the Cross Validated Mean Average Error. The score we are most interested is the CV MAE as our data sets are quite small and we don't want any bias from how the data was cut for the test_train_split, so we perform the cut 4 times and take the average of the score to get a truer value.

In [5]:
def score_estimator(estimator, X_test):

  y_pred = estimator.predict(X_test)

  print("MSE: %.3f" % mean_squared_error(scaler.inverse_transform(y_pred), scaler.inverse_transform(y_test)))
  print("MAE: %.3f" % mean_absolute_error(scaler.inverse_transform(y_pred), scaler.inverse_transform(y_test)))

  scores = -1 * cross_val_score(estimator, scaler.inverse_transform(y_pred), scaler.inverse_transform(y_test), cv=4, scoring='neg_mean_absolute_error')
  print("CV MAE: %.3f" % scores.mean())
  

# Performing the Regression

Now we can define our pipeline. We have already performed our feature engineering in the shape of the bid normalization and our columns are correctly defined, and there is no need for any further modification for now.
Hence our pipeline consists only of our regression frame work: Linear Regression. We begin with this as when plotting scatter graphs before this project commenced, we could see a near enough approximation to a linear relationship between performance and bid.

We iterate across the 3 categories we have split our data into, and are hence performing 3 seperate regressions. The "y_hat" and "z_hat" columns will contain the predicted values, but these are standardized so we inverse transform using the scaler defined earlier to find our unscaled predictions. We define our residuals. Finally a modification is made to make a 10% increase to the predicted Y value. This is an approximation to compensate for the extra player added to the league this year, which will increase the final bid values as there will be more competition for players.

In [6]:


linear = Pipeline(steps=[("regressor", LinearRegression())])



players = [calcio_A, calcio_C, calcio_D]
for i in players:

    fitting(i)

    linear.fit(X_train, y_train)
    
    
    i["y_hat"] = linear.predict(X.reshape(-1,1))
 
    i["z_hat"] = linear.predict(z.reshape(-1,1))



    i["20/21_Value"] = scaler.inverse_transform(np.array(i["y_hat"]).reshape(-1,1))
    i["21/22_Value"] = scaler.inverse_transform(np.array(i["z_hat"]).reshape(-1,1))
    
    #adding price compensation for extra 20% of people in league
    i["21/22_Value"] = 1.1*i["21/22_Value"]
    
    i["residual"] = i["20/21_Value"] - i["BID"]

    print("str(i) "+ "linear evaluation:")
    score_estimator(linear, X_test)
    



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the document

str(i) linear evaluation:
MSE: 7773.155
MAE: 65.763
CV MAE: 78.791
str(i) linear evaluation:
MSE: 342.041
MAE: 12.696
CV MAE: 14.191
str(i) linear evaluation:
MSE: 562.786
MAE: 18.340
CV MAE: 20.499


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-do

# Analysis

Let us now look at these figures. I initially also ran a Poisson regression to explore any non-linear possibilities, but the errors were larger than for the Linear Regression, and I had troubles with one of the Poisson libraries as mentioned earlier so I have omitted.

Looking at the cross validated scores, there is a large number of error for the strikers; more than 50 which is over 5% of the total budget. The score for midfielders is less, however we anticipated this as the bids for the best midfielders are not worth as much as the best strikers.

With defense, we can see there is more predictability, with the average prediction being only 6 credits away from the actuality. This is a more acceptable value so we may be able to use the model to value defenders fairly accurately but the large errors for midfielders and strikers are slightly concerning.

Now let's bring up the strikers data set to investigate further.

In [8]:
calcio_A.sort_values("21/22_Value", ascending=False)

Unnamed: 0,index,Id,R,Nome,Squadra,Pg,Mv,Mf,Gf,Gs,...,ln(Bid),FV 19/20,FV 20/21,2 YR AVG,scaled_bid,y_hat,z_hat,20/21_Value,21/22_Value,residual
0,0,2610.0,A,RONALDO,Juventus,33.0,6.32,8.79,23.0,0.0,...,6.089045,9.33,8.79,8.969938,1.0,0.807928,0.692231,356.680489,336.478236,-84.319511
1,1,2531.0,A,LUKAKU,Inter,36.0,6.63,8.89,18.0,0.0,...,6.006353,8.33,8.89,8.703271,0.920273,0.486601,0.606544,215.618049,295.099869,-190.381951
15,15,507.0,A,MURIEL,Atalanta,36.0,6.67,8.92,20.0,0.0,...,4.174387,7.98,8.92,8.606605,0.143508,0.374137,0.575482,166.246196,280.100217,101.246196
2,2,785.0,A,IMMOBILE,Lazio,35.0,6.23,7.7,16.0,0.0,...,5.993961,9.49,7.7,8.296612,0.908884,0.85934,0.475873,379.250479,231.999109,-21.749521
4,4,2530.0,A,IBRAHIMOVIC,Milan,19.0,6.28,8.28,12.0,0.0,...,5.361292,8.17,8.28,8.243275,0.480638,0.435189,0.458735,193.048059,223.722914,-19.951941
16,16,531.0,A,BERARDI,Sassuolo,30.0,6.6,8.37,10.0,0.0,...,3.688879,7.85,8.37,8.196608,0.08656,0.332365,0.443739,147.908079,216.481632,107.908079
3,3,608.0,A,ZAPATA D.,Atalanta,37.0,6.42,7.88,14.0,0.0,...,5.703782,8.66,7.88,8.139945,0.678815,0.592639,0.425532,262.168654,207.689205,-37.831346
8,8,2819.0,A,CAPUTO,Sassuolo,25.0,6.12,7.65,8.0,0.0,...,5.062595,8.31,7.65,7.869946,0.355353,0.480175,0.338774,212.796801,165.793917,54.796801
12,12,409.0,A,INSIGNE,Napoli,35.0,6.53,8.21,12.0,0.0,...,4.406719,6.93,8.21,7.783276,0.182232,0.036744,0.310925,18.130634,152.345456,-63.869366
5,5,2764.0,A,MARTINEZ L.,Inter,38.0,6.46,7.87,15.0,0.0,...,5.198497,7.29,7.87,7.676612,0.407745,0.152422,0.27665,68.913112,135.794463,-112.086888


Sorting by the predicted 21/22 value, we can immediately see where the issues lie. The highest value players have residuals of over 100 associated with them. Some positive and some negative. There is an argument for excluding some of these outliers but there are perhaps too many to be excluded. Plus it could be said that these are the players we are most interested in as they are where large portions of our budget will go. Excluding them would prove the study somewhat futile.

Another approach would be to introduce an exponential variable to assist in explaining this stratus of the relationship but I deemed that there are also a lot of outside variables affecting scores such as injury, potential transfers, plus the discovery of unknown gems that often go for minimum bid, which affect a lot of the data points. By this point I became more concerned with a different area that I was interested in; optimizing the transfer budget by finding how much to spend in each position, as oppose to looking at individual players.

We will save our dataframes as pickle files to carry across to Part 2.

In [8]:
calcio_A.to_pickle("calcio_A.pk1")
calcio_C.to_pickle("calcio_C.pk1")
calcio_D.to_pickle("calcio_D.pk1")
calcio.to_pickle("calcio.pk1")