# Preprocesing

## Obtaining Data

The following block is not going to be in the final pipeline, it is just a short version of the ETL, for quick experimentation.

In [6]:
### DATA PREPROCESING

import pandas as pd
import yaml
import numpy as np

### User defined
import variables_n_functions as vnf

config_file = open('config.yaml', 'r')
config = yaml.safe_load(config_file)

teams = config['teams']


### We initialize the df with the appropriate column names
df = pd.DataFrame(columns = vnf.columnas_df)

### The following variable is auxiliar to avoid duplicate requests
teams_aux = list(teams.keys())

### We recover the match history between every unique team - team combination, and store it in the df
for team_1 in teams.keys():
    teams_aux.remove(team_1)
    for team_2 in teams_aux:
        h2h = vnf.head2head(team_1, team_2, config['sports_token'])
        df = pd.concat([df] + [pd.DataFrame(pd.Series(h2h[k])).transpose() for k in range(len(h2h))])        

### Define columns to drop or that will be added to other tables
dropped_columns = ['details']
to_other_tables = ['weather_report', 'formations', 'scores', 'time', 'coaches', 'standings', 'assistants', 'colors']

config_file = open('config.yaml', 'r')
config = yaml.safe_load(config_file)

df_general = df.copy().drop(dropped_columns + to_other_tables, 1)

df_scores = pd.DataFrame(columns = ['id', 'localteam_score', 'visitorteam_score', 'localteam_pen_score',
                                    'visitorteam_pen_score', 'ht_score', 'ft_score', 'et_score', 'ps_score'])
for k in range(df.shape[0]):
#     temp = pd.DataFrame({key:[value] for key,value in eval(df['scores'].iloc[k]).items()})
    temp = pd.DataFrame({key:[value] for key,value in df['scores'].iloc[k].items()})
    temp['id'] = df.iloc[k]['id']
    df_scores = pd.concat([df_scores, temp])
    
#################### Transformations ####################
    
###### h2h.general

### ID variables
df_general['id'] = df_general['id']
df_general['league_id'] = df_general['league_id'].fillna(-1)
df_general['season_id'] = df_general['season_id'].fillna(-1)
df_general['stage_id'] = df_general['stage_id'].fillna(-1)
df_general['round_id'] = df_general['round_id'].fillna(-1)
df_general['group_id'] = df_general['group_id'].fillna(-1)
df_general['aggregate_id'] = df_general['aggregate_id'].fillna(-1)
df_general['venue_id'] = df_general['venue_id'].fillna(-1)
df_general['referee_id'] = df_general['referee_id'].fillna(-1)
df_general['localteam_id'] = df_general['localteam_id'].fillna(-1)
df_general['visitorteam_id'] = df_general['visitorteam_id'].fillna(-1)
df_general['winner_team_id'] = df_general['winner_team_id'].fillna(-1)

### Other variables
df_general['commentaries'] = df_general['commentaries'].apply(vnf.booleanize)# Boolean
df_general['attendance'] = df_general['attendance'].fillna(-1)# Integer
df_general['pitch'] = df_general['pitch'].apply(lambda x: "None" if x is None else x) # Categorical
df_general['neutral_venue'] = df_general['neutral_venue'].apply(vnf.booleanize)# Boolean
df_general['winning_odds_calculated'] = df_general['winning_odds_calculated'].apply(vnf.booleanize)# Boolean
df_general['deleted'] = df_general['deleted'].apply(vnf.booleanize)# Boolean
df_general['is_placeholder'] = df_general['is_placeholder'].apply(vnf.booleanize)# Boolean
df_general['leg'] = df_general['leg'].fillna(-1)

###### h2h.scores

df_scores['id'] = df_scores['id']
df_scores['localteam_score'] = df_scores['localteam_score'].fillna(-1)# Integer
df_scores['visitorteam_score'] = df_scores['visitorteam_score'].fillna(-1)# Integer
df_scores['localteam_pen_score'] = df_scores['localteam_pen_score'].fillna(-1)# Integer
df_scores['visitorteam_pen_score'] = df_scores['visitorteam_pen_score'].fillna(-1)# Integer
df_scores['ht_score'] = df_scores['ht_score'].fillna(-1) # String
df_scores['ft_score'] = df_scores['ft_score'].fillna(-1) # String
df_scores['et_score'] = df_scores['et_score'].fillna(-1) # String
df_scores['ps_score'] = df_scores['ps_score'].fillna(-1) # String



  df_general = df.copy().drop(dropped_columns + to_other_tables, 1)


## Defining the Y

The next chunk still is preprocesing, however in this section, we are going to adequate out dataframe to be ready to be used by different models, in a very simple way. Thus we will:

+ select only a few variables that we will use in this first iteration
+ give proper format to the dataframe
+ propose different Ys for different experiments

We have two kinds of models: overal match predictors and only winning predictor. In the former ones we are estimating the amount of goals for each team, this is normally done parametrically in more classical models, however, for the sake of the experiment, we tried naive regressions to estimate goals. This model is useful because it can predict winners, loosers, tie events, amount of goals, and different scenarios, the bad thing is that our design is pretty naive (the desgining of the proper model can be a whole thesis). In the latter models, less ambitious more powerful, we are only predicting if the home team is going to win or if it will loose or tie. 

In [24]:
#Y definition

# Add index to dataframe
new_index = [i for i in range(len(df_general))]
df_general.index = new_index

# Create Y variable: 1 if local wins, 0 in any other case
df_general['Y'] = np.where(df_general['winner_team_id']==df_general['localteam_id'], 1, 0)

# Filter only simpler columns
dat=df_general[["league_id","season_id","venue_id","referee_id","localteam_id",'visitorteam_id']]

#dat1=pd.json_normalize(df["formations"])
#dat2=pd.json_normalize(df["scores"])
#dat3=pd.json_normalize(df["time"])
#dat4=pd.json_normalize(df["coaches"])
dat5=pd.json_normalize(df["standings"])
#dat6=pd.json_normalize(df["assistants"])
data=pd.concat([dat,dat5], axis=1)
X = pd.get_dummies(data)

# This is a naive treatment to dataframe
X=X.fillna(0) 
X = X.replace([np.inf, -np.inf], np.nan)
X = X.reset_index()

In [25]:
# Ys definition

# Canonical Y variable: 1 if local wins, 0 in any other case
y = df_general['Y']

# 2 alternative Ys, amount of goals for local and visitor

df_scores.index = new_index
y_goals_local = df_scores['localteam_score']
y_goals_visitor = df_scores['visitorteam_score']

# Finally, win, loose or tie labels.
y_multi = np.where(df_general['winner_team_id']==-1, "Empate", 
                  np.where(df_general['winner_team_id']==df_general['localteam_id'], "Local", "Visitante"))



As you can see, our X is pretty simple and pretty small. We learned that a lot of the variables were really noisy and had a negative impact in our models.

In [10]:
X

Unnamed: 0,index,league_id,season_id,venue_id,referee_id,localteam_id,visitorteam_id,localteam_position,visitorteam_position
0,0,501,18369,8914,14859,62,53,2.0,1.0
1,1,501,18369,8909,14853,53,62,2.0,1.0
2,2,501,18369,8914,18748,62,53,6.0,5.0
3,3,501,17141,8914,14468,62,53,1.0,2.0
4,4,501,17141,8909,14859,53,62,2.0,1.0
...,...,...,...,...,...,...,...,...,...
2086,2086,501,1932,219,17252,734,496,0.0,0.0
2087,2087,501,1931,219,70310,734,496,0.0,0.0
2088,2088,501,1931,281425,70311,496,734,0.0,0.0
2089,2089,501,1931,219,70311,734,496,0.0,0.0


# Modeling Experiments

## CV

Since we have different Ys, we are going to split the data once it for all the models, this would make the comparisson between models more fair.


In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, y_local_train, y_local_test,y_visitor_train, y_visitor_test = train_test_split(X,y, y_goals_local,y_goals_visitor, test_size=0.4, random_state=10)


## Modelo 1: Lasso with binary Y

This is the simplest model with a binary y.
As you can see, the $R^2$ is pretty low however the amount of predicted outcomes is modest. 


In [15]:
# MODELO !

# Lasso BINARY MODEL

from sklearn.linear_model import Lasso

reg = Lasso(alpha=1)
reg.fit(X_train, y_train)
print('R squared training set', round(reg.score(X_train, y_train)*100, 2))
print('R squared test set', round(reg.score(X_test, y_test)*100, 2))

xtrain=reg.predict(X_train)
xtest=reg.predict(X_test)
partidos_train=sum(np.rint(np.nextafter(xtrain, xtrain+1))==y_train)/len(y_train)
partidos_test=sum(np.rint(np.nextafter(xtest, xtest+1))==y_test)/len(y_test)
print('% of succesfully predicted matches train', round(partidos_train*100,2))
print('% of succesfully predicted matches test ', round(partidos_test*100,2))


R squared training set 6.27
R squared test set 4.53
% of succesfully predicted matches train 63.48
% of succesfully predicted matches test  61.29


## Modelo 2:  Lasso with numeric Y

This is a pair of models that instead of using the binary Y they use as Y the number of goals scored by the local and the visitor, respectively. The design consists of estimating a linear model to estimate the goals of each of the two teams and later, we compare the goals of team 1 against those of team 2. This allows us to estimate the result and not only determine the winner. This is important because we could rescue the "Draw" category.

We know that there are probably correlation problems between the home Y and the away Y, so it doesn't make much mathematical sense that we estimated two models independently, but we did it as an experiment. Surprisingly good results.

We can explore how to make these types of models correctly in the future.





In [17]:
# Model 2

reg2 = Lasso(alpha=1)
model_local=reg2.fit(X_train, y_local_train)

print('R squared training set', round(reg2.score(X_train, y_local_train)*100, 2))
print('R squared test set', round(reg2.score(X_test, y_local_test)*100, 2))

reg3 = Lasso(alpha=1)
model_visitor=reg3.fit(X_train, y_visitor_train)
print('R squared training set', round(reg3.score(X_train, y_visitor_train)*100, 2))
print('R squared test set', round(reg3.score(X_test, y_visitor_test)*100, 2))


partidos_test=sum((np.round(model_local.predict(X_test)) > np.round(model_visitor.predict(X_test)))==y_test)/len(y_test)
partidos_train=sum((np.round(model_local.predict(X_train)) > np.round(model_visitor.predict(X_train)))==y_train)/len(y_train)
print('% of succesfully predicted matches train', round(partidos_train*100,2))
print('% of succesfully predicted matches test', round(partidos_test*100,2))




R squared training set 6.78
R squared test set 7.44
R squared training set 4.92
R squared test set 5.13
% of succesfully predicted matches train 60.85
% of succesfully predicted matches test 58.18


## Model 2.5:  Lasso Extension with ties

As the previous model allows us to estimate the expected goals, we can propose an alpha of such a size that it will help us to declare a tie if the expected goals are similar between both teams.

For example, if the alpha is .11 in size, and if the expected goals are 1.95 and 2.05, for home and away, then we would be proposing a 2-2 tie.

We notice that the number of correct matches decreases however, now we are talking about a problem of 3 categories, so the benchmark to beat would be a percentage of 1/3 of the matches. From this approach, the model performs well.

In [18]:
# Evaluacion con empate!
indices_train=y_train.index[0:len(y_train)]
indices_test=y_test.index[0:len(y_test)]
pd_y_multi=pd.DataFrame(y_multi)
pd_y_multi.index = new_index
y_multi_train = pd_y_multi.filter(items = indices_train, axis=0)
y_multi_test = pd_y_multi.filter(items = indices_test, axis=0)

alpha = .05

predict_multi_train = np.where(model_local.predict(X_train) > alpha + (model_visitor.predict(X_train)), "Local", 
                  np.where(model_local.predict(X_train) + alpha < model_visitor.predict(X_train), "Visitante", "Empate"))

predict_multi_test = np.where(model_local.predict(X_test) > alpha + (model_visitor.predict(X_test)), "Local", 
                  np.where(model_local.predict(X_test) + alpha < model_visitor.predict(X_test), "Visitante", "Empate"))

partidos_train=sum(y_multi_train[0]==predict_multi_train)/len(predict_multi_train)
partidos_test=sum(y_multi_test[0]==predict_multi_test)/len(predict_multi_test)

print('% of succesfully predicted matches train', round(partidos_train*100,2))
print('% of succesfully predicted matches test', round(partidos_test*100,2))


% of succesfully predicted matches train 47.05
% of succesfully predicted matches test 47.19


## Model 4:  Lasso CV

This is an extension of model1

In [20]:
# Modelo 3 LASSO CV

from sklearn.linear_model import LassoCV

# Lasso with 5 fold cross-validation
model = LassoCV(cv=5, random_state=0, max_iter=10000)

# Fit model
model.fit(X_train, y_train)

# Set best alpha
lasso_best = Lasso(alpha=model.alpha_)
lasso_best.fit(X_train, y_train)
print('R squared training set', round(lasso_best.score(X_train, y_train)*100, 2))
print('R squared test set', round(lasso_best.score(X_test, y_test)*100, 2))

xtrain=lasso_best.predict(X_train)
xtest=lasso_best.predict(X_test)
partidos_train=sum(np.rint(np.nextafter(xtrain, xtrain+1))==y_train)/len(y_train)
partidos_test=sum(np.rint(np.nextafter(xtest, xtest+1))==y_test)/len(y_test)
print('% of succesfully predicted matches train', round(partidos_train*100,2))
print('% of succesfully predicted matches test', round(partidos_test*100,2))


R squared training set 5.74
R squared test set 4.17
% of succesfully predicted matches train 63.96
% of succesfully predicted matches test 60.22


## Modelo 4:  Logistic regression

Es una version similar a las anteriores con Y binaria.

In [21]:
# Modelo Logistic

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    penalty='l1',
    solver='saga',  # or 'liblinear'
    C=.001)

model=model.fit(X_train, y_train)
partidos_train=model.score(X_train,y_train)
partidos_test=model.score(X_test,y_test)

print('% of succesfully predicted matches train', round(partidos_train*100,2))
print('% of succesfully predicted matches test', round(partidos_test*100,2))

% of succesfully predicted matches train 56.46
% of succesfully predicted matches test 56.27




## Modelo 4:  RF

It is a version similar to the previous ones with binary Y. The advantage of this model is that it is very powerful and with so little data and preprocessing it achieves tremendous performance. This ir our champion model.

In [23]:
### RF model
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)


partidos_train=clf.score(X_train,y_train)
partidos_test=clf.score(X_test,y_test)

print('% of succesfully predicted matches train', round(partidos_train*100,2))
print('% of succesfully predicted matches test', round(partidos_test*100,2))

% of succesfully predicted matches train 67.46
% of succesfully predicted matches test 64.01


# Conclusion

We will stick for now with the RF model (and later probably we will experiment with XBoosting) due to the flexibility of the inputs and predictive power. However, this exercise has served us to propose ad hoc metrics for our problem; extensions of the models that we could add to any version that we choose; to better select the data that we are going to use; and think about the way the models are going to be used in production, that is, what kind of data are they going to have before a match?

After reading some blogs of other data scientits and by this experiments, we learned that:
+ there is extreme noise in the data for this kind of models
+ bad variables are very punishing, thats why simpler models were the most powerfull.
+ Simple Random Forest can outperform regressions and even boosting methods and deep learning models.
    

# References

[Lessons from experienced people in this kind of models](https://towardsdatascience.com/what-ive-learnt-predicting-soccer-matches-with-machine-learning-b3f8b445149d)

[Different approaches to Y in this kind of models](https://medium.com/geekculture/building-a-simple-football-prediction-model-using-machine-learning-f061e607bec5)