### Introduction 🤗

#### In this notebook is explained how a Basic Machine Learning Model works, made by [Pol Jaimejuan Caubet](https://www.linkedin.com/in/pol-jaimejuan-caubet/)

We demonstrate the process of building and evaluating a basic machine learning model for **predicting the results of football matches**. We use a **decision tree regressor** to analyze historical data from the La Liga league, transforming categorical data into numerical features, **training the model**, and **evaluating its performance using Mean Absolute Error (MAE)**. The notebook covers data preprocessing, model training, and evaluation steps, providing a clear example of how machine learning techniques can be applied to sports analytics. The datasets in this project do not contain much statistical features that would allow us to analizeand accurate predictions. Instead, they serve as a basic example for learning how to apply predictive models. The main point is to practice and understand the process of building and evaluating machine learning models, rather than making highly accurate predictions :)

**First of all, we import necessary libraries for our project**
- `os`: for work with path files.
- `pandas`: for data manipulation as columns, rows, etc.
- `train_test_split` from `sklearn.model_selection`: for divide our data to validation and training sets
- `DecisionTreeRegressor` de `sklearn.tree`: to create and train our model
- `mean_absolute_error` de `sklearn.metrics`: statistical basic metrics.
- `LabelEncoder` de `sklearn.preprocessing`: to convert categorical variables to numerical values.

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder

**Loading dataset path**

Basically we are loading dataset of the matches of La Liga using `pandas` methods. We use `os.path` to obtain the correct path for .csv, that is located in `LeaguesDataset` file.

In [7]:
current_dir = os.getcwd()
league_dataset_path = os.path.join(current_dir, '..', 'LeaguesDataset', 'LaLiga.csv')
league_dataset = pd.read_csv(league_dataset_path)


**Eliminar valores nulos**

We delete all dataset rows that contain null values, using `dropna()`, to ensure that our model is not affected by incomplete data.

In [8]:
league_dataset.dropna(inplace=True)

**Convert the 'Date' column to numerical features**

The 'Date' column contains information about the date of each match, so we convert it into numerical features (day, month, and year). This is necessary because machine learning models cannot work directly with dates.

First, we convert the 'Date' column to a date format, then extract the day, month, and year of each match and save them as new columns. Finally, we delete through `drop()` the 'Date' column.


In [None]:
league_dataset['Date'] = pd.to_datetime(league_dataset['Date'], format='%d/%m/%Y', errors='coerce')
league_dataset.dropna(subset=['Date'], inplace=True)
league_dataset['Day'] = league_dataset['Date'].dt.day
league_dataset['Month'] = league_dataset['Date'].dt.month
league_dataset['Year'] = league_dataset['Date'].dt.year
league_dataset.drop('Date', axis=1, inplace=True)

**Encode categorical columns 'HomeTeam', 'AwayTeam' and 'Result'**

The 'HomeTeam' and 'AwayTeam' columns contain team names, which are categorical values. To make the model understand them, we convert them into numerical values using `LabelEncoder`. Each team is assigned a unique number.

The 'Result' column contains the match result, such as 'HomeWin', 'AwayWin', or 'Draw'. As with the team columns, we convert these results into numerical values using `LabelEncoder`.


In [None]:
label_encoder_home = LabelEncoder()
label_encoder_away = LabelEncoder()
label_encoder_result = LabelEncoder()
league_dataset['HomeTeam'] = label_encoder_home.fit_transform(league_dataset['HomeTeam'])
league_dataset['AwayTeam'] = label_encoder_away.fit_transform(league_dataset['AwayTeam'])
league_dataset['Result'] = label_encoder_result.fit_transform(league_dataset['Result'])

**Define features (X) and target (y)**

We define our features (`X`) and the target (`y`). The features are the columns we use to predict the match results, and the target is the 'Result' column, which contains the actual match outcome.

In [12]:
features = ['HomeTeam', 'AwayTeam', 'HomeGoals', 'AwayGoals', 'Day', 'Month', 'Year']
X = league_dataset[features]
y = league_dataset['Result']

**Split the data into training and validation sets**

We split the data into two sets: one for training the model and the other for validating its performance. We use `train_test_split` from `sklearn.model_selection` to do this, with 75% of the data for training and the remaining 25% for validation.

In [13]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

**Specify and train the model (DecisionTreeRegressor)**
We use a decision tree regression model (`DecisionTreeRegressor`) to predict the match results. We train the model with the training dataset.
Usamos un modelo de regresión basado en árboles de decisión (`DecisionTreeRegressor`) para predecir los resultados de los partidos. Entrenamos el modelo con el conjunto de datos de entrenamiento.

In [14]:
football_model = DecisionTreeRegressor(random_state=1)
football_model.fit(train_X, train_y)

**Make predictions on the validation set**

With the trained model, we now make predictions on the validation set

In [15]:
validation_predictions = football_model.predict(val_X)

**Calculate the Mean Absolute Error (MAE)**

We calculate the Mean Absolute Error (MAE) between the model's predictions and the actual results in the validation set `mean_absolute_error()` from `sklearn.metrics`. The MAE gives us a measure of how accurate the model's predictions are. 

In [16]:
# Calcular el error absoluto medio (MAE)
val_mae = mean_absolute_error(validation_predictions, val_y)
print(f"Mean Absolute Error: {val_mae}")

Mean Absolute Error: 0.0


**Function to calculate MAE with different values of max_leaf_nodes**

We see that the MAE is 0, which may indicate overfitting. We define a function that trains the model with different values for the `max_leaf_nodes` parameter (the maximum number of nodes in the decision tree) and calculates the MAE for each one. This will help us choose the best tree size.

In [17]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

**Test different values of max_leaf_nodes**

We test different values for the `max_leaf_nodes` parameter and calculate the MAE for each value. This will allow us to find the value of `max_leaf_nodes` that gives the best performance in our model.

In [18]:
candidate_max_leaf_nodes = [2, 3, 4, 5, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20]
scores = {}

for leaf_size in candidate_max_leaf_nodes:
    mae = get_mae(leaf_size, train_X, val_X, train_y, val_y)
    scores[leaf_size] = mae

print(scores)

{2: 0.5703409697663293, 3: 0.4553377152847998, 4: 0.38947089694139425, 5: 0.29088366289884104, 8: 0.07062969325153375, 10: 0.013901298701298703, 11: 0.005755844155844157, 12: 0.0018285714285714292, 13: 0.0, 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0, 18: 0.0, 20: 0.0}


**Finding the best tree size**

We look for the value of `max_leaf_nodes` that gives the lowest MAE, indicating the best tree size for our model.

In [23]:
best_tree_size = min(scores, key=scores.get)
print(f"Best tree size: {best_tree_size}")

Best tree size: 13


**Train final model**

Train final model using the best tree size searched on previous cell and predict the results

In [28]:
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
final_model.fit(X, y)
final_validation_predictions = football_model.predict(val_X)

print(final_validation_predictions)

[2. 2. 2. 0. 2. 2. 1. 1. 1. 2. 1. 2. 2. 0. 0. 2. 1. 0. 2. 2. 2. 2. 2. 1.
 0. 2. 2. 2. 2. 2. 1. 0. 2. 0. 2. 0. 2. 2. 1. 0. 0. 0. 1. 2. 1. 0. 2. 0.
 2. 2. 0. 2. 0. 0. 2. 0. 2. 0. 1. 2. 2. 1. 1. 1. 2. 1. 0. 0. 2. 1. 2. 2.
 2. 1. 0. 2. 2. 1. 1. 1. 0. 0. 0. 2. 2. 0. 0. 2. 2. 0. 0. 1. 2. 1. 1. 2.
 0. 0. 1. 1. 2. 2. 1. 0. 0. 2. 1. 2. 0. 0. 2. 1. 2. 0. 2. 0. 2. 1. 0. 1.
 1. 0. 1. 1. 1. 1. 0. 0. 2. 0. 2. 2. 2. 2. 2. 2. 0. 0. 2. 2. 2. 2. 1. 1.
 2. 2. 2. 1. 2. 1. 1. 2. 2. 0. 2. 2. 0. 2. 1. 2. 0. 1. 2. 1. 1. 1. 2. 2.
 2. 2. 1. 2. 1. 2. 2. 2. 1. 0. 2. 2. 2. 0. 0. 2. 2. 1. 0. 0. 2. 2. 2. 0.
 0. 1. 2. 2. 1. 0. 0. 0. 2. 0. 2. 0. 2. 0. 1. 0. 1. 0. 2. 0. 0. 1. 2. 1.
 2. 1. 1. 2. 2. 1. 0. 2. 1. 1. 2. 1. 2. 2. 2. 1. 2. 2. 1. 0. 2. 1. 1. 2.
 2. 2. 2. 0. 1. 1. 2. 2. 2. 2. 1. 2. 2. 1. 2. 1. 0. 2. 0. 1. 0. 2. 0. 2.
 0. 1. 1. 2. 2. 1. 1. 1. 2. 1. 1. 2. 2. 2. 1. 1. 1. 0. 1. 2. 0. 2. 2. 0.
 0. 2. 0. 2. 0. 2. 2. 1. 1. 0. 2. 0. 1. 2. 1. 1. 1. 2. 2. 1. 2. 1. 0. 0.
 2. 1. 2. 2. 2. 2. 2. 2. 0. 0. 0. 2. 1. 0. 1. 2. 2.

**Make final predictions on the validation set**

Now, with the final trained model, we make final predictions on the validation set.

In [29]:
predicted_results = label_encoder_result.inverse_transform(final_validation_predictions.astype(int))

**Convert numerical predictions back to categorical values**

The model's predictions are numbers, but we need to convert them back to the original categorical values (such as 'HomeWin', 'AwayWin', 'Draw'). We achieve this using `inverse_transform`.

In [30]:
home_teams = label_encoder_home.inverse_transform(val_X['HomeTeam'])
away_teams = label_encoder_away.inverse_transform(val_X['AwayTeam'])

**Print the teams and predicted resultsos**

Finally, we print the teams and their predicted results, giving us an insight into how our model predicts the matches.

In [31]:
for i in range(len(predicted_results)):
    print(f"Match: {home_teams[i]} vs {away_teams[i]} - Predicted Result: {predicted_results[i]}")

Match: Mallorca vs Girona - Predicted Result: H
Match: Celta vs Alaves - Predicted Result: H
Match: Betis vs Las Palmas - Predicted Result: H
Match: Valladolid vs Betis - Predicted Result: A
Match: Sociedad vs Espanol - Predicted Result: H
Match: Sevilla vs Elche - Predicted Result: H
Match: Ath Bilbao vs Getafe - Predicted Result: D
Match: Valladolid vs Mallorca - Predicted Result: D
Match: Almeria vs Valencia - Predicted Result: D
Match: Ath Madrid vs Girona - Predicted Result: H
Match: Valencia vs Vallecano - Predicted Result: D
Match: Betis vs Almeria - Predicted Result: H
Match: Almeria vs Barcelona - Predicted Result: H
Match: Real Madrid vs Betis - Predicted Result: A
Match: Espanol vs Barcelona - Predicted Result: A
Match: Real Madrid vs Elche - Predicted Result: H
Match: Mallorca vs Osasuna - Predicted Result: D
Match: Almeria vs Sociedad - Predicted Result: A
Match: Barcelona vs Valladolid - Predicted Result: H
Match: Barcelona vs Alaves - Predicted Result: H
Match: Real Madr