# Premier League Historical Match Result Prediction

If you do not already have a DOXA account, you will want to [sign up](https://doxaai.com/sign-up) first before proceeding and then make sure you are enrolled on the [DOXA challenge page](https://doxaai.com/competition/epl).

## Installing and Importing Useful Packages

In [None]:
%pip install numpy pandas matplotlib seaborn scikit-learn

In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Import potentially useful scikit-learn modules
from sklearn.compose import make_column_transformer
from sklearn.decomposition import PCA
from sklearn.ensemble import (
    BaggingClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
)
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, LogisticRegression
from sklearn.metrics import ConfusionMatrixDisplay, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV, cross_val_predict, train_test_split
from sklearn.preprocessing import (
    LabelEncoder,
    MinMaxScaler,
    OneHotEncoder,
    OrdinalEncoder,
    PolynomialFeatures,
    StandardScaler,
)
from sklearn.svm import SVC, SVR, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

pd.set_option("display.max_colwidth", None)

%matplotlib widget

## Data Loading

In [None]:
# Download the dataset if we don't already have it!
if not os.path.exists("data"):
    os.makedirs("data", exist_ok=True)

    !curl https://raw.githubusercontent.com/DoxaAI/epl-getting-started/data/train.csv --output data/train.csv
    !curl https://raw.githubusercontent.com/DoxaAI/epl-getting-started/data/test.csv --output data/test.csv

In [None]:
# Import the training dataset
train_df_original = pd.read_csv(
    "./data/train.csv", parse_dates=["date"]
)  # Change the path accordingly

# Import the testing dataset
test_df = pd.read_csv(
    "./data/test.csv", parse_dates=["date"]
)  # Change the path accordingly

In [None]:
# Make an in-memory copy of the training set to experiment with
train_df = train_df_original.copy()

## Data Understanding 

### The training set

In [None]:
# Examine the first 5 entries of our dataset
train_df.head()

In [None]:
# Display information about our training dataframe
train_df.info()

In [None]:
# View some statistical information about the features we have
train_df.describe()

In [None]:
# Tally up the number of home team wins, away team wins and draws
train_df["full_time_result"].value_counts()

### The test set

In [None]:
# View the first 5 rows of the test set
test_df.head()

In [None]:
# Examine the columns of the test dataframe a bit more closely!
test_df.info()

## Data Visualisation

In [None]:
# TODO: produce a correlation matrix for the features in the training set

## Data Preprocessing

In [None]:
# Drop unneeded columns
train_df.drop(
    columns=[
        "date",
        "full_time_home_goals",
        "full_time_away_goals",
        "half_time_home_goals",
        "half_time_away_goals",
        "half_time_result",
        "referee",
    ],
    inplace=True,
)

train_df.columns

In [None]:
# TODO: engineer some of your own features!

In [None]:
# Transform the data
numeric_features = [
    "home_shots",
    "away_shots",
    "home_shots_on_target",
    "away_shots_on_target",
    "home_fouls",
    "away_fouls",
    "home_corners",
    "away_corners",
    "home_yellow_cards",
    "away_yellow_cards",
    "home_red_cards",
    "away_red_cards",
]

transformer = make_column_transformer(
    (MinMaxScaler(), numeric_features),
    (OneHotEncoder(), ["home_team", "away_team"]),
    # OPTIONAL EXERCISE: add PCA
)

X = transformer.fit_transform(train_df.drop(columns=["full_time_result"]))
y = train_df["full_time_result"]

## Model Selection, Training & Evaluation

In [None]:
# Perform a hyperparameter search
parameter_grid = {
    "C": [0.1, 1, 10],
    # you can add more parameters here!
}

classifier = GridSearchCV(LinearSVC(max_iter=2000), parameter_grid, scoring="f1_micro")
classifier.fit(X, y)

print("Best parameters:", classifier.best_params_)
print("Best micro-averaged F1 score:", classifier.best_score_)

In [None]:
# Plot a confusion matrix
ConfusionMatrixDisplay.from_predictions(y_true=y, y_pred=classifier.predict(X))

## Preparing your DOXA Submission

In [None]:
# Drop columns we do not need
test_df.drop(columns=["date", "referee"], inplace=True)

# Transform the test set
X_test = transformer.transform(test_df)

# Use our trained classifier to make predictions
predictions = classifier.predict(X_test)

assert predictions.shape == (736,)

# Take a look at the first 20 predictions
predictions[:20]

In [None]:
# Prepare our submission package
os.makedirs("submission", exist_ok=True)

with open("submission/y.txt", "w") as f:
    f.writelines([f"{prediction}\n" for prediction in predictions])

with open("submission/doxa.yaml", "w") as f:
    f.write("competition: epl\nenvironment: cpu\nlanguage: python\nentrypoint: run.py")

with open("submission/run.py", "w") as f:
    f.write("with open('y.txt', 'r') as f: print(f.read().strip())")

## Submitting to DOXA

Before you can submit to DOXA, you must first ensure that you are enrolled for the challenge on the DOXA website. Visit [the challenge page](https://doxaai.com/competition/uclais-1) and click "Enrol" in the top-right corner if you have not done so already.

You can then log in using the DOXA CLI by running the following command:

In [None]:
!doxa login

Finally, you can submit your results to DOXA by running the following command:

In [None]:
!doxa upload submission

Wooo! 🥳 You have (probably) just uploaded your English Premier League match result predictions to DOXA &ndash; well done! Take a moment to see how you have done on the [scoreboard](https://doxaai.com/competition/epl).