## Web scraping Spanish La-Liga fixtures and results: Cleaning and analyzing the data to predict match outcomes using RandomForestClassifier



In [None]:
# ** Dependencies **

# Importing necessary libraries for web scraping, data manipulation, and machine learning

import requests  # To handle HTTP requests to fetch web page content
from bs4 import BeautifulSoup  # To parse HTML and extract data from web pages
import pandas as pd  # For data manipulation and analysis, especially for reading HTML tables and managing dataframes
from io import StringIO  # To treat strings as file-like objects, useful for reading HTML content with pandas
import time  # To handle time-related tasks, such as pauses between requests to avoid overloading servers

# Importing necessary libraries for machine learning and model evaluation
from sklearn.ensemble import RandomForestClassifier  # To build a random forest model for classification tasks
from sklearn.metrics import accuracy_score  # To calculate the accuracy of the classification model
from sklearn.metrics import classification_report  # To generate a detailed report of classification metrics (precision, recall, f1-score, etc.)


# Importing necessary libraries for data visualization
import matplotlib.pyplot as plt  # To create static, animated, and interactive visualizations in Python
import seaborn as sns  # To create attractive and informative statistical graphics



## **Stage 1: Scraping the Data**

In [None]:
# URL of the La-Liga statistics page to scrape
# url = "https://fbref.com/en/comps/12/La-Liga-Stats"

url = "https://fbref.com/en/comps/12/2023-2024/2023-2024-La-Liga-Stats"

# Send a GET request to the main page to retrieve the HTML content
data_1 = requests.get(url)

data_1  # This will show the status code, e.g., <Response [200]> if successful

In [None]:
# Create an instance of BeautifulSoup with the HTML content from the GET request
soup = BeautifulSoup(data_1.text, "html.parser")

# Display the parsed HTML content (optional, for debugging)
# print(soup.prettify()[:1000])  # Print the first 1000 characters of the parsed HTML

In [None]:
# Select the second 'table.stats_table' element from the HTML
league_table = soup.select('table.stats_table')[1]

In [None]:
# Find all 'a' (anchor) tags within the league table
links = league_table.find_all('a')

# Extract the href attributes from the links 
links = [l.get("href") for l in links]

# Filter to include only those containing '/squads/'
links = [l for l in links if '/squads/' in l]

# Print the first few links directly
for link in links[:3]:  # Adjust the number of links you want to print
    print(link)


In [None]:
# Create full URLs for each team by appending the base URL to each link
team_urls = [f"https://fbref.com{l}" for l in links]

In [None]:
# List of full URLs for each team created by appending the base URL to each filtered link
team_urls

In [None]:
# Select one team URL for example purposes
team_url = team_urls[1] 

In [None]:
# Print the selected team URL
print(team_url)

In [None]:
# Send a GET request to the team page for just Barcelona using the previously selected team URL
data_2 = requests.get(team_url)

# Display the response object for the Barcelona team page (this will show the status of the request, optional)
data_2

In [None]:
# Parse the HTML content with BeautifulSoup
# Useful for investigating the structure of the HTML or extracting additional data if needed
soup_2 = BeautifulSoup(data_2.text, "html.parser")

# Find all tables in the parsed HTML
tables = soup_2.find_all('table')

# Print the captions of the tables to identify the correct one ("Scores & Fixtures")
for idx, table in enumerate(tables):
    caption = table.find('caption')
    if caption:
        print(f"Table {idx}: {caption.text}")  # Print the index and caption text to identify the tables
    else:
        print(f"Table {idx} has no caption")

In [None]:
# use pandas to read in the tables 

# Wrap the HTML content of the team page (data_2.text) with the StringIO function 
# This allows the HTML data to be read by pandas as if it were a file
html_data = StringIO(data_2.text)

# Use pandas to read the HTML data and find the table that matches "Scores & Fixtures"
# This extracts the relevant table from the HTML content
matches = pd.read_html(html_data, match="Scores & Fixtures")


In [None]:
# Display the first few rows of the extracted table for "Scores & Fixtures" for the team
# [0] is used here to select the first table from a list of tables (even if there is only 1)
matches[0].head()

In [None]:
# This script scrapes the fixtures and results from multiple team URLs,
# processes the data, and combines it into a single CSV file. The script includes
# error handling and retries with exponential backoff for failed requests.

all_matches = []
skipped_urls = []

# Function to send a GET request with retries and exponential backoff
def get(url, max_retries=5, max_wait_time=60):
    wait_time = 1  # Initial wait time in seconds
    for i in range(max_retries):
        response = requests.get(url)
        if response.status_code == 200:
            return response
        elif response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", wait_time))
            print(f"Rate limit exceeded. Retrying after {retry_after} seconds.")
            if retry_after > max_wait_time:
                print(f"Retry after {retry_after} seconds is too long. Skipping {url}.")
                return None
            time.sleep(retry_after)
            wait_time = min(60, wait_time * 2)  # Exponential backoff
        else:
            print(f"Failed to get page, status code: {response.status_code}")
            return None
    return None

# Loop through each team URL to scrape data
for team_url in team_urls:
    # Extract team name from the URL
    team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ")
    response = get(team_url)

    if not response:
        print(f"Skipping {team_name} due to repeated get failures.")
        skipped_urls.append(team_url)
        continue

    # Parse the HTML content
    soup = BeautifulSoup(response.text, 'html.parser')

    # Print the title to confirm it's the right page
    print(soup.title.string)

    # Find all tables with the class 'stats_table'
    tables = soup.find_all('table', {'class': 'stats_table'})
    print(f"Found {len(tables)} tables on {team_name} page")


    # Ensure that the matches variable is not empty, find caption in table make sure it contains 'scores & fixtures' 
    # then use pandas read_html function to assign the table to matches after it has been converted to string and an 
    # object, if matches empty then continue
    matches = None
    for idx, table in enumerate(tables):
        caption = table.find('caption')
        if caption and "Scores & Fixtures" in caption.text:
            print(f"Found 'Scores & Fixtures' table at index {idx}")
            matches = pd.read_html(StringIO(str(table)))[0]
            break

    if matches is None:
        print(f"Did not find 'Scores & Fixtures' table for {team_name}")
        continue

    # Assuming 'matches' contains the required table, display the first few rows
    print(matches.head())

    try:
        # Filter the matches for La Liga competition and add season and team name columns
        matches = matches[matches["Comp"] == "La Liga"]
        matches["Season"] = '2023-2024'
        matches["Team"] = team_name
        all_matches.append(matches)
    except Exception as e:
        print(f"Error processing matches for {team_name}: {e}")

   # Add a delay between requests to avoid rate limiting
    print("Waiting for 5 seconds before the next request...")  # Added print statement
    time.sleep(5)  # Delay of 10 seconds between each loop iteration, as was having issues scraping data. 

# Combine all dataframes into one
if all_matches:
    all_matches_la_liga_2324 = pd.concat(all_matches, ignore_index=True)
    # Save the dataframe to a CSV file
    all_matches_la_liga_2324.to_csv('all_matches_la_liga_2324.csv', index=False)
    print("Data saved to all_matches_la_liga_2324.csv")
else:
    print("No data collected")

# Log the skipped URLs
if skipped_urls:
    print("The following URLs were skipped due to rate limits or other issues:")
    for url in skipped_urls:
        print(url)


In [None]:
# The data has been collected and saved previously, so there is no need to scrape it again for this project.
# We will load the data from the CSV file into a new DataFrame and work with that.
la_liga_2324 = pd.read_csv('all_matches_la_liga_2324.csv')


## **Stage 2: Data Integrity and Preparing for ML**

In [None]:
# Let's have a look at the first few rows of our DataFrame to understand its structure and content.
la_liga_2324.head()


In [None]:
# Let's have a look at rearranging the columns, and maybe drop some columns if they are not needed
# Get the list of current columns in the DataFrame
current_columns = la_liga_2324.columns.tolist()

# Display the list of current columns
current_columns


In [None]:
# Take out 'Match Report' and 'Notes' columns and define the new order of columns
new_order = [
    'Comp',
    'Season',
    'Round',
    'Date',
    'Time',
    'Day',
    'Team',
    'Opponent',
    'Venue',
    'Result',
    'GF',
    'GA',
    'Poss',
    'Attendance',
    'Captain',
    'Formation',
    'Referee'
]


In [None]:
# Apply the new column order to the DataFrame
la_liga_2324 = la_liga_2324[new_order]

# Display the DataFrame with the new column order
la_liga_2324.head()


In [None]:
# I like to work with most data in lowercase, so let's convert the column titles to lowercase
la_liga_2324.columns = la_liga_2324.columns.str.lower()

# Display the first few rows of the DataFrame to verify the column title changes
la_liga_2324.head()


In [None]:
# Check that each team has played 38 games

# Count the number of instances (games) for each team in the 'team' column
team_counts = la_liga_2324['team'].value_counts()

# Display the results to verify that each team has played 38 games
print("Number of instances of each team:")
print(team_counts)


In [None]:
# Sanity check - check the right number of rows
# There are 20 teams 
# However, each team does not play itself, so there are 2 fewer games per team
# Therefore, the total games should be 20 * 38

total_games = 20 * 38 

# Print the total games expected and the actual number of rows in the DataFrame
print(f"Total games played in La Liga = {total_games} versus the games (rows) in the DataFrame = {la_liga_2324.shape[0]}")


In [None]:
# Sanity checking to ensure the matches played per match week (round) are all the same.

# Count the number of entries for each round
round_counts = la_liga_2324["round"].value_counts()

# Sort the round counts by extracting the numerical part of the matchweek
round_counts = round_counts.sort_index(key=lambda x: x.str.extract(r'(\d+)').astype(int)[0])


# Initialize a flag to check if all rounds have 20 entries
all_rounds_have_20 = True

# Iterate through the round counts to verify each round has 20 matches
for round_number, count in round_counts.items():
    if count != 20:
        print(f"Issue found: Round {round_number} has {count} entries instead of 20.")
        all_rounds_have_20 = False

# Check if all rounds have 20 entries and print the result
if all_rounds_have_20:
    print("All rounds have 20 entries.")
else:
    print("Not all rounds have 20 entries.")

# Display the round counts for reference
print("\nRound counts:")
print(round_counts)


In [None]:
# Check the data types of the DataFrame columns to understand the structure of the data
print(la_liga_2324.dtypes)  # Display the data types of each column in the DataFrame

In [None]:
# Convert necessary columns to numerical formats as needed for ML algorithms
# This part of the code would include steps like encoding categorical variables, 

In [None]:
# Convert the 'date' column to datetime format to ensure it is in the correct format for analysis
la_liga_2324["date"] = pd.to_datetime(la_liga_2324["date"])

In [None]:
# Quick confirmation of the data types of each column in the DataFrame
la_liga_2324.dtypes  

## Converting 'venue' Column to Numerical Codes for Machine Learning

Convert the 'venue' column to categorical type and then to numerical codes for machine learning. The venue code is important for football match predictions for several reasons:

## Importance of Venue Code:

1. **Home Advantage**: Teams playing at their home venue often have a significant advantage.
2. **Travel Fatigue**: Away teams usually experience travel fatigue, affecting their performance.
3. **Crowd Support**: Local fans can boost the home team's morale and put pressure on the visiting team.
4. **Pitch Conditions**: Teams are familiar with their home pitch conditions, providing a tactical advantage.


In [None]:
la_liga_2324["venue_code"] = la_liga_2324["venue"].astype("category").cat.codes

# Display the DataFrame to verify that the 'venue_code' column has been added correctly
la_liga_2324.head()


## Converting 'opponent' Column to Numerical Codes for Machine Learning

Convert the 'opponent' column to categorical type and then to numerical codes for machine learning. Using 'opp_code' helps the model learn not only the specific relationship between the two teams but also broader patterns, strengths, and performance trends across multiple matches and contexts.

In [None]:
la_liga_2324["opp_code"] = la_liga_2324["opponent"].astype("category").cat.codes

# Display the DataFrame to verify that the 'opp_code' column has been added correctly
la_liga_2324.head()


### Steps to Enhance Predictions with Match Hour Data

1. **Extract Hour from 'Time' Column:**
   - Extract the hour from the 'time' column by removing everything after the colon.
   - Convert the extracted hour to an integer type for machine learning.

2. **Why Hour of the Match is a Good Predictor:**
   - **Player Performance:** Players may perform differently at various times of the day due to factors like energy levels and routines.
   - **Weather Conditions:** Weather can vary throughout the day, potentially impacting match conditions and player performance.
   - **Audience and Atmosphere:** Matches held at prime times may have larger audiences and more intense atmospheres, influencing team performance.
   - **Travel and Preparation:** The time of day can affect teams' travel schedules and preparation routines, impacting performance.


In [None]:
# Fill NaN values with a default hour, e.g., '00:00' (midnight)
la_liga_2324["time"].fillna("00:00", inplace=True)


la_liga_2324["hour"] = la_liga_2324["time"].str.replace(":.+", "", regex=True).astype("int")

# Display the DataFrame to verify that the 'hour' column has been added correctly
la_liga_2324.head()


### Steps to Enhance Predictions with Day of the Week Data

1. **Convert Date to Day of the Week:**
   - Convert the 'date' column to the day of the week and store it in a new 'day_code' column.
   - Convert the 'day of the week' column to a suitable format for machine learning.

2. **Why Day of the Week is a Good Predictor:**
   - **Player Recovery:** Players may have different levels of rest and recovery depending on the match schedule and training routines.
   - **Team Strategy:** Teams might adopt different strategies based on the day, such as rotation policies for midweek vs. weekend games.
   - **Audience and Atmosphere:** Weekend games may attract larger crowds and more vibrant atmospheres, influencing player performance.
   - **Match Importance:** Certain days may be associated with more significant matches (e.g., Sunday league fixtures, midweek cup games).


In [None]:
# The day of the week is represented as an integer where Monday is 0 and Sunday is 6
la_liga_2324["day_code"] = la_liga_2324["date"].dt.dayofweek

# Display the DataFrame to verify that the 'day_code' column has been added correctly
la_liga_2324.head()


In [None]:
# Map the 'result' column to numerical values for machine learning
# 'D' (Draw) is mapped to 0, 'W' (Win) is mapped to 1, and 'L' (Loss) is mapped to 2
la_liga_2324['target'] = la_liga_2324['result'].apply(lambda x: 0 if x == 'D' else (1 if x == 'W' else 2))

# Display the DataFrame to verify that the 'target' column has been added correctly
la_liga_2324.head()


In [None]:
# Quick checks now to save any issues later
# Check the data types of the DataFrame columns

print(la_liga_2324.dtypes)  # Display the data types to ensure all columns are correctly formatted


In [None]:
# Now we are ready for some machine learning, but let's save the work we've done so far.
# This way, we can simply load the saved CSV into a DataFrame from here onward.

# Save the current DataFrame to a CSV file
la_liga_2324.to_csv('la_liga_2324.csv', index=False)


# DataFrame after saving to confirm everything is in lowercase and predictors are added
la_liga_2324.head()


## Loading the Prepared DataFrame from a Saved CSV File

Instead of recreating the DataFrame with all the previous actions, you can start from here and simply load the saved CSV file. This allows you to skip the data preparation steps if they have already been done, saving time and ensuring consistency. You can either continue with the code from here or load the prepared DataFrame from the CSV file.


In [None]:
la_liga_2324 = pd.read_csv('la_liga_2324.csv')

## **Stage 3: Machine Learning with RandomForestClassifier**

## RandomForestClassifier Overview

RandomForestClassifier is an ensemble learning method that constructs multiple decision trees during training.

## Decision Trees

Decision trees are models that split the data into subsets based on feature values to make predictions.


In [None]:
from sklearn.ensemble import RandomForestClassifier

## Initializing a RandomForestClassifier with Specified Parameters

- **n_estimators**: Controls the number of decision trees in the forest (here set to 50).
- **min_samples_split**: Specifies the minimum number of samples required to split an internal node (set to 10).
- **random_state**: Sets the seed for random number generation to ensure reproducibility.


In [None]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

## Decide on the Predictors for the Model - Starting Very Basic

## Initial Considerations

- **Initial Predictors**: Included 'gf' (goals for) and 'ga' (goals against).
- **Issue**: These are results and should not be used as predictors.
- **Action**: Remove 'gf' and 'ga' from the predictors list.

## Defining the Predictors for the Model

- The predictors should be variables that influence the outcomes but are not direct results.
- Carefully select features that are relevant for making accurate predictions.


In [None]:
predictors_v1 = ["venue_code", "opp_code", "hour", "day_code"]

## Splitting Data into Training and Testing Sets

Since this is time series data, we need to ensure that the test data comes after the train data. We can't use future data to predict the past, as the past is already known (it's history). We will split the data 75/25 for training/testing.

## Key Points:

- **Time Series Consideration**: Ensure that the test data comes after the train data.
- **Historical Data**: Future data cannot be used to predict past events.
- **Split Ratio**: The data will be split into 75% for training and 25% for testing.


In [None]:
# Get the start and end dates of our data to ensure the split is in chronological order.
# Sorting the DataFrame by 'date' ensures that the data is ordered chronologically.
# Resetting the index ensures the DataFrame index reflects the new chronological order.

la_liga_2324 = la_liga_2324.sort_values(by='date').reset_index(drop=True)


In [None]:
# Get the start and end date of the dataset
season_start_date = la_liga_2324['date'].min()
season_end_date = la_liga_2324['date'].max()

# Display the start and end date of the dataset
print(f"Start Date: {season_start_date}")
print(f"End Date: {season_end_date}")

# Filter the DataFrame to include data only within the specified date range
season_data = la_liga_2324[(la_liga_2324['date'] >= season_start_date) & (la_liga_2324['date'] <= season_end_date)]

# Calculate the index for the 75% training and 25% testing split
split_index = int(len(season_data) * 0.75)

# Split the DataFrame into training and testing sets
train = season_data.iloc[:split_index]
test = season_data.iloc[split_index:]

# Define the predictors and target variables for training and testing
X_train = train[predictors_v1]
y_train = train["target"]
X_test = test[predictors_v1]
y_test = test["target"]

# Fit the RandomForestClassifier model with the training data
rf.fit(X_train, y_train)

# Make predictions using the test data
preds_v1 = rf.predict(X_test)

# accuracy_score is a function provided by the sklearn.metrics module in scikit-learn, a popular machine learning library 
# in Python. It is used to evaluate the accuracy of classification models.
# from sklearn.metrics import accuracy_score


# Calculate the accuracy score by comparing the predicted values with the actual target values
acc_v1 = accuracy_score(test["target"], preds_v1)

# Print the predictions, accuracy score, and length of predictions
print(preds_v1)
print(f"Accuracy: {acc_v1:.4f}")
print(f"Length of predictions: {len(preds_v1)}")


In [None]:
# Create a DataFrame to compare actual vs. predicted values
combined_v1 = pd.DataFrame(dict(actual=test["target"], prediction=preds_v1))

# Create a cross-tabulation (crosstab) of actual vs. predicted values
# Cross-tabulation helps us understand the frequency distribution of predictions
# It shows how many predictions fall into each combination of actual and predicted values
cross_tabulation = pd.crosstab(index=combined_v1['actual'], columns=combined_v1['prediction'])

# Display the cross-tabulation
cross_tabulation



## Classification Report

The classification report provides a detailed breakdown of how well the machine learning model performs across different classes (categories), including precision (accuracy of positive predictions), recall (sensitivity), and F1-score (harmonic mean of precision and recall). These metrics are crucial for understanding the model's predictive accuracy and its ability to correctly identify each class, offering valuable insights into its overall performance.

## Key Metrics:

- **Precision**: Measures the accuracy of positive predictions, focusing on how many selected items are relevant.

- **Recall (Sensitivity)**: Measures the ability of the model to find all positive instances, focusing on how many relevant items are selected.

- **F1-score**: Combines precision and recall into a single metric, providing a balanced measure useful for comparing models with varying precision and recall performances.


In [None]:
from sklearn.metrics import classification_report

# Extract actual and predicted values from the combined DataFrame
actual = combined_v1['actual']
prediction = combined_v1['prediction']

# Generate the classification report
# Classification report provides metrics like precision, recall, and F1-score for each class.
report_v1 = classification_report(actual, prediction, target_names=['Class 0', 'Class 1', 'Class 2'])

# Print the classification report
print("\nClassification Report:")
print(report_v1)

# Length of preds is the number of predictions made
length_preds = len(preds_v1)
print("Length of preds:", length_preds)



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Parse the classification report string into a DataFrame
report_df = pd.read_csv(StringIO(report_v1), delim_whitespace=True)

# Correcting the index to properly filter out 'accuracy', 'macro avg', and 'weighted avg'
report_df.index = ['Class 0', 'Class 1', 'Class 2', 'accuracy', 'macro avg', 'weighted avg']

# Extracting precision, recall, and F1-score for each class
metrics_df = report_df.loc[['Class 0', 'Class 1', 'Class 2'], ['precision', 'recall', 'f1-score']]

# Reset index for better plotting with seaborn
metrics_df = metrics_df.reset_index().rename(columns={'index': 'Class'})

# Melt the DataFrame to long format for easier plotting with seaborn
metrics_melted = metrics_df.melt(id_vars='Class', var_name='Metric', value_name='Score')

# Plotting the metrics using seaborn with yellow, orange, and purple palette
plt.figure(figsize=(12, 8))
sns.barplot(data=metrics_melted, x='Class', y='Score', hue='Metric', palette=['gold', 'orange', 'limegreen'])

plt.gca().set_xticks(range(len(['Draw', 'Home Win', 'Away Win'])))
plt.gca().set_xticklabels(['Draw', 'Home Win', 'Away Win'])



# Enhancing the plot aesthetics
plt.title('Classification Report Metrics', fontsize=16)
plt.xlabel('Class', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.ylim(0, 1)  # Setting y-axis limit for better comparison
plt.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left')  # Adding legend outside the plot
plt.xticks(rotation=0)  # Ensuring class labels are readable
plt.grid(True, linestyle='--', alpha=0.7)  # Adding grid lines for clarity

# Displaying the plot
plt.tight_layout()  # Ensuring all elements fit within the figure area
plt.show()



In [None]:
# Create a DataFrame to display test data, true targets, predictions, and result of predictions
la_liga_2324_results = test.copy()

# Add a column for predictions made by the Random Forest Classifier
la_liga_2324_results['pred_rfc'] = preds_v1

# Create a column to indicate the outcome of each prediction
la_liga_2324_results['result_rfc'] = la_liga_2324_results.apply(lambda row: 'success' if row['target'] == row['pred_rfc'] else 'failure', axis=1)

# Select and rearrange columns as desired
la_liga_2324_results = la_liga_2324_results[['date', 'team', 'opponent', 'target', 'pred_rfc', 'result_rfc']]

# Display the first few rows of the results DataFrame
la_liga_2324_results.head()


In [None]:
# Save the results DataFrame to a CSV file if you wish to inspect further
la_liga_2324_results.to_csv('la_liga_2324_results.csv', index=False)

## **Stage 4: Adding Additional Data/Features for ML to Improve Predictions**

## Enhancing Predictive Power by Incorporating New Features Specific to La Liga Matches


In [None]:
# Create a copy with increment ver of la_liga_2324 to preserve its integrity before adding new features.
# This ensures that any modifications do not affect the original DataFrame.
la_liga_2324_v1 = la_liga_2324.copy()
la_liga_2324_v1.head()

## Calculate Rolling Averages for Goals For, Against, and Possession

Rolling averages help capture trends over time, smoothing out short-term fluctuations. This is useful in machine learning models as it highlights the form of a team over recent matches, providing a better indication of future performance compared to individual match results.

## Key Points:

- **Goals For**: Calculate the rolling average of goals scored by a team over a specified number of matches.
- **Goals Against**: Calculate the rolling average of goals conceded by a team over a specified number of matches.
- **Possession**: Calculate the rolling average of possession percentage (an indicator of passing accuracy and efficiency) over a specified number of matches.


In [None]:

# Before calculating rolling averages, group the matches by team to perform calculations per team.
grouped_matches = la_liga_2324_v1.groupby("team")


In [None]:
# Retrieve and display the matches data for the team "Barcelona" from the grouped DataFrame.
group = grouped_matches.get_group("Barcelona")
group.head()

In [None]:
def rolling_averages(group, cols, new_cols):
    """
    Compute rolling averages for specified columns within each group (by team).

    Parameters:
    - group: DataFrame group for a specific team.
    - cols: List of columns to compute rolling averages for.
    - new_cols: List of new column names to assign to the rolling average results.

    Returns:
    - DataFrame with calculated rolling averages added as new columns.
    """
    group = group.sort_values("date")  # Sort the group data by 'date' for chronological order
    rolling_stats = group[cols].rolling(window=3, closed='left').mean()  # Calculate rolling averages for selected columns
    group[new_cols] = rolling_stats  # Assign calculated rolling averages to new columns in the DataFrame
    group = group.dropna(subset=new_cols)  # Drop rows with NaN values in the newly created columns
    return group


In [None]:
# Select columns for which rolling averages will be calculated
cols = ["gf", "ga", "poss"]

# Create new column names with '_rolling' suffix using list comprehension and f-string
new_cols = [f"{c}_rolling" for c in cols]

# Display the new column names
new_cols


In [None]:
# Call the rolling_averages function to calculate rolling averages for selected columns
# Assign the results to new columns with '_rolling' suffix
group1 = rolling_averages(group, cols, new_cols)

# Display the first few rows of the updated DataFrame for selected team
group1.head()


In [None]:
# Apply the rolling_averages function to each team's data in la_liga_2324_v1 DataFrame
# This function calculates rolling averages for specified columns (cols) and assigns the results to new columns (new_cols)
la_liga_2324_v1 = la_liga_2324_v1.groupby("team").apply(lambda x: rolling_averages(x, cols, new_cols))


In [None]:
# Drop the extra index level ('team') if multiple levels exist
la_liga_2324_v1 = la_liga_2324_v1.droplevel('team')


In [None]:
# Resetting the index to ensure unique values
la_liga_2324_v1.index = range(la_liga_2324_v1.shape[0])

# Note: The DataFrame now contains only 700 rows because the rolling average function 
# dropped the NA values from the first 3 games of each team. This is because we needed 
# at least 3 matches to calculate a 3-period rolling average.


In [None]:
la_liga_2324_v1.shape

In [None]:
la_liga_2324_v1.head()

In [None]:
# Now, let's use Random Forest to see if adding the additional features
# (rolling averages) has improved the model's performance.

# First, save the DataFrame with all the rolling averages to a CSV file.
la_liga_2324_v1.to_csv('la_liga_2324_v1.csv', index=False)


In [None]:
# Load the DataFrame containing rolling averages from the CSV file to
# df with incremented ver
la_liga_2324_v1 = pd.read_csv('la_liga_2324_v1.csv')
la_liga_2324_v1.head()

## Importing the RandomForestClassifier from the sklearn.ensemble module

Random Forest is a versatile and widely used machine learning algorithm known for its robustness and high performance in various predictive tasks. It operates by constructing multiple decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

## Key Features:

- **Ensemble Method**: Aggregates predictions from multiple decision trees to improve accuracy and reduce overfitting.
- **Versatility**: Suitable for both classification and regression tasks.
- **Robustness**: Less prone to overfitting compared to individual decision trees.
- **Feature Importance**: Provides insights into which features are most influential for making predictions.

Random Forest is effective for complex datasets and can handle large amounts of data with high dimensionality, making it a popular choice in various domains.


In [None]:
from sklearn.ensemble import RandomForestClassifier

## Initializing a RandomForestClassifier with Specified Parameters

- **n_estimators**: Controls the number of decision trees in the forest (here set to 50).
- **min_samples_split**: Specifies the minimum number of samples required to split an internal node (set to 10).
- **random_state**: Sets the seed for random number generation to ensure reproducibility.


In [None]:
rf = RandomForestClassifier(n_estimators=50, min_samples_split=10, random_state=1)

In [None]:
# Selecting predictors for the RandomForestClassifier model, including new rolling average features:
predictors_v2 = ["venue_code", "opp_code", "hour", "day_code", 'gf_rolling', 'ga_rolling', 'poss_rolling']


In [None]:
# Get the start and end date of the DataFrame
# Note that the dates are dynamic and will change as the DataFrame is filtered

season_start_date = la_liga_2324_v1['date'].min()
season_end_date = la_liga_2324_v1['date'].max()

# Display the start and end date of the season
print(f"Start Date: {season_start_date}")
print(f"End Date: {season_end_date}")

# Filter the DataFrame to include matches within the specified date range
season_data = la_liga_2324_v1[(la_liga_2324_v1['date'] >= season_start_date) & (la_liga_2324_v1['date'] <= season_end_date)]

# Calculate the index for the 75% split for training and testing data
split_index = int(len(season_data) * 0.75)

# Split the DataFrame into training and testing sets
train = season_data.iloc[:split_index]
test = season_data.iloc[split_index:]

# Define predictors and target variables
X_train = train[predictors_v2]
y_train = train["target"]
X_test = test[predictors_v2]
y_test = test["target"]

# Fit the RandomForestClassifier model on the training data
rf.fit(X_train, y_train)

# Make predictions using the trained model on the test data
preds_v2 = rf.predict(X_test)

from sklearn.metrics import accuracy_score

# Calculate accuracy score to evaluate the model performance
acc_v2 = accuracy_score(test["target"], preds_v2)

# Print predictions and accuracy score
print(preds_v2)
print(f"Accuracy Score: {acc_v2:.4f}")

# Display the length of predictions
length_preds_v2 = len(preds_v2)
print("Length of preds_v2:", length_preds_v2)


In [None]:
# Create a DataFrame to compare actual and predicted values
combined_v2 = pd.DataFrame(dict(actual=test["target"], prediction=preds_v2))

# Generate a cross-tabulation to analyze the classification results
cross_tab_v2 = pd.crosstab(index=combined_v2['actual'], columns=combined_v2['prediction'])
cross_tab_v2


In [None]:
# classification report v2

# Extract actual and predicted values from combined_v2 DataFrame
actual_v2 = combined_v2['actual']
prediction_v2 = combined_v2['prediction']

# Generate the classification report with specified target names for clarity
report_v2 = classification_report(actual_v2, prediction_v2, target_names=['Class 0', 'Class 1', 'Class 2'])

# Print the classification report header
print("\nClassification Report:")

# Print the detailed classification report showing precision, recall, F1-score, and support for each class
print(report_v2)



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO

# report_v2 is the classification report string

# Parse the string to a DataFrame
report_df = pd.read_csv(StringIO(report_v2), delim_whitespace=True)

# Correcting the index to properly filter out 'accuracy', 'macro avg', and 'weighted avg'
report_df.index = ['Class 0', 'Class 1', 'Class 2', 'accuracy', 'macro avg', 'weighted avg']
metrics_df = report_df.loc[['Class 0', 'Class 1', 'Class 2'], ['precision', 'recall', 'f1-score']]

# Reset index for better plotting with seaborn
metrics_df = metrics_df.reset_index().rename(columns={'index': 'Class'})

# Melt the DataFrame for seaborn
metrics_melted = metrics_df.melt(id_vars='Class', var_name='Metric', value_name='Score')

# Plot the metrics using seaborn with custom palette
plt.figure(figsize=(12, 8))
sns.barplot(data=metrics_melted, x='Class', y='Score', hue='Metric', palette=['gold', 'orange', 'limegreen'])

plt.gca().set_xticks(range(len(['Draw', 'Home Win', 'Away Win'])))
plt.gca().set_xticklabels(['Draw', 'Home Win', 'Away Win'])


# Enhance the plot with titles, labels, and limits
plt.title('Classification Report Metrics Enhanced Features', fontsize=16)
plt.xlabel('Class', fontsize=14)
plt.ylabel('Score', fontsize=14)
plt.ylim(0, 1)
plt.legend(title='Metric', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=0)
plt.grid(True, linestyle='--', alpha=0.7)

# Display the plot
plt.tight_layout()
plt.show()



In [None]:
# combine the 2 reports 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO

# Assuming report_v1 and report_v2 are your classification report strings

# Parse the strings to DataFrames
report_df_v1 = pd.read_csv(StringIO(report_v1), delim_whitespace=True)
report_df_v2 = pd.read_csv(StringIO(report_v2), delim_whitespace=True)

# Correcting the index to properly filter out 'accuracy', 'macro avg', and 'weighted avg'
report_df_v1.index = ['Class 0', 'Class 1', 'Class 2', 'accuracy', 'macro avg', 'weighted avg']
report_df_v2.index = ['Class 0', 'Class 1', 'Class 2', 'accuracy', 'macro avg', 'weighted avg']

# Keep only the class-specific rows
metrics_df_v1 = report_df_v1.loc[['Class 0', 'Class 1', 'Class 2'], ['precision', 'recall', 'f1-score']]
metrics_df_v2 = report_df_v2.loc[['Class 0', 'Class 1', 'Class 2'], ['precision', 'recall', 'f1-score']]

# Add a column to distinguish between the reports
metrics_df_v1['Report'] = 'Report 1'
metrics_df_v2['Report'] = 'Report 2'

# Combine the DataFrames
combined_metrics_df = pd.concat([metrics_df_v1, metrics_df_v2])

# Reset index for better plotting with seaborn
combined_metrics_df = combined_metrics_df.reset_index().rename(columns={'index': 'Class'})

# Melt the DataFrame for seaborn
combined_metrics_melted = combined_metrics_df.melt(id_vars=['Class', 'Report'], var_name='Metric', value_name='Score')

# Define a custom vibrant color palette (yellow, orange, and green)
custom_palette = sns.color_palette(["#FFD700", "#FFA07A", "#98FB98", "#FF6347", "#FF69B4", "#40E0D0"])

# Plot the metrics using seaborn with the custom color palette
plt.figure(figsize=(14, 8))
sns.set_palette(custom_palette)  # Set the custom color palette
g = sns.catplot(data=combined_metrics_melted, x='Class', y='Score', hue='Metric', col='Report', kind='bar', height=5, aspect=1.2)

plt.gca().set_xticks(range(len(['Draw', 'Home Win', 'Away Win'])))
plt.gca().set_xticklabels(['Draw', 'Home Win', 'Away Win'])

# Enhance the plot
g.fig.subplots_adjust(top=0.9)  # Adjust the top to fit the title
g.fig.suptitle('Comparison of Classification Report Metrics Standard v Enhanced Features', fontsize=16)
for ax in g.axes.flat:
    ax.set_ylim(0, 1)  # y-axis shows the score values ranging from 0 to 1

plt.show()



In [None]:
# all the hard work comes down to this:
percent_increase = ((acc_v2 - acc_v1) / acc_v1) * 100
print(f"The original model accuracy score is {acc_v1:.4f}, while the model with enhanced features achieved {acc_v2:.4f}. This represents a {percent_increase:.2f}% increase from the original model.")

In [None]:
# The increase is only 5%, which might require a magnifying glass to see on a basic plot. Let's enhance 
# the visual for a clearer view. adjusting the y-axis scale effectively magnifies the difference in scores, 
#making the improvement more noticeable and impactful.


import matplotlib.pyplot as plt

# Data
models = ['Original Model', 'Enhanced Features Model']
scores = [acc_v1, acc_v2]
percent_increase = ((acc_v2 - acc_v1) / acc_v1) * 100

# Plotting
plt.figure(figsize=(10, 6))
plt.bar(models, scores, color=['blue', 'orange'])
plt.text(1, acc_v2 + 0.002, f'{acc_v2:.4f}', ha='center', va='bottom', fontsize=12, color='blue')
plt.text(0, acc_v1 + 0.002, f'{acc_v1:.4f}', ha='center', va='bottom', fontsize=12, color='blue')

# Highlight the increase with annotations
plt.annotate(f'{percent_increase:.2f}% Increase', xy=(0.5, (acc_v1 + acc_v2) / 2), xytext=(0.5, (acc_v1 + acc_v2) / 2 + 0.005),
             ha='center', va='bottom', fontsize=14, color='red',
             arrowprops=dict(facecolor='red', arrowstyle='->'))

# Labels and title
plt.xlabel('Models')
plt.ylabel('Accuracy Score')
plt.title('Comparison of Model Accuracy Scores')
plt.ylim(min(scores) - 0.01, max(scores) + 0.01)  # Adjusted y-axis limit to emphasize the difference
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()


## Evaluating Football Match Outcome Predictions with a 44.6% Accuracy in a 3-Class Scenario (Win, Lose, Draw)

## 1. Above Random Chance
In a 3-class scenario (win, lose, draw), random guessing typically yields around 33% accuracy, assuming equal probability for each outcome. However, in football predictions, matches often have clear favorites, which means the actual baseline accuracy can be lower than 33%.

## 2. Contextual Evaluation
Predicting football matches involves complex factors, making 44.6% accuracy meaningful.

## 3. Room for Improvement
Despite promise, further enhancements in features or algorithms could boost accuracy.

## 4. Benchmarking
Comparing against other methods helps gauge competitiveness and identifies areas to improve.

## Conclusion
While 44.5% accuracy shows predictive capability, ongoing refinement is essential for reliable predictions.


## Improving the Model by Ensuring All Features Are Converted into Numerical Data for Effective Machine Learning Integration

This includes transforming features such as shots on target, weather data (temperature, humidity, precipitation), cup game participation, player injuries/suspensions, home/away performance, and managerial changes into numerical formats.

## Key Improvements:

### 1. Incorporate Shots on Target Data
Including statistics like shots on target per match could provide insights into offensive efficiency and potentially enhance predictive accuracy.

### 2. Integrate Weather Data
Weather conditions during matches can impact player performance and game outcomes. Incorporating weather data such as temperature, humidity, and precipitation may improve model robustness.

### 3. Consider Cup Game Participation
Teams participating in additional cup competitions may experience different levels of fatigue or prioritize differently, influencing their league performance. Including data on cup game participation could capture these dynamics.

### 4. Account for Player Injuries and Suspensions
Player availability due to injuries or suspensions significantly affects team performance. Tracking player injuries and suspensions and incorporating this data into the model could refine predictions.

### 5. Evaluate Home and Away Performance
Analyzing team performance differences between home and away matches can provide valuable insights. Incorporating home and away statistics may better capture the nuances of team dynamics.

### 6. Include Managerial Changes
Changes in coaching staff or managerial strategies can impact team performance. Monitoring managerial changes and their effects on team dynamics could contribute to more accurate predictions.


## Try Alternative Models

## 1. Support Vector Machines (SVM)
- **Explanation**: SVMs find the hyperplane that best separates classes in feature space. They use kernel functions for non-linear decision boundaries and are robust against overfitting, suitable for diverse datasets.

## 2. Multinomial Logistic Regression
- **Explanation**: Logistic Regression extends to multi-class problems via multinomial variants. It models class probabilities using the softmax function and predicts the class with the highest probability.

## 3. Gradient Boosting Classifier (e.g., XGBoost, LightGBM)
- **Explanation**: Gradient Boosting builds an ensemble of decision trees sequentially, where each corrects errors of its predecessor. Models like XGBoost and LightGBM are known for high accuracy and handling complex data interactions.
