## What this notebook does

This project implements a large-scale, fully automated Python pipeline that scrapes football match statistics from **SoccerStats.com** across **60+ leagues**, processes the collected information, and computes probability indicators for matches finishing **over 2.5 goals**.

The script extracts:
- Previous results; 
- Upcoming fixtures;  
- Team scoring profiles;  
- Over/under performance metrics;  
- Expected goals distributions.  

It builds combined probability estimates using home/away tendencies and statistical variance.

All processed information is exported as Excel files with multiple sheets.

---

### Output Files

**1. Full Data**  
*Example filename:* `+2.5Goals_10-11-2025.xlsx`

**2. Treated Data**  
*Example filename:* `Treated_+2.5Goals_10-11-2025.xlsx`

# 0 - Imports librarys

In [None]:
import requests, warnings
import pandas as pd
from bs4 import BeautifulSoup
import re, os
import statistics
from datetime import datetime, timedelta
from IPython.display import HTML
from io import BytesIO

In [None]:
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# 1. Full Data database

# 1.1. Websrapping

## 1.1.1. Scrapping Dataframes with OverUnderGoalsTotalFullTime, OverUnderGoalsLast8FullTime, OverUnderGoalsHomeFullTime, OverUnderGoalsAwayFullTime Sorted by percentage of games with +2.5 goals.

### 1.1.1.1. Scrapping Dataframes from the web.

In [None]:
url = "https://raw.githubusercontent.com/FabioATMonteiro92/WebScrappingFootballProbabilityOver2.5GoalsPerGame/main/URLs.xlsx"

# Get file content from GitHub
response = requests.get(url)
response.raise_for_status()  # ensure the request succeeded

# Load it into pandas
dfURLs = pd.read_excel(BytesIO(response.content))

# Display all columns when printing the DataFrame
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 10)
# Show the cleaned screening dataset
dfURLs

In [None]:
ListUrls   = dfURLs["TotalMatchGoalStats"].tolist()
Continent  = dfURLs["Continent"].tolist()
League     = dfURLs["League"].tolist()

rowsTotalFullTime = []
rowsLast8FullTime = []
rowsHomeFullTime = []
rowsAwayFullTime = []

IndexContLeague = 0
for i in ListUrls:
    # Perform the GET request
    response = requests.get(i)
    if response.status_code != 200:
        print(i)

# Parse the page content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the table with the desired stats
    tables = soup.find_all('table', {'id': 'btable'})

    rows = []
    # Lists to store the categorized data
    for table in tables:
        for tr in table.find_all('tr')[1:]:  # Skip the header row
            cells = tr.find_all('td')
            row = [cell.text.strip() for cell in cells]
            rows.append(row)
    rows = [sublist for sublist in rows if any('%' in item for item in sublist)]
    rows = [sublist for sublist in rows if 'League average' not in sublist]

    # Dictionary to count occurrences of each team
    team_counts = {}

    # Iterate through each sublist in the data
    for sublist in rows:
        team = sublist[0]

        # Initialize the team count if not already in dictionary
        if team not in team_counts:
            team_counts[team] = 0

        # Increment the count for the team
        team_counts[team] += 1

        # Place the sublist in the correct list based on the count
        if team_counts[team] == 1:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsTotalFullTime.append(sublist)
        elif team_counts[team] == 2:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsLast8FullTime.append(sublist)
        elif team_counts[team] == 3:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsHomeFullTime.append(sublist)
        elif team_counts[team] == 4:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsAwayFullTime.append(sublist)
    IndexContLeague += 1

headers = ["Continent","League","Team","GP","Avg","0.5+","1.5+","2.5+","3.5+","4.5+","5.5+","BTS","CS","FTS","WTN","LTN"]

dfOverUnderGoalsTotalFullTime = pd.DataFrame(rowsTotalFullTime, columns=headers)
dfOverUnderGoalsLast8FullTime = pd.DataFrame(rowsLast8FullTime, columns=headers)
dfOverUnderGoalsHomeFullTime = pd.DataFrame(rowsHomeFullTime, columns=headers)
dfOverUnderGoalsAwayFullTime = pd.DataFrame(rowsAwayFullTime, columns=headers)

### 1.1.1.2. Proper colum names and sorted dataframes by percentage of games with +2.5 goals.

In [None]:
# Replace '%' with an empty string and convert the columns to integers
columns_to_convert = ['0.5+', '1.5+', '2.5+', '3.5+', '4.5+', '5.5+', 'BTS', 'CS', 'FTS', 'WTN', 'LTN']
# Remove the '%' character and convert to integer
for column in columns_to_convert:
    dfOverUnderGoalsTotalFullTime[column] = dfOverUnderGoalsTotalFullTime[column].str.replace('%', '').astype(int)
    dfOverUnderGoalsLast8FullTime[column] = dfOverUnderGoalsLast8FullTime[column].str.replace('%', '').astype(int)
    dfOverUnderGoalsHomeFullTime[column] = dfOverUnderGoalsHomeFullTime[column].str.replace('%', '').astype(int)
    dfOverUnderGoalsAwayFullTime[column] = dfOverUnderGoalsAwayFullTime[column].str.replace('%', '').astype(int)

# Sort the DataFrame by the '2.5+' column in descending order
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsTotalFullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsLast8FullTime = dfOverUnderGoalsLast8FullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsLast8FullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsHomeFullTime = dfOverUnderGoalsHomeFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsHomeFullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsAwayFullTime = dfOverUnderGoalsAwayFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsAwayFullTime.reset_index(drop=True,inplace=True)

In [None]:
HTML(dfOverUnderGoalsTotalFullTime.to_html(classes="table-striped"))
HTML(
    dfOverUnderGoalsTotalFullTime.to_html()
    .replace('<table border="1" class="dataframe">', 
             '<table border="1" class="dataframe" style="display:block; height:300px; overflow-y:scroll;">')
)

## 1.1.2. Generating DataFrames with the previous results and fixtures

### 1.1.2.1. Scrapping Dataframes from the web.

In [None]:
ListUrlsbyDate = dfURLs["ResultsByDate"].tolist()

rows_results = []
rows_fixtures = []
IndexContLeague = 0
for i in ListUrlsbyDate:
    # Perform the GET request
    response = requests.get(i)
    if response.status_code != 200:
        print(i)

# Parse the page content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the table with the desired stats
    tables = soup.find_all('table', {'id': 'btable'})

    rows = []
    # Lists to store the categorized data
    for table in tables:
        for tr in table.find_all('tr')[1:]:  # Skip the header row
            cells = tr.find_all('td')
            row = [cell.text.strip() for cell in cells]
            rows.append(row)

    # Regular expression to match the 'xx xx xxx' structure
    pattern = re.compile(r'^[A-Za-z]{3} [0-9]{1,2} [A-Za-z]{3}$')

    # Filter the list to keep only sublists where the first element matches the pattern
    rows = [sublist for sublist in rows if pattern.match(sublist[0])]
    for i in range(0,len(rows)):
        rows[i] = rows[i][:4]

    for i in rows:
        if "-" in i[2]:
            i.insert(0,Continent[IndexContLeague])
            i.insert(1, League[IndexContLeague])
            rows_results.append(i)
        elif ":" in i[2]:
          i.insert(0, Continent[IndexContLeague])
          i.insert(1, League[IndexContLeague])
          rows_fixtures.append(i)
    IndexContLeague += 1

for i in range(0,len(rows_results)):
    score = rows_results[i][4].split(' - ')
    temp = [rows_results[i][0], rows_results[i][1], rows_results[i][2], rows_results[i][3], int(score[0]), '-', int(score[1]), rows_results[i][5]]
    rows_results[i] = temp

headers_results = ["Continent","League","Date","Home team","Goals Home","Hifen","Goals Away","Away Team"]
headers_fixtures = ["Continent","League","Date","Home team","Schedule","Away Team"]

dfresults = pd.DataFrame(rows_results, columns=headers_results)
dffixtures = pd.DataFrame(rows_fixtures, columns=headers_fixtures)

##Generates list with unique names of the teams.
listTeamNamesUnique =  dfresults["Home team"].drop_duplicates().tolist()

## 1.2. Caldulating average goals scored and conceded (and SD)

### 1.2.1. Full season

In [None]:
##########################################################################################################################################################################
#Dict 1.1: GoalsbyTeamTotalDict for dfOverUnderGoalsTotalFullTime
GoalsbyTeamTotalDict = {team: [] for team in listTeamNamesUnique}

# Iterate through the DataFrame
for index, row in dfresults.iterrows():
    # Home team case
    home_team = row['Home team']
    home_goals = row['Goals Home']
    if home_team in GoalsbyTeamTotalDict:
        GoalsbyTeamTotalDict[home_team].append(home_goals)

    # Away team case
    away_team = row['Away Team']
    away_goals = row['Goals Away']
    if away_team in GoalsbyTeamTotalDict:
        GoalsbyTeamTotalDict[away_team].append(away_goals)

# Now team_goals dictionary contains the list of goals scored by each team
AverageSDGoalsbyTeamTotalDict = GoalsbyTeamTotalDict.copy()

# Replace each list with the average and standard deviation
for team, goals in AverageSDGoalsbyTeamTotalDict.items():
    if len(goals) == 1:
        AverageSDGoalsbyTeamTotalDict[team] = [goals[0], '-']
    else:
        average = round(statistics.mean(goals), 2)
        stdev = round(statistics.stdev(goals), 2)
        AverageSDGoalsbyTeamTotalDict[team] = [average, stdev]
        
#################################################################
#Dict 1.2: GoalsbyTeamTotalDict for dfOverUnderGoalsTotalFullTime
# Create a new dictionary to store goals conceded by each team
GoalsConcededbyTeamTotalDict = {team: [] for team in listTeamNamesUnique}

# Iterate through the dfresults DataFrame to calculate goals conceded
for index, row in dfresults.iterrows():
    # Home team case: Home team concedes the away team's goals
    home_team = row['Home team']
    away_goals = row['Goals Away']
    if home_team in GoalsConcededbyTeamTotalDict:
        GoalsConcededbyTeamTotalDict[home_team].append(away_goals)

    # Away team case: Away team concedes the home team's goals
    away_team = row['Away Team']
    home_goals = row['Goals Home']
    if away_team in GoalsConcededbyTeamTotalDict:
        GoalsConcededbyTeamTotalDict[away_team].append(home_goals)

# Calculate the average and standard deviation for goals conceded
AverageSDGoalsConcededbyTeamTotalDict = {}
for team, goals_conceded in GoalsConcededbyTeamTotalDict.items():
    if len(goals_conceded) == 1:  # Only calculate if there are goals conceded recorded
        AverageSDGoalsConcededbyTeamTotalDict[team] = [goals_conceded[0], '-']
    else:
        average_gc = round(statistics.mean(goals_conceded), 2)
        stdev_gc = round(statistics.stdev(goals_conceded), 2)
        AverageSDGoalsConcededbyTeamTotalDict[team] = [average_gc, stdev_gc]

for i, (key, value) in enumerate(AverageSDGoalsbyTeamTotalDict.items()):
    if i == 10:
        break
    print(key, ":", value)

### 1.2.2. Last 8 games

In [None]:
#Dict 2.1: GoalsbyTeamLast8Dict for dfOverUnderGoalsLast8FullTime
GoalsbyTeamLast8Dict = GoalsbyTeamTotalDict.copy()

# Update the dictionary by replacing lists with length < 8 with '-'
for team in GoalsbyTeamLast8Dict:
    if len(GoalsbyTeamLast8Dict[team]) < 8:
        GoalsbyTeamLast8Dict[team] = ['-'] * len(GoalsbyTeamLast8Dict[team])
    else:
        GoalsbyTeamLast8Dict[team] = GoalsbyTeamLast8Dict[team][-8:]

AverageSDGoalsbyTeamLast8Dict = GoalsbyTeamLast8Dict.copy()

# Replace each list with the average and standard deviation
for team, goals in AverageSDGoalsbyTeamLast8Dict.items():
    if len(goals) == 1:
        AverageSDGoalsbyTeamLast8Dict[team] = [goals[0], '-']
    elif "-" in goals:
        AverageSDGoalsbyTeamLast8Dict[team] = ['-', '-']
    else:
        average = round(statistics.mean(goals), 2)
        stdev = round(statistics.stdev(goals), 2)
        AverageSDGoalsbyTeamLast8Dict[team] = [average, stdev]

#################################################################
#Dict 2.2: GoalsConcededbyTeamLast8Dict for dfOverUnderGoalsLast8FullTime
# Create a new dictionary to store goals conceded by each team
GoalsConcededbyTeamLast8Dict = GoalsConcededbyTeamTotalDict.copy()

# Update the dictionary by replacing lists with length < 8 with '-'
for team in GoalsConcededbyTeamLast8Dict:
    if len(GoalsConcededbyTeamLast8Dict[team]) < 8:
        GoalsConcededbyTeamLast8Dict[team] = ['-'] * len(GoalsConcededbyTeamLast8Dict[team])
    else:
        GoalsConcededbyTeamLast8Dict[team] = GoalsConcededbyTeamLast8Dict[team][-8:]

# Calculate the average and standard deviation for goals conceded
AverageSDGoalsConcededbyTeamLast8Dict = {}
for team, goals_conceded in GoalsConcededbyTeamLast8Dict.items():
    if len(goals_conceded) == 1:
        AverageSDGoalsConcededbyTeamLast8Dict[team] = [goals_conceded[0], '-']
    elif "-" in goals_conceded:
        AverageSDGoalsConcededbyTeamLast8Dict[team] = ['-', '-']
    else:
        average_gc = round(statistics.mean(goals_conceded), 2)
        stdev_gc = round(statistics.stdev(goals_conceded), 2)
        AverageSDGoalsConcededbyTeamLast8Dict[team] = [average_gc, stdev_gc]
        
for i, (key, value) in enumerate(AverageSDGoalsConcededbyTeamLast8Dict.items()):
    if i == 10:
        break
    print(key, ":", value)

### 1.2.3. Home Games

In [None]:
#Dict 3.1: GoalsbyTeamHomeDict for dfOverUnderGoalsHomeFullTime
GoalsbyTeamHomeDict = {team: [] for team in listTeamNamesUnique}

# Iterate through the DataFrame
for index, row in dfresults.iterrows():
    # Home team case
    home_team = row['Home team']
    home_goals = row['Goals Home']
    if home_team in GoalsbyTeamHomeDict:
        GoalsbyTeamHomeDict[home_team].append(home_goals)

AverageSDGoalsbyTeamHomeDict = GoalsbyTeamHomeDict.copy()

# Replace each list with the average and standard deviation
for team, goals in AverageSDGoalsbyTeamHomeDict.items():
    if len(goals) == 1:
        AverageSDGoalsbyTeamHomeDict[team] = [goals[0], '-']
    else:
        average = round(statistics.mean(goals), 2)
        stdev = round(statistics.stdev(goals), 2)
        AverageSDGoalsbyTeamHomeDict[team] = [average, stdev]

#################################################################
#Dict 3.2: GoalsConcededbyTeamTotalDict for dfOverUnderGoalsTotalFullTime
# Create a new dictionary to store goals conceded by each team
GoalsConcededbyTeamHomeDict = {team: [] for team in listTeamNamesUnique}

# Iterate through the dfresults DataFrame to calculate goals conceded
for index, row in dfresults.iterrows():
    # Home team case: Home team concedes the away team's goals
    home_team = row['Home team']
    away_goals = row['Goals Away']
    if home_team in GoalsConcededbyTeamHomeDict:
        GoalsConcededbyTeamHomeDict[home_team].append(away_goals)

# Calculate the average and standard deviation for goals conceded
AverageSDGoalsConcededbyTeamHomeDict = {}
for team, goals_conceded in GoalsConcededbyTeamHomeDict.items():
    if len(goals_conceded) == 1:
        AverageSDGoalsConcededbyTeamHomeDict[team] = [goals_conceded[0], '-']
    else:
        average_gc = round(statistics.mean(goals_conceded), 2)
        stdev_gc = round(statistics.stdev(goals_conceded), 2)
        AverageSDGoalsConcededbyTeamHomeDict[team] = [average_gc, stdev_gc]
        
for i, (key, value) in enumerate(AverageSDGoalsConcededbyTeamHomeDict.items()):
    if i == 10:
        break
    print(key, ":", value)

### 1.2.4. Away Games

In [None]:
#Dict 4.1: GoalsbyTeamAwaylDict for dfOverUnderGoalsAwayFullTime
GoalsbyTeamAwaylDict = {team: [] for team in listTeamNamesUnique}

# Iterate through the DataFrame
for index, row in dfresults.iterrows():
    # Away team case
    away_team = row['Away Team']
    away_goals = row['Goals Away']
    if away_team in GoalsbyTeamAwaylDict:
        GoalsbyTeamAwaylDict[away_team].append(away_goals)

AverageSDGoalsbyTeamAwaylDict = GoalsbyTeamAwaylDict.copy()

# Replace each list with the average and standard deviation
for team, goals in AverageSDGoalsbyTeamAwaylDict.items():
    if len(goals) == 1:
        AverageSDGoalsbyTeamAwaylDict[team] = [goals[0], '-']
    else:
        average = round(statistics.mean(goals), 2)
        stdev = round(statistics.stdev(goals), 2)
        AverageSDGoalsbyTeamAwaylDict[team] = [average, stdev]

#################################################################
#Dict 4.2: GoalsbyTeamAwayDict for dfOverUnderGoalsTotalFullTime
# Create a new dictionary to store goals conceded by each team
GoalsConcededbyTeamAwayDict = {team: [] for team in listTeamNamesUnique}

# Iterate through the dfresults DataFrame to calculate goals conceded
for index, row in dfresults.iterrows():
    # Away team case: Away team concedes the home team's goals
    away_team = row['Away Team']
    home_goals = row['Goals Home']
    if away_team in GoalsConcededbyTeamAwayDict:
        GoalsConcededbyTeamAwayDict[away_team].append(home_goals)

# Calculate the average and standard deviation for goals conceded
AverageSDGoalsConcededbyTeamAwayDict = {}
for team, goals_conceded in GoalsConcededbyTeamAwayDict.items():
    if len(goals_conceded) == 1:
        AverageSDGoalsConcededbyTeamAwayDict[team] = [goals_conceded[0], '-']  # Handle case where no goals are conceded
    else:  # Only calculate if there are goals conceded recorded
        average_gc = round(statistics.mean(goals_conceded), 2)
        stdev_gc = round(statistics.stdev(goals_conceded), 2)
        AverageSDGoalsConcededbyTeamAwayDict[team] = [average_gc, stdev_gc]
        
for i, (key, value) in enumerate(AverageSDGoalsConcededbyTeamAwayDict.items()):
    if i == 10:
        break
    print(key, ":", value)

### 1.2.5. Adds the dicitonary with the averages and SD of to the dataframes created in section 1

In [None]:
#########################################################################################################################################################################
#Adds the dicitonary with the averages and SD of the teams to the dataframes with over ubnder goals information sorted by percentage of games with 2.5- goals
dfOverUnderGoalsTotalFullTime['average_GS'] = dfOverUnderGoalsTotalFullTime['Team'].map(lambda x: AverageSDGoalsbyTeamTotalDict[x][0])
dfOverUnderGoalsTotalFullTime['SD_GS'] = dfOverUnderGoalsTotalFullTime['Team'].map(lambda x: AverageSDGoalsbyTeamTotalDict[x][1])

dfOverUnderGoalsLast8FullTime['average_GS'] = dfOverUnderGoalsLast8FullTime['Team'].map(lambda x: AverageSDGoalsbyTeamLast8Dict[x][0])
dfOverUnderGoalsLast8FullTime['SD_GS'] = dfOverUnderGoalsLast8FullTime['Team'].map(lambda x: AverageSDGoalsbyTeamLast8Dict[x][1])

dfOverUnderGoalsHomeFullTime['average_GS'] = dfOverUnderGoalsHomeFullTime['Team'].map(lambda x: AverageSDGoalsbyTeamHomeDict[x][0])
dfOverUnderGoalsHomeFullTime['SD_GS'] = dfOverUnderGoalsHomeFullTime['Team'].map(lambda x: AverageSDGoalsbyTeamHomeDict[x][1])

dfOverUnderGoalsAwayFullTime['average_GS'] = dfOverUnderGoalsAwayFullTime['Team'].map(lambda x: AverageSDGoalsbyTeamAwaylDict[x][0])
dfOverUnderGoalsAwayFullTime['SD_GS'] = dfOverUnderGoalsAwayFullTime['Team'].map(lambda x: AverageSDGoalsbyTeamAwaylDict[x][1])

# Add the averages and standard deviations for goals conceded to the dataframe
dfOverUnderGoalsTotalFullTime['average_GC'] = dfOverUnderGoalsTotalFullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamTotalDict.get(x, [None, None])[0])
dfOverUnderGoalsTotalFullTime['SD_GC'] = dfOverUnderGoalsTotalFullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamTotalDict.get(x, [None, None])[1])

dfOverUnderGoalsLast8FullTime['average_GC'] = dfOverUnderGoalsLast8FullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamLast8Dict.get(x, [None, None])[0])
dfOverUnderGoalsLast8FullTime['SD_GC'] = dfOverUnderGoalsLast8FullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamLast8Dict.get(x, [None, None])[1])

dfOverUnderGoalsHomeFullTime['average_GC'] = dfOverUnderGoalsHomeFullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamHomeDict.get(x, [None, None])[0])
dfOverUnderGoalsHomeFullTime['SD_GC'] = dfOverUnderGoalsHomeFullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamHomeDict.get(x, [None, None])[1])

dfOverUnderGoalsAwayFullTime['average_GC'] = dfOverUnderGoalsAwayFullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamAwayDict.get(x, [None, None])[0])
dfOverUnderGoalsAwayFullTime['SD_GC'] = dfOverUnderGoalsAwayFullTime['Team'].map(lambda x: AverageSDGoalsConcededbyTeamAwayDict.get(x, [None, None])[1])

# Display all columns when printing the DataFrame
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 6)

# Show the cleaned screening dataset
dfOverUnderGoalsTotalFullTime

## 1.3. Caldulating probability of game being over2.5 goals and adding info to dataframes

In [None]:
dffixtures_forExcel = dffixtures.copy()

# Initialize an empty DataFrame to store the first appearances
first_appearances = pd.DataFrame()

# Track teams that have already appeared
home_appeared_teams = set()
away_appeared_teams = set()

# Iterate through df2 to find the first appearance of each team
for index, row in dffixtures_forExcel.iterrows():
    home_team = row['Home team']
    away_team = row['Away Team']

    # Check if the home team has already appeared as either home or away
    if home_team not in home_appeared_teams and home_team not in away_appeared_teams:
        first_appearances = pd.concat([first_appearances, pd.DataFrame([row])])
        home_appeared_teams.add(home_team)

    # Check if the away team has already appeared as either home or away
    if away_team not in home_appeared_teams and away_team not in away_appeared_teams:
        first_appearances = pd.concat([first_appearances, pd.DataFrame([row])])
        away_appeared_teams.add(away_team)

first_appearances.reset_index(drop=True, inplace=True)
first_appearances = first_appearances.drop(columns=['Continent', 'League'])

# Extracting the relevant column "2.5+" from df1 to map with team names
df_relevant = dfOverUnderGoalsTotalFullTime[['Team', '2.5+']]

# Creating dictionaries for quick lookup of 2.5+ values
team_2_5_plus = df_relevant.set_index('Team')['2.5+'].to_dict()

# Adding two new columns to first_appearances_dropped
first_appearances['low bound probability'] = first_appearances.apply(
    lambda row: min(team_2_5_plus.get(row['Home team'], float('inf')),
                    team_2_5_plus.get(row['Away Team'], float('inf'))), axis=1)

first_appearances['high bound probability'] = first_appearances.apply(
    lambda row: max(team_2_5_plus.get(row['Home team'], float('-inf')),
                    team_2_5_plus.get(row['Away Team'], float('-inf'))), axis=1)

first_appearances['low bound probability'] = first_appearances['low bound probability']
first_appearances['high bound probability'] = first_appearances['high bound probability']

first_appearances['average probability'] = (first_appearances['low bound probability'] + first_appearances['high bound probability']) / 2

# Rounding the values in the "average probability" column to 2 decimal places
first_appearances['average probability'] = first_appearances['average probability'].round(2)

# Ensure the columns exist in df_original_correct
required_columns = ['Date', 'Home team', 'Schedule', 'Away Team',
                    'low bound probability', 'high bound probability', 'average probability']

for col in required_columns:
    if col not in dfOverUnderGoalsTotalFullTime.columns:
        dfOverUnderGoalsTotalFullTime[col] = None  # Add the column if it doesn't exist

for index, row in first_appearances.iterrows():
    home_team = row['Home team']
    away_team = row['Away Team']

    # Find rows in the original dataframe that match the home or away team
    matching_home_team = dfOverUnderGoalsTotalFullTime['Team'] == home_team
    matching_away_team = dfOverUnderGoalsTotalFullTime['Team'] == away_team

    # Assign the values from df_first_appearances to df_original_correct
    dfOverUnderGoalsTotalFullTime.loc[matching_home_team, ['Date', 'Home team', 'Schedule', 'Away Team',
                                                 'low bound probability', 'high bound probability',
                                                 'average probability']] = row[
        ['Date', 'Home team', 'Schedule', 'Away Team',
         'low bound probability', 'high bound probability', 'average probability']].values

    dfOverUnderGoalsTotalFullTime.loc[matching_away_team, ['Date', 'Home team', 'Schedule', 'Away Team',
                                                 'low bound probability', 'high bound probability',
                                                 'average probability']] = row[
        ['Date', 'Home team', 'Schedule', 'Away Team',
         'low bound probability', 'high bound probability', 'average probability']].values

# Create a dictionary mapping from the "Team" to the "2.5+" value in both df2 and df3
home_team_2_5_map = dfOverUnderGoalsHomeFullTime.set_index('Team')['2.5+'].to_dict()
away_team_2_5_map = dfOverUnderGoalsAwayFullTime.set_index('Team')['2.5+'].to_dict()

# Now, map these values to the first dataframe
dfOverUnderGoalsTotalFullTime['probability home team'] = dfOverUnderGoalsTotalFullTime['Home team'].map(home_team_2_5_map)
dfOverUnderGoalsTotalFullTime['probability away team'] = dfOverUnderGoalsTotalFullTime['Away Team'].map(away_team_2_5_map)

dfOverUnderGoalsTotalFullTime['probability home team'] = dfOverUnderGoalsTotalFullTime['probability home team']
dfOverUnderGoalsTotalFullTime['probability away team'] = dfOverUnderGoalsTotalFullTime['probability away team']

# Display all columns when printing the DataFrame
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 6)

# Show the cleaned screening dataset
dfOverUnderGoalsTotalFullTime

## 1.4. Adding date and schedules of the games to the dataframes

In [None]:
# First, create the new "next match" column by concatenating the values from "Date", "Home team", "Schedule", and "Away Team".
dfOverUnderGoalsTotalFullTime["Date1"] = dfOverUnderGoalsTotalFullTime["Date"]
dfOverUnderGoalsTotalFullTime['Next match'] = dfOverUnderGoalsTotalFullTime.apply(lambda row: f"{row['Date']}, {row['Schedule']}, {row['Home team']} - {row['Away Team']}", axis=1)

# Drop the original four columns
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime.drop(columns=['Date', 'Home team', 'Schedule', 'Away Team'])

# Move "next match" to the 21th position
columns = dfOverUnderGoalsTotalFullTime.columns.tolist()  # Get list of columns
columns.insert(20, columns.pop(columns.index('Next match')))  # Move "next match" to position 21 (index 20)
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime[columns]  # Reorder the dataframe

dfOverUnderGoalsTotalFullTime['low bound home/away probability'] = dfOverUnderGoalsTotalFullTime[['probability home team', 'probability away team']].min(axis=1)
columns = dfOverUnderGoalsTotalFullTime.columns.tolist()  # Get list of columns
columns.insert(26, columns.pop(columns.index('low bound home/away probability')))  # Move "next match" to position 21 (index 20)
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime[columns]  # Reorder the dataframe

dfOverUnderGoalsTotalFullTime['average home/away probability'] = round(dfOverUnderGoalsTotalFullTime[['probability home team', 'probability away team']].mean(axis=1),2)
columns = dfOverUnderGoalsTotalFullTime.columns.tolist()  # Get list of columns
columns.insert(27, columns.pop(columns.index('average home/away probability')))  # Move "next match" to position 21 (index 20)
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime[columns]  # Reorder the dataframe

# Display all columns when printing the DataFrame
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 6)

# Show the cleaned screening dataset
dfOverUnderGoalsTotalFullTime

## 1.2.5. Adding average conceded and scored goals by home and away team

In [None]:
# Function to extract home and away team names
def extract_teams(match_info):
    match_details = match_info.split(', ')[-1]  # Get the "Landskrona - Brage" part
    home_team, away_team = match_details.split(' - ')  # Split into home and away teams
    return home_team.strip(), away_team.strip()

# Populate the columns based on the extracted home and away teams
dfOverUnderGoalsTotalFullTime['average_GS_Home'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsbyTeamHomeDict.get(extract_teams(x)[0], [None, None])[0]
)
dfOverUnderGoalsTotalFullTime['SD_GS_Home'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsbyTeamHomeDict.get(extract_teams(x)[0], [None, None])[1]
)
dfOverUnderGoalsTotalFullTime['average_GC_Home'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsConcededbyTeamHomeDict.get(extract_teams(x)[0], [None, None])[0]
)
dfOverUnderGoalsTotalFullTime['SD_GC_Home'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsConcededbyTeamHomeDict.get(extract_teams(x)[0], [None, None])[1]
)
dfOverUnderGoalsTotalFullTime['average_GS_Away'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsbyTeamAwaylDict.get(extract_teams(x)[1], [None, None])[0]
)
dfOverUnderGoalsTotalFullTime['SD_GS_Away'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsbyTeamAwaylDict.get(extract_teams(x)[1], [None, None])[1]
)
dfOverUnderGoalsTotalFullTime['average_GC_Away'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsConcededbyTeamAwayDict.get(extract_teams(x)[1], [None, None])[0]
)
dfOverUnderGoalsTotalFullTime['SD_GC_Away'] = dfOverUnderGoalsTotalFullTime['Next match'].map(
    lambda x: AverageSDGoalsConcededbyTeamAwayDict.get(extract_teams(x)[1], [None, None])[1]
)

date1_column = dfOverUnderGoalsTotalFullTime.pop('Date1')

# Add the "Date1" column back at the end of the DataFrame
dfOverUnderGoalsTotalFullTime['Date1'] = date1_column

# Display all columns when printing the DataFrame
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 6)

# Show the cleaned screening dataset
dfOverUnderGoalsTotalFullTime

## 1.6. Sum avg Goals Scored/Conceded

In [None]:
dfOverUnderGoalsTotalFullTime['Sum avg Goals Scored/Conceded'] = dfOverUnderGoalsTotalFullTime[['average_GS_Home', 'average_GC_Home','average_GS_Away', 'average_GC_Away']].sum(axis=1)
dfOverUnderGoalsTotalFullTime['Sum SD Goals Scored/Conceded'] = dfOverUnderGoalsTotalFullTime[['SD_GS_Home', 'SD_GC_Home','SD_GS_Away', 'SD_GC_Away']].sum(axis=1)

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 6)

# Show the cleaned screening dataset
dfOverUnderGoalsTotalFullTime

## 1.7. Sorts dataframe by the probabiltiy of game having more than +2.5 goals 

In [None]:
# Sort the DataFrame by the '2.5+' column in descending order
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime.sort_values(by=['low bound home/away probability','average home/away probability','Sum avg Goals Scored/Conceded','Sum SD Goals Scored/Conceded','GP'], ascending=[False,False,True,True,False])
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime.iloc[:, :-2]
dfOverUnderGoalsTotalFullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsLast8FullTime = dfOverUnderGoalsLast8FullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsLast8FullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsHomeFullTime = dfOverUnderGoalsHomeFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsHomeFullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsAwayFullTime = dfOverUnderGoalsAwayFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsAwayFullTime.reset_index(drop=True,inplace=True)

dfOverUnderGoalsTotalFullTime_v2_0 = dfOverUnderGoalsTotalFullTime.drop_duplicates(subset='Next match', keep='first')
dfOverUnderGoalsTotalFullTime_v2_0 = dfOverUnderGoalsTotalFullTime_v2_0.drop(columns=[f'Team','Avg', '0.5+', '1.5+', '2.5+', '3.5+', '4.5+', '5.5+', 'BTS', 'CS', 'FTS', 'WTN', 'LTN', 'average_GS', 'SD_GS', 'average_GC','SD_GC'])
dfOverUnderGoalsTotalFullTime_v2_0.reset_index(drop=True, inplace=True)

HTML(dfOverUnderGoalsTotalFullTime_v2_0.to_html(classes="table-striped"))
HTML(
    dfOverUnderGoalsTotalFullTime_v2_0.to_html()
    .replace('<table border="1" class="dataframe">', 
             '<table border="1" class="dataframe" style="display:block; height:300px; overflow-y:scroll;">')
)

## 1.8. Downloading the full-data database 

In [None]:
from IPython.display import FileLink

DateForFileName = "FullDatabase+2.5Goals_" + datetime.now().strftime("%d-%m-%Y") + ".xlsx"

with pd.ExcelWriter(DateForFileName) as writer:
    dfOverUnderGoalsTotalFullTime_v2_0.to_excel(writer,sheet_name='OverUnderGoalsTotalFullTime', index=False)
    dfresults.to_excel(writer,sheet_name='Results', index=False)
    dffixtures.to_excel(writer,sheet_name='Fixtures', index=False)

# 2. Treated database 

## 2.1. Keeping cases where only more than tweleve games were played and lower bound probability > 50 

In [None]:
##########################################
##Treated File
Treated_dfOverUnderGoalsTotalFullTime_v2_0 = dfOverUnderGoalsTotalFullTime_v2_0

################################
#Keeping only lines with GP >= 12
Treated_dfOverUnderGoalsTotalFullTime_v2_0.loc[:, 'GP'] = pd.to_numeric(Treated_dfOverUnderGoalsTotalFullTime_v2_0['GP'], errors='coerce')
Treated_dfOverUnderGoalsTotalFullTime_v2_0 = Treated_dfOverUnderGoalsTotalFullTime_v2_0[Treated_dfOverUnderGoalsTotalFullTime_v2_0['GP'] >= 12]

################################
#Keeping only lines with low bound home/away probability >= 70
Treated_dfOverUnderGoalsTotalFullTime_v2_0.loc[:, 'low bound home/away probability'] = pd.to_numeric(Treated_dfOverUnderGoalsTotalFullTime_v2_0['low bound home/away probability'], errors='coerce')
Treated_dfOverUnderGoalsTotalFullTime_v2_0 = Treated_dfOverUnderGoalsTotalFullTime_v2_0[Treated_dfOverUnderGoalsTotalFullTime_v2_0['low bound home/away probability'] >= 50]

## 2.2. Computing Expected Odds, Probabilities, and Margins 

In [None]:
Treated_dfOverUnderGoalsTotalFullTime_v2_0[''] = None

Treated_dfOverUnderGoalsTotalFullTime_v2_0['WorstCaseExpGS_Home'] = Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Home','average_GC_Away']].min(axis=1)

Treated_dfOverUnderGoalsTotalFullTime_v2_0['WorstCaseExpGS_Away'] = Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Away','average_GC_Home']].min(axis=1)

Treated_dfOverUnderGoalsTotalFullTime_v2_0['WorstCaseExpResult'] = Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Home','average_GC_Away']].min(axis=1) + Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Away','average_GC_Home']].min(axis=1)

Treated_dfOverUnderGoalsTotalFullTime_v2_0['       '] = None

Treated_dfOverUnderGoalsTotalFullTime_v2_0['BestCaseExpGS_Home'] = Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Home','average_GC_Away']].max(axis=1)

Treated_dfOverUnderGoalsTotalFullTime_v2_0['BestCaseExpGS_Away'] = Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Away','average_GC_Home']].max(axis=1)

Treated_dfOverUnderGoalsTotalFullTime_v2_0['BestCaseExpResult'] = Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Home','average_GC_Away']].max(axis=1) + Treated_dfOverUnderGoalsTotalFullTime_v2_0[['average_GS_Away','average_GC_Home']].max(axis=1)

Treated_dfOverUnderGoalsTotalFullTime_v2_0['            '] = None

## 2.3.  Best/Worst-Case Expected Result Evaluation

In [None]:
def evaluate_cases(row):
    if row['WorstCaseExpResult'] > 2.5 and row['BestCaseExpResult'] > 2.5:
        return "Both Cases Ok"
    elif row['BestCaseExpResult'] > 2.5:
        return "Best Case Scenario Ok"
    else:
        return "No Case Ok"

Treated_dfOverUnderGoalsTotalFullTime_v2_0['CaseEvaluation'] = Treated_dfOverUnderGoalsTotalFullTime_v2_0.apply(evaluate_cases, axis=1)

order = ["Both Cases Ok", "Best Case Scenario Ok", "No Case Ok"]
Treated_dfOverUnderGoalsTotalFullTime_v2_0['CaseEvaluation'] = pd.Categorical(
    Treated_dfOverUnderGoalsTotalFullTime_v2_0['CaseEvaluation'],
    categories=order,
    ordered=True
)
Treated_dfOverUnderGoalsTotalFullTime_v2_0 = Treated_dfOverUnderGoalsTotalFullTime_v2_0.sort_values(
    by=["CaseEvaluation", "low bound home/away probability","WorstCaseExpResult"],
    ascending=[True, False,False]
)
Treated_dfOverUnderGoalsTotalFullTime_v2_0.reset_index(drop=True, inplace=True)

## 2.4. View Final Treated Database 

In [None]:
HTML(Treated_dfOverUnderGoalsTotalFullTime_v2_0.to_html(classes="table-striped"))
HTML(
    Treated_dfOverUnderGoalsTotalFullTime_v2_0.to_html()
    .replace('<table border="1" class="dataframe">', 
             '<table border="1" class="dataframe" style="display:block; height:300px; overflow-y:scroll;">')
)

## 2.5. Downloading the treated database 

In [None]:
from IPython.display import FileLink

DateForTreatedFileName = "TreatedDatabase_+2.5Goals_" + datetime.now().strftime("%d-%m-%Y") + ".xlsx"
with pd.ExcelWriter(DateForTreatedFileName) as writer:
    Treated_dfOverUnderGoalsTotalFullTime_v2_0.to_excel(writer,sheet_name='TreatedData', index=False)