## What this notebook does

This project implements a large-scale, fully automated Python pipeline that scrapes football match statistics from **SoccerStats.com** across **60+ leagues**, processes the collected information, and computes probability indicators for matches finishing **under 2.5 goals**.

The script extracts:
- Previous results; 
- Upcoming fixtures;  
- Team scoring profiles;  
- Over/under performance metrics;  
- Expected goals distributions.  

It builds combined probability estimates using home/away tendencies and statistical variance.

All processed information is exported as Excel files with multiple sheets.

---

### Output Files

**1. Full Data**  
*Example filename:* `-2.5Goals_10-11-2025.xlsx`

**2. Treated Data**  
*Example filename:* `Treated_-2.5Goals_10-11-2025.xlsx`

# 0 - Imports librarys

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re, os
import statistics
from datetime import datetime, timedelta

# 1. Websrapping

## 1.1. Scrapping Dataframes with OverUnderGoalsTotalFullTime, OverUnderGoalsLast8FullTime, OverUnderGoalsHomeFullTime, OverUnderGoalsAwayFullTime Sorted by percentage of games with 2.5- goals.

### 1.1.1. Scrapping Dataframes from the web.

In [2]:
##################################################################################################################
##Generates DataFrames with the original DataFrames OverUnderGoalsTotalFullTime, OverUnderGoalsLast8FullTime, OverUnderGoalsHomeFullTime, OverUnderGoalsAwayFullTime
##Sorted by percentage of games with 2.5- goals.
# URL of the page to scrape
argentina_url = 'https://www.soccerstats.com/table.asp?league=argentina&tid=c'
argentina5_url = 'https://www.soccerstats.com/table.asp?league=argentina5&tid=c'
austria_url = 'https://www.soccerstats.com/table.asp?league=austria&tid=c'
austria2_url = 'https://www.soccerstats.com/table.asp?league=austria2&tid=c'
belgium_url = 'https://www.soccerstats.com/table.asp?league=belgium&tid=c'
belgium2_url = 'https://www.soccerstats.com/table.asp?league=belgium2&tid=c'
brazil_url = 'https://www.soccerstats.com/table.asp?league=brazil&tid=c'
brazil2_url = 'https://www.soccerstats.com/table.asp?league=brazil2&tid=c'
bulgaria_url = 'https://www.soccerstats.com/table.asp?league=bulgaria&tid=c'
canada_url = 'https://www.soccerstats.com/table.asp?league=canada&tid=c'
chile_url = 'https://www.soccerstats.com/table.asp?league=chile&tid=c'
chile2_url = 'https://www.soccerstats.com/table.asp?league=chile2&tid=c'
colombia2_url = 'https://www.soccerstats.com/table.asp?league=colombia2&tid=c'
costarica_url = 'https://www.soccerstats.com/table.asp?league=costarica&tid=c'
croatia_url = 'https://www.soccerstats.com/table.asp?league=croatia&tid=c'
croatia2_url = 'https://www.soccerstats.com/table.asp?league=croatia2&tid=c'
cyprus_url = 'https://www.soccerstats.com/table.asp?league=cyprus&tid=c'
czechrepublic_url = 'https://www.soccerstats.com/table.asp?league=czechrepublic&tid=c'
czechrepublic2_url = 'https://www.soccerstats.com/table.asp?league=czechrepublic2&tid=c'
denmark_url = 'https://www.soccerstats.com/table.asp?league=denmark&tid=c'
denmark2_url = 'https://www.soccerstats.com/table.asp?league=denmark2&tid=c'
ecuador3_url = 'https://www.soccerstats.com/table.asp?league=ecuador3&tid=c'
england_url = 'https://www.soccerstats.com/table.asp?league=england&tid=c'
england2_url = 'https://www.soccerstats.com/table.asp?league=england2&tid=c'
england3_url = 'https://www.soccerstats.com/table.asp?league=england3&tid=c'
finland_url = 'https://www.soccerstats.com/table.asp?league=finland&tid=c'
finland2_url = 'https://www.soccerstats.com/table.asp?league=finland2&tid=c'
france_url = 'https://www.soccerstats.com/table.asp?league=france&tid=c'
france2_url = 'https://www.soccerstats.com/table.asp?league=france2&tid=c'
france3_url = 'https://www.soccerstats.com/table.asp?league=france3&tid=c'
germany_url = 'https://www.soccerstats.com/table.asp?league=germany&tid=c'
germany2_url = 'https://www.soccerstats.com/table.asp?league=germany2&tid=c'
germany3_url = 'https://www.soccerstats.com/table.asp?league=germany3&tid=c'
greece_url = 'https://www.soccerstats.com/table.asp?league=greece&tid=c'
guatemala_url = 'https://www.soccerstats.com/table.asp?league=guatemala&tid=c'
hungary_url = 'https://www.soccerstats.com/table.asp?league=hungary&tid=c'
hungary2_url = 'https://www.soccerstats.com/table.asp?league=hungary2&tid=c'
iceland_url = 'https://www.soccerstats.com/table.asp?league=iceland&tid=c'
ireland_url = 'https://www.soccerstats.com/table.asp?league=ireland&tid=c'
ireland2_url = 'https://www.soccerstats.com/table.asp?league=ireland2&tid=c'
israel_url = 'https://www.soccerstats.com/table.asp?league=israel&tid=c'
italy_url = 'https://www.soccerstats.com/table.asp?league=italy&tid=c'
italy2_url = 'https://www.soccerstats.com/table.asp?league=italy2&tid=c'
italy3_url = 'https://www.soccerstats.com/table.asp?league=italy3&tid=c'
japan_url = 'https://www.soccerstats.com/table.asp?league=japan&tid=c'
japan2_url = 'https://www.soccerstats.com/table.asp?league=japan2&tid=c'
jordan_url = 'https://www.soccerstats.com/table.asp?league=jordan&tid=c'
kuwait_url = 'https://www.soccerstats.com/table.asp?league=kuwait&tid=c'
mexico_url = 'https://www.soccerstats.com/table.asp?league=mexico&tid=c'
mexico2_url = 'https://www.soccerstats.com/table.asp?league=mexico2&tid=c'
netherlands_url = 'https://www.soccerstats.com/table.asp?league=netherlands&tid=c'
netherlands2_url = 'https://www.soccerstats.com/table.asp?league=netherlands2&tid=c'
northernireland_url = 'https://www.soccerstats.com/table.asp?league=northernireland&tid=c'
norway_url = 'https://www.soccerstats.com/table.asp?league=norway&tid=c'
norway2_url = 'https://www.soccerstats.com/table.asp?league=norway2&tid=c'
oman_url = 'https://www.soccerstats.com/table.asp?league=oman&tid=c'
paraguay2_url = 'https://www.soccerstats.com/table.asp?league=paraguay2&tid=c'
peru2_url = 'https://www.soccerstats.com/table.asp?league=peru2&tid=c'
poland_url = 'https://www.soccerstats.com/table.asp?league=poland&tid=c'
poland2_url = 'https://www.soccerstats.com/table.asp?league=poland2&tid=c'
portugal_url = 'https://www.soccerstats.com/table.asp?league=portugal&tid=c'
portugal2_url = 'https://www.soccerstats.com/table.asp?league=portugal2&tid=c'
qatar_url = 'https://www.soccerstats.com/table.asp?league=qatar&tid=c'
romania_url = 'https://www.soccerstats.com/table.asp?league=romania&tid=c'
saudiarabia_url = 'https://www.soccerstats.com/table.asp?league=saudiarabia&tid=c'
scotland_url = 'https://www.soccerstats.com/table.asp?league=scotland&tid=c'
scotland2_url = 'https://www.soccerstats.com/table.asp?league=scotland2&tid=c'
slovakia_url = 'https://www.soccerstats.com/table.asp?league=slovakia&tid=c'
slovenia_url = 'https://www.soccerstats.com/table.asp?league=slovenia&tid=c'
southkorea_url = 'https://www.soccerstats.com/table.asp?league=southkorea&tid=c'
southkorea2_url = 'https://www.soccerstats.com/table.asp?league=southkorea2&tid=c'
spain_url = 'https://www.soccerstats.com/table.asp?league=spain&tid=c'
spain2_url = 'https://www.soccerstats.com/table.asp?league=spain2&tid=c'
sweden_url = 'https://www.soccerstats.com/table.asp?league=sweden&tid=c'
sweden2_url = 'https://www.soccerstats.com/table.asp?league=sweden2&tid=c'
switzerland_url = 'https://www.soccerstats.com/table.asp?league=switzerland&tid=c'
switzerland2_url = 'https://www.soccerstats.com/table.asp?league=switzerland2&tid=c'
thailand_url = 'https://www.soccerstats.com/table.asp?league=thailand&tid=c'
turkey_url = 'https://www.soccerstats.com/table.asp?league=turkey&tid=c'
turkey2_url = 'https://www.soccerstats.com/table.asp?league=turkey2&tid=c'
ukraine_url = 'https://www.soccerstats.com/table.asp?league=ukraine&tid=c'
unitedarabemirates_url = 'https://www.soccerstats.com/table.asp?league=unitedarabemirates&tid=c'
uruguay_url = 'https://www.soccerstats.com/table.asp?league=uruguay&tid=c'
usa_url = 'https://www.soccerstats.com/table.asp?league=usa&tid=c'
usa2_url = 'https://www.soccerstats.com/table.asp?league=usa2&tid=c'
venezuela_url = 'https://www.soccerstats.com/table.asp?league=venezuela&tid=c'
wales_url = 'https://www.soccerstats.com/table.asp?league=wales&tid=c'

ListUrls = [argentina_url,argentina5_url,austria_url,austria2_url,belgium_url,belgium2_url,brazil_url,brazil2_url,bulgaria_url,canada_url,chile_url,chile2_url,colombia2_url,costarica_url,croatia_url,czechrepublic_url,czechrepublic2_url,denmark_url,denmark2_url,ecuador3_url,england_url,england2_url,england3_url,finland2_url,france_url,france2_url,germany2_url,germany3_url,guatemala_url,hungary_url,hungary2_url,iceland_url,ireland_url,ireland2_url,japan_url,japan2_url,mexico_url,netherlands_url,netherlands2_url,northernireland_url,norway_url,norway2_url,paraguay2_url,peru2_url,poland_url,poland2_url,portugal_url,portugal2_url,qatar_url,romania_url,scotland_url,scotland2_url,slovenia_url,southkorea_url,southkorea2_url,sweden_url,sweden2_url,switzerland_url,switzerland2_url,thailand_url,usa_url,usa2_url]
Continent =['America','America','Europe','Europe','Europe','Europe','America','America','Europe','America','America','America','America','America','Europe','Europe','Europe','Europe','Europe','America','Europe','Europe','Europe','Europe','Europe','Europe','Europe','Europe','America','Europe','Europe','Europe','Europe','Europe','Asia','Asia','America','Europe','Europe','Europe','Europe','Europe','America','America','Europe','Europe','Europe','Europe','Asia','Europe','Europe','Europe','Europe','Asia','Asia','Europe','Europe','Europe','Europe','Asia','America','America']
League = ['Argentina (D1)','Argentina (D2)','Austria (D1)','Austria (D2)','Belgium (D1)','Belgium (D2)','Brazil (D1)','Brazil (D2)','Bulgaria (D1)','Canada (D1)','Chile (D1)','Chile (D2)','Colombia (D1, Clausura)','Costa Rica (D1, Apertura)','Croatia (D1)','Czech Republic (D1)','Czech Republic (D2)','Denmark (D1)','Denmark (D2)','Ecuador (D1)','England (D1)','England (D2)','England (D3)','Finland (D2)','France (D1)','France (D2)','Germany (D2)','Germany (D3)','Guatemala (D1, Apertura)','Hungary (D1)','Hungary (D2)','Iceland (D1)','Ireland (D1)','Ireland (D2)','Japan (D1)','Japan (D2)','Mexico (D1)','Netherlands (D1)','Netherlands (D2)','Northern Ireland (D1)','Norway (D1)','Norway (D2)','Paraguay (D1, Apertura)','Peru (D1, Apertura)','Poland (D1)','Poland (D2)','Portugal (D1)','Portugal (D2)','Qatar (D1)','Romania (D1)','Scotland (D1)','Scotland (D2)','Slovenia (D1)','South Korea (D1)','South Korea (D2)','Sweden (D1)','Sweden (D2)','Switzerland (D1)','Switzerland (D2)','Thailand (D1)','USA (D1)','USA (D2)']

In [6]:
rowsTotalFullTime = []
rowsLast8FullTime = []
rowsHomeFullTime = []
rowsAwayFullTime = []

IndexContLeague = 0
for i in ListUrls:
    # Perform the GET request
    response = requests.get(i)
    if response.status_code != 200:
        print(i)

# Parse the page content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find the table with the desired stats
    tables = soup.find_all('table', {'id': 'btable'})

    rows = []
    # Lists to store the categorized data
    for table in tables:
        for tr in table.find_all('tr')[1:]:  # Skip the header row
            cells = tr.find_all('td')
            row = [cell.text.strip() for cell in cells]
            rows.append(row)
    rows = [sublist for sublist in rows if any('%' in item for item in sublist)]
    rows = [sublist for sublist in rows if 'League average' not in sublist]

    # Dictionary to count occurrences of each team
    team_counts = {}

    # Iterate through each sublist in the data
    for sublist in rows:
        team = sublist[0]

        # Initialize the team count if not already in dictionary
        if team not in team_counts:
            team_counts[team] = 0

        # Increment the count for the team
        team_counts[team] += 1

        # Place the sublist in the correct list based on the count
        if team_counts[team] == 1:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsTotalFullTime.append(sublist)
        elif team_counts[team] == 2:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsLast8FullTime.append(sublist)
        elif team_counts[team] == 3:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsHomeFullTime.append(sublist)
        elif team_counts[team] == 4:
            sublist.insert(0,Continent[IndexContLeague])
            sublist.insert(1, League[IndexContLeague])
            rowsAwayFullTime.append(sublist)
    IndexContLeague += 1

headers = ["Continent","League","Team","GP","Avg","0.5+","1.5+","2.5+","3.5+","4.5+","5.5+","BTS","CS","FTS","WTN","LTN"]

dfOverUnderGoalsTotalFullTime = pd.DataFrame(rowsTotalFullTime, columns=headers)
dfOverUnderGoalsLast8FullTime = pd.DataFrame(rowsLast8FullTime, columns=headers)
dfOverUnderGoalsHomeFullTime = pd.DataFrame(rowsHomeFullTime, columns=headers)
dfOverUnderGoalsAwayFullTime = pd.DataFrame(rowsAwayFullTime, columns=headers)

https://www.soccerstats.com/table.asp?league=ecuador3&tid=c


Unnamed: 0,Continent,League,Team,GP,Avg,0.5+,1.5+,2.5+,3.5+,4.5+,5.5+,BTS,CS,FTS,WTN,LTN
0,America,Argentina (D1),Aldosivi,16,2.88,100%,81%,56%,31%,12%,6%,44%,12%,44%,12%,44%
1,America,Argentina (D1),Central Cordoba,16,2.69,94%,75%,56%,25%,12%,6%,50%,31%,25%,25%,19%
2,America,Argentina (D1),Racing Club,16,2.63,100%,69%,44%,31%,19%,0%,44%,38%,19%,38%,19%
3,America,Argentina (D1),Defensa y J.,16,2.50,94%,69%,44%,25%,19%,0%,50%,31%,25%,25%,19%
4,America,Argentina (D1),A. Tucuman,16,2.38,100%,62%,44%,19%,12%,0%,38%,25%,38%,25%,38%
5,America,Argentina (D1),Barracas C.,16,2.38,94%,69%,44%,19%,6%,6%,56%,31%,19%,25%,12%
6,America,Argentina (D1),Estudiantes,16,2.31,88%,69%,38%,31%,6%,0%,44%,31%,38%,19%,25%
7,America,Argentina (D1),I. Rivadavia,16,2.31,88%,75%,38%,19%,12%,0%,50%,38%,25%,25%,12%
8,America,Argentina (D1),Belgrano,16,2.25,88%,81%,31%,25%,0%,0%,50%,31%,31%,19%,19%
9,America,Argentina (D1),Instituto,16,2.25,94%,69%,38%,12%,12%,0%,31%,31%,44%,25%,38%


### 1.1.2. Proper colum names and sorted dataframes by percentage of games with 2.5- goals.

In [None]:
# Replace '%' with an empty string and convert the columns to integers
columns_to_convert = ['0.5+', '1.5+', '2.5+', '3.5+', '4.5+', '5.5+', 'BTS', 'CS', 'FTS', 'WTN', 'LTN']
# Remove the '%' character and convert to integer
for column in columns_to_convert:
    dfOverUnderGoalsTotalFullTime[column] = dfOverUnderGoalsTotalFullTime[column].str.replace('%', '').astype(int)
    dfOverUnderGoalsLast8FullTime[column] = dfOverUnderGoalsLast8FullTime[column].str.replace('%', '').astype(int)
    dfOverUnderGoalsHomeFullTime[column] = dfOverUnderGoalsHomeFullTime[column].str.replace('%', '').astype(int)
    dfOverUnderGoalsAwayFullTime[column] = dfOverUnderGoalsAwayFullTime[column].str.replace('%', '').astype(int)

# Sort the DataFrame by the '2.5+' column in descending order
dfOverUnderGoalsTotalFullTime = dfOverUnderGoalsTotalFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsTotalFullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsLast8FullTime = dfOverUnderGoalsLast8FullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsLast8FullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsHomeFullTime = dfOverUnderGoalsHomeFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsHomeFullTime.reset_index(drop=True,inplace=True)
dfOverUnderGoalsAwayFullTime = dfOverUnderGoalsAwayFullTime.sort_values(by=['2.5+','GP'], ascending=[False,False])
dfOverUnderGoalsAwayFullTime.reset_index(drop=True,inplace=True)