# **üìä FBref Multi-League Multi-Season Scraper**  
**Using Selenium (Firefox + GeckoDriverManager)**


----
### **üì¶üìåüõ†Ô∏è Python & Selenium Imports Overview**
---
| Import | Purpose / Use |
|--------|---------------|
| `from selenium import webdriver` | Provides the main interface to control web browsers via Selenium. Used to open, navigate, and interact with webpages. |
| `from selenium.webdriver.firefox.service import Service` | Used to specify the Firefox driver executable (GeckoDriver) service when initializing the Firefox WebDriver. |
| `from selenium.webdriver.firefox.options import Options` | Allows configuring Firefox browser options such as window size, headless mode, and other preferences. |
| `from selenium.webdriver.common.by import By` | Provides methods to locate elements on a webpage (e.g., `By.ID`, `By.CLASS_NAME`, `By.XPATH`). |
| `from webdriver_manager.firefox import GeckoDriverManager` | Automatically downloads and manages the correct version of GeckoDriver (Firefox WebDriver) for your system. |
| `from io import StringIO` | Allows treating a string as a file-like object. Used here to read HTML tables into pandas DataFrames using `pd.read_html()`. |
| `import pandas as pd` | The main data manipulation library in Python. Used for creating, cleaning, and merging DataFrames. |
| `import time` | Provides sleep and time functions, used here to add delays between requests for safer web scraping. |
| `import random` | Provides random number functions, used here to vary sleep times and mimic human-like browsing behavior. |


In [1]:
# ---------------- IMPORTS ---------------- #

# Selenium imports ‚Üí used for automating browser actions
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By

# Automatically installs the correct GeckoDriver (Firefox driver)
from webdriver_manager.firefox import GeckoDriverManager

# Other Python imports
from io import StringIO
import pandas as pd
import time
import random

### **üìò 1) Understanding FBref Website Structure for Data Scraping**

**FBref.com** is one of the most detailed and reliable sources for football statistics.  
It provides rich data across multiple leagues, seasons, teams, and individual players.

This section explains **how FBref URLs work**, how **league IDs** are assigned, and how **HTML tables** are structured, so you can scrape them correctly.

---

##### **üèÜ What FBref Provides**

FBref contains an enormous collection of football data:

- League & team standings  
- Player match logs  
- Detailed shooting, passing, possession, and defensive metrics  
- Goalkeeping analytics  
- Advanced xG/xAG metrics  
- Progressive passing and carrying stats  

---

##### **üî¢ FBref League IDs (Competition IDs)**

Each league on FBref has a unique **competition ID**, used inside the URLs.

| League | Competition ID |
|--------|-----------------|
| Premier League **(England)** | `9` |
| La Liga **(Spain)** | `12` |
| Serie A **(Italy)** | `11` |
| Bundesliga **(Germany)** | `20` |
| Ligue 1 **(France)** | `13` |
| Primeira Liga **(Portugal)** | `32` |
| Eredivisie **(Netherlands)** | `23` |

---

##### **üåê FBref URL Structure**

All FBref league pages follow this structure:

`https://fbref.com/en/comps/<LEAGUE_ID>/<YEAR_CODE>/<CATEGORY>/<SEASON-LEAGUE-NAME>`

---
### **üñ•Ô∏è 2) Selenium Browser Setup**  
Using **Firefox** + **GeckoDriverManager** (auto-install)


In [2]:
# ------------------------------------------------------------
#  1. Configure Firefox Browser Options
# ------------------------------------------------------------
options = Options()

# Set browser window width (useful for consistent rendering)
options.add_argument("--width=1280")

# Set browser height (ensures full table visibility)
options.add_argument("--height=800")

# Optional: Run browser in headless mode (no visible UI)
# This is useful for servers / background scraping
# options.add_argument("--headless")


# ------------------------------------------------------------
#  2. Create Firefox WebDriver using GeckoDriver
# ------------------------------------------------------------
driver = webdriver.Firefox(
    service=Service(GeckoDriverManager().install()),  # Auto-installs correct driver version
    options=options                                   # Apply browser settings above
)


---
### **üåç 3) Set Leagues and Seasons**  
- **Competition IDs** + **Season Folders**
---

In [3]:
# ------------------------------------------------------------
# 1. Set the League IDs used in URL
# ------------------------------------------------------------
leagues = {
    "Primeira Liga": 32,
    "La Liga": 12,
    "Serie A": 11,
    "Bundesliga": 20,
    "Ligue 1": 13,
    "Premier League": 9,
    "Eredivisie": 23
}

# ------------------------------------------------------------
# 2. Set the Season Format used in URL
# ------------------------------------------------------------
seasons = {
    "2018-2019": 2019,
    "2019-2020": 2020,
    "2020-2021": 2021,
    "2021-2022": 2022,
    "2022-2023": 2023,
    "2023-2024": 2024
}

# ------------------------------------------------------------------------
# 3. Store the entire data via League and Season wise combination 
# ------------------------------------------------------------------------
all_data = []


---
### **üìì 4) Function that scrapes tabular data from the Website**  
- Based on **URL**, **table_id** + **season** + **league_name**
---

In [4]:
def get_fbref_table(url, table_id, season, league_name, max_retries=3):
    """ Opens an FBref page using Selenium, extracts a specific stats table, cleans it, 
    and returns a Pandas DataFrame. 
    PARAMETERS: 
    url (str) -> full URL of the page to scrape 
    table_id (str) -> HTML table ID (ex: 'stats_standard') 
    season (str) -> season like '2023-2024' 
    league_name (str)-> league name like 'Premier League'

    Opens an FBref page using Selenium, extracts table with retry logic.
    """

    attempt = 1
    # ------------------------------------------------------------
    # Scraping loop that automatically retries whenever a table fails due to:
    # 1) Timeouts
    # 2) Slow Selenium load
    # 3) Temporary FBref blocking
    # 4) Partial page loads
    # It retries up to 3 times before skipping the table.
    # ------------------------------------------------------------
    while attempt <= max_retries:
        try:
            # ------------------------------------------------------------
            # 1. Load the webpage using Selenium
            # ------------------------------------------------------------
            driver.get(url)

            # Random wait to mimic human browsing & avoid FBref blocking
            time.sleep(random.uniform(4, 7))

            # ------------------------------------------------------------
            # 2. Locate the table using its HTML ID
            # ------------------------------------------------------------
            table = driver.find_element(By.ID, table_id)

            html = table.get_attribute("outerHTML")

            df = pd.read_html(StringIO(html))[0]

            # ------------------------------------------------------------
            # 4. Fix MultiIndex headers
            # ------------------------------------------------------------
            if isinstance(df.columns, pd.MultiIndex):
                df.columns = ['_'.join(col).strip() for col in df.columns.values]

            # ------------------------------------------------------------
            # 5. Clean column names
            # ------------------------------------------------------------
            df.columns = (
                df.columns
                .str.replace("Unnamed: \\d+_level_0_", "", regex=True)
                .str.replace("Standard Stats_", "")
                .str.replace("Shooting_", "")
                .str.replace("Passing_", "")
                .str.replace("GCA_", "")
                .str.replace(" ", "_")
                .str.replace("-", "_")
                .str.strip("_")
            )

            # ------------------------------------------------------------
            # 6. Normalize Player column
            # ------------------------------------------------------------
            player_col = [c for c in df.columns if "Player" in c or "player" in c][0]
            df.rename(columns={player_col: "Player"}, inplace=True)

            # ------------------------------------------------------------
            # 7. Drop repeated header rows
            # ------------------------------------------------------------
            df = df[df["Player"].notna()]
            df = df[df["Player"].str.lower() != "player"]
            df = df[df["Player"].str.lower() != "matches"]

            # ------------------------------------------------------------
            # 8. Add metadata
            # ------------------------------------------------------------
            df["Season"] = seasons[season]
            df["League"] = league_name

            return df

        except Exception as e:
            wait_time = [3, 6, 10][attempt - 1]

            print(f"‚ö†Ô∏è Error on attempt {attempt}/{max_retries} scraping {league_name} {season}")
            print(f"   Error: {str(e)}")
            print(f"‚è≥ Retrying in {wait_time} seconds...\n")

            time.sleep(wait_time)
            attempt += 1

    # ------------------------------------------------------------
    # ‚ùå After all retries failed ‚Äî return empty df
    # ------------------------------------------------------------
    print(f"‚ùå Failed to scrape {league_name} {season} after {max_retries} attempts.")
    return pd.DataFrame()


---
### **üîÅ 5) Main Scraping Loop**
- Scrapes **Standard** + **Shooting** + **Passing** Tables  
---

In [5]:
# ------------------------------------------------------------
# üîÅ LOOP OVER ALL LEAGUES & SEASONS
# ------------------------------------------------------------
for league_name, league_code in leagues.items():
    for season in seasons.keys():

        print(f"\n‚öΩ Scraping {league_name} ‚Äî {season}")

        # ------------------------------------------------------------
        #  1. Build FBref slug used in every stats URL
        # ------------------------------------------------------------
        # Example: "2023-Premier-League-Stats"
        slug = f"{season}-{league_name.replace(' ', '-')}-Stats"

        # Base URL portion for the league + season
        base_url = f"https://fbref.com/en/comps/{league_code}/{season}"

        # ------------------------------------------------------------
        #  2. Dictionary of stats pages to scrape
        # This allows dynamic URL generation
        # ------------------------------------------------------------
        stat_types = {
            "standard": "stats",
            "shooting": "shooting",
            "passing":  "passing",
            "gca":      "gca"
        }

        # ------------------------------------------------------------
        #  3. Dynamically create final URLs for each stats table
        # ------------------------------------------------------------
        urls = {
            stat_name: f"{base_url}/{path}/{slug}"
            for stat_name, path in stat_types.items()
        }

        # ------------------------------------------------------------
        #  4. Download all FBref tables
        # get_fbref_table() is your parsing function
        # ------------------------------------------------------------
        df_std  = get_fbref_table(urls["standard"], "stats_standard", season, league_name)
        df_shot = get_fbref_table(urls["shooting"], "stats_shooting", season, league_name)
        df_pass = get_fbref_table(urls["passing"],  "stats_passing",  season, league_name)
        df_gca  = get_fbref_table(urls["gca"],      "stats_gca",      season, league_name)

        # ------------------------------------------------------------
        #  5. Skip season if all tables are empty
        # ------------------------------------------------------------
        if df_std.empty and df_shot.empty and df_pass.empty and df_gca.empty:
            print(f"‚ö†Ô∏è No data for {league_name} {season}")
            continue

        # ------------------------------------------------------------
        #  6. Begin merging with STANDARD table as the base
        # ------------------------------------------------------------
        merged = df_std

        # ------------------------------------------------------------
        #  7. Merge Shooting table 
        # ------------------------------------------------------------
        if not df_shot.empty:
            merged = merged.merge(
                df_shot,
                left_on=["Season", "League", "Rk"],      # keys in standard table
                right_on=["Season", "League", "Rk"],     # keys in shooting table
                how="inner",
                suffixes=("", "_shot")                   # suffix for shooting stats
            )

        # ------------------------------------------------------------
        #  8. Merge Passing table 
        # ------------------------------------------------------------
        if not df_pass.empty:
            merged = merged.merge(
                df_pass,
                left_on=["Season", "League", "Rk"],
                right_on=["Season", "League", "Rk"],
                how="inner",
                suffixes=("", "_pass")
            )

        # ------------------------------------------------------------
        #  9. Merge GCA table 
        # ------------------------------------------------------------
        if not df_gca.empty:
            merged = merged.merge(
                df_gca,
                left_on=["Season", "League", "Rk"],
                right_on=["Season", "League", "Rk"],
                how="inner",
                suffixes=("", "_gca")
            )

       
        # ------------------------------------------------------------
        #  10. Store merged dataset for this league & season
        # ------------------------------------------------------------
        merged["League"] = league_name

        # ------------------------------------------------------------
        #  11. Store merged dataset for this league & season
        # ------------------------------------------------------------
        all_data.append(merged)
        print(f"‚úÖ Finished {league_name} {season}: {len(merged)} players")



‚öΩ Scraping Primeira Liga ‚Äî 2018-2019
‚úÖ Finished Primeira Liga 2018-2019: 540 players

‚öΩ Scraping Primeira Liga ‚Äî 2019-2020
‚úÖ Finished Primeira Liga 2019-2020: 562 players

‚öΩ Scraping Primeira Liga ‚Äî 2020-2021
‚úÖ Finished Primeira Liga 2020-2021: 543 players

‚öΩ Scraping Primeira Liga ‚Äî 2021-2022
‚úÖ Finished Primeira Liga 2021-2022: 581 players

‚öΩ Scraping Primeira Liga ‚Äî 2022-2023
‚úÖ Finished Primeira Liga 2022-2023: 580 players

‚öΩ Scraping Primeira Liga ‚Äî 2023-2024
‚úÖ Finished Primeira Liga 2023-2024: 536 players

‚öΩ Scraping La Liga ‚Äî 2018-2019
‚úÖ Finished La Liga 2018-2019: 544 players

‚öΩ Scraping La Liga ‚Äî 2019-2020
‚úÖ Finished La Liga 2019-2020: 570 players

‚öΩ Scraping La Liga ‚Äî 2020-2021
‚úÖ Finished La Liga 2020-2021: 582 players

‚öΩ Scraping La Liga ‚Äî 2021-2022
‚úÖ Finished La Liga 2021-2022: 617 players

‚öΩ Scraping La Liga ‚Äî 2022-2023
‚úÖ Finished La Liga 2022-2023: 596 players

‚öΩ Scraping La Liga ‚Äî 2023-2024
‚úÖ Finished

---
### **üíæ 6) Save Final CSV**
- Merged across all leagues and seasons.
---

In [None]:
# ------------------------------------------------------------
#  1. Check if we have any scraped data
# ------------------------------------------------------------
if all_data:

    # ------------------------------------------------------------
    #  2. Merge all league-season DataFrames into one big DataFrame
    #    - ignore_index=True resets row index after concatenation
    # ------------------------------------------------------------
    final_df = pd.concat(all_data, ignore_index=True)

    # ------------------------------------------------------------
    #  3. Save the combined data to CSV
    #    - index=False prevents writing row numbers
    # ------------------------------------------------------------
    final_df.to_csv(
        "../Data/FBREF_Top7LeaguesEurope_Season(2019-2024)_Uncleaned.csv",
        index=False
    )

    # ------------------------------------------------------------
    #  4. Print confirmation + shape of final dataset
    # ------------------------------------------------------------
    print("\nüéØ Saved successfully!")
    print("Shape:", final_df.shape)

else:
    # ------------------------------------------------------------
    #  5. If no data was scraped at all
    # ------------------------------------------------------------
    print("\n‚ùå No data scraped.")



üéØ Saved successfully!
Shape: (23283, 119)


---
### **üö™ 7) Close Browser**
---

In [7]:
# -----------------------------------------------------------------
#  Completely closes the browser AND kills the WebDriver session.
# -----------------------------------------------------------------
driver.quit()
