# Ligat Ha'al Results & Playoffs Core Notebook

**Purpose:** Unified, reproducible scraping + normalization + competitive balance / leadership analysis.



**Data Sources:**

- Transfermarkt (regular season + playoffs via match reports)

- Wikipedia (legacy reference only; not authoritative for playoffs)

- Attendance scraping lives in dedicated attendance notebook (see `notebooks/attendance.ipynb`).



**Quick Workflow:**

1. Run Environment & Helpers (cells below).

2. Run Team Name Map / normalization.

3. Run Regular Season Scraper (skip-existing unless overwrite flag enabled).

4. Run Playoff Scraper (already optimized & skip-existing).

5. Merge + Leadership / Competitive Balance Analysis.



**Overwrite Flags:**

- `OVERWRITE_REGULAR` = False (change to True to re-scrape regular season)

- `OVERWRITE_PLAYOFFS` = False (change to True to re-scrape playoffs)



**Where to Get Attendance:** Use the attendance notebook; this file only references attendance outputs if already present.



---

## Installation (Optional)

Run this cell only if you need to install dependencies in your notebook environment. 

**Recommended**: Use a virtual environment and install from `requirements.txt`:
```bash
pip install -r ../requirements.txt
```

In [1]:
# Environment setup (API-Sports removed)
from pathlib import Path
from typing import Optional

try:
    from dotenv import load_dotenv
    DOTENV_AVAILABLE = True
except Exception:
    DOTENV_AVAILABLE = False

# Feature flags (only Wikipedia + Transfermarkt pipeline)
USE_APISPORTS = False  # deprecated; kept for compatibility but not used

# Helper to find project root
def _find_root(start: Optional[Path] = None) -> Path:
    p = start or Path.cwd()
    for _ in range(6):
        if (p / 'data').exists() or (p / '.git').exists() or (p / 'notebooks').exists():
            return p
        p = p.parent
    return Path.cwd()

# Resolve project directories consistently
ROOT = _find_root()
DATA_DIR = ROOT / 'data' / 'raw'
INTERIM_DIR = ROOT / 'data' / 'interim'
PROCESSED_DIR = ROOT / 'data' / 'processed'
FIG_DIR = ROOT / 'reports' / 'figures'
for d in [DATA_DIR, INTERIM_DIR, PROCESSED_DIR, FIG_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print(f"\nüéØ Environment setup complete")
print(f"   ROOT: {ROOT}")
print(f"   DATA_DIR: {DATA_DIR}")


üéØ Environment setup complete
   ROOT: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks
   DATA_DIR: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw


### Environment & Configuration
- This project now relies only on Wikipedia and Transfermarkt.
- All API-Sports related configuration and code has been removed to simplify the notebook.

In [2]:
# Helpers to make the notebook resilient across machines (kept)
from typing import Optional
import random
import time
from pathlib import Path
import requests

_USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0 Safari/537.36",
]

def find_repo_root(start: Optional[Path] = None) -> Path:
    p = start or Path.cwd()
    for _ in range(6):
        if (p / 'data').exists() or (p / '.git').exists() or (p / 'notebooks').exists():
            return p
        p = p.parent
    return Path.cwd()

def ensure_environment():
    global ROOT, DATA_DIR, INTERIM_DIR, PROCESSED_DIR, FIG_DIR
    if 'ROOT' not in globals() or not isinstance(ROOT, Path) or not (ROOT / 'data').exists():
        root_guess = find_repo_root(Path.cwd())
        if not (root_guess / 'data').exists() and (root_guess.parent / 'data').exists():
            root_guess = root_guess.parent
        ROOT = root_guess
    DATA_DIR = ROOT / 'data' / 'raw'
    INTERIM_DIR = ROOT / 'data' / 'interim'
    PROCESSED_DIR = ROOT / 'data' / 'processed'
    FIG_DIR = ROOT / 'reports' / 'figures'
    for d in [DATA_DIR, INTERIM_DIR, PROCESSED_DIR, FIG_DIR]:
        d.mkdir(parents=True, exist_ok=True)
    return ROOT, DATA_DIR, INTERIM_DIR, PROCESSED_DIR, FIG_DIR


def http_get(url: str, headers: Optional[dict] = None, retries: int = 3, timeout: int = 30) -> str:
    last_err = None
    sess = requests.Session()
    for attempt in range(1, retries + 1):
        ua = random.choice(_USER_AGENTS)
        hdrs = {"User-Agent": ua, "Accept-Language": "en-US,en;q=0.9"}
        if headers:
            hdrs.update(headers)
        try:
            resp = sess.get(url, headers=hdrs, timeout=timeout)
            resp.raise_for_status()
            return resp.text
        except Exception as e:
            last_err = e
            time.sleep(0.8 * attempt)
    raise last_err  # type: ignore


def save_csv(df: 'pd.DataFrame', path: Path, **to_csv_kwargs):
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False, encoding=to_csv_kwargs.get('encoding', 'utf-8-sig'))
    print(f"Saved: {path}")



## Shared Utilities (Environment + HTTP)

This section documents the small, reusable helpers defined in the previous code cell. They make the notebook portable (runs the same on different machines) and robust for web scraping and file saving.

### What‚Äôs included
- Rotating User-Agents list (`_USER_AGENTS`)
  - Cycles between several realistic browser strings to reduce scraping blocks.
- `find_repo_root(start: Optional[Path] = None) -> Path`
- `ensure_environment() -> tuple[Path, Path, Path, Path, Path]`
- `http_get(url: str, headers: Optional[dict] = None, retries: int = 3, timeout: int = 30) -> str`
- `save_csv(df: pd.DataFrame, path: Path, **to_csv_kwargs) -> None`

### Function reference

- `find_repo_root(start: Optional[Path] = None) -> Path`
  - Purpose: Locate the project root by walking up folders until one contains `data/`, `.git/`, or `notebooks/`.
  - Input: Optional starting `Path` (defaults to current working directory).
  - Returns: A `Path` pointing to the inferred project root (falls back to CWD if nothing is found within ~6 levels).
  - Notes: Helps when the notebook is launched from different folders/IDE contexts.

- `ensure_environment()`
  - Purpose: Initialize and expose core directories as globals so paths work anywhere.
  - Creates/sets globals: 
    - `ROOT` (project root)
    - `DATA_DIR` (raw) ‚Üí `ROOT/data/raw`
    - `INTERIM_DIR` ‚Üí `ROOT/data/interim`
    - `PROCESSED_DIR` ‚Üí `ROOT/data/processed`
    - `FIG_DIR` ‚Üí `ROOT/reports/figures`
  - Behavior: If `ROOT` is missing or wrong, it re-detects via `find_repo_root`. Ensures all folders exist.
  - Returns: `(ROOT, DATA_DIR, INTERIM_DIR, PROCESSED_DIR, FIG_DIR)`.
  - Idempotent: Safe to call multiple times.

- `http_get(url: str, headers: Optional[dict] = None, retries: int = 3, timeout: int = 30) -> str`
  - Purpose: Resilient HTTP GET wrapper.
  - Behavior:
    - Rotates a realistic `User-Agent` each attempt.
    - Allows extra headers to be merged in (e.g., cookies, referer).
    - Retries on failure with a small incremental backoff.
  - Returns: `resp.text` on success.
  - Errors: Raises the last exception after all retries fail.
  - Use this instead of `requests.get` directly for scraping reliability.

- `save_csv(df: pd.DataFrame, path: Path, **to_csv_kwargs)`
  - Purpose: Save a DataFrame to CSV with safe defaults.
  - Behavior: Ensures the parent folder exists; writes UTF‚Äë8 with BOM (`utf-8-sig`) by default so Excel reads Hebrew/Unicode correctly.
  - Returns: `None` (prints a confirmation with the saved path).

### Quick examples
```python
# 1) Initialize folders (safe to call once near the top)
ensure_environment()

# 2) Fetch HTML with retries and rotating User-Agent
html = http_get("https://en.wikipedia.org/wiki/Ligat_Ha%27al")

# 3) Save any DataFrame safely (folders auto-created, encoding friendly for Excel)
# df = pd.DataFrame({"a": [1,2,3]})
# save_csv(df, INTERIM_DIR / "example.csv")
```

Tips:
- If a path-related cell fails after moving the project, call `ensure_environment()` again.
- Prefer `http_get` over raw `requests` to avoid transient scraping issues.
- Use `save_csv` to avoid encoding surprises when opening files in Excel.


## Step 2: Enrich Match Data (2016/17 Example)

This cell demonstrates how to enrich raw match data with calculated metrics:
- **Goal difference**: home_goals - away_goals
- **Match result**: H (home win), A (away win), D (draw)
- **Points**: 3 for win, 1 for draw, 0 for loss
- **One-sided flag**: Matches with goal difference ‚â• 3

**Input**: `data/raw/matches_2016_17_ligat_haal_wikipedia.csv`  
**Output**: `data/interim/matches_2016_17_ligat_haal_enriched.csv`

In [3]:
# ...existing code...
# Enrich Wikipedia match-by-match table (robust detection + optional auto-scrape)
import pandas as pd
from pathlib import Path
import re

# Ensure environment and paths are set
ensure_environment()

# Output path
out_path = INTERIM_DIR / "matches_2016_17_ligat_haal_enriched.csv"

# Preferred input filename
preferred = DATA_DIR / "matches_2016_17_ligat_haal_wikipedia.csv"
matches_csv = None

if preferred.exists():
    matches_csv = preferred
else:
    # Search for likely candidates in data/raw (and recursively as fallback)
    candidates = []
    candidates += list(DATA_DIR.glob("matches_2016*.csv"))
    candidates += list(DATA_DIR.glob("matches_*2016*.csv"))
    candidates += list(DATA_DIR.glob("matches_*ligat*2016*.csv"))
    candidates += list(DATA_DIR.glob("matches_all_seasons*.csv"))
    candidates += list(DATA_DIR.rglob("matches*.csv"))

    # Deduplicate and prefer files that contain 2016/2017 or all_seasons
    seen = {}
    for p in candidates:
        try:
            seen[p.resolve()] = p
        except Exception:
            seen[p] = p
    candidates = list(seen.values())

    def score(p: Path):
        name = p.name.lower()
        s = 10
        if "2016" in name and "2017" in name: s -= 6
        if "2016" in name and "17" in name: s -= 5
        if "2016" in name: s -= 4
        if "all_seasons" in name: s -= 3
        if "ligat" in name: s -= 1
        return (s, len(name), str(p))

    candidates = sorted(candidates, key=score)

    if candidates:
        matches_csv = candidates[0]
        print(f"Detected matches CSV: {matches_csv}")
    else:
        # Attempt to auto-scrape 2016/17 from Wikipedia as a last resort
        print(f"Matches CSV not found in {DATA_DIR}. Attempting to scrape 2016/17 from Wikipedia...")
        try:
            from bs4 import BeautifulSoup
            url = "https://en.wikipedia.org/wiki/2016%E2%80%9317_Israeli_Premier_League"
            html = http_get(url)
            soup = BeautifulSoup(html, "html.parser")


            results_table = None
            for table in soup.find_all("table", class_="wikitable"):
                first_row = table.find("tr")
                if first_row:
                    first_cell = first_row.find("th")
                    if first_cell and ("Home \\ Away" in first_cell.text or "Home / Away" in first_cell.text):
                        results_table = table
                        break
            if not results_table:
                raise RuntimeError("Could not find results matrix table on Wikipedia page.")

            rows = results_table.find_all("tr")
            team_names = [td.get_text(strip=True) for td in rows[0].find_all("th")][1:]

            matches = []
            for row in rows[1:]:
                cells = row.find_all(["th", "td"])
                home_team = cells[0].get_text(strip=True)
                for j, cell in enumerate(cells[1:]):
                    away_team = team_names[j]
                    score = cell.get_text(strip=True)
                    if re.match(r"^\d+\s*[‚Äì-]\s*\d+$", score):
                        home_goals, away_goals = re.split(r"[‚Äì-]", score)
                        matches.append({
                            "season": "2016/17",
                            "home_team": home_team,
                            "away_team": away_team,
                            "home_goals": int(home_goals.strip()),
                            "away_goals": int(away_goals.strip())
                        })

            if not matches:
                raise RuntimeError("No matches parsed from Wikipedia.")

            df_autoscrape = pd.DataFrame(matches, columns=[
                "season", "home_team", "away_team", "home_goals", "away_goals"
            ])
            save_csv(df_autoscrape, DATA_DIR / "matches_2016_17_ligat_haal_wikipedia.csv")
            matches_csv = DATA_DIR / "matches_2016_17_ligat_haal_wikipedia.csv"
            print("‚úÖ Created matches CSV via auto-scrape.")
        except Exception as e:
            raise FileNotFoundError(
                f"Matches CSV not found and auto-scrape failed: {e}\n"
                f"Tried to create: {DATA_DIR / 'matches_2016_17_ligat_haal_wikipedia.csv'}\n"
                "Run the scraping cells manually or place the CSV in data/raw/."
            ) from e

# Load dataframe
print(f"Loading matches from: {matches_csv}")
df = pd.read_csv(matches_csv)

# Handle different CSV formats
if "score" in df.columns and "home_goals" not in df.columns:
    # Transfermarkt format: score column with "X:Y" format
    print("Detected Transfermarkt format (score column). Splitting into home_goals/away_goals...")
    df[["home_goals", "away_goals"]] = df["score"].str.split(":", expand=True)
    df["home_goals"] = pd.to_numeric(df["home_goals"], errors="coerce")
    df["away_goals"] = pd.to_numeric(df["away_goals"], errors="coerce")
else:
    # Wikipedia format: separate home_goals and away_goals columns
    print("Detected Wikipedia format (separate goal columns).")
    for col in ["home_goals", "away_goals"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

# Derived columns
df["goal_diff"] = df["home_goals"] - df["away_goals"]
df["result"] = df["goal_diff"].apply(lambda x: "H" if x > 0 else ("A" if x < 0 else "D"))
df["home_points"] = df["result"].map({"H": 3, "D": 1, "A": 0})
df["away_points"] = df["result"].map({"A": 3, "D": 1, "H": 0})

# Optional: simple flag for one-sided results
df["one_sided"] = (df["goal_diff"].abs() >= 3).astype(int)

# Reorder/keep columns defensively
cols = [
    "season", "home_team", "away_team",
    "home_goals", "away_goals", "goal_diff", "result",
    "home_points", "away_points", "one_sided"
]
ordered = [c for c in cols if c in df.columns]
df = df[ordered]

# Save enriched file
out_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(out_path, index=False, encoding="utf-8-sig")
print(f"Saved enriched matches to: {out_path} | rows: {len(df)}")
display(df.head(10))
# ...existing code...

Loading matches from: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\matches_2016_17_ligat_haal_wikipedia.csv
Detected Wikipedia format (separate goal columns).
Saved enriched matches to: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\interim\matches_2016_17_ligat_haal_enriched.csv | rows: 182


Unnamed: 0,season,home_team,away_team,home_goals,away_goals,goal_diff,result,home_points,away_points,one_sided
0,2016/17,F.C. Ashdod,BEI,0,0,0,D,1,1,0
1,2016/17,F.C. Ashdod,BnS,1,1,0,D,1,1,0
2,2016/17,F.C. Ashdod,BnY,2,2,0,D,1,1,0
3,2016/17,F.C. Ashdod,HAS,1,0,1,H,3,0,0
4,2016/17,F.C. Ashdod,HBS,0,1,-1,A,0,3,0
5,2016/17,F.C. Ashdod,HHA,1,3,-2,A,0,3,0
6,2016/17,F.C. Ashdod,HKS,0,0,0,D,1,1,0
7,2016/17,F.C. Ashdod,HRA,0,1,-1,A,0,3,0
8,2016/17,F.C. Ashdod,HTA,1,0,1,H,3,0,0
9,2016/17,F.C. Ashdod,IKS,0,0,0,D,1,1,0


## Step 3: Advanced Enrichment (2022/23 Example)

This cell shows a more comprehensive enrichment process with:
- **Phase parsing**: Extract "regular", "championship", or "relegation" from round names
- **Round number**: Extract numeric round from strings like "Regular Season - 1"
- **Goal difference, results, points**: Same as Step 2
- **One-sided matches**: Flag matches with |goal_diff| ‚â• 3
- **Column cleanup**: Remove irrelevant API-specific columns

**Input**: `data/raw/matches_2022_23_ligat_haal.csv` (if using API-Sports)  
**Output**: `data/interim/matches_2022_23_enriched.csv`

**Note**: This cell is for API-Sports data. For Wikipedia data, use the simpler enrichment in Step 2.

In [4]:
# === ◊î◊¢◊©◊®◊™ ◊î◊ò◊ë◊ú◊î + ◊†◊ô◊ß◊ï◊ô ◊¢◊û◊ï◊ì◊ï◊™ ◊û◊ô◊ï◊™◊®◊ï◊™ ===
# Note: This cell is for API-Sports data (2022/23). 
# If you're using Wikipedia data, skip this cell and use the enrichment cell above instead.
import re
import pandas as pd

# Ensure environment is set up
ensure_environment()

in_path  = DATA_DIR / "matches_2022_23_ligat_haal.csv"   # ◊©◊†◊î ◊ú◊ß◊ï◊ë◊• ◊©◊ú◊ö
out_path = INTERIM_DIR / "matches_2022_23_enriched.csv"

# Check if file exists before attempting to process
if not in_path.exists():
    print(f"‚Ñπ Skipping 2022/23 enrichment - input file not found: {in_path}")
    print(f"  This cell is for API-Sports data. If you're using Wikipedia data,")
    print(f"  your enriched file is already created by the enrichment cell above.")
else:
    df = pd.read_csv(in_path)

    # --- ◊¢◊û◊ï◊ì◊ï◊™ ◊¢◊ñ◊® ---
    # 1) ◊©◊†◊î ◊û◊°◊§◊®◊ô◊™ ◊ú◊§◊™◊ô◊ó◊™ ◊î◊¢◊ï◊†◊î
    #df["season_year"] = df["season"].str.slice(0,4).astype(int)

    # 2) ◊û◊°◊§◊® ◊û◊ó◊ñ◊ï◊® ◊ï-phase
    def parse_round(r):
        # ◊ì◊ï◊í◊û◊ê◊ï◊™: "Regular Season - 1", "Championship Round - 5"
        if pd.isna(r):
            return (None, None)
        r = str(r)
        m = re.search(r"(Regular|Championship|Relegation).*?(\d+)", r, flags=re.I)
        phase = None
        if "regular" in r.lower():      phase = "regular"
        elif "championship" in r.lower(): phase = "championship"
        elif "relegation" in r.lower():   phase = "relegation"
        round_num = int(m.group(2)) if m else None
        return (phase, round_num)

    tmp = df["round"].apply(parse_round).tolist()
    df["phase"] = [t[0] for t in tmp]
    df["round_num"] = [t[1] for t in tmp]

    # 3) ◊î◊§◊®◊© ◊©◊¢◊®◊ô◊ù, ◊™◊ï◊¶◊ê◊î, ◊†◊ß◊ï◊ì◊ï◊™
    df["goal_diff"] = df["home_goals"] - df["away_goals"]
    df["result"] = df["goal_diff"].apply(lambda x: "H" if x>0 else ("A" if x<0 else "D"))
    df["home_points"] = df["result"].map({"H":3, "D":1, "A":0})
    df["away_points"] = df["result"].map({"H":0, "D":1, "A":3})

    # 4) ◊ì◊í◊ú ◊û◊©◊ó◊ß ◊ó◊ì-◊¶◊ì◊ì◊ô (◊ú◊û◊©◊ú |GD|>=3)
    df["one_sided"] = (df["goal_diff"].abs() >= 3).astype(int)

    # 5) ◊¢◊û◊ï◊ì◊ï◊™ ◊ú◊ê ◊®◊ú◊ï◊ï◊†◊ò◊ô◊ï◊™ ◊ú◊î◊°◊®◊î (◊õ◊§◊ô ◊©◊ë◊ô◊ß◊©◊™)
    drop_cols = ["league_id","league_name","fixture_id"]
    df = df.drop(columns=[c for c in drop_cols if c in df.columns])

    # 6) ◊°◊ì◊® ◊¢◊û◊ï◊ì◊ï◊™ ◊†◊ï◊ó
    cols = [
        "season","season_year","date","phase","round_num","stage",
        "home_team","away_team","home_goals","away_goals","goal_diff","result",
        "home_points","away_points","one_sided","venue","referee"
    ]
    df = df[[c for c in cols if c in df.columns]]

    save_csv(df, out_path)
    print("◊†◊©◊û◊®:", out_path, "| ◊©◊ï◊®◊ï◊™:", len(df))
    display(df.head(10))


‚Ñπ Skipping 2022/23 enrichment - input file not found: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\matches_2022_23_ligat_haal.csv
  This cell is for API-Sports data. If you're using Wikipedia data,
  your enriched file is already created by the enrichment cell above.


## Step 4: Scrape League Table from Wikipedia (2016/17)

This cell demonstrates how to fetch a league standings table from Wikipedia using pandas' `read_html()`.

**What it does**:
- Fetches the 2016/17 Israeli Premier League Wikipedia page
- Uses `read_html()` to automatically parse HTML tables
- Identifies the league table by looking for typical columns (Team, Points, etc.)
- Saves the standings to CSV

**Output**: `data/raw/ligat_haal_2016_17_wikipedia.csv`

**Note**: This gives you final standings, not match-by-match data. For match data, see the next cells.

## Step 5: Scrape Match-by-Match Results from Wikipedia (2016/17)

This cell extracts individual match results from Wikipedia's results matrix table.

**How it works**:
1. Fetches the Wikipedia page for 2016/17 season
2. Finds the results matrix table (grid showing Home vs Away results)
3. Parses each cell to extract scores (e.g., "2‚Äì1")
4. Creates one row per match with home/away teams and goals
5. Calculates derived metrics (goal_diff, result, points)

**Output**: `data/raw/matches_2016_17_ligat_haal_wikipedia.csv`

**Derived columns**:
- `goal_diff`: home_goals - away_goals
- `result`: H (home win), A (away win), D (draw)
- `home_points` / `away_points`: 3 for win, 1 for draw, 0 for loss

## Step 6: Multi-Season Wikipedia Scraper (Last 20 Seasons)

This cell automates the match scraping process across multiple seasons.

**What it does**:
1. Calculates the last 20 seasons dynamically (based on current date)
2. For each season:
   - Fetches the Wikipedia page
   - Extracts the results matrix
   - Parses match-by-match data
   - Saves individual season CSV
3. Combines all seasons into one master file

**Outputs**:
- Per-season: `data/raw/matches_YYYY_YY_ligat_haal_wikipedia.csv`
- Combined: `data/raw/matches_all_seasons_ligat_haal_wikipedia.csv`

**Features**:
- Polite scraping with 1-second delays between requests
- Error handling for missing/changed pages
- Progress tracking with ‚úì/‚ùå indicators
- Season summary report

In [5]:
# Scrape multiple seasons of Ligat Ha'al from Wikipedia
import pandas as pd
from bs4 import BeautifulSoup
from pathlib import Path
import re
import time
from datetime import datetime

ensure_environment()

def scrape_season(season_year):
    """
    Scrape a single season's matches from Wikipedia.
    season_year: starting year (e.g., 2016 for 2016/17 season)
    """
    season_str = f"{season_year}/{str(season_year+1)[-2:]}"
    url = f"https://en.wikipedia.org/wiki/{season_year}%E2%80%93{str(season_year+1)[-2:]}_Israeli_Premier_League"
    
    print(f"Fetching {season_str}... ", end="", flush=True)
    try:
        html = http_get(url)
        soup = BeautifulSoup(html, "html.parser")

        
        # Find results matrix
        results_table = None
        for table in soup.find_all("table", class_="wikitable"):
            first_row = table.find("tr")
            if first_row:
                first_cell = first_row.find("th")
                if first_cell and ("Home \\ Away" in first_cell.text or "Home / Away" in first_cell.text):
                    results_table = table
                    break
        
        if not results_table:
            print("‚ùå (no results matrix)")
            return None
            
        # Parse teams and build matches
        rows = results_table.find_all("tr")
        team_names = [td.get_text(strip=True) for td in rows[0].find_all("th")][1:]
        
        matches = []
        for i, row in enumerate(rows[1:]):
            cells = row.find_all(["th", "td"])
            home_team = cells[0].get_text(strip=True)
            for j, cell in enumerate(cells[1:]):
                away_team = team_names[j]
                score = cell.get_text(strip=True)
                if re.match(r"^\d+\s*[‚Äì-]\s*\d+$", score):
                    home_goals, away_goals = re.split(r"[‚Äì-]", score)
                    matches.append({
                        "season": season_str,
                        "season_year": season_year,
                        "home_team": home_team,
                        "away_team": away_team,
                        "home_goals": int(home_goals.strip()),
                        "away_goals": int(away_goals.strip())
                    })
        
        if not matches:
            print("‚ùå (no matches found)")
            return None
            
        # Convert to DataFrame and add derived columns
        df = pd.DataFrame(matches)
        df['goal_diff'] = df['home_goals'] - df['away_goals']
        df['result'] = df['goal_diff'].apply(lambda x: "H" if x>0 else ("A" if x<0 else "D"))
        df['home_points'] = df['result'].map({"H":3, "D":1, "A":0}).fillna(0).astype(int)
        df['away_points'] = df['result'].map({"A":3, "D":1, "H":0}).fillna(0).astype(int)
        
        # Select and order columns
        keep_cols = ['season', 'season_year', 'home_team', 'away_team', 'home_goals', 
                     'away_goals', 'goal_diff', 'result', 'home_points', 'away_points']
        df = df[keep_cols]
        
        print(f"‚úì ({len(df)} matches)")
        return df
        
    except Exception as e:
        print(f"‚ùå ({str(e)[:50]}...)")
        return None

# List of seasons to scrape (last 20 seasons)
current_year = datetime.now().year
if datetime.now().month < 8:  # If before August, last season started in previous year
    current_year -= 1
seasons = list(range(current_year - 19, current_year + 1))

print(f"Scraping {len(seasons)} seasons from Wikipedia ({seasons[0]}/{str(seasons[0]+1)[-2:]} to {seasons[-1]}/{str(seasons[-1]+1)[-2:]})...")

# Scrape each season
all_matches = []
for season_year in seasons:
    df = scrape_season(season_year)
    if df is not None:
        # Save individual season
        season_path = DATA_DIR / f"matches_{season_year}_{str(season_year+1)[-2:]}_ligat_haal_wikipedia.csv"
        save_csv(df, season_path)
        all_matches.append(df)
    time.sleep(1)  # Be nice to Wikipedia

if all_matches:
    # Combine all seasons
    combined_df = pd.concat(all_matches, ignore_index=True)
    combined_path = DATA_DIR / "matches_all_seasons_ligat_haal_wikipedia.csv"
    save_csv(combined_df, combined_path)
    
    print("\nSummary:")
    print(f"- Successfully scraped {len(all_matches)} seasons")
    print(f"- Total matches: {len(combined_df)}")
    print(f"\nMatches per season:")
    season_counts = combined_df.groupby('season').size().sort_index()
    for season, count in season_counts.items():
        print(f"  ‚Ä¢ {season}: {count:3d} matches")
    print(f"\nAll matches saved to: {combined_path}")
    display(combined_df.head())

Scraping 20 seasons from Wikipedia (2006/07 to 2025/26)...
Fetching 2006/07... ‚úì (132 matches)
Saved: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\matches_2006_07_ligat_haal_wikipedia.csv
Fetching 2007/08... ‚úì (132 matches)
Saved: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\matches_2007_08_ligat_haal_wikipedia.csv
Fetching 2008/09... ‚úì (132 matches)
Saved: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\matches_2008_09_ligat_haal_wikipedia.csv
Fetching 2009/10... ‚úì (239 matches)
Saved: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\matches_2009_10_ligat_haal_wikipedia.csv
Fetching 2010/11... ‚úì (234 matches)
Saved: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\matches_2010_11_ligat_haal_wikipedia.csv
Fetching 2011/12... ‚úì (240 matches)
Saved: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\noteb

Unnamed: 0,season,season_year,home_team,away_team,home_goals,away_goals,goal_diff,result,home_points,away_points
0,2006/07,2006,Beitar Jerusalem,BnY,0,0,0,D,1,1
1,2006/07,2006,Beitar Jerusalem,ASH,2,0,2,H,3,0
2,2006/07,2006,Beitar Jerusalem,HAK,0,0,0,D,1,1
3,2006/07,2006,Beitar Jerusalem,HKS,2,0,2,H,3,0
4,2006/07,2006,Beitar Jerusalem,HPT,2,0,2,H,3,0


## Step 7: Scrape Attendance Data from Transfermarkt (Single Season)

This cell demonstrates scraping **actual attendance statistics** from Transfermarkt.

**Why Transfermarkt?**
- Wikipedia only shows stadium capacity (max seats), not actual attendance
- Transfermarkt provides real match attendance data aggregated by team per season

**Data collected per team**:
- `team`: Club name
- `average_attendance`: Average fans per home match
- `total_attendance`: Total fans across all home matches
- `stadium_capacity`: Maximum stadium capacity
- `utilization_pct`: Calculated as (average / capacity √ó 100)

**How utilization_pct is calculated**:
Since Transfermarkt doesn't provide a percentage column, we calculate it:
```
utilization_pct = (average_attendance / stadium_capacity) √ó 100
```

**Example output (2016/17)**:
- Hapoel Beer Sheva: 89.7% utilization (nearly full!)
- Ironi Kiryat Shmona: 10.8% utilization (mostly empty)

**Output**: `data/raw/attendance_YYYY_YY_ligat_haal_transfermarkt.csv`

**Source**: [Transfermarkt - Ligat Ha'al Attendance](https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1)

## Step 8: Test Attendance Scraper (2023/24)

Quick validation test on a recent season to ensure the scraper works correctly.

**Output**: `data/raw/attendance_2023_24_ligat_haal_transfermarkt.csv`

In [6]:
# Test: Inspect Transfermarkt attendance page structure
from bs4 import BeautifulSoup
ensure_environment()

test_url = "https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1/saison_id/2023"
print(f"Testing URL: {test_url}\n")

html = http_get(test_url)
soup = BeautifulSoup(html, "html.parser")

# Find all tables
tables = soup.find_all("table", class_="items")
print(f"Found {len(tables)} tables with class 'items'\n")

if tables:
    # Check first table structure
    first_table = tables[0]
    print("First table structure:")
    
    # Get headers
    headers = first_table.find_all("th")
    print(f"Headers ({len(headers)}):")
    for i, h in enumerate(headers):
        print(f"  {i}: {h.get_text(strip=True)}")
    
    # Get first few rows
    rows = first_table.find_all("tr")[1:6]  # Skip header, get first 5 data rows
    print(f"\nFirst 5 data rows:")
    for idx, row in enumerate(rows):
        cells = row.find_all(["td", "th"])
        cell_texts = [cell.get_text(strip=True) for cell in cells]
        print(f"  Row {idx+1}: {cell_texts}")

Testing URL: https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1/saison_id/2023

Found 2 tables with class 'items'

First table structure:
Headers (5):
  0: #
  1: Stadium
  2: Capacity
  3: Spectators
  4: Average

First 5 data rows:
  Row 1: ['', 'Total:', '0', '1.101.572', '7.295']
  Row 2: ['1', 'BloomfieldMaccabi Tel Aviv', '', 'Bloomfield', 'Maccabi Tel Aviv', '29.150', '213.565', '17.797']
  Row 3: ['', 'Bloomfield']
  Row 4: ['Maccabi Tel Aviv']
  Row 5: ['2', 'Sammy Ofer StadiumMaccabi Haifa', '', 'Sammy Ofer Stadium', 'Maccabi Haifa', '30.780', '171.948', '17.195']


In [7]:
# Detailed inspection of attendance table rows
from bs4 import BeautifulSoup
ensure_environment()

test_url = "https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1/saison_id/2023"
html = http_get(test_url)
soup = BeautifulSoup(html, "html.parser")

tables = soup.find_all("table", class_="items")
if tables:
    table = tables[0]
    rows = table.find_all("tr")[1:15]  # Skip header, get first 14 rows
    
    print("Detailed row structure:")
    for idx, row in enumerate(rows):
        print(f"\n--- Row {idx+1} ---")
        cells = row.find_all(["td", "th"])
        
        # Check if this is a team row (has rank number)
        if cells and cells[0].get_text(strip=True).isdigit():
            rank = cells[0].get_text(strip=True)
            
            # Find team name - usually in a link with class 'vereinsname'
            team_link = row.find("a", class_="vereinsname")
            team = team_link.get_text(strip=True) if team_link else "N/A"
            
            # Find stadium name - usually earlier in the same cell
            stadium_cell = cells[1] if len(cells) > 1 else None
            stadium = ""
            if stadium_cell:
                # Get all text, then extract stadium (before team link)
                all_text = stadium_cell.get_text(separator="|", strip=True)
                parts = all_text.split("|")
                stadium = parts[0] if parts else ""
            
            capacity = cells[2].get_text(strip=True) if len(cells) > 2 else "N/A"
            spectators = cells[3].get_text(strip=True) if len(cells) > 3 else "N/A"
            average = cells[4].get_text(strip=True) if len(cells) > 4 else "N/A"
            
            print(f"Rank: {rank}")
            print(f"Team: {team}")
            print(f"Stadium: {stadium}")
            print(f"Capacity: {capacity}")
            print(f"Spectators: {spectators}")
            print(f"Average: {average}")
        else:
            print(f"Non-data row: {[c.get_text(strip=True) for c in cells]}")

Detailed row structure:

--- Row 1 ---
Non-data row: ['', 'Total:', '0', '1.101.572', '7.295']

--- Row 2 ---
Rank: 1
Team: N/A
Stadium: Bloomfield
Capacity: 
Spectators: Bloomfield
Average: Maccabi Tel Aviv

--- Row 3 ---
Non-data row: ['', 'Bloomfield']

--- Row 4 ---
Non-data row: ['Maccabi Tel Aviv']

--- Row 5 ---
Rank: 2
Team: N/A
Stadium: Sammy Ofer Stadium
Capacity: 
Spectators: Sammy Ofer Stadium
Average: Maccabi Haifa

--- Row 6 ---
Non-data row: ['', 'Sammy Ofer Stadium']

--- Row 7 ---
Non-data row: ['Maccabi Haifa']

--- Row 8 ---
Rank: 3
Team: N/A
Stadium: Teddy-Kollek-Stadion
Capacity: 
Spectators: Teddy-Kollek-Stadion
Average: Beitar Jerusalem

--- Row 9 ---
Non-data row: ['', 'Teddy-Kollek-Stadion']

--- Row 10 ---
Non-data row: ['Beitar Jerusalem']

--- Row 11 ---
Rank: 4
Team: N/A
Stadium: Toto Jacob Turner Stadium
Capacity: 
Spectators: Toto Jacob Turner Stadium
Average: Hapoel Beer Sheva

--- Row 12 ---
Non-data row: ['', 'Toto Jacob Turner Stadium']

--- Row 13 --

In [8]:
# Check actual HTML structure for one team row
from bs4 import BeautifulSoup
ensure_environment()

test_url = "https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1/saison_id/2023"
html = http_get(test_url)
soup = BeautifulSoup(html, "html.parser")

tables = soup.find_all("table", class_="items")
if tables:
    table = tables[0]
    tbody = table.find("tbody")
    
    if tbody:
        rows = tbody.find_all("tr", recursive=False)  # Only direct children
        print(f"Found {len(rows)} direct tbody rows\n")
        
        # Look at first team entry (should be row index 1, row 0 is total)
        if len(rows) > 1:
            team_row = rows[1]
            print("First team row HTML:")
            print(team_row.prettify()[:1500])
            print("\n" + "="*80 + "\n")
            
            # Try to extract data
            cells = team_row.find_all("td")
            print(f"Number of cells: {len(cells)}\n")
            
            for i, cell in enumerate(cells):
                print(f"Cell {i}:")
                print(f"  Text: {cell.get_text(strip=True)[:100]}")
                links = cell.find_all("a")
                if links:
                    print(f"  Links: {[l.get('href') for l in links]}")
                print()

Found 14 direct tbody rows

First team row HTML:
<tr class="even">
 <td class="zentriert">
  2
 </td>
 <td>
  <table class="inline-table">
   <tr>
    <td class="zentriert wappen" rowspan="2">
     <a href="#">
      <a href="/maccabi-haifa/spielplan/verein/1064/saison_id/2023" title="Maccabi Haifa">
       <img alt="Maccabi Haifa" class="" src="https://tmssl.akamaized.net//images/wappen/verysmall/1064.png?lm=1684233681" title="Maccabi Haifa"/>
      </a>
     </a>
    </td>
    <td class="hauptlink">
     <a 0="1064" href="/1064/stadion/verein/1064">
      Sammy Ofer Stadium
     </a>
    </td>
   </tr>
   <tr>
    <td>
     <a href="/maccabi-haifa/spielplan/verein/1064/saison_id/2023" title="Maccabi Haifa">
      Maccabi Haifa
     </a>
    </td>
   </tr>
  </table>
 </td>
 <td class="rechts">
  30.780
 </td>
 <td class="rechts">
  171.948
 </td>
 <td class="rechts">
  17.195
 </td>
</tr>



Number of cells: 8

Cell 0:
  Text: 2

Cell 1:
  Text: Sammy Ofer StadiumMaccabi Haifa
  Link

In [9]:
def scrape_transfermarkt_attendance(season_year: int) -> 'pd.DataFrame':
    """
    Scrape team attendance data from Transfermarkt for a given season.
    
    Args:
        season_year: Starting year of season (e.g., 2023 for 2023/24)
    
    Returns:
        DataFrame with columns: season, team, stadium, capacity, total_spectators, average_attendance
    """
    import pandas as pd
    from bs4 import BeautifulSoup
    import re
    
    url = f"https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1/saison_id/{season_year}"
    print(f"Scraping attendance from: {url}")
    
    try:
        html = http_get(url)
        soup = BeautifulSoup(html, "html.parser")
        
        # Find the attendance table
        tables = soup.find_all("table", class_="items")
        if not tables:
            print(f"  ‚ö†Ô∏è  No attendance tables found for {season_year}/{str(season_year+1)[-2:]}")
            return None
        
        table = tables[0]
        tbody = table.find("tbody")
        if not tbody:
            print(f"  ‚ö†Ô∏è  No tbody found in attendance table for {season_year}/{str(season_year+1)[-2:]}")
            return None
        
        rows = tbody.find_all("tr", recursive=False)
        
        attendance_data = []
        season_str = f"{season_year}/{str(season_year+1)[-2:]}"
        
        for row in rows:
            cells = row.find_all("td")
            if len(cells) < 5:
                continue
            
            # First cell is rank (skip "Total" row)
            rank_text = cells[0].get_text(strip=True)
            if not rank_text.isdigit():
                continue
            
            # Second cell contains inline table with stadium and team info
            inline_table = cells[1].find("table", class_="inline-table")
            if not inline_table:
                continue
            
            # Extract stadium name (first link in inline table)
            stadium_link = inline_table.find("a", class_="hauptlink")
            stadium = stadium_link.get_text(strip=True) if stadium_link else "Unknown"
            
            # Extract team name (second row of inline table)
            team_links = inline_table.find_all("a", title=True)
            team = "Unknown"
            for link in team_links:
                title = link.get("title", "")
                if title and "spielplan" in link.get("href", ""):
                    team = title
                    break
            
            # Extract capacity, total spectators, average (last 3 cells)
            # Note: Numbers use European format (dots for thousands)
            capacity_text = cells[-3].get_text(strip=True)
            total_text = cells[-2].get_text(strip=True)
            average_text = cells[-1].get_text(strip=True)
            
            # Convert European number format (remove dots, handle empty values)
            def parse_number(text):
                if not text or text == "-":
                    return None
                return int(text.replace(".", "").replace(",", ""))
            
            capacity = parse_number(capacity_text)
            total_spectators = parse_number(total_text)
            average_attendance = parse_number(average_text)
            
            attendance_data.append({
                "season": season_str,
                "team": team,
                "stadium": stadium,
                "capacity": capacity,
                "total_spectators": total_spectators,
                "average_attendance": average_attendance
            })
        
        if not attendance_data:
            print(f"  ‚ö†Ô∏è  No attendance data extracted for {season_year}/{str(season_year+1)[-2:]}")
            return None
        
        df = pd.DataFrame(attendance_data)
        print(f"  ‚úÖ Scraped {len(df)} teams for {season_str}")
        return df
        
    except Exception as e:
        print(f"  ‚ùå Error scraping {season_year}/{str(season_year+1)[-2:]}: {e}")
        return None

# Test the function
ensure_environment()
test_df = scrape_transfermarkt_attendance(2023)
if test_df is not None:
    display(test_df)

Scraping attendance from: https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1/saison_id/2023
  ‚úÖ Scraped 14 teams for 2023/24


Unnamed: 0,season,team,stadium,capacity,total_spectators,average_attendance
0,2023/24,Maccabi Tel Aviv,Unknown,29150,213565,17797
1,2023/24,Maccabi Haifa,Unknown,30780,171948,17195
2,2023/24,Beitar Jerusalem,Unknown,33500,144830,13166
3,2023/24,Hapoel Beer Sheva,Unknown,16126,122024,10169
4,2023/24,Hapoel Tel Aviv,Unknown,29150,101049,9186
5,2023/24,Maccabi Netanya,Unknown,13610,70127,5844
6,2023/24,Hapoel Petah Tikva,Unknown,11500,60759,5524
7,2023/24,Hapoel Haifa,Unknown,30820,42559,3869
8,2023/24,Hapoel Jerusalem,Unknown,33500,40070,3643
9,2023/24,Maccabi Petah Tikva,Unknown,11500,39337,3576


In [10]:
# Quick test: scrape 2023/24 season attendance
ensure_environment()
season_year = 2023
_df_2023 = scrape_transfermarkt_attendance(season_year)
if _df_2023 is not None:
    _csv_2023 = DATA_DIR / f"attendance_{season_year}_{str(season_year+1)[-2:]}_ligat_haal_transfermarkt.csv"
    save_csv(_df_2023, _csv_2023)
    display(_df_2023.head(20))
else:
    print("Failed to scrape 2023/24 attendance from Transfermarkt.")

Scraping attendance from: https://www.transfermarkt.com/ligat-haal/besucherzahlen/wettbewerb/ISR1/saison_id/2023
  ‚úÖ Scraped 14 teams for 2023/24
Saved: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw\attendance_2023_24_ligat_haal_transfermarkt.csv


Unnamed: 0,season,team,stadium,capacity,total_spectators,average_attendance
0,2023/24,Maccabi Tel Aviv,Unknown,29150,213565,17797
1,2023/24,Maccabi Haifa,Unknown,30780,171948,17195
2,2023/24,Beitar Jerusalem,Unknown,33500,144830,13166
3,2023/24,Hapoel Beer Sheva,Unknown,16126,122024,10169
4,2023/24,Hapoel Tel Aviv,Unknown,29150,101049,9186
5,2023/24,Maccabi Netanya,Unknown,13610,70127,5844
6,2023/24,Hapoel Petah Tikva,Unknown,11500,60759,5524
7,2023/24,Hapoel Haifa,Unknown,30820,42559,3869
8,2023/24,Hapoel Jerusalem,Unknown,33500,40070,3643
9,2023/24,Maccabi Petah Tikva,Unknown,11500,39337,3576


## Scrape All 20 Seasons Attendance Data

Now scrape attendance data for all seasons from 2006/07 to 2025/26.

In [11]:
# Scrape attendance data for all 20 seasons (2006-2025)
import pandas as pd
import time

ensure_environment()

# Define seasons to scrape
start_year = 2006
end_year = 2025
seasons = list(range(start_year, end_year + 1))

print(f"Scraping attendance data for {len(seasons)} seasons ({start_year}/{start_year+1}-{end_year}/{str(end_year+1)[-2:]})\n")
print("="*80)

all_attendance = []
failed = []

for season_year in seasons:
    season_str = f"{season_year}/{str(season_year+1)[-2:]}"
    print(f"\n[{season_str}]")
    
    # Check if already exists
    csv_path = DATA_DIR / f"attendance_{season_year}_{str(season_year+1)[-2:]}_ligat_haal_transfermarkt.csv"
    if csv_path.exists():
        print(f"  ‚ÑπÔ∏è  File already exists: {csv_path.name}")
        try:
            existing_df = pd.read_csv(csv_path)
            all_attendance.append(existing_df)
            print(f"  ‚úÖ Loaded existing data: {len(existing_df)} teams")
        except Exception as e:
            print(f"  ‚ö†Ô∏è  Error loading existing file: {e}")
            # Try scraping anyway
            df = scrape_transfermarkt_attendance(season_year)
            if df is not None:
                save_csv(df, csv_path)
                all_attendance.append(df)
            else:
                failed.append(season_str)
    else:
        # Scrape new data
        df = scrape_transfermarkt_attendance(season_year)
        if df is not None:
            save_csv(df, csv_path)
            all_attendance.append(df)
        else:
            failed.append(season_str)
        
        # Be polite to the server
        time.sleep(1.2)

print("\n" + "="*80)
print(f"\n‚úÖ Successfully scraped/loaded: {len(all_attendance)} seasons")
if failed:
    print(f"‚ùå Failed: {len(failed)} seasons: {', '.join(failed)}")

# Combine all data
if all_attendance:
    combined_attendance = pd.concat(all_attendance, ignore_index=True)
    combined_path = DATA_DIR / "attendance_all_seasons_ligat_haal_transfermarkt.csv"
    save_csv(combined_attendance, combined_path)
    
    print(f"\nüìä Combined attendance data:")
    print(f"   Total records: {len(combined_attendance)}")
    print(f"   Seasons: {combined_attendance['season'].nunique()}")
    print(f"   Teams: {combined_attendance['team'].nunique()}")
    print(f"\n   Saved to: {combined_path.name}")
    
    # Show summary by season
    summary = combined_attendance.groupby('season').agg({
        'team': 'count',
        'total_spectators': 'sum',
        'average_attendance': 'mean'
    }).round(0)
    summary.columns = ['Teams', 'Total Spectators', 'Avg Attendance']
    print("\n   Season Summary:")
    display(summary)

Scraping attendance data for 20 seasons (2006/2007-2025/26)


[2006/07]
  ‚ÑπÔ∏è  File already exists: attendance_2006_07_ligat_haal_transfermarkt.csv
  ‚úÖ Loaded existing data: 12 teams

[2007/08]
  ‚ÑπÔ∏è  File already exists: attendance_2007_08_ligat_haal_transfermarkt.csv
  ‚úÖ Loaded existing data: 12 teams

[2008/09]
  ‚ÑπÔ∏è  File already exists: attendance_2008_09_ligat_haal_transfermarkt.csv
  ‚úÖ Loaded existing data: 12 teams

[2009/10]
  ‚ÑπÔ∏è  File already exists: attendance_2009_10_ligat_haal_transfermarkt.csv
  ‚úÖ Loaded existing data: 16 teams

[2010/11]
  ‚ÑπÔ∏è  File already exists: attendance_2010_11_ligat_haal_transfermarkt.csv
  ‚úÖ Loaded existing data: 16 teams

[2011/12]
  ‚ÑπÔ∏è  File already exists: attendance_2011_12_ligat_haal_transfermarkt.csv
  ‚úÖ Loaded existing data: 16 teams

[2012/13]
  ‚ÑπÔ∏è  File already exists: attendance_2012_13_ligat_haal_transfermarkt.csv
  ‚úÖ Loaded existing data: 14 teams

[2013/14]
  ‚ÑπÔ∏è  File already exists: attendan

Unnamed: 0_level_0,Teams,Total Spectators,Avg Attendance
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2006/07,12,119700,3136.0
2007/08,12,362600,5738.0
2008/09,12,0,0.0
2009/10,16,939155,3926.0
2010/11,16,318450,4867.0
2011/12,16,911780,3891.0
2012/13,14,916940,5038.0
2013/14,14,970781,5444.0
2014/15,14,935937,7630.0
2015/16,14,1247497,6854.0


---

## Next Steps

Now that you have collected match and attendance data, you can:

### 1. Multi-Season Attendance Collection
Create a loop to scrape attendance for all 20 seasons (similar to the Wikipedia multi-season scraper).

### 2. Data Merging
Merge attendance data with match data by `(season, team)` to analyze:
- Home performance vs attendance levels
- Win rate correlation with fan support
- Derby match attendance spikes

### 3. Team Name Normalization
Standardize team names between Wikipedia and Transfermarkt sources for accurate joining.

### 4. Analysis & Visualization
- Time series of attendance trends
- Team performance over multiple seasons
- Home advantage analysis
- Goal-scoring patterns

### 5. Statistical Modeling
- Predict match outcomes based on historical data
- Attendance forecasting
- League position projections

---

**Current Data Available**:
- ‚úÖ Match results: 20 seasons from Wikipedia (`matches_all_seasons_ligat_haal_wikipedia.csv`)
- ‚úÖ Attendance: 2016/17 and 2023/24 from Transfermarkt
- ‚úÖ Enriched data: Calculated metrics (points, goal_diff, results)

**Files Generated**:
```
data/raw/
‚îú‚îÄ‚îÄ matches_YYYY_YY_ligat_haal_wikipedia.csv (per season)
‚îú‚îÄ‚îÄ matches_all_seasons_ligat_haal_wikipedia.csv (combined)
‚îú‚îÄ‚îÄ attendance_2016_17_ligat_haal_transfermarkt.csv
‚îî‚îÄ‚îÄ attendance_2023_24_ligat_haal_transfermarkt.csv

data/interim/
‚îú‚îÄ‚îÄ matches_2016_17_ligat_haal_enriched.csv
‚îî‚îÄ‚îÄ matches_2022_23_enriched.csv
```

## Analysis 1: League Leadership Changes

This section calculates how many times first place changed hands during a season.

**How it works:**
1. Load match data for a specific season
2. Calculate cumulative standings after each matchday
3. Track which team was in first place after each round
4. Count how many times leadership changed

**Data source:** Wikipedia match results (not Transfermarkt - they don't have matchday-by-matchday data)

**Example analysis:** 2016/17 season


### CRITICAL FIX: Team Name Normalization

**Problem discovered:** The Wikipedia results matrix uses:
- Full team names in rows (home teams)
- Abbreviations in columns (away teams)

This creates duplicate teams (28 instead of 14) and incorrect statistics!

**Solution:** Create a mapping to normalize all team names to their full versions.

In [12]:
# Team Name Mapping - Normalizes abbreviations and variants to full names
# This mapping consolidates Wikipedia's inconsistent team naming across 20 seasons

TEAM_NAME_MAP = {
    # Abbreviations to full names
    'ASH': 'F.C. Ashdod',
    'BEI': 'Beitar Jerusalem',
    'BnS': 'Bnei Sakhnin',
    'BnY': 'Bnei Yehuda',
    'HAS': 'Hapoel Ashkelon',
    'HBS': "Hapoel Be'er Sheva",
    'HHA': 'Hapoel Haifa',
    'HKS': 'Hapoel Kfar Saba',
    'HRA': "Hapoel Ra'anana",
    'HTA': 'Hapoel Tel Aviv',
    'IKS': 'Ironi Kiryat Shmona',
    'MHA': 'Maccabi Haifa',
    'MPT': 'Maccabi Petah Tikva',
    'MTA': 'Maccabi Tel Aviv',
    'HPT': 'Hapoel Petah Tikva',
    'HRG': 'Hapoel Ramat Gan',
    'HRH': 'Hapoel Ramat HaSharon',
    'HRL': 'Rishon LeZion',
    'MAN': 'Maccabi Ahi Nazareth',
    'MBR': 'Maccabi Bnei Reineh',
    'SNZ': 'Sektzia Ness Ziona',
    'HAK': 'Hapoel Acre',
    'MHE': 'Maccabi Herzliya',
    'MNE': 'Maccabi Netanya',
    'HAR': 'Hapoel Raanana',
    'HAC': 'Hapoel Acre',
    'IRH': 'Ironi Ramat HaSharon',
    'HAH': 'Hapoel Hadera',
    'NES': 'Ness Ziona',
    'HJE': 'Hapoel Jerusalem',
    'HNG': 'Hapoel Nof HaGalil',
    'ITI': 'Ironi Tiberias',
    
    # Name variants to canonical names
    'Ashdod': 'F.C. Ashdod',
    'F.C. Ironi Ashdod': 'F.C. Ashdod',
    'Ness Ziona': 'Sektzia Ness Ziona',
    'Ironi Nir Ramat HaSharon': 'Ironi Ramat HaSharon',
    'Hakoah Amidar Ramat Gan': 'Hapoel Ramat Gan',
    'Hapoel Rishon LeZion': 'Rishon LeZion',
    'Hapoel Raanana': "Hapoel Ra'anana",
    
    # Full names map to themselves
    'F.C. Ashdod': 'F.C. Ashdod',
    'Beitar Jerusalem': 'Beitar Jerusalem',
    'Bnei Sakhnin': 'Bnei Sakhnin',
    'Bnei Yehuda': 'Bnei Yehuda',
    'Hapoel Ashkelon': 'Hapoel Ashkelon',
    "Hapoel Be'er Sheva": "Hapoel Be'er Sheva",
    'Hapoel Haifa': 'Hapoel Haifa',
    'Hapoel Kfar Saba': 'Hapoel Kfar Saba',
    "Hapoel Ra'anana": "Hapoel Ra'anana",
    'Hapoel Tel Aviv': 'Hapoel Tel Aviv',
    'Ironi Kiryat Shmona': 'Ironi Kiryat Shmona',
    'Maccabi Haifa': 'Maccabi Haifa',
    'Maccabi Petah Tikva': 'Maccabi Petah Tikva',
    'Maccabi Tel Aviv': 'Maccabi Tel Aviv',
    'Hapoel Petah Tikva': 'Hapoel Petah Tikva',
    'Hapoel Ramat Gan': 'Hapoel Ramat Gan',
    'Hapoel Ramat HaSharon': 'Hapoel Ramat HaSharon',
    'Rishon LeZion': 'Rishon LeZion',
    'Maccabi Ahi Nazareth': 'Maccabi Ahi Nazareth',
    'Maccabi Bnei Reineh': 'Maccabi Bnei Reineh',
    'Sektzia Ness Ziona': 'Sektzia Ness Ziona',
    'Hapoel Acre': 'Hapoel Acre',
    'Maccabi Herzliya': 'Maccabi Herzliya',
    'Maccabi Netanya': 'Maccabi Netanya',
    'Ironi Ramat HaSharon': 'Ironi Ramat HaSharon',
    'Hapoel Hadera': 'Hapoel Hadera',
    'Hapoel Jerusalem': 'Hapoel Jerusalem',
    'Hapoel Nof HaGalil': 'Hapoel Nof HaGalil',
    'Ironi Tiberias': 'Ironi Tiberias',
}

def normalize_team_names(df, name_map=TEAM_NAME_MAP):
    """
    Normalize team names by converting abbreviations and variants to full names.
    
    Args:
        df: DataFrame with 'home_team' and 'away_team' columns
        name_map: Dictionary mapping abbreviations/variants to standardized names
    
    Returns:
        DataFrame with normalized team names
    """
    df = df.copy()
    df['home_team'] = df['home_team'].map(lambda x: name_map.get(x, x))
    df['away_team'] = df['away_team'].map(lambda x: name_map.get(x, x))
    return df

def apply_season_specific_fixes(df, season):
    """
    Apply season-specific Wikipedia data corrections.
    Wikipedia sometimes uses incorrect team names in their results matrices.
    
    Args:
        df: DataFrame with match data
        season: Season string (e.g., '2006/07')
    
    Returns:
        DataFrame with season-specific fixes applied
    """
    df = df.copy()
    
    if season == '2006/07':
        df.loc[df['home_team'] == 'Hapoel Ramat Gan', 'home_team'] = 'Hapoel Acre'
    elif season == '2008/09':
        df.loc[df['home_team'] == 'Hapoel Ramat Gan', 'home_team'] = "Hapoel Ra'anana"
    
    return df

print("‚úÖ Team Name Mapping Loaded:")
print(f"  ‚Ä¢ {len([k for k in TEAM_NAME_MAP.keys() if len(k) <= 3])} abbreviations")
print(f"  ‚Ä¢ {len(set(TEAM_NAME_MAP.values()))} unique teams")


‚úÖ Team Name Mapping Loaded:
  ‚Ä¢ 32 abbreviations
  ‚Ä¢ 31 unique teams


In [13]:
# Summary: Compare data availability between Transfermarkt and Wikipedia
import pandas as pd
ensure_environment()

print("="*80)
print("DATA SOURCES COMPARISON")
print("="*80)

# Check what files we have
transfermarkt_files = list(DATA_DIR.glob("matches_*_transfermarkt.csv"))
wiki_files = list(DATA_DIR.glob("matches_*_wikipedia.csv"))

print(f"\nüìä Match Data Files:")
print(f"  Transfermarkt: {len(transfermarkt_files)} seasons")
print(f"  Wikipedia: {len(wiki_files)} seasons")

# Sample one season to show the difference
if transfermarkt_files:
    sample_file = transfermarkt_files[0]
    df_transfermarkt = pd.read_csv(sample_file)
    
    print(f"\nüîç Sample Analysis: {sample_file.name}")
    print(f"  Total matches: {len(df_transfermarkt)}")
    print(f"  Columns: {list(df_transfermarkt.columns)}")
    
    # Check if it has round info
    if 'round' in df_transfermarkt.columns:
        print(f"  Rounds: {df_transfermarkt['round'].min()} to {df_transfermarkt['round'].max()}")
    
    # Count teams
    teams_home = set(df_transfermarkt['home'].unique()) if 'home' in df_transfermarkt.columns else set()
    teams_away = set(df_transfermarkt['away'].unique()) if 'away' in df_transfermarkt.columns else set()
    all_teams = teams_home.union(teams_away)
    
    print(f"  Unique teams: {len(all_teams)}")
    
    # Calculate expected matches
    num_teams = len(all_teams)
    expected_regular = (num_teams - 1) * 2 * (num_teams // 2)
    
    print(f"\n  üìù For {num_teams} teams:")
    print(f"     Expected regular season: {expected_regular} matches")
    print(f"     Found in Transfermarkt: {len(df_transfermarkt)} matches")
    
    if len(df_transfermarkt) == expected_regular:
        print(f"     ‚úÖ Confirmed: Regular season only (no playoffs)")
    else:
        print(f"     ‚ö†Ô∏è  Match count doesn't match expected regular season")

print("\n" + "="*80)
print("\nüí° RECOMMENDATION:")
print("   Use WIKIPEDIA for complete match data (regular season + playoffs)")
print("   Use TRANSFERMARKT for attendance data")
print("\n   Your existing Wikipedia data already includes:")
print("   ‚úÖ Regular season matches")
print("   ‚úÖ Championship playoff matches")  
print("   ‚úÖ Relegation playoff matches")
print("="*80)

DATA SOURCES COMPARISON

üìä Match Data Files:
  Transfermarkt: 20 seasons
  Wikipedia: 21 seasons

üîç Sample Analysis: matches_2006_07_ligat_haal_transfermarkt.csv
  Total matches: 198
  Columns: ['round', 'home', 'score', 'away']
  Rounds: 1 to 198
  Unique teams: 12

  üìù For 12 teams:
     Expected regular season: 132 matches
     Found in Transfermarkt: 198 matches
     ‚ö†Ô∏è  Match count doesn't match expected regular season


üí° RECOMMENDATION:
   Use WIKIPEDIA for complete match data (regular season + playoffs)
   Use TRANSFERMARKT for attendance data

   Your existing Wikipedia data already includes:
   ‚úÖ Regular season matches
   ‚úÖ Championship playoff matches
   ‚úÖ Relegation playoff matches


### Fix All Existing Wikipedia Data Files

This cell will re-process all existing Wikipedia match data to normalize team names.
It reads the existing CSV files, applies the name mapping, and saves corrected versions.

---

## ‚úÖ DATA NORMALIZATION COMPLETE

All Wikipedia match data has been updated with consistent team names:

**Before:**
- `home_team`: "Hapoel Be'er Sheva" (full name)
- `away_team`: "HBS" (abbreviation) ‚ùå

**After:**
- `home_team`: "Hapoel Be'er Sheva" (full name)  
- `away_team`: "Hapoel Be'er Sheva" (full name) ‚úÖ

**Benefits:**
- Correct team counts (14 teams, not 28)
- Accurate statistics (26 games, 59 pts for regular season leader)
- Easy to merge with attendance data
- No need for mapping during analysis

**Files updated:** All `matches_*_ligat_haal_wikipedia.csv` files in `data/raw/`

---

### Diagnostic: Check Team Counts Per Season

This cell identifies which seasons still have incorrect team counts (not 14).

### Build Comprehensive Team Name Mapping

Since different seasons have different teams (due to promotion/relegation), we need to build a complete mapping that covers all abbreviations across all seasons.

In [14]:
# Calculate league standings after each matchday and track leadership changes
import pandas as pd
import numpy as np

ensure_environment()

def calculate_league_table_by_round(matches_df, season_str="2016/17"):
    """
    Calculate league standings after each round/matchday.
    
    Args:
        matches_df: DataFrame with match results (with normalized team names)
        season_str: Season to analyze (e.g., "2016/17")
    
    Returns:
        - standings_by_round: dict mapping round_num -> DataFrame of standings
        - leadership_changes: list of tuples (round_num, new_leader)
    
    Note: Team names should already be normalized (full names, not abbreviations).
    """
    # Filter for the specific season
    season_matches = matches_df[matches_df['season'] == season_str].copy()
    
    # Get all unique teams - count ONLY home teams (each team has home games)
    # This avoids duplicate counting from abbreviations in away_team column
    teams = sorted(season_matches['home_team'].unique())
    n_teams = len(teams)
    
    print(f"‚Ñπ Processing {season_str}: {len(season_matches)} matches, {n_teams} teams")
    
    # In Ligat Ha'al, 14 teams play 26 rounds in regular season, then split into championship/relegation
    # For the regular season: each team plays 13 opponents √ó 2 (home/away) = 26 matches
    # Total matches in regular season = (14 teams √ó 26 matches) / 2 = 182 matches
    
    # Assign round numbers by ordering matches
    # Since we don't have dates, distribute evenly assuming each round has n_teams/2 matches
    season_matches = season_matches.reset_index(drop=True)
    
    # Each round has 7 matches (14 teams / 2)
    matches_per_round = n_teams // 2 if n_teams % 2 == 0 else (n_teams + 1) // 2
    
    # Assign rounds based on position in dataset
    season_matches['round_num'] = (season_matches.index // matches_per_round) + 1
    max_round = season_matches['round_num'].max()
    
    # Initialize standings tracker
    standings_by_round = {}
    current_leader = None
    leadership_changes = []
    
    # Calculate standings after each round
    for round_num in sorted(season_matches['round_num'].unique()):
        # Get all matches up to and including this round
        matches_so_far = season_matches[season_matches['round_num'] <= round_num]
        
        # Initialize team stats
        stats = {team: {'played': 0, 'won': 0, 'drawn': 0, 'lost': 0, 
                        'gf': 0, 'ga': 0, 'gd': 0, 'points': 0} 
                 for team in teams}
        
        # Calculate stats from matches
        for _, match in matches_so_far.iterrows():
            home = match['home_team']
            away = match['away_team']
            home_goals = match['home_goals']
            away_goals = match['away_goals']
            
            # Update home team
            stats[home]['played'] += 1
            stats[home]['gf'] += home_goals
            stats[home]['ga'] += away_goals
            stats[home]['gd'] = stats[home]['gf'] - stats[home]['ga']
            
            # Update away team
            stats[away]['played'] += 1
            stats[away]['gf'] += away_goals
            stats[away]['ga'] += home_goals
            stats[away]['gd'] = stats[away]['gf'] - stats[away]['ga']
            
            # Update points
            if home_goals > away_goals:  # Home win
                stats[home]['won'] += 1
                stats[home]['points'] += 3
                stats[away]['lost'] += 1
            elif away_goals > home_goals:  # Away win
                stats[away]['won'] += 1
                stats[away]['points'] += 3
                stats[home]['lost'] += 1
            else:  # Draw
                stats[home]['drawn'] += 1
                stats[away]['drawn'] += 1
                stats[home]['points'] += 1
                stats[away]['points'] += 1
        
        # Convert to DataFrame and sort
        standings = pd.DataFrame.from_dict(stats, orient='index')
        standings.index.name = 'team'
        standings = standings.reset_index()
        standings = standings.sort_values(['points', 'gd', 'gf'], ascending=[False, False, False])
        standings['position'] = range(1, len(standings) + 1)
        
        standings_by_round[int(round_num)] = standings
        
        # Track leader
        new_leader = standings.iloc[0]['team']
        if new_leader != current_leader:
            leadership_changes.append((int(round_num), new_leader))
            current_leader = new_leader
    
    return standings_by_round, leadership_changes

# Load the combined matches data
matches_path = DATA_DIR / "matches_all_seasons_ligat_haal_wikipedia.csv"
if not matches_path.exists():
    print(f"‚ùå Combined matches file not found: {matches_path}")
    print("Please run the multi-season Wikipedia scraper first (cell 17)")
else:
    all_matches = pd.read_csv(matches_path)
    
    # Normalize team names (convert abbreviations to full names)
    all_matches = normalize_team_names(all_matches, TEAM_NAME_MAP)
    
    # Apply season-specific fixes
    for season_name in all_matches['season'].unique():
        season_data = all_matches[all_matches['season'] == season_name]
        all_matches.loc[all_matches['season'] == season_name] = apply_season_specific_fixes(season_data, season_name)
    
    # Analyze 2016/17 season
    season = "2016/17"
    standings_by_round, leadership_changes = calculate_league_table_by_round(all_matches, season)
    
    print(f"\nüìä League Leadership Analysis - {season} (REGULAR SEASON)")
    print("=" * 60)
    print(f"\nüèÜ Leadership Changes: {len(leadership_changes) - 1}")
    print(f"   (Initial leader doesn't count as a 'change')\n")
    
    print("Round-by-round first place:")
    for round_num, leader in leadership_changes:
        print(f"  ‚Ä¢ Round {round_num:2d}: {leader}")
    
    # Show final standings
    print(f"\nüìã Final Standings After Round {max(standings_by_round.keys())} (Regular Season):")
    final = standings_by_round[max(standings_by_round.keys())]
    display(final[['position', 'team', 'played', 'won', 'drawn', 'lost', 'gf', 'ga', 'gd', 'points']].head(10))
    
    # Calculate some interesting stats
    print(f"\nüìà Season Statistics:")
    print(f"  ‚Ä¢ Rounds analyzed: {len(standings_by_round)} (Regular Season only)")
    print(f"  ‚Ä¢ Teams: {len(final)}")
    print(f"  ‚Ä¢ Total matches: {len(all_matches[all_matches['season'] == season])}")
    print(f"  ‚Ä¢ Leader after regular season: {final.iloc[0]['team']} ({final.iloc[0]['points']:.0f} pts, {final.iloc[0]['played']:.0f} games)")
    print(f"  ‚Ä¢ Runner-up: {final.iloc[1]['team']} ({final.iloc[1]['points']:.0f} pts, {final.iloc[1]['played']:.0f} games)")
    print(f"  ‚Ä¢ Points gap: {final.iloc[0]['points'] - final.iloc[1]['points']:.0f} pts")
    
    print(f"\n‚ö†Ô∏è IMPORTANT NOTE:")
    print(f"   Wikipedia results matrix only shows REGULAR SEASON matches (26 rounds).")
    print(f"   Ligat Ha'al has additional Championship/Relegation playoffs (~10 rounds).")
    print(f"   Full season totals: ~36 matches, ~87 points for champion (as you mentioned).")
    print(f"   This analysis tracks leadership changes during the regular season only.")
    print(f"\n‚úÖ All team names are now normalized (full names used throughout).")


‚Ñπ Processing 2016/17: 182 matches, 14 teams

üìä League Leadership Analysis - 2016/17 (REGULAR SEASON)

üèÜ Leadership Changes: 3
   (Initial leader doesn't count as a 'change')

Round-by-round first place:
  ‚Ä¢ Round  1: F.C. Ashdod
  ‚Ä¢ Round  3: Beitar Jerusalem
  ‚Ä¢ Round  8: Bnei Sakhnin
  ‚Ä¢ Round 10: Hapoel Be'er Sheva

üìã Final Standings After Round 26 (Regular Season):


Unnamed: 0,position,team,played,won,drawn,lost,gf,ga,gd,points
5,1,Hapoel Be'er Sheva,26,18,5,3,54,13,41,59
13,2,Maccabi Tel Aviv,26,17,5,4,45,19,26,56
12,3,Maccabi Petah Tikva,26,13,9,4,36,23,13,48
0,4,Beitar Jerusalem,26,10,10,6,34,27,7,40
1,5,Bnei Sakhnin,26,10,9,7,26,26,0,39
11,6,Maccabi Haifa,26,10,8,8,30,25,5,38
10,7,Ironi Kiryat Shmona,26,9,8,9,35,33,2,35
6,8,Hapoel Haifa,26,8,4,14,29,36,-7,28
3,9,F.C. Ashdod,26,6,10,10,15,26,-11,28
8,10,Hapoel Ra'anana,26,7,7,12,14,29,-15,28



üìà Season Statistics:
  ‚Ä¢ Rounds analyzed: 26 (Regular Season only)
  ‚Ä¢ Teams: 14
  ‚Ä¢ Total matches: 182
  ‚Ä¢ Leader after regular season: Hapoel Be'er Sheva (59 pts, 26 games)
  ‚Ä¢ Runner-up: Maccabi Tel Aviv (56 pts, 26 games)
  ‚Ä¢ Points gap: 3 pts

‚ö†Ô∏è IMPORTANT NOTE:
   Wikipedia results matrix only shows REGULAR SEASON matches (26 rounds).
   Ligat Ha'al has additional Championship/Relegation playoffs (~10 rounds).
   Full season totals: ~36 matches, ~87 points for champion (as you mentioned).
   This analysis tracks leadership changes during the regular season only.

‚úÖ All team names are now normalized (full names used throughout).


### Visualization: Title Race Chart

Visualize how the top teams' points progressed throughout the season.


### Multi-Season Comparison: Competitive Balance

Compare how competitive each season was by analyzing leadership changes across multiple seasons.


## Analysis 2: Multi-Season Attendance Collection

Scrape attendance data from Transfermarkt for all 20 seasons to enable trend analysis.


In [15]:
# Scrape attendance data for multiple seasons from Transfermarkt
import pandas as pd
import time
from datetime import datetime

ensure_environment()

# Calculate seasons to scrape (last 20 seasons)
current_year = datetime.now().year
if datetime.now().month < 8:  # If before August, last season started in previous year
    current_year -= 1
seasons_to_scrape = list(range(current_year - 19, current_year + 1))

print(f"Scraping attendance data for {len(seasons_to_scrape)} seasons from Transfermarkt...")
print(f"Seasons: {seasons_to_scrape[0]}/{str(seasons_to_scrape[0]+1)[-2:]} to {seasons_to_scrape[-1]}/{str(seasons_to_scrape[-1]+1)[-2:]}")
print("=" * 80)

all_attendance = []
success_count = 0
fail_count = 0

for season_year in seasons_to_scrape:
    # Check if already scraped
    existing_file = DATA_DIR / f"attendance_{season_year}_{str(season_year+1)[-2:]}_ligat_haal_transfermarkt.csv"
    
    if existing_file.exists():
        print(f"{season_year}/{str(season_year+1)[-2:]}: ‚è≠ (already exists, loading...)")
        try:
            df = pd.read_csv(existing_file)
            all_attendance.append(df)
            success_count += 1
        except Exception as e:
            print(f"   ‚ö† Error loading existing file: {e}")
            fail_count += 1
    else:
        # Scrape new season
        df = scrape_transfermarkt_attendance(season_year)
        
        if df is not None and len(df) > 0:
            all_attendance.append(df)
            success_count += 1
            time.sleep(2)  # Be polite to Transfermarkt
        else:
            fail_count += 1
        
        time.sleep(1)  # Rate limiting

print("\n" + "=" * 80)
print(f"‚úÖ Successfully collected: {success_count} seasons")
print(f"‚ùå Failed: {fail_count} seasons")

if all_attendance:
    # Combine all seasons
    combined_attendance = pd.concat(all_attendance, ignore_index=True)
    
    # Save combined file
    combined_path = DATA_DIR / "attendance_all_seasons_ligat_haal_transfermarkt.csv"
    save_csv(combined_attendance, combined_path)
    
    print(f"\nüìä Combined Attendance Data:")
    print(f"  ‚Ä¢ Total rows: {len(combined_attendance)}")
    print(f"  ‚Ä¢ Seasons: {combined_attendance['season'].nunique()}")
    print(f"  ‚Ä¢ Unique teams: {combined_attendance['team'].nunique()}")
    
    # Calculate utilization percentage
    combined_attendance['utilization_pct'] = (
        combined_attendance['average_attendance'] / combined_attendance['capacity'] * 100
    ).fillna(0).round(1)
    
    # Show sample
    display(combined_attendance.head(10))
    
    # Summary statistics by season
    print("\nüìà Average Attendance by Season:")
    season_avg = combined_attendance.groupby('season').agg({
        'average_attendance': 'mean',
        'utilization_pct': 'mean',
        'team': 'count'
    }).round(1)
    season_avg.columns = ['Avg Attendance', 'Avg Utilization %', 'Teams']
    display(season_avg)
else:
    print("\n‚ö† No attendance data was successfully collected.")


Scraping attendance data for 20 seasons from Transfermarkt...
Seasons: 2006/07 to 2025/26
2006/07: ‚è≠ (already exists, loading...)
2007/08: ‚è≠ (already exists, loading...)
2008/09: ‚è≠ (already exists, loading...)
2009/10: ‚è≠ (already exists, loading...)
2010/11: ‚è≠ (already exists, loading...)
2011/12: ‚è≠ (already exists, loading...)
2012/13: ‚è≠ (already exists, loading...)
2013/14: ‚è≠ (already exists, loading...)
2014/15: ‚è≠ (already exists, loading...)
2015/16: ‚è≠ (already exists, loading...)
2016/17: ‚è≠ (already exists, loading...)
2017/18: ‚è≠ (already exists, loading...)
2018/19: ‚è≠ (already exists, loading...)
2019/20: ‚è≠ (already exists, loading...)
2020/21: ‚è≠ (already exists, loading...)
2021/22: ‚è≠ (already exists, loading...)
2022/23: ‚è≠ (already exists, loading...)
2023/24: ‚è≠ (already exists, loading...)
2024/25: ‚è≠ (already exists, loading...)
2025/26: ‚è≠ (already exists, loading...)

‚úÖ Successfully collected: 20 seasons
‚ùå Failed: 0 seasons
Saved: c

Unnamed: 0,season,team,stadium,capacity,total_spectators,average_attendance,utilization_pct
0,2006/07,Bnei Yehuda Tel Aviv,Unknown,6020,49000,3063,50.9
1,2006/07,Hapoel Tel Aviv,Unknown,29150,16000,5333,18.3
2,2006/07,Hapoel Petah Tikva,Unknown,11500,10250,2050,17.8
3,2006/07,Beitar Jerusalem,Unknown,33500,10000,10000,29.9
4,2006/07,Maccabi Netanya,Unknown,13610,9250,3083,22.7
5,2006/07,Maccabi Haifa,Unknown,30780,7700,3850,12.5
6,2006/07,Hakoah Amidar Ramat Gan,Unknown,8000,6250,1250,15.6
7,2006/07,Hapoel Kfar Saba,Unknown,5800,4500,2250,38.8
8,2006/07,FC Ashdod,Unknown,8200,2500,2500,30.5
9,2006/07,Maccabi Petah Tikva,Unknown,11500,2000,2000,17.4



üìà Average Attendance by Season:


Unnamed: 0_level_0,Avg Attendance,Avg Utilization %,Teams
season,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2006/07,3135.8,22.2,12
2007/08,5737.8,36.8,12
2008/09,0.0,0.0,12
2009/10,3926.3,24.1,16
2010/11,4866.6,28.2,16
2011/12,3890.8,26.9,16
2012/13,5038.1,30.5,14
2013/14,5444.1,36.7,14
2014/15,7630.2,43.1,14
2015/16,6854.4,42.1,14


In [16]:
print(DATA_DIR.resolve())


C:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw


In [None]:
# ‚ö†Ô∏è NOTE: This cell has been moved to after the function definitions.
# Please scroll down to find the validation cell after cells 47-48.
# This placeholder prevents execution errors.

print("‚ö†Ô∏è Validation cell moved to bottom of notebook (after function definitions).")

‚úÖ Saved 198 matches -> matches_2006_07_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2006
‚úÖ Saved 198 matches -> matches_2007_08_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2007
‚úÖ Saved 198 matches -> matches_2008_09_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2008
‚úÖ Saved 240 matches -> matches_2009_10_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2009
‚úÖ Saved 240 matches -> matches_2010_11_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2010
‚úÖ Saved 240 matches -> matches_2011_12_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2011
‚úÖ Saved 182 matches -> matches_2012_13_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2012
‚úÖ Saved 182 matches -> matches_2013_14_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed for 2013
‚úÖ Saved 182 matches -> matches_2014_15_ligat_haal_transfermarkt.csv
‚ö†Ô∏è No playoff matches parsed f

Unnamed: 0,season_year,regular_matches
0,2006,198
1,2007,198
2,2008,198
3,2009,240
4,2010,240
5,2011,240
6,2012,182
7,2013,182
8,2014,182
9,2015,182


In [None]:
# Debug: Inspect Transfermarkt HTML structure for 2023 season
import requests
from bs4 import BeautifulSoup

url_2023 = 'https://www.transfermarkt.com/ligat-haal/gesamtspielplan/wettbewerb/ISR1?saison_id=2023'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
resp = requests.get(url_2023, headers=HEADERS, timeout=20)
html = resp.text

soup = BeautifulSoup(html, 'html.parser')

# Look for match tables
print('=== Checking for match tables ===')
tables = soup.find_all('table')
print(f'Total tables found: {len(tables)}')

# Check for rows with scores (pattern: digit:digit)
import re
all_rows_with_scores = []
for table in tables:
    for tr in table.find_all('tr'):
        txt = tr.get_text()
        if re.search(r'\d+:\d+', txt):
            all_rows_with_scores.append(tr)

print(f'Rows with score pattern (\\d+:\\d+): {len(all_rows_with_scores)}')

# Inspect first few match rows
if all_rows_with_scores:
    print('\n=== First match row HTML sample ===')
    print(str(all_rows_with_scores[0])[:800])
    
    # Try to extract team names and scores from first row
    first_row = all_rows_with_scores[0]
    team_links = first_row.find_all('a', href=re.compile(r'/verein/'))
    print(f'\nTeam links found in first row: {len(team_links)}')
    for i, link in enumerate(team_links[:4]):
        print(f'  Team {i+1}: {link.get_text(strip=True)}')
    
    # Find score
    score_match = re.search(r'(\d+):(\d+)', first_row.get_text())
    if score_match:
        print(f'Score found: {score_match.group(0)}')
else:
    print('No rows with scores found - checking page structure...')
    print('\nFirst 2000 chars of HTML:')
    print(html[:2000])

=== Checking for match tables ===
Total tables found: 27
Rows with score pattern (\d+:\d+): 342

=== First match row HTML sample ===
<tr class="bg_blau_20">
<td class="show-for-small" colspan="7">
                                            Sat                                                    <a href="/aktuell/waspassiertheute/aktuell/new/datum/2023-08-26">26/08/23</a>6:00 PM                                        </td>
</tr>

Team links found in first row: 0
Score found: 236:00


In [None]:
# Find actual match rows (with 2 team links)
match_rows = []
for row in all_rows_with_scores:
    team_links = row.find_all('a', href=re.compile(r'/verein/'))
    if len(team_links) >= 2:
        match_rows.append(row)

print(f'\nMatch rows with 2+ team links: {len(match_rows)}')

if match_rows:
    print('\n=== Sample match row HTML ===')
    sample_row = match_rows[0]
    print(str(sample_row)[:1200])
    
    # Extract teams
    team_links = sample_row.find_all('a', href=re.compile(r'/verein/'))
    print(f'\nTeams: {team_links[0].get_text(strip=True)} vs {team_links[1].get_text(strip=True)}')
    
    # Extract score
    score_match = re.search(r'(\d+):(\d+)', sample_row.get_text())
    if score_match:
        print(f'Score: {score_match.group(0)}')
    
    # Check for round/matchday info in nearby elements
    print('\n=== Looking for matchday headers ===')
    # Find parent table or section
    parent = sample_row.find_parent('table')
    if parent:
        # Look for headers before this row
        prev_elements = parent.find_all(['h2', 'h3', 'div'], class_=re.compile(r'box-headline|table-header|spieltag'))
        print(f'Found {len(prev_elements)} potential headers in table')
        if prev_elements:
            print(f'First header: {prev_elements[0].get_text(strip=True)[:100]}')


Match rows with 2+ team links: 182

=== Sample match row HTML ===
<tr>
<td class="hide-for-small">
                                        Sat                                                <a href="/aktuell/waspassiertheute/aktuell/new/datum/2023-08-26">26/08/23</a> </td>
<td class="zentriert hide-for-small">
                                                6:00 PM                                    </td>
<td class="text-right no-border-rechts hauptlink"><a href="/ihud-bnei-sachnin/spielplan/verein/4769/saison_id/2023" title="Ihud Bnei Sakhnin">Bnei Sakhnin</a></td>
<td class="zentriert no-border-links"><a href="/ihud-bnei-sachnin/spielplan/verein/4769/saison_id/2023" title="Ihud Bnei Sakhnin"><img alt="Ihud Bnei Sakhnin" class="tiny_wappen" src="https://tmssl.akamaized.net//images/wappen/tiny/4769.png?lm=1423260464" title="Ihud Bnei Sakhnin"/></a></td>
<td class="zentriert hauptlink">¬†<a class="ergebnis-link" href="/ihud-bnei-sakhnin_hapoel-tel-aviv/index/spielbericht/4118835" id="4

In [None]:
# Look for matchday/round structure in page
print('=== Looking for matchday structure ===')

# Check all headings in the page
headings = soup.find_all(['h1', 'h2', 'h3', 'h4'])
print(f'Total headings: {len(headings)}')

matchday_headings = []
for h in headings:
    txt = h.get_text(strip=True)
    if re.search(r'Matchday|Spieltag|Round|Championship|Relegation', txt, re.IGNORECASE):
        matchday_headings.append(txt)
        
print(f'\nMatchday-related headings found: {len(matchday_headings)}')
for i, h in enumerate(matchday_headings[:10]):
    print(f'  {i+1}. {h}')

# Check divs with 'box' class that might contain sections
boxes = soup.find_all('div', class_='box')
print(f'\nDiv.box elements: {len(boxes)}')

if boxes:
    first_box = boxes[0]
    h2 = first_box.find(['h2', 'h3'])
    if h2:
        print(f'First box header: {h2.get_text(strip=True)}')
    # Count match rows in first box
    box_tables = first_box.find_all('table')
    box_match_rows = 0
    for t in box_tables:
        for tr in t.find_all('tr'):
            if len(tr.find_all('a', href=re.compile(r'/verein/'))) >= 2:
                box_match_rows += 1
    print(f'Match rows in first box: {box_match_rows}')

=== Looking for matchday structure ===
Total headings: 1

Matchday-related headings found: 0

Div.box elements: 27
Match rows in first box: 0


In [None]:
# Test the fixed scraper on 2023 season
df_test = scrape_transfermarkt_regular(2023)
if df_test is not None:
    print(f'\nSuccessfully scraped {len(df_test)} matches for 2023/24')
    print(f'Columns: {list(df_test.columns)}')
    display(df_test.head(10))
    display(df_test.tail(5))

‚úÖ Saved 182 matches -> matches_2023_24_ligat_haal_transfermarkt.csv

Successfully scraped 182 matches for 2023/24
Columns: ['round', 'home', 'score', 'away']


Unnamed: 0,round,home,score,away
0,1,Bnei Sakhnin,1:1,Hapoel Tel Aviv
1,2,M. Petah Tikva,1:1,H. Jerusalem
2,3,Maccabi Netanya,1:1,M. Bnei Reineh
3,4,H. Beer Sheva,3:0,Hapoel Hadera
4,5,M. Tel Aviv,4:1,FC Ashdod
5,6,B. Jerusalem,1:2,Hapoel Haifa
6,7,Maccabi Haifa,2:1,H. Petah Tikva
7,8,M. Bnei Reineh,1:1,H. Beer Sheva
8,9,H. Petah Tikva,1:1,Bnei Sakhnin
9,10,Hapoel Haifa,2:2,M. Petah Tikva


Unnamed: 0,round,home,score,away
177,178,Maccabi Netanya,1:3,Hapoel Hadera
178,179,M. Petah Tikva,0:3,B. Jerusalem
179,180,Bnei Sakhnin,0:0,M. Bnei Reineh
180,181,M. Tel Aviv,3:1,Hapoel Haifa
181,182,Maccabi Haifa,0:0,Hapoel Tel Aviv


In [None]:
# Scrape all 20 seasons from Transfermarkt (2006-2025)
import time
from datetime import datetime

seasons = list(range(2006, 2026))  # 2006/07 through 2025/26
results = {}
failed = []

print(f'Starting to scrape {len(seasons)} seasons from Transfermarkt...')
print(f'Time started: {datetime.now().strftime("%H:%M:%S")}')
print('=' * 80)

for i, season_year in enumerate(seasons, 1):
    print(f'\n[{i}/{len(seasons)}] Scraping {season_year}/{str(season_year+1)[-2:]}...')
    
    try:
        df = scrape_transfermarkt_regular(season_year)
        if df is not None:
            results[season_year] = len(df)
        else:
            failed.append(season_year)
    except Exception as e:
        print(f'  ‚ùå Error: {e}')
        failed.append(season_year)
    
    # Be polite - wait between requests
    if i < len(seasons):
        time.sleep(2)

print('\n' + '=' * 80)
print(f'\n‚úÖ Successfully scraped: {len(results)} seasons')
print(f'‚ùå Failed: {len(failed)} seasons')
if failed:
    print(f'  Failed seasons: {failed}')

# Show summary
if results:
    import pandas as pd
    summary = pd.DataFrame(list(results.items()), columns=['Season', 'Matches'])
    summary['Season'] = summary['Season'].apply(lambda x: f"{x}/{str(x+1)[-2:]}")
    print('\n‚úÖ Scraping Summary:')
    display(summary)
    
    print(f'\nTotal matches scraped: {sum(results.values())}')
    print(f'Time finished: {datetime.now().strftime("%H:%M:%S")}')

Starting to scrape 20 seasons from Transfermarkt...
Time started: 00:50:40

[1/20] Scraping 2006/07...
‚úÖ Saved 198 matches -> matches_2006_07_ligat_haal_transfermarkt.csv
‚úÖ Saved 198 matches -> matches_2006_07_ligat_haal_transfermarkt.csv

[2/20] Scraping 2007/08...

[2/20] Scraping 2007/08...
‚úÖ Saved 198 matches -> matches_2007_08_ligat_haal_transfermarkt.csv
‚úÖ Saved 198 matches -> matches_2007_08_ligat_haal_transfermarkt.csv

[3/20] Scraping 2008/09...

[3/20] Scraping 2008/09...
‚úÖ Saved 198 matches -> matches_2008_09_ligat_haal_transfermarkt.csv
‚úÖ Saved 198 matches -> matches_2008_09_ligat_haal_transfermarkt.csv

[4/20] Scraping 2009/10...

[4/20] Scraping 2009/10...
‚úÖ Saved 240 matches -> matches_2009_10_ligat_haal_transfermarkt.csv
‚úÖ Saved 240 matches -> matches_2009_10_ligat_haal_transfermarkt.csv

[5/20] Scraping 2010/11...

[5/20] Scraping 2010/11...
‚úÖ Saved 240 matches -> matches_2010_11_ligat_haal_transfermarkt.csv
‚úÖ Saved 240 matches -> matches_2010_11_li

Unnamed: 0,Season,Matches
0,2006/07,198
1,2007/08,198
2,2008/09,198
3,2009/10,240
4,2010/11,240
5,2011/12,240
6,2012/13,182
7,2013/14,182
8,2014/15,182
9,2015/16,182



Total matches scraped: 3749
Time finished: 00:51:41


In [None]:
# Verify scraped data and compare with Wikipedia format
from pathlib import Path
import pandas as pd

# List all Transfermarkt CSVs
DATA_DIR = Path(ROOT) / 'data' / 'raw'
transfermarkt_files = sorted(DATA_DIR.glob('matches_*_ligat_haal_transfermarkt.csv'))

print(f'‚úÖ Found {len(transfermarkt_files)} Transfermarkt CSV files')
print('\nFiles:')
for f in transfermarkt_files:
    print(f'  - {f.name}')

# Load and check format of first file
if transfermarkt_files:
    sample_file = transfermarkt_files[0]
    df_sample = pd.read_csv(sample_file)
    
    print(f'\n‚úÖ Sample file: {sample_file.name}')
    print(f'  Columns: {list(df_sample.columns)}')
    print(f'  Shape: {df_sample.shape}')
    print(f'\nFirst 5 rows:')
    display(df_sample.head())
    
    # Check for any missing data
    print(f'\nData quality check:')
    print(f'  Missing home teams: {df_sample["home"].isna().sum()}')
    print(f'  Missing away teams: {df_sample["away"].isna().sum()}')
    print(f'  Missing scores: {df_sample["score"].isna().sum()}')

# Compare with Wikipedia format
wiki_files = sorted(DATA_DIR.glob('matches_*_ligat_haal_wikipedia.csv'))
if wiki_files:
    wiki_sample = pd.read_csv(wiki_files[0])
    print(f'\n‚úÖ Wikipedia sample: {wiki_files[0].name}')
    print(f'  Columns: {list(wiki_sample.columns)}')
    print(f'\nFirst 3 rows:')
    display(wiki_sample.head(3))
    
    print('\n‚úÖ Format comparison:')
    print(f'  Transfermarkt columns: {list(df_sample.columns)}')
    print(f'  Wikipedia columns: {list(wiki_sample.columns)}')
    print(f'  Match: {list(df_sample.columns) == list(wiki_sample.columns)}')

‚úÖ Found 20 Transfermarkt CSV files

Files:
  - matches_2006_07_ligat_haal_transfermarkt.csv
  - matches_2007_08_ligat_haal_transfermarkt.csv
  - matches_2008_09_ligat_haal_transfermarkt.csv
  - matches_2009_10_ligat_haal_transfermarkt.csv
  - matches_2010_11_ligat_haal_transfermarkt.csv
  - matches_2011_12_ligat_haal_transfermarkt.csv
  - matches_2012_13_ligat_haal_transfermarkt.csv
  - matches_2013_14_ligat_haal_transfermarkt.csv
  - matches_2014_15_ligat_haal_transfermarkt.csv
  - matches_2015_16_ligat_haal_transfermarkt.csv
  - matches_2016_17_ligat_haal_transfermarkt.csv
  - matches_2017_18_ligat_haal_transfermarkt.csv
  - matches_2018_19_ligat_haal_transfermarkt.csv
  - matches_2019_20_ligat_haal_transfermarkt.csv
  - matches_2020_21_ligat_haal_transfermarkt.csv
  - matches_2021_22_ligat_haal_transfermarkt.csv
  - matches_2022_23_ligat_haal_transfermarkt.csv
  - matches_2023_24_ligat_haal_transfermarkt.csv
  - matches_2024_25_ligat_haal_transfermarkt.csv
  - matches_2025_26_liga

Unnamed: 0,round,home,score,away
0,1,H. Kfar Saba,4:1,H. Petah Tikva
1,2,M. Petah Tikva,0:0,Hakoah Amidar
2,3,FC Ashdod,1:0,Maccabi Herzlya
3,4,Maccabi Netanya,3:1,Maccabi Haifa
4,5,M. Tel Aviv,1:2,B. Jerusalem



Data quality check:
  Missing home teams: 0
  Missing away teams: 0
  Missing scores: 0

‚úÖ Wikipedia sample: matches_2006_07_ligat_haal_wikipedia.csv
  Columns: ['season', 'season_year', 'home_team', 'away_team', 'home_goals', 'away_goals', 'goal_diff', 'result', 'home_points', 'away_points']

First 3 rows:


Unnamed: 0,season,season_year,home_team,away_team,home_goals,away_goals,goal_diff,result,home_points,away_points
0,2006/07,2006,Beitar Jerusalem,BnY,0,0,0,D,1,1
1,2006/07,2006,Beitar Jerusalem,ASH,2,0,2,H,3,0
2,2006/07,2006,Beitar Jerusalem,HAK,0,0,0,D,1,1



‚úÖ Format comparison:
  Transfermarkt columns: ['round', 'home', 'score', 'away']
  Wikipedia columns: ['season', 'season_year', 'home_team', 'away_team', 'home_goals', 'away_goals', 'goal_diff', 'result', 'home_points', 'away_points']
  Match: False


In [None]:
# Final Summary: All 20 Seasons from Transfermarkt
import pandas as pd
from pathlib import Path

DATA_DIR = Path(ROOT) / 'data' / 'raw'
transfermarkt_files = sorted(DATA_DIR.glob('matches_*_ligat_haal_transfermarkt.csv'))

print('‚úÖ TRANSFERMARKT SCRAPING COMPLETE \u2705')
print('=' * 80)
print(f'\nSuccessfully scraped {len(transfermarkt_files)} seasons from Transfermarkt')
print(f'Seasons: 2006/07 to 2025/26')
print(f'Format: round, home, score, away (same as Wikipedia)')

# Load all files and create summary
all_data = []
season_summary = []

for csv_file in transfermarkt_files:
    df = pd.read_csv(csv_file)
    season = csv_file.stem.split('_')[1:3]  # Extract season from filename
    season_str = f"{season[0]}/{season[1]}"
    
    season_summary.append({
        'Season': season_str,
        'Matches': len(df),
        'Rounds': df['round'].max(),
        'Teams': len(set(df['home'].tolist() + df['away'].tolist()))
    })

summary_df = pd.DataFrame(season_summary)

print('\n‚úÖ Season Summary:')
display(summary_df)

print(f'\n‚úÖ Total Statistics:')
print(f'  Total matches: {summary_df["Matches"].sum()}')
print(f'  Average matches per season: {summary_df["Matches"].mean():.0f}')
print(f'  Max rounds in a season: {summary_df["Rounds"].max()}')
print(f'  Min rounds in a season: {summary_df["Rounds"].min()}')

print('\n‚úÖ Data Location:')
print(f'  Directory: {DATA_DIR}')
print(f'  Files: matches_YYYY_YY_ligat_haal_transfermarkt.csv')

print('\n‚úÖ Next Steps:')
print('  - Data is ready for analysis')
print('  - Same format as Wikipedia data (round, home, score, away)')
print('  - Can be combined or analyzed separately')
print('  - Playoff data available in gesamtspielplan pages (Championship/Relegation rounds)')

‚úÖ TRANSFERMARKT SCRAPING COMPLETE ‚úÖ

Successfully scraped 20 seasons from Transfermarkt
Seasons: 2006/07 to 2025/26
Format: round, home, score, away (same as Wikipedia)

‚úÖ Season Summary:


Unnamed: 0,Season,Matches,Rounds,Teams
0,2006/07,198,198,12
1,2007/08,198,198,12
2,2008/09,198,198,12
3,2009/10,240,240,16
4,2010/11,240,240,16
5,2011/12,240,240,16
6,2012/13,182,182,14
7,2013/14,182,182,14
8,2014/15,182,182,14
9,2015/16,182,182,14



‚úÖ Total Statistics:
  Total matches: 3749
  Average matches per season: 187
  Max rounds in a season: 240
  Min rounds in a season: 69

‚úÖ Data Location:
  Directory: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\notebooks\data\raw
  Files: matches_YYYY_YY_ligat_haal_transfermarkt.csv

‚úÖ Next Steps:
  - Data is ready for analysis
  - Same format as Wikipedia data (round, home, score, away)
  - Can be combined or analyzed separately
  - Playoff data available in gesamtspielplan pages (Championship/Relegation rounds)


In [18]:
# Transfermarkt Playoff Scraper (Restored) - outputs round, home, score, away
import re, time, requests
from bs4 import BeautifulSoup
from pathlib import Path
import pandas as pd

# Ensure ROOT and DATA_DIR exist
try:
    ROOT
except NameError:
    ROOT = Path.cwd()
DATA_DIR = Path(ROOT) / 'data' / 'raw'
DATA_DIR.mkdir(parents=True, exist_ok=True)

HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}

def http_get(url, retries=3, sleep=1.5):
    for attempt in range(1, retries+1):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=20)
            if resp.status_code == 200:
                return resp.text
            else:
                print(f"HTTP {resp.status_code} for {url}")
        except Exception as e:
            print(f"Attempt {attempt} failed for {url}: {e}")
        time.sleep(sleep)
    return ''

def scrape_transfermarkt_playoffs(season_year):
    season_tag = f"{season_year}_{str(season_year+1)[-2:]}"
    out_csv = DATA_DIR / f"matches_{season_tag}_ligat_haal_transfermarkt_playoffs.csv"
    base_url = f"https://www.transfermarkt.com/ligat-haal/gesamtspielplan/wettbewerb/ISR1?saison_id={season_year}"
    # Note: League playoffs are included in gesamtspielplan as separate sections (e.g., Championship Round)
    html = http_get(base_url)
    if not html:
        print(f"‚ùå No HTML for playoffs {season_year}")
        return None
    soup = BeautifulSoup(html,'html.parser')
    rows_out = []
    playoff_round = 0
    for box in soup.select('div.box'):
        h2 = box.select_one('h2, h3')
        if not h2:
            continue
        title = h2.get_text(strip=True)
        # Identify playoff sections by keywords
        if not re.search(r'Championship|Relegation|Play-?off|Upper|Lower', title, re.IGNORECASE):
            continue
        table = box.select_one('table.items') or box.select_one('table')
        if not table:
            continue
        for tr in table.select('tbody tr'):
            tds = tr.find_all('td')
            if len(tds) < 5:
                continue
            home_a = tr.select_one('td.verein-heim a, td.heim a, td:nth-of-type(2) a[href*="/verein/"]')
            away_a = tr.select_one('td.verein-gast a, td.gast a, td:nth-of-type(6) a[href*="/verein/"]')
            if not home_a or not away_a:
                team_links = [a for a in tr.select('a[href*="/verein/"]') if a.get_text(strip=True)]
                if len(team_links) >= 2:
                    home_a, away_a = team_links[0], team_links[1]
                else:
                    continue
            home = home_a.get_text(strip=True)
            away = away_a.get_text(strip=True)
            score_cell = tr.select_one('td.ergebnis a, td.ergebnis, td:nth-of-type(5)')
            score_txt = score_cell.get_text(" ", strip=True) if score_cell else ''
            mscore = re.search(r'(\d+\s*:\s*\d+)', score_txt)
            score = mscore.group(1).replace(' ','') if mscore else ''
            if not score:
                continue
            playoff_round += 1
            rows_out.append({'round': playoff_round, 'home': home, 'score': score, 'away': away})
    if not rows_out:
        print(f"‚ö†Ô∏è No playoff matches parsed for {season_year}")
        return None
    df = pd.DataFrame(rows_out)
    df.to_csv(out_csv, index=False)
    print(f"‚úÖ Saved {len(df)} playoff matches -> {out_csv.name}")
    return df

In [19]:
# Transfermarkt Regular Season Scraper (Fixed) - outputs Wikipedia-style columns: round, home, score, away
import re, time, requests
from bs4 import BeautifulSoup
from pathlib import Path
import pandas as pd

# Ensure ROOT and DATA_DIR exist
try:
    ROOT
except NameError:
    ROOT = Path.cwd()
DATA_DIR = Path(ROOT) / 'data' / 'raw'
DATA_DIR.mkdir(parents=True, exist_ok=True)

HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}

def http_get(url, retries=3, sleep=1.5):
    for attempt in range(1, retries+1):
        try:
            resp = requests.get(url, headers=HEADERS, timeout=20)
            if resp.status_code == 200:
                return resp.text
        except Exception as e:
            if attempt == retries:
                print(f"Failed after {retries} attempts: {e}")
        if attempt < retries:
            time.sleep(sleep)
    return ''

def scrape_transfermarkt_regular(season_year):
    """Scrape regular season matches from Transfermarkt gesamtspielplan page."""
    season_tag = f"{season_year}_{str(season_year+1)[-2:]}"
    out_csv = DATA_DIR / f"matches_{season_tag}_ligat_haal_transfermarkt.csv"
    
    url = f"https://www.transfermarkt.com/ligat-haal/gesamtspielplan/wettbewerb/ISR1?saison_id={season_year}"
    html = http_get(url)
    if not html:
        print(f"‚ùå No HTML for season {season_year}")
        return None
    
    soup = BeautifulSoup(html, 'html.parser')
    rows_out = []
    round_num = 0
    
    # Find all tables on the page
    tables = soup.find_all('table')
    
    for table in tables:
        # Look for match rows (rows with 2 team links)
        for tr in table.find_all('tr'):
            # Find all cells
            cells = tr.find_all('td')
            if len(cells) < 5:
                continue
            
            # Find score first to confirm this is a match row
            score_link = tr.find('a', class_='ergebnis-link')
            if not score_link:
                continue
            
            score_text = score_link.get_text(strip=True)
            # Validate score format (d:d)
            if not re.match(r'^\d+:\d+$', score_text):
                continue
            
            # Now find team links - typically in cells before and after score
            all_team_links = []
            for cell in cells:
                team_link = cell.find('a', href=re.compile(r'/verein/'))
                if team_link:
                    team_name = team_link.get_text(strip=True)
                    if team_name and team_name not in [link.get_text(strip=True) for link in all_team_links]:
                        all_team_links.append(team_link)
            
            if len(all_team_links) < 2:
                continue
            
            home = all_team_links[0].get_text(strip=True)
            away = all_team_links[1].get_text(strip=True)
            
            # Increment round for each match found
            round_num += 1
            
            rows_out.append({
                'round': round_num,
                'home': home,
                'score': score_text,
                'away': away
            })
    
    if not rows_out:
        print(f"‚ö†Ô∏è No matches parsed for {season_year}")
        return None
    
    df = pd.DataFrame(rows_out)
    df.to_csv(out_csv, index=False)
    print(f"‚úÖ Saved {len(df)} matches -> {out_csv.name}")
    return df

print('Regular season scraper updated with fixed team extraction.')

Regular season scraper updated with fixed team extraction.


## Validation: Test All Transfermarkt Scrapers

Now that the scraper functions are defined, let's validate them by scraping all 20 seasons.

In [None]:
# Run restored Transfermarkt scrapers for all seasons and validate coverage
seasons = list(range(2006, 2026))
regular_counts = {}
playoff_counts = {}

for sy in seasons:
    r = scrape_transfermarkt_regular(sy)
    if r is not None:
        regular_counts[sy] = len(r)
    
    p = scrape_transfermarkt_playoffs(sy)
    if p is not None:
        playoff_counts[sy] = len(p)

print('\n' + '='*80)
print('VALIDATION SUMMARY')
print('='*80)
print(f'Regular seasons scraped: {len(regular_counts)}')
print(f'Playoff seasons scraped: {len(playoff_counts)}')

import pandas as pd
summary_df = pd.DataFrame({
    'season_year': list(regular_counts.keys()), 
    'regular_matches': list(regular_counts.values())
}).sort_values('season_year')

print('\nDetailed breakdown:')
display(summary_df)