# Optional installs ‚Äî quick note
Small commented pip commands; run only if you need to install dependencies in this environment.

# Notebook cell guide ‚Äî what each cell does

This top-level cell documents every cell in this notebook (by cell number) and explains its purpose and recommended use. Run cells in the order listed unless you know what you're doing.

1. Optional installs (code)
   - Purpose: Comments with suggested pip install commands (or `pip install -r requirements.txt`) for setting up the environment. Run only if you need to install packages in the notebook environment.

2. Notebook title / description (markdown)
   - Purpose: High-level description of the notebook and its design choices (uses `.env`/`key.env`, stores data under `data/`). No execution.

3. Environment setup (code)
   - Purpose: Loads environment variables (using python-dotenv), prepares `ROOT`, `DATA_DIR`, `INTERIM_DIR`, `PROCESSED_DIR`, `FIG_DIR`, and attempts to load `APISPORTS_KEY` from the environment or several `key.env`/`.env` candidate locations. Creates directories if missing and sets `BASE_URL` and `HEADERS` used for API calls.
   - Run first.

4. Small local test file (code)
   - Purpose: Writes a small `test.txt` into `DATA_DIR` to verify write permissions and that `DATA_DIR` is correctly configured.
   - Optional; run to verify directories.

5. League listing + single-season download (code)
   - Purpose: Uses the `api_get` helper to list leagues in Israel, chooses the Israeli Premier League (`LEAGUE_ID`) and downloads fixtures for a single `SEASON_YEAR` (default 2022). Saves results to a CSV under `DATA_DIR`.
   - Run after cell 3.

6. Enrichment & cleaning (code)
   - Purpose: Reads a matches CSV (e.g., `matches_2022_23_ligat_haal.csv`), parses `round` into `phase` and `round_num`, computes `goal_diff`, `result`, `home_points`/`away_points`, flags one-sided matches, drops irrelevant columns, reorders columns, and writes the enriched CSV to `INTERIM_DIR`.
   - Run after you have the relevant matches CSV from cell 5 or the multi-season download.

7. sync_test utility (code)
   - Purpose: Small helper function `sync_check()` to test that the repo sync/process is working. Not part of the data pipeline.

8. Kernel check (code)
   - Purpose: Simple cell to verify the Python kernel is operational. Prints a confirmation.

9. Multi-season helper (code)
   - Purpose: `get_season_fixtures()` helper and a multi-season downloader (initial implementation). Use this to fetch multiple seasons programmatically. The notebook also contains newer multi-season cells later ‚Äî prefer the "available seasons" approach.
   - Run after cell 3 and verifying API key.

10. API setup check (code)
    - Purpose: `check_api_setup()` verifies `APISPORTS_KEY` exists, `HEADERS` is set, and performs a quick API `status` check. Run this after cell 3 and before any download cells.

11. Historical download (code)
    - Purpose: Attempts to download seasons from 2010‚Äì2024. This may return no data for older seasons if the API doesn't provide them. It saves per-season CSVs and attempts to combine them.
    - Use only if you need historical attempts; prefer the "available seasons" downloader below.

12. Available-seasons check (code)
    - Purpose: Queries the API to ask which seasons are actually available for the selected league (uses `/leagues/seasons` or falls back to `/leagues`). Prints the exact seasons the API provides.
    - Run this to confirm what the API can return. This explains why some seasons were missing.

13. Download available seasons (code)
    - Purpose: Downloads only the seasons returned by the available-seasons check, saves per-season CSVs, and combines them into `matches_all_seasons_ligat_haal.csv`.
    - Recommended: run cell 10 (API check) ‚Üí cell 12 (available seasons) ‚Üí this cell.

14+. Append / helper cells (code)
    - Purpose: Any additional helper or analysis cells appended by you (e.g., progress improvements, minor fixes). Review and run as needed.

Recommended minimal run order (safe):
- Cell 3 (Environment setup)
- Cell 10 (API setup check)
- Cell 12 (Check available seasons)
- Cell 13 (Download available seasons)
- Cell 6 (Enrich the CSVs you saved)

Notes & troubleshooting:
- If `APISPORTS_KEY` is missing, copy `key.env.example` to `key.env` at the repo root and set `APISPORTS_KEY=your_key_here`.
- Use `setx APISPORTS_KEY "your_key"` (PowerShell) for a persistent Windows user env var (restart VS Code to pick it up).
- Some older seasons may not exist in the API; the `available seasons` cell tells you exactly what years you can download.

If you want, I can now:
- Add an inline markdown cell before each code cell (instead of this single top summary), or
- Convert this summary into per-cell markdown inserted directly above each code cell (I can do that next). 

Which do you prefer?

In [78]:
# Optional: install requirements (recommended to use requirements.txt)
# If you need to install dependencies in the notebook environment, uncomment one of the lines below.
# It's better to run these once in your environment or use a virtualenv and install from requirements.txt.
# pip install -r ../requirements.txt
# or (not recommended to run on every notebook execution):
# pip -q install pandas requests python-dateutil python-dotenv


# Enrich Wikipedia match-by-match table for competitiveness analysis
"""
This cell enriches the Wikipedia match-by-match CSV we created earlier for 2016/17.
Improvements in this revision:
 - Adds fuzzy-name matching (difflib) to map standings team names to match team names when they differ slightly (e.g. punctuation, diacritics, short forms).
 - Emits mapping diagnostics (suggestions and unmatched counts) so you can review and add manual overrides if needed.
 - Preserves prior derived columns (goal_diff, result, points, quartile flags).
Saves output to INTERIM_DIR / 'matches_2016_17_ligat_haal_enriched.csv'.
"""
import pandas as pd
from pathlib import Path
import re
import math
import difflib

# Files created by previous cells
matches_csv = Path(DATA_DIR) / "matches_2016_17_ligat_haal_wikipedia.csv"
standings_csv = Path(DATA_DIR) / "ligat_haal_2016_17_wikipedia.csv"
out_path = INTERIM_DIR / "matches_2016_17_ligat_haal_enriched.csv"

# Safety checks
if not matches_csv.exists():
    raise FileNotFoundError(f"Matches CSV not found: {matches_csv} ‚Äî run the scraping cell first")

# Read matches
matches = pd.read_csv(matches_csv)
# ensure goals are numeric
for col in ["home_goals","away_goals"]:
    if col in matches.columns:
        matches[col] = pd.to_numeric(matches[col], errors='coerce')
    else:
        # try to parse from 'score' column if present
        if 'score' in matches.columns:
            matches[['home_goals','away_goals']] = matches['score'].str.split(r"[‚Äì-]", expand=True).apply(lambda s: pd.to_numeric(s.str.strip(), errors='coerce'))
        else:
            matches[col] = pd.NA

# Derived columns
matches['goal_diff'] = matches['home_goals'] - matches['away_goals']
def result_from_diff(d):
    if pd.isna(d):
        return pd.NA
    if d>0:
        return 'H'
    if d<0:
        return 'A'
    return 'D'
matches['result'] = matches['goal_diff'].apply(result_from_diff)
matches['home_points'] = matches['result'].map({'H':3,'D':1,'A':0}).fillna(0).astype(int)
matches['away_points'] = matches['result'].map({'A':3,'D':1,'H':0}).fillna(0).astype(int)

# Normalize team names helper (to increase match probability between standings and matches)
def norm(s):
    if pd.isna(s):
        return ''
    s = str(s)
    s = s.replace('‚Äì','-')  # ndash
    s = s.replace('‚Äî','-')  # mdash
    s = re.sub(r'[^\w\s\-]','', s)
    s = s.lower().strip()
    s = re.sub(r'\s+',' ', s)
    return s

matches['_home_norm'] = matches['home_team'].apply(norm)
matches['_away_norm'] = matches['away_team'].apply(norm)

# Try to read standings and map final ranks/points
rank_map = {}
points_map = {}
suggestion_map = {}  # when fuzzy matching resolves a name, record it here
if standings_csv.exists():
    std = pd.read_csv(standings_csv)
    # Find possible rank column and points column heuristically
    cols = [c.lower() for c in std.columns.astype(str)]
    # rank candidates: 'pos','position','#','rank'
    rank_col = None
    for c in ('pos','position','rank','#'):
        if c in cols:
            rank_col = std.columns[cols.index(c)]
            break
    # points candidates: 'pts','points'
    points_col = None
    for c in ('pts','points'):
        if c in cols:
            points_col = std.columns[cols.index(c)]
            break
    # team name column candidate: 'team','club'
    team_col = None
    for c in ('team','club','team(s)'):
        if c in cols:
            team_col = std.columns[cols.index(c)]
            break
    if team_col is None:
        # fallback: use first column
        team_col = std.columns[0]
    # Build maps by normalized team name
    for _, r in std.iterrows():
        team = str(r.get(team_col))
        n = norm(team)
        if rank_col and not pd.isna(r.get(rank_col)):
            try:
                rank_map[n] = int(r.get(rank_col))
            except Exception:
                rank_map[n] = None
        if points_col and not pd.isna(r.get(points_col)):
            try:
                points_map[n] = float(r.get(points_col))
            except Exception:
                points_map[n] = None
    # If ranks found, compute quartiles
    ranks = [v for v in rank_map.values() if v is not None]
    n_teams = len(ranks) if ranks else 0
    top_cut = math.ceil(n_teams/4) if n_teams else None
    # Prepare list of normalized names for fuzzy matching
    std_norms = list(rank_map.keys())
else:
    std = None
    team_col = None
    n_teams = 0
    top_cut = None
    std_norms = []

# Fuzzy-resolving lookup functions: exact match first, then difflib.get_close_matches
def lookup_rank_norm(n):
    # n is already normalized (string)
    if not n:
        return None
    if n in rank_map:
        return rank_map.get(n)
    if std_norms:
        best = difflib.get_close_matches(n, std_norms, n=1, cutoff=0.6)
        if best:
            suggestion_map[n] = best[0]
            return rank_map.get(best[0])
    return None

def lookup_points_norm(n):
    if not n:
        return None
    if n in points_map:
        return points_map.get(n)
    if std_norms:
        best = difflib.get_close_matches(n, std_norms, n=1, cutoff=0.6)
        if best:
            suggestion_map[n] = best[0]
            return points_map.get(best[0])
    return None

# Apply lookups
matches['final_rank_home'] = matches['_home_norm'].apply(lookup_rank_norm)
matches['final_rank_away'] = matches['_away_norm'].apply(lookup_rank_norm)
matches['final_points_home'] = matches['_home_norm'].apply(lookup_points_norm)
matches['final_points_away'] = matches['_away_norm'].apply(lookup_points_norm)

# rank_diff when both present
matches['rank_diff'] = matches.apply(lambda r: (r['final_rank_home'] - r['final_rank_away']) if (pd.notna(r['final_rank_home']) and pd.notna(r['final_rank_away'])) else pd.NA, axis=1)

# Flags: top_quartile and bottom_quartile (based on final ranks; lower rank number is better)
if top_cut:
    matches['home_top_quartile'] = matches['final_rank_home'].apply(lambda v: True if (pd.notna(v) and int(v) <= top_cut) else False)
    matches['away_top_quartile'] = matches['final_rank_away'].apply(lambda v: True if (pd.notna(v) and int(v) <= top_cut) else False)
    matches['home_bottom_quartile'] = matches['final_rank_home'].apply(lambda v: True if (pd.notna(v) and int(v) > n_teams - top_cut) else False)
    matches['away_bottom_quartile'] = matches['final_rank_away'].apply(lambda v: True if (pd.notna(v) and int(v) > n_teams - top_cut) else False)
    matches['top_vs_bottom'] = matches.apply(lambda r: (r['home_top_quartile'] and r['away_bottom_quartile']) or (r['away_top_quartile'] and r['home_bottom_quartile']), axis=1)
else:
    matches['home_top_quartile'] = False
    matches['away_top_quartile'] = False
    matches['home_bottom_quartile'] = False
    matches['away_bottom_quartile'] = False
    matches['top_vs_bottom'] = False

# Diagnostics: how many mapped via exact vs fuzzy vs unmatched
mapped_home_exact = matches['_home_norm'].isin(rank_map.keys()).sum()
mapped_away_exact = matches['_away_norm'].isin(rank_map.keys()).sum()
mapped_home_fuzzy = matches['final_rank_home'].notna().sum() - mapped_home_exact
mapped_away_fuzzy = matches['final_rank_away'].notna().sum() - mapped_away_exact
unmapped_home = len(matches) - matches['final_rank_home'].notna().sum()
unmapped_away = len(matches) - matches['final_rank_away'].notna().sum()
print(f"Mapping summary: exact-home={mapped_home_exact}, fuzzy-home={mapped_home_fuzzy}, unmapped-home={unmapped_home}")
print(f"Mapping summary: exact-away={mapped_away_exact}, fuzzy-away={mapped_away_fuzzy}, unmapped-away={unmapped_away}")
if suggestion_map:
    print("

























































display(df.head(10))print(f'Saved: {csv_path} | rows: {len(df)}')df.to_csv(csv_path, index=False, encoding='utf-8-sig')csv_path = DATA_DIR / f"matches_{SEASON_YEAR}_{str(SEASON_YEAR+1)[-2:]}_ligat_haal.csv"df = pd.DataFrame(rows)    })        'league_name': league.get('name'),        'league_id': league.get('id'),        'fixture_id': fixture.get('id'),        'referee': fixture.get('referee'),        'venue': fixture.get('venue', {}).get('name'),        'away_goals': goals.get('away'),        'home_goals': goals.get('home'),        'away_team': teams.get('away', {}).get('name'),        'home_team': teams.get('home', {}).get('name'),        'stage': league.get('name'),        'round': league.get('round'),        'date': dt,        'season': f'{SEASON_YEAR}/{str(SEASON_YEAR+1)[-2:]}',    rows.append({        dt = None    except:        dt = dateparser.parse(dt).strftime('%Y-%m-%d') if dt else None    try:    dt = fixture.get('date')    goals   = item.get('goals', {})    teams   = item.get('teams', {})    league  = item.get('league', {})    fixture = item.get('fixture', {})for item in fx.get('response', []):rows = []fx = api_get('/fixtures', {'league': LEAGUE_ID, 'season': SEASON_YEAR, 'timezone': 'UTC'})SEASON_YEAR = 2022print(f'◊†◊ë◊ó◊®◊î ◊ú◊ô◊í◊î: {LEAGUE_NAME} (ID={LEAGUE_ID})')assert LEAGUE_ID is not None, '◊ú◊ê ◊†◊û◊¶◊ê ◊û◊ñ◊î◊î ◊ú◊ô◊í◊™ ◊î◊¢◊ú.'LEAGUE_ID, LEAGUE_NAME = choose_israeli_premier(israel_leagues_df)    return None, None        return int(r0['id']), r0['name']        r0 = df.iloc[0]    if not df.empty:            return int(row['id']), row['name']        if 'ligat' in (row['name'] or '').lower():    for _, row in df.iterrows():            return int(row['id']), row['name']        if any(p in name_norm for p in PREFERRED_NAMES):matches.head(20)print(f"Saved enriched matches to: {out_path} | rows: {len(matches)}")matches.to_csv(out_path, index=False, encoding='utf-8-sig')matches = matches.drop(columns=[c for c in matches.columns if c.startswith('_')])# Cleanup helper columns and save        print(f"  {k}  ->  {v}")    for k,v in list(suggestion_map.items())[:20]:Sample fuzzy suggestions (match_norm -> standings_norm):")

# List leagues & download a single season
Query Israeli leagues, choose Ligat Ha'al, download fixtures for `SEASON_YEAR` and save to CSV.

In [79]:
# === ◊î◊¢◊©◊®◊™ ◊î◊ò◊ë◊ú◊î + ◊†◊ô◊ß◊ï◊ô ◊¢◊û◊ï◊ì◊ï◊™ ◊û◊ô◊ï◊™◊®◊ï◊™ ===
import re
import pandas as pd

in_path  = DATA_DIR / "matches_2022_23_ligat_haal.csv"   # ◊©◊†◊î ◊ú◊ß◊ï◊ë◊• ◊©◊ú◊ö
out_path = INTERIM_DIR / "matches_2022_23_enriched.csv"

if not in_path.exists():
    raise FileNotFoundError(f"Input matches file not found: {in_path}")

df = pd.read_csv(in_path)

# --- ◊¢◊û◊ï◊ì◊ï◊™ ◊¢◊ñ◊® ---
# 1) ◊©◊†◊î ◊û◊°◊§◊®◊ô◊™ ◊ú◊§◊™◊ô◊ó◊™ ◊î◊¢◊ï◊†◊î
#df["season_year"] = df["season"].str.slice(0,4).astype(int)

# 2) ◊û◊°◊§◊® ◊û◊ó◊ñ◊ï◊® ◊ï-phase
def parse_round(r):
    # ◊ì◊ï◊í◊û◊ê◊ï◊™: "Regular Season - 1", "Championship Round - 5"
    if pd.isna(r):
        return (None, None)
    r = str(r)
    m = re.search(r"(Regular|Championship|Relegation).*?(\d+)", r, flags=re.I)
    phase = None
    if "regular" in r.lower():      phase = "regular"
    elif "championship" in r.lower(): phase = "championship"
    elif "relegation" in r.lower():   phase = "relegation"
    round_num = int(m.group(2)) if m else None
    return (phase, round_num)

tmp = df["round"].apply(parse_round).tolist()
df["phase"] = [t[0] for t in tmp]
df["round_num"] = [t[1] for t in tmp]

# 3) ◊î◊§◊®◊© ◊©◊¢◊®◊ô◊ù, ◊™◊ï◊¶◊ê◊î, ◊†◊ß◊ï◊ì◊ï◊™
df["goal_diff"] = df["home_goals"] - df["away_goals"]
df["result"] = df["goal_diff"].apply(lambda x: "H" if x>0 else ("A" if x<0 else "D"))
df["home_points"] = df["result"].map({"H":3, "D":1, "A":0})
df["away_points"] = df["result"].map({"H":0, "D":1, "A":3})

# 4) ◊ì◊í◊ú ◊û◊©◊ó◊ß ◊ó◊ì-◊¶◊ì◊ì◊ô (◊ú◊û◊©◊ú |GD|>=3)
df["one_sided"] = (df["goal_diff"].abs() >= 3).astype(int)

# 5) ◊¢◊û◊ï◊ì◊ï◊™ ◊ú◊ê ◊®◊ú◊ï◊ï◊†◊ò◊ô◊ï◊™ ◊ú◊î◊°◊®◊î (◊õ◊§◊ô ◊©◊ë◊ô◊ß◊©◊™)
drop_cols = ["league_id","league_name","fixture_id"]
df = df.drop(columns=[c for c in drop_cols if c in df.columns])

# 6) ◊°◊ì◊® ◊¢◊û◊ï◊ì◊ï◊™ ◊†◊ï◊ó
cols = [
    "season","season_year","date","phase","round_num","stage",
    "home_team","away_team","home_goals","away_goals","goal_diff","result",
    "home_points","away_points","one_sided","venue","referee"
]
df = df[[c for c in cols if c in df.columns]]

df.to_csv(out_path, index=False, encoding="utf-8-sig")
print("◊†◊©◊û◊®:", out_path, "| ◊©◊ï◊®◊ï◊™:", len(df))
df.head(10)


◊†◊©◊û◊®: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\data\interim\matches_2022_23_enriched.csv | ◊©◊ï◊®◊ï◊™: 240


Unnamed: 0,season,date,phase,round_num,stage,home_team,away_team,home_goals,away_goals,goal_diff,result,home_points,away_points,one_sided,venue,referee
0,2022/23,2022-08-20,regular,1,Ligat Ha'al,Hapoel Haifa,Hapoel Tel Aviv,2,0,2,H,3,0,0,Sammy Ofer Stadium,O. Grinfeeld
1,2022/23,2022-08-20,regular,1,Ligat Ha'al,Hapoel Katamon,Hapoel Hadera,1,1,0,D,1,1,0,HaMoshava Stadium,A. Shiloach
2,2022/23,2022-08-20,regular,1,Ligat Ha'al,Maccabi Netanya,Beitar Jerusalem,4,1,3,H,3,0,1,Netanya Stadium,R. Reinshreiber
3,2022/23,2022-08-21,regular,1,Ligat Ha'al,Maccabi Tel Aviv,Maccabi Bnei Raina,5,0,5,H,3,0,1,Bloomfield Stadium,I. Frid
4,2022/23,2022-08-22,regular,1,Ligat Ha'al,Sektzia Nes Tziona,Ironi Kiryat Shmona,0,2,-2,A,0,3,0,HaMoshava Stadium,Y. Mizrahi
5,2022/23,2022-08-27,regular,2,Ligat Ha'al,Ironi Kiryat Shmona,Hapoel Katamon,1,1,0,D,1,1,0,Kiryat-Shmona Municipal Stadium,O. Na'al
6,2022/23,2022-08-27,regular,2,Ligat Ha'al,Hapoel Tel Aviv,Bnei Sakhnin,0,2,-2,A,0,3,0,Bloomfield Stadium,R. Reinshreiber
7,2022/23,2022-08-27,regular,2,Ligat Ha'al,Maccabi Haifa,Maccabi Netanya,4,1,3,H,3,0,1,Sammy Ofer Stadium,S. Levi
8,2022/23,2022-08-27,regular,2,Ligat Ha'al,Ashdod,Sektzia Nes Tziona,1,0,1,H,3,0,0,Yud-Alef Stadium,O. Asulin
9,2022/23,2022-08-28,regular,2,Ligat Ha'al,Maccabi Bnei Raina,Hapoel Haifa,1,1,0,D,1,1,0,Green Stadium,S. Ben Avraham


# Enrich & clean matches CSV
Read a saved matches CSV, compute goal differences, results, points, clean columns and save an enriched CSV to `interim/`.

In [80]:
# sync_test.py
# ◊ß◊ï◊ë◊• ◊ë◊ì◊ô◊ß◊î ◊ú◊°◊†◊õ◊®◊ï◊ü ◊ë◊ô◊ü 

def sync_check():
    print("‚úÖ Git sync test successful! ‚Äî ◊©◊ú◊ï◊ù ◊û◊©◊†◊ô ◊î◊¶◊ì◊ì◊ô◊ù üòé")

if __name__ == "__main__":
    sync_check()


‚úÖ Git sync test successful! ‚Äî ◊©◊ú◊ï◊ù ◊û◊©◊†◊ô ◊î◊¶◊ì◊ì◊ô◊ù üòé


# Sync test utility
Simple function to verify git/repo sync (prints confirmation).

In [81]:
import IPython
print("‚úÖ Kernel is working!")


‚úÖ Kernel is working!


# Kernel check
Quick cell to print a message and confirm the Python kernel is running.

In [82]:
# Download multiple seasons of Ligat Ha'al fixtures
import pandas as pd
from pathlib import Path
import time

def get_season_fixtures(season_year: int, league_id: int = None):
    """Get fixtures for a specific season, with progress tracking and error handling."""
    if league_id is None:
        league_id = globals().get('LEAGUE_ID')  # Use the one found earlier
        if league_id is None:
            raise ValueError("No league_id provided or found in globals()")
    
    fx = api_get('/fixtures', {
        'league': league_id,
        'season': season_year,
        'timezone': 'UTC'
    })
    
    rows = []
    for item in fx.get('response', []):
        fixture = item.get('fixture', {})
        league  = item.get('league', {})
        teams   = item.get('teams', {})
        goals   = item.get('goals', {})
        dt = fixture.get('date')
        try:
            dt = dateparser.parse(dt).strftime('%Y-%m-%d') if dt else None
        except:
            dt = None
        rows.append({
            'season': f'{season_year}/{str(season_year+1)[-2:]}',
            'date': dt,
            'round': league.get('round'),
            'stage': league.get('name'),
            'home_team': teams.get('home', {}).get('name'),
            'away_team': teams.get('away', {}).get('name'),
            'home_goals': goals.get('home'),
            'away_goals': goals.get('away'),
            'venue': fixture.get('venue', {}).get('name'),
            'referee': fixture.get('referee'),
            'fixture_id': fixture.get('id'),
            'league_id': league.get('id'),
            'league_name': league.get('name'),
        })
    return pd.DataFrame(rows)

# Download seasons from 2018/19 to 2023/24
seasons = list(range(2018, 2024))
results = {}
errors = []

print(f"Downloading {len(seasons)} seasons of Ligat Ha'al fixtures...")
print("Progress: ", end="", flush=True)

for i, season in enumerate(seasons, 1):
    print(f"[{season}/{season+1}] ", end="", flush=True)
    try:
        df = get_season_fixtures(season)
        if len(df) > 0:
            results[season] = df
            # Save each season's data
            csv_path = DATA_DIR / f"matches_{season}_{str(season+1)[-2:]}_ligat_haal.csv"
            df.to_csv(csv_path, index=False, encoding='utf-8-sig')
            print(f"‚úì ({len(df)} matches)", end=" ", flush=True)
        else:
            print("‚ö† (no matches)", end=" ", flush=True)
    except Exception as e:
        print(f"‚ùå ({str(e)})", end=" ", flush=True)
        errors.append((season, str(e)))
    time.sleep(1)  # Be nice to the API
    
    # Add a newline every 2 seasons for readability
    if i % 2 == 0:
        print()

print("\n\nSummary:")
print(f"- Successfully downloaded {len(results)} seasons")
print(f"- Total matches: {sum(len(df) for df in results.values())}")
if errors:
    print("\nErrors encountered:")
    for season, error in errors:
        print(f"- {season}/{season+1}: {error}")

# Show the most recent season as a sample
if results:
    latest = max(results.keys())
    print(f"\nMost recent season ({latest}/{latest+1}) preview:")
    display(results[latest].head())

Downloading 6 seasons of Ligat Ha'al fixtures...
Progress: [2018/2019] [2018/2019] ‚ö† (no matches) ‚ö† (no matches) [2019/2020] [2019/2020] ‚ö† (no matches) ‚ö† (no matches) 
[2020/2021] 
[2020/2021] ‚ö† (no matches) ‚ö† (no matches) [2021/2022] [2021/2022] ‚úì (240 matches) ‚úì (240 matches) 
[2022/2023] 
[2022/2023] ‚úì (240 matches) ‚úì (240 matches) [2023/2024] [2023/2024] ‚úì (240 matches) ‚úì (240 matches) 


Summary:
- Successfully downloaded 3 seasons
- Total matches: 720

Most recent season (2023/2024) preview:



Summary:
- Successfully downloaded 3 seasons
- Total matches: 720

Most recent season (2023/2024) preview:


Unnamed: 0,season,date,round,stage,home_team,away_team,home_goals,away_goals,venue,referee,fixture_id,league_id,league_name
0,2023/24,2023-08-26,Regular Season - 1,Ligat Ha'al,Bnei Sakhnin,Hapoel Tel Aviv,1,1,Doha Stadium,I. Layba,1036999,383,Ligat Ha'al
1,2023/24,2023-08-26,Regular Season - 1,Ligat Ha'al,Maccabi Petah Tikva,Hapoel Katamon,1,1,HaMoshava Stadium,D. Tzino,1036997,383,Ligat Ha'al
2,2023/24,2023-08-26,Regular Season - 1,Ligat Ha'al,Hapoel Beer Sheva,Hapoel Hadera,3,0,Yaakov Turner Toto Stadium,D. Fuxman,1037001,383,Ligat Ha'al
3,2023/24,2023-08-26,Regular Season - 1,Ligat Ha'al,Maccabi Netanya,Maccabi Bnei Raina,1,1,Netanya Stadium,N. Steif,1037000,383,Ligat Ha'al
4,2023/24,2023-08-27,Regular Season - 1,Ligat Ha'al,Maccabi Tel Aviv,Ashdod,4,1,Bloomfield Stadium,G. Laibuvitz,1037002,383,Ligat Ha'al


# Multi-season helper
Defines `get_season_fixtures()` and a basic multi-season loop used to download multiple seasons.

In [83]:
# Quick check that our API key and headers are properly set
def check_api_setup():
    """Verify that APISPORTS_KEY and HEADERS are properly set up."""
    # Check APISPORTS_KEY
    if not APISPORTS_KEY:
        raise RuntimeError("APISPORTS_KEY is not set")
    if len(APISPORTS_KEY) < 8:  # basic sanity check
        raise RuntimeError("APISPORTS_KEY looks too short - check your key")
    
    # Check HEADERS exists and has our key
    if 'HEADERS' not in globals():
        raise RuntimeError("HEADERS not found - run the environment setup cell first")
    if HEADERS.get('x-apisports-key') != APISPORTS_KEY:
        raise RuntimeError("HEADERS['x-apisports-key'] doesn't match APISPORTS_KEY")
    
    # Quick API test
    r = requests.get(f"{BASE_URL}/status", headers=HEADERS)
    if r.status_code != 200:
        raise RuntimeError(f"API test failed with status {r.status_code}")
    
    print("‚úÖ API setup verified:")
    print(f"  ‚Ä¢ APISPORTS_KEY: {APISPORTS_KEY[:4]}...{APISPORTS_KEY[-4:]}")
    print(f"  ‚Ä¢ HEADERS: properly set with API key")
    print(f"  ‚Ä¢ API test: successful")
    return True

# Run the check
check_api_setup()

‚úÖ API setup verified:
  ‚Ä¢ APISPORTS_KEY: 26eb...faf2
  ‚Ä¢ HEADERS: properly set with API key
  ‚Ä¢ API test: successful


True

# API setup check
Verify `APISPORTS_KEY`, `HEADERS`, and a quick API `status` request to ensure connectivity before downloads.

In [84]:
# Download historical Ligat Ha'al seasons (2010-2024)
import pandas as pd
from pathlib import Path
import time

# Range of seasons to download (2010/11 to 2023/24)
seasons = list(range(2010, 2024))
results = {}
errors = []

print(f"Downloading {len(seasons)} seasons of Ligat Ha'al (2010-2024)...")
print("Progress: ", end="", flush=True)

for i, season in enumerate(seasons, 1):
    print(f"[{season}/{season+1}] ", end="", flush=True)
    try:
        df = get_season_fixtures(season)
        if len(df) > 0:
            results[season] = df
            # Save each season's data
            csv_path = DATA_DIR / f"matches_{season}_{str(season+1)[-2:]}_ligat_haal.csv"
            df.to_csv(csv_path, index=False, encoding='utf-8-sig')
            print(f"‚úì ({len(df)} matches)", end=" ", flush=True)
        else:
            print("‚ö† (no matches)", end=" ", flush=True)
    except Exception as e:
        print(f"‚ùå ({str(e)})", end=" ", flush=True)
        errors.append((season, str(e)))
    time.sleep(1.5)  # Be extra nice to the API for historical data
    
    # Add a newline every 2 seasons for readability
    if i % 2 == 0:
        print()

print("\n\nCombining all seasons into one file...")
# Combine all successful seasons into one DataFrame
all_matches = pd.concat(results.values(), axis=0, ignore_index=True)

# Add season_start_year column for easier filtering
all_matches['season_start_year'] = all_matches['season'].str.slice(0,4).astype(int)

# Sort by date and reset index
all_matches = all_matches.sort_values(['date', 'fixture_id']).reset_index(drop=True)

# Save combined file
combined_path = DATA_DIR / "matches_all_seasons_ligat_haal.csv"
all_matches.to_csv(combined_path, index=False, encoding='utf-8-sig')

print("\nSummary:")
print(f"- Successfully downloaded {len(results)} seasons")
print(f"- Total matches: {len(all_matches)}")
print(f"- Years covered: {min(results.keys())}-{max(results.keys())}")
print(f"\nMatches per season:")
season_counts = all_matches.groupby('season').size().sort_index()
for season, count in season_counts.items():
    print(f"  ‚Ä¢ {season}: {count:3d} matches")

if errors:
    print("\nErrors encountered:")
    for season, error in errors:
        print(f"- {season}/{season+1}: {error}")

print(f"\nAll matches saved to: {combined_path}")
print("\nPreview of combined data:")
display(all_matches.head())

Downloading 14 seasons of Ligat Ha'al (2010-2024)...
Progress: [2010/2011] [2010/2011] ‚ö† (no matches) ‚ö† (no matches) [2011/2012] [2011/2012] ‚ö† (no matches) ‚ö† (no matches) 
[2012/2013] 
[2012/2013] ‚ö† (no matches) ‚ö† (no matches) [2013/2014] [2013/2014] ‚ö† (no matches) ‚ö† (no matches) 
[2014/2015] 
[2014/2015] ‚ö† (no matches) ‚ö† (no matches) [2015/2016] [2015/2016] ‚ö† (no matches) ‚ö† (no matches) 
[2016/2017] 
[2016/2017] ‚ö† (no matches) ‚ö† (no matches) [2017/2018] [2017/2018] ‚ö† (no matches) ‚ö† (no matches) 
[2018/2019] 
[2018/2019] ‚ö† (no matches) ‚ö† (no matches) [2019/2020] [2019/2020] ‚ö† (no matches) ‚ö† (no matches) 
[2020/2021] 
[2020/2021] ‚ö† (no matches) ‚ö† (no matches) [2021/2022] [2021/2022] ‚úì (240 matches) ‚úì (240 matches) 
[2022/2023] 
[2022/2023] ‚úì (240 matches) ‚úì (240 matches) [2023/2024] [2023/2024] ‚úì (240 matches) ‚úì (240 matches) 


Combining all seasons into one file...

Summary:
- Successfully downloaded 3 seasons
- Total matches: 72

Unnamed: 0,season,date,round,stage,home_team,away_team,home_goals,away_goals,venue,referee,fixture_id,league_id,league_name,season_start_year
0,2021/22,2021-08-28,Regular Season - 1,Ligat Ha'al,Maccabi Petah Tikva,Ironi Kiryat Shmona,1,1,HaMoshava Stadium,D. Fuxman,708093,383,Ligat Ha'al,2021
1,2021/22,2021-08-28,Regular Season - 1,Ligat Ha'al,Hapoel Tel Aviv,Ashdod,2,1,Bloomfield Stadium,R. Reinshreiber,708094,383,Ligat Ha'al,2021
2,2021/22,2021-08-28,Regular Season - 1,Ligat Ha'al,Hapoel Haifa,Maccabi Netanya,0,0,Sammy Ofer Stadium,O. Grinfeeld,708095,383,Ligat Ha'al,2021
3,2021/22,2021-08-29,Regular Season - 1,Ligat Ha'al,Beitar Jerusalem,Hapoel Beer Sheva,0,2,Teddi Malcha Stadium,L. Liani,708097,383,Ligat Ha'al,2021
4,2021/22,2021-08-29,Regular Season - 1,Ligat Ha'al,Hapoel Hadera,Maccabi Haifa,0,0,Netanya Stadium,E. Shmuelevich,708098,383,Ligat Ha'al,2021


# Historical download (2010-2024)
Attempts to fetch older seasons; may be empty if API has no historical coverage.

In [85]:
# Check available seasons for Ligat Ha'al in API-Sports
print("Checking available seasons for Ligat Ha'al...")

# Try the dedicated seasons endpoint first
seasons_info = api_get("/leagues/seasons", {"league": LEAGUE_ID})
available_seasons = sorted(seasons_info.get("response", []) or [])

# If that returned no data, fall back to the /leagues endpoint which often includes a 'seasons' list
if not available_seasons:
    print("Warning: /leagues/seasons returned no seasons ‚Äî falling back to /leagues response.")
    league_info = api_get("/leagues", {"id": LEAGUE_ID})
    league_details = league_info.get("response", [{}])[0]
    raw_seasons = league_details.get("seasons", []) or []
    # Normalize seasons to a list of ints if possible
    normalized = []
    for s in raw_seasons:
        if isinstance(s, int):
            normalized.append(s)
        elif isinstance(s, dict):
            # try common keys that might contain a year
            for k in ("season", "year"):
                val = s.get(k)
                if isinstance(val, int):
                    normalized.append(val)
                    break
    available_seasons = sorted(set(normalized))

# Print available seasons (if any)
print(f"\nAvailable seasons for {LEAGUE_NAME}:")
if available_seasons:
    for season in available_seasons:
        print(f"‚Ä¢ {season}/{str(season+1)[-2:]}")
else:
    print("‚Ä¢ (no seasons found)")

# Ensure league_info and league_details are available for the coverage section
if 'league_info' not in globals():
    league_info = api_get("/leagues", {"id": LEAGUE_ID})
league_details = league_info.get("response", [{}])[0]
league_status = league_details.get("league", {})

print(f"\nLeague Coverage Details:")
print(f"‚Ä¢ League: {league_status.get('name')} (ID: {league_status.get('id')})")
print(f"‚Ä¢ Type: {league_status.get('type')}")
print(f"‚Ä¢ Country: {league_details.get('country', {}).get('name')}")
print(f"‚Ä¢ Available Seasons: {len(available_seasons)}")
if available_seasons:
    print(f"‚Ä¢ Date Range: {min(available_seasons)} to {max(available_seasons)}")
else:
    print("‚Ä¢ Date Range: N/A")

Checking available seasons for Ligat Ha'al...

Available seasons for Ligat Ha'al:
‚Ä¢ 2016/17
‚Ä¢ 2017/18
‚Ä¢ 2018/19
‚Ä¢ 2019/20
‚Ä¢ 2020/21
‚Ä¢ 2021/22
‚Ä¢ 2022/23
‚Ä¢ 2023/24
‚Ä¢ 2024/25
‚Ä¢ 2025/26

League Coverage Details:
‚Ä¢ League: Ligat Ha'al (ID: 383)
‚Ä¢ Type: League
‚Ä¢ Country: Israel
‚Ä¢ Available Seasons: 10
‚Ä¢ Date Range: 2016 to 2025

Available seasons for Ligat Ha'al:
‚Ä¢ 2016/17
‚Ä¢ 2017/18
‚Ä¢ 2018/19
‚Ä¢ 2019/20
‚Ä¢ 2020/21
‚Ä¢ 2021/22
‚Ä¢ 2022/23
‚Ä¢ 2023/24
‚Ä¢ 2024/25
‚Ä¢ 2025/26

League Coverage Details:
‚Ä¢ League: Ligat Ha'al (ID: 383)
‚Ä¢ Type: League
‚Ä¢ Country: Israel
‚Ä¢ Available Seasons: 10
‚Ä¢ Date Range: 2016 to 2025


# Check available seasons from API
Query the API for which seasons are actually available for the selected league; use this before bulk downloads.

In [86]:
# Download all available seasons of Ligat Ha'al
print("Downloading available Ligat Ha'al seasons...")
print("Progress: ", end="", flush=True)

results = {}
errors = []

for i, season in enumerate(available_seasons, 1):
    print(f"[{season}/{season+1}] ", end="", flush=True)
    try:
        df = get_season_fixtures(season)
        if len(df) > 0:
            results[season] = df
            # Save each season's data
            csv_path = DATA_DIR / f"matches_{season}_{str(season+1)[-2:]}_ligat_haal.csv"
            df.to_csv(csv_path, index=False, encoding='utf-8-sig')
            print(f"‚úì ({len(df)} matches)", end=" ", flush=True)
        else:
            print("‚ö† (no matches)", end=" ", flush=True)
    except Exception as e:
        print(f"‚ùå ({str(e)})", end=" ", flush=True)
        errors.append((season, str(e)))
    time.sleep(1)  # Be nice to the API
    
    # Add a newline every 2 seasons for readability
    if i % 2 == 0:
        print()

print("\n\nCombining available seasons into one file...")
if results:
    # Combine all successful seasons into one DataFrame
    all_matches = pd.concat(results.values(), axis=0, ignore_index=True)
    
    # Add season_start_year column for easier filtering
    all_matches['season_start_year'] = all_matches['season'].str.slice(0,4).astype(int)
    
    # Sort by date and reset index
    all_matches = all_matches.sort_values(['date', 'fixture_id']).reset_index(drop=True)
    
    # Save combined file
    combined_path = DATA_DIR / "matches_all_seasons_ligat_haal.csv"
    all_matches.to_csv(combined_path, index=False, encoding='utf-8-sig')
    
    print("\nSummary:")
    print(f"- Successfully downloaded {len(results)} seasons")
    print(f"- Total matches: {len(all_matches)}")
    print(f"- Years covered: {min(results.keys())}-{max(results.keys())}")
    print(f"\nMatches per season:")
    season_counts = all_matches.groupby('season').size().sort_index()
    for season, count in season_counts.items():
        print(f"  ‚Ä¢ {season}: {count:3d} matches")
    
    if errors:
        print("\nErrors encountered:")
        for season, error in errors:
            print(f"- {season}/{season+1}: {error}")
    
    print(f"\nAll matches saved to: {combined_path}")
    print("\nPreview of combined data:")
    display(all_matches.head())

Downloading available Ligat Ha'al seasons...
Progress: [2016/2017] [2016/2017] ‚ö† (no matches) ‚ö† (no matches) [2017/2018] [2017/2018] ‚ö† (no matches) ‚ö† (no matches) 
[2018/2019] 
[2018/2019] ‚ö† (no matches) ‚ö† (no matches) [2019/2020] [2019/2020] ‚ö† (no matches) ‚ö† (no matches) 
[2020/2021] 
[2020/2021] ‚ö† (no matches) ‚ö† (no matches) [2021/2022] [2021/2022] ‚úì (240 matches) ‚úì (240 matches) 
[2022/2023] 
[2022/2023] ‚úì (240 matches) ‚úì (240 matches) [2023/2024] [2023/2024] ‚úì (240 matches) ‚úì (240 matches) 
[2024/2025] 
[2024/2025] ‚ö† (no matches) ‚ö† (no matches) [2025/2026] [2025/2026] ‚ö† (no matches) ‚ö† (no matches) 


Combining available seasons into one file...

Summary:
- Successfully downloaded 3 seasons
- Total matches: 720
- Years covered: 2021-2023

Matches per season:
  ‚Ä¢ 2021/22: 240 matches
  ‚Ä¢ 2022/23: 240 matches
  ‚Ä¢ 2023/24: 240 matches

All matches saved to: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\data\raw\matches_all_se

Unnamed: 0,season,date,round,stage,home_team,away_team,home_goals,away_goals,venue,referee,fixture_id,league_id,league_name,season_start_year
0,2021/22,2021-08-28,Regular Season - 1,Ligat Ha'al,Maccabi Petah Tikva,Ironi Kiryat Shmona,1,1,HaMoshava Stadium,D. Fuxman,708093,383,Ligat Ha'al,2021
1,2021/22,2021-08-28,Regular Season - 1,Ligat Ha'al,Hapoel Tel Aviv,Ashdod,2,1,Bloomfield Stadium,R. Reinshreiber,708094,383,Ligat Ha'al,2021
2,2021/22,2021-08-28,Regular Season - 1,Ligat Ha'al,Hapoel Haifa,Maccabi Netanya,0,0,Sammy Ofer Stadium,O. Grinfeeld,708095,383,Ligat Ha'al,2021
3,2021/22,2021-08-29,Regular Season - 1,Ligat Ha'al,Beitar Jerusalem,Hapoel Beer Sheva,0,2,Teddi Malcha Stadium,L. Liani,708097,383,Ligat Ha'al,2021
4,2021/22,2021-08-29,Regular Season - 1,Ligat Ha'al,Hapoel Hadera,Maccabi Haifa,0,0,Netanya Stadium,E. Shmuelevich,708098,383,Ligat Ha'al,2021


# Download available seasons
Download only the seasons returned by the API, save each season and combine into a single CSV.

# Scrape Ligat Ha'al 2016/17 season from Wikipedia
This cell demonstrates how to fetch the 2016/17 Israeli Premier League (Ligat Ha'al) table from Wikipedia using pandas, and save it as a CSV file in your `data/raw/` directory.

- Source: [Wikipedia ‚Äì 2016‚Äì17 Israeli Premier League](https://en.wikipedia.org/wiki/2016%E2%80%9317_Israeli_Premier_League)
- The table format may change between seasons; this code works for standard Wikipedia league tables.
- You can adapt this for other seasons by changing the URL.

In [87]:
# Scrape 2016/17 Ligat Ha'al league table from Wikipedia and save as CSV
# Use requests with a browser User-Agent to avoid HTTP 403 from the site
import re

url = "https://en.wikipedia.org/wiki/2016%E2%80%9317_Israeli_Premier_League"
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117 Safari/537.36"}, timeout=30)
resp.raise_for_status()

tables = pd.read_html(resp.text)

# Find the main league table (usually the first or second table)
def find_league_table(tables):
    for df in tables:
        # Look for columns typical of league tables
        cols = [c.lower() for c in df.columns.astype(str)]
        if any(re.search(r"team|club", c) for c in cols) and any(re.search(r"pts|points", c) for c in cols):
            return df
    return tables[0]  # fallback

league_df = find_league_table(tables)
print("Columns:", league_df.columns.tolist())
print("Rows:", len(league_df))

# Save to CSV in data/raw/
csv_path = Path(DATA_DIR) / "ligat_haal_2016_17_wikipedia.csv"
league_df.to_csv(csv_path, index=False, encoding="utf-8-sig")
print(f"Saved Wikipedia league table to: {csv_path}")
league_df.head()

  tables = pd.read_html(resp.text)


Columns: ['Pos', 'Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA', 'GD', 'Pts', 'Qualification or relegation']
Rows: 14
Saved Wikipedia league table to: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\data\raw\ligat_haal_2016_17_wikipedia.csv


Unnamed: 0,Pos,Team,Pld,W,D,L,GF,GA,GD,Pts,Qualification or relegation
0,1,Hapoel Be'er Sheva,26,18,5,3,54,13,41,59,Qualification for the Championship round
1,2,Maccabi Tel Aviv,26,17,5,4,45,19,26,56,Qualification for the Championship round
2,3,Maccabi Petah Tikva,26,13,9,4,36,23,13,48,Qualification for the Championship round
3,4,Beitar Jerusalem,26,10,10,6,34,27,7,40,Qualification for the Championship round
4,5,Bnei Sakhnin,26,10,9,7,26,26,0,39,Qualification for the Championship round


# Match-level data from Wikipedia for 2016/17 season

This cell processes the Wikipedia results matrix into a match-by-match dataset with:
- Basic match data: home team, away team, goals scored
- Derived columns: goal difference, match result (H/A/D), points earned

We keep only the essential columns that we can reliably calculate from the match data alone.

In [88]:
# Scrape 2016/17 Ligat Ha'al match-by-match results from Wikipedia and save as CSV
import pandas as pd
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import re

# Fetch and parse the Wikipedia page
url = "https://en.wikipedia.org/wiki/2016%E2%80%9317_Israeli_Premier_League"
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")

# Find the results matrix table by checking the first header cell
results_table = None
for table in soup.find_all("table", class_="wikitable"):
    first_row = table.find("tr")
    if first_row:
        first_cell = first_row.find("th")
        if first_cell and ("Home \\ Away" in first_cell.text or "Home / Away" in first_cell.text):
            results_table = table
            break

if not results_table:
    raise ValueError("Could not find results matrix table on Wikipedia page.")

# Parse teams from the first column and first row
rows = results_table.find_all("tr")
team_names = [td.get_text(strip=True) for td in rows[0].find_all("th")][1:]

# Build match list
matches = []
for i, row in enumerate(rows[1:]):
    cells = row.find_all(["th", "td"])
    home_team = cells[0].get_text(strip=True)
    for j, cell in enumerate(cells[1:]):
        away_team = team_names[j]
        score = cell.get_text(strip=True)
        # Only add if score looks like a result (e.g., '2‚Äì1')
        if re.match(r"^\d+\s*[‚Äì-]\s*\d+$", score):
            home_goals, away_goals = re.split(r"[‚Äì-]", score)
            matches.append({
                "season": "2016/17",
                "home_team": home_team,
                "away_team": away_team,
                "home_goals": int(home_goals.strip()),
                "away_goals": int(away_goals.strip()),
                "score": score
            })

# Convert to DataFrame and save
df = pd.DataFrame(matches)

# Add simple derived columns
df['goal_diff'] = df['home_goals'] - df['away_goals']
df['result'] = df['goal_diff'].apply(lambda x: "H" if x>0 else ("A" if x<0 else "D"))
df['home_points'] = df['result'].map({"H":3, "D":1, "A":0}).fillna(0).astype(int)
df['away_points'] = df['result'].map({"A":3, "D":1, "H":0}).fillna(0).astype(int)

# Select and order columns
keep_cols = ['season', 'home_team', 'away_team', 'home_goals', 'away_goals', 
             'goal_diff', 'result', 'home_points', 'away_points']
df = df[keep_cols]

# Save to CSV
csv_path = Path(DATA_DIR) / "matches_2016_17_ligat_haal_wikipedia.csv"
df.to_csv(csv_path, index=False, encoding="utf-8-sig")
print(f"Saved match-by-match results to: {csv_path}")
print(f"Total matches: {len(df)}")
df.head()

Saved match-by-match results to: c:\Users\nitib\dev-lab\ligat_haal_project\ligat_haal_project\data\raw\matches_2016_17_ligat_haal_wikipedia.csv
Total matches: 182


Unnamed: 0,season,home_team,away_team,home_goals,away_goals,goal_diff,result,home_points,away_points
0,2016/17,F.C. Ashdod,BEI,0,0,0,D,1,1
1,2016/17,F.C. Ashdod,BnS,1,1,0,D,1,1
2,2016/17,F.C. Ashdod,BnY,2,2,0,D,1,1
3,2016/17,F.C. Ashdod,HAS,1,0,1,H,3,0
4,2016/17,F.C. Ashdod,HBS,0,1,-1,A,0,3


# Multi-season Wikipedia Scraper for Ligat Ha'al

This cell scrapes match data from Wikipedia for multiple seasons of Ligat Ha'al (2003/04 to 2023/24).
- Fetches each season's Wikipedia page
- Extracts the results matrix (home vs away grid)
- Converts to match-by-match format
- Adds derived columns (goal_diff, result, points)
- Saves individual season CSVs and a combined file

In [None]:
# Scrape multiple seasons of Ligat Ha'al from Wikipedia
import pandas as pd
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import re
import time
from datetime import datetime

def scrape_season(season_year):
    """
    Scrape a single season's matches from Wikipedia.
    season_year: starting year (e.g., 2016 for 2016/17 season)
    """
    season_str = f"{season_year}/{str(season_year+1)[-2:]}"
    url = f"https://en.wikipedia.org/wiki/{season_year}%E2%80%93{str(season_year+1)[-2:]}_Israeli_Premier_League"
    
    print(f"Fetching {season_str}... ", end="", flush=True)
    try:
        resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=30)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "lxml")
        
        # Find results matrix
        results_table = None
        for table in soup.find_all("table", class_="wikitable"):
            first_row = table.find("tr")
            if first_row:
                first_cell = first_row.find("th")
                if first_cell and ("Home \\ Away" in first_cell.text or "Home / Away" in first_cell.text):
                    results_table = table
                    break
        
        if not results_table:
            print("‚ùå (no results matrix)")
            return None
            
        # Parse teams and build matches
        rows = results_table.find_all("tr")
        team_names = [td.get_text(strip=True) for td in rows[0].find_all("th")][1:]
        
        matches = []
        for i, row in enumerate(rows[1:]):
            cells = row.find_all(["th", "td"])
            home_team = cells[0].get_text(strip=True)
            for j, cell in enumerate(cells[1:]):
                away_team = team_names[j]
                score = cell.get_text(strip=True)
                if re.match(r"^\d+\s*[‚Äì-]\s*\d+$", score):
                    home_goals, away_goals = re.split(r"[‚Äì-]", score)
                    matches.append({
                        "season": season_str,
                        "season_year": season_year,
                        "home_team": home_team,
                        "away_team": away_team,
                        "home_goals": int(home_goals.strip()),
                        "away_goals": int(away_goals.strip())
                    })
        
        if not matches:
            print("‚ùå (no matches found)")
            return None
            
        # Convert to DataFrame and add derived columns
        df = pd.DataFrame(matches)
        df['goal_diff'] = df['home_goals'] - df['away_goals']
        df['result'] = df['goal_diff'].apply(lambda x: "H" if x>0 else ("A" if x<0 else "D"))
        df['home_points'] = df['result'].map({"H":3, "D":1, "A":0}).fillna(0).astype(int)
        df['away_points'] = df['result'].map({"A":3, "D":1, "H":0}).fillna(0).astype(int)
        
        # Select and order columns
        keep_cols = ['season', 'season_year', 'home_team', 'away_team', 'home_goals', 
                     'away_goals', 'goal_diff', 'result', 'home_points', 'away_points']
        df = df[keep_cols]
        
        print(f"‚úì ({len(df)} matches)")
        return df
        
    except Exception as e:
        print(f"‚ùå ({str(e)[:50]}...)")
        return None

# List of seasons to scrape (last 20 seasons)
current_year = datetime.now().year
if datetime.now().month < 8:  # If before August, last season started in previous year
    current_year -= 1
seasons = list(range(current_year - 19, current_year + 1))

print(f"Scraping {len(seasons)} seasons from Wikipedia ({seasons[0]}/{str(seasons[0]+1)[-2:]} to {seasons[-1]}/{str(seasons[-1]+1)[-2:]})...")

# Scrape each season
all_matches = []
for season_year in seasons:
    df = scrape_season(season_year)
    if df is not None:
        # Save individual season
        season_path = DATA_DIR / f"matches_{season_year}_{str(season_year+1)[-2:]}_ligat_haal_wikipedia.csv"
        df.to_csv(season_path, index=False, encoding='utf-8-sig')
        all_matches.append(df)
    time.sleep(1)  # Be nice to Wikipedia

if all_matches:
    # Combine all seasons
    combined_df = pd.concat(all_matches, ignore_index=True)
    combined_path = DATA_DIR / "matches_all_seasons_ligat_haal_wikipedia.csv"
    combined_df.to_csv(combined_path, index=False, encoding='utf-8-sig')
    
    print("\nSummary:")
    print(f"- Successfully scraped {len(all_matches)} seasons")
    print(f"- Total matches: {len(combined_df)}")
    print(f"\nMatches per season:")
    season_counts = combined_df.groupby('season').size().sort_index()
    for season, count in season_counts.items():
        print(f"  ‚Ä¢ {season}: {count:3d} matches")
    print(f"\nAll matches saved to: {combined_path}")
    display(combined_df.head())