# 04 - Injury Features & Dataset Splitting

Bu notebook, Ekin'in sorumluluƒüundaki g√∂revleri ger√ßekle≈ütirir:

1. **Injury Data Processing**: Raw injury verilerini temizle ve birle≈ütir
2. **Injury Feature Generation**: Ma√ß bazlƒ± injury feature'larƒ±nƒ± √ºret
3. **Final Dataset**: Core features + injury features ile final dataset olu≈ütur
4. **Train/Val/Test Split**: Zaman bazlƒ± split yap

## Input Dosyalar
- `data_interim/games_with_core_features.csv` (ƒ∞brahim'den)
- `data_raw/injury_reports_raw/` (Raw injury PDF'lerden parse edilmi≈ü CSV'ler)
- `data_raw/nbastuffer_2025_2026_player_stats_raw.csv` (Oyuncu dakika bilgileri)

## Output Dosyalar
- `data_interim/injury_reports_clean.csv` (Temizlenmi≈ü injury verileri)
- `data_processed/games_with_all_features.csv` (Core + injury features)
- `data_processed/train_set.csv`
- `data_processed/val_set.csv`
- `data_processed/test_set.csv`


## 1. Setup & Imports


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Proje root dizinini bul
project_root = Path().absolute().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# √áalƒ±≈üma dizinini proje root'una ayarla
os.chdir(project_root)

print(f"Proje root: {project_root}")
print(f"√áalƒ±≈üma dizini: {os.getcwd()}")


Proje root: c:\Users\Esref\OneDrive\Masa√ºst√º\ann
√áalƒ±≈üma dizini: c:\Users\Esref\OneDrive\Masa√ºst√º\ann


In [2]:
# Mod√ºlleri import et
from src.features.injury_features import (
    load_and_clean_injury_reports,
    load_player_minutes,
    add_injury_features,
    build_injury_features
)

from src.data.split_dataset import (
    analyze_date_range,
    split_dataset_by_time,
    split_dataset_random,
    save_splits,
    print_split_stats,
    print_split_stats_random,
    validate_splits,
    create_time_based_split,
    create_random_split,
    suggest_split_dates
)

print("Mod√ºller y√ºklendi!")


Mod√ºller y√ºklendi!


## 2. Core Features Dosyasƒ±nƒ± Kontrol Et

ƒ∞brahim'in build_features.py pipeline'ƒ±nƒ±n √ßƒ±ktƒ±sƒ±nƒ± kontrol edelim. Eƒüer yoksa, olu≈üturalƒ±m.


In [3]:
# ƒ∞lk olarak ƒ∞brahim'in pipeline'ƒ±nƒ± √ßalƒ±≈ütƒ±rƒ±p core features'ƒ± √ºretelim
from src.features.build_features import build_model_dataset

# master_merged.csv yoksa, √∂nce onu olu≈üturmamƒ±z gerekiyor
master_csv = Path("data_processed/master_merged.csv")

if not master_csv.exists():
    print("master_merged.csv bulunamadƒ±!")
    print("√ñnce 02_clean_merge.ipynb notebook'unu √ßalƒ±≈ütƒ±rƒ±n.")
else:
    print(f"master_merged.csv mevcut: {master_csv}")
    
    # Core features'ƒ± olu≈ütur
    core_features_path = Path("data_interim/games_with_core_features.csv")
    if not core_features_path.exists():
        print("\nCore features olu≈üturuluyor...")
        build_model_dataset(
            master_csv=master_csv,
            output_csv="data_processed/model_dataset.csv",
            write_interim=True
        )
    else:
        print(f"Core features mevcut: {core_features_path}")


master_merged.csv mevcut: data_processed\master_merged.csv
Core features mevcut: data_interim\games_with_core_features.csv


In [4]:
# Core features dosyasƒ±nƒ± y√ºkle ve incele
core_features_path = Path("data_interim/games_with_core_features.csv")

if core_features_path.exists():
    core_df = pd.read_csv(core_features_path, low_memory=False)
    
    print(f"Core Features:")
    print(f"  Satƒ±r sayƒ±sƒ±: {len(core_df):,}")
    print(f"  Kolon sayƒ±sƒ±: {len(core_df.columns)}")
    
    print(f"\nKolonlar:")
    for i, col in enumerate(core_df.columns):
        print(f"  {i+1}. {col}")
    
    print(f"\nƒ∞lk 3 satƒ±r:")
    display(core_df.head(3))
    
    # Tarih aralƒ±ƒüƒ±nƒ± kontrol et
    if 'game_date' in core_df.columns:
        core_df['game_date'] = pd.to_datetime(core_df['game_date'])
        print(f"\nTarih aralƒ±ƒüƒ±:")
        print(f"  Min: {core_df['game_date'].min()}")
        print(f"  Max: {core_df['game_date'].max()}")
else:
    print("Core features dosyasƒ± bulunamadƒ±!")


Core Features:
  Satƒ±r sayƒ±sƒ±: 18,226
  Kolon sayƒ±sƒ±: 213

Kolonlar:
  1. gameId
  2. game_date
  3. season_year
  4. home_team
  5. away_team
  6. home_team_code
  7. away_team_code
  8. season_type
  9. matchup
  10. RANK_x
  11. home_team_CONF
  12. home_team_DIVISION
  13. home_team_GP
  14. home_team_PPG
  15. home_team_oPPG
  16. home_team_pDIFF
  17. home_team_PACE
  18. home_team_oEFF
  19. home_team_dEFF
  20. home_team_eDIFF
  21. home_team_SoS
  22. home_team_rSoS
  23. home_team_SAR
  24. home_team_CONS
  25. home_team_A4F
  26. home_team_W
  27. home_team_L
  28. home_team_WINpct
  29. home_team_eWINpct
  30. home_team_pWINpct
  31. home_team_ACH
  32. home_team_STRK
  33. RANK_y
  34. away_team_CONF
  35. away_team_DIVISION
  36. away_team_GP
  37. away_team_PPG
  38. away_team_oPPG
  39. away_team_pDIFF
  40. away_team_PACE
  41. away_team_oEFF
  42. away_team_dEFF
  43. away_team_eDIFF
  44. away_team_SoS
  45. away_team_rSoS
  46. away_team_SAR
  47. away_team_CON

Unnamed: 0,gameId,game_date,season_year,home_team,away_team,home_team_code,away_team_code,season_type,matchup,RANK_x,...,diff_schedule_ALL_B2B,diff_schedule_TOTAL_B2B_ON_THE_ROAD,diff_schedule_TOTAL_B2B_AT_HOME,diff_schedule_3IN4,diff_schedule_1_DAY_REST,diff_schedule_2_DAYS_REST,diff_schedule_3DAYS_REST,diff_schedule_REST_ADVANTAGE,diff_schedule_REST_DISADVANTAGE,diff_schedule_BOTH_TEAMS_RESTED_or_NO_REST
0,21000001,2010-10-26,2010-11,Boston Celtics,Miami Heat,BOS,MIA,regular,BOS vs. MIA,,...,-2,1,-3,0,5,-5,2,1,-2,1
1,21000002,2010-10-26,2010-11,Portland Trail Blazers,Phoenix Suns,POR,PHX,regular,PHX @ POR,,...,-1,2,-3,-3,5,-1,0,3,-3,0
2,21000003,2010-10-26,2010-11,Los Angeles Lakers,Houston Rockets,LAL,HOU,regular,HOU @ LAL,,...,0,-2,2,1,-1,0,0,1,1,-2



Tarih aralƒ±ƒüƒ±:
  Min: 2010-10-26 00:00:00
  Max: 2025-12-19 00:00:00


## 3. Injury Data Processing

Raw injury verilerini y√ºkle, temizle ve birle≈ütir.


In [5]:
# Raw injury klas√∂r√ºn√º kontrol et
injury_raw_dir = Path("data_raw/injury_reports_raw")

if injury_raw_dir.exists():
    files = list(injury_raw_dir.glob("*"))
    print(f"Injury raw dosyalarƒ± ({len(files)} dosya):")
    for f in files:
        print(f"  - {f.name}")
else:
    print("Injury raw klas√∂r√º bulunamadƒ±!")


Injury raw dosyalarƒ± (1 dosya):
  - Injury-Report_2025-11-16_12PM.parsed.csv


In [6]:
# Injury verilerini y√ºkle ve temizle
# NOT: Injury verisi sƒ±nƒ±rlƒ± (sadece 1 g√ºnl√ºk), bu y√ºzden opsiyonel
try:
    injury_df = load_and_clean_injury_reports(
        raw_dir="data_raw/injury_reports_raw/",
        output_path="data_interim/injury_reports_clean.csv"
    )
except Exception as e:
    print(f"‚ö†Ô∏è Injury verisi y√ºklenirken hata: {e}")
    injury_df = pd.DataFrame()

if len(injury_df) > 0:
    print(f"\n‚úÖ Temizlenmi≈ü injury verileri:")
    print(f"  Satƒ±r sayƒ±sƒ±: {len(injury_df)}")
    print(f"\nKolonlar: {list(injury_df.columns)}")
    print(f"\nƒ∞lk 10 satƒ±r:")
    display(injury_df.head(10))

    print(f"\nStatus daƒüƒ±lƒ±mƒ±:")
    print(injury_df['status'].value_counts())
else:
    print("‚ö†Ô∏è Injury verisi bulunamadƒ± veya parse edilemedi.")
    print("   ‚Üí ƒ∞brahim'in yakla≈üƒ±mƒ±: Injury olmadan devam edilecek (inference-time'da kullanƒ±lacak)")


  Y√ºkleniyor: Injury-Report_2025-11-16_12PM.parsed.csv
‚ö†Ô∏è Injury verisi bulunamadƒ± veya parse edilemedi.
   ‚Üí ƒ∞brahim'in yakla≈üƒ±mƒ±: Injury olmadan devam edilecek (inference-time'da kullanƒ±lacak)


## 4. Player Minutes Data


In [7]:
# Player minutes verilerini y√ºkle (opsiyonel - injury i√ßin gerekli)
try:
    player_minutes_df = load_player_minutes(
        player_stats_path="data_raw/nbastuffer_2025_2026_player_stats_raw.csv"
    )
except Exception as e:
    print(f"‚ö†Ô∏è Player minutes y√ºklenirken hata: {e}")
    player_minutes_df = pd.DataFrame()

if len(player_minutes_df) > 0:
    print(f"\n‚úÖ Player minutes verileri:")
    print(f"  Oyuncu sayƒ±sƒ±: {len(player_minutes_df)}")

    print(f"\nEn √ßok oynayan oyuncular:")
    top_players = player_minutes_df.nlargest(15, 'avg_minutes_per_game')
    display(top_players)

    print(f"\nDakika daƒüƒ±lƒ±mƒ±:")
    print(player_minutes_df['avg_minutes_per_game'].describe())

    # Key players (25+ dk)
    key_players = player_minutes_df[player_minutes_df['avg_minutes_per_game'] >= 25]
    print(f"\nKey players (25+ dk): {len(key_players)} oyuncu")
else:
    print("‚ö†Ô∏è Player minutes verisi bulunamadƒ± - Injury features i√ßin gerekli deƒüil")


‚úÖ Player minutes y√ºklendi: 503 oyuncu

‚úÖ Player minutes verileri:
  Oyuncu sayƒ±sƒ±: 503

En √ßok oynayan oyuncular:


Unnamed: 0,player_name,team,avg_minutes_per_game
129,Tyrese Maxey,Philadelphia 76ers,39.8
472,Keegan Murray,Sacramento Kings,37.5
35,Luka Doncic,Los Angeles Lakers,37.4
31,Austin Reaves,Los Angeles Lakers,36.9
4,Amen Thompson,Houston Rockets,36.7
1,Alperen Sengun,Houston Rockets,36.2
149,Cade Cunningham,Detroit Pistons,36.2
3,Kevin Durant,Houston Rockets,36.1
165,Trey Murphy III,New Orleans Pelicans,35.6
220,James Harden,LA Clippers,35.4



Dakika daƒüƒ±lƒ±mƒ±:
count    503.000000
mean      19.362425
std        9.845941
min        0.600000
25%       11.200000
50%       19.300000
75%       28.050000
max       39.800000
Name: avg_minutes_per_game, dtype: float64

Key players (25+ dk): 173 oyuncu


## 5. Injury Features Generation

Core features'a injury feature'larƒ±nƒ± ekle.


In [8]:
# Injury features ekle (veya placeholder kullan)
if core_features_path.exists():
    # Injury verisi varsa ve yeterliyse kullan
    if len(injury_df) > 0 and 'player_minutes_df' in dir() and len(player_minutes_df) > 0:
        try:
            games_with_injury = add_injury_features(
                games_df=core_df,
                injury_df=injury_df,
                player_minutes_df=player_minutes_df,
                key_player_minutes_threshold=25.0
            )
            print(f"\n‚úÖ Injury features eklendi:")
            new_cols = [col for col in games_with_injury.columns if col not in core_df.columns]
            print(f"  Yeni kolonlar: {new_cols}")
        except Exception as e:
            print(f"‚ö†Ô∏è Injury features eklenirken hata: {e}")
            games_with_injury = core_df.copy()
    else:
        # ƒ∞brahim'in yakla≈üƒ±mƒ±: Injury verisi yetersiz, placeholder kullan
        print("‚ö†Ô∏è Yeterli injury verisi yok - ƒ∞brahim'in yakla≈üƒ±mƒ± uygulanƒ±yor:")
        print("   ‚Üí Injury feature'larƒ± 0 olarak ayarlanacak")
        print("   ‚Üí Model sadece core features ile eƒüitilecek")
        print("   ‚Üí Injury, inference-time'da ayrƒ± kullanƒ±lacak")
        
        games_with_injury = core_df.copy()
        # Placeholder injury kolonlarƒ± (hepsi 0)
        games_with_injury['injury_count_home'] = 0
        games_with_injury['injury_count_away'] = 0
        games_with_injury['expected_minutes_lost_home'] = 0.0
        games_with_injury['expected_minutes_lost_away'] = 0.0
        games_with_injury['any_key_player_out_home'] = 0
        games_with_injury['any_key_player_out_away'] = 0

    print(f"\nüìä Final dataset:")
    print(f"  Satƒ±r sayƒ±sƒ±: {len(games_with_injury):,}")
    print(f"  Kolon sayƒ±sƒ±: {len(games_with_injury.columns)}")
else:
    print("‚ùå Core features dosyasƒ± bulunamadƒ±!")


‚ö†Ô∏è Yeterli injury verisi yok - ƒ∞brahim'in yakla≈üƒ±mƒ± uygulanƒ±yor:
   ‚Üí Injury feature'larƒ± 0 olarak ayarlanacak
   ‚Üí Model sadece core features ile eƒüitilecek
   ‚Üí Injury, inference-time'da ayrƒ± kullanƒ±lacak

üìä Final dataset:
  Satƒ±r sayƒ±sƒ±: 18,226
  Kolon sayƒ±sƒ±: 219


In [9]:
# Final dataset'i kaydet
output_path = Path("data_processed/games_with_all_features.csv")
output_path.parent.mkdir(parents=True, exist_ok=True)

games_with_injury.to_csv(output_path, index=False)
print(f"Final dataset kaydedildi: {output_path}")
print(f"   {len(games_with_injury):,} satƒ±r, {len(games_with_injury.columns)} kolon")


Final dataset kaydedildi: data_processed\games_with_all_features.csv
   18,226 satƒ±r, 219 kolon


## 6. Dataset Splitting (Random Shuffle)

Rastgele karƒ±≈ütƒ±rmalƒ± (random shuffle) train/val/test split yap.

**Neden Random Split?**
- Basketbolun oyun yapƒ±sƒ± yƒ±llar i√ßinde deƒüi≈ütiƒüi i√ßin, sadece eski verilerle eƒüitilip g√ºncel ma√ßlarƒ± tahmin etmek (distribution drift) performans kaybƒ±na yol a√ßabilir.
- Random split ile model her yƒ±ldan (√∂zellikle g√ºncel yƒ±llardan da) veri g√∂rerek eƒüitilmi≈ü olur.

**Split Oranlarƒ±:**
- Train: %70
- Validation: %15
- Test: %15


In [10]:
# Tarih aralƒ±ƒüƒ±nƒ± analiz et
print("Tarih Aralƒ±ƒüƒ± Analizi:")
stats = analyze_date_range(games_with_injury, date_col='game_date')

print(f"\nMin tarih: {stats['min_date']}")
print(f"Max tarih: {stats['max_date']}")
print(f"Toplam g√ºn: {stats['date_range_days']}")
print(f"Toplam ma√ß: {stats['total_games']:,}")

print(f"\nYƒ±llara g√∂re ma√ß sayƒ±sƒ±:")
for year, count in sorted(stats['games_per_year'].items()):
    print(f"  {year}: {count:,}")


Tarih Aralƒ±ƒüƒ± Analizi:

Min tarih: 2010-10-26 00:00:00
Max tarih: 2025-12-19 00:00:00
Toplam g√ºn: 5533
Toplam ma√ß: 18,226

Yƒ±llara g√∂re ma√ß sayƒ±sƒ±:
  2010: 482
  2011: 885
  2012: 1,474
  2013: 1,324
  2014: 1,334
  2015: 1,319
  2016: 1,333
  2017: 1,347
  2018: 1,316
  2019: 1,267
  2020: 706
  2021: 1,625
  2022: 1,336
  2023: 1,248
  2024: 824
  2025: 406


In [11]:
# Random Split Konfig√ºrasyonu
TRAIN_RATIO = 0.70  # %70 Train
VAL_RATIO = 0.15    # %15 Validation  
TEST_RATIO = 0.15   # %15 Test
RANDOM_STATE = 42   # Reproducibility i√ßin

print(f"Random Split Konfig√ºrasyonu:")
print(f"  Train:  {TRAIN_RATIO:.0%}")
print(f"  Val:    {VAL_RATIO:.0%}")
print(f"  Test:   {TEST_RATIO:.0%}")
print(f"  Random State: {RANDOM_STATE}")

# Not: Eski zaman bazlƒ± split i√ßin suggest_split_dates kullanƒ±labilir:
# suggested_train_end, suggested_val_end = suggest_split_dates(
#     games_with_injury,
#     date_col='game_date',
#     train_ratio=0.70,
#     val_ratio=0.15
# )


Random Split Konfig√ºrasyonu:
  Train:  70%
  Val:    15%
  Test:   15%
  Random State: 42


In [12]:
# Veri √∂zeti
print(f"Toplam ma√ß sayƒ±sƒ±: {len(games_with_injury):,}")
print(f"\nBeklenen split sonu√ßlarƒ±:")
print(f"  Train: ~{int(len(games_with_injury) * TRAIN_RATIO):,} ma√ß")
print(f"  Val:   ~{int(len(games_with_injury) * VAL_RATIO):,} ma√ß")
print(f"  Test:  ~{int(len(games_with_injury) * TEST_RATIO):,} ma√ß")

# Not: Eski zaman bazlƒ± split i√ßin:
# TRAIN_END = "2021-07-01"
# VAL_END = "2023-07-01"
# TEST_END = None


Toplam ma√ß sayƒ±sƒ±: 18,226

Beklenen split sonu√ßlarƒ±:
  Train: ~12,758 ma√ß
  Val:   ~2,733 ma√ß
  Test:  ~2,733 ma√ß


In [13]:
# Random Split yap
train_df, val_df, test_df = split_dataset_random(
    games_with_injury,
    train_ratio=TRAIN_RATIO,
    val_ratio=VAL_RATIO,
    test_ratio=TEST_RATIO,
    random_state=RANDOM_STATE
)

# ƒ∞statistikleri g√∂ster (yƒ±l daƒüƒ±lƒ±mƒ± ile)
print_split_stats_random(train_df, val_df, test_df, date_col='game_date', label_col='home_team_win')

# Not: Eski zaman bazlƒ± split i√ßin:
# train_df, val_df, test_df = split_dataset_by_time(
#     games_with_injury,
#     train_end=TRAIN_END,
#     val_end=VAL_END,
#     test_end=TEST_END,
#     date_col='game_date'
# )
# print_split_stats(train_df, val_df, test_df, date_col='game_date', label_col='home_team_win')



RANDOM SPLIT ƒ∞STATƒ∞STƒ∞KLERƒ∞

Train:
  Ma√ß sayƒ±sƒ±: 12,758 (70.0%)
  Tarih aralƒ±ƒüƒ±: 2010-10-26 - 2025-12-19
  Yƒ±l daƒüƒ±lƒ±mƒ±: 2010: 327, 2011: 642, 2012: 1040, 2013: 927, 2014: 932, 2015: 929, 2016: 910, 2017: 954, 2018: 903, 2019: 881, 2020: 476, 2021: 1155, 2022: 942, 2023: 873, 2024: 589, 2025: 278
  Home win rate: 57.0%

Val:
  Ma√ß sayƒ±sƒ±: 2,734 (15.0%)
  Tarih aralƒ±ƒüƒ±: 2010-10-26 - 2025-12-19
  Yƒ±l daƒüƒ±lƒ±mƒ±: 2010: 91, 2011: 128, 2012: 224, 2013: 190, 2014: 217, 2015: 183, 2016: 208, 2017: 193, 2018: 212, 2019: 207, 2020: 116, 2021: 219, 2022: 200, 2023: 163, 2024: 122, 2025: 61
  Home win rate: 58.2%

Test:
  Ma√ß sayƒ±sƒ±: 2,734 (15.0%)
  Tarih aralƒ±ƒüƒ±: 2010-10-27 - 2025-12-19
  Yƒ±l daƒüƒ±lƒ±mƒ±: 2010: 64, 2011: 115, 2012: 210, 2013: 207, 2014: 185, 2015: 207, 2016: 215, 2017: 200, 2018: 201, 2019: 179, 2020: 114, 2021: 251, 2022: 194, 2023: 212, 2024: 113, 2025: 67
  Home win rate: 56.9%



In [14]:
# Validation kontrolleri
required_cols = ['home_team_win', 'score_diff', 'game_date', 'home_team', 'away_team']

is_valid = validate_splits(
    train_df, val_df, test_df,
    date_col='game_date',
    required_cols=required_cols,
    check_date_overlap=False  # Random split i√ßin tarih overlap kontrol√º kapalƒ±
)



‚úÖ T√ºm validation kontrolleri ba≈üarƒ±lƒ±!


In [15]:
# Split dosyalarƒ±nƒ± kaydet
paths = save_splits(
    train_df, val_df, test_df,
    output_dir='data_processed/',
    prefix=''
)

print("\nKaydedilen dosyalar:")
for name, path in paths.items():
    df = pd.read_csv(path)
    print(f"  {name}: {path} ({len(df):,} satƒ±r)")



Kaydedilen dosyalar:
  train: data_processed\train_set.csv (12,758 satƒ±r)
  val: data_processed\val_set.csv (2,734 satƒ±r)
  test: data_processed\test_set.csv (2,734 satƒ±r)


## 7. Final Verification


In [16]:
# T√ºm √ßƒ±ktƒ± dosyalarƒ±nƒ± kontrol et
output_files = [
    "data_interim/injury_reports_clean.csv",
    "data_processed/games_with_all_features.csv",
    "data_processed/train_set.csv",
    "data_processed/val_set.csv",
    "data_processed/test_set.csv"
]

print("=" * 60)
print("√áIKTI DOSYALARI KONTROL√ú")
print("=" * 60)

for file_path in output_files:
    path = Path(file_path)
    if path.exists():
        df = pd.read_csv(path)
        print(f"\n[OK] {file_path}")
        print(f"     Satƒ±r: {len(df):,}, Kolon: {len(df.columns)}")
    else:
        print(f"\n[X] {file_path} bulunamadƒ±!")


√áIKTI DOSYALARI KONTROL√ú

[X] data_interim/injury_reports_clean.csv bulunamadƒ±!

[OK] data_processed/games_with_all_features.csv
     Satƒ±r: 18,226, Kolon: 219

[OK] data_processed/train_set.csv
     Satƒ±r: 12,758, Kolon: 219

[OK] data_processed/val_set.csv
     Satƒ±r: 2,734, Kolon: 219

[OK] data_processed/test_set.csv
     Satƒ±r: 2,734, Kolon: 219


In [17]:
# Train/Val/Test overlap kontrol√º
print("\n" + "=" * 60)
print("OVERLAP KONTROL√ú")
print("=" * 60)

train_dates = pd.to_datetime(train_df['game_date'])
val_dates = pd.to_datetime(val_df['game_date'])
test_dates = pd.to_datetime(test_df['game_date'])

print(f"\nTrain: {train_dates.min()} - {train_dates.max()}")
print(f"Val:   {val_dates.min()} - {val_dates.max()}")
print(f"Test:  {test_dates.min()} - {test_dates.max()}")

# Overlap var mƒ±?
train_val_overlap = train_dates.max() >= val_dates.min()
val_test_overlap = val_dates.max() >= test_dates.min()

if train_val_overlap:
    print("\n[X] Train-Val overlap var!")
else:
    print("\n[OK] Train-Val overlap yok")

if val_test_overlap:
    print("[X] Val-Test overlap var!")
else:
    print("[OK] Val-Test overlap yok")



OVERLAP KONTROL√ú

Train: 2010-10-26 00:00:00 - 2025-12-19 00:00:00
Val:   2010-10-26 00:00:00 - 2025-12-19 00:00:00
Test:  2010-10-27 00:00:00 - 2025-12-19 00:00:00

[X] Train-Val overlap var!
[X] Val-Test overlap var!


In [18]:
# Label daƒüƒ±lƒ±mƒ± kontrol√º
print("\n" + "=" * 60)
print("LABEL DAƒûILIMI")
print("=" * 60)

for name, df in [("Train", train_df), ("Val", val_df), ("Test", test_df)]:
    if 'home_team_win' in df.columns:
        win_rate = df['home_team_win'].mean() * 100
        print(f"\n{name}: Home win rate = {win_rate:.1f}%")
    
    if 'score_diff' in df.columns:
        avg_diff = df['score_diff'].mean()
        std_diff = df['score_diff'].std()
        print(f"       Score diff = {avg_diff:.2f} +/- {std_diff:.2f}")



LABEL DAƒûILIMI

Train: Home win rate = 57.0%
       Score diff = 2.32 +/- 14.11

Val: Home win rate = 58.2%
       Score diff = 2.50 +/- 13.82

Test: Home win rate = 56.9%
       Score diff = 2.14 +/- 14.00


## Summary

Bu notebook'un √ºrettiƒüi dosyalar:

| Dosya | A√ßƒ±klama |
|-------|----------|
| `data_interim/injury_reports_clean.csv` | Temizlenmi≈ü injury verileri |
| `data_processed/games_with_all_features.csv` | Core + Injury features |
| `data_processed/train_set.csv` | Training seti |
| `data_processed/val_set.csv` | Validation seti |
| `data_processed/test_set.csv` | Test seti |

ƒ∞brahim bu dosyalarƒ± model eƒüitiminde kullanabilir:
- `train_set.csv`: Model eƒüitimi i√ßin
- `val_set.csv`: Hiperparametre optimizasyonu i√ßin
- `test_set.csv`: Final deƒüerlendirme i√ßin


In [19]:
print("\n" + "=" * 60)
print("T√úM ƒ∞≈ûLEMLER TAMAMLANDI!")
print("=" * 60)
print("\nƒ∞brahim i√ßin hazƒ±r dosyalar:")
print("  - data_processed/train_set.csv")
print("  - data_processed/val_set.csv")
print("  - data_processed/test_set.csv")



T√úM ƒ∞≈ûLEMLER TAMAMLANDI!

ƒ∞brahim i√ßin hazƒ±r dosyalar:
  - data_processed/train_set.csv
  - data_processed/val_set.csv
  - data_processed/test_set.csv
