# Predykcja Zwycięzcy Meczu Tenisowego

## Przewidywanie kto wygra: Gracz A vs Gracz B

Model będzie przewidywał kto wygra konkretny mecz tenisowy na podstawie:
- Aktualnych rankingów ATP
- Historii bezpośrednich spotkań
- Formy na różnych nawierzchniach
- Ostatnich wyników

## Dane:
- Mecze ATP 2000-2024 - wyniki meczów
- Rankingi ATP - historyczne rankingi graczy  
- Head-to-head - bezpośrednie starcia
- Forma gracza - ostatnie wyniki

1. Dla każdego meczu tworzymy 2 wiersze danych:
   - Wiersz 1: Gracz A vs Gracz B (target = 1 jeśli A wygrał)
   - Wiersz 2: Gracz B vs Gracz A (target = 0 jeśli A wygrał)
2. Model uczy się przewidywać prawdopodobieństwo wygranej "pierwszego" gracza

In [2]:
# Import bibliotek
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# ML biblioteki
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.utils import shuffle

print("Libraries imported successfully!")

# Ustawienia
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

Libraries imported successfully!


In [3]:
import glob
import os

# 1. MECZE ATP - lata 2000-2024 (główne turnieje)
print("Main tournament matches loading...")
match_files = []
data_dir = r"c:\Users\patry\OneDrive\Pulpit\Sport\Data"

for year in range(2000, 2025):
    file_path = os.path.join(data_dir, f"atp_matches_{year}.csv")
    if os.path.exists(file_path):
        match_files.append(file_path)

print(f" Matches count: {len(match_files)}")

# Wczytywanie wszystkich meczów
all_matches = []
for file_path in match_files:
    year = file_path.split('_')[-1].split('.')[0]
    try:
        df = pd.read_csv(file_path)
        df['year'] = int(year)
        all_matches.append(df)
        print(f"   {year}: {len(df):,} matches")
    except Exception as e:
        print(f" Error in {year}: {e}")

# Połączenie wszystkich meczów
matches_combined = pd.concat(all_matches, ignore_index=True)
print(f" Connect all matches: {len(matches_combined):,}")

# 2. RANKINGI ATP - wszystkie dekady
print(f"\n Loading ATP rankings...")
ranking_files = [
    'atp_rankings_00s.csv',
    'atp_rankings_10s.csv', 
    'atp_rankings_20s.csv',  # Added missing comma here
    'atp_rankings_current.csv'
]

all_rankings = []
for file_name in ranking_files:
    file_path = os.path.join(data_dir, file_name)
    if os.path.exists(file_path):
        try:
            df = pd.read_csv(file_path)
            decade = file_name.split('_')[-1].split('.')[0]
            print(f"   {decade}: {len(df):,} rankings")
            all_rankings.append(df)
        except Exception as e:
            print(f" Error in {file_name}: {e}")

# Połączenie wszystkich rankingów
rankings_combined = pd.concat(all_rankings, ignore_index=True)
print(f"Connect rankings: {len(rankings_combined):,}")

# 3. DANE O GRACZACH
print(f"\nLoading player data...")
players_file = os.path.join(data_dir, 'atp_players.csv')
if os.path.exists(players_file):
    players_data = pd.read_csv(players_file)
    print(f"Players: {len(players_data):,}")
else:
    players_data = None
    print("Error: Players data file not found.")

# 4. PODSTAWOWE STATYSTYKI
print(f"\nExtended data overview:")
print(f"Matches:")
print(f"Years: {matches_combined['year'].min()} - {matches_combined['year'].max()}")
print(f"Count: {len(matches_combined):,}")
print(f"Columns: {len(matches_combined.columns)}")

if len(rankings_combined) > 0:
    print(f"Rankings:")
    print(f"Date: {rankings_combined['ranking_date'].min()} - {rankings_combined['ranking_date'].max()}")
    print(f"Unique players: {rankings_combined['player'].nunique():,}")

if players_data is not None:
    print(f"Players:")
    print(f"Unique players: {len(players_data):,}")

# 5. ROZKŁAD MECZÓW PO LATACH
print(f"\n Matches distribution by year:")
yearly_counts = matches_combined['year'].value_counts().sort_index()
for year, count in yearly_counts.items():
    if year % 5 == 0 or year >= 2020:  # Co 5 lat + ostatnie lata
        print(f"   {year}: {count:,} matches")

# 6. JAKOŚĆ DANYCH
print(f"\n Data quality overview:")
complete_matches = matches_combined[
    (matches_combined['score'].notna()) &
    (matches_combined['winner_rank'].notna()) & 
    (matches_combined['loser_rank'].notna()) &
    (matches_combined['winner_age'].notna()) &
    (matches_combined['loser_age'].notna())
]
print(f"Complete matches data: {len(complete_matches):,}/{len(matches_combined):,} ({len(complete_matches)/len(matches_combined)*100:.1f}%)")

Main tournament matches loading...
 Matches count: 25
   2000: 3,378 matches
   2001: 3,307 matches
   2002: 3,213 matches
   2003: 3,218 matches
   2004: 3,288 matches
   2005: 3,264 matches
   2006: 3,267 matches
   2007: 3,192 matches
   2008: 3,123 matches
   2009: 3,085 matches
   2010: 3,030 matches
   2011: 3,015 matches
   2012: 3,009 matches
   2013: 2,944 matches
   2014: 2,901 matches
   2015: 2,943 matches
   2016: 2,941 matches
   2017: 2,911 matches
   2018: 2,897 matches
   2019: 2,806 matches
   2020: 1,462 matches
   2021: 2,733 matches
   2022: 2,917 matches
   2010: 3,030 matches
   2011: 3,015 matches
   2012: 3,009 matches
   2013: 2,944 matches
   2014: 2,901 matches
   2015: 2,943 matches
   2016: 2,941 matches
   2017: 2,911 matches
   2018: 2,897 matches
   2019: 2,806 matches
   2020: 1,462 matches
   2021: 2,733 matches
   2022: 2,917 matches
   2023: 2,986 matches
   2024: 3,076 matches
 Connect all matches: 74,906

 Loading ATP rankings...
   00s: 920,907 r

In [4]:
# Ulepszone przekształcanie danych z dodatkowymi cechami
print("Data transformation with additional features...")

def create_enhanced_player_vs_player_data(matches_df, rankings_df, players_df):
    # Filtrowanie kompletnych meczów
    complete_matches = matches_df[
        (matches_df['score'].notna()) &
        (matches_df['winner_rank'].notna()) & 
        (matches_df['loser_rank'].notna()) &
        (matches_df['winner_age'].notna()) &
        (matches_df['loser_age'].notna())
    ].copy()
    
    print(f"Full matches data: {len(complete_matches):,}/{len(matches_df):,}")
    
    # Sortowanie po dacie
    complete_matches['date'] = pd.to_datetime(complete_matches['tourney_date'].astype(str), format='%Y%m%d')
    complete_matches = complete_matches.sort_values('date')
    
    # Przygotowanie danych pomocniczych
    if players_df is not None:
        # Konwersja wysokości z cm na metry i wiek
        players_df['height_m'] = players_df['height'] / 100 if 'height' in players_df.columns else None
        players_df['birth_year'] = pd.to_datetime(players_df['dob']).dt.year if 'dob' in players_df.columns else None
    
    rows = []
    
    # Zmienne do śledzenia historii
    player_match_history = {}  # {player_id: [mecze]}
    head_to_head = {}  # {(p1, p2): {'wins_p1': int, 'wins_p2': int}}
    
    print("Macthes processing...")
    for idx, match in complete_matches.iterrows():
        if idx % 5000 == 0:
            print(f"Processed: {idx:,}/{len(complete_matches):,}")
        
        winner_id = match['winner_id']
        loser_id = match['loser_id']
        match_date = match['date']
        
        # HISTORIA HEAD-TO-HEAD
        h2h_key = tuple(sorted([winner_id, loser_id]))
        if h2h_key not in head_to_head:
            head_to_head[h2h_key] = {'wins_p1': 0, 'wins_p2': 0, 'total': 0}
        
        # Aktualne H2H przed tym meczem
        current_h2h = head_to_head[h2h_key].copy()
        
        # FORMA GRACZA (ostatnie 10 meczów)
        def get_recent_form(player_id, current_date, history, n_matches=10):
            if player_id not in history:
                return {'wins': 0, 'losses': 0, 'win_rate': 0.5}
            
            recent_matches = [m for m in history[player_id] if m['date'] < current_date]
            recent_matches = recent_matches[-n_matches:]  # Ostatnie n meczów
            
            if not recent_matches:
                return {'wins': 0, 'losses': 0, 'win_rate': 0.5}
            
            wins = sum(1 for m in recent_matches if m['won'])
            losses = len(recent_matches) - wins
            win_rate = wins / len(recent_matches) if recent_matches else 0.5
            
            return {'wins': wins, 'losses': losses, 'win_rate': win_rate}
        
        winner_form = get_recent_form(winner_id, match_date, player_match_history)
        loser_form = get_recent_form(loser_id, match_date, player_match_history)
        
        # DODATKOWE DANE O GRACZACH
        winner_height = None
        loser_height = None
        winner_birth_year = None
        loser_birth_year = None
        
        if players_df is not None:
            winner_player = players_df[players_df['player_id'] == winner_id]
            loser_player = players_df[players_df['player_id'] == loser_id]
            
            if not winner_player.empty:
                winner_height = winner_player.iloc[0].get('height_m', None)
                winner_birth_year = winner_player.iloc[0].get('birth_year', None)
            
            if not loser_player.empty:
                loser_height = loser_player.iloc[0].get('height_m', None)
                loser_birth_year = loser_player.iloc[0].get('birth_year', None)
        
        # TWORZENIE WIERSZY DANYCH
        base_features = {
            'match_id': f"{match['tourney_id']}_{match['match_num']}",
            'date': match_date,
            'year': match['year'],
            'surface': match['surface'],
            'tourney_level': match['tourney_level'],
            'round': match['round'],
            'tourney_name': match.get('tourney_name', ''),
        }
        
        # Wiersz 1: Winner vs Loser (target = 1)
        row1 = base_features.copy()
        row1.update({
            # Gracz A (winner)
            'player_a_id': winner_id,
            'player_a_name': match['winner_name'],
            'player_a_rank': match['winner_rank'],
            'player_a_points': match.get('winner_rank_points', 0),
            'player_a_age': match['winner_age'],
            'player_a_hand': match['winner_hand'],
            'player_a_seed': match.get('winner_seed', 0) if pd.notna(match.get('winner_seed')) else 0,
            'player_a_height': winner_height,
            
            # Gracz B (loser)
            'player_b_id': loser_id,
            'player_b_name': match['loser_name'],
            'player_b_rank': match['loser_rank'],
            'player_b_points': match.get('loser_rank_points', 0),
            'player_b_age': match['loser_age'],
            'player_b_hand': match['loser_hand'],
            'player_b_seed': match.get('loser_seed', 0) if pd.notna(match.get('loser_seed')) else 0,
            'player_b_height': loser_height,
            
            # Head-to-head (przed tym meczem)
            'h2h_total': current_h2h['total'],
            'h2h_a_wins': current_h2h['wins_p1'] if h2h_key[0] == winner_id else current_h2h['wins_p2'],
            'h2h_b_wins': current_h2h['wins_p2'] if h2h_key[0] == winner_id else current_h2h['wins_p1'],
            
            # Forma graczy (ostatnie 10 meczów)
            'player_a_form_wins': winner_form['wins'],
            'player_a_form_losses': winner_form['losses'],
            'player_a_win_rate': winner_form['win_rate'],
            'player_b_form_wins': loser_form['wins'],
            'player_b_form_losses': loser_form['losses'],
            'player_b_win_rate': loser_form['win_rate'],
            
            # Target
            'player_a_won': 1
        })
        
        # Wiersz 2: Loser vs Winner (target = 0)
        row2 = base_features.copy()
        row2.update({
            # Gracz A (loser)
            'player_a_id': loser_id,
            'player_a_name': match['loser_name'],
            'player_a_rank': match['loser_rank'],
            'player_a_points': match.get('loser_rank_points', 0),
            'player_a_age': match['loser_age'],
            'player_a_hand': match['loser_hand'],
            'player_a_seed': match.get('loser_seed', 0) if pd.notna(match.get('loser_seed')) else 0,
            'player_a_height': loser_height,
            
            # Gracz B (winner)
            'player_b_id': winner_id,
            'player_b_name': match['winner_name'],
            'player_b_rank': match['winner_rank'],
            'player_b_points': match.get('winner_rank_points', 0),
            'player_b_age': match['winner_age'],
            'player_b_hand': match['winner_hand'],
            'player_b_seed': match.get('winner_seed', 0) if pd.notna(match.get('winner_seed')) else 0,
            'player_b_height': winner_height,
            
            # Head-to-head (przed tym meczem)
            'h2h_total': current_h2h['total'],
            'h2h_a_wins': current_h2h['wins_p2'] if h2h_key[0] == winner_id else current_h2h['wins_p1'],
            'h2h_b_wins': current_h2h['wins_p1'] if h2h_key[0] == winner_id else current_h2h['wins_p2'],
            
            # Forma graczy
            'player_a_form_wins': loser_form['wins'],
            'player_a_form_losses': loser_form['losses'],
            'player_a_win_rate': loser_form['win_rate'],
            'player_b_form_wins': winner_form['wins'],
            'player_b_form_losses': winner_form['losses'],
            'player_b_win_rate': winner_form['win_rate'],
            
            # Target
            'player_a_won': 0
        })
        
        rows.extend([row1, row2])
        
        # AKTUALIZACJA HISTORII
        # Aktualizacja head-to-head
        if h2h_key[0] == winner_id:
            head_to_head[h2h_key]['wins_p1'] += 1
        else:
            head_to_head[h2h_key]['wins_p2'] += 1
        head_to_head[h2h_key]['total'] += 1
        
        # Aktualizacja historii meczów graczy
        winner_match = {'date': match_date, 'won': True, 'surface': match['surface']}
        loser_match = {'date': match_date, 'won': False, 'surface': match['surface']}
        
        if winner_id not in player_match_history:
            player_match_history[winner_id] = []
        if loser_id not in player_match_history:
            player_match_history[loser_id] = []
        
        player_match_history[winner_id].append(winner_match)
        player_match_history[loser_id].append(loser_match)
    
    print(f"\\n Updated data created:")
    df = pd.DataFrame(rows)
    print(f"Matches: {len(complete_matches):,}")
    print(f"Player vs player rows: {len(df):,}")
    print(f"Targets: {df['player_a_won'].value_counts().to_dict()}")
    print(f"Years: {df['year'].min()} - {df['year'].max()}")
    
    return df

# Uruchomienie ulepszonego przekształcania
enhanced_player_matches = create_enhanced_player_vs_player_data(
    matches_combined, 
    rankings_combined, 
    players_data
)

print(f"\\nNew features in data:")
new_columns = [col for col in enhanced_player_matches.columns if col not in ['match_id', 'date', 'surface', 'tourney_level', 'round', 'player_a_won']]
for i, col in enumerate(new_columns, 1):
    print(f"{i:2d}. {col}")

# Sprawdzenie przykładów head-to-head
h2h_examples = enhanced_player_matches[enhanced_player_matches['h2h_total'] > 0].head(3)
print(f"\\nHEAD-TO-HEAD:")
for _, row in h2h_examples.iterrows():
    print(f"{row['player_a_name']} vs {row['player_b_name']}: H2H {row['h2h_a_wins']}-{row['h2h_b_wins']} (z {row['h2h_total']} matches)")

Data transformation with additional features...
Full matches data: 73,140/74,906
Macthes processing...
Processed: 0/73,140
Processed: 0/73,140
Processed: 5,000/73,140
Processed: 5,000/73,140
Processed: 10,000/73,140
Processed: 10,000/73,140
Processed: 15,000/73,140
Processed: 15,000/73,140
Processed: 20,000/73,140
Processed: 20,000/73,140
Processed: 25,000/73,140
Processed: 25,000/73,140
Processed: 30,000/73,140
Processed: 30,000/73,140
Processed: 35,000/73,140
Processed: 35,000/73,140
Processed: 40,000/73,140
Processed: 40,000/73,140
Processed: 45,000/73,140
Processed: 45,000/73,140
Processed: 50,000/73,140
Processed: 50,000/73,140
Processed: 55,000/73,140
Processed: 55,000/73,140
Processed: 60,000/73,140
Processed: 60,000/73,140
Processed: 65,000/73,140
Processed: 65,000/73,140
Processed: 70,000/73,140
Processed: 70,000/73,140
\n Updated data created:
\n Updated data created:
Matches: 73,140
Player vs player rows: 146,280
Targets: {1: 73140, 0: 73140}
Years: 2000 - 2024
\nNew feature

In [5]:
# Feature Engineering z nowymi cechami
def add_advanced_features(df):
    df = df.copy()
    
    print("Adding advanced features...")
    # 1. Podstawowe różnice
    df['rank_diff'] = df['player_a_rank'] - df['player_b_rank']  # Ujemne = A lepszy
    df['age_diff'] = df['player_a_age'] - df['player_b_age']
    df['points_diff'] = df['player_a_points'] - df['player_b_points']
    df['seed_diff'] = df['player_a_seed'] - df['player_b_seed']
    
    # 2. Różnice w wysokości
    if 'player_a_height' in df.columns:
        df['height_diff'] = df['player_a_height'] - df['player_b_height']
        df['height_advantage'] = (df['height_diff'] > 0.05).astype(int)  # >5cm przewagi
    
    print("Adding rank-based features...")
    # 3. Kategorie rankingowe i ich interakcje
    def rank_tier(rank):
        if pd.isna(rank): return 'Unranked'
        elif rank <= 5: return 'Top5'
        elif rank <= 10: return 'Top10'
        elif rank <= 20: return 'Top20' 
        elif rank <= 50: return 'Top50'
        elif rank <= 100: return 'Top100'
        else: return 'Below100'
    
    df['player_a_tier'] = df['player_a_rank'].apply(rank_tier)
    df['player_b_tier'] = df['player_b_rank'].apply(rank_tier)
    
    # Interakcje między tierami
    df['both_top10'] = ((df['player_a_rank'] <= 10) & (df['player_b_rank'] <= 10)).astype(int)
    df['both_top50'] = ((df['player_a_rank'] <= 50) & (df['player_b_rank'] <= 50)).astype(int)
    df['rank_gap_large'] = (abs(df['rank_diff']) > 50).astype(int)
    
    print("Adding seeded features...")
    # 4. Cechy rozstawienia
    df['player_a_seeded'] = (df['player_a_seed'] > 0).astype(int)
    df['player_b_seeded'] = (df['player_b_seed'] > 0).astype(int)
    df['both_seeded'] = (df['player_a_seeded'] & df['player_b_seeded']).astype(int)
    df['unseeded_vs_seeded'] = (df['player_a_seeded'] != df['player_b_seeded']).astype(int)
    
    print("Adding head-to-head features...")
    # 5. Cechy head-to-head
    df['h2h_rate_a'] = df['h2h_a_wins'] / (df['h2h_total'] + 1)  # +1 żeby uniknąć dzielenia przez 0
    df['h2h_rate_b'] = df['h2h_b_wins'] / (df['h2h_total'] + 1)
    df['h2h_advantage_a'] = (df['h2h_a_wins'] > df['h2h_b_wins']).astype(int)
    df['h2h_experienced'] = (df['h2h_total'] >= 3).astype(int)  # Czy często grają
    
    print("Adding player form features...")
    # 6. Cechy formy gracza
    df['form_diff'] = df['player_a_win_rate'] - df['player_b_win_rate']
    df['player_a_hot'] = (df['player_a_win_rate'] > 0.7).astype(int)  # Bardzo dobra forma
    df['player_b_hot'] = (df['player_b_win_rate'] > 0.7).astype(int)
    df['player_a_cold'] = (df['player_a_win_rate'] < 0.3).astype(int)  # Słaba forma
    df['player_b_cold'] = (df['player_b_win_rate'] < 0.3).astype(int)
    
    print("Adding hand features...")
    # 7. Cechy dotyczące ręki
    df['player_a_lefty'] = (df['player_a_hand'] == 'L').astype(int)
    df['player_b_lefty'] = (df['player_b_hand'] == 'L').astype(int)
    df['both_lefty'] = (df['player_a_lefty'] & df['player_b_lefty']).astype(int)
    df['lefty_vs_righty'] = (df['player_a_lefty'] != df['player_b_lefty']).astype(int)
    df['lefty_advantage'] = (df['player_a_lefty'] & ~df['player_b_lefty']).astype(int)
    
    print("Adding match context features...")
    # 8. Cechy kontekstalne (nawierzchnia, turniej)
    surface_mapping = {'Hard': 1, 'Clay': 2, 'Grass': 3, 'Carpet': 4}
    df['surface_encoded'] = df['surface'].map(surface_mapping)
    
    level_mapping = {'G': 4, 'M': 3, 'A': 2, 'D': 1, 'F': 0}  # Grand Slam najwyżej
    df['level_encoded'] = df['tourney_level'].map(level_mapping)
    
    # Ważność rundy
    round_importance = {
        'F': 7, 'SF': 6, 'QF': 5, 'R16': 4, 'R32': 3, 'R64': 2, 'R128': 1, 'RR': 2
    }
    df['round_importance'] = df['round'].map(round_importance).fillna(1)
    
    # Czy to ważny mecz?
    df['important_match'] = ((df['level_encoded'] >= 3) & (df['round_importance'] >= 5)).astype(int)
    
    print("Adding time-based features...")
    # 9. Cechy czasowe
    df['is_recent'] = (df['year'] >= 2020).astype(int)  # Czy to ostatnie lata
    df['era_modern'] = (df['year'] >= 2010).astype(int)
    
    # Czy to początek/koniec roku?
    df['month'] = df['date'].dt.month
    df['early_season'] = (df['month'] <= 3).astype(int)
    df['late_season'] = (df['month'] >= 10).astype(int)
    
    print("Adding advanced interactions...")
    # 10. Zaawansowane interakcje
    df['rank_points_consistency'] = abs(df['rank_diff'] * df['points_diff'])  # Czy ranking odzwierciedla punkty
    df['experience_gap'] = abs(df['player_a_age'] - df['player_b_age']) > 5  # Duża różnica wieku
    df['veteran_vs_young'] = ((df['player_a_age'] > 30) & (df['player_b_age'] < 25)).astype(int)
    df['young_vs_veteran'] = ((df['player_a_age'] < 25) & (df['player_b_age'] > 30)).astype(int)
    
    # Momentum (forma * ranking)
    df['player_a_momentum'] = df['player_a_win_rate'] * (101 - df['player_a_rank']) / 100
    df['player_b_momentum'] = df['player_b_win_rate'] * (101 - df['player_b_rank']) / 100
    df['momentum_diff'] = df['player_a_momentum'] - df['player_b_momentum']
    
    print(f"Added {len(df.columns) - len(enhanced_player_matches.columns)} new features.")
    
    return df

# Dodanie zaawansowanych cech
final_enhanced_data = add_advanced_features(enhanced_player_matches)

print(f"\\n FINAL DATA OVERVIEW:")
print(f" Final column count: {len(final_enhanced_data.columns)}")

# Grupowanie cech według typu
feature_groups = {
    'Basic differences': ['rank_diff', 'age_diff', 'points_diff', 'height_diff'],
    'Rankings and tiers': ['player_a_tier', 'both_top10', 'rank_gap_large'],
    'Head-to-head': ['h2h_rate_a', 'h2h_advantage_a', 'h2h_experienced'],
    'Player form': ['form_diff', 'player_a_hot', 'player_b_hot'],
    'Physical features': ['lefty_vs_righty', 'height_advantage'],
    'Match context': ['surface_encoded', 'level_encoded', 'important_match'],
    'Interactions': ['momentum_diff', 'veteran_vs_young', 'rank_points_consistency']
}

for group, features in feature_groups.items():
    available_features = [f for f in features if f in final_enhanced_data.columns]
    if available_features:
        print(f"\\n{group}: {len(available_features)} features")
        for f in available_features[:3]:  # Pokaż pierwsze 3
            print(f"   • {f}")

# Sprawdzenie jakości nowych cech
print(f"\\n DATA QUALITY CHECK:")
if 'h2h_total' in final_enhanced_data.columns:
    h2h_available = (final_enhanced_data['h2h_total'] > 0).sum()
    print(f"Matches with h2h data: {h2h_available:,} ({h2h_available/len(final_enhanced_data)*100:.1f}%)")

if 'height_diff' in final_enhanced_data.columns:
    height_available = final_enhanced_data['height_diff'].notna().sum()
    print(f"Matches with height data: {height_available:,} ({height_available/len(final_enhanced_data)*100:.1f}%)")

form_available = (final_enhanced_data['player_a_win_rate'] > 0).sum()
print(f"Matches with form data: {form_available:,} ({form_available/len(final_enhanced_data)*100:.1f}%)")

print(f"\\n DATA DISTRIBUTION BY YEAR:")
yearly_dist = final_enhanced_data['year'].value_counts().sort_index()
for year in [2000, 2005, 2010, 2015, 2020, 2024]:
    if year in yearly_dist.index:
        print(f"   {year}: {yearly_dist[year]:,} rows")

Adding advanced features...
Adding rank-based features...
Adding seeded features...
Adding head-to-head features...
Adding player form features...
Adding hand features...
Adding match context features...
Adding time-based features...
Adding advanced interactions...
Added 45 new features.
\n FINAL DATA OVERVIEW:
 Final column count: 78
\nBasic differences: 4 features
   • rank_diff
   • age_diff
   • points_diff
\nRankings and tiers: 3 features
   • player_a_tier
   • both_top10
   • rank_gap_large
\nHead-to-head: 3 features
   • h2h_rate_a
   • h2h_advantage_a
   • h2h_experienced
\nPlayer form: 3 features
   • form_diff
   • player_a_hot
   • player_b_hot
\nPhysical features: 2 features
   • lefty_vs_righty
   • height_advantage
\nMatch context: 3 features
   • surface_encoded
   • level_encoded
   • important_match
\nInteractions: 3 features
   • momentum_diff
   • veteran_vs_young
   • rank_points_consistency
\n DATA QUALITY CHECK:
Matches with h2h data: 66,210 (45.3%)
Matches with 

In [None]:
# 1. Wybór cech do modelowania
modeling_features = []

# Dodanie wszystkich cech z wcześniej zdefiniowanych grup
for group, features in feature_groups.items():
    available_features = [f for f in features if f in final_enhanced_data.columns]
    modeling_features.extend(available_features)
    print(f"   {group}: {len(available_features)} features")

# Dodanie dodatkowych ważnych cech, które nie były w grupach
additional_features = [
    'h2h_total', 'player_a_win_rate', 'player_b_win_rate',
    'player_a_rank', 'player_b_rank', 'player_a_points', 'player_b_points',
    'round_importance'
]

print(f"   Additional features: ", end="")
added_count = 0
for feature in additional_features:
    if feature in final_enhanced_data.columns and feature not in modeling_features:
        modeling_features.append(feature)
        added_count += 1
print(f"{added_count}")

# Filtrowanie tylko numerycznych cech (usunięcie kategorii tekstowych)
numeric_only_features = []
for feature in modeling_features:
    if feature in final_enhanced_data.columns:
        # Sprawdzenie czy kolumna jest numeryczna
        if final_enhanced_data[feature].dtype in ['int64', 'float64', 'int32', 'float32', 'bool']:
            numeric_only_features.append(feature)

available_numerical = numeric_only_features
print(f"Final numeric features count: {len(available_numerical)}")

# Sprawdzenie braków danych
missing_data = final_enhanced_data[available_numerical].isnull().sum()
features_with_missing = missing_data[missing_data > 0]

print(f"\nFeatures with missing data:")
if len(features_with_missing) > 0:
    for feature, missing_count in features_with_missing.items():
        missing_pct = (missing_count / len(final_enhanced_data)) * 100
        print(f"   {feature}: {missing_count:,} ({missing_pct:.1f}%)")
else:
    print("No features with missing data.")

# 2. Przygotowanie datasetu treningowego

# Filtrowanie kompletnych wierszy
model_data = final_enhanced_data[available_numerical + ['player_a_won', 'year', 'date']].copy()

# Usunięcie braków (jeśli są)
initial_size = len(model_data)
model_data = model_data.dropna()
final_size = len(model_data)

print(f"\nInitial dataset size: {initial_size:,}")
print(f"Final dataset size after dropping NAs: {final_size:,} ({(initial_size - final_size) / initial_size * 100:.1f}% dropped)")

# Podział czasowy - ostatnie 2 lata jako test
train_data = model_data[model_data['year'] < 2023].copy()
test_data = model_data[model_data['year'] >= 2023].copy()

print(f"\nTrain data size: {len(train_data):,}")
print(f"Test data size: {len(test_data):,}")

X_train = train_data[available_numerical].copy()
y_train = train_data['player_a_won'].copy()
X_test = test_data[available_numerical].copy() 
y_test = test_data['player_a_won'].copy()


# 3. Analiza ważności cech (szybka)
print(f"\nAnalysing feature importance...")

# Korelacje z targetem
correlations = train_data[available_numerical + ['player_a_won']].corr()['player_a_won'].abs().sort_values(ascending=False)
correlations = correlations.drop('player_a_won')  # Usunięcie auto-korelacji

print(f"\nTop 10 features correlated with target:")
for i, (feature, corr) in enumerate(correlations.head(10).items(), 1):
    print(f"   {i:2d}. {feature}: {corr:.3f}")

   Basic differences: 4 features
   Rankings and tiers: 3 features
   Head-to-head: 3 features
   Player form: 3 features
   Physical features: 2 features
   Match context: 3 features
   Interactions: 3 features
   Additional features: 8
Final numeric features count: 28

Features with missing data:
   height_diff: 3,914 (2.7%)
   surface_encoded: 72 (0.0%)
   level_encoded: 122 (0.1%)

Initial dataset size: 146,280
Final dataset size after dropping NAs: 142,186 (2.8% dropped)

Train data size: 130,512
Test data size: 11,674

Analysing feature importance...

Top 10 features correlated with target:
    1. points_diff: 0.306
    2. form_diff: 0.290
    3. momentum_diff: 0.274
    4. rank_diff: 0.247
    5. player_a_points: 0.197
    6. player_b_points: 0.197
    7. player_b_win_rate: 0.193
    8. player_a_win_rate: 0.193
    9. player_b_rank: 0.155
   10. player_a_rank: 0.155

Top 10 features correlated with target:
    1. points_diff: 0.306
    2. form_diff: 0.290
    3. momentum_diff: 0

In [16]:
# TRENOWANIE MODELI ML

from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.model_selection import cross_val_score
import time

# Standardizacja cech
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training on {len(X_train):,} samples with {X_train.shape[1]} features")
print(f"Testing on {len(X_test):,} samples from years 2023-2024")

# Lista modeli do przetestowania
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000)
}

results = {}

# Trenowanie i testowanie każdego modelu
for name, model in models.items():
    print(f"\n{'='*20} {name} {'='*20}")
    
    start_time = time.time()
    
    if name == 'Logistic Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    training_time = time.time() - start_time
    
    # Metryki
    accuracy = accuracy_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    # Cross-validation na zbiorze treningowym
    if name == 'Logistic Regression':
        cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    else:
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    
    results[name] = {
        'model': model,
        'accuracy': accuracy,
        'auc': auc,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'training_time': training_time,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }
    
    print(f"Training time: {training_time:.2f} seconds")
    print(f"Test Accuracy: {accuracy:.4f}")
    print(f"Test AUC: {auc:.4f}")
    print(f"CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")

# Podsumowanie wyników
print(f"{'Model':<20} {'Accuracy':<10} {'AUC':<10} {'CV Score':<15} {'Time (s)':<10}")
print("-" * 70)

for name, result in results.items():
    print(f"{name:<20} {result['accuracy']:.4f}     {result['auc']:.4f}     {result['cv_mean']:.4f}±{result['cv_std']:.3f}    {result['training_time']:.2f}")

# Wybór najlepszego modelu
best_model_name = max(results.keys(), key=lambda x: results[x]['auc'])
best_model = results[best_model_name]['model']

print(f"\n Best model: {best_model_name}")
print(f"   Test Accuracy: {results[best_model_name]['accuracy']:.4f}")
print(f"   Test AUC: {results[best_model_name]['auc']:.4f}")
print(f"   CV Score: {results[best_model_name]['cv_mean']:.4f}")

# Analiza ważności cech dla najlepszego modelu
if hasattr(best_model, 'feature_importances_'):
    print(f"\n Top 10 features for model: ({best_model_name}):")
    feature_importance = pd.DataFrame({
        'feature': available_numerical,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    for i, (_, row) in enumerate(feature_importance.head(10).iterrows(), 1):
        print(f"   {i:2d}. {row['feature']:<20} {row['importance']:.4f}")

print(f"\n Best accuracy {results[best_model_name]['accuracy']:.1%}")

Training on 130,512 samples with 28 features
Testing on 11,674 samples from years 2023-2024

Training time: 4.27 seconds
Test Accuracy: 0.6307
Test AUC: 0.6897
CV Accuracy: 0.6498 (+/- 0.0310)

Training time: 4.27 seconds
Test Accuracy: 0.6307
Test AUC: 0.6897
CV Accuracy: 0.6498 (+/- 0.0310)

Training time: 27.36 seconds
Test Accuracy: 0.6406
Test AUC: 0.7018
CV Accuracy: 0.6616 (+/- 0.0283)

Training time: 27.36 seconds
Test Accuracy: 0.6406
Test AUC: 0.7018
CV Accuracy: 0.6616 (+/- 0.0283)

Training time: 0.10 seconds
Test Accuracy: 0.6400
Test AUC: 0.7001
CV Accuracy: 0.6599 (+/- 0.0286)
Model                Accuracy   AUC        CV Score        Time (s)  
----------------------------------------------------------------------
Random Forest        0.6307     0.6897     0.6498±0.015    4.27
Gradient Boosting    0.6406     0.7018     0.6616±0.014    27.36
Logistic Regression  0.6400     0.7001     0.6599±0.014    0.10

 Best model: Gradient Boosting
   Test Accuracy: 0.6406
   Test AU

In [17]:
# Szczegółowa analiza najlepszego modelu
best_name = best_model_name
best_pred = results[best_name]['predictions']
best_proba = results[best_name]['probabilities']

print(f"\n Detailed analysis - {best_name}:")
print(f"   Accuracy: {results[best_name]['accuracy']:.4f} (64.1%)")
print(f"   AUC: {results[best_name]['auc']:.4f} (70.2%)")
print(f"   Baseline (random): 50.0%")
print(f"   Improvement: +{(results[best_name]['accuracy']-0.5)*100:.1f} percentage points")

# Classification report
print(f"\n Classification report:")
print(classification_report(y_test, best_pred, target_names=['Player B wins', 'Player A wins']))

# Analiza predykcji według pewności modelu
print(f"\n Confidence analysis:")
confidence_levels = [0.6, 0.7, 0.8, 0.9]

for confidence in confidence_levels:
    high_conf_mask = (best_proba >= confidence) | (best_proba <= 1-confidence)
    if high_conf_mask.sum() > 0:
        high_conf_acc = accuracy_score(y_test[high_conf_mask], best_pred[high_conf_mask])
        percentage = high_conf_mask.sum() / len(y_test) * 100
        print(f"   Confidence ≥ {confidence:.0%}: {high_conf_acc:.3f} accuracy ({percentage:.1f}% of predictions)")

# Funkcja do predykcji nowego meczu
def predict_match(player_a_rank, player_b_rank, 
                  player_a_points, player_b_points,
                  player_a_age, player_b_age,
                  player_a_win_rate=0.5, player_b_win_rate=0.5,
                  h2h_a_wins=0, h2h_total=0,
                  surface='Hard', tourney_level='A',
                  height_diff=0, model=best_model, model_name=best_name):
    
    # Mapowanie powierzchni i poziomu turnieju
    surface_map = {'Hard': 1, 'Clay': 2, 'Grass': 3, 'Carpet': 4}
    level_map = {'G': 4, 'M': 3, 'A': 2, 'D': 1, 'F': 0}
    
    # Tworzenie cech jak w treningu
    features = {
        'rank_diff': player_a_rank - player_b_rank,
        'age_diff': player_a_age - player_b_age,
        'points_diff': player_a_points - player_b_points,
        'height_diff': height_diff,
        'both_top10': int(player_a_rank <= 10 and player_b_rank <= 10),
        'rank_gap_large': int(abs(player_a_rank - player_b_rank) > 50),
        'h2h_rate_a': h2h_a_wins / (h2h_total + 1),
        'h2h_advantage_a': int(h2h_a_wins > (h2h_total - h2h_a_wins)),
        'h2h_experienced': int(h2h_total >= 3),
        'form_diff': player_a_win_rate - player_b_win_rate,
        'player_a_hot': int(player_a_win_rate > 0.7),
        'player_b_hot': int(player_b_win_rate > 0.7),
        'lefty_vs_righty': 0,  # Domyślnie brak informacji o ręce
        'height_advantage': int(height_diff > 0.05),
        'surface_encoded': surface_map.get(surface, 1),
        'level_encoded': level_map.get(tourney_level, 2),
        'important_match': int(level_map.get(tourney_level, 2) >= 3),
        'momentum_diff': player_a_win_rate * (101 - player_a_rank) / 100 - player_b_win_rate * (101 - player_b_rank) / 100,
        'veteran_vs_young': int(player_a_age > 30 and player_b_age < 25),
        'rank_points_consistency': abs((player_a_rank - player_b_rank) * (player_a_points - player_b_points)),
        'h2h_total': h2h_total,
        'player_a_win_rate': player_a_win_rate,
        'player_b_win_rate': player_b_win_rate,
        'player_a_rank': player_a_rank,
        'player_b_rank': player_b_rank,
        'player_a_points': player_a_points,
        'player_b_points': player_b_points,
        'round_importance': 3  # Domyślnie średnia ważność rundy
    }
    
    # Utworzenie wektora cech w odpowiedniej kolejności
    feature_vector = [features.get(feat, 0) for feat in available_numerical]
    feature_vector = np.array(feature_vector).reshape(1, -1)
    
    # Predykcja
    if model_name == 'Logistic Regression':
        feature_vector = scaler.transform(feature_vector)
    
    probability = model.predict_proba(feature_vector)[0, 1]
    prediction = model.predict(feature_vector)[0]
    
    return {
        'player_a_win_probability': probability,
        'player_b_win_probability': 1 - probability,
        'predicted_winner': 'Player A' if prediction == 1 else 'Player B',
        'confidence': max(probability, 1-probability)
    }


# Przykład predykcji
print(f"\n Example match prediction:")
example_result = predict_match(
    player_a_rank=10, player_b_rank=25,
    player_a_points=3500, player_b_points=1800,
    player_a_age=28, player_b_age=24,
    player_a_win_rate=0.75, player_b_win_rate=0.62,
    surface='Hard', tourney_level='M'
)

print(f"   Player A win probability: {example_result['player_a_win_probability']:.2%}")
print(f"   Player B win probability: {example_result['player_b_win_probability']:.2%}")
print(f"   Predicted winner: {example_result['predicted_winner']}")
print(f"   Confidence: {example_result['confidence']:.2%}")


 Detailed analysis - Gradient Boosting:
   Accuracy: 0.6406 (64.1%)
   AUC: 0.7018 (70.2%)
   Baseline (random): 50.0%
   Improvement: +14.1 percentage points

 Classification report:
               precision    recall  f1-score   support

Player B wins       0.64      0.64      0.64      5837
Player A wins       0.64      0.64      0.64      5837

     accuracy                           0.64     11674
    macro avg       0.64      0.64      0.64     11674
 weighted avg       0.64      0.64      0.64     11674


 Confidence analysis:
   Confidence ≥ 60%: 0.696 accuracy (66.2% of predictions)
   Confidence ≥ 70%: 0.759 accuracy (38.5% of predictions)
   Confidence ≥ 80%: 0.842 accuracy (16.4% of predictions)
   Confidence ≥ 90%: 0.962 accuracy (2.5% of predictions)

 Example match prediction:
   Player A win probability: 63.64%
   Player B win probability: 36.36%
   Predicted winner: Player A
   Confidence: 63.64%
