# Premier League Data Integrity Checks
## Using Deepchecks for Data Validation

Ce notebook effectue des v√©rifications compl√®tes de l'int√©grit√© des donn√©es de la Premier League:
- V√©rification de la qualit√© des donn√©es
- D√©tection des anomalies
- Validation de la coh√©rence entre les fichiers
- G√©n√©ration de rapports d√©taill√©s

Cellule 2 : Imports


In [1]:
# Imports
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Deepchecks
from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import data_integrity
from deepchecks.tabular.checks import (
    MixedNulls,
    StringMismatch,
    MixedDataTypes,
    IsSingleValue,
    SpecialCharacters,
    StringLengthOutOfBounds,
    ConflictingLabels,
    OutlierSampleDetection,
    FeatureLabelCorrelation,
    DataDuplicates,
    CategoryMismatchTrainTest
)

print("‚úì Imports loaded successfully")

‚úì Imports loaded successfully


Cellule 3 : Configuration et Chargement des Donn√©es


In [2]:
# Configuration
DATA_DIR = Path('../../data/raw')
SEASONS = ['2015-2016', '2016-2017', '2017-2018', '2018-2019', '2019-2020',
           '2020-2021', '2021-2022', '2022-2023', '2023-2024']

print(f"Data directory: {DATA_DIR}")
print(f"Seasons to check: {len(SEASONS)}")

Data directory: ..\..\data\raw
Seasons to check: 9


Cellule 4 : Fonction de Chargement des Donn√©es


In [7]:
def load_season_data(season):
    """Charge toutes les donn√©es d'une saison"""
    season_path = DATA_DIR / season
    
    results = pd.read_csv(season_path / 'results.csv')
    standings = pd.read_csv(season_path / 'standings.csv')
    
    # Match stats peut ne pas exister pour toutes les saisons
    stats_path = season_path / 'match_stats.csv'
    match_stats = pd.read_csv(stats_path) if stats_path.exists() else None
    
    return {
        'results': results,
        'standings': standings,
        'match_stats': match_stats,
        'season': season
    }

# Charger toutes les saisons
all_data = {season: load_season_data(season) for season in SEASONS}

print(f"\n‚úì Loaded data for {len(all_data)} seasons")
for season, data in all_data.items():
    print(f"  {season}: {len(data['results'])} matches, {len(data['standings'])} standing records")


‚úì Loaded data for 9 seasons
  2015-2016: 380 matches, 760 standing records
  2016-2017: 380 matches, 760 standing records
  2017-2018: 380 matches, 760 standing records
  2018-2019: 380 matches, 760 standing records
  2019-2020: 380 matches, 760 standing records
  2020-2021: 380 matches, 756 standing records
  2021-2022: 380 matches, 760 standing records
  2022-2023: 380 matches, 760 standing records
  2023-2024: 380 matches, 760 standing records


Cellule 5 : V√©rifications de Base - R√©sultats des Matchs


In [8]:
# Combiner tous les r√©sultats pour une analyse globale
all_results = pd.concat([data['results'] for data in all_data.values()], ignore_index=True)
print(f"Total matches: {len(all_results)}")
print(f"\nColumns: {list(all_results.columns)}")
print(f"\nData types:\n{all_results.dtypes}")

Total matches: 3420

Columns: ['match_id', 'gameweek', 'kickoff', 'home_team', 'away_team', 'home_goals', 'away_goals', 'result']

Data types:
match_id      float64
gameweek      float64
kickoff        object
home_team      object
away_team      object
home_goals    float64
away_goals    float64
result         object
dtype: object


Cellule 6 : Cr√©ation Dataset Deepchecks pour R√©sultats


In [9]:
# Cr√©er un Dataset Deepchecks
results_dataset = Dataset(all_results, label=None, cat_features=['home_team', 'away_team', 'result'])

# Suite compl√®te d'int√©grit√© des donn√©es
print("Running Data Integrity Suite on Results...")
integrity_suite = data_integrity()
results_integrity = integrity_suite.run(results_dataset)
results_integrity

Running Data Integrity Suite on Results...


Accordion(children=(VBox(children=(HTML(value='\n<h1 id="summary_71W8HUKGUUJB4AB4MLI1RI5ND">Data Integrity Sui‚Ä¶

Cellule 7 : V√©rification des Valeurs Nulles Mixtes


In [10]:
# V√©rifier les valeurs nulles mixtes
print("\n=== Mixed Nulls Check ===")
mixed_nulls = MixedNulls()
result = mixed_nulls.run(results_dataset)
result


=== Mixed Nulls Check ===


VBox(children=(HTML(value='<h4><b>Mixed Nulls</b></h4>'), HTML(value='<p>Search for various types of null valu‚Ä¶

Cellule 8 : V√©rification des Doublons


In [11]:
# V√©rifier les doublons
print("\n=== Duplicates Check ===")
duplicates = DataDuplicates()
result = duplicates.run(results_dataset)
result


=== Duplicates Check ===


VBox(children=(HTML(value='<h4><b>Data Duplicates</b></h4>'), HTML(value='<p>Checks for duplicate samples in t‚Ä¶

Cellule 9 : V√©rification des Caract√®res Sp√©ciaux


In [12]:
# V√©rifier les caract√®res sp√©ciaux dans les noms d'√©quipes
print("\n=== Special Characters Check ===")
special_chars = SpecialCharacters()
result = special_chars.run(results_dataset)
result


=== Special Characters Check ===


VBox(children=(HTML(value='<h4><b>Special Characters</b></h4>'), HTML(value='<p>Search in column[s] for values‚Ä¶

Cellule 10 : D√©tection des Outliers


In [13]:
# V√©rifier les outliers dans les scores
print("\n=== Outlier Detection ===")
outliers = OutlierSampleDetection()
result = outliers.run(results_dataset)
result


=== Outlier Detection ===


VBox(children=(HTML(value='<h4><b>Outlier Sample Detection</b></h4>'), HTML(value='<p>Detects outliers in a da‚Ä¶

Cellule 11 : Fonction de V√©rification Logique des Matchs


In [14]:
def check_match_logic(df):
    """V√©rifications logiques sur les matchs"""
    issues = []
    
    # 1. V√©rifier que home_team != away_team
    same_teams = df[df['home_team'] == df['away_team']]
    if len(same_teams) > 0:
        issues.append(f"‚ùå {len(same_teams)} matches with same home and away team")
    else:
        issues.append("‚úì No matches with identical teams")
    
    # 2. V√©rifier que les scores sont positifs
    negative_scores = df[(df['home_goals'] < 0) | (df['away_goals'] < 0)]
    if len(negative_scores) > 0:
        issues.append(f"‚ùå {len(negative_scores)} matches with negative scores")
    else:
        issues.append("‚úì All scores are non-negative")
    
    # 3. V√©rifier la coh√©rence du r√©sultat
    def check_result(row):
        if row['home_goals'] > row['away_goals']:
            return row['result'] == 'H'
        elif row['home_goals'] < row['away_goals']:
            return row['result'] == 'A'
        else:
            return row['result'] == 'D'
    
    inconsistent = df[~df.apply(check_result, axis=1)]
    if len(inconsistent) > 0:
        issues.append(f"‚ùå {len(inconsistent)} matches with inconsistent results")
    else:
        issues.append("‚úì All results are consistent with scores")
    
    # 4. V√©rifier les gameweeks
    invalid_gw = df[(df['gameweek'] < 1) | (df['gameweek'] > 38)]
    if len(invalid_gw) > 0:
        issues.append(f"‚ùå {len(invalid_gw)} matches with invalid gameweek")
    else:
        issues.append("‚úì All gameweeks are valid (1-38)")
    
    # 5. V√©rifier les scores aberrants (> 10 buts)
    high_scores = df[(df['home_goals'] > 10) | (df['away_goals'] > 10)]
    if len(high_scores) > 0:
        issues.append(f"‚ö†Ô∏è  {len(high_scores)} matches with unusually high scores (>10)")
        print("\nHigh scoring matches:")
        print(high_scores[['home_team', 'away_team', 'home_goals', 'away_goals']])
    else:
        issues.append("‚úì No unusually high scores")
    
    return issues

print("\n=== Match Logic Validation ===")
logic_issues = check_match_logic(all_results)
for issue in logic_issues:
    print(issue)


=== Match Logic Validation ===
‚úì No matches with identical teams
‚úì All scores are non-negative
‚úì All results are consistent with scores
‚úì All gameweeks are valid (1-38)
‚úì No unusually high scores


Cellule 12 : V√©rifications des Classements (Standings)


In [15]:
# Combiner tous les standings
all_standings = pd.concat([data['standings'] for data in all_data.values()], ignore_index=True)
print(f"Total standing records: {len(all_standings)}")
print(f"\nColumns: {list(all_standings.columns)}")

Total standing records: 6836

Columns: ['team', 'season', 'gameweek', 'played', 'won', 'drawn', 'lost', 'goals_for', 'goals_against', 'goal_difference', 'points', 'position']


Cellule 13 : Suite d'Int√©grit√© pour Standings


In [16]:
# Cr√©er un Dataset Deepchecks pour les standings
standings_dataset = Dataset(all_standings, label=None, cat_features=['team', 'season'])

print("Running Data Integrity Suite on Standings...")
standings_integrity = data_integrity()
standings_result = standings_integrity.run(standings_dataset)
standings_result

Running Data Integrity Suite on Standings...


Accordion(children=(VBox(children=(HTML(value='\n<h1 id="summary_LWI6RFE1GBQL8EN5YSZV9FQ3Z">Data Integrity Sui‚Ä¶

Cellule 14 : Fonction de V√©rification Logique des Standings


In [17]:
def check_standings_logic(df):
    """V√©rifications logiques sur les classements"""
    issues = []
    
    # 1. V√©rifier que points = won*3 + drawn*1
    df['calculated_points'] = df['won'] * 3 + df['drawn']
    incorrect_points = df[df['points'] != df['calculated_points']]
    if len(incorrect_points) > 0:
        issues.append(f"‚ùå {len(incorrect_points)} records with incorrect points calculation")
    else:
        issues.append("‚úì All points calculations are correct")
    
    # 2. V√©rifier que played = won + drawn + lost
    df['calculated_played'] = df['won'] + df['drawn'] + df['lost']
    incorrect_played = df[df['played'] != df['calculated_played']]
    if len(incorrect_played) > 0:
        issues.append(f"‚ùå {len(incorrect_played)} records with incorrect played count")
    else:
        issues.append("‚úì All played counts are correct")
    
    # 3. V√©rifier goal_difference
    df['calculated_gd'] = df['goals_for'] - df['goals_against']
    incorrect_gd = df[df['goal_difference'] != df['calculated_gd']]
    if len(incorrect_gd) > 0:
        issues.append(f"‚ùå {len(incorrect_gd)} records with incorrect goal difference")
    else:
        issues.append("‚úì All goal differences are correct")
    
    # 4. V√©rifier les positions par gameweek
    for season in df['season'].unique():
        for gw in df[df['season'] == season]['gameweek'].unique():
            gw_data = df[(df['season'] == season) & (df['gameweek'] == gw)]
            positions = sorted(gw_data['position'].unique())
            expected = list(range(1, len(positions) + 1))
            if positions != expected:
                issues.append(f"‚ùå {season} GW{gw}: Invalid positions {positions}")
                break
    
    if not any('Invalid positions' in issue for issue in issues):
        issues.append("‚úì All positions are sequential and valid")
    
    # 5. V√©rifier qu'il y a 20 √©quipes par gameweek
    team_counts = df.groupby(['season', 'gameweek']).size()
    invalid_counts = team_counts[team_counts != 20]
    if len(invalid_counts) > 0:
        issues.append(f"‚ùå {len(invalid_counts)} gameweeks without exactly 20 teams")
    else:
        issues.append("‚úì All gameweeks have exactly 20 teams")
    
    return issues

print("\n=== Standings Logic Validation ===")
standings_issues = check_standings_logic(all_standings)
for issue in standings_issues:
    print(issue)


=== Standings Logic Validation ===
‚úì All points calculations are correct
‚úì All played counts are correct
‚úì All goal differences are correct
‚úì All positions are sequential and valid
‚ùå 1 gameweeks without exactly 20 teams


Cellule 15 : V√©rifications des Statistiques de Match


In [18]:
# Combiner toutes les stats (si disponibles)
all_stats_list = [data['match_stats'] for data in all_data.values() if data['match_stats'] is not None]

if all_stats_list:
    all_stats = pd.concat(all_stats_list, ignore_index=True)
    print(f"Total match stats records: {len(all_stats)}")
    print(f"\nColumns: {len(all_stats.columns)} columns")
    print(f"\nFirst few columns: {list(all_stats.columns[:10])}")
    
    # Afficher les statistiques disponibles
    stat_columns = [col for col in all_stats.columns if col.startswith('home_') or col.startswith('away_')]
    unique_stats = set([col.replace('home_', '').replace('away_', '') for col in stat_columns])
    print(f"\nUnique statistics tracked: {len(unique_stats)}")
    print(f"Examples: {list(unique_stats)[:10]}")
else:
    print("‚ö†Ô∏è  No match statistics available")

Total match stats records: 3420

Columns: 493 columns

First few columns: ['match_id', 'home_team', 'away_team', 'home_accurate_layoffs', 'home_accurate_keeper_sweeper', 'home_total_yel_card', 'home_aerial_lost', 'home_goals', 'home_attempts_conceded_ibox', 'home_offtarget_att_assist']

Unique statistics tracked: 246
Examples: ['att_obox_own_goal', 'total_keeper_sweeper', 'total_high_claim', 'att_fastbreak', 'total_fastbreak', 'big_chance_missed', 'losses', 'att_hd_total', 'big_chance_created', 'att_sv_high_right']


Cellule 16 : Analyse de Qualit√© des Statistiques


In [19]:
if all_stats_list:
    # Analyse de qualit√© des stats
    print("\n=== Match Stats Quality Check ===")
    
    # 1. Pourcentage de valeurs manquantes
    missing_pct = (all_stats.isnull().sum() / len(all_stats)) * 100
    high_missing = missing_pct[missing_pct > 50]
    
    if len(high_missing) > 0:
        print(f"‚ö†Ô∏è  {len(high_missing)} columns with >50% missing values:")
        print(high_missing.head(10))
    else:
        print("‚úì No columns with excessive missing values")
    
    # 2. V√©rifier la coh√©rence des statistiques
    print("\n‚úì Match stats data loaded successfully")
    print(f"  Matches with stats: {len(all_stats)}")
    print(f"  Missing values: {all_stats.isnull().sum().sum()}")
else:
    print("Skipping match stats checks - no data available")


=== Match Stats Quality Check ===
‚ö†Ô∏è  206 columns with >50% missing values:
home_accurate_keeper_sweeper    66.549708
home_att_freekick_miss          85.935673
home_att_sv_low_left            51.315789
home_first_half_goals           50.643275
home_six_yard_block             69.444444
home_wins                       55.087719
home_error_lead_to_shot         76.374269
home_total_keeper_sweeper       65.029240
home_clean_sheet                68.479532
home_own_goal_accrued           94.005848
dtype: float64

‚úì Match stats data loaded successfully
  Matches with stats: 3420
  Missing values: 671467


Cellule 17 : Fonction de V√©rification de Coh√©rence Inter-Fichiers


In [20]:
def check_cross_file_consistency(season_data):
    """V√©rifie la coh√©rence entre results, standings et match_stats"""
    issues = []
    season = season_data['season']
    results = season_data['results']
    standings = season_data['standings']
    match_stats = season_data['match_stats']
    
    print(f"\n{'='*60}")
    print(f"Season: {season}")
    print(f"{'='*60}")
    
    # 1. V√©rifier que tous les matchs dans results ont des standings
    teams_in_results = set(results['home_team'].unique()) | set(results['away_team'].unique())
    teams_in_standings = set(standings['team'].unique())
    
    missing_from_standings = teams_in_results - teams_in_standings
    if missing_from_standings:
        issues.append(f"‚ùå Teams in results but not in standings: {missing_from_standings}")
    else:
        issues.append(f"‚úì All teams from results appear in standings")
    
    # 2. V√©rifier le nombre de matchs par gameweek
    matches_per_gw = results.groupby('gameweek').size()
    invalid_gw_counts = matches_per_gw[matches_per_gw != 10]
    if len(invalid_gw_counts) > 0:
        issues.append(f"‚ö†Ô∏è  {len(invalid_gw_counts)} gameweeks without exactly 10 matches")
        print(f"  Gameweeks with issues: {invalid_gw_counts.to_dict()}")
    else:
        issues.append(f"‚úì All gameweeks have 10 matches")
    
    # 3. V√©rifier la coh√©rence entre results et standings pour le dernier gameweek
    max_gw = results['gameweek'].max()
    final_standings = standings[standings['gameweek'] == max_gw]
    
    # Recalculer les points √† partir des r√©sultats
    team_points = {}
    for team in teams_in_results:
        home_wins = len(results[(results['home_team'] == team) & (results['result'] == 'H')])
        away_wins = len(results[(results['away_team'] == team) & (results['result'] == 'A')])
        home_draws = len(results[(results['home_team'] == team) & (results['result'] == 'D')])
        away_draws = len(results[(results['away_team'] == team) & (results['result'] == 'D')])
        
        team_points[team] = (home_wins + away_wins) * 3 + (home_draws + away_draws)
    
    # Comparer avec les standings
    points_mismatch = 0
    for team, calculated_points in team_points.items():
        standing_points = final_standings[final_standings['team'] == team]['points'].values
        if len(standing_points) > 0 and standing_points[0] != calculated_points:
            points_mismatch += 1
            print(f"  ‚ö†Ô∏è  {team}: Results={calculated_points} pts, Standings={standing_points[0]} pts")
    
    if points_mismatch == 0:
        issues.append(f"‚úì Final standings points match results calculations")
    else:
        issues.append(f"‚ùå {points_mismatch} teams have mismatched points")
    
    # 4. V√©rifier match_stats si disponible
    if match_stats is not None:
        results_match_ids = set(results['match_id'].values)
        stats_match_ids = set(match_stats['match_id'].values)
        
        missing_stats = results_match_ids - stats_match_ids
        if missing_stats:
            issues.append(f"‚ö†Ô∏è  {len(missing_stats)} matches without statistics")
        else:
            issues.append(f"‚úì All matches have statistics")
    
    # Afficher les r√©sultats
    for issue in issues:
        print(issue)
    
    return issues

# V√©rifier chaque saison
all_cross_file_issues = {}
for season, data in all_data.items():
    all_cross_file_issues[season] = check_cross_file_consistency(data)


Season: 2015-2016
  Gameweeks with issues: {27.0: 8, 30.0: 5, 33.0: 11, 34.0: 15, 35.0: 7, 37.0: 14}
‚úì All teams from results appear in standings
‚ö†Ô∏è  6 gameweeks without exactly 10 matches
‚úì Final standings points match results calculations
‚úì All matches have statistics

Season: 2016-2017
  Gameweeks with issues: {26.0: 8, 27.0: 11, 28.0: 4, 34.0: 11, 36.0: 11, 37.0: 15}
‚úì All teams from results appear in standings
‚ö†Ô∏è  6 gameweeks without exactly 10 matches
‚úì Final standings points match results calculations
‚úì All matches have statistics

Season: 2017-2018
  Gameweeks with issues: {21.0: 9, 22.0: 11, 31.0: 4, 34.0: 14, 35.0: 6, 37.0: 16}
‚úì All teams from results appear in standings
‚ö†Ô∏è  6 gameweeks without exactly 10 matches
‚úì Final standings points match results calculations
‚úì All matches have statistics

Season: 2018-2019
  Gameweeks with issues: {25.0: 11, 27.0: 8, 31.0: 5, 32.0: 15, 33.0: 6, 34.0: 11, 35.0: 14}
‚úì All teams from results appear in stan

Cellule 18 : Fonction de G√©n√©ration de Rapport de Synth√®se


In [21]:
def generate_summary_report():
    """G√©n√®re un rapport de synth√®se complet"""
    print("\n" + "="*70)
    print("DATA INTEGRITY SUMMARY REPORT")
    print("="*70)
    
    # 1. Vue d'ensemble des donn√©es
    print("\nüìä DATA OVERVIEW")
    print("-" * 70)
    print(f"Seasons analyzed: {len(SEASONS)}")
    print(f"Total matches: {len(all_results)}")
    print(f"Total standing records: {len(all_standings)}")
    if all_stats_list:
        print(f"Total match stats: {len(all_stats)}")
    
    # 2. Statistiques par saison
    print("\nüìÖ PER-SEASON STATISTICS")
    print("-" * 70)
    for season, data in all_data.items():
        results = data['results']
        standings = data['standings']
        stats = data['match_stats']
        
        print(f"\n{season}:")
        print(f"  Matches: {len(results)}")
        print(f"  Teams: {len(results['home_team'].unique())}")
        print(f"  Gameweeks: {results['gameweek'].max()}")
        print(f"  Standing records: {len(standings)}")
        if stats is not None:
            print(f"  Match stats: {len(stats)} matches")
    
    # 3. R√©sum√© de la qualit√©
    print("\n‚úÖ QUALITY SUMMARY")
    print("-" * 70)
    
    total_issues = 0
    total_warnings = 0
    
    # Compter les probl√®mes
    for season, issues in all_cross_file_issues.items():
        for issue in issues:
            if '‚ùå' in issue:
                total_issues += 1
            elif '‚ö†Ô∏è' in issue:
                total_warnings += 1
    
    if total_issues == 0:
        print("‚úì No critical data integrity issues found")
    else:
        print(f"‚ùå {total_issues} critical issues found")
    
    if total_warnings == 0:
        print("‚úì No warnings")
    else:
        print(f"‚ö†Ô∏è  {total_warnings} warnings found")
    
    # 4. Recommandations
    print("\nüí° RECOMMENDATIONS")
    print("-" * 70)
    
    if total_issues > 0:
        print("1. Review and fix critical issues before using data for modeling")
    
    if total_warnings > 0:
        print("2. Investigate warnings to ensure data quality")
    
    print("3. Run this notebook regularly after data updates")
    print("4. Consider adding automated validation in the ingestion pipeline")
    
    print("\n" + "="*70)
    print("END OF REPORT")
    print("="*70)

generate_summary_report()


DATA INTEGRITY SUMMARY REPORT

üìä DATA OVERVIEW
----------------------------------------------------------------------
Seasons analyzed: 9
Total matches: 3420
Total standing records: 6836
Total match stats: 3420

üìÖ PER-SEASON STATISTICS
----------------------------------------------------------------------

2015-2016:
  Matches: 380
  Teams: 20
  Gameweeks: 38.0
  Standing records: 760
  Match stats: 380 matches

2016-2017:
  Matches: 380
  Teams: 20
  Gameweeks: 38.0
  Standing records: 760
  Match stats: 380 matches

2017-2018:
  Matches: 380
  Teams: 20
  Gameweeks: 38.0
  Standing records: 760
  Match stats: 380 matches

2018-2019:
  Matches: 380
  Teams: 20
  Gameweeks: 38.0
  Standing records: 760
  Match stats: 380 matches

2019-2020:
  Matches: 380
  Teams: 20
  Gameweeks: 38.0
  Standing records: 760
  Match stats: 380 matches

2020-2021:
  Matches: 380
  Teams: 20
  Gameweeks: 38.0
  Standing records: 756
  Match stats: 380 matches

2021-2022:
  Matches: 380
  Teams: 20

Cellule 19 : Export des R√©sultats


In [22]:
# Sauvegarder les rapports Deepchecks
OUTPUT_DIR = Path('../../reports/data_quality')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Sauvegarder le rapport d'int√©grit√© des r√©sultats
results_integrity.save_as_html(str(OUTPUT_DIR / 'results_integrity_report.html'))
print(f"‚úì Results integrity report saved to {OUTPUT_DIR / 'results_integrity_report.html'}")

# Sauvegarder le rapport d'int√©grit√© des standings
standings_result.save_as_html(str(OUTPUT_DIR / 'standings_integrity_report.html'))
print(f"‚úì Standings integrity report saved to {OUTPUT_DIR / 'standings_integrity_report.html'}")

print("\n‚úÖ All reports saved successfully!")

‚úì Results integrity report saved to ..\..\reports\data_quality\results_integrity_report.html
‚úì Standings integrity report saved to ..\..\reports\data_quality\standings_integrity_report.html

‚úÖ All reports saved successfully!


Cellule 20 : Analyse Statistique des Scores


In [23]:
# Analyse des scores
print("\nüìà SCORE ANALYSIS")
print("-" * 70)
print("\nHome Goals Statistics:")
print(all_results['home_goals'].describe())
print("\nAway Goals Statistics:")
print(all_results['away_goals'].describe())

# Distribution des r√©sultats
print("\nüìä RESULT DISTRIBUTION")
print("-" * 70)
result_dist = all_results['result'].value_counts()
result_pct = (result_dist / len(all_results) * 100).round(2)
print(f"Home wins (H): {result_dist.get('H', 0)} ({result_pct.get('H', 0)}%)")
print(f"Draws (D): {result_dist.get('D', 0)} ({result_pct.get('D', 0)}%)")
print(f"Away wins (A): {result_dist.get('A', 0)} ({result_pct.get('A', 0)}%)")


üìà SCORE ANALYSIS
----------------------------------------------------------------------

Home Goals Statistics:
count    3420.000000
mean        1.556140
std         1.326516
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max         9.000000
Name: home_goals, dtype: float64

Away Goals Statistics:
count    3420.000000
mean        1.262281
std         1.215303
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max         9.000000
Name: away_goals, dtype: float64

üìä RESULT DISTRIBUTION
----------------------------------------------------------------------
Home wins (H): 1536 (44.91%)
Draws (D): 793 (23.19%)
Away wins (A): 1091 (31.9%)


Cellule 21 : Analyse des √âquipes


In [24]:
# √âquipes les plus fr√©quentes
print("\nüèÜ TEAM APPEARANCES")
print("-" * 70)
all_teams = pd.concat([all_results['home_team'], all_results['away_team']])
team_counts = all_teams.value_counts()
print("\nTop 10 teams by number of matches:")
print(team_counts.head(10))

print("\nTeams with fewer matches (possibly promoted/relegated):")
print(team_counts.tail(10))


üèÜ TEAM APPEARANCES
----------------------------------------------------------------------

Top 10 teams by number of matches:
Manchester United    342
Arsenal              342
Liverpool            342
Manchester City      342
Crystal Palace       342
Tottenham Hotspur    342
West Ham United      342
Everton              342
Chelsea              342
Newcastle United     304
Name: count, dtype: int64

Teams with fewer matches (possibly promoted/relegated):
Swansea City         114
Leeds United         114
Norwich City         114
Nottingham Forest     76
Huddersfield Town     76
Sunderland            76
Hull City             38
Cardiff City          38
Middlesbrough         38
Luton Town            38
Name: count, dtype: int64


Cellule 22 : Conclusion


# Conclusion

Ce notebook a effectu√© une analyse compl√®te de l'int√©grit√© des donn√©es de la Premier League:

‚úÖ **V√©rifications effectu√©es:**
- Qualit√© des donn√©es (valeurs manquantes, doublons, types de donn√©es)
- Logique m√©tier (coh√©rence des scores, r√©sultats, classements)
- Coh√©rence inter-fichiers (results ‚Üî standings ‚Üî match_stats)
- D√©tection d'anomalies et valeurs aberrantes

üìä **Rapports g√©n√©r√©s:**
- Rapports HTML Deepchecks d√©taill√©s
- Rapport de synth√®se dans le notebook

üí° **Prochaines √©tapes:**
- Corriger les probl√®mes critiques identifi√©s
- Int√©grer ces v√©rifications dans le pipeline d'ingestion
- Ex√©cuter ce notebook apr√®s chaque mise √† jour des donn√©es