# ðŸ”¬ Feature Analysis & Quality Check

Ce notebook analyse les features crÃ©Ã©es pour:
1. VÃ©rifier la qualitÃ© des features
2. Analyser les distributions
3. Identifier les features les plus importantes
4. DÃ©tecter les corrÃ©lations
5. PrÃ©parer pour le modeling


Cellule 2 â€” Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 6)

print("âœ… Imports successful")


âœ… Imports successful


Cellule 3 â€” Markdown
## 1. Charger les Features

Cellule 4 â€” Chargement Features

In [3]:
# Charger les features
features_path = Path('../data/processed/v1/features.parquet')
df = pd.read_parquet(features_path)

print(f"ðŸ“Š Dataset Shape: {df.shape}")
print(f"   Records: {len(df):,}")
print(f"   Features: {len(df.columns)}")
print(f"\nðŸ“… Seasons: {df['season'].unique()}")
print(f"   Total: {df['season'].nunique()} seasons")
print(f"\nâš½ Teams: {df['team'].nunique()} unique teams")


ðŸ“Š Dataset Shape: (6076, 44)
   Records: 6,076
   Features: 44

ðŸ“… Seasons: ['2015-2016' '2016-2017' '2017-2018' '2018-2019' '2019-2020' '2020-2021'
 '2021-2022' '2022-2023']
   Total: 8 seasons

âš½ Teams: 32 unique teams


Cellule 5 â€” AperÃ§u

In [4]:
# AperÃ§u des donnÃ©es
df.head(10)


Unnamed: 0,season,team,gameweek,current_position,current_points,matches_played,wins,draws,losses,goals_for,...,target_final_position,form_last_5_goal_diff,avg_total_scoring_att,avg_ontarget_scoring_att,avg_total_pass,avg_accurate_pass,avg_total_tackle,avg_won_contest,avg_saves,avg_total_offside
0,2015-2016,Manchester City,1,1,3,1,1,0,0,3.0,...,4,,,,,,,,,
1,2015-2016,Manchester City,2,1,6,2,2,0,0,6.0,...,4,3.0,20.0,7.0,764.0,693.0,10.0,7.0,2.0,
2,2015-2016,Manchester City,3,1,9,3,3,0,0,8.0,...,4,6.0,19.0,7.5,593.5,519.0,13.5,10.5,2.5,1.0
3,2015-2016,Manchester City,4,1,12,4,4,0,0,10.0,...,4,8.0,18.0,8.0,558.0,484.0,14.666667,10.333333,2.0,1.5
4,2015-2016,Manchester City,5,1,15,5,5,0,0,11.0,...,4,10.0,18.0,7.25,579.75,511.5,14.75,11.0,2.0,1.333333
5,2015-2016,Manchester City,6,1,15,6,5,0,1,12.0,...,4,11.0,18.6,7.2,563.4,489.6,14.8,9.6,2.0,1.333333
6,2015-2016,Manchester City,7,2,15,7,5,0,2,13.0,...,4,7.0,20.0,7.333333,581.0,507.333333,15.5,10.833333,1.8,1.5
7,2015-2016,Manchester City,8,1,18,8,6,0,2,19.0,...,4,1.0,20.285714,7.428571,566.285714,486.714286,15.857143,10.285714,2.166667,1.8
8,2015-2016,Manchester City,9,1,21,9,7,0,2,24.0,...,4,4.0,20.625,7.875,569.25,488.125,16.375,10.125,2.285714,1.666667
9,2015-2016,Manchester City,10,1,22,10,7,1,2,24.0,...,4,6.0,20.111111,8.222222,568.888889,486.111111,16.111111,10.888889,2.285714,1.714286


Cellule 6 â€” Metadata

In [5]:
# Charger les mÃ©tadonnÃ©es
metadata_path = Path('../data/processed/v1/feature_metadata.json')
with open(metadata_path, 'r') as f:
    metadata = json.load(f)

print("ðŸ“‹ Feature Metadata:")
print(f"   Created: {metadata['created_at']}")
print(f"   Form window: {metadata['form_window']}")

print("\nðŸ“Š Feature Categories:")

# Classifier les features par catÃ©gorie
feature_names = [col for col in df.columns if col not in ['season', 'team', 'gameweek', 'target_final_points', 'target_final_position']]

categories = {
    'Cumulative': [f for f in feature_names if f.startswith(('current_', 'matches_', 'wins', 'draws', 'losses', 'goals_'))],
    'Form': [f for f in feature_names if 'form_' in f or 'last_' in f],
    'Home/Away': [f for f in feature_names if f.startswith(('home_', 'away_'))],
    'Ratios': [f for f in feature_names if any(x in f for x in ['_rate', '_per_', '_ppg'])],
    'Stats': [f for f in feature_names if 'avg_' in f],
}

for cat, features in categories.items():
    print(f"   {cat}: {len(features)} features")


ðŸ“‹ Feature Metadata:
   Created: 2025-12-11T00:35:41.440399
   Form window: 5

ðŸ“Š Feature Categories:
   Cumulative: 11 features
   Form: 5 features
   Home/Away: 6 features
   Ratios: 11 features
   Stats: 8 features
