# ETL: FIFA World Cup 2022 Dataset

Ce notebook présente les étapes d'un processus ETL (Extract, Transform, Load) sur le jeu de données de la Coupe du Monde 2022.
L'objectif est d'analyser le dataset brut, de nettoyer et de sélectionner les colonnes pertinentes pour une analyse ultérieure.

In [1]:
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

pd.set_option('display.max_columns', None)

## 1. Extraction et Chargement des Données
Nous téléchargeons la dernière version du dataset depuis Kaggle via `kagglehub`.

In [2]:
# Download latest version
path = kagglehub.dataset_download("die9origephit/fifa-world-cup-2022-complete-dataset")

print("Path to dataset files:", path)

# Locate and read the CSV
csv_path = None
for root, dirs, files in os.walk(path):
    for file in files:
        if file == "Fifa_world_cup_matches.csv":
            csv_path = os.path.join(root, file)
            break

if csv_path:
    df = pd.read_csv(csv_path)
    print("Dataset loaded successfully.")
else:
    print("File not found.")

Path to dataset files: C:\Users\mabed\.cache\kagglehub\datasets\die9origephit\fifa-world-cup-2022-complete-dataset\versions\7
Dataset loaded successfully.


## 2. Analyse du Dataset (Exploration)
Dans cette étape, nous explorons la structure des données, vérifions les types et recherchons les valeurs manquantes.

In [3]:
# Aperçu des premières lignes
df.head()

Unnamed: 0,team1,team2,possession team1,possession team2,possession in contest,number of goals team1,number of goals team2,date,hour,category,total attempts team1,total attempts team2,conceded team1,conceded team2,goal inside the penalty area team1,goal inside the penalty area team2,goal outside the penalty area team1,goal outside the penalty area team2,assists team1,assists team2,on target attempts team1,on target attempts team2,off target attempts team1,off target attempts team2,attempts inside the penalty area team1,attempts inside the penalty area team2,attempts outside the penalty area team1,attempts outside the penalty area team2,left channel team1,left channel team2,left inside channel team1,left inside channel team2,central channel team1,central channel team2,right inside channel team1,right inside channel team2,right channel team1,right channel team2,total offers to receive team1,total offers to receive team2,inbehind offers to receive team1,inbehind offers to receive team2,inbetween offers to receive team1,inbetween offers to receive team2,infront offers to receive team1,infront offers to receive team2,receptions between midfield and defensive lines team1,receptions between midfield and defensive lines team2,attempted line breaks team1,attempted line breaks team2,completed line breaksteam1,completed line breaks team2,attempted defensive line breaks team1,attempted defensive line breaks team2,completed defensive line breaksteam1,completed defensive line breaks team2,yellow cards team1,yellow cards team2,red cards team1,red cards team2,fouls against team1,fouls against team2,offsides team1,offsides team2,passes team1,passes team2,passes completed team1,passes completed team2,crosses team1,crosses team2,crosses completed team1,crosses completed team2,switches of play completed team1,switches of play completed team2,corners team1,corners team2,free kicks team1,free kicks team2,penalties scored team1,penalties scored team2,goal preventions team1,goal preventions team2,own goals team1,own goals team2,forced turnovers team1,forced turnovers team2,defensive pressures applied team1,defensive pressures applied team2
0,QATAR,ECUADOR,42%,50%,8%,0,2,20 NOV 2022,17 : 00,Group A,5,6,2,0,0,2,0,0,0,1,0,3,5,3,2,4,3,2,15,8,0,7,3,6,1,4,9,6,520,532,116,127,235,187,169,218,5,8,136,155,86,99,9,13,4,7,4,2,0,0,15,15,3,4,450,480,381,409,9,14,4,4,9,9,1,3,19,17,0,1,6,5,0,0,52,72,256,279
1,ENGLAND,IRAN,72%,19%,9%,6,2,21 NOV 2022,14 : 00,Group B,13,8,2,6,6,2,0,0,6,1,7,3,3,4,10,6,3,2,11,3,5,0,2,3,3,1,11,0,1061,212,207,53,386,86,468,73,16,4,238,101,178,45,25,7,16,4,0,2,0,0,9,14,2,2,809,224,730,154,23,8,7,1,12,3,8,0,16,10,0,1,8,13,0,0,63,72,139,416
2,SENEGAL,NETHERLANDS,44%,45%,11%,0,2,21 NOV 2022,17 : 00,Group A,14,9,2,0,0,2,0,0,0,1,3,3,8,5,7,5,7,4,12,11,4,2,2,2,4,7,13,20,502,506,123,117,230,191,149,198,15,14,151,162,89,96,22,22,15,10,2,1,0,0,13,13,2,1,383,438,313,374,19,25,7,8,9,6,6,7,14,14,0,0,9,15,0,0,63,73,263,251
3,UNITED STATES,WALES,51%,39%,10%,1,1,21 NOV 2022,20 : 00,Group B,6,7,1,1,1,1,0,0,1,0,1,3,4,3,4,5,2,2,14,7,5,2,4,5,4,2,11,7,725,436,149,100,336,172,240,164,12,9,199,174,146,103,23,17,15,8,4,2,0,0,15,10,1,1,569,409,509,321,31,15,4,6,5,8,5,3,11,15,0,1,7,7,0,0,81,72,242,292
4,ARGENTINA,SAUDI ARABIA,64%,24%,12%,1,2,22 NOV 2022,11 : 00,Group C,14,3,2,1,1,2,0,0,0,1,6,2,5,0,10,3,4,0,12,3,4,2,5,3,8,3,18,8,650,268,157,69,177,131,316,68,26,9,191,137,127,68,39,15,25,7,0,6,0,0,7,21,10,1,610,267,529,190,29,9,12,2,5,7,9,2,22,16,1,0,4,14,0,0,65,80,163,361


In [4]:
# Informations sur les colonnes et les types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 88 columns):
 #   Column                                                 Non-Null Count  Dtype 
---  ------                                                 --------------  ----- 
 0   team1                                                  64 non-null     object
 1   team2                                                  64 non-null     object
 2   possession team1                                       64 non-null     object
 3   possession team2                                       64 non-null     object
 4   possession in contest                                  64 non-null     object
 5   number of goals team1                                  64 non-null     int64 
 6   number of goals team2                                  64 non-null     int64 
 7   date                                                   64 non-null     object
 8   hour                                                   64 non-

In [5]:
# Vérification des valeurs nulles
nulls = df.isnull().sum()
print("Colonnes avec des valeurs nulles :")
print(nulls[nulls > 0])

Colonnes avec des valeurs nulles :
Series([], dtype: int64)


In [6]:
# Statistiques descriptives sommaires
df.describe()

Unnamed: 0,number of goals team1,number of goals team2,total attempts team1,total attempts team2,conceded team1,conceded team2,goal inside the penalty area team1,goal inside the penalty area team2,goal outside the penalty area team1,goal outside the penalty area team2,assists team1,assists team2,on target attempts team1,on target attempts team2,off target attempts team1,off target attempts team2,attempts inside the penalty area team1,attempts inside the penalty area team2,attempts outside the penalty area team1,attempts outside the penalty area team2,left channel team1,left channel team2,left inside channel team1,left inside channel team2,central channel team1,central channel team2,right inside channel team1,right inside channel team2,right channel team1,right channel team2,total offers to receive team1,total offers to receive team2,inbehind offers to receive team1,inbehind offers to receive team2,inbetween offers to receive team1,inbetween offers to receive team2,infront offers to receive team1,infront offers to receive team2,receptions between midfield and defensive lines team1,receptions between midfield and defensive lines team2,attempted line breaks team1,attempted line breaks team2,completed line breaksteam1,completed line breaks team2,attempted defensive line breaks team1,attempted defensive line breaks team2,completed defensive line breaksteam1,completed defensive line breaks team2,yellow cards team1,yellow cards team2,red cards team1,red cards team2,fouls against team1,fouls against team2,offsides team1,offsides team2,passes team1,passes team2,passes completed team1,passes completed team2,crosses team1,crosses team2,crosses completed team1,crosses completed team2,switches of play completed team1,switches of play completed team2,corners team1,corners team2,free kicks team1,free kicks team2,penalties scored team1,penalties scored team2,goal preventions team1,goal preventions team2,own goals team1,own goals team2,forced turnovers team1,forced turnovers team2,defensive pressures applied team1,defensive pressures applied team2
count,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0
mean,1.578125,1.109375,11.140625,11.28125,1.109375,1.578125,1.46875,0.984375,0.09375,0.109375,1.171875,0.734375,4.203125,3.75,4.703125,5.03125,6.9375,6.953125,4.203125,4.328125,13.625,13.5,4.921875,4.421875,4.65625,5.15625,4.265625,5.03125,11.65625,12.796875,592.375,550.21875,126.375,119.625,231.5,212.859375,234.5,217.734375,11.40625,10.5,173.46875,166.59375,114.25,106.484375,18.484375,18.265625,10.15625,9.71875,1.78125,1.75,0.0625,0.0,12.640625,12.359375,1.96875,1.96875,509.515625,492.109375,437.0,419.890625,18.09375,18.53125,4.59375,4.078125,6.453125,6.15625,4.484375,4.453125,14.09375,14.390625,0.140625,0.125,11.59375,11.359375,0.015625,0.015625,71.96875,70.125,289.75,293.265625
std,1.551289,1.055856,4.972519,5.807682,1.055856,1.551289,1.563155,0.999876,0.293785,0.314576,1.363407,0.895176,2.527184,2.713868,2.394966,2.911219,3.77912,4.459446,2.470009,2.766321,6.550173,7.287737,2.53424,3.201213,2.852004,3.296071,2.685896,3.141978,5.812463,6.544547,170.21084,169.487694,33.776812,36.660822,70.466698,59.487191,85.887893,101.472843,6.920682,5.614607,32.77822,27.965806,33.217895,27.795736,7.144744,6.183034,5.771354,5.202163,1.740906,1.511858,0.243975,0.0,5.247425,3.789573,1.727175,1.727175,156.348511,166.213681,156.9237,165.710028,8.239893,7.195609,3.298478,2.269918,3.749835,3.432888,2.777416,2.794153,4.219075,5.202616,0.350382,0.377964,5.911299,4.990045,0.125,0.125,14.394629,13.531269,88.406888,80.91623
min,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,302.0,212.0,62.0,52.0,99.0,86.0,75.0,68.0,1.0,1.0,104.0,101.0,55.0,45.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,3.0,5.0,0.0,0.0,225.0,224.0,167.0,154.0,4.0,5.0,0.0,0.0,1.0,1.0,0.0,0.0,6.0,5.0,0.0,0.0,0.0,2.0,0.0,0.0,38.0,44.0,139.0,141.0
25%,0.0,0.0,8.0,7.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,2.0,3.0,3.0,4.0,4.0,3.0,3.0,8.0,8.0,3.0,2.0,2.0,3.0,2.0,3.0,8.75,8.0,465.75,439.0,100.0,95.75,176.75,176.0,173.5,159.75,8.0,7.0,150.75,149.25,90.5,86.75,13.0,14.75,7.0,7.0,0.0,1.0,0.0,0.0,9.0,10.0,1.0,0.75,392.75,392.25,318.25,317.5,11.75,13.0,2.75,2.75,4.0,3.0,2.0,2.0,11.0,11.0,0.0,0.0,7.75,8.0,0.0,0.0,63.0,60.25,229.0,233.75
50%,1.0,1.0,10.0,10.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,4.0,3.0,4.0,5.0,6.0,6.0,4.0,4.0,12.0,11.5,5.0,4.0,4.0,4.5,4.0,4.0,11.0,12.0,611.5,544.0,125.5,115.5,232.0,207.5,235.5,196.5,10.0,9.0,171.5,167.0,111.0,99.0,17.0,17.0,8.0,9.0,1.5,2.0,0.0,0.0,13.0,12.0,2.0,2.0,508.0,466.0,437.0,396.5,19.0,18.0,4.0,4.0,6.0,6.0,4.5,4.0,14.0,14.5,0.0,0.0,11.0,10.0,0.0,0.0,71.0,72.0,281.0,292.5
75%,2.0,2.0,14.0,14.0,2.0,2.0,2.0,2.0,0.0,0.0,2.0,1.0,6.0,5.0,6.25,6.25,9.25,9.0,5.0,5.0,17.25,18.0,6.25,7.0,6.25,6.25,6.0,6.25,14.25,18.0,696.0,640.75,152.25,137.0,275.25,261.25,286.0,259.0,15.0,14.0,193.0,185.0,134.25,123.25,23.0,22.0,12.25,12.0,3.0,2.0,0.0,0.0,15.0,14.25,3.0,3.0,594.5,571.0,523.0,498.25,23.0,22.25,6.0,5.0,9.0,8.25,6.0,6.0,16.0,17.0,0.0,0.0,14.0,14.0,0.0,0.0,83.5,79.0,328.0,327.5
max,7.0,4.0,25.0,32.0,4.0,7.0,7.0,4.0,1.0,1.0,6.0,4.0,10.0,13.0,11.0,17.0,18.0,24.0,13.0,15.0,30.0,36.0,12.0,13.0,14.0,16.0,11.0,19.0,27.0,29.0,1085.0,1138.0,207.0,217.0,418.0,360.0,487.0,678.0,43.0,28.0,276.0,241.0,233.0,188.0,39.0,37.0,27.0,25.0,8.0,8.0,1.0,0.0,30.0,24.0,10.0,7.0,1061.0,1070.0,1003.0,992.0,46.0,38.0,17.0,12.0,18.0,14.0,12.0,14.0,27.0,30.0,1.0,2.0,32.0,26.0,1.0,1.0,101.0,104.0,637.0,585.0


## 3. Extraction (Sélection des colonnes)
Le dataset contient de très nombreuses colonnes (88). Pour notre analyse, nous allons conserver uniquement les plus pertinentes.

### Justification des choix

Nous choisissons de garder les colonnes suivantes :

1.  **Informations Générales** :
    *   `team1`, `team2` : Les équipes qui s'affrontent.
    *   `date`, `hour`, `category` : Contexte temporel et phase du tournoi (e.g., Groupe, Finale).

2.  **Résultat du Match** :
    *   `number of goals team1`, `number of goals team2` : Indispensable pour connaître le vainqueur.

3.  **Statistiques de Jeu (Performance)** :
    *   `possession team1`, `possession team2` : Indicateur clé de domination.
    *   `total attempts team1`, `total attempts team2` : Volume offensif.
    *   `on target attempts team1`, `on target attempts team2` : Précision et dangerosité réelle.

4.  **Discipline** :
    *   `yellow cards team1`, `yellow cards team2`
    *   `red cards team1`, `red cards team2` : Impact sur le jeu et fair-play.

Ces colonnes nous permettront de répondre aux questions principales : Qui a gagné ? Qui a dominé ? Le match était-il agressif ?

In [None]:
columns_to_keep = [
    'team1', 'team2', 'date', 'hour', 'category',
    'number of goals team1', 'number of goals team2',
    'possession team1', 'possession team2', 'possession in contest',
    'total attempts team1', 'total attempts team2',
    'on target attempts team1', 'on target attempts team2',
    'yellow cards team1', 'yellow cards team2',
    'red cards team1', 'red cards team2'
]

df_selected = df[columns_to_keep].copy()
df_selected.head()

Unnamed: 0,team1,team2,date,hour,category,number of goals team1,number of goals team2,possession team1,possession team2,total attempts team1,total attempts team2,on target attempts team1,on target attempts team2,yellow cards team1,yellow cards team2,red cards team1,red cards team2
0,QATAR,ECUADOR,20 NOV 2022,17 : 00,Group A,0,2,42%,50%,5,6,0,3,4,2,0,0
1,ENGLAND,IRAN,21 NOV 2022,14 : 00,Group B,6,2,72%,19%,13,8,7,3,0,2,0,0
2,SENEGAL,NETHERLANDS,21 NOV 2022,17 : 00,Group A,0,2,44%,45%,14,9,3,3,2,1,0,0
3,UNITED STATES,WALES,21 NOV 2022,20 : 00,Group B,1,1,51%,39%,6,7,1,3,4,2,0,0
4,ARGENTINA,SAUDI ARABIA,22 NOV 2022,11 : 00,Group C,1,2,64%,24%,14,3,6,2,0,6,0,0


In [8]:
# Vérification du nouveau dataframe
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   team1                     64 non-null     object
 1   team2                     64 non-null     object
 2   date                      64 non-null     object
 3   hour                      64 non-null     object
 4   category                  64 non-null     object
 5   number of goals team1     64 non-null     int64 
 6   number of goals team2     64 non-null     int64 
 7   possession team1          64 non-null     object
 8   possession team2          64 non-null     object
 9   total attempts team1      64 non-null     int64 
 10  total attempts team2      64 non-null     int64 
 11  on target attempts team1  64 non-null     int64 
 12  on target attempts team2  64 non-null     int64 
 13  yellow cards team1        64 non-null     int64 
 14  yellow cards team2        64

Le dataset est maintenant réduit aux dimensions essentielles pour l'analyse visée.

In [22]:

equipes = pd.concat([df_selected['team1'], df_selected['team2']]).unique().tolist()
print(equipes, len(equipes))


['QATAR', 'ENGLAND', 'SENEGAL', 'UNITED STATES', 'ARGENTINA', 'DENMARK', 'MEXICO', 'FRANCE', 'MOROCCO', 'GERMANY', 'SPAIN', 'BELGIUM', 'SWITZERLAND', 'URUGUAY', 'PORTUGAL', 'BRAZIL', 'WALES', 'NETHERLANDS', 'TUNISIA', 'POLAND', 'JAPAN', 'CROATIA', 'CAMEROON', 'KOREA REPUBLIC', 'ECUADOR', 'IRAN', 'AUSTRALIA', 'SAUDI ARABIA', 'CANADA', 'COSTA RICA', 'GHANA', 'SERBIA'] 32
