# Extraction - World Cup 2014

**Auteur** : Short Kings Team
**Date** : 16/12/2025
**Dataset** : WorldCupMatches2014_1.csv

## Objectif
Extraire les données brutes du fichier et identifier les problèmes de qualité

## 1. Imports et Configuration

In [10]:
import pandas as pd
import numpy as np
from pathlib import Path

# Configuration
DATA_PATH = Path('../data/raw/')
OUTPUT_PATH = Path('../data/processed/')

print("Fichiers dans data/raw/ :")
if DATA_PATH.exists():
    for f in DATA_PATH.iterdir():
        print(f"  - {f.name}")
else:
    print("Le dossier n'existe pas!")

# Créer les dossiers si nécessaire
OUTPUT_PATH.mkdir(parents=True, exist_ok=True)


Fichiers dans data/raw/ :
  - .gitkeep
  - WorldCupMatches2014 (1).csv


# 2. Chargement des données brutes

In [11]:
# Chargement avec le bon séparateur et encodage
# Note: le fichier utilise ';' comme séparateur
df_raw = pd.read_csv(
    DATA_PATH / 'WorldCupMatches2014 (1).csv',
    sep=';',
    encoding='latin-1'  # Pour gérer les caractères spéciaux
)

print(f"Dimensions : {df_raw.shape}")
print(f"Colonnes : {df_raw.columns.tolist()}")

Dimensions : (80, 20)
Colonnes : ['Year', 'Datetime', 'Stage', 'Stadium', 'City', 'Home Team Name', 'Home Team Goals', 'Away Team Goals', 'Away Team Name', 'Win conditions', 'Attendance', 'Half-time Home Goals', 'Half-time Away Goals', 'Referee', 'Assistant 1', 'Assistant 2', 'RoundID', 'MatchID', 'Home Team Initials', 'Away Team Initials']


In [12]:
df_raw.head(10)

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
0,2014,12 Jun 2014 - 17:00,Group A,Arena de Sao Paulo,Sao Paulo,Brazil,3,1,Croatia,,62103.0,1,1,NISHIMURA Yuichi (JPN),SAGARA Toru (JPN),NAGI Toshiyuki (JPN),255931,300186456,BRA,CRO
1,2014,13 Jun 2014 - 13:00,Group A,Estadio das Dunas,Natal,Mexico,1,0,Cameroon,,39216.0,0,0,ROLDAN Wilmar (COL),CLAVIJO Humberto (COL),DIAZ Eduardo (COL),255931,300186492,MEX,CMR
2,2014,13 Jun 2014 - 16:00,Group B,Arena Fonte Nova,Salvador,Spain,1,5,Netherlands,,48173.0,1,1,Nicola RIZZOLI (ITA),Renato FAVERANI (ITA),Andrea STEFANI (ITA),255931,300186510,ESP,NED
3,2014,13 Jun 2014 - 18:00,Group B,Arena Pantanal,Cuiaba,Chile,3,1,Australia,,40275.0,2,1,Noumandiez DOUE (CIV),YEO Songuifolo (CIV),BIRUMUSHAHU Jean Claude (BDI),255931,300186473,CHI,AUS
4,2014,14 Jun 2014 - 13:00,Group C,Estadio Mineirao,Belo Horizonte,Colombia,3,0,Greece,,57174.0,1,0,GEIGER Mark (USA),HURD Sean (USA),FLETCHER Joe (CAN),255931,300186471,COL,GRE
5,2014,14 Jun 2014 - 16:00,Group D,Estadio Castelao,Fortaleza,Uruguay,1,3,Costa Rica,,58679.0,1,0,BRYCH Felix (GER),BORSCH Mark (GER),LUPP Stefan (GER),255931,300186489,URU,CRC
6,2014,14 Jun 2014 - 18:00,Group D,Arena Amazonia,Manaus,England,1,2,Italy,,39800.0,1,1,Bjï¿½rn KUIPERS (NED),Sander VAN ROEKEL (NED),Erwin ZEINSTRA (NED),255931,300186513,ENG,ITA
7,2014,14 Jun 2014 - 22:00,Group C,Arena Pernambuco,Recife,Cï¿½te d'Ivoire,2,1,Japan,,40267.0,0,1,OSSES Enrique (CHI),ASTROZA Carlos (CHI),ROMAN Sergio (CHI),255931,300186507,CIV,JPN
8,2014,15 Jun 2014 - 13:00,Group E,Estadio Nacional,Brasilia,Switzerland,2,1,Ecuador,,68351.0,0,1,Ravshan IRMATOV (UZB),RASULOV Abduxamidullo (UZB),KOCHKAROV Bakhadyr (KGZ),255931,300186494,SUI,ECU
9,2014,15 Jun 2014 - 16:00,Group E,Estadio Beira-Rio,Porto Alegre,France,3,0,Honduras,,43012.0,1,0,RICCI Sandro (BRA),DE CARVALHO Emerson (BRA),VAN GASSE Marcelo (BRA),255931,300186496,FRA,HON


In [13]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Year                  80 non-null     int64  
 1   Datetime              80 non-null     object 
 2   Stage                 80 non-null     object 
 3   Stadium               80 non-null     object 
 4   City                  80 non-null     object 
 5   Home Team Name        80 non-null     object 
 6   Home Team Goals       80 non-null     int64  
 7   Away Team Goals       80 non-null     int64  
 8   Away Team Name        80 non-null     object 
 9   Win conditions        80 non-null     object 
 10  Attendance            78 non-null     float64
 11  Half-time Home Goals  80 non-null     int64  
 12  Half-time Away Goals  80 non-null     int64  
 13  Referee               80 non-null     object 
 14  Assistant 1           80 non-null     object 
 15  Assistant 2           80 

# 3.Analyse de qualité des données
## 3.1 Valeurs manquantes

In [14]:
# Comptage des valeurs manquantes par colonne
missing = df_raw.isnull().sum()
missing_pct = (missing / len(df_raw) * 100).round(2)

missing_report = pd.DataFrame({
    'Valeurs manquantes': missing,
    'Pourcentage (%)': missing_pct
})
missing_report[missing_report['Valeurs manquantes'] > 0]

Unnamed: 0,Valeurs manquantes,Pourcentage (%)
Attendance,2,2.5


## 3.2 Détection des doublons

In [16]:
# Doublons exacts
doublons_exacts = df_raw.duplicated().sum()
print(f"Nombre de doublons exacts : {doublons_exacts}")

# Doublons sur MatchID (identifiant unique supposé)
doublons_matchid = df_raw['MatchID'].duplicated().sum()
print(f"Nombre de MatchID dupliqués : {doublons_matchid}")

Nombre de doublons exacts : 16
Nombre de MatchID dupliqués : 16


In [17]:
# Voir les matchs dupliqués
if doublons_matchid > 0:
    matchids_dupliques = df_raw[df_raw['MatchID'].duplicated(keep=False)]['MatchID'].unique()
    print(f"MatchIDs dupliqués : {matchids_dupliques}")
    
    # Afficher un exemple de doublon
    exemple_id = matchids_dupliques[0]
    print(f"\nExemple de doublon (MatchID={exemple_id}) :")
    display(df_raw[df_raw['MatchID'] == exemple_id])

MatchIDs dupliqués : [300186487 300186491 300186462 300186460 300186461 300186485 300186474
 300186502 300186501 300186490 300186488 300186504 300186508 300186459
 300186503 300186497]

Exemple de doublon (MatchID=300186487) :


Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
48,2014,28 Jun 2014 - 13:00,Round of 16,Estadio Mineirao,Belo Horizonte,Brazil,1,1,Chile,Brazil win on penalties (3 - 2),57714.0,0,0,WEBB Howard (ENG),MULLARKEY Michael (ENG),Darren CANN (ENG),255951,300186487,BRA,CHI
64,2014,28 Jun 2014 - 13:00,Round of 16,Estadio Mineirao,Belo Horizonte,Brazil,1,1,Chile,Brazil win on penalties (3 - 2),57714.0,0,0,WEBB Howard (ENG),MULLARKEY Michael (ENG),Darren CANN (ENG),255951,300186487,BRA,CHI


## 3.3 Problèmes d'encodage et données corrompues

In [19]:
# Vérifier les noms d'équipes uniques
equipes_home = df_raw['Home Team Name'].unique()
equipes_away = df_raw['Away Team Name'].unique()
toutes_equipes = set(equipes_home) | set(equipes_away)

print(f"Nombre d'équipes uniques : {len(toutes_equipes)}")
print("\nListe des équipes :")
for eq in sorted(toutes_equipes):
    print(f"  - '{eq}'")

Nombre d'équipes uniques : 32

Liste des équipes :
  - 'Algeria'
  - 'Argentina'
  - 'Australia'
  - 'Belgium'
  - 'Brazil'
  - 'Cameroon'
  - 'Chile'
  - 'Colombia'
  - 'Costa Rica'
  - 'Croatia'
  - 'Cï¿½te d'Ivoire'
  - 'Ecuador'
  - 'England'
  - 'France'
  - 'Germany'
  - 'Ghana'
  - 'Greece'
  - 'Honduras'
  - 'IR Iran'
  - 'Italy'
  - 'Japan'
  - 'Korea Republic'
  - 'Mexico'
  - 'Netherlands'
  - 'Nigeria'
  - 'Portugal'
  - 'Russia'
  - 'Spain'
  - 'Switzerland'
  - 'USA'
  - 'Uruguay'
  - 'rn">Bosnia and Herzegovina'


In [21]:
# Vérifier les villes
villes = df_raw['City'].unique()
print(f"Villes uniques ({len(villes)}) :")
for v in sorted(villes):
    print(f"  - '{v}'")

Villes uniques (12) :
  - 'Belo Horizonte '
  - 'Brasilia '
  - 'Cuiaba '
  - 'Curitiba '
  - 'Fortaleza '
  - 'Manaus '
  - 'Natal '
  - 'Porto Alegre '
  - 'Recife '
  - 'Rio De Janeiro '
  - 'Salvador '
  - 'Sao Paulo '


## 3.4 Analyse des formats de date

In [22]:
# Échantillon des formats de date
print("Exemples de format de date :")
print(df_raw['Datetime'].head(10).tolist())

Exemples de format de date :
['12 Jun 2014 - 17:00 ', '13 Jun 2014 - 13:00 ', '13 Jun 2014 - 16:00 ', '13 Jun 2014 - 18:00 ', '14 Jun 2014 - 13:00 ', '14 Jun 2014 - 16:00 ', '14 Jun 2014 - 18:00 ', '14 Jun 2014 - 22:00 ', '15 Jun 2014 - 13:00 ', '15 Jun 2014 - 16:00 ']


## 3.5 Vérification des phases de compétition (Stage)

In [23]:
# Phases de compétition
stages = df_raw['Stage'].value_counts()
print("Phases de compétition :")
print(stages)

Phases de compétition :
Stage
Round of 16                 16
Quarter-finals               8
Group A                      6
Group B                      6
Group C                      6
Group D                      6
Group E                      6
Group F                      6
Group G                      6
Group H                      6
Semi-finals                  4
Play-off for third place     2
Final                        2
Name: count, dtype: int64
