# **Update Transfermarkt Datasets**

This notebook is designed to process and analyze the Transfermarkt football dataset, specifically the Croissant JSON-LD [dataset from Kaggle](https://www.kaggle.com/datasets/davidcariboo/player-scores) . It utilizes the `mlcroissant` library to fetch the dataset and `pandas` to convert its record sets into DataFrames. The notebook iterates through all available record sets (e.g., `appearances.csv`, `club_games.csv`, `clubs.csv`, etc.), converts each into a DataFrame, and saves them as local CSV files for further analysis. This functionality supports the use of Transfermarkt data in a local environment for my MCP project, enabling efficient data extraction and storage for subsequent processing.

In [3]:
#!pip install mlcroissant

In [4]:
import mlcroissant as mlc
import pandas as pd
import os

Set the data path to store de datasets

In [5]:
output_dir = '/data'
os.makedirs(output_dir, exist_ok=True)

In [2]:

# Fetch the Croissant JSON-LD
croissant_dataset = mlc.Dataset('https://www.kaggle.com/datasets/davidcariboo/player-scores/croissant/download')

# Check what record sets are in the dataset
record_sets = croissant_dataset.metadata.record_sets
print(record_sets)

# Fetch the records and put them in a DataFrame
record_set_df = pd.DataFrame(croissant_dataset.records(record_set=record_sets[0].uuid))
record_set_df.head() 

  -  [Metadata(Football Data from Transfermarkt)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.


[RecordSet(uuid="appearances.csv"), RecordSet(uuid="club_games.csv"), RecordSet(uuid="clubs.csv"), RecordSet(uuid="competitions.csv"), RecordSet(uuid="game_events.csv"), RecordSet(uuid="game_lineups.csv"), RecordSet(uuid="games.csv"), RecordSet(uuid="player_valuations.csv"), RecordSet(uuid="players.csv"), RecordSet(uuid="transfers.csv")]


Downloading https://www.kaggle.com/api/v1/datasets/download/davidcariboo/player-scores?datasetVersionNumber=602...: 100%|██████████| 165M/165M [00:45<00:00, 3.83MiB/s] 


Unnamed: 0,appearances.csv/appearance_id,appearances.csv/game_id,appearances.csv/player_id,appearances.csv/player_club_id,appearances.csv/player_current_club_id,appearances.csv/date,appearances.csv/player_name,appearances.csv/competition_id,appearances.csv/yellow_cards,appearances.csv/red_cards,appearances.csv/goals,appearances.csv/assists,appearances.csv/minutes_played
0,b'2231978_38004',b'2231978',b'38004',b'853',b'235',2012-07-03,b'Aur\xc3\xa9lien Joachim',b'CLQ',0,0,2,0,90
1,b'2233748_79232',b'2233748',b'79232',b'8841',b'2698',2012-07-05,b'Ruslan Abyshov',b'ELQ',0,0,0,0,90
2,b'2234413_42792',b'2234413',b'42792',b'6251',b'465',2012-07-05,b'Sander Puri',b'ELQ',0,0,0,0,45
3,b'2234418_73333',b'2234418',b'73333',b'1274',b'6646',2012-07-05,b'Vegar Hedenstad',b'ELQ',0,0,0,0,90
4,b'2234421_122011',b'2234421',b'122011',b'195',b'3008',2012-07-05,b'Markus Henriksen',b'ELQ',0,0,0,1,90


In [None]:
for record_set in record_sets:
    try:
        record_set_df = pd.DataFrame(croissant_dataset.records(record_set=record_set.uuid))
        
        if record_set_df.empty:
            print(f"El record set {record_set.uuid} está vacío, no se guardará.")
            continue
        
        output_file = os.path.join(output_dir, record_set.uuid)
        record_set_df.to_csv(output_file, index=False, encoding='utf-8')
        print(f"Guardado {output_file} con éxito.")
        
        print(f"Primeras filas del record set {record_set.uuid}:")
        print(record_set_df.head())
        
    except Exception as e:
        print(f"Error al procesar el record set {record_set.uuid}: {e}")

Guardado /data\appearances.csv con éxito.
Primeras filas del record set appearances.csv:
  appearances.csv/appearance_id appearances.csv/game_id  \
0              b'2231978_38004'              b'2231978'   
1              b'2233748_79232'              b'2233748'   
2              b'2234413_42792'              b'2234413'   
3              b'2234418_73333'              b'2234418'   
4             b'2234421_122011'              b'2234421'   

  appearances.csv/player_id appearances.csv/player_club_id  \
0                  b'38004'                         b'853'   
1                  b'79232'                        b'8841'   
2                  b'42792'                        b'6251'   
3                  b'73333'                        b'1274'   
4                 b'122011'                         b'195'   

  appearances.csv/player_current_club_id appearances.csv/date  \
0                                 b'235'           2012-07-03   
1                                b'2698'           20