# `stage` Data

The `stage` data provides general information about a stage, including the  astage code (identifier), stage number, stage name, start and end dates, and cancellation status. The stage information also identifes stage `sector` information, that identifies link sections and competitive stage sections, and within a competitive stage section a list of `sections` which identify start and end section distances into the competitive stage and the terrain types for each section.

In [1]:
# Load in the required packages
import pandas as pd
from jupyterlite_simple_cors_proxy import furl, xurl

# Also load in our custom function
from dakar_utils_2025 import mergeInLangLabels

In [2]:
# Generate the API URL pattern
dakar_api_template = "https://www.dakar.live.worldrallyraidchampionship.com/api/{path}"

# Define the year
YEAR = 2025
# Define the category
CATEGORY = "A"

# Define the API path to the stage resource
# Use a Python f-string to instantiate variable values directly
stage_path = f"stage-{YEAR}-{CATEGORY}"

# Define the URL
stage_url = dakar_api_template.format(path=stage_path)

# Preview the path and the URL
stage_path, stage_url

('stage-2025-A',
 'https://www.dakar.live.worldrallyraidchampionship.com/api/stage-2025-A')

In [3]:
# Load in data
# Use furl() to handle CORS issues in Jupyterlite
stage_df = pd.read_json(furl(stage_url))
stage_df.columns

Index(['stage', 'sectors', 'updatedAt', 'code', 'mapDisplay', 'isCancelled',
       'isDelayed', 'endDate', 'groupsAsCategory', 'mapCategoryDisplay',
       'lastColumnDisplay', 'stageLangs', 'date', 'type', 'timezone', 'length',
       'stageWithBonus', 'generalDisplay', 'podiumDisplay', 'startDate',
       'marathon', '_bind', '_origin', '_id', '_key', '_updatedAt', '_parent',
       'qualities', '$category', '_gets', 'refueling', 'shortLabel', 'label',
       'promotionalDisplay', 'position', 'reference', 'lastStage',
       'liveDisplay', 'categoryLangs'],
      dtype='object')

By inspecting the column names, does the `stageLangs` column define a set of label language mappings, I wonder?

In [4]:
# Preview the first three rows of the stageLangs column
stage_df["stageLangs"][0]

[{'text': 'RIYADH > HARADH', 'variable': 'stage.name.09000', 'locale': 'en'},
 {'text': 'RIYADH > HARADH', 'locale': 'fr', 'variable': 'stage.name.09000'},
 {'locale': 'es', 'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH'},
 {'variable': 'stage.name.09000', 'text': 'الرياض > حرض', 'locale': 'ar'}]

It looks like it does, so we can flatten that data in using the function we defined in the previous chapter. But what column do we need to merge against?

In [5]:
# Preview the values in the first row in the dataframe
stage_df.iloc[0]

stage                                                               9.0
sectors               [{'startTime': '2025-01-14T04:05:00+00:00', 't...
updatedAt                                     2025-01-11T14:53:48+01:00
code                                                              09000
mapDisplay                                                         True
isCancelled                                                         0.0
isDelayed                                                           0.0
endDate                                       2025-01-14T00:00:00+03:00
groupsAsCategory                                                   None
mapCategoryDisplay                                                    A
lastColumnDisplay                                                    ce
stageLangs            [{'text': 'RIYADH > HARADH', 'variable': 'stag...
date                                          2025-01-14 00:00:00+03:00
type                                                            

Inspection of the columns in the original dataframe suggests there is no direct mapping to the `variable` value, which takes the form `stage.name.09000`. We do note, however, there is a `code` value which maps the numeric part of `variable`, so we can create out own merge column, which we might call `variable`, and then 

In [6]:
# Create a dummy colum to match on
stage_df["variable"] = "stage.name." + stage_df["code"]

# Update the dataframe by using our new function to
# merge in the exploded and widenened language labels
stage_df = mergeInLangLabels(stage_df, "stageLangs", key="variable")

# Preview the dataframe, limited to a few illustrative columns
stage_df[["code", "en", "ar"]].head()

Unnamed: 0,code,en,ar
0,09000,RIYADH > HARADH,الرياض > حرض
1,07000,AL DUWADIMI > AL DUWADIMI,الدوادمي > الدوادمي
2,06000,HAIL > AL DUWADIMI,حائل > الدوادمي
3,10000,HARADH > SHUBAYTAH,حرض > شبيطة
4,0P000,BISHA,BISHA


Let's skim the first few rows of the table, also adding in new column to help identify the stage code more explcitly.

In [7]:
stage_df['stage_code'] = stage_df['code']
stage_df.sort_values("startDate", inplace=True)
stage_df.reset_index(drop=True, inplace=True)

stage_df.head()

Unnamed: 0,stage,sectors,updatedAt,code,mapDisplay,isCancelled,isDelayed,endDate,groupsAsCategory,mapCategoryDisplay,...,position,reference,lastStage,liveDisplay,categoryLangs,ar,en,es,fr,stage_code
0,0.0,"[{'groupsLength': [], 'powerStage': False, 'ty...",2025-01-08T17:24:08+01:00,0P000,True,0.0,0.0,2025-01-03T00:00:00+03:00,,,...,,,,,,BISHA,BISHA,BISHA,BISHA,0P000
1,1.0,"[{'length': 86, 'arrivalTime': '2025-01-04T06:...",2025-01-08T17:24:08+01:00,01000,True,0.0,0.0,2025-01-04T00:00:00+03:00,,A,...,,,,,,بيشة > بيشة,BISHA > BISHA,BISHA > BISHA,BISHA > BISHA,01000
2,2.0,"[{'type': 'LIA', 'code': '02100', 'groupsLengt...",2025-01-08T17:24:08+01:00,02000,True,0.0,0.0,2025-01-06T00:00:00+03:00,,M,...,,,,,,بيشة > بيشة,BISHA > BISHA,BISHA > BISHA,BISHA > BISHA,02000
3,3.0,"[{'id': 23867, 'length': 314, 'powerStage': Fa...",2025-01-08T17:24:08+01:00,03000,True,0.0,0.0,2025-01-07T00:00:00+03:00,,A,...,,,,,,بيشة > الحناكية,BISHA > AL HENAKIYAH,BISHA > AL HENAKIYAH,BISHA > AL HENAKIYAH,03000
4,4.0,"[{'grounds': [{'percentage': 30, 'sections': [...",2025-01-08T17:24:08+01:00,04000,True,0.0,0.0,2025-01-08T00:00:00+03:00,,A,...,,,,,,الحناكية > العلا,AL HENAKIYAH > ALULA,AL HENAKIYAH > ALULA,AL HENAKIYAH > ALULA,04000


From the date columns, we notice that the dataframe is not ordered. To provide a more natural ordering, we could order the rows by start date, or stage number, which appears to be given by the `stage` column, for example.

Inspection of the data also reveals that the `sectors` column typically appears to contain three elements: a link sector, the competititve stage sector, and a second link sector. The competitive stage sector is further subdivided into a list of `grounds` or terrain types, with associated language mappings. Within in each ground is a list of `sections` with a `section` number and a `start` and `finish` value that identify start and finish distances in kilometres into the stage for that section.

## Parsing each row

Let's get the JSON `sectors`  data for single row and see if we can find a sensible way of parsing that.

I am imagining producing something like a two linked dataframes:

- one containing the top-level sector data for each stage;
- one containing the surface type by section; I imagine this dataframe to have columns `stage`, `sector`, `section`, `start`, `finish`, `surface_type` and then perhaps language mappings for the surface type.

Let's start by looking at the metadata for each sector.

In [8]:
sectors_df = pd.json_normalize(stage_df[ "sectors"].explode())

sectors_df.head()

Unnamed: 0,groupsLength,powerStage,type,id,startTime,arrivalTime,length,code,grounds,groupsLength.A_T5
0,[],False,LIA,23800,2025-01-03T06:45:00+00:00,2025-01-03T07:35:00+00:00,26,0P100,,
1,[],False,SPE,23801,2025-01-03T07:35:00+00:00,2025-01-03T07:54:00+00:00,29,0P200,"[{'groundLangs': [{'text': 'Dirt Track', 'vari...",
2,[],False,LIA,23802,2025-01-03T07:54:00+00:00,2025-01-03T08:24:00+00:00,24,0P300,,
3,[],False,LIA,23803,2025-01-04T04:30:00+00:00,2025-01-04T06:25:00+00:00,86,01100,,
4,[],False,SPE,23804,2025-01-04T06:25:00+00:00,2025-01-04T10:35:00+00:00,413,01200,"[{'groundLangs': [{'text': 'Dirt Track', 'loca...",


The stage number is not explicitly listed, but we can derive it from the first two characters of the `code`.

We can also add a "sector number".

In [9]:
# Generate an appropriate stage code
sectors_df['stage_code'] = sectors_df['code'].str[:2] + '000'

# Generate a sector number
sectors_df['sector_number'] = sectors_df.groupby('stage_code').cumcount() + 1

sectors_df.head()

Unnamed: 0,groupsLength,powerStage,type,id,startTime,arrivalTime,length,code,grounds,groupsLength.A_T5,stage_code,sector_number
0,[],False,LIA,23800,2025-01-03T06:45:00+00:00,2025-01-03T07:35:00+00:00,26,0P100,,,0P000,1
1,[],False,SPE,23801,2025-01-03T07:35:00+00:00,2025-01-03T07:54:00+00:00,29,0P200,"[{'groundLangs': [{'text': 'Dirt Track', 'vari...",,0P000,2
2,[],False,LIA,23802,2025-01-03T07:54:00+00:00,2025-01-03T08:24:00+00:00,24,0P300,,,0P000,3
3,[],False,LIA,23803,2025-01-04T04:30:00+00:00,2025-01-04T06:25:00+00:00,86,01100,,,01000,1
4,[],False,SPE,23804,2025-01-04T06:25:00+00:00,2025-01-04T10:35:00+00:00,413,01200,"[{'groundLangs': [{'text': 'Dirt Track', 'loca...",,01000,2


If the sector is an `SPE` type, we have the `grounds` information to play with. This is split over various sections, with start and finish distances identifying each section, as well as a surface type and a "percentage" value.

In [10]:
# Get the sectors with grounds data
competitive_sectors = sectors_df[['grounds', 'code']].dropna(
    axis="index").explode('grounds').reset_index(drop=True)

competitive_sectors.head()

Unnamed: 0,grounds,code
0,"{'groundLangs': [{'text': 'Dirt Track', 'varia...",0P200
1,"{'sections': [{'start': 14, 'section': 7, 'fin...",0P200
2,"{'color': '#efc07c', 'name': 'ground.name.3', ...",0P200
3,"{'name': 'ground.name.1', 'color': '#753a05', ...",0P200
4,"{'sections': [{'section': 7, 'start': 14, 'fin...",0P200


Tidy the sectors dataframe: 

In [13]:
# Sort sectors by stage and sector
sectors_df.sort_values("code", inplace=True)
# Simplify the sectors dataframe
sectors_df = sectors_df[["stage_code", "code", "id", "sector_number", "powerStage",
                             "length", "startTime", "type", "arrivalTime"]].reset_index(drop=True)

sectors_df.head()

Unnamed: 0,stage_code,code,id,sector_number,powerStage,length,startTime,type,arrivalTime
0,1000,1100,23803,1,False,86,2025-01-04T04:30:00+00:00,LIA,2025-01-04T06:25:00+00:00
1,1000,1200,23804,2,False,413,2025-01-04T06:25:00+00:00,SPE,2025-01-04T10:35:00+00:00
2,2000,2100,23805,1,False,45,2025-01-05T03:25:00+00:00,LIA,2025-01-05T04:40:00+00:00
3,2000,2200,23806,2,False,967,2025-01-05T04:40:00+00:00,SPE,2025-01-05T13:00:00+00:00
4,2000,2300,23807,3,False,46,2025-01-05T13:00:00+00:00,LIA,2025-01-05T14:00:00+00:00


And simplify and tidy the "top level" stages dataframe:

In [12]:
stage_cols = ['stage_code', 'stage', 'date', 'startDate', 'endDate', 'isCancelled', 'generalDisplay', 'isDelayed', 'marathon',
              'length', 'type',  'timezone', 'stageWithBonus', 'mapCategoryDisplay', 'podiumDisplay', '_bind', 'ar', 'en', 'es', 'fr']
stage_df = stage_df[stage_cols]

Let's see if we can pull out the data for the mixed surface types in a sensible way.

In [13]:
competitive_sectors.iloc[0].to_dict()

{'grounds': {'groundLangs': [{'text': 'Dirt Track',
    'variable': 'ground.name.1',
    'locale': 'en'},
   {'locale': 'fr', 'variable': 'ground.name.1', 'text': 'Piste terre'},
   {'text': 'Pista de tierra', 'variable': 'ground.name.1', 'locale': 'es'},
   {'variable': 'ground.name.1', 'text': 'الحجارة', 'locale': 'ar'}],
  'sections': [{'start': 0, 'finish': 1, 'section': 1},
   {'section': 3, 'start': 2, 'finish': 2},
   {'finish': 6, 'start': 6, 'section': 5}],
  'percentage': 6,
  'color': '#753a05',
  'name': 'ground.name.1'},
 'code': '0P200'}

We can pull the data out into three components:

- the percentage of each surface type by stage
- the surface type for each section
- the surface type language labels

In [14]:
def flatten_grounds_data(df):
    """
    Flatten nested grounds data into a wide DataFrame format.

    Parameters:
    df (pandas.DataFrame): DataFrame with 'grounds' and 'id' columns where grounds contains nested dictionary data

    Returns:
    pandas.DataFrame: Flattened DataFrame with one row per section
    """

    # Create empty lists to store flattened data
    flattened_data = []
    percentage_data = []
    surface_types = []
    _surface_types = []

    # Sort by stage sector
    df.sort_values("code", inplace=True)

    # Create a mapping for translations
    for _, row in df.iterrows():
        ground_data = row['grounds']

        # Create a translations dictionary
        translations = {f"text_{lang['locale']}": lang['text']
                        for lang in ground_data['groundLangs']}

        _stype = translations["text_en"].lower()

        percentage_data.append(
            {'code': row['code'],
             'percentage': ground_data['percentage'],
             # 'ground_name': ground_data['name'],
             'color': ground_data['color'], "type": _stype})

        if _stype not in _surface_types:
            _surface_types.append(_stype)
            surface_types.append({"type": _stype, **translations})

        # Create a record for each section
        for section in ground_data['sections']:
            section_record = {
                'code': row['code'],
                'ground_name': ground_data['name'],
                'section': section['section'],
                'start': section['start'],
                'finish': section['finish'],
                'color': ground_data['color'],
                # 'percentage': ground_data['percentage'],
                "type": _stype
            }
            flattened_data.append(section_record)

    # Create DataFrame from flattened data
    section_df = pd.DataFrame(flattened_data)
    percentage_df = pd.DataFrame(percentage_data)
    surfaces_df = pd.DataFrame(surface_types)

    # Sort columns for better organization
    # Ignore: 'percentage', 'ground_name',
    fixed_columns = ['code',  'section', 'start', 'finish',
                     'color', "type"]
    lang_columns = [
        col for col in section_df.columns if col.startswith('text_')]
    section_df = section_df[fixed_columns + sorted(lang_columns)]

    # Drop duplicates if any still exist
    section_df = section_df.drop_duplicates()
    section_df = section_df.sort_values(
        ['code', 'section'])
    section_df.reset_index(drop=True, inplace=True)

    return section_df, percentage_df, surfaces_df

In [15]:
section_surfaces, stage_surfaces, surfaces = flatten_grounds_data(
    competitive_sectors)
section_surfaces.head()

Unnamed: 0,code,section,start,finish,color,type
0,1200,1,0,27,#efc07c,sand
1,1200,2,27,32,#753a05,dirt track
2,1200,3,32,32,#1dc942,gravel track
3,1200,4,32,41,#efc07c,sand
4,1200,5,41,42,#753a05,dirt track


In [16]:
stage_surfaces.head(10)

Unnamed: 0,code,percentage,color,type
0,1200,18,#753a05,dirt track
1,1200,18,#753a05,dirt track
2,1200,28,#1dc942,gravel track
3,1200,53,#efc07c,sand
4,1200,18,#753a05,dirt track
5,1200,28,#1dc942,gravel track
6,1200,53,#efc07c,sand
7,1200,18,#753a05,dirt track
8,1200,53,#efc07c,sand
9,1200,28,#1dc942,gravel track


In [17]:
surfaces

Unnamed: 0,type,text_en,text_fr,text_es,text_ar
0,dirt track,Dirt Track,Piste terre,Pista de tierra,الحجارة
1,gravel track,Gravel Track,Piste empierrée,Tierra,التراب
2,sand,Sand,Sable,Arena,الرمال
3,asphalt road,Asphalt Road,Goudron,Asfalto,الزفت
4,dunes,Dunes,Dunes,Dunas,الكثبان


In [18]:
stage_surfaces[stage_surfaces["code"]=="11200"]

Unnamed: 0,code,percentage,color,type
121,11200,60,#efc07c,sand
122,11200,40,#ff7200,dunes
123,11200,38,#efc07c,sand
124,11200,61,#ff7200,dunes


## Differences Between Stage Data for Different Categories

*Are there actually any differences in the stage data for the different categories?*