# `stage` Data

The `stage` data provides general information about a stage, including the  astage code (identifier), stage number, stage name, start and end dates, and cancellation status. The stage information also identifes stage `sector` information, that identifies link sections and competitive stage sections, and within a competitive stage section a list of `sections` which identify start and end section distances into the competitive stage and the terrain types for each section.

In [1]:
# Load in the required packages
import pandas as pd
from jupyterlite_simple_cors_proxy import furl, xurl

# Also load in our custom function
from dakar_utils_2025 import mergeInLangLabels

In [2]:
# Generate the API URL pattern
dakar_api_template = "https://www.dakar.live.worldrallyraidchampionship.com/api/{path}"

# Define the year
YEAR = 2025
# Define the category
CATEGORY = "A"

# Define the API path to the stage resource
# Use a Python f-string to instantiate variable values directly
stage_path = f"stage-{YEAR}-{CATEGORY}"

# Define the URL
stage_url = dakar_api_template.format(path=stage_path)

# Preview the path and the URL
stage_path, stage_url

('stage-2025-A',
 'https://www.dakar.live.worldrallyraidchampionship.com/api/stage-2025-A')

In [3]:
# Load in data
# Use furl() to handle CORS issues in Jupyterlite
stage_df = pd.read_json(furl(stage_url))
stage_df.columns

Index(['groupsAsCategory', 'code', 'endDate', 'isCancelled', 'stageLangs',
       'mapDisplay', 'generalDisplay', 'isDelayed', 'marathon', 'length',
       'type', 'lastColumnDisplay', 'podiumDisplay', 'stage', 'timezone',
       'updatedAt', 'sectors', 'mapCategoryDisplay', 'startDate', 'date',
       'stageWithBonus', '_bind', '_origin', '_id', '_key', '_updatedAt',
       '_parent', 'qualities', '$category', '_gets', 'refueling', 'shortLabel',
       'lastStage', 'categoryLangs', 'label', 'reference', 'position',
       'promotionalDisplay', 'liveDisplay'],
      dtype='object')

By inspecting the column names, does the `stageLangs` column define a set of label language mappings, I wonder?

In [4]:
# Preview the first three rows of the stageLangs column
stage_df["stageLangs"][0]

[{'locale': 'en', 'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH'},
 {'variable': 'stage.name.09000', 'locale': 'fr', 'text': 'RIYADH > HARADH'},
 {'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH', 'locale': 'es'},
 {'text': 'الرياض > حرض', 'variable': 'stage.name.09000', 'locale': 'ar'}]

It looks like it does, so we can flatten that data in using the function we defined in the previous chapter. But what column do we need to merge against?

In [5]:
# Preview the values in the first row in the dataframe
stage_df.iloc[0]

groupsAsCategory                                                   None
code                                                              09000
endDate                                       2025-01-14T00:00:00+03:00
isCancelled                                                         0.0
stageLangs            [{'locale': 'en', 'variable': 'stage.name.0900...
mapDisplay                                                         True
generalDisplay                                                      1.0
isDelayed                                                           0.0
marathon                                                            0.0
length                                                            589.0
type                                                                STA
lastColumnDisplay                                                    ce
podiumDisplay                                                        ce
stage                                                           

Inspection of the columns in the original dataframe suggests there is no direct mapping to the `variable` value, which takes the form `stage.name.09000`. We do note, however, there is a `code` value which maps the numeric part of `variable`, so we can create out own merge column, which we might call `variable`, and then 

In [6]:
# Create a dummy colum to match on
stage_df["variable"] = "stage.name." + stage_df["code"]

# Update the dataframe by using our new function to
# merge in the exploded and widenened language labels
stage_df = mergeInLangLabels(stage_df, "stageLangs", key="variable")

# Preview the dataframe, limited to a few illustrative columns
stage_df[["code", "en", "ar"]].head()

Unnamed: 0,code,en,ar
0,09000,RIYADH > HARADH,الرياض > حرض
1,07000,AL DUWADIMI > AL DUWADIMI,الدوادمي > الدوادمي
2,06000,HAIL > AL DUWADIMI,حائل > الدوادمي
3,10000,HARADH > SHUBAYTAH,حرض > شبيطة
4,0P000,BISHA,BISHA


Let's skim the first few rows of the table:

In [7]:
stage_df.head(5)

Unnamed: 0,groupsAsCategory,code,endDate,isCancelled,mapDisplay,generalDisplay,isDelayed,marathon,length,type,...,categoryLangs,label,reference,position,promotionalDisplay,liveDisplay,ar,en,es,fr
0,,09000,2025-01-14T00:00:00+03:00,0.0,True,1.0,0.0,0.0,589.0,STA,...,,,,,,,الرياض > حرض,RIYADH > HARADH,RIYADH > HARADH,RIYADH > HARADH
1,,07000,2025-01-12T00:00:00+03:00,0.0,True,1.0,0.0,0.0,716.0,STA,...,,,,,,,الدوادمي > الدوادمي,AL DUWADIMI > AL DUWADIMI,AL DUWADIMI > AL DUWADIMI,AL DUWADIMI > AL DUWADIMI
2,,06000,2025-01-11T00:00:00+03:00,0.0,True,1.0,0.0,0.0,829.0,STA,...,,,,,,,حائل > الدوادمي,HAIL > AL DUWADIMI,HAIL > AL DUWADIMI,HAIL > AL DUWADIMI
3,,10000,2025-01-15T00:00:00+03:00,0.0,True,1.0,0.0,0.0,640.0,STA,...,,,,,,,حرض > شبيطة,HARADH > SHUBAYTAH,HARADH > SHUBAYTAH,HARADH > SHUBAYTAH
4,,0P000,2025-01-03T00:00:00+03:00,0.0,True,0.0,0.0,0.0,77.0,PRO,...,,,,,,,BISHA,BISHA,BISHA,BISHA


From the date columns, we notice that the dataframe is not ordered. To provide a more natural ordering, we could order the rows by start date, or stage number, which appears to be given by the `stage` column, for example.

Inspection of the data also reveals that the `sectors` column typically appears to contain three elements: a link sector, the competititve stage sector, and a second link sector. The competitive stage sector is further subdivided into a list of `grounds` or terrain types, with associated language mappings. Within in each ground is a list of `sections` with a `section` number and a `start` and `finish` value that identify start and finish distances in kilometres into the stage for that section.

## Parsing each row

Let's get the JSON `sectors`  data for single row and see if we can find a sensible way of parsing that.

I am imagining producing something like a two linked dataframes:

- one containing the top-level sector data for each stage;
- one containing the surface type by section; I imagine this dataframe to have columns `stage`, `sector`, `section`, `start`, `finish`, `surface_type` and then perhaps language mappings for the surface type.

Let's start by looking at the metadata for each sector.

In [8]:
sectors_df = pd.json_normalize(stage_df["sectors"].explode())

sectors_df.head()

Unnamed: 0,powerStage,code,id,length,startTime,type,arrivalTime,groupsLength,grounds,groupsLength.A_T5
0,False,9100,23937,112,2025-01-14T04:05:00+00:00,LIA,2025-01-14T06:20:00+00:00,[],,
1,False,9200,23938,357,2025-01-14T06:20:00+00:00,SPE,2025-01-14T09:27:00+00:00,[],"[{'name': 'ground.name.2', 'sections': [{'sect...",
2,False,9300,23939,120,2025-01-14T09:27:00+00:00,LIA,2025-01-14T10:47:00+00:00,[],,
3,False,7100,23931,189,2025-01-12T02:10:00+00:00,LIA,2025-01-12T04:35:00+00:00,[],,
4,False,7200,23932,419,2025-01-12T04:35:00+00:00,SPE,2025-01-12T09:45:00+00:00,[],"[{'sections': [{'section': 1, 'finish': 22, 's...",


The stage number is not explicitly listed,by we can derive it from the first two characters of the `code`.

If the sector is an `SPE` type, we have the `grounds` information to play with. This is split over various sections, with start and finish distances identifying each section, as well as a surface type and a "percentage" value.

In [9]:
# Get the sectors with grounds data
competitive_sectors = sectors_df[['grounds', 'code']].dropna(
    axis="index").explode('grounds').reset_index(drop=True)

competitive_sectors.head()

Unnamed: 0,grounds,code
0,"{'name': 'ground.name.2', 'sections': [{'secti...",9200
1,"{'sections': [{'section': 2, 'start': 9, 'fini...",9200
2,"{'sections': [{'section': 1, 'finish': 22, 'st...",7200
3,"{'name': 'ground.name.2', 'groundLangs': [{'te...",7200
4,"{'sections': [{'start': 118, 'section': 16, 'f...",7200


Tidy the sectors dataframe: 

In [10]:
# Simplify the sectors dataframe
sectors_df = sectors_df[["powerStage", "code", "id",
                         "length", "startTime", "type", "arrivalTime"]].reset_index(drop=True)

# Sort sectors by stage and sector
sectors_df.sort_values("code", inplace=True)

sectors_df.head()

Unnamed: 0,powerStage,code,id,length,startTime,type,arrivalTime
14,False,1100,23803,86,2025-01-04T04:30:00+00:00,LIA,2025-01-04T06:25:00+00:00
15,False,1200,23804,413,2025-01-04T06:25:00+00:00,SPE,2025-01-04T10:35:00+00:00
29,False,2100,23805,45,2025-01-05T03:25:00+00:00,LIA,2025-01-05T04:40:00+00:00
30,False,2200,23806,967,2025-01-05T04:40:00+00:00,SPE,2025-01-05T13:00:00+00:00
31,False,2300,23807,46,2025-01-05T13:00:00+00:00,LIA,2025-01-05T14:00:00+00:00


Let's see if we can pull out the data for the mixed surface types in a sensible way.

In [11]:
competitive_sectors.iloc[0].to_dict()

{'grounds': {'name': 'ground.name.2',
  'sections': [{'section': 1, 'start': 0, 'finish': 9},
   {'finish': 25, 'section': 3, 'start': 11},
   {'finish': 30, 'section': 5, 'start': 25},
   {'start': 33, 'finish': 35, 'section': 7},
   {'start': 39, 'section': 9, 'finish': 41},
   {'finish': 357, 'start': 43, 'section': 11}],
  'color': '#1dc942',
  'percentage': 96,
  'groundLangs': [{'locale': 'en',
    'text': 'Gravel Track',
    'variable': 'ground.name.2'},
   {'locale': 'fr', 'text': 'Piste empierrée', 'variable': 'ground.name.2'},
   {'locale': 'es', 'variable': 'ground.name.2', 'text': 'Tierra'},
   {'text': 'التراب', 'variable': 'ground.name.2', 'locale': 'ar'}]},
 'code': '09200'}

We can pull the data out into three components:

- the percentage of each surface type by stage
- the surface type for each section
- the surface type language labels

In [12]:
def flatten_grounds_data(df):
    """
    Flatten nested grounds data into a wide DataFrame format.

    Parameters:
    df (pandas.DataFrame): DataFrame with 'grounds' and 'id' columns where grounds contains nested dictionary data

    Returns:
    pandas.DataFrame: Flattened DataFrame with one row per section
    """

    # Create empty lists to store flattened data
    flattened_data = []
    percentage_data = []
    surface_types = []
    _surface_types = []

    # Sort by stage sector
    df.sort_values("code", inplace=True)

    # Create a mapping for translations
    for _, row in df.iterrows():
        ground_data = row['grounds']

        # Create a translations dictionary
        translations = {f"text_{lang['locale']}": lang['text']
                        for lang in ground_data['groundLangs']}

        _stype = translations["text_en"].lower()

        percentage_data.append(
            {'code': row['code'],
             'percentage': ground_data['percentage'],
             # 'ground_name': ground_data['name'],
             'color': ground_data['color'], "type": _stype})

        if _stype not in _surface_types:
            _surface_types.append(_stype)
            surface_types.append({"type": _stype, **translations})

        # Create a record for each section
        for section in ground_data['sections']:
            section_record = {
                'code': row['code'],
                'ground_name': ground_data['name'],
                'section': section['section'],
                'start': section['start'],
                'finish': section['finish'],
                'color': ground_data['color'],
                # 'percentage': ground_data['percentage'],
                "type": _stype
            }
            flattened_data.append(section_record)

    # Create DataFrame from flattened data
    section_df = pd.DataFrame(flattened_data)
    percentage_df = pd.DataFrame(percentage_data)
    surfaces_df = pd.DataFrame(surface_types)

    # Sort columns for better organization
    # Ignore: 'percentage', 'ground_name',
    fixed_columns = ['code',  'section', 'start', 'finish',
                     'color', "type"]
    lang_columns = [
        col for col in section_df.columns if col.startswith('text_')]
    section_df = section_df[fixed_columns + sorted(lang_columns)]

    # Drop duplicates if any still exist
    section_df = section_df.drop_duplicates()
    section_df = section_df.sort_values(
        ['code', 'section'])
    section_df.reset_index(drop=True, inplace=True)

    return section_df, percentage_df, surfaces_df

In [13]:
section_surfaces, stage_surfaces, surfaces = flatten_grounds_data(
    competitive_sectors)
section_surfaces.head()

Unnamed: 0,code,section,start,finish,color,type
0,1200,1,0,27,#efc07c,sand
1,1200,2,27,32,#753a05,dirt track
2,1200,3,32,32,#1dc942,gravel track
3,1200,4,32,41,#efc07c,sand
4,1200,5,41,42,#753a05,dirt track


In [14]:
stage_surfaces.head(10)

Unnamed: 0,code,percentage,color,type
0,1200,18,#753a05,dirt track
1,1200,28,#1dc942,gravel track
2,1200,53,#efc07c,sand
3,1200,18,#753a05,dirt track
4,1200,28,#1dc942,gravel track
5,1200,53,#efc07c,sand
6,1200,18,#753a05,dirt track
7,1200,28,#1dc942,gravel track
8,1200,53,#efc07c,sand
9,1200,18,#753a05,dirt track


In [15]:
surfaces

Unnamed: 0,type,text_en,text_fr,text_es,text_ar
0,dirt track,Dirt Track,Piste terre,Pista de tierra,الحجارة
1,gravel track,Gravel Track,Piste empierrée,Tierra,التراب
2,sand,Sand,Sable,Arena,الرمال
3,dunes,Dunes,Dunes,Dunas,الكثبان
4,asphalt road,Asphalt Road,Goudron,Asfalto,الزفت


In [16]:
stage_surfaces[stage_surfaces["code"]=="11200"]

Unnamed: 0,code,percentage,color,type
121,11200,61,#ff7200,dunes
122,11200,38,#efc07c,sand
123,11200,40,#ff7200,dunes
124,11200,60,#efc07c,sand
