# `stages` Data

In [83]:
# Load in the required pandas package
import pandas as pd

# Also load in our custom function
from dakar_utils_2025 import mergeInLangLabels

In [84]:
# Generate the API URL pattern
dakar_api_template = "https://www.dakar.live.worldrallyraidchampionship.com/api/{path}"

# Define the year
YEAR = 2025
# Define the category
CATEGORY = "A"

# Define the API path to the stage resource
# Use a Python f-string to instantiate variable values directly
stage_path = f"stage-{YEAR}-{CATEGORY}"

# Define the URL
stage_url = dakar_api_template.format(path=stage_path)

# Preview the path and the URL
stage_path, stage_url

('stage-2025-A',
 'https://www.dakar.live.worldrallyraidchampionship.com/api/stage-2025-A')

In [85]:
stage_df = pd.read_json(stage_url)
stage_df.columns

Index(['endDate', 'startDate', 'isCancelled', 'type', 'stage',
       'mapCategoryDisplay', 'isDelayed', 'stageLangs', 'sectors', 'length',
       'updatedAt', 'podiumDisplay', 'generalDisplay', 'timezone', 'marathon',
       'stageWithBonus', 'lastColumnDisplay', 'mapDisplay', 'date',
       'groupsAsCategory', 'code', '_id', '_bind', '_origin', '_updatedAt',
       '_parent', '_key', 'qualities', '$category', '_gets', 'refueling',
       'shortLabel', 'liveDisplay', 'position', 'promotionalDisplay',
       'categoryLangs', 'label', 'reference', 'lastStage'],
      dtype='object')

By inspecting the column names, does the `stageLangs` column define a set of label language mappings, I wonder?

In [87]:
# Preview the first three rows of the stageLangs column
stage_df["stageLangs"][0]

[{'variable': 'stage.name.09000', 'locale': 'en', 'text': 'RIYADH > HARADH'},
 {'locale': 'fr', 'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH'},
 {'variable': 'stage.name.09000', 'text': 'RIYADH > HARADH', 'locale': 'es'},
 {'variable': 'stage.name.09000', 'text': 'الرياض > حرض', 'locale': 'ar'}]

It looks like it does, so we can flatten that data in using the function we defined in the previous chapter. But what column do we need to merge against?

In [88]:
# Preview the values in the first row in the dataframe
stage_df.iloc[0]

endDate                                       2025-01-14T00:00:00+03:00
startDate                                     2025-01-14T00:00:00+03:00
isCancelled                                                         0.0
type                                                                STA
stage                                                               9.0
mapCategoryDisplay                                                    A
isDelayed                                                           0.0
stageLangs            [{'variable': 'stage.name.09000', 'locale': 'e...
sectors               [{'type': 'LIA', 'groupsLength': [], 'startTim...
length                                                            589.0
updatedAt                                     2025-01-08T17:24:08+01:00
podiumDisplay                                                        ce
generalDisplay                                                      1.0
timezone                                                    Asia

Inspection of the columns in the original dataframe suggests there is no direct mapping to the `variable` value, which takes the form `stage.name.09000`. We do note, however, there is a `code` value which maps the numeric part of `variable`, so we can create out own merge column, which we might call `variable`, and then 

In [10]:
# Create a dummy colum to match on
stage_df["variable"] = "stage.name." + stage_df["code"]

# Update the dataframe by using our new function to
# merge in the exploded and widenened language labels
stage_df = mergeInLangLabels(stage_df, "stageLangs", key="variable")

# Preview the dataframe, limited to a few illustrative columns
stage_df[["code", "en", "ar"]].head()

Unnamed: 0,code,en,ar
0,09000,RIYADH > HARADH,الرياض > حرض
1,07000,AL DUWADIMI > AL DUWADIMI,الدوادمي > الدوادمي
2,06000,HAIL > AL DUWADIMI,حائل > الدوادمي
3,10000,HARADH > SHUBAYTAH,حرض > شبيطة
4,0P000,BISHA,BISHA


Let's skim the first few rows of the table:

In [17]:
stage_df.head(5)

Unnamed: 0,length,endDate,isDelayed,marathon,mapCategoryDisplay,generalDisplay,date,updatedAt,mapDisplay,timezone,...,position,promotionalDisplay,categoryLangs,label,reference,lastStage,ar,en,es,fr
0,589.0,2025-01-14T00:00:00+03:00,0.0,0.0,A,1.0,2025-01-14 00:00:00+03:00,2025-01-02T09:24:24+01:00,True,Asia/Riyadh,...,,,,,,,الرياض > حرض,RIYADH > HARADH,RIYADH > HARADH,RIYADH > HARADH
1,742.0,2025-01-12T00:00:00+03:00,0.0,0.0,M,1.0,2025-01-12 00:00:00+03:00,2025-01-02T09:24:24+01:00,True,Asia/Riyadh,...,,,,,,,الدوادمي > الدوادمي,AL DUWADIMI > AL DUWADIMI,AL DUWADIMI > AL DUWADIMI,AL DUWADIMI > AL DUWADIMI
2,828.0,2025-01-11T00:00:00+03:00,0.0,0.0,A,1.0,2025-01-11 00:00:00+03:00,2025-01-02T09:24:24+01:00,True,Asia/Riyadh,...,,,,,,,حائل > الدوادمي,HAIL > AL DUWADIMI,HAIL > AL DUWADIMI,HAIL > AL DUWADIMI
3,640.0,2025-01-15T00:00:00+03:00,0.0,0.0,M,1.0,2025-01-15 00:00:00+03:00,2025-01-02T09:24:24+01:00,True,Asia/Riyadh,...,,,,,,,حرض > شبيطة,HARADH > SHUBAYTAH,HARADH > SHUBAYTAH,HARADH > SHUBAYTAH
4,79.0,2025-01-03T00:00:00+03:00,0.0,0.0,,0.0,2025-01-03 00:00:00+03:00,2025-01-02T09:24:24+01:00,True,Asia/Riyadh,...,,,,,,,BISHA,BISHA,BISHA,BISHA


From the date columns, we notice that the dataframe is not ordered. To provide a more natural ordering, we could order the rows by start date, or stage number, which appears to be given by the `stage` column, for example.

Inspection of the data also reveals that the `sectors` column typically appears to contain three elements: a link sector, the competititve stage sector, and a second link sector. The competitive stage sector is further subdivided into a list of `grounds` or terrain types, with associated language mappings. Within in each ground is a list of `sections` with a `section` number and a `start` and `finish` value that identify start and finish distances in kilometres into the stage for that section.

## Parsing each row

Let's get the JSON `sectors`  data for single row and see if we can find a sensible way of parsing that.

I am imagining producing something like a two linked dataframes:

- one containing the top-level sector data for each stage;
- one containing the surface type by section; I imagine this dataframe to have columns `stage`, `sector`, `section`, `start`, `finish`, `surface_type` and then perhaps language mappings for the surface type.

In [89]:
_sectors_data = stage_df["sectors"].iloc[0]

Let's start by looking at the metadata for each sector.

In [91]:
sectors_df = pd.json_normalize(stage_df["sectors"].explode())

sectors_df.head()

Unnamed: 0,type,groupsLength,startTime,arrivalTime,id,length,powerStage,code,grounds,groupsLength.A_T5
0,LIA,[],2025-01-14T04:05:00+00:00,2025-01-14T06:20:00+00:00,23905.0,112.0,False,9100,,
1,SPE,[],2025-01-14T06:20:00+00:00,2025-01-14T09:27:00+00:00,23906.0,357.0,False,9200,"[{'groundLangs': [{'locale': 'en', 'text': 'Gr...",
2,LIA,[],2025-01-14T09:27:00+00:00,2025-01-14T10:47:00+00:00,23907.0,120.0,False,9300,,
3,LIA,[],2025-01-12T02:10:00+00:00,2025-01-12T04:35:00+00:00,23899.0,156.0,False,7100,,
4,SPE,[],2025-01-12T04:35:00+00:00,2025-01-12T09:45:00+00:00,23900.0,478.0,False,7200,"[{'sections': [{'start': 0, 'finish': 1, 'sect...",


The stage number is not explicitly listed,by we can derive it from the first two characters of the `code`.

If the sector is an `SPE` type, we have the `grounds` information to play with. This is split over various sections, with start and finish distances identifying each section, as well as a surface type and a "percentage" value.

In [93]:
# Get the sectors with grounds data
competitive_sectors = sectors_df[['grounds', 'code']].dropna(
    axis="index").explode('grounds')

# Cast the long sectors data to a dataframe
#grounds_df = pd.json_normalize(competitive_sectors)

competitive_sectors.head()

Unnamed: 0,grounds,code
1,"{'groundLangs': [{'locale': 'en', 'text': 'Gra...",9200
1,"{'sections': [{'section': 2, 'start': 9, 'fini...",9200
4,"{'sections': [{'start': 0, 'finish': 1, 'secti...",7200
4,"{'name': 'ground.name.2', 'percentage': 25, 'c...",7200
4,"{'name': 'ground.name.3', 'percentage': 16, 'c...",7200


It's not immediately clear to me what the best way of representung mixed surface types is, or how to interpret the `percentage` values, so I'm going to postpone my own further exploration of this data for now.

That said, this data might support an interesting view of the stage make up, particularly when cross-referenced with waypoint locations. We might then be abel to identify different peroformance characteristics of different crews or vehicle types based on the predominant surface type betweemn waypoints.