# Converting numeric data to sequence data: Example based on the Gapminder Child Mortality data

*Author: Yuqi Liang*

*Date: 13 Feb 2026*

In this tutorial, we will explore the Gapminder Child Mortality data and demonstrate how to convert numeric data into sequence data using **global deciles**.

**Data Source:** The data used in this tutorial is sourced from [Gapminder](https://www.gapminder.org/data/). Gapminder provides comprehensive datasets on various global indicators, including child mortality (children under 5 years old dying per 1000 born).

**Data Analysis and Cleaning:** We will transform the numeric dataset into a sequence dataset by:

* Loading the dataset.
* Converting the dataset to long format and handling missing values.
* Computing **global decile** thresholds (dividing all values into 10 equal-sized groups).
* Converting the dataset to wide format with sequence states.

Let's get started!

In [None]:
# Import necessary packages
import pandas as pd

## Child Mortality

The data source for each country's child mortality rate (children under 5 years old dying per 1000 born) can be found [here](https://www.gapminder.org/data/). This dataset contains child mortality values for countries from 1800 to 2100.

In [None]:
# Load the dataset (wide format: countries as rows, years as columns)
file_path = "../data_sources/child_mortality_0_5_year_olds_dying_per_1000_born.csv"
df_wide = pd.read_csv(file_path)

# Convert from wide format to long format
# Melt the dataframe: keep 'name' and 'geo' as identifiers, convert year columns to rows
# Note: 'geo' is the country code column, we include it in id_vars so it doesn't get melted into 'year'
df_long = df_wide.melt(
    id_vars=['name', 'geo'],
    var_name='year',
    value_name='child_mortality'
)

# Convert year to integer and child_mortality to numeric
df_long['year'] = df_long['year'].astype(int)
df_long['child_mortality'] = pd.to_numeric(df_long['child_mortality'], errors='coerce')

# Rename 'name' to 'country' for consistency
df_long = df_long.rename(columns={'name': 'country'})

# Reorder columns (drop 'geo' as we don't need it for the analysis)
df_long = df_long[['country', 'year', 'child_mortality']]

df_long

In [None]:
# Missing values in the original dataset (before dropping)
n_total = len(df_long)
n_missing_child_mortality = df_long['child_mortality'].isna().sum()
n_missing_year = df_long['year'].isna().sum()
n_missing_country = df_long['country'].isna().sum()

print("Missing values in the original dataset:")
print(f"  Total rows: {n_total:,}")
print(f"  Missing in child_mortality: {n_missing_child_mortality:,} ({100 * n_missing_child_mortality / n_total:.2f}%)")
print(f"  Missing in year: {n_missing_year:,}")
print(f"  Missing in country: {n_missing_country:,}")

# Drop rows with missing child_mortality for downstream analysis
df_long = df_long.dropna(subset=['child_mortality'])
print(f"\nAfter dropping rows with missing child_mortality: {len(df_long):,} rows remain.")

## Global deciles

**Global deciles** divide all child mortality values (across all years and countries) into 10 equal-sized groups. One set of thresholds is computed for the entire dataset. D1 = lowest 10%, D10 = highest 10%.

In [None]:
# Ensure child_mortality is numeric and drop any remaining non-numeric values
# Filter to only numeric values before computing deciles
df_long_numeric = df_long[pd.to_numeric(df_long['child_mortality'], errors='coerce').notna()].copy()
df_long_numeric['child_mortality'] = pd.to_numeric(df_long_numeric['child_mortality'], errors='coerce')

# Drop any rows that still have NaN after conversion (shouldn't happen, but just in case)
df_long_numeric = df_long_numeric.dropna(subset=['child_mortality'])

# Verify the column is numeric
print(f"Data type of child_mortality: {df_long_numeric['child_mortality'].dtype}")
print(f"Number of rows for decile computation: {len(df_long_numeric):,}")

# Compute global deciles
df_long_numeric['decile_global'] = pd.qcut(
    df_long_numeric['child_mortality'],
    q=10,
    labels=[
        'D1 (Very Low)', 'D2', 'D3', 'D4', 'D5',
        'D6', 'D7', 'D8', 'D9', 'D10 (Very High)'
    ]
)

# Convert to wide format: rows = country, columns = year, values = decile
df_global_deciles = df_long_numeric.pivot(index='country', columns='year', values='decile_global')

# Reset index and clean column names
df_global_deciles = df_global_deciles.reset_index()
df_global_deciles.columns.name = None

df_global_deciles

In [None]:
# Save to CSV
df_global_deciles.to_csv('country_child_mortality_global_deciles.csv', index=False)

## Missing values in the final output

The final wide-format data (`df_global_deciles`) may contain missing values (NaN) where a country has no child mortality record for a given year. Below we report where they are and how many.

In [None]:
# Count missing values in the final output
year_cols = [c for c in df_global_deciles.columns if c != 'country']
n_total = df_global_deciles.shape[0] * len(year_cols)
n_missing = df_global_deciles[year_cols].isna().sum().sum()

print("Missing values in df_global_deciles (final output):")
print(f"  Total cells (countries Ã— years): {n_total:,}")
print(f"  Missing cells: {n_missing:,} ({100 * n_missing / n_total:.2f}%)")

# Where are the missing values? By country (how many years missing per country)
missing_by_country = df_global_deciles[year_cols].isna().sum(axis=1)
countries_with_missing = missing_by_country[missing_by_country > 0]
if len(countries_with_missing) > 0:
    print(f"\nCountries with missing values ({len(countries_with_missing)} countries):")
    for idx in countries_with_missing.index[:10]:  # show first 10
        c = df_global_deciles.loc[idx, 'country']
        print(f"  - {c}: {countries_with_missing[idx]:,} years missing")
    if len(countries_with_missing) > 10:
        print(f"  ... and {len(countries_with_missing) - 10} more countries")

# By year (how many countries missing per year)
missing_by_year = df_global_deciles[year_cols].isna().sum()
years_with_missing = missing_by_year[missing_by_year > 0]
if len(years_with_missing) > 0:
    print(f"\nYears with missing values ({len(years_with_missing)} years):")
    for y in list(years_with_missing.index)[:10]:
        print(f"  - {y}: {years_with_missing[y]:,} countries missing")
    if len(years_with_missing) > 10:
        print(f"  ... and {len(years_with_missing) - 10} more years")
else:
    print("\nNo missing values in the final output.")