# Converting numeric data to sequence data: Example based on the Gapminder data

*Author: Yuqi Liang*

*Date: 12 Feb 2026*

In this tutorial, we will explore the Gapminder GDP per capita data and demonstrate how to convert numeric data into sequence data using **global deciles**.

**Data Source:** The data used in this tutorial is sourced from [Gapminder](https://www.gapminder.org/data/). Gapminder provides comprehensive datasets on various global indicators, including GDP per capita.

**Data Analysis and Cleaning:** We will transform the numeric dataset into a sequence dataset by:

* Loading the dataset.
* Converting the dataset to long format and handling missing values.
* Computing **global decile** thresholds (dividing all values into 10 equal-sized groups).
* Converting the dataset to wide format with sequence states.

Let's get started!

In [1]:
# Import necessary packages
import pandas as pd

## GDP per capita

The data source for each country's GDP per capita can be found [here](https://docs.google.com/spreadsheets/d/1mbfE9vSQmshpSOsBbicmiaL0Oio7hIJAg0vQKzzl8v0/edit?gid=730262387#gid=730262387). Although this dataset primarily focuses on carbon emissions for each country, it also includes valuable information on GDP per capita.


In [2]:
# Load the dataset (header=1 uses the second row as column names)
file_path = "../data_sources/Output _ CO2 Long Series 1800 - 2022 - data_GM GDP per Cap v30.csv"
df_long = pd.read_csv(file_path, header=1)

# Select relevant columns and rename
df_long = df_long[["name", "time", "Income per person"]].rename(
    columns={"name": "country", "time": "year", "Income per person": "gdp_per_capita"}
)

# Convert types
df_long["year"] = df_long["year"].astype(int)
df_long["gdp_per_capita"] = pd.to_numeric(df_long["gdp_per_capita"], errors="coerce")

df_long

Unnamed: 0,country,year,gdp_per_capita
0,Afghanistan,1800,476.991347
1,Afghanistan,1801,476.991347
2,Afghanistan,1802,476.991347
3,Afghanistan,1803,476.991347
4,Afghanistan,1804,476.991347
...,...,...,...
59292,Zimbabwe,2096,10813.961830
59293,Zimbabwe,2097,11077.484580
59294,Zimbabwe,2098,11347.001950
59295,Zimbabwe,2099,11622.589580


In [6]:
# Missing values in the original dataset (before dropping)
n_total = len(df_long)
n_missing_gdp = df_long["gdp_per_capita"].isna().sum()
n_missing_year = df_long["year"].isna().sum()
n_missing_country = df_long["country"].isna().sum()

print("Missing values in the original dataset:")
print(f"  Total rows: {n_total:,}")
print(f"  Missing in gdp_per_capita: {n_missing_gdp:,} ({100 * n_missing_gdp / n_total:.2f}%)")
print(f"  Missing in year: {n_missing_year:,}")
print(f"  Missing in country: {n_missing_country:,}")

# Drop rows with missing gdp_per_capita for downstream analysis
df_long = df_long.dropna(subset=["gdp_per_capita"])
print(f"\nAfter dropping rows with missing gdp_per_capita: {len(df_long):,} rows remain.")

Missing values in the original dataset:
  Total rows: 58,695
  Missing in gdp_per_capita: 0 (0.00%)
  Missing in year: 0
  Missing in country: 0

After dropping rows with missing gdp_per_capita: 58,695 rows remain.


## Global deciles

**Global deciles** divide all GDP per capita values (across all years and countries) into 10 equal-sized groups. One set of thresholds is computed for the entire dataset. D1 = lowest 10%, D10 = highest 10%.

In [3]:
# Compute global deciles
df_long["decile_global"] = pd.qcut(
    df_long["gdp_per_capita"],
    q=10,
    labels=[
        "D1 (Very Low)", "D2", "D3", "D4", "D5",
        "D6", "D7", "D8", "D9", "D10 (Very High)"
    ]
)

# Convert to wide format: rows = country, columns = year, values = decile
df_global_deciles = df_long.pivot(index="country", columns="year", values="decile_global")

# Reset index and clean column names
df_global_deciles = df_global_deciles.reset_index()
df_global_deciles.columns.name = None

df_global_deciles

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D7,D7,D7,D7,D7,D7,D7,D7,D7,D7
1,Albania,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High)
2,Algeria,D2,D2,D2,D2,D2,D2,D2,D2,D2,...,D9,D9,D9,D9,D9,D9,D9,D9,D9,D9
3,Andorra,D4,D4,D4,D4,D4,D4,D4,D4,D4,...,D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High)
4,Angola,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D9,D9,D9,D9,D9,D9,D9,D9,D9,D9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,Venezuela,D3,D3,D3,D3,D3,D3,D3,D3,D3,...,D9,D9,D9,D9,D9,D9,D9,D9,D9,D9
191,Vietnam,D2,D2,D2,D2,D2,D2,D2,D2,D2,...,D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High),D10 (Very High)
192,Yemen,D3,D3,D3,D3,D3,D3,D3,D3,D3,...,D7,D7,D7,D7,D7,D7,D7,D7,D7,D7
193,Zambia,D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),D1 (Very Low),...,D8,D8,D8,D8,D8,D8,D8,D8,D8,D8


In [4]:
# Save to CSV
df_global_deciles.to_csv("country_gdp_per_capita_global_deciles.csv", index=False)

## Missing values in the final output

The final wide-format data (`df_global_deciles`) may contain missing values (NaN) where a country has no GDP record for a given year. Below we report where they are and how many.

In [5]:
# Count missing values in the final output
year_cols = [c for c in df_global_deciles.columns if c != "country"]
n_total = df_global_deciles.shape[0] * len(year_cols)
n_missing = df_global_deciles[year_cols].isna().sum().sum()

print("Missing values in df_global_deciles (final output):")
print(f"  Total cells (countries × years): {n_total:,}")
print(f"  Missing cells: {n_missing:,} ({100 * n_missing / n_total:.2f}%)")

# Where are the missing values? By country (how many years missing per country)
missing_by_country = df_global_deciles[year_cols].isna().sum(axis=1)
countries_with_missing = missing_by_country[missing_by_country > 0]
if len(countries_with_missing) > 0:
    print(f"\nCountries with missing values ({len(countries_with_missing)} countries):")
    for idx in countries_with_missing.index[:10]:  # show first 10
        c = df_global_deciles.loc[idx, "country"]
        print(f"  - {c}: {countries_with_missing[idx]:,} years missing")
    if len(countries_with_missing) > 10:
        print(f"  ... and {len(countries_with_missing) - 10} more countries")

# By year (how many countries missing per year)
missing_by_year = df_global_deciles[year_cols].isna().sum()
years_with_missing = missing_by_year[missing_by_year > 0]
if len(years_with_missing) > 0:
    print(f"\nYears with missing values ({len(years_with_missing)} years):")
    for y in list(years_with_missing.index)[:10]:
        print(f"  - {y}: {years_with_missing[y]:,} countries missing")
    if len(years_with_missing) > 10:
        print(f"  ... and {len(years_with_missing) - 10} more years")
else:
    print("\nNo missing values in the final output.")

Missing values in df_global_deciles (final output):
  Total cells (countries × years): 58,695
  Missing cells: 0 (0.00%)

No missing values in the final output.
