## Add regions (continents)

Fill set information so we can start to do drilldown/drillup.

For this we add a members column that references other rows by primary key (code)

In [None]:
df = pd.read_feather("intermediate/03-countries-with-entitiyids.feather")

In [None]:
df.shape

❔ Do we have any rows that do not have a continent assigned in the continent column?

In [None]:
df[df.continent.isnull()]

💡 It looks like Micronesia (region) is odd - do we have more entries like this?

In [None]:
df[df.name.str.contains("icronesia")]

I would argue that "Rest of the World" and "Micronesia (region)" should be dropped. They have no other identifiers, no entitiy id and both are not clear enough that they could easily be used across table joins. If we get rid of those then the only entry left that will not have a continent assigned is "World" which sounds good

In [None]:
df[df.continent.isnull() & df.code.isnull() & df.legacy_entity_id.isnull()]

In [None]:
cleaned = df.drop(
    df.loc[
        df.continent.isnull() & df.code.isnull() & df.legacy_entity_id.isnull()
    ].index
)

In [None]:
cleaned.shape

In [None]:
cleaned[cleaned.code.isnull()]

⚡ Code will be our primary key and be mandatory for our table. Let's fill these in with some new, made up values.

In [None]:
new_codes = {
    "Africa": "OWID_AFR",
    "Asia": "OWID_ASI",
    "Europe": "OWID_EUR",
    "Oceania": "OWID_OCE",
    "North America": "OWID_NAM",
    "Latin America": "OWID_LAM",
}

❔ Are any of these codes already used?

In [None]:
cleaned[cleaned.code.isin(new_codes.values())]

In [None]:
cleaned["code"] = cleaned.code.fillna(cleaned.name.map(new_codes))

In [None]:
cleaned[cleaned.code.isnull()]

Make sure we don't have any duplicates in the code column

In [None]:
cleaned.duplicated(subset=["code"]).any()

### Now group by continent, create and fill the members column

In [None]:
continents_members = cleaned.groupby("continent")["code"].apply(list)

In [None]:
continents_members

In [None]:
continents_members.dtype

In [None]:
# continents_members = continents_members.map(lambda m: pd.array(m, dtype="string"))

In [None]:
continents_members[7.0]

In [None]:
first_items = [l[0] for l in continents_members.values]

In [None]:
cleaned[cleaned.code.isin(first_items)]

In [None]:
continents_map = {
    1.0: "OWID_NAM",
    2.0: "OWID_ASI",
    3.0: "OWID_AFR",
    4.0: "OWID_EUR",
    5.0: "OWID_LAM",
    6.0: "OWID_OCE",
    7.0: "ATA",
}
reverse_continents_map = {key: value for (value, key) in continents_map.items()}

In [None]:
cleaned["members"] = cleaned.code.map(
    lambda code: continents_members[reverse_continents_map[code]]
    if code in reverse_continents_map
    else []
)

In [None]:
cleaned[cleaned.code == "ATA"]

In [None]:
antarctica_index = cleaned[cleaned.code == "ATA"].index[0]

In [None]:
antarctica_index

All looks good - let's just remove the self reference in Antarctica :)cleaned.loc[cleaned.code == "ATA", "members"].itemcleaned.at[cleaned.code == "ATA", "members"]

In [None]:
without_ata = [
    m for m in cleaned.loc[cleaned.code == "ATA", "members"].item() if m != "ATA"
]

In [None]:
without_ata

In [None]:
cleaned.loc[cleaned.code == "ATA"]

In [None]:
cleaned.at[antarctica_index, "members"] = without_ata

In [None]:
cleaned.loc[cleaned.code == "ATA"]

Now set the world to have the continents as members

In [None]:
cleaned.loc[cleaned.code == "OWID_WRL"]

In [None]:
world_index = cleaned.loc[cleaned.code == "OWID_WRL"].index[0]

In [None]:
continents_values = list(continents_map.values())

In [None]:
continents_values

In [None]:
cleaned.at[world_index, "members"] = continents_values

In [None]:
cleaned.loc[cleaned.code == "OWID_WRL"]

## Save the file

In [None]:
cleaned = cleaned.reset_index(drop=True)

In [None]:
cleaned.dtypes

In [None]:
cleaned_modern_dtypes = cleaned.convert_dtypes()

In [None]:
cleaned_modern_dtypes.dtypes

In [None]:
cleaned_modern_dtypes.head()

In [None]:
cleaned_modern_dtypes.drop(["continent"], axis=1, inplace=True)

In [None]:
cleaned_modern_dtypes.head()

In [None]:
cleaned_modern_dtypes.loc[cleaned_modern_dtypes.code == "OWID_WRL"]

In [None]:
import json

In [None]:
cleaned_modern_dtypes.members = cleaned_modern_dtypes.members.map(
    lambda row: json.dumps(row) if row != [] else None
)

In [None]:
cleaned_modern_dtypes.loc[cleaned_modern_dtypes.code == "OWID_WRL"]

In [None]:
cleaned_modern_dtypes.head()

In [None]:
cleaned_modern_dtypes.to_feather("intermediate/04-countries-with-continents.feather")