## Add regions (continents)

Fill set information so we can start to do drilldown/drillup.

For this we add a members column that references other rows by primary key (code)

In [1]:
df = pd.read_feather("entities/03-countries-with-entitiyids.feather")

In [2]:
df.shape

(298, 17)

❔ Do we have any rows that do not have a continent assigned in the continent column?

In [3]:
df[df.continent.isnull()]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id
157,,Micronesia (region),,,,,,,,,,,,,,,632.0
201,,Rest of the World,,,,,,,,,,,,,,,560.0
272,OWID_WRL,World,,,,,,,,,,,,,,355.0,559.0
280,OWID_ABK,Abkhazia,,,,,,,,,,,,,,386.0,
281,OWID_AKD,Akrotiri and Dhekelia,,,,,,,,,,,,,,387.0,
282,OWID_ERE,Eritrea and Ethiopia,,,,,,,,,,,,,,388.0,
283,OWID_NAG,Nagorno-Karabakh,,,,,,,,,,,,,,389.0,
284,OWID_SRM,Serbia and Montenegro,,,,,,,,,,,,,,268.0,
285,OWID_SEK,Serbia excluding Kosovo,,,,,,,,,,,,,,392.0,
286,OWID_SML,Somaliland,,,,,,,,,,,,,,393.0,


💡 It looks like Micronesia (region) is odd - do we have more entries like this?

In [4]:
df[df.name.str.contains("icronesia")]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id
156,FSM,Micronesia (country),FM,FSM,868.0,FSM,987.0,,FM,,FSM,FSM,6.0,http://www.wikidata.org/entity/Q702,Federated States of Micronesia,222.0,633.0
157,,Micronesia (region),,,,,,,,,,,,,,,632.0


I would argue that "Rest of the World" and "Micronesia (region)" should be dropped. They have no other identifiers, no entitiy id and both are not clear enough that they could easily be used across table joins. If we get rid of those then the only entry left that will not have a continent assigned is "World" which sounds good

In [5]:
df[df.continent.isnull() & df.code.isnull() & df.legacy_entity_id.isnull()]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id
157,,Micronesia (region),,,,,,,,,,,,,,,632.0
201,,Rest of the World,,,,,,,,,,,,,,,560.0


In [6]:
cleaned = df.drop(df.loc[df.continent.isnull() & df.code.isnull() & df.legacy_entity_id.isnull()].index)

In [7]:
cleaned.shape

(296, 17)

In [8]:
cleaned[cleaned.code.isnull()]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id
128,,Korea,,,,KOR,730.0,,,,,,2.0,,,,832.0
292,,Africa,,,,,,,,,,,,,,273.0,
293,,Asia,,,,,,,,,,,,,,275.0,
294,,Europe,,,,,,,,,,,,,,276.0,
295,,Latin America,,,,,,,,,,,,,,5403.0,
296,,North America,,,,,,,,,,,,,,294.0,
297,,Oceania,,,,,,,,,,,,,,277.0,


⚡ Code will be our primary key and be mandatory for our table. Let's fill these in with some new, made up values.

In [9]:
new_codes = {
    "Korea": "OWID_KOR",
    "Africa": "OWID_AFR",
    "Asia": "OWID_ASI",
    "Europe": "OWID_EUR",
    "Oceania": "OWID_OCE",
    "North America": "OWID_NAM",
    "Latin America": "OWID_LAM"    
}

❔ Are any of these codes already used?

In [10]:
cleaned[cleaned.code.isin(new_codes.values())]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id


In [11]:
cleaned["code"] = cleaned.code.fillna(cleaned.name.map(new_codes))

In [12]:
cleaned[cleaned.code.isnull()]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id


Make sure we don't have any duplicates in the code column

In [13]:
cleaned.duplicated(subset=["code"]).any()

False

### Now group by continent, create and fill the members column

In [14]:
continents_members = cleaned.groupby("continent")["code"].apply(list)

In [15]:
continents_members

continent
1.0    [AIA, ATG, ABW, BHS, BRB, BLZ, BMU, BES, VGB, ...
2.0    [AFG, ARM, AZE, BHR, BGD, BTN, IOT, BRN, KHM, ...
3.0    [DZA, AGO, BEN, BWA, BFA, BDI, CMR, CPV, CAF, ...
4.0    [ALA, ALB, AND, AUT, OWID_AUH, OWID_BAD, OWID_...
5.0    [ARG, BOL, BRA, OWID_NLC, CHL, COL, ECU, FLK, ...
6.0    [ASM, AUS, COK, FJI, PYF, GUM, KIR, MHL, OWID_...
7.0                            [ATA, BVT, ATF, HMD, SGS]
Name: code, dtype: object

In [16]:
continents_members.dtype

dtype('O')

In [17]:
#continents_members = continents_members.map(lambda m: pd.array(m, dtype="string"))

In [18]:
continents_members[7.0]

['ATA', 'BVT', 'ATF', 'HMD', 'SGS']

In [19]:
first_items = [ l[0] for l in  continents_members.values ]

In [20]:
cleaned[cleaned.code.isin(first_items)]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id
0,AFG,Afghanistan,AF,AFG,512.0,AFG,700.0,AFG,AF,AFGN,AFG,AFG,2.0,http://www.wikidata.org/entity/Q889,Afghanistan,15.0,562.0
1,ALA,Aland Islands,AX,ALA,,,,,,,,,4.0,http://www.wikidata.org/entity/Q5689,Åland,296.0,791.0
3,DZA,Algeria,DZ,DZA,612.0,ALG,615.0,ALG,AE,ALGR,DZA,DZA,3.0,http://www.wikidata.org/entity/Q262,Algeria,17.0,619.0
4,ASM,American Samoa,AS,ASM,859.0,,,,AS,,ASM,ASM,6.0,http://www.wikidata.org/entity/Q16641,American Samoa,246.0,571.0
7,AIA,Anguilla,AI,AIA,312.0,,,ANL,AM,,AIA,AIA,1.0,http://www.wikidata.org/entity/Q25228,Anguilla,228.0,564.0
8,ATA,Antarctica,AQ,ATA,,,,,,,,,7.0,http://www.wikidata.org/entity/Q21590062,Antarctic Treaty area,346.0,792.0
10,ARG,Argentina,AR,ARG,213.0,ARG,160.0,ARG,AG,ARGN,ARG,ARG,5.0,http://www.wikidata.org/entity/Q414,Argentina,21.0,569.0


In [21]:
continents_map = {
    1.0: "OWID_NAM",
    2.0: "OWID_ASI",
    3.0: "OWID_AFR",
    4.0: "OWID_EUR",
    5.0: "OWID_LAM",
    6.0: "OWID_OCE",
    7.0: "ATA"
}
reverse_continents_map = {key: value for (value, key) in continents_map.items()}

In [22]:
cleaned["members"] = cleaned.code.map(lambda code: continents_members[reverse_continents_map[code]] if code in reverse_continents_map else [])

In [23]:
cleaned[cleaned.code == "ATA"]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
8,ATA,Antarctica,AQ,ATA,,,,,,,,,7.0,http://www.wikidata.org/entity/Q21590062,Antarctic Treaty area,346.0,792.0,"[ATA, BVT, ATF, HMD, SGS]"


In [24]:
antarctica_index = cleaned[cleaned.code == "ATA"].index[0]

In [25]:
antarctica_index

8

All looks good - let's just remove the self reference in Antarctica :)cleaned.loc[cleaned.code == "ATA", "members"].itemcleaned.at[cleaned.code == "ATA", "members"]

In [26]:
without_ata =  [ m for m in cleaned.loc[cleaned.code == "ATA", "members"].item() if m != "ATA" ]

In [27]:
without_ata

['BVT', 'ATF', 'HMD', 'SGS']

In [28]:
cleaned.loc[cleaned.code == "ATA"]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
8,ATA,Antarctica,AQ,ATA,,,,,,,,,7.0,http://www.wikidata.org/entity/Q21590062,Antarctic Treaty area,346.0,792.0,"[ATA, BVT, ATF, HMD, SGS]"


In [29]:
cleaned.at[antarctica_index, "members"] = without_ata

In [30]:
cleaned.loc[cleaned.code == "ATA"]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
8,ATA,Antarctica,AQ,ATA,,,,,,,,,7.0,http://www.wikidata.org/entity/Q21590062,Antarctic Treaty area,346.0,792.0,"[BVT, ATF, HMD, SGS]"


Now set the world to have the continents as members

In [31]:
cleaned.loc[cleaned.code == "OWID_WRL"]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
272,OWID_WRL,World,,,,,,,,,,,,,,355.0,559.0,[]


In [32]:
world_index = cleaned.loc[cleaned.code == "OWID_WRL"].index[0]

In [33]:
continents_values = list(continents_map.values())

In [34]:
continents_values

['OWID_NAM', 'OWID_ASI', 'OWID_AFR', 'OWID_EUR', 'OWID_LAM', 'OWID_OCE', 'ATA']

In [35]:
cleaned.at[world_index, "members"] = continents_values

In [36]:
cleaned.loc[cleaned.code == "OWID_WRL"]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
272,OWID_WRL,World,,,,,,,,,,,,,,355.0,559.0,"[OWID_NAM, OWID_ASI, OWID_AFR, OWID_EUR, OWID_..."


## Save the file

In [37]:
cleaned = cleaned.reset_index(drop=True)

In [38]:
cleaned.dtypes

code                  object
name                  object
iso_alpha2            object
iso_alpha3            object
imf_code             float64
cow_letter            object
cow_code             float64
unctad_code           object
marc_code             object
ncd_code              object
kansas_code           object
penn_code             object
continent            float64
wikidata_uri          object
wikidata_label        object
legacy_entity_id     float64
legacy_country_id    float64
members               object
dtype: object

In [39]:
cleaned_modern_dtypes = cleaned.convert_dtypes()

In [40]:
cleaned_modern_dtypes.dtypes

code                 string
name                 string
iso_alpha2           string
iso_alpha3           string
imf_code              Int64
cow_letter           string
cow_code              Int64
unctad_code          string
marc_code            string
ncd_code             string
kansas_code          string
penn_code            string
continent             Int64
wikidata_uri         string
wikidata_label       string
legacy_entity_id      Int64
legacy_country_id     Int64
members              object
dtype: object

In [41]:
cleaned_modern_dtypes.head()

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,continent,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
0,AFG,Afghanistan,AF,AFG,512.0,AFG,700.0,AFG,AF,AFGN,AFG,AFG,2,http://www.wikidata.org/entity/Q889,Afghanistan,15,562,[]
1,ALA,Aland Islands,AX,ALA,,,,,,,,,4,http://www.wikidata.org/entity/Q5689,Åland,296,791,[]
2,ALB,Albania,AL,ALB,914.0,ALB,339.0,ALB,AA,ALBN,ALB,ALB,4,http://www.wikidata.org/entity/Q222,Albania,16,565,[]
3,DZA,Algeria,DZ,DZA,612.0,ALG,615.0,ALG,AE,ALGR,DZA,DZA,3,http://www.wikidata.org/entity/Q262,Algeria,17,619,[]
4,ASM,American Samoa,AS,ASM,859.0,,,,AS,,ASM,ASM,6,http://www.wikidata.org/entity/Q16641,American Samoa,246,571,[]


In [42]:
cleaned_modern_dtypes.drop(["continent"], axis=1, inplace=True)

In [43]:
cleaned_modern_dtypes.head()

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
0,AFG,Afghanistan,AF,AFG,512.0,AFG,700.0,AFG,AF,AFGN,AFG,AFG,http://www.wikidata.org/entity/Q889,Afghanistan,15,562,[]
1,ALA,Aland Islands,AX,ALA,,,,,,,,,http://www.wikidata.org/entity/Q5689,Åland,296,791,[]
2,ALB,Albania,AL,ALB,914.0,ALB,339.0,ALB,AA,ALBN,ALB,ALB,http://www.wikidata.org/entity/Q222,Albania,16,565,[]
3,DZA,Algeria,DZ,DZA,612.0,ALG,615.0,ALG,AE,ALGR,DZA,DZA,http://www.wikidata.org/entity/Q262,Algeria,17,619,[]
4,ASM,American Samoa,AS,ASM,859.0,,,,AS,,ASM,ASM,http://www.wikidata.org/entity/Q16641,American Samoa,246,571,[]


In [44]:
cleaned_modern_dtypes.loc[cleaned_modern_dtypes.code == "OWID_WRL"]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
270,OWID_WRL,World,,,,,,,,,,,,,355,559,"[OWID_NAM, OWID_ASI, OWID_AFR, OWID_EUR, OWID_..."


In [45]:
import json

In [46]:
cleaned_modern_dtypes.members = cleaned_modern_dtypes.members.map(lambda row: json.dumps(row) if row != [] else None)

In [47]:
cleaned_modern_dtypes.loc[cleaned_modern_dtypes.code == "OWID_WRL"]

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
270,OWID_WRL,World,,,,,,,,,,,,,355,559,"[""OWID_NAM"", ""OWID_ASI"", ""OWID_AFR"", ""OWID_EUR..."


In [48]:
cleaned_modern_dtypes.head()

Unnamed: 0,code,name,iso_alpha2,iso_alpha3,imf_code,cow_letter,cow_code,unctad_code,marc_code,ncd_code,kansas_code,penn_code,wikidata_uri,wikidata_label,legacy_entity_id,legacy_country_id,members
0,AFG,Afghanistan,AF,AFG,512.0,AFG,700.0,AFG,AF,AFGN,AFG,AFG,http://www.wikidata.org/entity/Q889,Afghanistan,15,562,
1,ALA,Aland Islands,AX,ALA,,,,,,,,,http://www.wikidata.org/entity/Q5689,Åland,296,791,
2,ALB,Albania,AL,ALB,914.0,ALB,339.0,ALB,AA,ALBN,ALB,ALB,http://www.wikidata.org/entity/Q222,Albania,16,565,
3,DZA,Algeria,DZ,DZA,612.0,ALG,615.0,ALG,AE,ALGR,DZA,DZA,http://www.wikidata.org/entity/Q262,Algeria,17,619,
4,ASM,American Samoa,AS,ASM,859.0,,,,AS,,ASM,ASM,http://www.wikidata.org/entity/Q16641,American Samoa,246,571,


In [49]:
cleaned_modern_dtypes.to_feather("entities/04-countries-with-continents.feather")