# Country Converter

The country converter (coco) is a Python package to convert country names into different classifications and between different naming versions. Internally it uses regular expressions to match country names.


## Installation

Just download the package and add the path to your python path:

In [12]:
import sys
_fd = r'D:\KST\proj\country_converter'
if not _fd in sys.path:
    sys.path.append(_fd) 
del _fd

import country_converter as coco

## Conversion

The country converter provides one main class which is used for the conversion:

In [16]:
cc = coco.CountryConverter()

Given a list of countries is a certain classification:

In [24]:
iso3_codes = ['USA', 'VUT', 'TKL', 'AUT', 'AFG', 'ALB' ]

This can be converted to any classification provided by:

In [28]:
cc.convert(names = iso3_codes, src = 'ISO3', to = 'name_official')

['United States of America',
 'Republic of Vanuatu',
 'Tokelau',
 'Republic of Austria',
 'Islamic Republic of Afghanistan',
 'Republic of Albania']

or

In [30]:
cc.convert(names = iso3_codes, src = 'ISO3', to = 'continent')

['America', 'Oceania', 'Oceania', 'Europe', 'Asia', 'Europe']

The parameter "src" specifies the input-, "to" the output format. Possible values for both parameter can be found by:

In [22]:
cc.valid_class

['name_short',
 'name_official',
 'regex',
 'ISO2',
 'ISO3',
 'ISOnumeric',
 'UNcode',
 'continent',
 'UNregion',
 'EXIO1',
 'EXIO2',
 'EXIO3',
 'WIOD',
 'OECD',
 'EU',
 'EURO',
 'UNmember']

Internally, these names are the column header of the underlying pandas dataframe (see below).

The convert function can also be accessed without initiateing the CountryConverter. This can be useful for one time usage. For multiple matchings, initiating the CountryCotnverter avoids that the file providing the matching data gets read in for each conversion.

In [58]:
coco.convert(names = iso3_codes, src = 'ISO3', to = 'ISO2')

['US', 'VU', 'TK', 'AT', 'AF', 'AL']

Some of the classifications can be accessed by some shortcuts. For example:

In [60]:
cc.EU27

Unnamed: 0,name_short
13,Austria
20,Belgium
34,Bulgaria
59,Cyprus
60,Czech Republic
61,Denmark
70,Estonia
75,Finland
76,France
83,Germany


In [67]:
cc.OECDin('ISO2')

Unnamed: 0,ISO2
12,AU
13,AT
20,BE
39,CA
45,CL
60,CZ
61,DK
70,EE
75,FI
76,FR


##Regular expression matching

The input parameter "src" can be set to "regex" to use regular expression matching for a given country list. For example:

In [17]:
some_names = ['United Rep. of Tanzania', 'Cape Verde', 'Burma', 'Iran (Islamic Republic of)', 'Korea, Republic of', "Dem. People's Rep. of Korea"]

In [37]:
cc.convert(names = some_names, src = "regex", to = "name_short")

['Tanzania', 'Cabo Verde', 'Myanmar', 'Iran', 'South Korea', 'North Korea']

The regular expressions can also be used to match any list of countries to any other. For example: 

In [41]:
match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Republic of China' ]

coco.match(match_these, master_list)

{'china': 'Peoples Republic of China',
 'norway': 'Norway is a Kingdom too',
 'taiwan': 'Republic of China',
 'united_states': 'USA'}

If the regular expression matches several times, all results are given as list and a warning is generated:

In [69]:
match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Taiwan, province of china', 'Republic of China' ]

coco.match(match_these, master_list)



{'china': 'Peoples Republic of China',
 'norway': 'Norway is a Kingdom too',
 'taiwan': ['Taiwan, province of china', 'Republic of China'],
 'united_states': 'USA'}

The parameter "enforce_sublist" can be set to ensure consistent output:

In [46]:
coco.match(match_these, master_list, enforce_sublist = True)



{'china': ['Peoples Republic of China'],
 'norway': ['Norway is a Kingdom too'],
 'taiwan': ['Taiwan, province of china', 'Republic of China'],
 'united_states': ['USA']}

A warning also occurs if one of the names couldn't be found:

In [50]:
match_these = ['norway', 'united_states', 'china', 'taiwan', 'some other country']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China',  'Republic of China' ]
coco.match(match_these, master_list)



{'china': 'Peoples Republic of China',
 'norway': 'Norway is a Kingdom too',
 'some other country': 'not_found',
 'taiwan': 'Republic of China',
 'united_states': 'USA'}

And the value for non found countries can be specified: 

In [54]:
coco.match(match_these, master_list, not_found = 'its not there')



{'china': 'Peoples Republic of China',
 'norway': 'Norway is a Kingdom too',
 'some other country': 'its not there',
 'taiwan': 'Republic of China',
 'united_states': 'USA'}

This can also be used to pass the not found country to the new classification:

In [56]:
coco.match(match_these, master_list, not_found = None)



{'china': 'Peoples Republic of China',
 'norway': 'Norway is a Kingdom too',
 'some other country': 'some other country',
 'taiwan': 'Republic of China',
 'united_states': 'USA'}

## Internals

Within the new instance, the raw data for the conversion is saved within a pandas dataframe. 
This dataframe can be accessed directly with:

In [14]:
cc.data.head()

Unnamed: 0,name_short,name_official,regex,ISO2,ISO3,ISOnumeric,UNcode,continent,UNregion,EXIO1,EXIO2,EXIO3,WIOD,OECD,EU,EURO,UNmember
0,Afghanistan,Islamic Republic of Afghanistan,afghan,AF,AFG,4,4,Asia,Southern Asia,WW,WA,WA,ROW,,,,1946.0
1,Albania,Republic of Albania,albania,AL,ALB,8,8,Europe,Southern Europe,WW,WE,WE,ROW,,,,1955.0
2,Algeria,People's Democratic Republic of Algeria,algeria,DZ,DZA,12,12,Africa,Northern Africa,WW,WF,WF,ROW,,,,1962.0
3,American Samoa,American Samoa,^(?=.*americ).*samoa,AS,ASM,16,16,Oceania,Polynesia,WW,WA,WA,ROW,,,,
4,Andorra,Principality of Andorra,andorra,AD,AND,20,20,Europe,Southern Europe,WW,WE,WE,ROW,,,,1993.0


This dataframe can be extended in both directions. The only requirement is to provide unique values for name_short, name_official and regex.

Internally, the data is saved in country_data.txt as tab-separated values (utf-8 encoded).

Of course, all pandas indexing and matching methods can be used. For example:

In [70]:
some_countries = ['Australia', 'Belgium', 'Brazil', 'Bulgaria', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'India', 'Indonesia', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Romania', 'Russia',  'Turkey', 'United Kingdom', 'United States']
cc.data[(cc.data.OECD >= 1995) & cc.data.name_short.isin(some_countries)].name_short

60    Czech Republic
70           Estonia
99           Hungary
Name: name_short, dtype: object

Further information can be found here: http://pandas.pydata.org/pandas-docs/stable/

## Testing

All regular expressions of the country converter are tested for a unique match to name_short and name_official. 
Test sets for alternative names found in various databases are also available. 

The test sets are stored in the /test subbolder. To tests require py.test. 
I recommend to rerun the test if a regular expression is changed. 

To specify a new test set just add a tab-separated file with headers "name\_short" and "name\_test" and provide name (corresponding to the short name in the main classification file) and the alternative name which should be tested (one pair per row in the file). If the file name starts with "test\_regex\_ " it will be automatically recognised by the test functions.

Konstantin Stadler 20150806