# Getting and combining ISO Country codes from Wikipedia

Collect the data from two Wikipedia pages, one in English, one in Estonian and create a translation dictionary.

In [2]:
from bs4 import BeautifulSoup
import re
import requests

## Getting the Estonian ISO List from Wikipedia

In [3]:
url = r'https://et.wikipedia.org/wiki/ISO_maakoodide_loend'

r = requests.get(url)
soup = BeautifulSoup(r.text, features="html.parser")

Collect all the tables and select the correct one.

In [7]:
tables = soup.find_all('table')
table = tables[0] # selecte the first table
trs = table.find_all('tr') # find all table rows
trs[1:2] # first line is a header, data starts from second.

[<tr>
 <td><img alt="Afganistan" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/23px-Flag_of_Afghanistan.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/35px-Flag_of_Afghanistan.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/46px-Flag_of_Afghanistan.svg.png 2x" title="Afganistan" width="23"/> <a href="/wiki/Afganistan" title="Afganistan">Afganistan</a>
 </td>
 <td width="30px"><tt>AF</tt></td>
 <td><tt>AFG</tt></td>
 <td><tt>004</tt></td>
 <td><a class="new" href="/w/index.php?title=ISO_3166-2:AF&amp;action=edit&amp;redlink=1" title="ISO 3166-2:AF (pole veel kirjutatud)">ISO 3166-2:AF</a>
 </td></tr>]

Now for all rows do:
1. Find the table cells
2. Extract the ISO 2-digit code
3. Create a dictionary entry (key = 2-digit code) with the Country name.

In [9]:
a2_dct = {} # empty dictionary
for tr in trs[1:]: # all table rows except headline
    tds = tr.find_all('td') # find all table cells in row
    a2 = tds[1].text # extract the text of the second one (2-digit code)
    a2_dct[a2] = {'ee' : tds[0].text.rstrip('\n').lstrip()} # create dictionary entry 
dict(list(a2_dct.items())[0:5]) # print first 5 entries

{'AF': {'ee': 'Afganistan'},
 'AX': {'ee': 'Ahvenamaa'},
 'AL': {'ee': 'Albaania'},
 'DZ': {'ee': 'Alžeeria'},
 'AS': {'ee': 'Ameerika Samoa'}}

## Getting the English ISO List from Wikipedia

Same approach here. Get webpage, find tables, select the right one.

In [10]:
url = r'https://en.wikipedia.org/wiki/ISO_3166-1'

r = requests.get(url)
soup = BeautifulSoup(r.text, features="html.parser")

In [14]:
tables = soup.find_all('table')
table = tables[1] # here it is the second table on the webpage

trs = table.find_all('tr') # find all the rows
trs[1:2] # show first entry

[<tr>
 <td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/23px-Flag_of_Afghanistan.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/35px-Flag_of_Afghanistan.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Flag_of_Afghanistan.svg/45px-Flag_of_Afghanistan.svg.png 2x" width="23"/></span> <a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan</a>
 </td>
 <td><a href="/wiki/ISO_3166-1_alpha-2#AF" title="ISO 3166-1 alpha-2"><link href="mw-data:TemplateStyles:r886049734" rel="mw-deduplicated-inline-style"/><span class="monospaced">AF</span></a></td>
 <td><link href="mw-data:TemplateStyles:r886049734" rel="mw-deduplicated-inline-style"/><span class="monospaced">AFG</span></td>
 <td><link href="mw-data:TemplateStyles:r886049734" rel="mw-deduplicated-inline-style"/><spa

Fill the data to the previously created dictionary with Estonian country names.

In [15]:
for tr in trs[1:]:
    tds = tr.find_all('td')
    a2 = tds[1].text
    if a2 not in a2_dct:
        a2_dct[a2] = {}
    a2_dct[a2]['en'] = tds[0].text.rstrip('\n').lstrip() # create new "en" key in 2-digit key
dict(list(a2_dct.items())[0:5]) # print first 5 entries

{'AF': {'ee': 'Afganistan', 'en': 'Afghanistan'},
 'AX': {'ee': 'Ahvenamaa', 'en': 'Åland Islands'},
 'AL': {'ee': 'Albaania', 'en': 'Albania'},
 'DZ': {'ee': 'Alžeeria', 'en': 'Algeria'},
 'AS': {'ee': 'Ameerika Samoa', 'en': 'American Samoa'}}