# Obtains world bank data on each participating country<br>

## Purpose
The purpose of this notebook is to obtain country data on every country who have competed in the Olympic games. To do this we used the names of each country within a dictionary file to query the world bank. We then, through use of BeautifulSoup, parsed the html of the pages obtaining a download link for the .csv file of the countries data. We used then downloaded the zip files for every country unzipped them and finally stored all the relevant files in a folder 


## Datasets
<b>dictionary.csv</b> - A file which contained the name, NOC code and region of every country which has competed in the Olympic games


Imports necessary libraries

In [2]:
import os.path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from bs4 import BeautifulSoup
import webbrowser
import urllib.request
from lxml import html
import zipfile
import re

Ensures that the required file exists

In [3]:
# Ensure the file exists
if not os.path.exists( "../../data/raw/dictionary.csv" ):
    print("Missing dataset file")

Reads in the dictionary dataset containing the names of each participation nation

In [4]:
df = pd.read_csv( "../../data/raw/dictionary.csv")
df.head()

Unnamed: 0,Country,Code,Region
0,Afghanistan,AFG,West and Central Asia
1,Albania,ALB,Europe
2,Algeria,ALG,North Africa
3,American Samoa,ASA,Oceania
4,Andorra,AND,Europe


## Downloading the data

The get_links function finds the world bank webpage pertaining to a given country and returns the download link from the page or an error message

In [18]:
def get_links(country):
    #goes to the relevant page for every country
    url = "https://data.worldbank.org/country/{}?year_low_desc=true".format(country)
    try:
        #Loads the html for th4e page and then parses it obtaining the download link which is the returned
        with urllib.request.urlopen(url) as response:
            page = response.read()
        soup = BeautifulSoup(page, 'html.parser')
        links = [ link for link in soup.find("div", {"class": "btn-item download"}).findAll('a')]
        download = re.findall('"([^"]*)"', str(links[0]))[1]
        print(download)
        return download
    except:
        print ("ERROR - "+url)


The download_file function takes in a country and url and downloads the zip file from the given url before unzipping the contents taking the file we require and moveing it to a new folder called 'Country_Data'

In [49]:
def download_file(country, url):
    #Parses the world bank code for each country from within thw download link
    code = url[39:42]
    #Puts the countries title into the appropriate form
    country = country.replace(" ","_").replace("*","").replace(",","").replace(".","")
    
    #If the file doesnt already exist the zip file is downloaded from the supplied download link
    if not os.path.exists( "../../data/raw/Zips/{}.zip".format(code) ):
        urllib.request.urlretrieve(url, '../../data/raw/Zips/{}.zip'.format(code))  
        try:
            #Unzips the file and takes wanted .csv, placing it into a different folder with a new appropriate name
            zip_ref = zipfile.ZipFile('../../data/raw/Zips/{}.zip'.format(code), 'r')
            zip_ref.extractall("Unzipped")
            zip_ref.close()
            fileName = "../data/raw/Unzipped/API_{}_DS2_en_csv_v2.csv".format(code)
            os.rename(fileName, "../../data/raw/Country_Data/"+country+'.csv')
            print("Created - ../../data/raw/Country_Data/"+country+'.csv')
        except Exception as e:
            print (country+" - ERROR")
    else:
        print("Error - File:"+"../../data/raw/Zips/{}.zip".format(code)+" already exists.")
        

A dictionary is created which holds the download link for each given country within the dataset

In [21]:
download_links = dict()
for country in df['Country']:
    country = country.replace("*","")
    if  ' ' in country:
        country = country.replace(' ','-')
    download_links[country] = get_links(country)

http://api.worldbank.org/v2/en/country/AFG?downloadformat=csv
http://api.worldbank.org/v2/en/country/ALB?downloadformat=csv
http://api.worldbank.org/v2/en/country/DZA?downloadformat=csv
http://api.worldbank.org/v2/en/country/ASM?downloadformat=csv
http://api.worldbank.org/v2/en/country/AND?downloadformat=csv
http://api.worldbank.org/v2/en/country/AGO?downloadformat=csv
http://api.worldbank.org/v2/en/country/ATG?downloadformat=csv
http://api.worldbank.org/v2/en/country/ARG?downloadformat=csv
http://api.worldbank.org/v2/en/country/ARM?downloadformat=csv
http://api.worldbank.org/v2/en/country/ABW?downloadformat=csv
http://api.worldbank.org/v2/en/country/AUS?downloadformat=csv
http://api.worldbank.org/v2/en/country/AUT?downloadformat=csv
http://api.worldbank.org/v2/en/country/AZE?downloadformat=csv
http://api.worldbank.org/v2/en/country/BHS?downloadformat=csv
http://api.worldbank.org/v2/en/country/BHR?downloadformat=csv
http://api.worldbank.org/v2/en/country/BGD?downloadformat=csv
http://a

http://api.worldbank.org/v2/en/country/OMN?downloadformat=csv
http://api.worldbank.org/v2/en/country/PAK?downloadformat=csv
http://api.worldbank.org/v2/en/country/PLW?downloadformat=csv
http://api.worldbank.org/v2/en/country/PAN?downloadformat=csv
http://api.worldbank.org/v2/en/country/PNG?downloadformat=csv
http://api.worldbank.org/v2/en/country/PRY?downloadformat=csv
http://api.worldbank.org/v2/en/country/PER?downloadformat=csv
http://api.worldbank.org/v2/en/country/PHL?downloadformat=csv
http://api.worldbank.org/v2/en/country/POL?downloadformat=csv
http://api.worldbank.org/v2/en/country/PRT?downloadformat=csv
http://api.worldbank.org/v2/en/country/PRI?downloadformat=csv
http://api.worldbank.org/v2/en/country/QAT?downloadformat=csv
http://api.worldbank.org/v2/en/country/ROU?downloadformat=csv
http://api.worldbank.org/v2/en/country/RUS?downloadformat=csv
http://api.worldbank.org/v2/en/country/RWA?downloadformat=csv
http://api.worldbank.org/v2/en/country/KNA?downloadformat=csv
http://a

The contents of this dictionary are then put into a new column within the dataframe

In [33]:
df["Links"] = pd.Series(l for l in download_links.values())
df

Unnamed: 0,Country,Code,Population,GDP per Capita,Links
0,Afghanistan,AFG,32526562.0,594.323081,http://api.worldbank.org/v2/en/country/AFG?dow...
1,Albania,ALB,2889167.0,3945.217582,http://api.worldbank.org/v2/en/country/ALB?dow...
2,Algeria,ALG,39666519.0,4206.031232,http://api.worldbank.org/v2/en/country/DZA?dow...
3,American Samoa*,ASA,55538.0,,http://api.worldbank.org/v2/en/country/ASM?dow...
4,Andorra,AND,70473.0,,http://api.worldbank.org/v2/en/country/AND?dow...
5,Angola,ANG,25021974.0,4101.472152,http://api.worldbank.org/v2/en/country/AGO?dow...
6,Antigua and Barbuda,ANT,91818.0,13714.731960,http://api.worldbank.org/v2/en/country/ATG?dow...
7,Argentina,ARG,43416755.0,13431.878340,http://api.worldbank.org/v2/en/country/ARG?dow...
8,Armenia,ARM,3017712.0,3489.127690,http://api.worldbank.org/v2/en/country/ARM?dow...
9,Aruba*,ARU,103889.0,,http://api.worldbank.org/v2/en/country/ABW?dow...


Finally for every row in the dataframe the download_file finction is called and executed downloading the data for every country

In [50]:
df[['Country','Links','Code']].apply(lambda row: download_file(row[0],row[1]) ,axis=1)

Created - Test/Afghanistan.csv
Created - Test/Albania.csv
Created - Test/Algeria.csv
Created - Test/American_Samoa.csv
Created - Test/Andorra.csv
Created - Test/Angola.csv
Created - Test/Antigua_and_Barbuda.csv
Created - Test/Argentina.csv
Created - Test/Armenia.csv
Created - Test/Aruba.csv
Created - Test/Australia.csv
Created - Test/Austria.csv
Created - Test/Azerbaijan.csv
Created - Test/Bahamas.csv
Created - Test/Bahrain.csv
Created - Test/Bangladesh.csv
Created - Test/Barbados.csv
Created - Test/Belarus.csv
Created - Test/Belgium.csv
Created - Test/Belize.csv
Created - Test/Bermuda.csv
Created - Test/Benin.csv
Created - Test/Bhutan.csv
Created - Test/Bolivia.csv
Created - Test/Bosnia_and_Herzegovina.csv
Created - Test/Botswana.csv
Created - Test/Brazil.csv
Created - Test/British_Virgin_Islands.csv
Created - Test/Brunei_Darussalam.csv
Created - Test/Bulgaria.csv
Created - Test/Burkina_Faso.csv
Created - Test/Burundi.csv
Created - Test/Cambodia.csv
Created - Test/Cameroon.csv
Created

0      None
1      None
2      None
3      None
4      None
5      None
6      None
7      None
8      None
9      None
10     None
11     None
12     None
13     None
14     None
15     None
16     None
17     None
18     None
19     None
20     None
21     None
22     None
23     None
24     None
25     None
26     None
27     None
28     None
29     None
       ... 
167    None
168    None
169    None
170    None
171    None
172    None
173    None
174    None
175    None
176    None
177    None
178    None
179    None
180    None
181    None
182    None
183    None
184    None
185    None
186    None
187    None
188    None
189    None
190    None
191    None
192    None
193    None
194    None
195    None
196    None
Length: 197, dtype: object