# Lede Summer 2019 Project - Part 6
## Get country names, latitude and longitude

* Download a list of IOC country codes/country names from here:

https://github.com/johnashu/datacamp/blob/master/medals/Summer%20Olympic%20medalists%201896%20to%202008%20-%20IOC%20COUNTRY%20CODES.csv

* Download this csv, which has longitude/latitude for all countries

https://developers.google.com/public-data/docs/canonical/countries_csv

* Join the two dataframes

* Merge the dataframe of country names and long/lat with my main dataframe of athlete info

In [1]:
import requests
import pandas as pd
import numpy as np
import os

import itertools

from bs4 import BeautifulSoup
from dotenv import load_dotenv
load_dotenv()

import time

## Manually edit athletes with countries/country codes that don't match the current ISO codes
* replace FRG with GER
* replace RHO with ZIM so I can get the coordinates for ZIM (later, in all_info.csv, turn ZIM back into RHO. DescriptionRhodesia was an unrecognised state in southern Africa from 1965 to 1979, equivalent in territory to modern Zimbabwe. )
* replace GDR with GER (GDR = East Germany)
* replace URS with RUS or UKR. 

        Nikolay ANDRIANOV >> RUS
        Boris SHAKHLIN >> RUS
        Viktor CHUKARIN >> UKR
        Aleksandr DITYATIN >> RUS
        Larisa LATYNINA >> UKR
        Polina ASTAKHOVA >> UKR
        Galina KULAKOVA >> RUS

* replace TCH to CZE (Czech Republic)
* replace NPA with RUS (NPA stands for Neutral Paralympic Athletes)

In [2]:
df = pd.read_csv('medal_info_cleaned.csv')

In [3]:
df.citizenship.value_counts(dropna=False)

USA    23
GER    22
RUS    14
NOR    13
FIN     9
ITA     6
GBR     6
SWE     5
CAN     5
UKR     5
FRA     5
JPN     4
SUI     4
HUN     2
ISR     2
CZE     2
RSA     2
NED     2
BLR     2
AUS     2
NZL     1
DEN     1
BRA     1
ESP     1
KOR     1
POL     1
KAZ     1
AUT     1
SVK     1
ZIM     1
Name: citizenship, dtype: int64

## The main dataframe, with athlete information, has 149 rows (149 athletes)

In [4]:
df.shape

(145, 15)

In [5]:
athlete_countries = df.citizenship.value_counts().index.to_list()
# athlete_countries

## The dataframe with IOC codes has 201 rows

In [6]:
df_ioc = pd.read_csv('ioc_country_codes.csv')

In [7]:
df_ioc.shape

(201, 3)

In [8]:
df_ioc.head(3)

Unnamed: 0,Country,NOC,ISO code
0,Afghanistan,AFG,AF
1,Albania,ALB,AL
2,Algeria,ALG,DZ


In [9]:
url = 'https://developers.google.com/public-data/docs/canonical/countries_csv'
raw_html = requests.get(url)
doc = BeautifulSoup(raw_html.content, "html.parser")
doc


<!DOCTYPE html>
<html class="" lang="en"><head><script>var a=window.devsite||{};window.devsite=a;a.readyCallbacks=[];window.devsite.readyCallbacks=a.readyCallbacks;a.ready=function(b){a.readyCallbacks.push(b)};window.devsite.ready=a.ready;
</script><meta charset="utf-8"/><meta content="JNIBXj911jOpx0vktpGX1gLY2l36mspLc2VLgRQ9Gu46MTU2MjAwODc3MDQ5MDgzMA" name="xsrf_token"><link href="https://developers.google.com/public-data/docs/canonical/countries_csv" rel="canonical"/><link href="https://developers.google.com/public-data/docs/canonical/countries_csv" hreflang="en" rel="alternate"/><link href="https://developers.google.cn/public-data/docs/canonical/countries_csv" hreflang="en-cn" rel="alternate"/><link href="https://developers.google.com/public-data/docs/canonical/countries_csv" hreflang="x-default" rel="alternate"/><link href="https://developers.google.com/_static/a643c1ad0c/images/favicon.png" rel="shortcut icon"/><link href="https://developers.google.com/_static/a643c1ad0c/images/t

In [10]:
countries = doc.find_all('tr')
rows = []

for country in countries[1:]:
    row = {}
    info = country.find_all('td')
    row['code'] = info[0].text.strip()
    row['latitude'] = info[1].text.strip()
    row['longitude'] = info[2].text.strip()
    row['country_name'] = info[3].text.strip()
    rows.append(row)

## The dataframe with coordinates for each country has 245 rows

In [11]:
df_coord = pd.DataFrame(rows)

In [12]:
df_coord.shape

(245, 4)

In [13]:
df_coord.head(3)

Unnamed: 0,code,country_name,latitude,longitude
0,AD,Andorra,42.546245,1.601554
1,AE,United Arab Emirates,23.424076,53.847818
2,AF,Afghanistan,33.93911,67.709953


In [14]:
df_ioc = pd.read_csv('ioc_country_codes.csv')
df_ioc.head(3)

Unnamed: 0,Country,NOC,ISO code
0,Afghanistan,AFG,AF
1,Albania,ALB,AL
2,Algeria,ALG,DZ


## Merge the table of coordinates with the table of country codes, and save it as a csv, just in case
The merged dataframe with coordinates and country for each country has 200 rows

In [15]:
merged = df_coord.merge(df_ioc, left_on='code', right_on='ISO code')

In [16]:
merged.shape

(200, 7)

In [17]:
merged.to_csv('countries_coord_codes.csv', index=False)

In [18]:
merged.head(1)

Unnamed: 0,code,country_name,latitude,longitude,Country,NOC,ISO code
0,AD,Andorra,42.546245,1.601554,Andorra,AND,AD


In [20]:
df.citizenship.value_counts()

USA    23
GER    22
RUS    14
NOR    13
FIN     9
ITA     6
GBR     6
SWE     5
CAN     5
UKR     5
FRA     5
JPN     4
SUI     4
HUN     2
ISR     2
CZE     2
RSA     2
NED     2
BLR     2
AUS     2
NZL     1
DEN     1
BRA     1
ESP     1
KOR     1
POL     1
KAZ     1
AUT     1
SVK     1
ZIM     1
Name: citizenship, dtype: int64

# Merge the country coordinate/code dataframe with the dataframe with all athlete info
Somehow, I get 129 rows...

In [21]:
df_all_info = df.merge(merged, how='left', left_on='citizenship', right_on='NOC')

In [22]:
df_all_info.shape

(145, 22)

## Clean up the new dataframe of all info by removing extraneous columns

In [23]:
df_all_info = df_all_info.drop(columns=['ISO code', 'Country'])

In [24]:
df_all_info.citizenship.value_counts(dropna=False)

USA    23
GER    22
RUS    14
NOR    13
FIN     9
ITA     6
GBR     6
SWE     5
CAN     5
UKR     5
FRA     5
JPN     4
SUI     4
HUN     2
ISR     2
CZE     2
RSA     2
NED     2
BLR     2
AUS     2
NZL     1
DEN     1
BRA     1
ESP     1
KOR     1
POL     1
KAZ     1
AUT     1
SVK     1
ZIM     1
Name: citizenship, dtype: int64

In [25]:
df_all_info =  df_all_info.fillna('')

In [26]:
df_all_info.sort_values('latitude', na_position='first')

Unnamed: 0,alternate_name,citizenship,event,first_name,full_name,game_type,gender,last_name,medals_bronze,medals_gold,medals_silver,medals_total,other_info,season,years,code,country_name,latitude,longitude,NOC
3,,BRA,Para swimming,Daniel,Daniel DIAS,Paralympic,Men,DIAS,3,14,7,24,,Summer,2008-2016,BR,Brazil,-14.235004,-51.92528,BRA
32,,ZIM,Para archery | dartchery | swimming (Summer Ol...,Margaret,Margaret HARRIMAN,Paralympic,Women,HARRIMAN,4,11,2,17,Also competed representing South Africa,Summer,1960-1996,ZW,Zimbabwe,-19.015438,29.154857,ZIM
57,,AUS,Para alpine skiing | cycling,Michael,Michael MILTON,Paralympic,Men,MILTON,2,6,3,11,,Winter,1992-2006,AU,Australia,-25.274398,133.775136,AUS
7,,AUS,Para swimming,Matthew,Matthew COWDREY,Paralympic,Men,COWDREY,3,13,7,23,,Summer,2004-2012,AU,Australia,-25.274398,133.775136,AUS
29,,RSA,Para swimming,Natalie,Natalie DU TOIT,Paralympic,Women,DU TOIT,0,13,2,15,,Summer,2004-2012,ZA,South Africa,-30.559482,22.937506,RSA
33,,RSA,Para archery | dartchery | lawn bowls,Margaret,Margaret HARRIMAN,Paralympic,Women,HARRIMAN,4,11,2,17,Also competed representing Rhodesia,Summer,1960-1996,ZA,South Africa,-30.559482,22.937506,RSA
39,,NZL,Para swimming,Sophie,Sophie PASCOE,Paralympic,Women,PASCOE,0,9,6,15,,Summer,2008-2016,NZ,New Zealand,-40.900557,174.885971,NZL
24,,ISR,Para athletics | wheelchair basketball | swimm...,Zipora,Zipora RUBIN-ROSENBAUM,Paralympic,Women,RUBIN-ROSENBAUM,6,14,6,26,,Summer,1964-1988,IL,Israel,31.046051,34.851612,ISR
12,,ISR,Para swimming,Uri,Uri BERGMAN,Paralympic,Men,BERGMAN,1,12,1,14,,Summer,1976-1988,IL,Israel,31.046051,34.851612,ISR
105,Hyun-Soo Ahn,KOR,short track,AN,Victor AN,Olympic,Men,Victor,2,6,0,8,Also competed for Russia,Winter,2006-2014,KR,South Korea,35.907757,127.766922,KOR


## Save the new database as csv

In [27]:
df_all_info.to_csv('athletes_with_coord.csv', index=False)