# Instructions

This programming test is all about "Olympic Games". To solve all tasks you need the following 3 data files:

- **summer.csv** contains information on the results at Olympic Summer Games in the years 2016 and 2020
    - `year`: year of the Olympic Games (e.g. 2016)
    - `location`: host city of the Olympic Games (e.g. Rio der Janeiro)
    - `sport`: e.g. Athletics
    - `discipline`: a sport may comprise multiple disciplines (e.g. 100 metres, Men)
    - `athlete_id`: unique identifier for an athlete (e.g. 105512)
    - `country`: country for which the athlete starts (e.g. Jamaica)
    - `position`: position in the final ranking, if available (e.g. 1)
    - `medal`: medal won, if any (e.g. Gold)
    - `height`: registered height in meters (e.g. 1.96)
    - `weight`: registered weight in kilograms (e.g. 95.0)
- **winter.csv** contains the same information on the results at Winter Olympic Games in the years 2014 and 2018
- **athletes.txt** contains additional information on the athletes
    - `id`: unique identifier for an athlete (e.g. 105512)
    - `name`: full name of the athlete (e.g. Bolt, Usain)
    - `sex`: registered sex of the athlete (e.g. Male)

**Note**: If you do not manage to prepare the full data set in exercises 2 and 3, then you may use the file **backup.csv** in exercises 4 and 5.

# Exercise 1 (10 points)

Import Packages here

In [2]:
import pandas as pd

Read in the data set `summer.csv`. Print the number of rows and columns and display the first five rows. 

Read in the data set `winter.csv`. Print the number of rows and columns and display the first five rows. 

# Exercise 2 (20 points) 
Combine the two datasets (summer and winter) suitably into one large dataset containing the results of all Olympic Games. Then print the number of rows and columns of this combined data set

In [12]:
df_summer = pd.read_csv('summer.csv')
df_winter = pd.read_csv('winter.csv')

df_results = pd.concat(objs=[df_summer,df_winter])
print(df_results.shape)
df_results.head()

(37823, 10)


Unnamed: 0,year,location,sport,discipline,athlete_id,country,position,medal,height,weight
0,2016,Rio de Janeiro,Archery,"Individual, Men",134917,Republic of Korea,1.0,Gold,1.81,84.0
1,2016,Rio de Janeiro,Archery,"Individual, Men",111548,France,2.0,Silver,1.8,83.0
2,2016,Rio de Janeiro,Archery,"Individual, Men",111530,United States,3.0,Bronze,1.81,86.0
3,2016,Rio de Janeiro,Archery,"Individual, Men",135226,Netherlands,4.0,,1.83,75.0
4,2016,Rio de Janeiro,Archery,"Individual, Men",121560,Australia,5.0,,1.74,60.0


How many missing values does the column `height` have?

In [7]:
df_results.height.isna().sum()

np.int64(8225)

How many different athletes are contained in the data set?  
**Note**: A given athlete may be present multiple times in the data set, if he or she particpates at multiple Olympic Games, sports or disciplines. Count each athlete only once. 

In [17]:
df_results.athlete_id.drop_duplicates().notna().sum()

np.int64(23048)

# Exercise 3 (30 points)

Read in the file `athletes.txt` and print the first five rows.

In [19]:
df_athletes = pd.read_table('athletes.txt',delimiter=';')
df_athletes.head(5)

Unnamed: 0,id,name,sex
0,7,"Chila, Patrick",Male
1,15,"Éloi, Damien",Male
2,18,"Gatien, Jean-Philippe",Male
3,27,"Legoût, Christophe",Male
4,35,"Santoro, Fabrice",Male


Merge the athletes data set with the results data set from above **using an outer merge**. Then report on the sources of your merged dataset:

- how many rows contain information from both data sources?
- how many rows contain only information from the athletes data set?
- how many rows contain only information from the results data set?

In [30]:
df = pd.merge(left=df_results, left_on='athlete_id', right=df_athletes, right_on='id', how='outer')
print(df[(df.athlete_id.notna())&(df.id.notna())].shape[0])
print(df[(df.athlete_id.isna())&(df.id.notna())].shape[0])
print(df[(df.athlete_id.notna())&(df.id.isna())].shape[0])
print(df.shape[0])

37823
31015
0
68838


Explain (using the following markdown cell) whether the outer merge or some other merge type is an appropriate choice here?  
**Note:** in the following exercises we want to analyse the results of the Olympic Games between 2014 and 2020, considering also aspects such as the sex and names of the participants.

If applicable, carry out the merge again using a more appropriate merge type. Print the rows and colums of the merged data set.

In [31]:
df = pd.merge(left=df_results, left_on='athlete_id', right=df_athletes, right_on='id', how='left')

# Exercise 4 (20 points)

Create a new column bmi that contains the athlete’s body mass index (bmi). The BMI is defined as weight (in kg) divided by the square of the body height (in m). Note that a BMI of 18.5 - 25 is considered normal.     

In [36]:
df['bmi'] = df.weight / df.height**2

For the male athletes, calculate the average body mass index per `sport`. Display the 5 sports with the lowest average bmi.
**Note**: you can ignore missing values for this task

In [35]:
l = df[df.sex=='Male'].groupby('sport').weight.mean(numeric_only=True).nsmallest(5)
print(f"The sports are {", ".join(l.index.to_list())}")

The sports are Skateboarding, Ski Jumping, Artistic Gymnastics, Trampolining, Nordic Combined


A body mass index between 18.5 and 24.9 is considered as normal (neither underweight nor overweight). What percentage of rows exhibit a normal body mass index? 

In [48]:
df[(df.bmi >= 18.5) & (df.bmi <= 24.9)].shape[0] / df[df.bmi.notna()].shape[0]

0.7711265140376672

# Exercise 5 (30 points)

The German Biathlon athlete Erik Lesser (`athlete_id` = 127808) participated 12 times, at multiple Olympic Games and disciplines. Print the average position that he achieved.

In [51]:
df.position = pd.to_numeric(df.position, errors='coerce')
df[df.athlete_id==127808].position.mean()

np.float64(12.25)

Based on the `name` column create a `firstname` and a `lastname` column and add them permanently to the data set. Display the first five rows of the dataset.

In [55]:
df[['firstname','lastname']] = df.name.str.split(',',n=1,expand=True)
df.head(5)

Unnamed: 0,year,location,sport,discipline,athlete_id,country,position,medal,height,weight,id,name,sex,bmi,firstname,lastname
0,2016,Rio de Janeiro,Archery,"Individual, Men",134917,Republic of Korea,1.0,Gold,1.81,84.0,134917,"Bon-Chan, Gu",Male,25.640243,Bon-Chan,Gu
1,2016,Rio de Janeiro,Archery,"Individual, Men",111548,France,2.0,Silver,1.8,83.0,111548,"Valladont, Jean-Charles",Male,25.617284,Valladont,Jean-Charles
2,2016,Rio de Janeiro,Archery,"Individual, Men",111530,United States,3.0,Bronze,1.81,86.0,111530,"Ellison, Brady",Male,26.250725,Ellison,Brady
3,2016,Rio de Janeiro,Archery,"Individual, Men",135226,Netherlands,4.0,,1.83,75.0,135226,"van den Berg, Sjef",Male,22.395413,van den Berg,Sjef
4,2016,Rio de Janeiro,Archery,"Individual, Men",121560,Australia,5.0,,1.74,60.0,121560,"Worth, Taylor",Male,19.817677,Worth,Taylor


What are the 10 most common firstnames in Austria, Denmark, Germany, and Switzerland?  
**Note**: create an overall ranking, not one ranking per country

In [64]:
country_list = ['Austria','Denmark','Germany','Switzerland']
df[df.country.isin(country_list)].value_counts('firstname').head(10)

firstname
Gasparin    25
Müller      21
Baumann     21
Hansen      18
Schwarz     15
Toba        14
Yusof       14
Brägger     14
Nielsen     14
Lesser      12
Name: count, dtype: int64