# Instructions

This programming test is all about "Olympic Games". To solve all tasks you need the following 3 data files:

- **summer.csv** contains information on the results at Olympic Summer Games in the years 2016 and 2020
    - `year`: year of the Olympic Games (e.g. 2016)
    - `location`: host city of the Olympic Games (e.g. Rio der Janeiro)
    - `sport`: e.g. Athletics
    - `discipline`: a sport may comprise multiple disciplines (e.g. 100 metres, Men)
    - `athlete_id`: unique identifier for an athlete (e.g. 105512)
    - `country`: country for which the athlete starts (e.g. Jamaica)
    - `position`: position in the final ranking, if available (e.g. 1)
    - `medal`: medal won, if any (e.g. Gold)
    - `height`: registered height in meters (e.g. 1.96)
    - `weight`: registered weight in kilograms (e.g. 95.0)
- **winter.csv** contains the same information on the results at Winter Olympic Games in the years 2014 and 2018
- **athletes.txt** contains additional information on the athletes
    - `id`: unique identifier for an athlete (e.g. 105512)
    - `name`: full name of the athlete (e.g. Bolt, Usain)
    - `sex`: registered sex of the athlete (e.g. Male)

**Note**: If you do not manage to prepare the full data set in exercises 2 and 3, then you may use the file **backup.csv** in exercises 4 and 5.

# Exercise 1 (10 points)

Import Packages here

In [1]:
import pandas as pd
import numpy as np

Read in the data set `summer.csv`. Print the number of rows and columns and display the first five rows. 

In [2]:
df_summer = pd.read_csv("summer.csv")
print(f'Number of rows: {df_summer.shape[0]}, columns: {df_summer.shape[1]}')
df_summer.head(5)

Number of rows: 27729, columns: 10


Unnamed: 0,year,location,sport,discipline,athlete_id,country,position,medal,height,weight
0,2016,Rio de Janeiro,Archery,"Individual, Men",134917,Republic of Korea,1.0,Gold,1.81,84.0
1,2016,Rio de Janeiro,Archery,"Individual, Men",111548,France,2.0,Silver,1.8,83.0
2,2016,Rio de Janeiro,Archery,"Individual, Men",111530,United States,3.0,Bronze,1.81,86.0
3,2016,Rio de Janeiro,Archery,"Individual, Men",135226,Netherlands,4.0,,1.83,75.0
4,2016,Rio de Janeiro,Archery,"Individual, Men",121560,Australia,5.0,,1.74,60.0


- Read in the data set `winter.csv`. Print the number of rows and columns and display the first five rows. 

In [3]:
df_winter = pd.read_csv("winter.csv")
print(f'Number of rows: {df_winter.shape[0]}, columns: {df_winter.shape[1]}')
df_winter.head(5)

Number of rows: 10094, columns: 10


Unnamed: 0,year,location,sport,discipline,athlete_id,country,position,medal,height,weight
0,2014,Sochi,Alpine Skiing,"Combined, Men",118461,Switzerland,1.0,Gold,1.77,77.0
1,2014,Sochi,Alpine Skiing,"Combined, Men",101791,Croatia,2.0,Silver,1.82,88.0
2,2014,Sochi,Alpine Skiing,"Combined, Men",118341,Italy,3.0,Bronze,1.86,90.0
3,2014,Sochi,Alpine Skiing,"Combined, Men",110097,Norway,4.0,,1.85,83.0
4,2014,Sochi,Alpine Skiing,"Combined, Men",128551,Slovakia,5.0,,1.78,80.0


# Exercise 2 (20 points) 
Combine the two datasets (summer and winter) suitably into one large dataset containing the results of all Olympic Games. Then print the number of rows and columns of this combined data set

In [4]:
df_olympics = pd.concat(objs=[df_summer, df_winter], ignore_index=True, )
df_olympics.drop_duplicates()
print(f'Number of rows: {df_olympics.shape[0]}, columns: {df_olympics.shape[1]}')

Number of rows: 37823, columns: 10


How many missing values does the column `height` have?

In [5]:
df_olympics.height.isnull().sum()

np.int64(8225)

How many different athletes are contained in the data set?  
**Note**: A given athlete may be present multiple times in the data set, if he or she particpates at multiple Olympic Games, sports or disciplines. Count each athlete only once. 

In [6]:
df_olympics.athlete_id.nunique()

23048

# Exercise 3 (25 points)

Read in the file `athletes.txt` and print the first five rows.

In [7]:
df_athletes = pd.read_table("athletes.txt",sep=";")
df_athletes.head(5)

Unnamed: 0,id,name,sex
0,7,"Chila, Patrick",Male
1,15,"Éloi, Damien",Male
2,18,"Gatien, Jean-Philippe",Male
3,27,"Legoût, Christophe",Male
4,35,"Santoro, Fabrice",Male


Merge the athletes data set with the results data set from above **using an outer merge**. Then report on the sources of your merged dataset:

- how many rows contain information from both data sources?
- how many rows contain only information from the athletes data set?
- how many rows contain only information from the results data set?

In [8]:
df_merged_outer = pd.merge(left=df_athletes, left_on="id", right=df_olympics, right_on="athlete_id", how="outer")

number_both         = df_merged_outer[(df_merged_outer.id.notna()) & (df_merged_outer.athlete_id.notna())].shape[0]
number_only_athlete = df_merged_outer[(df_merged_outer.id.notna()) & (df_merged_outer.athlete_id.isna())].shape[0]
number_only_results = df_merged_outer[(df_merged_outer.id.isna()) & (df_merged_outer.athlete_id.notna())].shape[0]
print(f"Number of rows in both: {number_both}")
print(f"Number of rows containing only info about athletes dataset: {number_only_athlete}")
print(f"Number of rows containing only info about results dataset: {number_only_results}")

number_both + number_only_athlete + number_only_results == df_merged_outer.shape[0]

Number of rows in both: 37823
Number of rows containing only info about athletes dataset: 31015
Number of rows containing only info about results dataset: 0


True

Explain (using the following markdown cell) whether the outer merge or some other merge type is an appropriate choice here?  
**Note:** in the following exercises we want to analyse the results of the Olympic Games between 2014 and 2020, considering also aspects such as the sex and names of the participants.

The way how to merge depends on how are we going to use the resulting dataframe. 

If we want to analyse both results and athelete information (if a row does not have information about result or athlete that row is not important), we should do inner join. That way we only get rows with result athlete information.

If we want to analyse the results, sex and names (every row is important), we should do outer join. That way we get rows only with result information (not information about the athlete), rows only with athelte information (not information about results) and rows with result and athlete information.

If we want to analyse the results, and sex and name provide some interesting metadata about the result (in this case, sex and name is not as important as the result), we should do a right join. That way we fill results rows with sex and name if founded, if not, sex and name will be null.

If we want to analyse the sex and name, and the results provide some interesting metadata about the athlete (in this case, result is not as important as sex and name), we should do a left join. That way we fill athlete rows with results if founded, if not, results columns will be null.


In this case, we want to analyse the results of the olympics, considering also information about athletes. I think the most important part of our analysis is the results and the athlete information is metadata, so i will do a **right join** (olympics) to keep all information about olympic games and not lose any information about it. 

*(I know right and inner join IN THIS CASE are the same, but in other cases it might not, so the general way not to lose information about olympic games is to do right join)*.

If applicable, carry out the merge again using a more appropriate merge type. Print the rows and colums of the merged data set.

In [9]:
df = pd.merge(left=df_athletes, left_on="id", right=df_olympics, right_on="athlete_id", how="right")
df

Unnamed: 0,id,name,sex,year,location,sport,discipline,athlete_id,country,position,medal,height,weight
0,134917,"Bon-Chan, Gu",Male,2016,Rio de Janeiro,Archery,"Individual, Men",134917,Republic of Korea,1.0,Gold,1.81,84.0
1,111548,"Valladont, Jean-Charles",Male,2016,Rio de Janeiro,Archery,"Individual, Men",111548,France,2.0,Silver,1.80,83.0
2,111530,"Ellison, Brady",Male,2016,Rio de Janeiro,Archery,"Individual, Men",111530,United States,3.0,Bronze,1.81,86.0
3,135226,"van den Berg, Sjef",Male,2016,Rio de Janeiro,Archery,"Individual, Men",135226,Netherlands,4.0,,1.83,75.0
4,121560,"Worth, Taylor",Male,2016,Rio de Janeiro,Archery,"Individual, Men",121560,Australia,5.0,,1.74,60.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37818,119658,"Hansen, Brian",Male,2018,PyeongChang,Speed Skating,"Team Pursuit (8 laps), Men",119658,United States,8.0,,1.83,82.0
37819,128782,"Garcia, Jonathan",Male,2018,PyeongChang,Speed Skating,"Team Pursuit (8 laps), Men",128782,United States,8.0,,1.70,64.0
37820,128783,"Lehman, Emery",Male,2018,PyeongChang,Speed Skating,"Team Pursuit (8 laps), Men",128783,United States,8.0,,1.83,79.0
37821,128784,"Mantia, Joey",Male,2018,PyeongChang,Speed Skating,"Team Pursuit (8 laps), Men",128784,United States,8.0,,1.73,78.0


# Exercise 4 (20 points)

Create a new column bmi that contains the athlete’s body mass index (bmi). The BMI is defined as weight (in kg) divided by the square of the body height (in m). Note that a BMI of 18.5 - 25 is considered normal.     

In [10]:
df['bmi'] = df.weight / (df.height ** 2)

For the male athletes, calculate the average body mass index per `sport`. Display the 5 sports with the lowest average bmi.
**Note**: you can ignore missing values for this task

In [11]:
l = df[df.sex=='Male'].groupby('sport').bmi.mean(numeric_only=True).nsmallest(5).index.to_list()

print(f"The sports are {', '.join(l)}.")

The sports are Ski Jumping, Skateboarding, Nordic Combined, Triathlon, Cycling Road.


A body mass index between 18.5 and 24.9 is considered as normal (neither underweight nor overweight). What percentage of rows exhibit a normal body mass index? 

In [17]:
number_of_normal = df[(df.bmi >= 18.5) & (df.bmi <= 24.9)].shape[0]
number_of_all = df.bmi.notna().sum()
print(f"The percentage is {(number_of_normal / number_of_all * 100):.2f}%")

The percentage is 77.11%


# Exercise 5 (25 points)

The German Biathlon athlete Erik Lesser (`athlete_id` = 127808) participated 12 times, at multiple Olympic Games and disciplines. Print the average position that he achieved.

In [13]:
df.position = pd.to_numeric(df.position, errors="coerce")

In [14]:
df[df.athlete_id==127808].position.mean()

np.float64(12.25)

Based on the `name` column create a `firstname` and a `lastname` column and add them permanently to the data set. Display the first five rows of the dataset.

In [15]:
names = df['name'].str.split(',', n=1, expand=True)
df['firstname'] = names[1]
df['lastname'] = names[0]

df.head(5)

Unnamed: 0,id,name,sex,year,location,sport,discipline,athlete_id,country,position,medal,height,weight,bmi,firstname,lastname
0,134917,"Bon-Chan, Gu",Male,2016,Rio de Janeiro,Archery,"Individual, Men",134917,Republic of Korea,1.0,Gold,1.81,84.0,25.640243,Gu,Bon-Chan
1,111548,"Valladont, Jean-Charles",Male,2016,Rio de Janeiro,Archery,"Individual, Men",111548,France,2.0,Silver,1.8,83.0,25.617284,Jean-Charles,Valladont
2,111530,"Ellison, Brady",Male,2016,Rio de Janeiro,Archery,"Individual, Men",111530,United States,3.0,Bronze,1.81,86.0,26.250725,Brady,Ellison
3,135226,"van den Berg, Sjef",Male,2016,Rio de Janeiro,Archery,"Individual, Men",135226,Netherlands,4.0,,1.83,75.0,22.395413,Sjef,van den Berg
4,121560,"Worth, Taylor",Male,2016,Rio de Janeiro,Archery,"Individual, Men",121560,Australia,5.0,,1.74,60.0,19.817677,Taylor,Worth


What are the 10 most common firstnames in Austria, Denmark, Germany, and Switzerland?  
**Note**: create an overall ranking, not one ranking per country

In [16]:
countrylist = ['Austria','Denmark','Germany','Switzerland']
df_unique_athletes = df[~df['id'].duplicated()]
l = df_unique_athletes[df_unique_athletes.country.isin(countrylist)].value_counts('firstname').nlargest(10).index.to_list()
print(f"The most common first names in {", ".join(countrylist)} are: \n {", ".join(l)}")

The most common first names in Austria, Denmark, Germany, Switzerland are: 
  Thomas,  Andreas,  Lisa,  Simon,  Christian,  Martin,  Johannes,  Alexander,  Daniel,  Anna
