# Final Project - Final Utilities
![Olympic Rings](https://idrottsforum.org/wp-content/uploads/2019/02/winter-olympics.jpg)

In [1]:
import pip
!pip install pandas matplotlib seaborn



In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Read the dataset

In [31]:
athletes = pd.read_csv('https://raw.githubusercontent.com/ADSLab-Salzburg/DataAnalysiswithPython/main/data/athlete_events.csv')
athletes.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


### Quiz

Here you have space for your work on the quiz! Good luck and have fun!

In [37]:
# YOUR CODE HERE

# How many athlete/event combinations are in the dataset (number of rows)?
num_rows = len(athletes)
print("Number of athlete/event combinations:", num_rows)

# How many countries are in the dataset?
unique_countries = athletes['NOC'].nunique()
print("Number of unique countries in the dataset:", unique_countries)

# How many different events are in the dataset?

unique_events = athletes['Event'].nunique()
print("Number of different events in the dataset:", unique_events)

# How many different sport types are in the dataset?

unique_sports = athletes['Sport'].nunique()
print("Number of different sports in the dataset:", unique_sports)

# How many female athletes are in the dataset?

female_athletes = athletes[athletes['Sex'] == 'F'].drop_duplicates(subset='ID')
num_female_athletes = len(female_athletes)

print("Number of female athletes in the dataset:", num_female_athletes)

# How many hosting cities are in the dataset?

unique_cities = athletes['City'].nunique()
print("Number of hosting cities in the dataset:", unique_cities)

# How many individual athletes are in the dataset? (Athletes can participate in more than one event)
unique_athletes = athletes['ID'].nunique()
print("Number of individual athletes in the dataset:", unique_athletes)

# How many male athletes are in the dataset?

male_athletes = athletes[athletes['Sex'] == 'M'].drop_duplicates(subset='ID')
num_male_athletes = len(male_athletes)
print("Number of male athletes in the dataset:", num_male_athletes)

# How many medalists are in the dataset?
medalists_df = athletes[athletes['Medal'].notna()]
num_medalists = medalists_df['ID'].nunique()

print("The number of medalists in the dataset is:", num_medalists)

# Where did the first modern Olympic Games take place?

earliest_year_df = athletes[athletes['Year'] == athletes['Year'].min()]
first_olympics_city = earliest_year_df.iloc[0]['City']

print(f"The first modern Olympic Games took place in {first_olympics_city}")

# Who is the athlete to compete in the most events? State his/her name.

athlete_id_counts = athletes['Name'].value_counts().reset_index()
athlete_id_counts.columns = ['Name', 'Event_Count']
athlete_with_most_events = athlete_id_counts.iloc[0]
most_events_athlete_name = athlete_with_most_events['Name']

print("The athlete who competed in the most events is:", most_events_athlete_name)


# Who is the most successful athlete (most medasl won)?

athlete_medal_counts = athletes.groupby('Name')['Medal'].count()
most_successful_athlete = athlete_medal_counts.idxmax()

print(f"The most successful athlete (with the most medals) is: {most_successful_athlete}")

Number of athlete/event combinations: 271116
Number of unique countries in the dataset: 230
Number of different events in the dataset: 765
Number of different sports in the dataset: 66
Number of female athletes in the dataset: 33981
Number of hosting cities in the dataset: 42
Number of individual athletes in the dataset: 135571
Number of male athletes in the dataset: 101590
The number of medalists in the dataset is: 28251
The first modern Olympic Games took place in Athina
The athlete who competed in the most events is: Robert Tait McKenzie
The most successful athlete (with the most medals) is: Michael Fred Phelps, II


### Extract host cities

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Maps - Primer
This is not mandatory. There will be a lab on maps anyway.


### "Reverse-locate" latitude and longitude

We are using GeoPandas to display maps. See [these instructions](https://geopandas.org/install.html) on how to install GeoPandas.

In [None]:
!pip install geopandas geopy descartes

In [None]:
import geopandas as gpd
from  geopy.geocoders import Nominatim

In [None]:
geolocator = Nominatim(user_agent='TestForOlympic', timeout=100)  # set agent name according to your project
latitudes = []
longitudes = []

for c in host_cities['City']:
    loc = geolocator.geocode(c, timeout=100)  # time out to prevent being denied access
    print(loc)
    latitudes.append(loc.latitude)
    longitudes.append(loc.longitude)

Adding the captured latitudinal and longitudinal data to the data frame.

In [None]:
host_cities['latitude'] = latitudes
host_cities['longitude'] = longitudes
host_cities.head()

### Define geometry points
Let's do that by means of a GeoDataFrame. Here we use the input from the DataFrame fille previously.

In [None]:
host_cities = gpd.GeoDataFrame(host_cities, geometry=gpd.points_from_xy(host_cities.longitude, host_cities.latitude))
host_cities.head()

### Draw and save map
With this piece of code you can create your own map. For more on maps, stay tuned to the lab on maps.

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='lightgrey', edgecolor='black', figsize=(20,10))
host_cities.plot(ax=base, marker='*', color='red', markersize=75)

# annotation - but it is not useful for this example
#for x, y, label in zip(host_cities.geometry.x, host_cities.geometry.y, host_cities.City):
#    base.annotate(label, xy=(x, y), xytext=(3, 3), textcoords="offset points")

plt.title('Olympic Game Hosts Since 1896', fontsize=20)
#plt.savefig('olympic_hosts.png', dpi=100)  # increase dpi for poster version
