# Olympic Data Analysis
by Daniel Walker, 28th June 2018

INTRO PARAGRAPH

### Package Imports


In [131]:
import numpy as np
import pandas as pd

import plotly.graph_objs as go
from plotly.offline import iplot
import colorlover as cl

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

In [132]:
colors = cl.scales['10']['div']

### Importing Data and Quick Test/Check

In [133]:
df = pd.read_csv('athlete_events.csv')
df.name = 'Olympics 1896 - 2016'

In [134]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
ID        271116 non-null int64
Name      271116 non-null object
Sex       271116 non-null object
Age       261642 non-null float64
Height    210945 non-null float64
Weight    208241 non-null float64
Team      271116 non-null object
NOC       271116 non-null object
Games     271116 non-null object
Year      271116 non-null int64
Season    271116 non-null object
City      271116 non-null object
Sport     271116 non-null object
Event     271116 non-null object
Medal     39783 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


In [135]:
df.head(5)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


## Index

#### Questions
1. How many unique events are there?
2. What are the top five sports by number of competitor?
3. What are the bottom five sports by number of competitor?
4. Which year had the most competitors?
5. What are the top five years by competitor
6. What year had the most competitors for a Winter Games?

## Questions

**1)
How many unique events are there?**

In [136]:
df['Event'].nunique()

765

In [137]:
pie_chart = go.Pie(
            labels=df['Event'],
            values=df['Event'],
            hoverinfo='label+percent',
            textinfo='Event',
            textfont=dict(size=18),
            marker=dict(colors=colors, line=dict(color="#000000", width=1)))

layout = go.Layout(
title='{} Pie Chart'.format(df.name))

fig = go.Figure(data=[pie_chart], layout=layout)
iplot(fig)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


**2) What are the top 5 sports by number of competitor?**

In [138]:
df['Sport'].value_counts().head(5)

Athletics     38624
Gymnastics    26707
Swimming      23195
Shooting      11448
Cycling       10859
Name: Sport, dtype: int64

**What are the bottom 5 sports by number of competitor?**

In [139]:
df['Sport'].value_counts().tail(5)

Racquets         12
Jeu De Paume     11
Roque             4
Basque Pelota     2
Aeronautics       1
Name: Sport, dtype: int64

**Follow on from last question: Who competed in Aeronautics?**

In [140]:
df.loc[df['Sport'] == 'Aeronautics']

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
214105,107506,Hermann Schreiber,M,26.0,,,Switzerland,SUI,1936 Summer,1936,Summer,Berlin,Aeronautics,Aeronautics Mixed Aeronautics,Gold


**Which year had the most competitors?**

In [141]:
df_by_year = df[['Year', 'ID']]
df_by_year = df_by_year.rename(columns={'ID' : 'Competitors'})
groupby_year = df_by_year.groupby('Year')
groupby_year.count().sort_values(by='Competitors', ascending=False).head(1)

Unnamed: 0_level_0,Competitors
Year,Unnamed: 1_level_1
1992,16413


**What are the top 5 years by competitors?**

In [142]:
groupby_year.count().sort_values(by='Competitors', ascending=False).head(5)

Unnamed: 0_level_0,Competitors
Year,Unnamed: 1_level_1
1992,16413
1988,14676
2000,13821
1996,13780
2016,13688


After some research I found out that the Summer Games and the Winter Games were held in the same year up until 1992, which is why 1992 is the top year by number of competitors. So the biggest games on their own was the 2000 olympics with 13821 competitors.

After 1992 they staggered the games on a cycle like so : Winter Games 1994, Summer Games 1996, Winter Games 1998 and so on.

On finding out this information Ive decided to find the top Winter Games by number of competitors as well.

**Top Winter Games by number of competitors**

In [143]:
df_wg = df[['Games', 'ID']]
df_wg = df_wg[df_wg['Games'].str.contains("Winter")]
df_wg = df_wg.rename(columns={'Games' : 'Year', 'ID' : 'Competitors'})
groupby_wg = df_wg.groupby('Year')
groupby_wg.count().sort_values(by='Competitors', ascending=False).head(1)

Unnamed: 0_level_0,Competitors
Year,Unnamed: 1_level_1
2014 Winter,4891
