# Olympics

This is a historical dataset on the modern Olympic Games, from Athens 1896 to Rio 2016. Each row consists of an individual athlete competing in an Olympic event and which medal was won (if any).

Not sure where to begin? Scroll to the bottom to find challenges!

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [21]:


data = pd.read_csv("data/athlete_events.csv.gz")

data.head()

Unnamed: 0,id,name,sex,age,height,weight,team,noc,games,year,season,city,sport,event,medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


## Data Dictionary

|Column   |Explanation                   |
| ------- | ---------------------------- |
|id       |Unique number for each athlete |
|name     |Athlete's name                 |
|sex      |M or F                         |
|age      |Age of the athlete                        |
|height   |In centimeters                 |
|weight   |In kilograms                   |
|team     |Team name                      |
|noc      |National Olympic Committee 3   |
|games    |Year and season                |
|year     |Integer                        |
|season   |Summer or Winter               |
|city     |Host city                      |
|sport    |Sport                          |
|event    |Event                          |
|medal    |Gold, Silver, Bronze, or NA    |

[Source](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results) and [license](https://creativecommons.org/publicdomain/zero/1.0/) of the dataset.
The dataset is a consolidated version of data from www.sports-reference.com. 

## Data Cleaning

In [22]:
# make a copy of the data
clean_data = data.copy()

In [23]:
# check the shape of the data
print("the data has {} rows and {} columns".format(clean_data.shape[0], clean_data.shape[1]))

the data has 271116 rows and 15 columns


In [24]:
# check the number of unique values in id column
print("the data has {} unique values in id column".format(clean_data["id"].nunique()))

the data has 135571 unique values in id column


In [25]:
# check for missing values
clean_data.isnull().sum()

id             0
name           0
sex            0
age         9474
height     60171
weight     62875
team           0
noc            0
games          0
year           0
season         0
city           0
sport          0
event          0
medal     231333
dtype: int64

In [26]:
# Fill missing values in the age column
import numpy as np

def fill_missing_age(df):
    """
    Fill missing age values by grouping the DataFrame by 'id' and 'sport'
    and filling missing age values based on the available data for each group.
    """
    # Group the DataFrame by 'id' and 'sport'
    grouped = df.groupby(['id', 'sport'])

    # Define a function that fills missing age values for each group
    def fill_missing_age_group(group):
        age_values = group['age'].values
        if pd.isnull(age_values).all():
            # If all age values are missing for the group, return the group unchanged
            return group
        elif pd.notnull(age_values).all():
            # If all age values are present for the group, return the group unchanged
            return group
        else:
            # Otherwise, fill missing age values with the mean or median age value for the group
            age_mean = np.nanmean(age_values)
            if pd.isnull(age_mean):
                age_median = np.nanmedian(df.loc[df['sport'] == group.name[1], 'age'].values)
                age = age_median if pd.notnull(age_median) else np.nanmedian(df['age'].values)
            else:
                age = age_mean
            group['age'] = group['age'].fillna(age)
            return group

    # Apply the fill_missing_age_group function to each group
    filled_df = grouped.apply(fill_missing_age_group)

    return filled_df



# call the function to fill missing values
clean_data = fill_missing_age(clean_data)

In [27]:
clean_data.isnull().sum()

id             0
name           0
sex            0
age         9474
height     60171
weight     62875
team           0
noc            0
games          0
year           0
season         0
city           0
sport          0
event          0
medal     231333
dtype: int64

## Don't know where to start?

**Challenges are brief tasks designed to help you practice specific skills:**

- 🗺️ **Explore**: In which year and city did the Netherlands win the highest number of medals in their history?
- 📊 **Visualize**: Create a plot visualizing the relationship between the number of athletes countries send to an event and the number of medals they receive.
- 🔎 **Analyze**: In which sports does the height of an athlete increase their chances of earning a medal?

**Scenarios are broader questions to help you develop an end-to-end project for your portfolio:**

You are working as a data analyst for an international judo club. The owner of the club is looking for new ways to leverage data for competition. One idea they have had is to use past competition data to estimate the threat of future opponents. They have provided you with a dataset of past Olympic data and want to know whether you can use information such as the height, weight, age, and national origin of a judo competitor to estimate the probability that they will earn a medal.

You will need to prepare a report that is accessible to a broad audience. It should outline your steps, findings, and conclusions.

---

✍️ _If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system._