![](https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/IOC_Logo.svg/2341px-IOC_Logo.svg.png)

# Introduction

Welcome to your first day as an analyst working for the IOC! The IOC is at the very heart of world sport, supporting every Olympic Movement stakeholder, promoting Olympism worldwide, and overseeing the regular celebration of the Olympic Games.

For a moment of glory on the medalist podium, elite athletes dedicate *everything* to their sport. Olympics medalists from 1896 through 2016 comprise the dataset you'll be working with. Who are the youngest and oldest medalists of all time? Are there physical differences between Summer Olympics medalists and Winter Olympics medalists? You're about to use your data coding chops to find out!

You'll start this Milestone assignment by cleaning and filtering the data. So many of your Python skills that you've learned so far will be at play. Are you up for it? Let's go!

### Dataset Description

The dataset is stored in a .csv file named `olympics.csv`. It contains the following columns:

* **ID**: A unique identifying number of each athlete
* **Name**: The name of each athlete
* **Sex**: M or F
* **Age**: The age of an athlete, in years, at the time they competed.
* **Height**: The height of an athlete, in centimeters
* **Weight**: The weight of an athlete, in kilograms
* **Team**: The name of the athlete’s team. Not always the name of a country.
* **NOC**: National Olympic Committee’s 3 letter code
* **Games**: Year and season
* **Season**: Summer or Winter
* **City**: Host city
* **Sport**: The sport or category of an olympic event/activity
* **Event**: specific event within a sport, e.g. Men’s 400 meters breaststroke.
* **Medal**: Gold, Silver, Bronze
* **Region**: Name of athlete’s country



# Task 1: Data Inspection

![](https://media.giphy.com/media/42wQXwITfQbDGKqUP7/giphy.gif)

In [2]:
# import the pandas library
import pandas as pd

In [4]:
# Load in the data
df = pd.read_csv('datasets/olympics.csv')

In [5]:
# Preview DataFrame
print(df.head())

   ID                      Name Sex   Age  Height  Weight            Team  \
0   4      Edgar Lindenau Aabye   M  34.0     NaN     NaN  Denmark/Sweden   
1  15      Arvo Ossian Aaltonen   M  30.0     NaN     NaN         Finland   
2  15      Arvo Ossian Aaltonen   M  30.0     NaN     NaN         Finland   
3  16  Juhamatti Tapio Aaltonen   M  28.0   184.0    85.0         Finland   
4  17   Paavo Johannes Aaltonen   M  28.0   175.0    64.0         Finland   

   NOC        Games  Year  Season       City       Sport  \
0  DEN  1900 Summer  1900  Summer      Paris  Tug-Of-War   
1  FIN  1920 Summer  1920  Summer  Antwerpen    Swimming   
2  FIN  1920 Summer  1920  Summer  Antwerpen    Swimming   
3  FIN  2014 Winter  2014  Winter      Sochi  Ice Hockey   
4  FIN  1948 Summer  1948  Summer     London  Gymnastics   

                                    Event   Medal   region  
0             Tug-Of-War Men's Tug-Of-War    Gold  Denmark  
1  Swimming Men's 200 metres Breaststroke  Bronze  Fin

In [6]:
# Inspect the numbers of rows and columns
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Number of rows: 39783
Number of columns: 16


In [7]:
# Inspect column names
print("Column names:", df.columns)

Column names: Index(['ID', 'Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal', 'region'],
      dtype='object')


In [8]:
# Inspect column data types, memory usage, etc.
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39783 entries, 0 to 39782
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      39783 non-null  int64  
 1   Name    39783 non-null  object 
 2   Sex     39783 non-null  object 
 3   Age     39051 non-null  float64
 4   Height  31072 non-null  float64
 5   Weight  30456 non-null  float64
 6   Team    39783 non-null  object 
 7   NOC     39783 non-null  object 
 8   Games   39783 non-null  object 
 9   Year    39783 non-null  int64  
 10  Season  39783 non-null  object 
 11  City    39783 non-null  object 
 12  Sport   39783 non-null  object 
 13  Event   39783 non-null  object 
 14  Medal   39783 non-null  object 
 15  region  39774 non-null  object 
dtypes: float64(3), int64(2), object(11)
memory usage: 4.9+ MB
None


In [9]:
# Display a statistical summary of the data
print(df.describe())

                  ID           Age        Height        Weight          Year
count   39783.000000  39051.000000  31072.000000  30456.000000  39783.000000
mean    69407.051806     25.925175    177.554197     73.770680   1973.943845
std     38849.980737      5.914026     10.893723     15.016025     33.822857
min         4.000000     10.000000    136.000000     28.000000   1896.000000
25%     36494.000000     22.000000    170.000000     63.000000   1952.000000
50%     68990.000000     25.000000    178.000000     73.000000   1984.000000
75%    103461.500000     29.000000    185.000000     83.000000   2002.000000
max    135563.000000     73.000000    223.000000    182.000000   2016.000000


In [10]:
# What types of medals are there?
print("Types of Medals:", df['Medal'].unique())

Types of Medals: ['Gold' 'Bronze' 'Silver']


# Task 2: Data Cleaning

![](https://media.giphy.com/media/10zsjaH4g0GgmY/giphy.gif)

In [12]:
# Rename 'NOC' column to 'CountryCode'
# Rename 'region' column to 'Country'
df.rename(columns={'NOC': 'CountryCode'}, inplace=True)
df.rename(columns={'Region': 'Country'}, inplace=True)

In [15]:
# Remove the 'Team' column
df.drop('Team', axis=1, inplace=True)

KeyError: "['Team'] not found in axis"

# Task 3: Data Analysis

![](https://media.giphy.com/media/MT5UUV1d4CXE2A37Dg/giphy.gif)

In [16]:
# What is the youngest age of an Olympics medalist?
youngest_age = df['Age'].min()
print("Youngest Age of an Olympics Medalist:", youngest_age)

Youngest Age of an Olympics Medalist: 10.0


In [17]:
# What is the oldest age of an Olympics medalist?
oldest_age = df['Age'].max()
print("Oldest Age of an Olympics Medalist:", oldest_age)

Oldest Age of an Olympics Medalist: 73.0


In [18]:
# How many of each medal were awarded?
medal_counts = df['Medal'].value_counts()
print("Number of Each Medal Awarded:\n", medal_counts)

Number of Each Medal Awarded:
 Gold      13372
Bronze    13295
Silver    13116
Name: Medal, dtype: int64


In [19]:
# How many events are there?
num_events = df['Event'].nunique()
print("Number of Events:", num_events)

Number of Events: 756


In [20]:
# How many sports are there?
num_sports = df['Sport'].nunique()
print("Number of Sports:", num_sports)

Number of Sports: 66


In [21]:
# What is the average age of an Olympics medalist?
average_age = df['Age'].mean()
print("Average Age of an Olympics Medalist:", average_age)

Average Age of an Olympics Medalist: 25.925174771452717


In [22]:
# Among the 10 oldest medalists, what are the most common sports?
oldest_medalists = df.nlargest(10, 'Age')
common_sports_oldest = oldest_medalists['Sport'].value_counts()
print("Most Common Sports Among the 10 Oldest Medalists:\n", common_sports_oldest)

Most Common Sports Among the 10 Oldest Medalists:
 Art Competitions    5
Sailing             3
Shooting            1
Archery             1
Name: Sport, dtype: int64


In [23]:
# What are the 10 winningest countries in total medal count?
winningest_countries = df.groupby('Country')['Medal'].count().nlargest(10)
print("Top 10 Winningest Countries in Total Medal Count:\n", winningest_countries)

KeyError: 'Country'

In [24]:
# How many medals have been awarded in the sport of trampolining?
trampolining_medals = df[df['Sport'] == 'Trampolining']['Medal'].count()
print("Number of Medals Awarded in Trampolining:", trampolining_medals)

Number of Medals Awarded in Trampolining: 30


# Level Up

![](https://media.giphy.com/media/YYaapBJ7UAZp9DJS7o/giphy.gif)

Want to Level Up your practice? We love to see it! Take a crack at some of these extra challenges, including visualizing some of this here data.

In [4]:
# How many gold medals were awarded to the United States?

In [5]:
# List the Olympics in dataset, starting with the most recent

In [None]:
# Average medalist height in the most recent Winter Olympics


In [None]:
# Average medalist weight in the most recent Winter Olympics


In [None]:
# Average medalist height in the most recent Summer Olympics


In [None]:
# Average medalist weight in the most recent Summer Olympics


In [None]:
# Import plotly express library


In [None]:
# Assign top 10 winningest countries table to a variable
# You did this in task 3


In [None]:
# Visualize the table as a bar chart
