<a href="https://colab.research.google.com/github/MaliheDahmardeh/Olympic-History/blob/main/Olympic_History.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data info

**Context:**

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. data has scraped from www.sports-reference.com in May 2018. 

The Winter and Summer Games were held in the same year up until 1992. After that, they staggered them such that Winter Games occur on a four year cycle starting with 1994, then Summer in 1996, then Winter in 1998, and so on.

**Content:**

The file athlete_events.csv contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

ID : Unique number for each athlete

Name : Athlete's name

Sex : M or F

Age : Integer

Height : In centimeters

Weight : In kilograms

Team : Team name

NOC : National Olympic Committee 3-letter code

Games : Year and season

Year : Integer

Season : Summer or Winter

City : Host city

Sport : Sport

Event : Event

Medal : Gold, Silver, Bronze, or NA

#Dataset loading

In [79]:
import numpy as np
import pandas as pd
import os

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots


In [80]:
!ls

athlete_events.csv  noc_regions.csv  sample_data


In [81]:
df_events = pd.read_csv('athlete_events.csv')

In [82]:
df_noc = pd.read_csv('noc_regions.csv')

In [83]:
df = pd.merge(df_events,df_noc,on='NOC',how='left')
df

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,region,notes
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,,China,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,,China,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,,Denmark,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,Denmark,
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,,Netherlands,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271111,135569,Andrzej ya,M,29.0,179.0,89.0,Poland-1,POL,1976 Winter,1976,Winter,Innsbruck,Luge,Luge Mixed (Men)'s Doubles,,Poland,
271112,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Individual",,Poland,
271113,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Team",,Poland,
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,,Poland,


In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271116 entries, 0 to 271115
Data columns (total 17 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
 15  region  270746 non-null  object 
 16  notes   5039 non-null    object 
dtypes: float64(3), int64(2), object(12)
memory usage: 37.2+ MB


#Data Cleaning

In [85]:
df.isnull().sum()

ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
region       370
notes     266077
dtype: int64

In [86]:
#drop columns 'notes' and 'ID' because they are not important and column 'notes' has lots of null values
df.drop(columns=['notes','ID'],inplace=True)

In [87]:
# check and drop duplicated rows
df.duplicated().sum()

1385

In [88]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

In [89]:
df.isnull().sum()

Name           0
Sex            0
Age         9315
Height     58814
Weight     61527
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     229959
region       370
dtype: int64

Fill  null value for Age, Height and Weight with mean of them

In [90]:
#Mean of height
mean_height=df["Height"].mean()
rmh=round(mean_height)
rmh

175

In [91]:
df["Height"]=df["Height"].fillna(rmh)
df["Height"]

0         180.0
1         170.0
2         175.0
3         175.0
4         185.0
          ...  
271111    179.0
271112    176.0
271113    176.0
271114    185.0
271115    185.0
Name: Height, Length: 269731, dtype: float64

In [92]:
#Mean of weight
mean_weight=df["Weight"].mean()
rmw=round(mean_weight)
rmw

71

In [93]:
df["Weight"]=df["Weight"].fillna(rmw)
df["Weight"]

0         80.0
1         60.0
2         71.0
3         71.0
4         82.0
          ... 
271111    89.0
271112    59.0
271113    59.0
271114    96.0
271115    96.0
Name: Weight, Length: 269731, dtype: float64

In [94]:
#Mean of Age
mean_age=df["Age"].mean()
rma=round(mean_age)
rma

25

In [95]:
df["Age"]=df["Age"].fillna(rma)
df["Age"]

0         24.0
1         23.0
2         24.0
3         34.0
4         21.0
          ... 
271111    29.0
271112    27.0
271113    27.0
271114    30.0
271115    34.0
Name: Age, Length: 269731, dtype: float64

In [96]:
#Changing float type data to integer
df["Age"]=df["Age"].astype(int)
df["Height"]=df["Height"].astype(int)
df["Weight"]=df["Weight"].astype(int)

In [97]:
df.columns

Index(['Name', 'Sex', 'Age', 'Height', 'Weight', 'Team', 'NOC', 'Games',
       'Year', 'Season', 'City', 'Sport', 'Event', 'Medal', 'region'],
      dtype='object')

In [98]:
#drop Medal null values
drop_rows = df.dropna( how='any',subset=['Medal'], inplace=True)

In [99]:
df.isnull().sum()

Name      0
Sex       0
Age       0
Height    0
Weight    0
Team      0
NOC       0
Games     0
Year      0
Season    0
City      0
Sport     0
Event     0
Medal     0
region    9
dtype: int64

In [100]:
#fill remaining null values in region with 'unknown'
df["region"]=df["region"].fillna('unknown')
df["region"]

3         Denmark
37        Finland
38        Finland
40        Finland
41        Finland
           ...   
271078     Russia
271080     Russia
271082     Poland
271102     Russia
271103     Russia
Name: region, Length: 39772, dtype: object

In [101]:
#view of "unknown" region
region_unknown=df.loc[(df["region"]=="unknown")]
region_unknown

Unnamed: 0,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,region
67723,Feng Tian Wei,F,21,163,55,Singapore,SGP,2008 Summer,2008,Summer,Beijing,Table Tennis,Table Tennis Women's Team,Silver,unknown
67724,Feng Tian Wei,F,25,163,55,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Singles,Bronze,unknown
67725,Feng Tian Wei,F,25,163,55,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Team,Bronze,unknown
138095,Li Jia Wei,F,26,170,60,Singapore,SGP,2008 Summer,2008,Summer,Beijing,Table Tennis,Table Tennis Women's Team,Silver,unknown
138096,Li Jia Wei,F,30,170,60,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Team,Bronze,unknown
213955,Joseph Isaac Schooling,M,21,184,74,Singapore,SGP,2016 Summer,2016,Summer,Rio de Janeiro,Swimming,Swimming Men's 100 metres Butterfly,Gold,unknown
235908,"Howe Liang ""Tiger"" Tan",M,27,160,69,Singapore,SGP,1960 Summer,1960,Summer,Roma,Weightlifting,Weightlifting Men's Lightweight,Silver,unknown
256622,Wang Jue Gu,F,28,155,63,Singapore,SGP,2008 Summer,2008,Summer,Beijing,Table Tennis,Table Tennis Women's Team,Silver,unknown
256624,Wang Jue Gu,F,32,155,63,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Team,Bronze,unknown


In [102]:
#because in the Noc_region data frame NOC for Singapore is SIN and in the athlete_events data frame it is SGP we missed some data so we replace "unknown" with "singapore"
df["region"]=df["region"].replace('unknown','Singapore')
region_singapore=df.loc[(df["region"]=="Singapore")]
region_singapore

Unnamed: 0,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,region
67723,Feng Tian Wei,F,21,163,55,Singapore,SGP,2008 Summer,2008,Summer,Beijing,Table Tennis,Table Tennis Women's Team,Silver,Singapore
67724,Feng Tian Wei,F,25,163,55,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Singles,Bronze,Singapore
67725,Feng Tian Wei,F,25,163,55,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Team,Bronze,Singapore
138095,Li Jia Wei,F,26,170,60,Singapore,SGP,2008 Summer,2008,Summer,Beijing,Table Tennis,Table Tennis Women's Team,Silver,Singapore
138096,Li Jia Wei,F,30,170,60,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Team,Bronze,Singapore
213955,Joseph Isaac Schooling,M,21,184,74,Singapore,SGP,2016 Summer,2016,Summer,Rio de Janeiro,Swimming,Swimming Men's 100 metres Butterfly,Gold,Singapore
235908,"Howe Liang ""Tiger"" Tan",M,27,160,69,Singapore,SGP,1960 Summer,1960,Summer,Roma,Weightlifting,Weightlifting Men's Lightweight,Silver,Singapore
256622,Wang Jue Gu,F,28,155,63,Singapore,SGP,2008 Summer,2008,Summer,Beijing,Table Tennis,Table Tennis Women's Team,Silver,Singapore
256624,Wang Jue Gu,F,32,155,63,Singapore,SGP,2012 Summer,2012,Summer,London,Table Tennis,Table Tennis Women's Team,Bronze,Singapore


In [103]:
df.isnull().sum()

Name      0
Sex       0
Age       0
Height    0
Weight    0
Team      0
NOC       0
Games     0
Year      0
Season    0
City      0
Sport     0
Event     0
Medal     0
region    0
dtype: int64

#Data Exploration

In [104]:
df.describe(include=['object']).T

Unnamed: 0,count,unique,top,freq
Name,39772,28202,"Michael Fred Phelps, II",28
Sex,39772,2,M,28519
Team,39772,498,United States,5219
NOC,39772,149,USA,5637
Games,39772,51,2008 Summer,2048
Season,39772,2,Summer,34077
City,39772,42,London,3624
Sport,39772,66,Athletics,3969
Event,39772,756,Football Men's Football,1269
Medal,39772,3,Gold,13369


In [105]:
df.nunique()

Name      28202
Sex           2
Age          61
Height       86
Weight      129
Team        498
NOC         149
Games        51
Year         35
Season        2
City         42
Sport        66
Event       756
Medal         3
region      137
dtype: int64

In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39772 entries, 3 to 271103
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    39772 non-null  object
 1   Sex     39772 non-null  object
 2   Age     39772 non-null  int64 
 3   Height  39772 non-null  int64 
 4   Weight  39772 non-null  int64 
 5   Team    39772 non-null  object
 6   NOC     39772 non-null  object
 7   Games   39772 non-null  object
 8   Year    39772 non-null  int64 
 9   Season  39772 non-null  object
 10  City    39772 non-null  object
 11  Sport   39772 non-null  object
 12  Event   39772 non-null  object
 13  Medal   39772 non-null  object
 14  region  39772 non-null  object
dtypes: int64(4), object(11)
memory usage: 4.9+ MB


In [112]:
#number of Medals in each year
years_df = df.groupby(['Year', 'Sport', 'region'], as_index=False).first()
years_df.head(15)


Unnamed: 0,Year,Sport,region,Name,Sex,Age,Height,Weight,Team,NOC,Games,Season,City,Event,Medal
0,1896,Athletics,Australia,"Edwin Harold ""Teddy"" Flack",M,22,175,71,Australia,AUS,1896 Summer,Summer,Athina,Athletics Men's 800 metres,Gold
1,1896,Athletics,France,Albin Georges Lermusiaux,M,21,175,71,France,FRA,1896 Summer,Summer,Athina,"Athletics Men's 1,500 metres",Bronze
2,1896,Athletics,Germany,Fritz Hofmann,M,24,167,56,Germany,GER,1896 Summer,Summer,Athina,Athletics Men's 100 metres,Silver
3,1896,Athletics,Greece,Evangelos Damaskos,M,25,175,71,Greece,GRE,1896 Summer,Summer,Athina,Athletics Men's Pole Vault,Bronze
4,1896,Athletics,Hungary,Nndor Dni,M,24,175,71,Hungary,HUN,1896 Summer,Summer,Athina,Athletics Men's 800 metres,Silver
5,1896,Athletics,UK,Charles Henry Stuart Gmelin,M,23,175,71,Great Britain,GBR,1896 Summer,Summer,Athina,Athletics Men's 400 metres,Bronze
6,1896,Athletics,USA,Arthur Charles Blake,M,24,175,71,United States,USA,1896 Summer,Summer,Athina,"Athletics Men's 1,500 metres",Silver
7,1896,Cycling,Austria,Felix Adolf Schmal,M,23,175,71,Austria,AUT,1896 Summer,Summer,Athina,Cycling Men's 333 metres Time Trial,Bronze
8,1896,Cycling,France,Marie Lon Flameng,M,18,175,71,France,FRA,1896 Summer,Summer,Athina,Cycling Men's Sprint,Bronze
9,1896,Cycling,Germany,Anton Gdrich,M,36,175,71,Germany,GER,1896 Summer,Summer,Athina,"Cycling Men's Road Race, Individual",Silver


In [113]:
years_df.columns=['region', 'Sport','region','count']
years_df.pivot('region', 'Sport','region','count')

ValueError: ignored

In [114]:
pivot = df.pivot_table(index=['Sport'], values=['Medal'], aggfunc='count')

print (pivot)

                  Medal
Sport                  
Aeronautics           1
Alpine Skiing       428
Alpinism             25
Archery             353
Art Competitions    156
...                 ...
Tug-Of-War          115
Volleyball          969
Water Polo         1057
Weightlifting       646
Wrestling          1296

[66 rows x 1 columns]


In [115]:
#number of different medals from 1896 to 2016
number_of_different_medals=df.groupby('region')['Medal'].value_counts().sort_values(ascending=False)
number_of_different_medals.head(15)

region   Medal 
USA      Gold      2638
         Silver    1641
Russia   Gold      1599
USA      Bronze    1358
Germany  Gold      1301
         Bronze    1260
         Silver    1195
Russia   Bronze    1178
         Silver    1170
UK       Silver     739
         Gold       677
France   Bronze     666
UK       Bronze     651
France   Silver     602
Italy    Gold       575
Name: Medal, dtype: int64

In [116]:
df['Year'].nunique()

35

In [117]:
#number of women who have won 'Gold' medals
women_with_gold=df[(df['Sex'] == 'F') & (df['Medal'] == 'Gold')].count()['Medal']
print('women who have won Gold medals:',women_with_gold)
#number of men who have won 'Gold' medals
men_with_gold=df[(df['Sex'] == 'M') & (df['Medal'] == 'Gold')].count()['Medal']
print('men who have won Gold medals:',men_with_gold)

women who have won Gold medals: 3747
men who have won Gold medals: 9622
