**Olympics_study**



This project merely started with my excitement to watch every competition at Olympics and the love to play with data. Like many of us, knowing the process behind all the hardship of athletes made me curious to know how much past data is useful for future planning and analysis. I decided to do some number crunching on 124 years of Olympics to see which are the best performing countries at the Olympics and what makes them great!
I have gathered data of Olympics performances from year 1900 to 2020 and in their relation to the country’s GDP, population, infra-structure, economics.

I use the [olympics dataset](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results),[2021 olympics report](https://www.kaggle.com/berkayalan/2021-olympics-medals-in-tokyo) from kaggle, and merged it with the country wise [gdp](https://www.kaggle.com/resulcaliskan/countries-gdps) and [population data](https://www.kaggle.com/centurion1986/countries-population).



In [1]:
import os
import numpy as np 
import pandas as pd
#from matplotlib import pyplot as plt
#import seaborn as sns

The first five rows of the olympics data are shown below. We have 271,116 rows and 15 columns. Variables include Name, Sex, Age, Height, Weight of the athlete,their team name, sport, event and the year, season, city of the olympics he/she took part in. In addition, the data captures the medal won (if any) by the athlete.

In [2]:
# Read in the data set
olympics = pd.read_csv('../input/120-years-of-olympic-history-athletes-and-results/athlete_events.csv')
olympics.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


**Data Cleaning and Exploration**

# 1) Missing Values
Finding and printing column wise missing values we have in the dataset 'Olympics'

In [3]:
print(olympics.isnull().sum())

ID             0
Name           0
Sex            0
Age         9474
Height     60171
Weight     62875
Team           0
NOC            0
Games          0
Year           0
Season         0
City           0
Sport          0
Event          0
Medal     231333
dtype: int64


We can see Height,Weight and Age have a lot of missing values. Medals have a NaN in about 2,31,333 rows. This can be explained since not all participating athletes have won medals. Let's replace these missing values by 'Did not win' or 'DNW'.

In [4]:
olympics['Medal'].fillna('DNW', inplace = True)

In [5]:
# As expected the NaNs in the 'Medal' column disappear!
print(olympics.isnull().sum())

ID            0
Name          0
Sex           0
Age        9474
Height    60171
Weight    62875
Team          0
NOC           0
Games         0
Year          0
Season        0
City          0
Sport         0
Event         0
Medal         0
dtype: int64


# 2) NOC - National Olympic Committee
The organizations which send their sport persons to participate in the Olympics.
Are all NOCs linked to a unique team? We can find this out by taking a unique subset 
of just the NOC and team columns and taking a value count.

In [6]:
print(olympics.loc[:, ['NOC', 'Team']].drop_duplicates()['NOC'].value_counts().head())

FRA    160
USA     97
GBR     96
SWE     52
NOR     46
Name: NOC, dtype: int64


So NOC code 'FRA' is associated with 160 teams? That sounds prepostorous! Let's use a master of NOC to country mapping to correct this.

The NOC dataset has the NOC code and the corresponding Country Name. The first five rows of the data are shown below -

In [7]:
# Lets read in the noc_country mapping first
noc_country = pd.read_csv('../input/120-years-of-olympic-history-athletes-and-results/noc_regions.csv')
noc_country.drop('notes', axis = 1 , inplace = True)
noc_country.rename(columns = {'region':'Country'}, inplace = True)

noc_country.head()

Unnamed: 0,NOC,Country
0,AFG,Afghanistan
1,AHO,Curacao
2,ALB,Albania
3,ALG,Algeria
4,AND,Andorra


We will merge the original dataset with the NOC master using the **NOC code as the primary key**. This has to be a left join since we want all participating countries to remain in the data even if their NOC-Country is not found in the master. We can easily correct those manually.