<a href="https://colab.research.google.com/github/AMMLRepos/Data-Analysis-120-years-of-olympic-history/blob/main/data_analysis_120_years_of_olympic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
# Jovian Commit Essentials
# Please retain and execute this cell without modifying the contents for `jovian.commit` to work
!pip install jovian --upgrade -q
import jovian
jovian.set_project('data-analysis-120-years-of-olympic')
jovian.set_colab_id('1HOuo72qgRdDcUkVPtprahYoZeCrU8Hcv')

# Overview
This nootebook is an exercise to analyse the Olympic dataset which is openly avaailable on [Kaggle](https://www.kaggle.com/mysarahmadbhat/120-years-of-olympic-history/). We have following objectives with this activity -
- Get some interesting insights on the data we have available, like say person who won most number of golds in olympic history, number of countries participated each year and what not. 
- Learning purpose - Use pandas, matplotlib and seaborn libraries to analyse the data and provide us an interesting use case to apply these skills

#Major Steps 
We will perform following major steps -
- Setup your wokring environment - Download libraries like pandas, numpy, matplotlib, seaborn
- Download the data from [Kaggle](https://www.kaggle.com/mysarahmadbhat/120-years-of-olympic-history/) using [opendatasets](https://github.com/JovianML/opendatasets) library which is developed by [jovian](https://jovian.ai)
- Perform basic analysis and draw seaborn plots 
- Summarize your statistics 
- Optional - Expose your insights on a webpage 

Use the "Run" button to execute the code.

# Step 1 - Setup our working environment 

Please note that this notebook is saved in jovian's environment and hence have some setup involved for jovian. If you are not running and saving it in jovian environment, you might not need a few of the steps involving jovian

In [15]:
!pip install jovian pandas numpy matplotlib opendatasets seaborn  --quiet

In [16]:
#Only needed if you are using jovian environment
import jovian 

In [17]:
# Execute this to save new versions of the notebook
jovian.commit(project="data-analysis-120-years-of-olympic")

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/aaryaashay1848/data-analysis-120-years-of-olympic


'https://jovian.ai/aaryaashay1848/data-analysis-120-years-of-olympic'

In [18]:
#Import required libraries 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

setup basic configuration for notebook and seaborn plots

In [19]:
#Ensures that matplotlib and seaborn graphs are visible within notebook
%matplotlib inline 

#Setup up style and theming for seaborn graphs 
sns.set_theme(style="darkgrid")
sns.set_context("paper")
plt.figure(figsize=(8,6))

#Ensures that you see all (500) columns within notebook
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

<Figure size 576x432 with 0 Axes>

#Step 2 - Download data
Download dataset using opendatasets library of jovian

In [22]:
import opendatasets as od
dataset_url = "https://www.kaggle.com/mysarahmadbhat/120-years-of-olympic-history/"

#This will prompt to ask for your Kaggle username and access key. Please provide as an input
#If not, you can download your kaggle.json from kaggle and upload it in same working directory of this notebook
#It will then pick your credentials from kaggle.json
#I will manually provide those inputs
od.download(dataset_url)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: aashaymaheshwari
Your Kaggle Key: ··········
Downloading 120-years-of-olympic-history.zip to ./120-years-of-olympic-history


100%|██████████| 5.43M/5.43M [00:00<00:00, 124MB/s]







Your dataset is downloaded in a special directory named "120-years-of-olympic-history". Let us use os module to work on directories and get list of files which are downloaded


In [26]:
import os 
datadir = "./120-years-of-olympic-history"

data_files = os.listdir(datadir)
print("List of downloaded files - ", data_files)

List of downloaded files -  ['country_definitions_data_dictionary.csv', 'country_definitions.csv', 'athlete_events_data_dictionary.csv', 'athlete_events.csv']


We have four files in each. It is a good practice to view files in excel to have a quick look at the data you are deadling with, but we will use pandas also to do that. Let us import all 4 files in our pandas dataframe

In [29]:
#Create file paths for all files present in the directory
file_country_definition_data_dictionary = datadir + "/" + data_files[0]
file_country_definition = datadir + "/" + data_files[1]
file_athelete_events_data_dictionary = datadir + "/" + data_files[2]
file_athelete_events = datadir + "/" + data_files[3]

Let us now import data into our dataframe and see if they are important for our analysis or are just helping data dictionary to understand manual data

In [31]:
country_definition_data_dictionary_df = pd.read_csv(file_country_definition_data_dictionary)
print("Dataframe for Country Definition data dictionary file")
print(country_definition_data_dictionary_df)

Dataframe for Country Definition data dictionary file
    Field                                        Description
0     NOC           National Olympic Committee 3 letter code
1  region           Country name used for geospatial mapping
2   notes  Real country name if "region" isn't an exact m...


The above output looks like this is just a helping file with mapping dictionaries for data in some other file. For time being we might not need this and we can refer it in an excel seperately if needed. 
Let us do that for all other remaining files 

In [33]:
country_definition_df = pd.read_csv(file_country_definition)
print("Dataframe for Country Definition file")
print(country_definition_df)

Dataframe for Country Definition file
     NOC       region                 notes
0    AFG  Afghanistan                   NaN
1    AHO      Curacao  Netherlands Antilles
2    ALB      Albania                   NaN
3    ALG      Algeria                   NaN
4    AND      Andorra                   NaN
..   ...          ...                   ...
225  YEM        Yemen                   NaN
226  YMD        Yemen           South Yemen
227  YUG       Serbia            Yugoslavia
228  ZAM       Zambia                   NaN
229  ZIM     Zimbabwe                   NaN

[230 rows x 3 columns]


The above dataframe is a list of coutries with country code and name of country. If you observe columns of our values of our first dataset file are used as columns in our second file 

In [35]:
athelete_definition_data_dictionary_df = pd.read_csv(file_athelete_events_data_dictionary)
print("Dataframe for Athelete events data dictionary file")
print(athelete_definition_data_dictionary_df)

Dataframe for Country Definition file
     Field                               Description
0       ID            Unique number for each athlete
1     Name                            Athlete's name
2      Sex                    Male (M) or Female (F)
3      Age                                   Integer
4   Height                            In centimeters
5   Weight                              In kilograms
6     Team                                 Team name
7      NOC  National Olympic Committee 3-letter code
8    Games                           Year and season
9     Year                                   Integer
10  Season                          Summer or Winter
11    City                                 Host city
12   Sport                                     Sport
13   Event                                     Event
14   Medal               Gold, Silver, Bronze, or NA


This is again the metadata but an important one which is giving us details of all the columns which we will get to see in our main datafile. Let us know do the important job of importing last and critical data file

In [36]:
athelete_events_df = pd.read_csv(file_athelete_events)
print("Dataframe for Athelete events file")
print(athelete_events_df)

Dataframe for Athelete events file
            ID                      Name Sex   Age  Height  Weight            Team  NOC        Games  Year  Season            City          Sport                                     Event Medal
0            1                 A Dijiang   M  24.0   180.0    80.0           China  CHN  1992 Summer  1992  Summer       Barcelona     Basketball               Basketball Men's Basketball   NaN
1            2                  A Lamusi   M  23.0   170.0    60.0           China  CHN  2012 Summer  2012  Summer          London           Judo              Judo Men's Extra-Lightweight   NaN
2            3       Gunnar Nielsen Aaby   M  24.0     NaN     NaN         Denmark  DEN  1920 Summer  1920  Summer       Antwerpen       Football                   Football Men's Football   NaN
3            4      Edgar Lindenau Aabye   M  34.0     NaN     NaN  Denmark/Sweden  DEN  1900 Summer  1900  Summer           Paris     Tug-Of-War               Tug-Of-War Men's Tug-Of-War  

#Step 3 - Working on basic data analysis 
Now that we have our data imported in dataframe, lets observe our data and perform basic analysis on the meta information of the data 