<h1>Olympic History Analysis</h1>

<h2><font color='#0E2545'>Table of Contents</font></h2>

<h2><font color='#0E2545'>Introduction</font></h2>


Let's pretend we're a team of *risk-takers* , aka gamblers, with a background in data science. Usually we always keep an eye on every competition where we might earn some money. 

The Tokyo's 2020 Summer Olympic games are going to start in a couple of months and we are really excited, since recently we got a dataset that contains about 120 years of history about the previous olympic games. We are planning to use this dataset in order to make some predictions and see which are the possible winners per event on which we can bet and earn the most.

<h2><font color='#0E2545'>Data Analytic Question</font></h2>

Our goal is to earn a lot on money, so we just want to consider the most popular sports. 
In order to achieve this, we are asking ourselves the following question:
* **Which are the athletes, from the most popular sports, that are going to win gold?**

<h2><font color='#0E2545'>Loading the Data</font></h2>

As we explained at the beginning, we got the [<font color='blue'>120 years of Olympic History</font>](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results) dataset, but we also just want to know which are consider the most popular sports, so for this task we are going to web scrape Wikipedia Article: [<font color='blue'>Olympic sports</font>](https://en.wikipedia.org/wiki/Olympic_sports#Current_and_discontinued_summer_program)

In the Wikipedia article, it manages popularity using the following categories:
* A, B, C, D, E. Where the category A represents the most popular, while the category E represents the least popular. For these analysis we'll only going to consider categories **A and B**.

In [110]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb
import re
import numpy as np
import pandas as pd
import scipy.stats as stats
import requests as r
from bs4 import BeautifulSoup

In [239]:
def get_wiki_article(url = "https://en.wikipedia.org/wiki/Olympic_sports"):
    article = r.get(url).text
    return BeautifulSoup(article)

def add_sport_info(pss_data,category,sports,type):
    sport_amount = len(sports)
    pss_data["sport"] += sports 
    pss_data["category"] += [category] * sport_amount
    pss_data["type"] += [type] * sport_amount
    return pss_data

#Obtains data about current olympic summer sports by scraping a certain table from a Wikipedia article
def get_summer_sports_popularity(article):

    
    #The table contains the following fields
    #Category -> Index 0, represents how popular the sport is (A-E)
    #Individual Sport -> Index 1, which sports are individual
    #Team Sport -> Index 2, which sports are team based.
    pss_data = {
        "category": [],
        "sport": [],
        "type": []
    }
    pss_table = soup.find_all("table", class_="wikitable")[1]
    pss_rows = pss_table.find_all("tr")
    for index, row in enumerate(pss_rows):
        #Not considering the header
        if index == 0:
            continue
        columns = row.find_all("td")
        category = columns[0].text
        pattern = re.compile(r'((?:\w+)(?:\s*\w+)*)',re.I)
        
        #Add individual sports
        individual_sports = re.findall(pattern,columns[1].text)
        pss_data = add_sport_info(pss_data,category,individual_sports, "individual")
        
        #Add team sports
        team_sports =  re.findall(pattern,columns[2].text)
        pss_data = add_sport_info(pss_data,category,team_sports, "team")
        
    return pd.DataFrame(pss_data)


#Gets the federation for each sport by scraping a certain table from a Wikipedia article
def get_federation_per_sport(sports, 
                             url="https://en.wikipedia.org/wiki/Association_of_Summer_Olympic_International_Federations"):
    soup = get_wiki_article(url)
    sport_x_fed = {}
    fps_table = soup.find_all("table", class_="wikitable")[1]
    fps_rows = fps_table.find_all("tr")
    for index, row in enumerate(fps_rows):
        #Not considering the header
        if index == 0:
            continue
        columns = row.find_all("td")
        
        sport_name = columns[0].text
        sport_fed_acronym = columns[2].text
        for sport in sports:
            if sport not in sport_x_fed:
                #match at least the first 3 letters of the sport
                pattern = re.compile(r'('+sport[0:3]+')',re.I)
                if re.match(pattern,sport_name):
                    sport_x_fed[sport_fed_acronym] = sport
                    break
                    
    return sport_x_fed

#Obtain each sport's disciplines using its federation
#just consider the disciplines that were official during the past 5 Olympic games
def get_disciplines_per_sport(soup, federation_per_sport):
    disciplines_per_sport = {
        "discipline": [],
        "sport": []
    }
    dps_table = soup.find_all("table", class_="wikitable")[0]
    dps_rows = dps_table.find_all("tr")
    
    #The table contains some rows that act as separators, no need to be considered
    separator_background = "#ddd"
    last_discipline_background = None
    current_federation = ""
    for index, row in enumerate(dps_rows):
        #We know that the first 4 rows are just headers and separators
        if index < 4:
            continue
            
        #Each discipline is identified by its background color    
        if row.get("style") is not None or row.get("bgcolor") is not None:
            background = row.get("bgcolor")
            if background is None:
                background = re.search(r'(?<=background\:)#*[A-Za-z0-9]+',row["style"])
                background = background.group(0)
            #It is a separator, then skip it
            if background.upper() == separator_background.upper():
                continue
                
            columns = row.find_all("td")
            discipline_name = columns[0].text
            
            if last_discipline_background != background:
                last_discipline_background = background
                current_federation = columns[2].text.strip()
            
            if current_federation not in federation_per_sport:
                continue
            
            #If it was in the 5 previous olympic games
            if columns[-5].text == "":
                continue
            
            sport_name = federation_per_sport[current_federation]
            disciplines_per_sport["discipline"].append(discipline_name)
            disciplines_per_sport["sport"].append(sport_name)
    
    #If the sport doens't have disciplines, just add it
    for fed,sport in federation_per_sport.items():
        if sport not in disciplines_per_sport["sport"]:
            disciplines_per_sport["sport"].append(sport)
            disciplines_per_sport["discipline"].append(sport)
            
    return pd.DataFrame(disciplines_per_sport)
        
        
        
    
    

In [240]:
test = get_disciplines_per_sport(soup,federation_per_sport)
test.head(100)

Unnamed: 0,discipline,sport
0,Diving,aquatics
1,Swimming,aquatics
2,Synchronized swimming,aquatics
3,Water polo,aquatics
4,Basketball,basketball
5,Mountain biking,cycling
6,Road cycling,cycling
7,Track cycling,cycling
8,Artistic,gymnastics
9,Rhythmic,gymnastics


In [209]:
federation_per_sport

{'FINA': 'aquatics',
 'IAAF': 'athletics',
 'WBSC': 'basketball',
 'FIBA': 'basketball',
 'UCI': 'cycling',
 'FIFA': 'football',
 'FIG': 'gymnastics',
 'ITF': 'tennis',
 'FIVB': 'volleyball'}

In [173]:
re.search(r'(?<=background\:)#[A-Za-z0-9]+','background:#f0f8ff;').group(0)

'#f0f8ff'

In [140]:
soup = get_wiki_article("https://en.wikipedia.org/wiki/Olympic_sports")

At this point, we are going to get the olympic sports' popularity, and only consider the A,B and C categories.

In [141]:
summer_sports_pd = get_summer_sports_popularity(article)
summer_sports_pd.head(10)

Unnamed: 0,category,sport,type
0,A,athletics,individual
1,A,aquatics,individual
2,A,gymnastics,individual
3,B,cycling,individual
4,B,tennis,individual
5,B,basketball,team
6,B,football,team
7,B,volleyball,team
8,C,archery,individual
9,C,badminton,individual


In [142]:
#We just consider most popular, in this case Categories A,B,C
print("All Popularities:",summer_sports_pd.category.unique())
summer_sports_pd = summer_sports_pd[summer_sports_pd["category"].str.match(r'[AB]')]
print("Considered Popularities:",summer_sports_pd.category.unique())
print("Considered Sports:",  summer_sports_pd.sport.unique())

All Popularities: ['A' 'B' 'C' 'D' 'E']
Considered Popularities: ['A' 'B']
Considered Sports: ['athletics' 'aquatics' 'gymnastics' 'cycling' 'tennis' 'basketball'
 'football' 'volleyball']


After getting the most popular sports, we need to know which discipline the sport includes. The Wikipedia article contains the disciplines per sport, but we need the sport's federation in order to recognize to which sport the discipline corresponds.

In [159]:
sports = summer_sports_pd.sport.unique()
federation_per_sport = get_federation_per_sport(sports)

In [161]:
federation_per_sport

{'FINA': 'aquatics',
 'IAAF': 'athletics',
 'WBSC': 'basketball',
 'FIBA': 'basketball',
 'UCI': 'cycling',
 'FIFA': 'football',
 'FIG': 'gymnastics',
 'ITF': 'tennis',
 'FIVB': 'volleyball'}