# Athlete Medal Rating System

# Purpose
Creating tow ranking fields for all the athletes in every olympic games one for a given athletes rank in a given games and another ranking field for overall rank in the olympic games. 

# Datasets
Uses: <br>
** Games-800.csv ** from 800-Olympic_NOC_Rankings <br>
Creates: &emsp;
<br>
** Games-TotalRank-900.csv ** csv of the Medal df from 800-Olympic_NOC_Rankings with each athletes total rank across all of the olympic games. <br>
** Games-900.csv ** csv joining the Medal df from 800-Olympic_NOC_Rankings with an added field for athlete ranking for each olympics. <br>
** Games-AthRank-900.csv ** csv copy for the same csv above.

In [1]:
import os.path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
%matplotlib inline
from bs4 import BeautifulSoup
import webbrowser
import urllib.request
from lxml import html
import zipfile
import re
import string
import sys, os
from IPython.display import Image

In [2]:
# Ensure the file exists
if not os.path.exists(  r"..\..\data\prep\Games\Games-800.csv" ):
    print("Missing dataset file")

In [3]:
# read the medal csv into a dataframe
df = pd.read_csv(  r"..\..\data\prep\Games\Games-800.csv", encoding = "ISO-8859-1")

In [4]:
# this is the medal dataframe 
df.head(3)

Unnamed: 0,Year,Host_City,Host_Country,Total_Males,Total_Females,Total_Athletes,Summer,Winter,Discipline,Sport,...,Gold,Silver,Bronze,Total_Medals,NOC_Gold,NOC_Silver,NOC_Bronze,NOC_Total_Medals,NOC_Rating,NOC_Rank
0,1960,Rome,ITA,4727,611,5338,True,False,Sailing,Sailing,...,0,1,0,1,0,1,1,2,3,27
1,1960,Rome,ITA,4727,611,5338,True,False,Boxing,Boxing,...,0,0,1,1,0,1,1,2,3,27
2,1960,Rome,ITA,4727,611,5338,True,False,Athletics,Athletics,...,0,1,0,1,9,12,8,29,59,5


# Rating Field 
Now we can add the rating field. The rating for an athlete is based on the same weighted system we created for the olympic medals for countires. A rating of 3 is for Gold 2 for sliver and 1 for bronze. The rating field will just be the athletes total medals by weight. 

In [5]:
# We have to put the none value in every column first or else the sum doesn't work 
dfARate = df
dfARate['Ath_Rating'] = None
dfARate['Ath_Rating'] = dfARate['Gold'] * 3 +  dfARate['Silver'] * 2 + dfARate['Bronze']

In [6]:
dfARate.head(3)

Unnamed: 0,Year,Host_City,Host_Country,Total_Males,Total_Females,Total_Athletes,Summer,Winter,Discipline,Sport,...,Silver,Bronze,Total_Medals,NOC_Gold,NOC_Silver,NOC_Bronze,NOC_Total_Medals,NOC_Rating,NOC_Rank,Ath_Rating
0,1960,Rome,ITA,4727,611,5338,True,False,Sailing,Sailing,...,1,0,1,0,1,1,2,3,27,2
1,1960,Rome,ITA,4727,611,5338,True,False,Boxing,Boxing,...,0,1,1,0,1,1,2,3,27,1
2,1960,Rome,ITA,4727,611,5338,True,False,Athletics,Athletics,...,1,0,1,9,12,8,29,59,5,2


# Creating a ranking field 
Now that we have the rating system in place we want to give a rank to each athelte of each olympics. First we should sort the ratings by each olympics then we can give a rank based on this sort. One consideration we had to figure out is whether we would be ranking females and males seperately when looking at the athletes ratings. For now we are just ranking them soely on the rating highest to lowest rating by specific olympics reguardless of gender but this can be easily changed down the line.  

In [7]:
# Sorting the df by the year of the games, host city, host country and then the rating of each athelete
dfARate = dfARate.sort_values(by=['Year', 'Host_City', 'Host_Country', 'Ath_Rating'], ascending=False).reset_index()
# Dropping the old index 
dfARate = dfARate.drop(dfARate.columns[[0]], axis=1)

# The rank field popluation 
So far the rating dataframe is sorted by the Year, host city, host country and importanly the athlete Rating. In order to populate the rank field correctly we have to consider that we can have a winter and summer games on the same year. With this in mind everytime we reach a new games during our iteration of the rating dataframe we must reset the rank to 1. 

In [8]:
# For loop for populating the rank field 
dfARate['Ath_Rank'] = None 

# The lastyear and lasthost varaibles are needed so we can track when the games change in the iteration
lastyear = dfARate['Year'].iloc[0]
lastHost = dfARate['Host_City'].iloc[0]
rank = 1


for x, row in dfARate.iterrows():
    
    # current year and host to compare with the last years 
    curryear = dfARate['Year'].iloc[x]
    currHost = dfARate['Host_City'].iloc[x]
    
    # as long as the current host and year are the same we're in the same games so rank is assinged
    if(curryear == lastyear and currHost == lastHost):
        dfARate.loc[x, 'Ath_Rank'] = rank
    
    # if the games changes then we reset the rank varaible 
    else:
        rank = 1
        dfARate.loc[x, 'Ath_Rank'] = rank
    
    # give the last year and host varaibles their new values 
    lastyear = curryear
    lastHost = currHost
    # increment rank 
    rank = rank + 1

In [9]:
# Looking at an example of the case we spoke aobut above and check if or loop worked correctly 
dfARate[dfARate['Year'] == 1960].head(5)

Unnamed: 0,Year,Host_City,Host_Country,Total_Males,Total_Females,Total_Athletes,Summer,Winter,Discipline,Sport,...,Bronze,Total_Medals,NOC_Gold,NOC_Silver,NOC_Bronze,NOC_Total_Medals,NOC_Rating,NOC_Rank,Ath_Rating,Ath_Rank
11581,1960,Squaw Valley,USA,521,144,665,False,True,Cross Country Skiing,Skiing,...,1,3,3,3,3,9,18,3,6,1
11582,1960,Squaw Valley,USA,521,144,665,False,True,Speed skating,Skating,...,0,2,5,7,10,22,39,1,6,2
11583,1960,Squaw Valley,USA,521,144,665,False,True,Speed skating,Skating,...,0,2,4,3,1,8,19,2,5,3
11584,1960,Squaw Valley,USA,521,144,665,False,True,Cross Country Skiing,Skiing,...,0,2,3,4,0,7,17,4,5,4
11585,1960,Squaw Valley,USA,521,144,665,False,True,Speed skating,Skating,...,0,2,3,4,0,7,17,4,5,5


In [11]:
dfARate.to_csv( r"..\..\data\prep\Games\Games-AthRank-900.csv", index=False)

# Overall across all games 
Now we can get the rating and rank of each athlete across all the olympic games from 1960 -> 2018. But we'll split it into the winter and summer games. '

In [12]:
dfATRate = dfARate.groupby(['Ath_Name', 'NOC', 'Sport', 'Summer', 'Winter'])[['Gold', 'Silver', 'Bronze', 'Total_Medals', 'Ath_Rating']].sum().reset_index()

In [13]:
dfATRate = dfATRate.sort_values(by=['Summer', 'Winter', 'Ath_Rating'], ascending=False).reset_index()
# Dropping the old index 
dfATRate = dfATRate.drop(dfATRate.columns[[0]], axis=1)

In [14]:
dfATRate.head(5)

Unnamed: 0,Ath_Name,NOC,Sport,Summer,Winter,Gold,Silver,Bronze,Total_Medals,Ath_Rating
0,"PHELPS, Michael",USA,Aquatics,True,False,19,1,2,22,61
1,"KATO, Sawao",JPN,Gymnastics,True,False,8,3,1,12,31
2,"CÁSLAVSKÁ, Vera",TCH,Gymnastics,True,False,7,4,0,11,29
3,"LATYNINA, Larisa",URS,Gymnastics,True,False,5,4,3,12,26
4,"WERTH, Isabell",GER,Equestrian,True,False,6,4,0,10,26


# Rank field for total ratings 
Now we'll create a rank field for the new dataframe which contains the total ratings across all olympic games for each Athlete. The for loop below will be the same as the one above except rank will only be reset when we more to winter games from summer. 

In [15]:
# For loop for populating the rank field 
dfATRate['Ath_Rank'] = None 

# Ranks will start at one 
rankS = 1
rankW = 1

for x, row in dfATRate.iterrows():
        
    # keeping track of the olympic games type 
    gameType = dfATRate['Summer'].iloc[x]
    
    # while the iteration is within the summer games the summer rank is assinged
    if(gameType):
        dfATRate.loc[x, 'Ath_Rank'] = rankS
        rankS = rankS + 1
    
    # if the games changes to winter rank the winter rank variable is used 
    else:
        dfATRate.loc[x, 'Rank'] = rankW
        rankW = rankW + 1

In [16]:
dfATRate.to_csv( r"..\..\data\prep\Games\Games-TotalRank-900.csv", index=False)

In [17]:
dfARate.to_csv( r"..\..\data\prep\Games\Games-900.csv", index=False)