# Introduction
This is a project to get more familiar with the classical machine learning technique of random forests and decision trees. The project aims to be able to classify athletes based on their height, weight, gender and age into the sport they compete in. The dataset is sourced from Kaggle.

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

## 1. Exploratory Data Analysis

In [3]:
raw_data = pd.read_csv("data/athlete_events.csv", index_col=0)
raw_data.head()

Unnamed: 0_level_0,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271116 entries, 1 to 135571
Data columns (total 14 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Name    271116 non-null  object 
 1   Sex     271116 non-null  object 
 2   Age     261642 non-null  float64
 3   Height  210945 non-null  float64
 4   Weight  208241 non-null  float64
 5   Team    271116 non-null  object 
 6   NOC     271116 non-null  object 
 7   Games   271116 non-null  object 
 8   Year    271116 non-null  int64  
 9   Season  271116 non-null  object 
 10  City    271116 non-null  object 
 11  Sport   271116 non-null  object 
 12  Event   271116 non-null  object 
 13  Medal   39783 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 31.0+ MB


As expected, the `Medals` column has nulls as not all athletes would win a medal. There are also nulls in the `Age`, `Height` and `Weight` columns. It would make sense that if the `Height` column is null, the `Weight` column would also be null. If that is true, we can just drop these rows. If not, we could explore whether the athelete for whom the data is missing has any other entires in the dataset where the data is not missing.

In [18]:
#Find the indices of the null rows in the Height column and intersect with the same on the Weight column
same_index_na = len(set(raw_data[raw_data["Height"].isna()].index).intersection(raw_data[raw_data["Weight"].isna()].index))

n_height_na = len(raw_data[raw_data["Height"].isna()])
n_weight_na = len(raw_data[raw_data["Weight"].isna()])

print(f"Out of {n_height_na} rows of missing height data and {n_weight_na} of missing weight data, {same_index_na} are the same row")


Out of 60171 rows of missing height data and 62875 of missing weight data, 32817 are the same row


More than half of the rows where the `Height` is missing, the `Weight` is too. There is definitely a case to drop the intersecting indices. As for the non intersecting indices, while we could search the dataset for other entires for the same athletes, there is no guarantee that their weight (and even their height) would stay the same across Olympics'. Given the large dataset, I am comfortable dropping all rows with NaNs in the `Height` and `Weight` columns.

In [22]:
data_dropped_hw_na = raw_data[(~raw_data["Height"].isna()) & (~raw_data["Weight"].isna())].reset_index(drop=True)
data_dropped_hw_na.head()

Unnamed: 0,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
3,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",
4,Christine Jacoba Aaftink,F,25.0,185.0,82.0,Netherlands,NED,1992 Winter,1992,Winter,Albertville,Speed Skating,Speed Skating Women's 500 metres,


We still have a dataset with more than 200,000 entires. Lets see how many reaiming rows are missing `Age` data. 

In [26]:
age_na = len(data_dropped_hw_na[data_dropped_hw_na["Age"].isna()])
print(f"{age_na} rows still have missing data for Age")

688 rows still have missing data for Age


Given that such a small number of rows with missing data remain, we could drop them. However, I am interested to see if the athletes with missing age data have populated data elsewhere in the dataset.

In [44]:
for athlete in (data_dropped_hw_na[data_dropped_hw_na["Age"].isna()]["Name"].unique()):

    athlete_all_entires = data_dropped_hw_na[data_dropped_hw_na["Name"] == athlete]

    if len(athlete_all_entires[~athlete_all_entires["Age"].isna()]) > 0:
        print(athlete_all_entires[["Name","Age","Games","Team"]])
    
    

                        Name   Age        Games    Team
39868  Dimitrios Deligiannis  22.0  1984 Summer  Greece
39869  Dimitrios Deligiannis  26.0  1988 Summer  Greece
39870  Dimitrios Deligiannis  30.0  1992 Summer  Greece
39871  Dimitrios Deligiannis   NaN  1896 Summer  Greece
                 Name   Age        Games        Team
41164  Mamadou Diallo   NaN  1984 Summer  Mauritania
41165  Mamadou Diallo  38.0  1980 Summer      Guinea
41166  Mamadou Diallo  29.0  1984 Summer     Senegal
               Name   Age        Games         Team
90082  Kim Yong-Bae   NaN  1968 Summer  South Korea
90083  Kim Yong-Bae  22.0  1996 Summer  South Korea
90084  Kim Yong-Bae  26.0  2000 Summer  South Korea
90085  Kim Yong-Bae  30.0  2004 Summer  South Korea
90086  Kim Yong-Bae  34.0  2008 Summer  South Korea


For all but 3 athletes, there are no other instances of their age in the dataset (unless they changed their name between games). There are ages recorded for 3 athletes, however for all of them the athlete with populated data is clearly different from the athlete with populated data. In the case of Mamadou Diallo, the populated data athlete is competing for Senegal, wheras the unpopulated data athlete is competeing for Mauritiania. All that work to save no rows of data, but interesting to see nonetheless. Lets drop all rows where `Age` is NaN.

In [47]:
data_no_na = data_dropped_hw_na[~data_dropped_hw_na["Age"].isna()]
data_no_na.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 206165 entries, 0 to 206852
Data columns (total 14 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Name    206165 non-null  object 
 1   Sex     206165 non-null  object 
 2   Age     206165 non-null  float64
 3   Height  206165 non-null  float64
 4   Weight  206165 non-null  float64
 5   Team    206165 non-null  object 
 6   NOC     206165 non-null  object 
 7   Games   206165 non-null  object 
 8   Year    206165 non-null  int64  
 9   Season  206165 non-null  object 
 10  City    206165 non-null  object 
 11  Sport   206165 non-null  object 
 12  Event   206165 non-null  object 
 13  Medal   30181 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 23.6+ MB


The remaining dataset has 206,165 entires. That is enough to train a Random Forest on. Many rows in this dataset, however, are the same athlete. As we are trying to classify the sport an athlete plays based on some metrics it is valid that now all rows in the database are unique. An athlete may play more than two sports, and changes in weight and height across time for the same athlete will add some variance to the dataset.