                                        PREDICTING AGE OF ATHLETES BASED OFF OF OLYMPICS DATA
**INTRODUCTION**

Every four years, nations gather the best of their best, their peak athletes to compete in the Summer Olympics, competing in the highest echelon of their respective sports. Maintaining their spot at the top for many years is no easy feat however. As such, we are trying to predict an athlete’s age based on multiple factors. We will try to determine whether factors such as an athlete’s: height, weight, Summer Olympics appearances, and the amount of medals they have won have any correlation to their age. The dataset we will be using has gathered Olympic data for 120 years! From all the way back in 1896, to 2016. It contains the names of athletes and their statistics, such as their height, weight, team, medals won, years they have participated etc. For our purposes however, we will only be using Summer Olympics data as well as all the predictors as listed above. 

**EXPECTED OUTCOMES AND SIGNIFICANCE**

Through our analysis of the athlete dataset, we expect to find the ages of the athletes based on their height, weight, the number of times they've participated in the Olympics, and the number of medals they've won. With this, there are some new perspectives or debates that are opened up, specifically within the rules and regulations for the Olympics. It also shines some light onto the recognition of age differences/variances in the Olympics, regardless of standings and medals. Our results and observations lead into numerous diverse questions such as: is age an impactful factor for winning medals? Should there be age restrictions or limitations in the Olympics? Does an individual with a greater age, in comparison to a new Olympian, have a advantageous position? Does a greater age imply or directly correlate to greater amounts of medals won, or vise versa? 

**METHODS**

We are predicting the age of an athlete based on the predictor's height, weight, times they've been to the olympics, and number of medals won. For our exploratory plot, we'll show average medals won per olympics vs age of athlete. We will be only using data gained from the Summer Olympics. We will be using line plots as well as scatter plots for our visualizations.


In [1]:
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [17]:
athlete<-read_csv("https://raw.githubusercontent.com/Mahekbhardwaj/DSCI-100-group38/main/athelete_data.csv")
athlete <- athlete |>
slice(1:1499)
#athlete


[1mRows: [22m[34m1800[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
[32mdbl[39m  (4): Age, Height, Weight, Year

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [13]:
head(athlete)
#we see that the data is already in a tidy format

Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
A Dijiang,M,24,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
A Lamusi,M,23,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
Gunnar Nielsen Aaby,M,24,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
Edgar Lindenau Aabye,M,34,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",


In [28]:
# we only need data from the most recent summer olympics in the dataset
# finding the most recent year 
recent_year <-  athlete |> select(Year) |>
map_df(max) |>
        pull()
recent_year

In [29]:
# we know the most recent year is 2016, so now we filter for year==2016 and season==summer
athlete<- athlete|>
                filter(Year== recent_year ,Season=="Summer")
head(athlete)

Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
Andreea Aanei,F,22,170,125,Romania,ROU,2016 Summer,2016,Summer,Rio de Janeiro,Weightlifting,Weightlifting Women's Super-Heavyweight,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Individual All-Around,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Floor Exercise,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Parallel Bars,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Horizontal Bar,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Rings,


In [31]:
#changing the age column type to factor
#splitting into training and testing data
mutate(athlete,Age=as_factor(Age))
athlete_split <- initial_split(athlete, prop = 0.75, strata = Age)
athlete_train <- training(athlete_split)
athlete_test <- testing(athlete_split)

Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
<chr>,<chr>,<fct>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
Andreea Aanei,F,22,170,125,Romania,ROU,2016 Summer,2016,Summer,Rio de Janeiro,Weightlifting,Weightlifting Women's Super-Heavyweight,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Individual All-Around,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Floor Exercise,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Parallel Bars,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Horizontal Bar,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Rings,
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Pommelled Horse,
Antonio Abadia Beci,M,26,170,65,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Athletics,"Athletics Men's 5,000 metres",
Giovanni Abagnale,M,21,198,90,Italy,ITA,2016 Summer,2016,Summer,Rio de Janeiro,Rowing,Rowing Men's Coxless Pairs,Bronze
Patimat Abakarova,F,21,165,49,Azerbaijan,AZE,2016 Summer,2016,Summer,Rio de Janeiro,Taekwondo,Taekwondo Women's Flyweight,Bronze


In [32]:
#exploratory analysis
#finding out the number of times each athlete has attended the olympics, and number of medals won
athlete_number<- athlete_train |> group_by(Name) |>
                    summarize(n=n()) |>
                    rename("olympics attended"="n")
                    
head(athlete_number)

Name,olympics attended
<chr>,<int>
Abdulqdir Abdullayev,1
Adlan Aliyevich Abdurashidov,1
Ahmad Abughaush,1
Ahmed Abdelaal,1
Ahmed Abdelrahman,1
Alaaeldin Ahmad El-Sayyid Abouelkassem,1


In [33]:
#number of medals won per athlete
#indexing NA and non-NA values in Medal column to 0 and 1 medals won respectively
athlete_train$Medal <- ifelse(!is.na(athlete_train$Medal), 1, athlete_train$Medal)
athlete_train[["Medal"]][is.na(athlete_train[["Medal"]])]<-0
head(athlete_train)

Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Individual All-Around,0
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Floor Exercise,0
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Parallel Bars,0
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Rings,0
Nstor Abad Sanjun,M,23,167,64,Spain,ESP,2016 Summer,2016,Summer,Rio de Janeiro,Gymnastics,Gymnastics Men's Pommelled Horse,0
Ilyas Abbadi,M,23,175,75,Algeria,ALG,2016 Summer,2016,Summer,Rio de Janeiro,Boxing,Boxing Men's Middleweight,0


In [None]:
#exploratory analysis- creating a table to find the number of medals won by each athlete
#converting medal col to int type
athlete_train$Medal <- as.integer(athlete_train$Medal)
athlete_medal<-athlete_train|>
                group_by(Medal,Name,Age)|>
                summarize(medalcount=n())
athlete_medal

In [None]:
#exploratory analysis- vizualizing number of medals won against age to see the trend in their relationship.
medal_age_plot<-athlete_medal|>
                ggplot(aes(y=medalcount,x=Age)) +
                geom_point()+ geom_line()+
                labs(y="number of medals won",x="Age of athletes")
medal_age_plot