# What do I want to end up with?
## Tenniest apparatus (per year + aggregated)
## Tenniest teams (per year + aggregrated)
## Top 10 (20?) goats of all time (by average score) (by apparatus?)
## Bubble maps x axis year, y axis team, size = no. 10s (colour/pie apparatus if poss?)
## Avg score over time (colour by team)

# 1 Set up the environment

In [None]:
!pip install -r ../requirements.txt

In [2]:
import os
import json
import requests
import sqlite3
from tqdm.notebook import tqdm, trange
tqdm.pandas()
import numpy as np
import pandas as pd 
from sqlalchemy import create_engine
from lets_plot import * # This imports all of ggplot2's functions
LetsPlot.setup_html()
import plotly.express as px

%load_ext sql
%config SqlMagic.autocommit=True

from pprint import pprint

In [None]:
# Not sure I need this?
# from lets_plot.mapping import as_discrete


## 1.1 Connect to the gymternet database

In [3]:
%sql sqlite:///../data/clean/gymternet.db --alias gymternet 
engine = create_engine('sqlite:///../data/clean/gymternet.db')

# 2 Exploratory data analysis

## 2.1 The tenniest apparatus
Which apparatus (vault, uneven bars, balance beam or floor exercise) attract the most 10s from the judges? Has it changed over time?

Intuitively, one would assume that vault would attract the fewest deductions; gymnasts are only performing one skill, so there are fewer opportunities to make mistakes.

However, my anecdotal observation as a watcher of college gymnastics is that the judges in this competition are fairly lenient; hesitancy on beam or short handstands on bars might not incur the deduction they would in other leagues. However, they are quite strict on landings - that is, if a gymnast doesn't perfectly stick their landing, they will incur a deduction. Given vault's landing difficulty, does this even out the advantage of having to perform fewer skills?

### 2.1.1 Retrieve the data from the database

In [6]:
%%sql gymternet

-- LEFT JOIN with aggregated row at the bottom
SELECT 
    SUM(r.vt_score = 10.0) AS 'Vault',
    SUM(r.ub_score = 10.0) AS 'Uneven Bars',
    SUM(r.bb_score = 10.0) AS 'Balance Beam',
    SUM(r.fx_score = 10.0) AS 'Floor Exercise',
    m.year AS 'Season'
FROM gymnast_results AS r
LEFT JOIN meets AS m
ON m.meet_id = r.meet_id
GROUP BY m.year

Vault,Uneven Bars,Balance Beam,Floor Exercise,Season
34,32,2,7,2015
12,8,16,28,2016
22,26,35,16,2017
10,51,53,24,2018
31,38,8,56,2019
28,10,32,4,2020
50,44,20,21,2021
59,46,38,77,2022
88,81,126,64,2023
45,56,69,103,2024


### 2.1.2 Import the data into a dataframe

In [7]:
# Export the above query to a new df
tenniest_apparatus_query = """
SELECT 
    SUM(r.vt_score = 10.0) AS 'Vault',
    SUM(r.ub_score = 10.0) AS 'Uneven Bars',
    SUM(r.bb_score = 10.0) AS 'Balance Beam',
    SUM(r.fx_score = 10.0) AS 'Floor Exercise',
    m.year AS 'Season'
FROM gymnast_results AS r
LEFT JOIN meets AS m
ON m.meet_id = r.meet_id
GROUP BY m.year;
"""

# Execute the query and store the result in a DataFrame
tenniest_apparatus_df = pd.read_sql_query(tenniest_apparatus_query, engine)

# Preview the df
tenniest_apparatus_df

Unnamed: 0,Vault,Uneven Bars,Balance Beam,Floor Exercise,Season
0,34,32,2,7,2015
1,12,8,16,28,2016
2,22,26,35,16,2017
3,10,51,53,24,2018
4,31,38,8,56,2019
5,28,10,32,4,2020
6,50,44,20,21,2021
7,59,46,38,77,2022
8,88,81,126,64,2023
9,45,56,69,103,2024


### 2.1.3 Prepare the data for plotting

We want this table to look slightly different, so that it's easier to read by Plotly.

New layout should look like:
| **Apparatus**    | **Number of 10s** | **Season** |
|------------------|-------------------|------------|
| 'Vault'          | 34                | 2015       |
| 'Uneven Bars'    | 32                | 2015       |
| 'Balance Beam'   | 2                 | 2015       |
| 'Floor Exercise' | 7                 | 2015       |
| 'Total'          | 75                | 2015       |

etc.


In [8]:
# Melt the DataFrame
tenniest_apparatus_per_year = pd.melt(tenniest_apparatus_df, id_vars=['Season'], var_name='Apparatus', value_name='No. of Tens')

# Preview the melted DataFrame
tenniest_apparatus_per_year.head()

Unnamed: 0,Season,Apparatus,No. of Tens
0,2015,Vault,34
1,2016,Vault,12
2,2017,Vault,22
3,2018,Vault,10
4,2019,Vault,31


### 2.1.4 Tenniest ever - delete this section

In [12]:
# Summarise the DataFrame with the sum of the 10s over the years per apparatus
tenniest_apparatus_10y = tenniest_apparatus_per_year.groupby('Apparatus')['No. of Tens'].sum().reset_index()

tenniest_apparatus_10y

Unnamed: 0,Apparatus,No. of Tens
0,Balance Beam,399
1,Floor Exercise,400
2,Uneven Bars,392
3,Vault,379


### 2.1.4 Prepare the plot(s)

I want to explore how many 10s have been awarded across each of the apparatus in total across the last 10 years.

To visualise this, I want a stacked bar chart, with apparatus across the x-axis, number of 10s on the y-axis and for each bar to be segmented by year.

I also want to be fairly specific about my colour, field and font choices, as I'm preparing all the upcoming charts for publication on a website, and I want them to look as though they belong together.

In [18]:
# A bar chart showing the aggregated number of 10s per apparatus across the years

tenniest_apparatus_ever = (
        ggplot(tenniest_apparatus_per_year, aes(x='Apparatus', y='No. of Tens')) + 
            geom_bar(aes(group='Season', fill='Season'), 
                stat='identity', 
                alpha=.8,
                size=0.2) +
            ggtitle('Which apparatus attracts the most 10s in NCAA gymnastics?') +
            scale_fill_viridis() +
            scale_fill_discrete() +
            theme(
                axis_title = element_text(size = 12, family='Helvetica'),
                axis_text = element_text(size = 12, family='Helvetica'),
                legend_position='bottom',
                legend_title = element_text(size = 12, family='Helvetica'),
                legend_text = element_text(size = 10, family = 'Helvetica')
            )      
)

# Export the plot to html file
ggsave(tenniest_apparatus_ever, "../docs/figures/tenniest_apparatus_ever.html")

# Show the plot
tenniest_apparatus_ever

By the looks of the above plot, it seems like there isn't that much difference in the likelihood of scoring a 10 on any particular apparatus, although there is a slight advantage on Floor Exercise. 

The sizes of the slices, however, tell a different story. It suggests there are trends; in some seasons it is easier to achieve perfection on one apparatus and in others another.

It would be interesting to explore how these trends change from year to year in some sort of amusing animated plot.

In [11]:
# Making an animated plot to show the number of 10s scored on each apparatus over the years

# Sample distinct colors from the Viridis color scale
num_colors = len(tenniest_apparatus_per_year['Apparatus'].unique())
viridis_colors = px.colors.sample_colorscale(px.colors.sequential.Viridis, [i/num_colors for i in range(num_colors)])

tenniest_apparatus_py = px.bar(tenniest_apparatus_per_year, 
                x="Apparatus", 
                y="No. of Tens", 
                animation_frame="Season",       
                color="Apparatus", 
                hover_name="Apparatus",
                range_y=[0, tenniest_apparatus_per_year["No. of Tens"].max()], # Set the y-axis range
                color_discrete_sequence=viridis_colors,
                opacity=0.8
            )

# Customize the layout
tenniest_apparatus_py.update_layout(
    title="Number of 10s Scored on Each Apparatus Over the Years",
    title_font=dict(size=12, family='Helvetica', color='black'),
    xaxis_title="Apparatus",
    xaxis_title_font=dict(size=12, family='Helvetica', color='black'),
    yaxis_title="No. of Tens",
    yaxis_title_font=dict(size=12, family='Helvetica', color='black'),
    legend_title="Apparatus",
    legend_title_font=dict(size=12, family='Helvetica', color='black'),
    font=dict(size=10, family='Helvetica', color='black'),
    plot_bgcolor='white',  # Set plot background to white
    paper_bgcolor='white',  # Set paper background to white
    xaxis=dict(
        gridcolor='#EEEEEE'  # Set x-axis grid lines to light grey
    ),
    yaxis=dict(
        gridcolor='#EEEEEE'  # Set y-axis grid lines to light grey
    ),
    legend=dict(
        orientation="h",  #horizontal legend
        yanchor="bottom",  
        y=-1,  
        xanchor="center",  
        x=0.5  
    )
)
# Export the plot to html file
tenniest_apparatus_py.write_html("../docs/figures/tenniest_apparatus_per_year.html")

# Show the plot
tenniest_apparatus_py

# 2 The tenniest teams

Ok, but this is a competition, isn't it? Which *teams* have been the most successful in achieving tens over the years? Has it changed over time?

<!-- Intuitively, one would assume that vault would attract the fewest deductions; gymnasts are only performing one skill, so there are fewer opportunities to make mistakes.

However, my anecdotal observation as a watcher of college gymnastics is that the judges in this competition are fairly lenient; hesitancy on beam or short handstands on bars might not incur the deduction they would in other leagues. However, they are quite strict on landings - that is, if a gymnast doesn't perfectly stick their landing, they will incur a deduction. Given vault's landing difficulty, does this even out the advantage of having to perform fewer skills? -->

In [None]:
%%sql --alias gymternet

SELECT 
    SUM(r.vt_score = 10.0) AS 'Vault',
    SUM(r.ub_score = 10.0) AS 'Uneven Bars',
    SUM(r.bb_score = 10.0) AS 'Balance Beam',
    SUM(r.fx_score = 10.0) AS 'Floor Exercise',
    SUM(r.vt_score = 10.0) + SUM(r.ub_score = 10.0) + SUM(r.bb_score = 10.0) SUM(r.fx_score = 10.0) AS 'Total 10s'
    g.team_id AS 'team_id',
    t.team_name AS 'Team',
    m.year AS 'Season'
FROM gymnast_results AS r
LEFT JOIN gymnasts AS g
ON g.gymnast_id = r.gymnast_id
LEFT JOIN teams as t
ON t.team_id = g.team_id
LEFT JOIN meets as m
ON m.meet_id = r.meet_id
GROUP BY t.team_name, r.meet_id;

In [None]:
# Export the above query to a new df
tenniest_teams_query = """
SELECT 
    SUM(r.vt_score = 10.0) AS 'Vault',
    SUM(r.ub_score = 10.0) AS 'Uneven Bars',
    SUM(r.bb_score = 10.0) AS 'Balance Beam',
    SUM(r.fx_score = 10.0) AS 'Floor Exercise',
    g.team_id AS 'team_id',
    t.team_name AS 'Team',
    m.year AS 'Season'
FROM gymnast_results AS r
LEFT JOIN gymnasts AS g
ON g.gymnast_id = r.gymnast_id
LEFT JOIN teams as t
ON t.team_id = g.team_id
LEFT JOIN meets as m
ON m.meet_id = r.meet_id
GROUP BY t.team_name, m.year;
"""

# Execute the query and store the result in a DataFrame
tenniest_teams_df = pd.read_sql_query(tenniest_teams_query, engine)

# Preview the df
tenniest_teams_df

In [None]:
# Let's remove the teams that have never gotten a 10
grouped_teams_df = tenniest_teams_df.groupby(['Team']).sum().reset_index()

# Any let's drop the irrelevant columns
grouped_teams_df = grouped_teams_df.drop(columns = ['team_id', 'Season'])

# Preview the new df
grouped_teams_df.head()

In [None]:
# Create a column with total tens

grouped_teams_df['total 10s'] = grouped_teams_df[['Vault', 'Uneven Bars', 'Balance Beam', 'Floor Exercise']].sum(axis=1)

#Preview the df
grouped_teams_df.head()

In [None]:
# Drop rows where total 10s == 0

grouped_teams_df = grouped_teams_df[grouped_teams_df['total 10s'] != 0]

# Check how many we have
grouped_teams_df.shape

In [None]:
# Make some subset dfs for easier plotting

total_tens_df = grouped_teams_df.drop(columns=['Vault', 'Uneven Bars', 'Balance Beam', 'Floor Exercise'])
vault_queens_df = grouped_teams_df.drop(columns=['Uneven Bars', 'Balance Beam', 'Floor Exercise', 'total 10s'])
bars_queens_df = grouped_teams_df.drop(columns=['Vault', 'Balance Beam', 'Floor Exercise', 'total 10s'])
beam_queens_df = grouped_teams_df.drop(columns=['Vault', 'Uneven Bars', 'Floor Exercise', 'total 10s'])
floor_queens_df = grouped_teams_df.drop(columns=['Vault', 'Uneven Bars', 'Balance Beam', 'total 10s'])

In [None]:
# A plot that shows the tenniest teams of all time
# a plot that shows the tenniest teams over time

# A plot that shows the tenniest teams across the apparatus
# A plot that shows the tenniest teams, across the apparatus, over time

# Ten of the top 10 GOATs of all time

Do you think I'm only interested in 10s? Children focus on 10s. I'm interested in the truth. I'm interested in what matters. 10s are shiny, certainly, but any good person-who-can-do-basic-mathematics can see that a gymnast who gets a 10 one week and then a 5 the next week is not as useful as a gymnast that gets a 9.9 week after week.

With this in mind, let's find the gymnasts who have the highest average scores across the apparatus and across the seasons.

It's easy to do well if you never compete. For the purposes of this analysis, I am only interested in gymnasts who compete a minimum of 6 times over the course of the season (per apparatus). This will necessarily impact the data from the 2020 and 2021 seasons, which were impacted heavily by COVID restrictions.

In [None]:
# Who got the highest average/median of all time
# Who got the highest average/median each year

# Which teams own the most goats?