# Before we start
## Set up python environment
As per the README file, I set up my conda environment and installed my program libraries.

In [1]:
# Uncomment and run this section if you haven't followed the directions in the README.md file yet.

# !conda init
# !conda create -n gymternet -- python 3.12
# !conda activate gymternet 
# !conda install pip -y
# !pip install -r ../requirements.txt

## Import libraries and programs

Now that we're operating in Python, install all the libraries etc called on in the code

In [2]:
import os
import json
import requests
import datetime

import numpy as np
import pandas as pd 

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy import Selector

from tqdm.notebook import tqdm
from pprint import pprint as print

In [3]:
# Setting program-level variables
driver = webdriver.Chrome()
years = [2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015] # These are the years that we are interested in evaluating

Seeing as I'm fetching a large quantity of data and my internet can be patchy, I want a method to fetch pages that includes error handling, and records unsuccessful attempts to reach a url, so that I can go back later and retry if necessary (and if I have time).

Much of the error handling code was provided by copilot, and seems to be unproblematic.

In [4]:
error_logs = []

def fetch_page(url, retries=3, timeout=10):
    for i in range(retries):
        try:
            response = requests.get(url, timeout=timeout)
            if response.status_code == 200:
                return response.text
            else:
                error_logs.append({
                    'url': url,
                    'status_code': response.status_code,
                    'error': 'Non-200 status code',
                    'timestamp': datetime.datetime.now().isoformat()
                })
        except requests.exceptions.Timeout:
            error_logs.append({
                'url': url,
                'status_code': None,
                'error': 'Timeout',
                'timestamp': datetime.datetime.now().isoformat()
            })
        except requests.exceptions.RequestException as e:
            error_logs.append({
                'url': url,
                'status_code': None,
                'error': str(e),
                'timestamp': datetime.datetime.now().isoformat()
            })
    return None


# 1 Get the team info
The first step in the logic is to start to set up the data related to the teams. The teams are the 'base unit' of analysis for these data: all meets comprise teams, all gymnasts belong to teams, all scores either belong to gymnasts who belong to teams, or belong to teams directly.

On the landing page, I have scraped all the information for the past 10 years; teams are relatively static, but occassionally there will be a new team added to the roster, or a team dropped, so at this stage I'll just grab everything and drop duplicates later.

## 1.1 Scraping the data
The roadtonationals.com website has hidden apis, which are visible when inspecting using the 'Network' tab. With the help of [Postman](https://www.postman.com/), I was able to establish that no specific headers were needed, and to get the code needed to scrape that data.

It was evident from previewing that cycling through the years was as simple as replacing the final component of the cURL slug with the year in question.

In the interest of not having to run this code repeatedly, I saved the data down to a json file that I can easily read to a DataFrame locally.

**NB: The raw data that is written from this is saved in the 'data/raw/teams' folder**

In [5]:
year_url_root = "https://www.roadtonationals.com/api/women/finalresults/"

# For every year, access the website and save the data to a json file
def get_the_teams(url, year):
    year_url = year_url_root + str(year)

    # Code here generated by Postman
    payload = {}
    headers = {
        'Cookie': 'PHPSESSID=c48eb24102c0c45390a5d64809741f95'
    }

    response = requests.request("GET", year_url, headers=headers, data=payload)

    # Save the data to a json file
    with open(f'../data/raw/teams/{year}_teams.json', 'w') as f:
        # pure text
        f.write(response.text)

In [6]:
# Run the function to get the team info for every year
for year in years:
    get_the_teams(year_url_root, year)

# 2 Get the meet info
The next phase in the logic is collecting all the information about all the meets that have happened between teams in the last 10 years. An easy way to find this information is to visit every team's dashboard (eg. [Florida] (https://roadtonationals.com/results/teams/dashboard/2024/22)), and cycle through the years to collect all the data.

## 2.1 Setting up to scrape
Like the team data, the cURL patterns for this process were fairly easy to establish, with the last section of the slug being the team id and the second to last section being the year.

But to cycle through this pattern, I needed to first get a list of the team ids, which are helpfully stored in the teams json files, but not readable in the program yet.

In [7]:
# Read the teams json files into a DataFrame

# Create an empty DataFrame
teams_data_df = pd.DataFrame()

# For every year, load the data from the json file and append to the DataFrame
for year in years:
    filename = f'../data/raw/teams/{year}_teams.json'

    # Read the json file into a temporary df
    temp_df = pd.read_json(filename)
    temp_df['year'] = year

    # Append the temporary df to the main df
    teams_data_df = pd.concat([teams_data_df, temp_df])


teams_data_df = teams_data_df.reset_index(drop=True)
teams_df = pd.json_normalize(teams_data_df['data']).reset_index(drop=True)
teams_df['year'] = teams_data_df['year']

In [8]:
# Preview the df
teams_df.head()

Unnamed: 0,rank,team_name,team_id,ncaa_final,ncaa,nqs,regionals,rqs,division_id,average_score,high_score,year
0,1,LSU,34,198.225,198.113,396.465,198.25,198.215,1,197.908,198.475,2024
1,2,California,15,197.85,197.713,396.455,198.275,198.18,1,197.833,198.55,2024
2,3,Utah,69,197.8,197.938,395.47,197.575,197.895,1,197.704,198.3,2024
3,4,Florida,22,197.438,197.875,396.23,198.325,197.905,1,197.67,198.225,2024
4,5,Stanford,61,,197.075,394.62,197.575,197.045,1,196.563,197.975,2024


I'll do some light data cleaning now so that it's easier to look at what I'm looking at and to get rid of irrelevant data.

In [9]:
# Drop the columns that we are not interested in
teams_df = teams_df.drop(columns=['rank', 'ncaa_final', 'nqs', 'regionals', 'rqs', 'division_id', 'average_score', 'high_score', 'ncaa'])

In [10]:
# Preview the df
teams_df.head()

Unnamed: 0,team_name,team_id,year
0,LSU,34,2024
1,California,15,2024
2,Utah,69,2024
3,Florida,22,2024
4,Stanford,61,2024


In [11]:
# Remove duplicates - ie. if team_id & team_name are identical, retain years as a list
teams_df = teams_df.drop_duplicates(subset=['team_id', 'team_name']).reset_index(drop=True)

In [12]:
# Preview the df
teams_df.dtypes
teams_df['team_id'] = teams_df['team_id'].astype(int)

In [13]:
# Determine the link to access the team's dashboard
# Note: this accesses the api for the most recent dashboard for each team. If you want to access a specific year, you will need to modify the URL
base_team_url = str('https://www.roadtonationals.com/api/women/dashboard')

# Add the team links to the team_url column
teams_df['team_url'] = teams_df.apply(lambda x: f'{base_team_url}/{str(x["year"])}/{str(x["team_id"])}', axis=1)

In [14]:
# Preview the df - this looks good to work with now
teams_df['team_url']

0     https://www.roadtonationals.com/api/women/dash...
1     https://www.roadtonationals.com/api/women/dash...
2     https://www.roadtonationals.com/api/women/dash...
3     https://www.roadtonationals.com/api/women/dash...
4     https://www.roadtonationals.com/api/women/dash...
                            ...                        
84    https://www.roadtonationals.com/api/women/dash...
85    https://www.roadtonationals.com/api/women/dash...
86    https://www.roadtonationals.com/api/women/dash...
87    https://www.roadtonationals.com/api/women/dash...
88    https://www.roadtonationals.com/api/women/dash...
Name: team_url, Length: 89, dtype: object

In [15]:
# Saving the teams_df for easy access in later notebooks
teams_df.to_pickle('../data/raw/dirty_dfs/teams_df.pkl')

## 2.2 Scraping the meet data

As with the teams data, once the location of the hidden api was found, it was simple enough to establish the pattern of the curl, so again, building the specific link was folded into the method for fetching, reading and saving the content.

**NB: The raw data that is written from this is saved in the 'data/raw/meets' folder**

In [16]:
# Create a list of all team dashboards across all years and teams 
meet_urls = teams_df['team_url'].tolist()

In [17]:
# Get the meet info for every team in every year
def get_the_meet_info(url):
    year = url.split('/')[-2]
    team = url.split('/')[-1]
    # If we are able to fetch the page without timing out
    if fetch_page(url):   
        payload = {}
        headers = {
                'Cookie': 'PHPSESSID=c48eb24102c0c45390a5d64809741f95'
                }

        response = requests.request("GET", url, headers=headers, data=payload)

        # Save the data to a json file
        with open(f'../data/raw/meets/{year}_{team}_meets.json', 'w') as f:
            # pure text
            f.write(response.text)
    else:
        pass



In [18]:
# Batching up the meet_urls to avoid overloading the server
batch_size = 100
batches = [meet_urls[i:i + batch_size] for i in range(0, len(meet_urls), batch_size)]

In [19]:
# Call the method for every url in the list

# #Batch 1 #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[0]):

#     get_the_meet_info(url)

# #Batch 2 #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[1]): 

#     get_the_meet_info(url)

# #Batch 3  #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[2]):

#     get_the_meet_info(url)

# #Batch 4  #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[3]):

#     get_the_meet_info(url)

# #Batch 5 #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[4]):

#     get_the_meet_info(url)

# #Batch 6  #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[5]):

#     get_the_meet_info(url)

# #Batch 7   #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[6]):

#     get_the_meet_info(url)

# #Batch 8   #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[7]):

#     get_the_meet_info(url)

#Batch 9  #Completed successfully and commented out to avoid re-running
# for url in tqdm(batches[8]):

#     get_the_meet_info(url)


In [20]:
# Read the json files into a DataFrame

# Create an empty DataFrame
team_ids = teams_df['team_id'].tolist()
meets_data_df = pd.DataFrame()

with open(filename) as data_file:    
    data = json.load(data_file)  


# For every year, load the data from the json file and append to the DataFrame
for year in years:
    for team in team_ids:
        filename = f'../data/raw/meets/{year}_{team}_meets.json'

        with open(filename) as data_file:    
            data = json.load(data_file) 

            # Read the json file into a temporary df
            temp_df = pd.json_normalize(data, 'meets')
            temp_df['year'] = year
            temp_df['team_id'] = team

            # Append the temporary df to the main df
            meets_data_df = pd.concat([meets_data_df, temp_df])


meets_data_df = meets_data_df.reset_index(drop=True)

In [21]:
# Preview the df
meets_data_df.sort_values(by='meet_id', ascending=False).head()


Unnamed: 0,team_id,team_name,meet_id,meet_date,team_score,home,opponent,meet_desc,linked_id,jas,year
48,69,Utah,30231,"Sat, Apr-20-2024",197.8,A,"California, Florida, LSU",NCAA Championships Finals,6392,,2024
63,22,Florida,30230,"Sat, Apr-20-2024",197.4375,A,"California, LSU, Utah",NCAA Championships Finals,6392,,2024
32,15,California,30226,"Sat, Apr-20-2024",197.85,A,"Florida, LSU, Utah",NCAA Championships Finals,6392,,2024
15,34,LSU,30225,"Sat, Apr-20-2024",198.225,A,"California, Florida, Utah",NCAA Championships Finals,6392,,2024
139,33,Kentucky,30224,"Thu, Apr-18-2024",19.9,A,"Alabama, Arizona State, Arkansas, Boise State,...",NCAA Championships,6391,,2024


In [22]:
# Add the meet url to the DataFrame
results_url_root = "https://www.roadtonationals.com/api/women/meetresults/"
meets_data_df['meet_url'] = meets_data_df['meet_id'].apply(lambda x: f"{results_url_root}{str(x)}")

# Preview the df
meets_data_df.head()

Unnamed: 0,team_id,team_name,meet_id,meet_date,team_score,home,opponent,meet_desc,linked_id,jas,year,meet_url
0,34,LSU,28977,"Fri, Jan-05-2024",196.975,H,Ohio State,,5986,,2024,https://www.roadtonationals.com/api/women/meet...
1,34,LSU,29040,"Sat, Jan-13-2024",197.15,A,"Oklahoma, UCLA, Utah",Sprouts Farmers Market Collegiate Quad,6011,,2024,https://www.roadtonationals.com/api/women/meet...
2,34,LSU,29098,"Fri, Jan-19-2024",198.125,H,Kentucky,,6030,,2024,https://www.roadtonationals.com/api/women/meet...
3,34,LSU,29215,"Fri, Jan-26-2024",197.225,A,Missouri,,6078,,2024,https://www.roadtonationals.com/api/women/meet...
4,34,LSU,29303,"Fri, Feb-02-2024",198.475,H,Arkansas,,6111,,2024,https://www.roadtonationals.com/api/women/meet...


Unfortunately, the website I'm scraping from allocates a different meet_id for the same meet depending on which team is the originating source, so this df has a lot of duplicates that are difficult to spot. Luckily, there are only some 10,000 to sort through, so this should be no problem.

In [23]:
# Create a new column that stores the team name and the opponent names as a sorted list
meets_data_df['all_teams'] = meets_data_df.apply(lambda x: [x['team_name']] + x['opponent'].split(', '), axis=1)

# Sort the list of team names alphabetically so they can be easily compared
meets_data_df['all_teams'] = meets_data_df['all_teams'].apply(lambda x: sorted(x))

# Convert the list of team names to a tuple so it can be used as a key to identify duplicates
meets_data_df['all_teams'] = meets_data_df['all_teams'].apply(tuple)

# Drop duplicates (when all_teams and meet_date column are identical, they are duplicates.
meets_df = meets_data_df.drop_duplicates(subset=['all_teams', 'meet_date']).reset_index(drop=True)

# Preview the df
meets_df.head()

Unnamed: 0,team_id,team_name,meet_id,meet_date,team_score,home,opponent,meet_desc,linked_id,jas,year,meet_url,all_teams
0,34,LSU,28977,"Fri, Jan-05-2024",196.975,H,Ohio State,,5986,,2024,https://www.roadtonationals.com/api/women/meet...,"(LSU, Ohio State)"
1,34,LSU,29040,"Sat, Jan-13-2024",197.15,A,"Oklahoma, UCLA, Utah",Sprouts Farmers Market Collegiate Quad,6011,,2024,https://www.roadtonationals.com/api/women/meet...,"(LSU, Oklahoma, UCLA, Utah)"
2,34,LSU,29098,"Fri, Jan-19-2024",198.125,H,Kentucky,,6030,,2024,https://www.roadtonationals.com/api/women/meet...,"(Kentucky, LSU)"
3,34,LSU,29215,"Fri, Jan-26-2024",197.225,A,Missouri,,6078,,2024,https://www.roadtonationals.com/api/women/meet...,"(LSU, Missouri)"
4,34,LSU,29303,"Fri, Feb-02-2024",198.475,H,Arkansas,,6111,,2024,https://www.roadtonationals.com/api/women/meet...,"(Arkansas, LSU)"


In [24]:
# Saving the meets_df for easy access in later notebooks
meets_df.to_pickle('../data/raw/dirty_dfs/meets_df.pkl')

In [25]:
results_url_root = "https://www.roadtonationals.com/api/women/meetresults/"
results_links = meets_df['meet_url'].tolist()

# Get the results info for every meet
def get_the_results_info(url):
    meet_id = url.split('/')[-1]
    # If we are able to fetch the page without timing out
    if fetch_page(url):   
        payload = {}
        headers = {
                'Cookie': 'PHPSESSID=c48eb24102c0c45390a5d64809741f95'
                }

        response = requests.request("GET", url, headers=headers, data=payload)

        # Save the data to a json file
        with open(f'../data/raw/results/{meet_id}_results.json', 'w') as f:
            # pure text
            f.write(response.text)
    else:
        pass

In [26]:
# Note for players at home - this will take a while to run (approx ~1 hr)
# Complete data as at 2024-05-25 14:30:00 UTC is available in the '../data/raw/results' directory

# Call the method for every url in the list

# for url in results_links: #Commented out to avoid re-running

#     get_the_results_info(url)

In [27]:
# Read the json files into a results DataFrame
meet_ids = meets_df['meet_id'].tolist()

# Create an empty DataFrame
team_results_data_df = pd.DataFrame()
gymnasts_data_df = pd.DataFrame()

# with open(filename) as data_file:    
#     data = json.load(data_file)  


# For every meet, load the data from the results json file and append to the DataFrame
for meet_id in meet_ids:
    filename = f'../data/raw/results/{meet_id}_results.json'

    if os.path.exists(filename):
        if os.path.getsize(filename) == 0:
            print(f"File {filename} is empty.")
            continue

        try:
            with open(filename) as data_file:
                data = json.load(data_file)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON from file {filename}: {e}")
            continue

        # Read the json file into temporary DataFrames
        temp_team_df = pd.json_normalize(data, 'teams')

        # Normalising the scores data
        scores_data = data['scores']

        # Flatten the nested structure
        # Since 'scores' is a list of lists, we need to flatten it first
        flattened_scores = [item for sublist in scores_data for item in sublist]

        # Create DataFrame
        temp_gymnast_df = pd.json_normalize(flattened_scores)
        temp_gymnast_df['meet_id'] = meet_id

        # Append the temporary DataFrames to the main DataFrames
        team_results_data_df = pd.concat([team_results_data_df, temp_team_df])
        gymnasts_data_df = pd.concat([gymnasts_data_df, temp_gymnast_df])
    
    else:
        print(f"File {filename} does not exist.")
        continue

# Reset index for the final DataFrames
team_results_data_df = team_results_data_df.reset_index(drop=True)
gymnasts_data_df = gymnasts_data_df.reset_index(drop=True)

('Error decoding JSON from file ../data/raw/results/27977_results.json: '
 'Expecting value: line 1 column 1 (char 0)')
('Error decoding JSON from file ../data/raw/results/26843_results.json: '
 'Expecting value: line 1 column 1 (char 0)')
('Error decoding JSON from file ../data/raw/results/24822_results.json: '
 'Expecting value: line 1 column 1 (char 0)')
('Error decoding JSON from file ../data/raw/results/21326_results.json: '
 'Expecting value: line 1 column 1 (char 0)')
('Error decoding JSON from file ../data/raw/results/20001_results.json: '
 'Expecting value: line 1 column 1 (char 0)')
('Error decoding JSON from file ../data/raw/results/19660_results.json: '
 'Expecting value: line 1 column 1 (char 0)')
('Error decoding JSON from file ../data/raw/results/20016_results.json: '
 'Expecting value: line 1 column 1 (char 0)')


I can see we have a (very manageable) number of failures. For the moment, I have enough data to get meaningful results, but I would like to come back at a later stage and see if the data are available but trying the process above with the (now deleted) meet_ids from the duplicates in the original `meets_data_df`, which I thankfully preserved.

ðŸš¨ **TODO** - Come back to these errors and find the corresponding (duplicate) meets in the meets_data_df and check to see if the links work with the other meet_id

âœ… **UPDATE** - The error was due to some miscoding in the original source. No further action is required.

In [28]:
# Preview the team results DataFrame
team_results_data_df.head()

# This one looks ok!

Unnamed: 0,mid,tid,tname,vault,bars,beam,floor,tscore,year,home,lead
0,28977,34,LSU,49.375,49.375,48.7,49.525,196.975,2024,H,0.0
1,28978,46,Ohio State,49.3,49.125,49.05,49.3,196.775,2024,A,0.2
2,29039,47,Oklahoma,49.45,49.45,49.525,49.475,197.9,2024,A,0.0
3,29040,34,LSU,49.225,49.65,48.75,49.525,197.15,2024,A,0.75
4,29042,66,UCLA,49.4,49.25,49.25,49.2,197.1,2024,A,0.8


In [29]:
#Save the team_results_data_df for easy access in later notebooks

team_results_data_df.to_pickle('../data/raw/dirty_dfs/team_results_data_df.pkl')

In [30]:
# Preview the gymnasts DataFrame

gymnasts_data_df.head()

Unnamed: 0,gid,first_name,last_name,vault,bars,beam,floor,all_around,team_name,team_id,yr,vt_url,ub_url,bb_url,fx_url,meet_id
0,30950,Sierra,Ballard,,,9.2,9.9,,LSU,34,2024,,,,,28977
1,30952,Haleigh,Bryant,9.95,9.875,9.925,9.925,39.675,LSU,34,2024,,,,,28977
2,31947,Ashley,Cowan,,9.8,,,,LSU,34,2024,,,,,28977
3,32453,Amari,Drayton,9.925,,,9.925,,LSU,34,2024,,,,,28977
4,30953,Olivia,Dunne,,,,9.875,,LSU,34,2024,,,,,28977


In [31]:
# Save the gymnasts_data_df for easy access in later notebooks

gymnasts_data_df.to_pickle('../data/raw/dirty_dfs/gymnasts_data_df.pkl')