# Uso di `procyclingstats`
Notebook per provare a usare questo pacchetto. By Andrea

## Autoreload

Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.

In [1]:
%load_ext autoreload
%autoreload 2

## Imports

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
# Import procyclingstats library
import procyclingstats as pcs
import re
import seaborn as sns
import sys
import tqdm



sys.path.append('../dataset/')

In [3]:
cyclist_df = pd.read_csv(os.path.join('dataset','cyclists.csv'))
races_df = pd.read_csv(os.path.join('dataset','races.csv'))
print(f"cyclist_df.shape = {cyclist_df.shape}, races_df.shape = {races_df.shape}")

cyclist_df.shape = (6134, 6), races_df.shape = (589865, 18)


## A simple preprocessing

I noticed that there are cyclists that appear twice in the same stage, but with different positions. Sometimes they're not even listed in procyclingstats!

In [4]:
races_df.drop_duplicates(keep="first", subset=races_df.columns.difference(['position']), inplace=True)
# I want to reset the indices. I don't know if it's the right thing to do, but it helps with my scraping
#races_df.reset_index(inplace=True, drop=True)
print(f"races_df.shape = {races_df.shape}")

races_df.shape = (589818, 18)


## Library usage

Let's try to use the library

### Cyclists

In [5]:
cyclist_df.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


This is an example of an error that might occur when using the scraping tool

In [6]:
getattr(pcs.Rider(f"rider/{cyclist_df.loc[0,"_url"]}"),'height')()

AttributeError: 'NoneType' object has no attribute 'text'

Let's try to augment the dataset using new information about the cyclists, scraped from [procyclingstats](https://www.procyclingstats.com/). 

The `procyclingstats` library allows to create a `Rider` object, whose constructor needs the URL of the desired cyclist, so that the corresponding data from the web page can be scraped. One retrieves data such as date of birth, weight etc of the cyclist just by calling method of the object.

With the next cell of code we try to retrieve data from the web, to augment the dataset.

The function `safe_getattr` is needed because if a feature (e.g. weight) is not present in the webpage, the method that retrieves it (in the previous example `.weight()`) raises an unhandled exception. We want to return a NaN for missing data instead.

We loop through all the rows in the dataframe, and for each cyclist we first try to retrieve the data from the dataset. If it's missing we call the corresponding scraping methods. This is to reduce the computational burden. The approach is justified by the fact that on preliminary exploration we saw that the data is consistent, there shouldn't be errors to correct. So we can copy over the values that are there.

In [49]:
new_cyclists = []

TRUE_RANGE = cyclist_df.shape[0]
FALSE_RANGE = 400

# Helper function to handle exceptions
def safe_getattr(obj, attr, fun:callable = lambda x: x):
    try:
        return fun(getattr(obj, attr)())
    # AttributeError: the attribute (e.g. height) is not in the website
    # IndexError: trying to convert the weight into a number, but the weight is nan (because it's not in the website)
    # ValueError: is risen because bascally the pcs.Rider object is trying to convert the string 'available' (which is what is scraped by the website)
    #               into a month of the year. But of course is not in the list of months, so a ValueError is risen
    except (IndexError, AttributeError, ValueError):
        return np.nan


def width_height_mistaken(scraped_weight, height) -> bool:
    return scraped_weight*100 == height

for i in tqdm.tqdm(range(TRUE_RANGE)):
    url = cyclist_df.loc[i, "_url"]
    # This try block is for actually scraping the rider
    try:
        ciclista = pcs.Rider(f"rider/{url}")

        # New features: total points accumulated in the whole career,
        #               total n° of seasons "ran" by the cyclist (aka length of the "career")
        storico_punti = ciclista.points_per_season_history()
        tot_punti = 0
        n_stagioni = len(storico_punti)
        for dizio in storico_punti:
            tot_punti += dizio['points']


        # If we have the values in our dataframe we use that, otherwise we scrape procyclingstats
        nome = ' '.join(cyclist_df.loc[i,'name'].split()) if not pd.isna(cyclist_df.loc[i,'name']) else ' '.join(ciclista.name().split())
        data_nascita = cyclist_df.loc[i,'birth_year'] if not pd.isna(cyclist_df.loc[i,'birth_year']) else safe_getattr(ciclista, 'birthdate', lambda str: str[:4])
        height = cyclist_df.loc[i,'height'] if not pd.isna(cyclist_df.loc[i,'height']) else safe_getattr(ciclista, 'height', lambda x: x*100)
        # Sometimes the scraper confuses the height for the weight, when it's not there...
        scraped_weight = safe_getattr(ciclista, 'weight')
        weight = cyclist_df.loc[i,'weight'] if not pd.isna(cyclist_df.loc[i,'weight']) else scraped_weight if not width_height_mistaken(scraped_weight, height) else np.nan
        nazionalita = cyclist_df.loc[i,'nationality'] if not pd.isna(cyclist_df.loc[i,'nationality']) else safe_getattr(ciclista, 'nationality')
    except ValueError:
        # If we don't find the cyclist on procyclingstats we basically copy the row from our dataframe
        #pass
        nome = ' '.join(cyclist_df.loc[i,'name'].split())
        data_nascita = cyclist_df.loc[i,'birth_year']
        weight = cyclist_df.loc[i,'weight'] 
        height = cyclist_df.loc[i,'height']
        nazionalita = cyclist_df.loc[i,'nationality']

    cyclist_new_data = {
        '_url': url,
        'name': nome,
        'birth_year': data_nascita,
        'weight': weight,
        'height': height,
        'nationality': nazionalita,
        'points_total': tot_punti,
        'tot_seasons_attended': n_stagioni,
        # I'm sorry, I'll do this anyway. It's easier to remove this feature when we see that is unused than to add it back later
        'full_history': storico_punti
    }    
    new_cyclists.append(cyclist_new_data)


100%|██████████| 6134/6134 [26:00<00:00,  3.93it/s]  


We create the new dataframe

In [53]:
new_cyclist_df = pd.DataFrame(new_cyclists)
new_cyclist_df.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality,points_total,tot_seasons_attended,full_history
0,bruno-surra,Bruno Surra,1964.0,,,Italy,15.0,2,"[{'season': 1989, 'points': 14.0, 'rank': 828}..."
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France,4717.0,11,"[{'season': 1997, 'points': 164.0, 'rank': 257..."
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands,315.0,10,"[{'season': 2024, 'points': 30.0, 'rank': 990}..."
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium,953.0,9,"[{'season': 2023, 'points': 298.0, 'rank': 218..."
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain,459.0,5,"[{'season': 2024, 'points': 197.0, 'rank': 317..."


In [54]:
new_cyclist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   _url                  6134 non-null   object 
 1   name                  6134 non-null   object 
 2   birth_year            6121 non-null   float64
 3   weight                3080 non-null   float64
 4   height                3143 non-null   float64
 5   nationality           6133 non-null   object 
 6   points_total          6134 non-null   float64
 7   tot_seasons_attended  6134 non-null   int64  
 8   full_history          6134 non-null   object 
dtypes: float64(4), int64(1), object(4)
memory usage: 431.4+ KB


In [55]:
cyclist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _url         6134 non-null   object 
 1   name         6134 non-null   object 
 2   birth_year   6121 non-null   float64
 3   weight       3078 non-null   float64
 4   height       3143 non-null   float64
 5   nationality  6133 non-null   object 
dtypes: float64(3), object(3)
memory usage: 287.7+ KB


First, let's save the data that we scraped

In [56]:
# Save the data in the new dataframe
new_cyclist_df.to_csv(os.path.join('dataset','cyclists_new.csv'))

***OSS***: I *willingly* (come no...) decide to also save the indices. One should drop them when reading the .csv

***OSS***: the `full_history` column is managed badly by the `pd.read_csv()` function, because it is read as a string, and not as a list of dictionary. To recover the correct value one has to execute:

<code>
import ast

new_cyclist_df = pd.read_csv(os.path.join('dataset','cyclists_new.csv')) <br>
new_cyclist_df['full_history'] = new_cyclist_df['full_history'].apply(ast.literal_eval)
</code>

At a first trial, it looked like we've been able to retrieve 72 cyclists' weights from scraping. But by investigating further, we noticed that it wasn't the case. It's just that when on the website there is the height information but not the weight information, the scraper returns the height instead of the weight when asked for the latter. This is why there is a check performed by the `width-height-mistaken` function.

See the following example:

In [57]:
idar_andersen = pcs.Rider("rider/idar-andersen")
print(idar_andersen.weight())
try:
    print(idar_andersen.height())
except AttributeError as exc:
    print(f"Trying to call height method raised the exception: {exc}")

1.82
Trying to call height method raised the exception: 'NoneType' object has no attribute 'text'


But let's see what we've gained!

In [None]:
new_cyclist_df.loc[cyclist_df.loc[:new_cyclist_df.shape[0],"weight"].isna() & ~new_cyclist_df["weight"].isna(), ['_url', 'name', 'weight', 'height']]

Unnamed: 0,_url,name,weight,height
2321,torstein-traeen,Torstein Træen,63.0,181.0
6088,eric-antonio-fagundez,Eric Antonio Fagúndez,67.0,180.0


All this effort for just two weights...

### Races

Let's see if we have more luck with the races

In [5]:
races_df.head()

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,vini-ricordi-pinarello-sidermec-1986,0.0
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,norway-1987,0.0
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,,0.0
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,navigare-blue-storm-1993,0.0
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,spain-1991,0.0


Allora, consideriamo `race = pcs.Stage(f"race/{races_df.loc[i, '_url']}")`. Consideriamo le cose più facili da parsare e recuperare:
- `race.profile_icon()` è `races_df.loc[i, 'profile']`
- `race.race_startlist_quality_score()` è `races_df.loc[0, 'startlist_quality']`, ma non manca nessun valore in questa colonna
- `race.uci_points_scale()` è `races_df[i, 'uci_points']`
- `race.avg_temperature()` è `races_df[i, 'average_temperature']`
- `race.vertical_meters()` è `races_df[i, 'climb_total']`
- `race.distance()` è `races_df.loc[i, 'length']`, ma quest'ulltima è in Km mentre la prima è in metri
- `race.date()` è la prima parte di `races_df.loc[i, 'date']` 


What we do:
- We keep the columns of the dataframe which are complete (i.e. the URL, the name, the terrain,...)
- We use the points and UCI points from procyclingstats. We overwrite those of the dataframe. This is because they're *are relative to the cyclist, not to the race!* 
- If the point/UCI point is 0, we set it to NaN, because 0 encodes missing values for the scraper
- for `age`,`climb_total`, `profile`, `average_temperature` we first try to look into the dataframe, if there is a NaN we scrape
- Since the teams in procyclingstats are completely different from those in the dataset (and I trust pcs more), we keep the formers.


The following block is taken from `Andrea_data_understanding.ipynb`, and basically identifies which races are the same, but have different names in the `name` column. This is checked by comparing the races' `_url`s. 

It also creates a dictionary in which the keys are names of the races (a representative for the equivalence class), and the values are the different names that refer to the same race as the corresponding key.

In [6]:
from utility.data_understanding import check_if_same
from itertools import combinations

# List with all the race names
race_names = np.sort(races_df['name'].unique())

# Initialize a list to store pairs of races that are actually the same
same_races = []
# And a dictionary in which the different values correspond to different names for
# the same race (the race denoted by the key)
same_races_dict = {}

# Iterate through all pairs of race names
for i in range(len(race_names)):
    for j in range(i + 1, len(race_names)):
        race1 = race_names[i]
        race2 = race_names[j]
        # Use the check_if_same function to compare the races
        try:
            same, _, _ = check_if_same(race1, race2, races_df=races_df)
            if same:
                same_races.append((race1, race2))
                
                # Find the representative name
                representative = None
                for key in same_races_dict:
                    if race1 in same_races_dict[key] or race2 in same_races_dict[key]:
                        representative = key
                        break
                
                if representative is None:
                    representative = race1
                
                # Add the races to the dictionary
                if representative not in same_races_dict:
                    same_races_dict[representative] = [race1, race2]
                else:
                    if race1 not in same_races_dict[representative]:
                        same_races_dict[representative].append(race1)
                    if race2 not in same_races_dict[representative]:
                        same_races_dict[representative].append(race2)
        except TypeError:
            print(f"Caught error at races {race_names[i]} and {race_names[j]}")

# Final check
for key in same_races_dict.keys():
    # Check if all the aliases stored in the dictionary are also in the list of pairs of same races
    v = all([pair in same_races for pair in combinations(same_races_dict[key], 2)])
    if not v:
        print(f"Error with {key}")
        break

assert v, "There is some problem"

We partition the indices on which to iterate in chunks:
- strong scaling, so quicker time
- if something goes badly (the host refuses the HTTPS requests) we can easily resume the work without preprocessing everything again

In [17]:
# Now is time to scrape!

N_CHUNKS = 100
#TRUE_RANGE = races_df.index[-1]
#FALSE_RANGE = 1000

chunks_list = np.array_split(races_df.index, N_CHUNKS)

Torquatian vibes in the next few cells (thanks ChatGPT for the valuable help). <br>
The essence is what was written by me.

In [18]:
from concurrent.futures import ProcessPoolExecutor, as_completed
from utility.data_understanding import scrape_stages

#new_races_df = pd.DataFrame(columns=races_df.columns)
# Path to store the progress checkpoint
checkpoint_file = os.path.join('dataset/scraping', 'resume_checkpoint.txt')

def process_chunk(n, indices_arr, races_df, same_races_dict):
    try:
        # Create the new dataframe
        reduced_new_races_df = pd.DataFrame(scrape_stages(indices_arr, races_df, same_races_dict))
        
        # Save the new data in a unique .csv file per chunk
        reduced_new_races_df.to_csv(os.path.join('dataset/scraping/scraped_data', f'races_{n}_new.csv'))
        # Concatenate new to old, to hopefully create a new updated df
        #new_races_df = pd.concat([new_races_df, reduced_new_races_df])

        reduced_races_df = races_df.loc[reduced_new_races_df.index]

        # Save gained values, using a unique log file for this chunk
        log_file_path = os.path.join('dataset/scraping/scraping_logs', f'gained_values_chunk_{n}.txt')
        with open(log_file_path, 'w') as file:
            file.write(f"Chunk {n}, rows from {indices_arr[0]} to {indices_arr[-1]}\n")
            for col in reduced_races_df.columns:
                truth1 = all(reduced_new_races_df[col] == reduced_races_df[col])
                diversi = reduced_new_races_df[col] != reduced_races_df[col]
                truth2_1 = pd.isna(reduced_new_races_df.loc[diversi, col]).all()
                truth2_2 = pd.isna(reduced_races_df.loc[diversi, col]).all()
                # Either the columns have the same values, or they differ because both are NaNs (np.nan == np.nan yields False)                
                truth = truth1 or (truth2_1 and truth2_2)
                #print(f"Columns {'\''+col+'\'':^22} are equal across the two datasets: {truth}")
                # Exclude some columns (not interesting)
                if not truth and col not in ['name', 'points', 'uci_points', 'cyclist_team']:
                    file.write(f"\tGained values for {col}:\n")
                    for idx, val in reduced_new_races_df.loc[diversi, col].items():
                        file.write(f"\t\tRow: {idx},\t value: {val}\n")
        
        # Append the completed chunk index to the checkpoint file
        with open(checkpoint_file, 'a') as checkpoint:
            checkpoint.write(f"{n}\n")

    except ConnectionError as exc:
        print(f"Chunk {n} interrupted due to ConnectionError: {exc}")
        # Save progress before raising exception to stop further processing
        with open(checkpoint_file, 'w') as checkpoint:
            checkpoint.write(f"{n}\n")
        raise  # Reraise to signal main process to stop
    except Exception as exc:
        print("That's a new one")
        print(f"Chunk {n} interrupted due to exception: {exc}")
        with open(checkpoint_file, 'w') as checkpoint:
            checkpoint.write(f"{n}\n")
        raise  # Reraise for other unanticipated errors


This cell revolves around the computations of the indices of already processed rows.

In [19]:
# Function to get the list of chunks that still need processing
def get_incomplete_chunks():
    # Get all chunks that have been completed
    completed_chunks = set()
    if os.path.exists(checkpoint_file):
        with open(checkpoint_file, 'r') as checkpoint:
            completed_chunks = {int(line.strip()) for line in checkpoint if line.strip().isdigit()}

    # Identify and return the incomplete chunks
    return [(n, indices_arr) for n, indices_arr in enumerate(chunks_list) if n not in completed_chunks]

# Get list of chunks that still need to be processed
chunks_to_process = get_incomplete_chunks()

And now, the parallel code!

In [20]:
# Execute chunks in parallel, skipping completed ones and terminate on any exception
try:
    with ProcessPoolExecutor() as executor:
        futures = [executor.submit(process_chunk, n, indices_arr, races_df, same_races_dict)
                   for n, indices_arr in chunks_to_process]
        for future in as_completed(futures):
            future.result()  # Raises any exceptions encountered during processing
except Exception as exc:
    print(f"Processing halted due to an error: {exc}")

# Merge log files if all chunks are completed successfully
if not get_incomplete_chunks():  # If no incomplete chunks remain
    with open(os.path.join('dataset', 'gained_values.txt'), 'w') as merged_file:
        for n in range(len(chunks_list)):
            log_file_path = os.path.join('dataset/scraping/scraping_logs', f'gained_values_chunk_{n}.txt')
            if os.path.exists(log_file_path):
                with open(log_file_path, 'r') as individual_log:
                    merged_file.write(individual_log.read())
                #os.remove(log_file_path)

scraping...: 100%|██████████| 5899/5899 [00:59<00:00, 99.88it/s] 
scraping...: 100%|██████████| 5899/5899 [00:58<00:00, 100.75it/s]
scraping...: 100%|██████████| 5899/5899 [00:58<00:00, 100.08it/s]
scraping...: 100%|██████████| 5899/5899 [01:01<00:00, 96.36it/s]]
scraping...:  98%|█████████▊| 5758/5899 [01:02<00:01, 83.72it/s] 
scraping...: 100%|██████████| 5899/5899 [01:01<00:00, 96.30it/s] 
scraping...: 100%|██████████| 5899/5899 [01:00<00:00, 96.89it/s]]
scraping...: 100%|██████████| 5899/5899 [01:01<00:00, 96.64it/s]
scraping...: 100%|██████████| 5899/5899 [01:01<00:00, 95.20it/s]
scraping...: 100%|██████████| 5899/5899 [01:03<00:00, 92.68it/s]
scraping...: 100%|██████████| 5899/5899 [01:01<00:00, 96.42it/s] 
scraping...: 100%|██████████| 5899/5899 [01:01<00:00, 96.31it/s] 
scraping...: 100%|██████████| 5899/5899 [01:08<00:00, 86.34it/s]]
scraping...: 100%|██████████| 5899/5899 [01:11<00:00, 81.97it/s] 
scraping...: 100%|██████████| 5899/5899 [01:11<00:00, 82.25it/s] 
scraping...: 

We merge all the csv files into a new csv file

In [22]:
# Path to save the merged CSV
races_new_csv_path = os.path.join('dataset', 'races_new.csv')

# Open the target file in write mode
with open(races_new_csv_path, 'w') as new_races_file:
    for n in range(N_CHUNKS):
        portion_path = os.path.join('dataset/scraping/scraped_data', f'races_{n}_new.csv')
        
        # Open each CSV chunk
        with open(portion_path, 'r') as portion_csv:
            # Read all lines in the chunk
            lines = portion_csv.readlines()
            
            # Write the header for the first chunk only
            if n == 0:
                new_races_file.writelines(lines)
            else:
                # Skip the header for subsequent chunks
                new_races_file.writelines(lines[1:])
                
print(f"All chunks merged into {races_new_csv_path}")

All chunks merged into dataset/races_new.csv


In [21]:
# Define the path for the final CSV file
races_new_csv_path = os.path.join('dataset', 'races_new.csv')

# Combine the chunks into one file, only writing the header once
with open(races_new_csv_path, 'w') as new_races_file:
    all_lines = []  # Temporary storage for all lines in all chunks
    for n in range(N_CHUNKS):
        portion_path = os.path.join('dataset/scraping/scraped_data', f'races_{n}_new.csv')
        
        # Open each CSV chunk
        with open(portion_path, 'r') as portion_csv:
            lines = portion_csv.readlines()
            
            # Write the header for the first chunk only
            if n == 0:
                all_lines.extend(lines)
            else:
                all_lines.extend(lines[1:]) 

    # Sort lines by the value in first column (original index)
    header = all_lines[0]  # Save the header
    data_lines = all_lines[1:]  # All data lines

    # Sort data lines based on the first column (the index)
    data_lines.sort(key=lambda line: int(line.split(',')[0]))  # Convert index to int for proper sorting

    # Write sorted lines back to the file
    new_races_file.write(header)
    new_races_file.writelines(data_lines)

print(f"Sorted {races_new_csv_path} has been created.")

Sorted dataset/races_new.csv has been created.


In [23]:
new_races_df = pd.read_csv(os.path.join('dataset','races_new.csv'))
new_races_df.head()

Unnamed: 0.1,Unnamed: 0,_url,name,stage_type,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,0,tour-de-france/1978/stage-6,Tour de France,RR,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,team/flandria-velda-lano-1978,0.0
1,1,tour-de-france/1978/stage-6,Tour de France,RR,70.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,team/ti-raleigh-mc-gregor-1978,0.0
2,2,tour-de-france/1978/stage-6,Tour de France,RR,50.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,team/flandria-velda-lano-1978,0.0
3,3,tour-de-france/1978/stage-6,Tour de France,RR,40.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,team/c-a-1978,0.0
4,4,tour-de-france/1978/stage-6,Tour de France,RR,32.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,team/miko-mercier-1978,0.0


In [24]:
new_races_df = new_races_df.drop(columns=['Unnamed: 0'])
new_races_df.head()

Unnamed: 0,_url,name,stage_type,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,RR,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,team/flandria-velda-lano-1978,0.0
1,tour-de-france/1978/stage-6,Tour de France,RR,70.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,team/ti-raleigh-mc-gregor-1978,0.0
2,tour-de-france/1978/stage-6,Tour de France,RR,50.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,team/flandria-velda-lano-1978,0.0
3,tour-de-france/1978/stage-6,Tour de France,RR,40.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,team/c-a-1978,0.0
4,tour-de-france/1978/stage-6,Tour de France,RR,32.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,team/miko-mercier-1978,0.0


In [25]:
new_races_df.shape

(604840, 19)

It looks like the rows are stored in the same order as the original `races_df` dataframe:

In [27]:
np.array_equal(new_races_df['_url'].unique(), races_df['_url'].unique())

True

---

#### Dump (discarica):

In [None]:
from utility.data_understanding import scrape_stages

#new_races_df = pd.DataFrame(columns=races_df.columns)
for n, indices_arr in enumerate(chunks_list):
    try:
        # Create the new dataframe
        reduced_new_races_df = pd.DataFrame(scrape_stages(indices_arr, races_df, same_races_dict))
        # Save the new data in a new .csv
        reduced_new_races_df.to_csv(os.path.join('dataset/scraping',f'races_{n}_new.csv'))
        # Concatenate new to old, to hopefully create a new updated df
        #new_races_df = pd.concat([new_races_df, reduced_new_races_df])

        reduced_races_df = races_df.loc[reduced_new_races_df.index]

        # Save gained values
        with open(os.path.join('dataset','gained_values.txt'), 'a+') as file:
            file.write(f"Chunk {n}, rows from {indices_arr[0]} to {indices_arr[-1]}")
            for col in reduced_races_df.columns:
                truth1 = all(reduced_new_races_df[col] == reduced_races_df[col])
                diversi = reduced_new_races_df[col] != reduced_races_df[col]
                truth2_1 = pd.isna(reduced_new_races_df.loc[diversi, col]).all() 
                truth2_2 = pd.isna(reduced_races_df.loc[diversi, col]).all() 
                # Either the columns have the same values, or they differ because both are NaNs (np.nan == np.nan yields False)
                truth = truth1 or (truth2_1 and truth2_2)
                #print(f"Columns {'\''+col+'\'':^22} are equal across the two datasets: {truth}")
                # Exclude some columns (not interesting)
                if not truth and col not in ['name', 'points', 'uci_points', 'cyclist_team']:
                    file.write(f"\tGained values for {col}:\n")
                    for idx, val in reduced_new_races_df.loc[reduced_new_races_df[col] != reduced_races_df[col], col].items():
                        file.write(f"\t\tRow: {idx},\t value: {val}\n")
    except ConnectionError as exc:
        print(f"Loop interrupted at chunk {n} due to exception {exc}")
        break
    except Exception as exc:
        print("That's a new one")
        print(f"Loop interrupted at chunk {n} due to exception {exc}")
        break

That's a new one
Loop interrupted at chunk 0 due to exception name 'same_races_dict' is not defined


Useful pieces of code:

In [309]:

with open('gained_values.txt', 'a+') as file:
    for col in reduced_races_df.columns:
        truth1 = all(reduced_new_races_df[col] == reduced_races_df[col])
        diversi = reduced_new_races_df[col] != reduced_races_df[col]
        truth2_1 = pd.isna(reduced_new_races_df.loc[diversi, col]).all() 
        truth2_2 = pd.isna(reduced_races_df.loc[diversi, col]).all() 
        # Either the columns have the same values, or they differ because both are NaNs (np.nan == np.nan yields False)
        truth = truth1 or (truth2_1 and truth2_2)
        #print(f"Columns {'\''+col+'\'':^22} are equal across the two datasets: {truth}")
        if not truth and col not in ['name', 'points', 'uci_points', 'cyclist_team']:
            file.write(f"\tGained values for {col}:\n")
            for idx, val in reduced_new_races_df.loc[reduced_new_races_df[col] != reduced_races_df[col], col].items():
                file.write(f"\t\tIndex: {idx},\t value: {val}\n")

In [303]:
for idx, val in reduced_new_races_df.loc[reduced_new_races_df[col] != reduced_races_df[col], col].items():
    print(idx, val)

2063 23.0


In [291]:
reduced_new_races_df.loc[2063]

_url                     giro-d-italia/2019/stage-14
name                                   Giro d'Italia
stage_type                                        RR
points                                           NaN
uci_points                                       NaN
length                                      131000.0
climb_total                                   4187.0
profile                                          4.0
startlist_quality                                896
average_temperature                              NaN
date                             2019-05-25 04:31:35
position                                          99
cyclist                                 scott-davies
cyclist_age                                     23.0
is_tarmac                                       True
is_cobbled                                     False
is_gravel                                      False
cyclist_team           team/team-dimension-data-2019
delta                                         

In [298]:
righe = reduced_new_races_df.loc[reduced_new_races_df[col] != reduced_races_df[col], col]

In [301]:
col = 'cyclist_age'
print(f"{col} gained: {righe}")

cyclist_age gained: 2063    23.0
Name: cyclist_age, dtype: float64


In [267]:
from utility.data_understanding import scrape_stages

reduced_new_races_df = pd.DataFrame(scrape_stages(chunks_list[0], races_df, same_races_dict))

Encountered exception 
Encountered exception 
Encountered exception 


In [271]:
all(chunks_list[0] == range(5899)), len(chunks_list), len(chunks_list[0]), reduced_new_races_df.shape

(True, 100, 5899, (5931, 19))

In [270]:
reduced_new_races_df.head()

Unnamed: 0,_url,name,stage_type,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,RR,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,team/flandria-velda-lano-1978,0.0
1,tour-de-france/1978/stage-6,Tour de France,RR,70.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,team/ti-raleigh-mc-gregor-1978,0.0
2,tour-de-france/1978/stage-6,Tour de France,RR,50.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,team/flandria-velda-lano-1978,0.0
3,tour-de-france/1978/stage-6,Tour de France,RR,40.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,team/c-a-1978,0.0
4,tour-de-france/1978/stage-6,Tour de France,RR,32.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,team/miko-mercier-1978,0.0


---

Besides the new features added with scraping, of course the two dataframes are not equal, because for sure `points`, `uci_points` and `cyclist_team` have been changed

In fact, what we've discovered is that the `points` value gives the same points (those of the winner) to all the contenders of the stage, but in reality they've been awarded different points, according to the order of arrival.

In [220]:
reduced_new_races_df.head()

Unnamed: 0,_url,name,stage_type,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,RR,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,team/flandria-velda-lano-1978,0.0
1,tour-de-france/1978/stage-6,Tour de France,RR,70.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,team/ti-raleigh-mc-gregor-1978,0.0
2,tour-de-france/1978/stage-6,Tour de France,RR,50.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,team/flandria-velda-lano-1978,0.0
3,tour-de-france/1978/stage-6,Tour de France,RR,40.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,team/c-a-1978,0.0
4,tour-de-france/1978/stage-6,Tour de France,RR,32.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,team/miko-mercier-1978,0.0


In [221]:
reduced_new_races_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5931 entries, 0 to 5930
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   _url                 5931 non-null   object 
 1   name                 5931 non-null   object 
 2   stage_type           5931 non-null   object 
 3   points               942 non-null    float64
 4   uci_points           75 non-null     float64
 5   length               5931 non-null   float64
 6   climb_total          4060 non-null   float64
 7   profile              4539 non-null   float64
 8   startlist_quality    5931 non-null   int64  
 9   average_temperature  474 non-null    float64
 10  date                 5931 non-null   object 
 11  position             5931 non-null   int64  
 12  cyclist              5931 non-null   object 
 13  cyclist_age          5931 non-null   float64
 14  is_tarmac            5931 non-null   bool   
 15  is_cobbled           5931 non-null   b

In [222]:
reduced_races_df.head()

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,vini-ricordi-pinarello-sidermec-1986,0.0
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,norway-1987,0.0
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,,0.0
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,navigare-blue-storm-1993,0.0
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,spain-1991,0.0


In [223]:
reduced_races_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5931 entries, 0 to 5930
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   _url                 5931 non-null   object 
 1   name                 5931 non-null   object 
 2   points               5931 non-null   float64
 3   uci_points           2701 non-null   float64
 4   length               5931 non-null   float64
 5   climb_total          4060 non-null   float64
 6   profile              4539 non-null   float64
 7   startlist_quality    5931 non-null   int64  
 8   average_temperature  474 non-null    float64
 9   date                 5931 non-null   object 
 10  position             5931 non-null   int64  
 11  cyclist              5931 non-null   object 
 12  cyclist_age          5930 non-null   float64
 13  is_tarmac            5931 non-null   bool   
 14  is_cobbled           5931 non-null   bool   
 15  is_gravel            5931 non-null   bool  

In [None]:
# Save the data in the new dataframe
new_cyclist_df.to_csv(os.path.join('dataset','races_new_pt1.csv'))