# Uso di `procyclingstats`
Notebook per provare a usare questo pacchetto. By Andrea

## Autoreload

Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.

In [1]:
%load_ext autoreload
%autoreload 2

## Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm

import re

# Import procyclingstats library
import procyclingstats as pcs

import os
import sys
sys.path.append('../dataset/')

In [3]:
cyclist_df = pd.read_csv(os.path.join('dataset','cyclists.csv'))
races_df = pd.read_csv(os.path.join('dataset','races.csv'))

## Library usage

Let's try to use the library

### Cyclists

In [4]:
cyclist_df.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


This is an example of an error that might occur when using the scraping tool

In [5]:
getattr(pcs.Rider(f"rider/{cyclist_df.loc[0,"_url"]}"),'height')()

AttributeError: 'NoneType' object has no attribute 'text'

Let's try to augment the dataset using new information about the cyclists, scraped from [procyclingstats](https://www.procyclingstats.com/). 

The `procyclingstats` library allows to create a `Rider` object, whose constructor needs the URL of the desired cyclist, so that the corresponding data from the web page can be scraped. One retrieves data such as date of birth, weight etc of the cyclist just by calling method of the object.

With the next cell of code we try to retrieve data from the web, to augment the dataset.

The function `safe_getattr` is needed because if a feature (e.g. weight) is not present in the webpage, the method that retrieves it (in the previous example `.weight()`) raises an unhandled exception. We want to return a NaN for missing data instead.

We loop through all the rows in the dataframe, and for each cyclist we first try to retrieve the data from the dataset. If it's missing we call the corresponding scraping methods. This is to reduce the computational burden. The approach is justified by the fact that on preliminary exploration we saw that the data is consistent, there shouldn't be errors to correct. So we can copy over the values that are there.

In [10]:
new_cyclists = []

# Helper function to handle exceptions
def safe_getattr(obj, attr, fun:callable = lambda x: x):
    try:
        return fun(getattr(obj, attr)())
    # AttributeError: the attribute (e.g. height) is not in the website
    # IndexError: trying to convert the weight into a number, but the weight is nan (because it's not in the website)
    # ValueError: is risen because bascally the pcs.Rider object is trying to convert the string 'available' (which is what is scraped by the website)
    #               into a month of the year. But of course is not in the list of months, so a ValueError is risen
    except (IndexError, AttributeError, ValueError):
        return np.nan


def width_height_mistaken(scraped_weight, height) -> bool:
    return scraped_weight*100 == height

for i in tqdm.tqdm(range(cyclist_df.shape[0])):
    url = cyclist_df.loc[i, "_url"]
    # This try block is for actually scraping the rider
    try:
        ciclista = pcs.Rider(f"rider/{url}")

        # WIP!!!
        storico_punti = ciclista.points_per_season_history()
        tot_punti = 0
        n_gare = len(storico_punti)
        for dizio in storico_punti:
            tot_punti += dizio['points']


        # If we have the values in our dataframe we use that, otherwise we scrape procyclingstats
        nome = ' '.join(cyclist_df.loc[i,'name'].split()) if not pd.isna(cyclist_df.loc[i,'name']) else ' '.join(ciclista.name().split())
        data_nascita = cyclist_df.loc[i,'birth_year'] if not pd.isna(cyclist_df.loc[i,'birth_year']) else safe_getattr(ciclista, 'birthdate', lambda str: str[:4])
        height = cyclist_df.loc[i,'height'] if not pd.isna(cyclist_df.loc[i,'height']) else safe_getattr(ciclista, 'height', lambda x: x*100)
        # Sometimes the scraper confuses the height for the weight, when it's not there...
        scraped_weight = safe_getattr(ciclista, 'weight')
        weight = cyclist_df.loc[i,'weight'] if not pd.isna(cyclist_df.loc[i,'weight']) else scraped_weight if not width_height_mistaken(scraped_weight, height) else np.nan
        nazionalita = cyclist_df.loc[i,'nationality'] if not pd.isna(cyclist_df.loc[i,'nationality']) else safe_getattr(ciclista, 'nationality')
    except ValueError:
        # If we don't find the cyclist on procyclingstats we basically copy the row from our dataframe
        #pass
        nome = ' '.join(cyclist_df.loc[i,'name'].split())
        data_nascita = cyclist_df.loc[i,'birth_year']
        weight = cyclist_df.loc[i,'weight'] 
        height = cyclist_df.loc[i,'height']
        nazionalita = cyclist_df.loc[i,'nationality']

    cyclist_new_data = {
        '_url': url,
        'name': nome,
        'birth_year': data_nascita,
        'weight': weight,
        'height': height,
        'nationality': nazionalita,
        'points_total': tot_punti,
        'tot_seasons_attended': n_gare
    }    
    new_cyclists.append(cyclist_new_data)
    

    



  0%|          | 0/6134 [00:00<?, ?it/s]

100%|██████████| 6134/6134 [24:49<00:00,  4.12it/s]  


We create the new dataframe

In [14]:
new_cyclist_df = pd.DataFrame(new_cyclists)
new_cyclist_df.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


In [15]:
new_cyclist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _url         6134 non-null   object 
 1   name         6134 non-null   object 
 2   birth_year   6121 non-null   float64
 3   weight       3150 non-null   float64
 4   height       3143 non-null   float64
 5   nationality  6133 non-null   object 
dtypes: float64(3), object(3)
memory usage: 287.7+ KB


In [16]:
cyclist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _url         6134 non-null   object 
 1   name         6134 non-null   object 
 2   birth_year   6121 non-null   float64
 3   weight       3078 non-null   float64
 4   height       3143 non-null   float64
 5   nationality  6133 non-null   object 
dtypes: float64(3), object(3)
memory usage: 287.7+ KB


First, let's save the data that we scraped

In [17]:
# Save the data in the new dataframe
new_cyclist_df.to_csv(os.path.join('dataset','cyclists_new.csv'))

At a first sight, it looks like we've been able to retrieve 72 cyclists' weights from scraping. But by investigating further, it's not the case.

Let's compare the new values that we obtained. Let's print some features of samples from which the weight was originally missing, but it's present in the dataframe

In [45]:
new_cyclist_df.loc[cyclist_df["weight"].isna() & ~new_cyclist_df["weight"].isna(), ['_url', 'name', 'weight', 'height']]

Unnamed: 0,_url,name,weight,height
76,idar-andersen,Idar Andersen,1.82,182.0
219,thomas-bonnet,Thomas Bonnet,1.75,175.0
261,syver-waersted,Syver Wærsted,1.93,193.0
355,loe-van-belle,Loe van Belle,1.84,184.0
403,negasi-abreha,Negasi Haylu Abreha,1.86,186.0
...,...,...,...,...
5836,matheo-vercher,Mattéo Vercher,1.71,171.0
6033,erik-nordsaeter-resell,Erik Nordsæter Resell,1.92,192.0
6063,carlos-galarreta,Carlos Galarreta,1.74,174.0
6088,eric-antonio-fagundez,Eric Antonio Fagúndez,67.00,180.0


These values are of course suspect

In [51]:
x = new_cyclist_df[cyclist_df["weight"].isna() & ~new_cyclist_df["weight"].isna()]['weight']*100 == new_cyclist_df[cyclist_df["weight"].isna() & ~new_cyclist_df["weight"].isna()]['height']
x.sum(), x.shape

(70, (72,))

Fof 70 out of 72 cases, the parser just mistook the height for the weight.

In [58]:
idar_andersen = pcs.Rider("rider/idar-andersen")
print(idar_andersen.weight())
try:
    print(idar_andersen.height())
except AttributeError as exc:
    print(f"Trying to call height method raised the exception: {exc}")

1.82
Trying to call height method raised the exception: 'NoneType' object has no attribute 'text'


In [59]:
cyclist_df.iloc[76]

_url            idar-andersen
name           Idar  Andersen
birth_year             1999.0
weight                    NaN
height                  182.0
nationality            Norway
Name: 76, dtype: object

Indeed, it's the height.

### Races

Let's see if we have more luck with the races

In [6]:
races_df.head()

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,vini-ricordi-pinarello-sidermec-1986,0.0
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,norway-1987,0.0
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,,0.0
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,navigare-blue-storm-1993,0.0
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,spain-1991,0.0


Allora, consideriamo `race = pcs.Stage(f"race/{races_df.loc[i, '_url']}")`. Consideriamo le cose più facili da parsare e recuperare:
- `race.profile_icon()` è `races_df.loc[i, 'profile']`
- `race.race_startlist_quality_score()` è `races_df.loc[0, 'startlist_quality']`, ma non manca nessun valore in questa colonna
- `race.uci_points_scale()` è `races_df[i, 'uci_points']`
- `race.avg_temperature()` è `races_df[i, 'average_temperature']`
- `race.vertical_meters()` è `races_df[i, 'climb_total']`
- `race.distance()` è `races_df.loc[i, 'length']`, ma quest'ulltima è in Km mentre la prima è in metri
- `race.date()` è la prima parte di `races_df.loc[i, 'date']` 


What we do:
- We keep the columns of the dataframe which are complete (i.e. the URL, the name, the terrain,...)
- We use the points and UCI points from procyclingstats. We overwrite those of the dataframe. This is because they're *are relative to the cyclist, not to the race!* 
- If the point/UCI point is 0, we set it to NaN, because 0 encodes missing values for the scraper
- for `age`,`climb_total`, `profile`, `average_temperature` we first try to look into the dataframe, if there is a NaN we scrape



**OSS**: I teams di procyclingstats sono completamente diversi da quelli del dataset

The following block is taken from `Andrea_data_understanding.ipynb`, and basically identifies which races are the same, but have different names in the `name` column. This is checked by comparing the races' `_url`s. 

It also creates a dictionary in which the keys are names of the races (a representative for the equivalence class), and the values are the different names that refer to the same race as the corresponding key.

In [7]:
from utility.data_understanding import check_if_same
from itertools import combinations

# List with all the race names
race_names = np.sort(races_df['name'].unique())

# Initialize a list to store pairs of races that are actually the same
same_races = []
# And a dictionary in which the different values correspond to different names for
# the same race (the race denoted by the key)
same_races_dict = {}

# Iterate through all pairs of race names
for i in range(len(race_names)):
    for j in range(i + 1, len(race_names)):
        race1 = race_names[i]
        race2 = race_names[j]
        # Use the check_if_same function to compare the races
        try:
            same, _, _ = check_if_same(race1, race2, races_df=races_df)
            if same:
                same_races.append((race1, race2))
                
                # Find the representative name
                representative = None
                for key in same_races_dict:
                    if race1 in same_races_dict[key] or race2 in same_races_dict[key]:
                        representative = key
                        break
                
                if representative is None:
                    representative = race1
                
                # Add the races to the dictionary
                if representative not in same_races_dict:
                    same_races_dict[representative] = [race1, race2]
                else:
                    if race1 not in same_races_dict[representative]:
                        same_races_dict[representative].append(race1)
                    if race2 not in same_races_dict[representative]:
                        same_races_dict[representative].append(race2)
        except TypeError:
            print(f"Caught error at races {race_names[i]} and {race_names[j]}")

# Final check
for key in same_races_dict.keys():
    # Check if all the aliases stored in the dictionary are also in the list of pairs of same races
    v = all([pair in same_races for pair in combinations(same_races_dict[key], 2)])
    if not v:
        print(f"Error with {key}")
        break

assert v, "There is some problem"

In [8]:
new_races = []

TRUE_RANGE = races_df.shape[0]
FALSE_RANGE = 1000

# Helper function to handle exceptions
def safe_getattr(obj, attr, fun:callable=lambda x: x):
    try:
        return fun(getattr(obj, attr)())
    except (IndexError, AttributeError, ValueError):
        return np.nan
    

def name_returner(dictionary:dict[str,list[str]], val_to_find:str) -> str|float:
    for key, val in dictionary.items():
        if val_to_find in val:
            return key
    # Well... Turns out that in the dictionary there isn't really everything...
    return val_to_find #np.nan

prev_url = None
for i in tqdm.tqdm(range(FALSE_RANGE)):
    url = races_df.loc[i, '_url']

    # We process only the different URLs
    if url == prev_url:
        continue

    try:
        tappa = pcs.Stage(f"race/{url}")
        same_url_races = races_df[races_df['_url'] == url]

        ## Non-cyclist oriented
        elevazione = races_df.loc[i, 'climb_total'] if not pd.isna(races_df.loc[i, 'climb_total']) else safe_getattr(tappa, 'vertical_meters')
        # In our dataset missing profile is treated as NaN, while the scraper treats them as 0
        profilo = races_df.loc[i, 'profile'] if not pd.isna(races_df.loc[i, 'profile']) else safe_getattr(tappa, 'profile_icon', lambda profile: np.float64(profile[1]) if np.float64(profile[1] != 0) else np.nan)
        # If the temperature isn't there then tappa.avg_temperature() is None. None becomes NaN in the dataframe
        avgtemp = races_df.loc[i, 'average_temperature'] if not pd.isna(races_df.loc[i, 'average_temperature']) else safe_getattr(tappa, 'avg_temperature')

        for idx, row in same_url_races.iterrows():
            posizione, url_ciclista = races_df.loc[idx, ['position', 'cyclist']]

            ## Cyclist-oriented
            lista = tappa.results('rider_url', 'age', 'pcs_points', 'uci_points', 'team_url')   
            diz_valori_ciclista = lista[posizione]  # Luckily the values in lista are in the same order as that of the races_df
            # I take the initiative, and correct the points (and UCI-points)!
            # In our dataset missing points are treated as NaN, while the scraper treats them as 0
            punti = diz_valori_ciclista.get('pcs_points', np.nan) if diz_valori_ciclista.get('pcs_points', np.nan) != 0 else np.nan
            # In our dataset missing UCI points are treated as NaN, while the scraper treats them as 0
            punti_uci = diz_valori_ciclista.get('uci_points', np.nan) if diz_valori_ciclista.get('uci_points', np.nan) != 0 else np.nan
            eta = races_df.loc[idx, 'cyclist_age'] if not pd.isna(races_df.loc[idx, 'cyclist_age']) else diz_valori_ciclista.get('age', np.nan)
    
            # The teams in the dataset are completely different from those of pcs...
            team = races_df.loc[idx, 'cyclist_team'] if not races_df.loc[idx, 'cyclist_team'] else diz_valori_ciclista.get('team_url', np.nan)     

            # And now let's create the new entry
            race_new_data = {
                '_url': url,
                'name': ' '.join(name_returner(same_races_dict, races_df.loc[i, 'name']).split()),
                'points': punti,
                'uci_points': punti_uci,
                'length': races_df.loc[i, 'length'],
                'climb_total': elevazione,
                'profile': profilo,
                'startlist_quality': races_df.loc[i, 'startlist_quality'],
                'average_temperature': avgtemp,
                'date': races_df.loc[i, 'date'],
                'position': posizione,
                'cyclist': url_ciclista,
                'cyclist_age': eta,
                # Until we know how to get these...
                'is_tarmac': races_df.loc[i, 'is_tarmac'],
                'is_cobbled': races_df.loc[i, 'is_cobbled'], # ... as if they weren't all False...
                'is_gravel': races_df.loc[i, 'is_gravel'],
                'cyclist_team': team,
                'delta': races_df.loc[i, 'delta']
            }
            new_races.append(race_new_data)       

        # Update url
        prev_url = url
    except ValueError:
        print(f"Encountered error at url {url}, iteration {i}")

  0%|          | 0/1000 [00:00<?, ?it/s]

100%|██████████| 1000/1000 [00:09<00:00, 102.30it/s]


In [13]:
reduced_new_races_df = pd.DataFrame(new_races)
reduced_races_df = races_df[:reduced_new_races_df.shape[0]]
reduced_new_races_df.shape, reduced_races_df.shape

((1075, 18), (1075, 18))

In [14]:
reduced_new_races_df.equals(reduced_races_df)

False

Of course the two dataframes are not equal, because for sure `points`, `uci_points` and `cyclist_team` have been changed

In fact, what we've discovered is that the `points` value gives the same points (those of the winner) to all the contenders of the stage, but in reality they've been awarded different points, according to the order of arrival.

In [103]:
reduced_new_races_df.head()

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,team/flandria-velda-lano-1978,0.0
1,tour-de-france/1978/stage-6,Tour de France,70.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,team/ti-raleigh-mc-gregor-1978,0.0
2,tour-de-france/1978/stage-6,Tour de France,50.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,team/flandria-velda-lano-1978,0.0
3,tour-de-france/1978/stage-6,Tour de France,40.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,team/c-a-1978,0.0
4,tour-de-france/1978/stage-6,Tour de France,32.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,team/miko-mercier-1978,0.0


In [104]:
reduced_new_races_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1075 entries, 0 to 1074
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   _url                 1075 non-null   object 
 1   name                 1075 non-null   object 
 2   points               110 non-null    float64
 3   uci_points           23 non-null     float64
 4   length               1075 non-null   float64
 5   climb_total          956 non-null    float64
 6   profile              1075 non-null   float64
 7   startlist_quality    1075 non-null   int64  
 8   average_temperature  164 non-null    float64
 9   date                 1075 non-null   object 
 10  position             1075 non-null   int64  
 11  cyclist              1075 non-null   object 
 12  cyclist_age          1075 non-null   float64
 13  is_tarmac            1075 non-null   bool   
 14  is_cobbled           1075 non-null   bool   
 15  is_gravel            1075 non-null   b

In [105]:
reduced_races_df.head()

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,vini-ricordi-pinarello-sidermec-1986,0.0
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,norway-1987,0.0
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,,0.0
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,navigare-blue-storm-1993,0.0
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,spain-1991,0.0


In [106]:
reduced_races_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   _url                 1000 non-null   object 
 1   name                 1000 non-null   object 
 2   points               1000 non-null   float64
 3   uci_points           679 non-null    float64
 4   length               1000 non-null   float64
 5   climb_total          881 non-null    float64
 6   profile              881 non-null    float64
 7   startlist_quality    1000 non-null   int64  
 8   average_temperature  164 non-null    float64
 9   date                 1000 non-null   object 
 10  position             1000 non-null   int64  
 11  cyclist              1000 non-null   object 
 12  cyclist_age          1000 non-null   float64
 13  is_tarmac            1000 non-null   bool   
 14  is_cobbled           1000 non-null   bool   
 15  is_gravel            1000 non-null   bo

---

In [22]:
# checking format of the date
all(reduced_new_races_df['date'].apply(lambda data: bool(re.match(r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}",data))))

True

In [16]:
for col in reduced_new_races_df.columns:
    truth1 = all(reduced_new_races_df[col] == reduced_races_df[col])
    diversi = reduced_new_races_df[col] != reduced_races_df[col]
    truth2_1 = pd.isna(reduced_new_races_df.loc[diversi, col]).all() 
    truth2_2 = pd.isna(reduced_races_df.loc[diversi, col]).all() 
    truth = truth1 or (truth2_1 and truth2_2)
    print(f"Columns {col} are equal across the two datasets: {truth}")

Columns _url are equal across the two datasets: True
Columns name are equal across the two datasets: False
Columns points are equal across the two datasets: False
Columns uci_points are equal across the two datasets: False
Columns length are equal across the two datasets: True
Columns climb_total are equal across the two datasets: True
Columns profile are equal across the two datasets: True
Columns startlist_quality are equal across the two datasets: True
Columns average_temperature are equal across the two datasets: True
Columns date are equal across the two datasets: False
Columns position are equal across the two datasets: True
Columns cyclist are equal across the two datasets: True
Columns cyclist_age are equal across the two datasets: True
Columns is_tarmac are equal across the two datasets: True
Columns is_cobbled are equal across the two datasets: True
Columns is_gravel are equal across the two datasets: True
Columns cyclist_team are equal across the two datasets: False
Columns 

In [23]:
reduced_new_races_df['profile'].unique(), reduced_races_df['profile'].unique()

(array([ 1.,  5., nan,  3.,  2.]), array([ 1.,  5., nan,  3.,  2.]))

In [24]:
colonna = 'profile'

In [25]:
reduced_races_df[colonna].isna().sum()

119

In [197]:
diversi = reduced_new_races_df[colonna] != reduced_races_df[colonna]
#reduced_new_races_df.loc[diversi, colonna]
diversi

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Name: profile, Length: 1000, dtype: bool

In [201]:
diversi.sum()

119

In [190]:
reduced_races_df.loc[999]

_url                   volta-a-catalunya/2018/stage-7
name                       Volta Ciclista a Catalunya
points                                           50.0
uci_points                                       50.0
length                                       154800.0
climb_total                                    2008.0
profile                                           2.0
startlist_quality                                 659
average_temperature                               NaN
date                              2018-03-25 03:29:45
position                                           37
cyclist                        sindre-skjoestad-lunke
cyclist_age                                      25.0
is_tarmac                                        True
is_cobbled                                      False
is_gravel                                       False
cyclist_team                             ireland-2005
delta                                           101.0
Name: 999, dtype: object

In [191]:
reduced_new_races_df.loc[999]

_url                   volta-a-catalunya/2018/stage-7
name                       Volta Ciclista a Catalunya
points                                              0
uci_points                                        NaN
length                                       154800.0
climb_total                                    2008.0
profile                                           2.0
startlist_quality                                 659
average_temperature                               NaN
date                              2018-03-25 03:29:45
position                                           37
cyclist                        sindre-skjoestad-lunke
cyclist_age                                      25.0
is_tarmac                                        True
is_cobbled                                      False
is_gravel                                       False
cyclist_team                team/fortuneo-samsic-2018
delta                                           101.0
Name: 999, dtype: object

---