# Recurrent Neural Networks

The 4th task is about classification *on tabular data*. What if the approach was wrong? What would happen if we changed the nature of the data?

This notebook revolves around using models for time series to tackle the classification task of the Data Mining project. However, these models shouldn't be compared with those that make use of tabular data, as the validation set will be different, and the data it uses is inherently different. Is still part of task 4 in a sense...

Maybe I'll write a better introduction for this, for now let's move on.

## Autoreload

Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.

In [1]:
%load_ext autoreload
%autoreload 2

## Imports

As usual, we import all the packages and stuff

In [2]:
import procyclingstats as pcs
# Base libraries
import os
import sys
# Basic data manipulation libraries
import numpy as np
import pandas as pd
import itertools
# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns


# Otherwise nothing will be found
sys.path.append(os.path.abspath(os.path.join('..')))

Import from our own utilities

In [3]:
from utility.classification_utility import make_dataset_for_RNN_classification, TO_NOT_USE_COLS, TO_REMOVE_COLS

Other constants, global variables

In [4]:
IMAGES_DIR = os.path.join('Images', 'Clustering_imgs', 'recurrent_models_imgs')

RACES_URL = os.path.join('..', 'dataset', 'races_cleaned.csv')
CYCLISTS_URL = os.path.join('..', 'dataset', 'cyclists_cleaned.csv')

# we define a random state to make the results reproducible
RANDOM_STATE = 42
RUN_SLOW_STUFF = False

## Dataset Creation

As said, the creation of the dataset is not trivial at all.

In [5]:
merged = make_dataset_for_RNN_classification(cyclists_url=CYCLISTS_URL, races_url=RACES_URL)
merged.shape

(523073, 39)

The cell above takes a long time to run. That's because there's a lot going on

The `make_dataset_for_RNN_classification` removes some more columns:
- `time_seconds`: is just the conversion of the `time` column from string into the n° of seconds (int). With the `timedelta64[s]` datatype, this is not necessary
- `cyclist_age_cyc`: is just `2024 - birth_year` (2024 because it was created last year)

In [21]:
merged.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,_url_rac,name_rac,stage,stage_type,points,uci_points,length,climb_total,profile,startlist_quality,...,partecipants_number,target,delta_shifted,stamina_index_shifted,time_shifted,age_performance_index_shifted,points_shifted,average_speed_shifted,target_shifted,position_shifted
cyclist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
aad-van-den-hoek,72148,tour-de-suisse/1975/stage-9b,Tour de Suisse,stage-9b,ITT,7.0,0.0,20400.0,,,549,...,6,True,NaT,,NaT,,,,,
aad-van-den-hoek,398199,paris-nice/1976/prologue,Paris - Nice,prologue,RR,0.0,0.0,6500.0,,,779,...,15,True,0 days 00:01:27,,0 days 00:28:18,-21.241584,7.0,12.014134,True,5.0
aad-van-den-hoek,408882,omloop-het-nieuwsblad/1977/result,Omloop Het Nieuwsblad ME,,RR,0.0,0.0,201000.0,,,595,...,38,False,0 days 00:00:24,,0 days 00:08:56,,0.0,12.126866,True,14.0
aad-van-den-hoek,441142,omloop-het-nieuwsblad/1978/result,Omloop Het Nieuwsblad ME,,RR,0.0,0.0,218000.0,,,493,...,31,False,0 days 00:09:25,,0 days 04:50:25,,0.0,11.535151,False,37.0
aad-van-den-hoek,56937,tour-de-france/1978/prologue,Tour de France,prologue,RR,0.0,0.0,5200.0,27.0,1.0,1241,...,110,False,0 days 00:04:00,,0 days 05:11:30,,0.0,11.663991,False,26.0


We notice that the `cyclist` column has now become an index. In fact the `make_dataset_for_RNN_classification`
- **Groups** the rows by cyclist, in order to put together the records of the same cyclist, to create time series of the cyclists
- **Sorts** the rows by date, inside each group
- **Shifts** some columns

Let's check that all dates are sorted

In [42]:
all_sorted = True
for cyclist in merged.index.levels[0]:
	truth = merged.loc[cyclist, 'date'].is_monotonic_increasing
	if not truth:
		print(f"For {cyclist}, dates are not sorted")
		all_sorted = False

if all_sorted:
	print("All dates are sorted")

All dates are sorted


A few more words on the columns shifting.

For the tabular data models we can't use some of the columns' values, we can just use the past values to compute the current prediction. For example, we can't use the current value for the `position` feature to predict if the cyclist is in the top 20, but we can use all the past values as we see fit.

In the exploration, we identified features such as `average_position` that make use of these "forbidden values", as they make use of all the data available. Thus, they have to be computed for each timestep. For example, each record of a cyclist has the `average_position` value it had at the point of his career that corresponds to this record, excluding the current `position` value.

In this way, classifiers based on tabular data can make use of these weird features.

Well, for sequence models we don't need these summarizations. We leave it to the model to use all the values of the sequence seen so far, for its training, sa it sees fit. The only thing we have to make sure is to not use the current one. Hence the column shifting.

Let's check that the shift is successful

In [63]:
def inefficient_comparer(l1:pd.Series, l2:pd.Series) -> bool:
	for i, j in zip(l1, l2):
		if (pd.isna(i) and pd.isna(j)) or i == j:
			continue
		else:
			return False
	return True

all_shifted_correctly = True
for cyclist in merged.index.levels[0]:
	for col in set(TO_NOT_USE_COLS) - {'time_seconds'}:
		l1 = merged.loc[cyclist, col][:-1].reset_index(drop=True)
		l2 = merged.loc[cyclist, f"{col}_shifted"][1:].reset_index(drop=True)
		truth = inefficient_comparer(l1, l2)
		if not truth:
			print(f"For {cyclist}, {col} and {col}_shifted are not shifted correctly")
			all_shifted_correctly = False

if all_shifted_correctly:
	print("All columns are shifted correctly")

All columns are shifted correctly


In [35]:
merged.columns

Index(['_url_rac', 'name_rac', 'stage', 'stage_type', 'points', 'uci_points',
       'length', 'climb_total', 'profile', 'startlist_quality', 'date',
       'position', 'cyclist_age_rac', 'is_tarmac', 'delta', 'time',
       'average_speed', 'steepness', 'season', 'is_staged', 'race_country',
       'age_performance_index', 'quality_adjusted_points', 'stamina_index',
       'birth_year', 'weight', 'height', 'nationality', 'bmi',
       'partecipants_number', 'target', 'delta_shifted',
       'stamina_index_shifted', 'time_shifted',
       'age_performance_index_shifted', 'points_shifted',
       'average_speed_shifted', 'target_shifted', 'position_shifted'],
      dtype='object')

There are a sh*t ton of columns now. I don't like it, but it is what it is.

Un ripasso, per me:
- Da **buttare di sicuro**
	- `_url_rac`, 
- Da **buttare forse**:
	- `stamina_index`, `stamina_index_shifted`, `bmi`, `age_performance_index`, `age_performance_index_shifted`, in quanto funzioni delle altre features
	- `name_rac`, `stage`
	- `is_tarmac` per piacere
- Da **non usare per allenare le RNN**:
	- `delta`, `stamina_index`, `points`, `uci_points`, `average_speed`, `position`, `time`

In [69]:
merged['race_country'].unique()

array(['Switzerland', 'France', 'Belgium', 'Spain', 'Italy',
       'Netherlands', nan, 'UAE', 'Canada'], dtype=object)

---

In [7]:
from utility.classification_utility import TO_RECOMPUTE_COLS, TO_NOT_USE_COLS, TO_KEEP_UNCHANGED_COLS
