# Recurrent Neural Networks

The 4th task is about classification *on tabular data*. What if the approach was wrong? What would happen if we changed the nature of the data?

This notebook revolves around using models for time series to tackle the classification task of the Data Mining project. However, these models shouldn't be compared with those that make use of tabular data, as the validation set will be different, and the data it uses is inherently different. Is still part of task 4 in a sense...

Maybe I'll write a better introduction for this, for now let's move on.

## Autoreload

Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.

In [1]:
%load_ext autoreload
%autoreload 2

## Imports

As usual, we import all the packages and stuff

In [2]:
import procyclingstats as pcs
# Base libraries
import os
import sys
# Basic data manipulation libraries
import numpy as np
import pandas as pd
import itertools
# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns


# Otherwise nothing will be found
sys.path.append(os.path.abspath(os.path.join('..')))

Import from our own utilities

In [3]:
from utility.classification_utility import get_merged_dataset

Other constants, global variables

In [4]:
IMAGES_DIR = os.path.join('Images', 'Clustering_imgs', 'recurrent_models_imgs')

RACES_URL = os.path.join('..', 'dataset', 'races_cleaned.csv')
CYCLISTS_URL = os.path.join('..', 'dataset', 'cyclists_cleaned.csv')

# we define a random state to make the results reproducible
RANDOM_STATE = 42
RUN_SLOW_STUFF = False

## Dataset Creation

As said, the creation of the dataset is not trivial at all.

In [5]:
merged = get_merged_dataset(cyclists=CYCLISTS_URL, races=RACES_URL)
merged.shape

(523073, 38)

Let's convert some datatypes into time-related datatypes of NumPy and Pandas

In [6]:
merged['date'] = merged['date'].astype('datetime64[s]')
merged['delta'] = merged['delta'].astype('timedelta64[s]')
merged['time'] = merged['time'].astype('timedelta64[s]')

Now we can remove some more columns:
- `time_seconds`: is just the conversion of the `time` column from string into the n° of seconds (int). With the `timedelta64[s]` datatype, this is not necessary
- `cyclist_age_cyc`: is just `2024 - birth_year` (2024 because it was created last year)

In [7]:
merged.drop(columns=['time_seconds', 'cyclist_age_cyc'], inplace=True)

In [8]:
merged.columns


Index(['_url_rac', 'name_rac', 'stage', 'stage_type', 'points', 'length',
       'climb_total', 'profile', 'startlist_quality', 'date', 'position',
       'cyclist', 'cyclist_age_rac', 'is_tarmac', 'delta', 'time',
       'average_speed', 'steepness', 'season', 'is_staged', 'race_country',
       'age_performance_index', 'quality_adjusted_points', 'stamina_index',
       'weight', 'height', 'nationality', 'bmi', 'race_count',
       'experience_level', 'total_points', 'avg_points_per_race',
       'average_position', 'avg_speed_cyclist', 'cyclist_age_cyc',
       'mean_stamina_index'],
      dtype='object')

In [9]:
cipollini_df = merged[merged['cyclist'] == 'mario-cipollini']
cipollini_df

Unnamed: 0,_url_rac,name_rac,stage,stage_type,points,length,climb_total,profile,startlist_quality,date,...,nationality,bmi,race_count,experience_level,total_points,avg_points_per_race,average_position,avg_speed_cyclist,cyclist_age_cyc,mean_stamina_index
285,volta-a-catalunya/1999/prologue,Volta Ciclista a Catalunya,prologue,RR,0.0,8100.0,,,804,1999-06-17,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
1436,milano-sanremo/2004/result,Milano-Sanremo,,RR,5.0,294000.0,,,1400,2004-03-20,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
2502,giro-d-italia/1999/stage-8,Giro d'Italia,stage-8,RR,0.0,253000.0,,5.0,1057,1999-05-22,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
3151,tirreno-adriatico/2005/stage-6,Tirreno-Adriatico,stage-6,RR,4.0,164000.0,,1.0,1040,2005-03-14,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
4753,tirreno-adriatico/1999/stage-4,Tirreno-Adriatico,stage-4,RR,0.0,197000.0,,,1150,1999-03-13,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
519003,giro-d-italia/1996/stage-10,Giro d'Italia,stage-10,RR,0.0,164000.0,,2.0,1034,1996-05-28,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
519948,milano-sanremo/1996/result,Milano-Sanremo,,RR,80.0,294000.0,,,1580,1996-03-23,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
520116,tirreno-adriatico/2003/stage-1,Tirreno-Adriatico,stage-1,RR,50.0,178000.0,,,882,2003-03-13,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851
520274,paris-nice/1993/stage-1,Paris - Nice,stage-1,RR,50.0,208500.0,,,1182,1993-03-08,...,Italy,21.555947,373.0,pro,9807.0,26.292225,47.636842,11.017329,57.0,17.545851


In [39]:
cipollini_df['cyclist_age_cyc']

285       57.0
1436      57.0
2502      57.0
3151      57.0
4753      57.0
          ... 
519003    57.0
519948    57.0
520116    57.0
520274    57.0
520570    57.0
Name: cyclist_age_cyc, Length: 344, dtype: float64

In [42]:
cipollini_df['cyclist_age_rac'].unique()

array([32., 37., 38., 30., 31., 35., 29., 34., 28., 33., 25., 36., 24.,
       26., 23., 27.])