# Demanda de transporte

## `Auxiliar: Redes neuronales artificiales`

**Junio 2024**<br>
**Gabriel Nova & Sander van Cranenburgh** <br>
**G.N.Nova@tudelft.nl** <br>

### `Application: Estimating the Value of Travel Time`

In this lab session, we will investigate the "Value of Travel Time" (VTT) distribution. The VTT of a traveller reflects the amount of money the traveller is **willing to pay** to reduce their travel time. The VTT is used to determine the benefits of new infrastructure projects. As travel time savings are the dominant and most salient benefits of new infrastructure, accurate inference of the distribution of the VTT is crucial for a rigorous underpinning of policy decisions. <br>

During this lab, we will apply Mixed Logit choice models. We aim to uncover how tastes for travel time and travel cost are distributed in the population. Most of the analyses in this lab session are carried out in the so-called willingness-to-pay space. Willingness-to-pay space facilitates the inference of the VTT distribution.<br>

For this study, we will use Stated Choice (SC) data (`Norway_VTT_2009.csv`) collected in 2009 to compute the Norwegian VTT. In this SC experiment, respondents faced nine choice tasks involving two alternatives and two attributes (travel cost and travel time). The data set consists of 5,832 participants, resulting in a total of 52,488 choice observations. The figure below shows one of the choice tasks (note that for the purposes of illustration we converted the currency unit (Kronor) into euros).

![SC](data/sc_experiment.png)

**`Learning objectives lab session 02B`**

After completing the following exercises, you will be able to:
* Estimate Mixed Logit models that account for panel data
* Discuss the impact of the number of draws on the modelling outcomes


**`This lab consists of 2 parts and has 2 exercises`**

**Part 1**: The Panel Mixed Logit model

- Exercise 1: "Panel ML model with log-normally distributed VTT"

**Part 2**: Impact of the number of draws on modelling outcomes

- Exercise 2: "Impact of the number of draws"



### `Import packages`

To begin, we will import all the libraries that we will use in this lab.

In [1]:
# Biogeme
import biogeme.database as db
import biogeme.biogeme as bio
import biogeme.biogeme_logging as blog
from biogeme import models
from biogeme.expressions import Beta, Variable, bioDraws, log, MonteCarlo, exp, bioMultSum, exp


# General packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import time
from pathlib import Path
from scipy.stats import norm, lognorm

# Pandas setting to show all columns when displaying a pandas dataframe
pd.set_option('display.max_columns', None)

We invoke a so-called `logger` which enables us to see the progress during estimation.<br>

In [None]:
# Initialize the logger, if it has not been initialized yet
try:
    logger
except NameError:    
    logger = blog.get_screen_logger(level=blog.INFO)
    print('Logger has been initialised')

## `1. Load and explore the data set` <br>

We will use the same data set as in lab session 2A. So,we load the data and process it similarly as in lab session 2A.

In [3]:
# 1. Load the data set
data_path = Path('data/Norway_VTT_2009.csv')
df = pd.read_table(data_path, sep=',')

In [4]:
# 1. Keep only entries purpose == 5 (Long distance trips) & Mode == 1 (Car)
df = df.loc[(df['Purpose'] == 5) & (df['Mode'] == 1)]

In [5]:
# 2. Convert the monetary unit to euros
NOK2euro_exchange_rate = 9
df[['CostL','CostR']] = df[['CostL','CostR']] .div(NOK2euro_exchange_rate)

## `1. The Panel Mixed Logit model`

Thus far, we have worked on the assumption that each choice observation is uncorrelated with all other choice observations. However, this data set contains multiple choices per respondent. In the ML modelling framework, we can also account for correlation in unobserved utility **across observations** of the same individual if we specify it as a panel ML model. In the panel ML model, the likelihood of the sequence of choices *t* = 1..*T* of an individual *n* is given by:  

$L_n(i_1,...,i_{T})(\beta_n|\sigma) = \int_{\beta_n}\Pi_{t=1}^T     P_{n}(i_t|\beta_n) f(\beta_n|\sigma)d\beta_n$

This likelihood does not have a closed-form expression. Therefore, as before, it needs to be approximated using simulation. Let's re-estimate the ML model assuming a normally distributed VTT distribution while accounting for panel structure. To do this, we first need to convert the data set into a so-called wide data format. In a wide format data set, each row contains all the choices belonging to an individual. Conveniently, Biogeme has a built-in function to do this (but, rather inconveniently, the names of the columns still need to be renamed).

### `1.1. Preparing a wide Biogeme database for estimating panel ML model`

In this cell we transform our data set into a wide format, and create a new Biogeme database object.

In [None]:
# Create Biogeme database object
biodata = db.Database('Norway2009VTT', df)

# Tell Biogeme which variable is the identifier of the individuals
biodata.panel('RespID')

# Calculate the number of observations per individual
obs_per_ind = biodata.data['RespID'].value_counts().unique()[0]
print(f'Number of observations per individual: {obs_per_ind}')

# Use biogeme's "generateFlatPanelDataFrame to create a wide database in which each row corresponds to one individual
df_wide = biodata.generate_flat_panel_dataframe(identical_columns=None)

# Rename the columns, such that they run from columnname_{0} to columnname_{n} 
renumbered_columns = {col: f'{col.split("_")[1]}_{int(col.split("_")[0])-1}' if len(col.split("_")) == 2 else col for col in df_wide.columns}

# Rename the columns using the dictionary
df_wide.rename(columns=renumbered_columns, inplace=True)

# Create Biogeme database object
biodata_wide = db.Database('Norway2009VTT_wide', df_wide)

# Show the first rows of the wide database
print(f'The wide dataset has a shape of {biodata_wide.data.shape}')
biodata_wide.data.head()

### `1.2. Panel ML model with normally distributed VTT`

In [7]:
# Give the model a name
model_name = 'Panel ML WTP space with normally distributed vtt'

# Parameters definition enabling the construction of random parameters
vtt       = Beta('vtt',       0.4, None, None, 0)
B_tc      = Beta('b_tc',     -0.4, None, None, 0)    
sigma_vtt = Beta('sigma_vtt ',  2, None, None, 0)

# Construction of random parameters   
vtt_rnd = vtt + sigma_vtt * bioDraws('vtt_rnd', 'NORMAL_HALTON2')

# Definition of the utility functions
# Note that we use list comprehension to create a list of utility functions for all observations of an individual 
V_L = [B_tc * (Variable(f'CostL_{q}') + vtt_rnd * Variable(f'TimeL_{q}')) for q in range(obs_per_ind)]
V_R = [B_tc * (Variable(f'CostR_{q}') + vtt_rnd * Variable(f'TimeR_{q}')) for q in range(obs_per_ind)]

# Create a dictionary to list the utility functions with the numbering of alternatives
# Note that we use list comprehension to create a list of dictionaries
V = [{1: V_L[q], 2: V_R[q]} for q in range(obs_per_ind)]
           
# Create a dictionary to describe the availability conditions of each alternative
av = {1:1, 2:1}

In [None]:
# The conditional probability of the chosen alternative is a logit
condProb = [models.loglogit(V[q], av, Variable(f'Chosen_{q}')) for q in range(obs_per_ind)] 

# Take the product of the conditional probabilities
condprobIndiv = exp(bioMultSum(condProb))   # exp to convert from logP to P again

# The unconditional probability is obtained by simulation
uncondProb = MonteCarlo(condprobIndiv)

# The Log-likelihood is the log of the unconditional probability
LL = log(uncondProb)

# Create the Biogeme estimation object containing the data and the model
num_draws = 100
biogeme = bio.BIOGEME(biodata_wide , LL, number_of_draws=num_draws)

# Compute the null loglikelihood for reporting
# Note that we need to compute it manually, as biogeme does not do this for panel data
biogeme.nullLogLike = len(biodata_wide.data)*np.log(1/2)*obs_per_ind

# Set reporting levels
biogeme.generate_pickle = False
biogeme.generate_html = False
biogeme.save_iterations = False
biogeme.modelName = model_name                               

In [None]:
# Estimate the parameters and print the results
results = biogeme.estimate()
print(results.print_general_statistics())

# Get the results in a pandas table
beta_hat = results.get_estimated_parameters()
print(beta_hat)

In [None]:
# Compute the value of travel time
VTT_WTP_ML_PANEL_normal = 60 * beta_hat.loc['vtt']['Value']
print(f'Value of travel time Panel ML model in WTP space:  €{VTT_WTP_ML_PANEL_normal:.2f} per hour')

## `Exercise 1: Panel ML with log-normally distributed VTT`

Now, **you** will estimate a ML model under the assumption that the VTT is log-normally distributed, while accounting for panel effects.<br>

To do so, copy the code from the Panel ML model in WTP space with normally distributed VTT, and create the log-normally distributed random parameter (as you have done in exercise 1 of lab_session 2A).<br>  
Estimate this model and interpret the results.<br>


`A` Compare the log-likelihood of the ML models with the log-normally distributed VTTs, which do and do not account for the panel effect. Which model fits better?<br>

`B` Compute the mean of the VTT for the Panel ML model with the log-normally distributed VTT and compare it with the non-panel model. Has it changed?<br>

`C`  i. Print the recovered mean VTTs of the models we have estimated below each other.<br>
* MNL model<br>
* ML model with Normal distribution in utility space<br>
* ML model with Normal distribution in wtp space<br>
* ML model with Log-normal in wtp space<br>
* Panel ML with Normal distribution in wtp space<br>
* Panel ML with Log-normal distribution in wtp space<br>                     

ii. Compare the VTTs of the models with a normal distribution and a log-normal distribution. Do you see a pattern? <br>

iii. What could explain this pattern?<br> 

In [None]:
## Your code here

## `2. Impact of the number of draws on modelling outcomes`

## `Exercise 2: Impact of the number of draws` 

For all the Mixed Logit models that we have estimated, we have used a low number of draws (<100). We choose a relatively low number of draws to avoid long estimation times.  <br>

Next, we analyse how sensitive the modelling outcomes are towards the number of draws. To do this, we have estimated a Panel Mixed Logit model using different numbers of draws, ranging from 33 to 2,000, and stored the results. <br>

The following plots show the results. 

![Draws](data/draws_vs_.png)

`Questions:`

`A` The left-hand side plot shows that the VTT estimate gets more stable with an increasing number of draws. Can you explain why the estimate gets more stable? 

`B` What number of draws do you deem sufficient for estimating this model? Explain your answer.

`C` The right-hand side plot shows a linear relation between the number of draws and the estimation time. Explain why a linear relation was to be expected.

`D` Suppose we estimate a model with *K* random parameters. Would the relation between the number of draws and estimation time still be linear? Explain your answer. 

<br>

In [None]:
## Your code here

## END

In [None]:
# Below is the code to create the plot 

# Create a dataframe to store the results
df_out = pd.DataFrame(columns=['num_draws','VTT', 'LL','elapsed_time'])

# Define the number of draws to be used for Monte-Carlo simulation
num_draws = list(range(33, 201, 33))

# Parameters definition enabling the construction of random parameters
vtt         = Beta('vtt',       0.4, None, None, 0)
B_tc        = Beta('b_tc',     -0.4, None, None, 0)    
sigma_vtt   = Beta('sigma_vtt',   2, None, None, 0)

# Construction of random parameters   
vtt_rnd = exp(vtt + sigma_vtt * bioDraws('vtt_rnd', 'NORMAL_HALTON2'))

# Definition of the utility functions 
V_L = [B_tc * (Variable(f'CostL_{q}') + vtt_rnd * Variable(f'TimeL_{q}')) for q in range(9)]
V_R = [B_tc * (Variable(f'CostR_{q}') + vtt_rnd * Variable(f'TimeR_{q}')) for q in range(9)]

# Create a dictionary to list the utility functions with the numbering of alternatives
V = [{1: V_L[q], 2: V_R[q]} for q in range(9)]
        
# Create a dictionary to describe the availability conditions of each alternative
av = {1:1, 2:1}

# The conditional probability of the chosen alternative is a logit
condProb = [models.loglogit(V[q], av, Variable(f'Chosen_{q}')) for q in range(9)] 

# Take the product of the conditional probabilities
condprobIndiv = exp(bioMultSum(condProb))   # exp to convert from logP to P again

# The unconditional probability is obtained by simulation
uncondProb = MonteCarlo(condprobIndiv)

# The Log-likelihood is the log of the unconditional probability
LL = log(uncondProb)

# Loop over the number of draws
for R in num_draws:
    
    # Start the timer
    start_time = time.time()

    # Give the model a name
    model_name = f'Panel ML WTP space with log-normally distributed vtt with {R} draws'

    # Create the Biogeme estimation object containing the data and the model
    biogeme = bio.BIOGEME(biodata_wide , LL, number_of_draws=R)
    
    # Set reporting levels
    biogeme.generate_pickle = False
    biogeme.generate_html = False
    biogeme.save_iterations = False
    biogeme.modelName = model_name
                                    
    # Compute the null loglikelihood for reporting
    biogeme.nullLogLike = len(biodata_wide.data)*np.log(1/2)*9

    # Estimate the parameters
    results = biogeme.estimate()
    # print(results.short_summary())

    # Get the results in a pandas table
    beta_hat = results.get_estimated_parameters()
    # print(beta_hat)

    # End the timer
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'Elapsed time: {elapsed_time:.2f} seconds\n\n')

    # Compute the mean value of travel time
    mu = beta_hat.loc['vtt']['Value']
    sigma = beta_hat.loc['sigma_vtt']['Value'] 
    mean_lognormal_panel = np.exp(mu + np.square(sigma)/2) * 60
    
    # Add the results to the dataframe
    df_R = pd.DataFrame({'num_draws': [R], 'VTT': [mean_lognormal_panel], 'LL': [results.get_general_statistics()['Final log likelihood'][0]], 'elapsed_time': [elapsed_time]})
    df_out = pd.concat([df_out, df_R])

# Show the results
df_out

In [None]:
# Plot the results in a figure
fig, ax = plt.subplots(1,3, figsize=(15,5), sharex=True)
fig.tight_layout(w_pad=3)

ax[0].plot(df_out['num_draws'], df_out['VTT'], marker='.')
ax[0].set_xlabel('Number of draws')
ax[0].set_ylabel('VTT [euro/hour]')
ax[0].set_title('VTT')

ax[1].plot(df_out['num_draws'], df_out['LL'], marker='.')
ax[1].set_xlabel('Number of draws')
ax[1].set_ylabel('Log-likelihood')
ax[1].set_title('Log-likelihood')

ax[2].plot(df_out['num_draws'], df_out['elapsed_time'], marker='.')
ax[2].set_xlabel('Number of draws')
ax[2].set_ylabel('Elapsed time [s]')
ax[2].set_title('Elapsed time')

plt.show()