## Tutorial 1: Data Generating Process and Bias

The Data Generating Process (DGP) refers to the mechanism that a decision maker uses to make choice. It encompasses all the underlying probabilistic and structural relationships that produce the observed data. In the context of choice models it comprises:<br>
1. A **decision rule** (e.g. utility maximisation, regret minimisation, or herding behaviour) and <br>
2. **Parameters** that govern the decision making (e.g. $\beta_{cost}$, $\beta_{time}$, $\beta_{quality}$)<br>

This tutorial illustrates: 
1. How to create synthetic data under the assumption of a given a DGP
2. How to recover the parameters of the DGP using Maximum Likelihood Estimation (MLE) 
3. How a mismatch between the true DGP and assumed model leads to biased parameters

### 1. Creating synthetic data under the assumption of a given a DGP<br>
#### Choice tasks and respondents
Firstly, we need to create the choice tasks and respondents. Let's assume we have data from *N* = 200 respondents. Each respondent was given the following *T* = 5 route choice tasks in a **labelled** stated choice experiment:<br>

| Task | Route 1: |Urban| Route 2: | Highway |
|------|------|------|-------|------|
|      | TC1 (€) | TT1 (min) | TC2 (€) | TT2 (min) |
| 1    | 4   | 35  | 8   | 25  |
| 2    | 8   | 30  | 4   | 35  |
| 3    | 7   | 20  | 6   | 40  |
| 4    | 6   | 25  | 5   | 30  |
| 5    | 5   | 40  | 7   | 20  |


<br>
<img src="assets/route_choice.png" alt="Route choice" width="400">

In [1]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import biogeme.database as db
import biogeme.biogeme as bio
from biogeme import models
from biogeme.expressions import Beta, Variable, log, exp

In [2]:
# Create np array with the five choice tasks (TC1, TT1, TC2, TT2)
tasks = np.array([[4,35,8,25], [4,35,8,30], [6,40,7,20], [5,30,6,25], [5,40,7,20]])

# Set the number of respondents
N = 200

# Replicate the tasks N times
data = np.tile(tasks, (N, 1))

# Create respondent IDs to add to the dataframe
resp_id = np.expand_dims(np.repeat(np.arange(1, N+1), len(tasks)), axis=1)

# Create a pandas dataframe with the data
df = pd.DataFrame(np.concatenate((resp_id,data), axis=1), columns=['RESP','TC1', 'TT1', 'TC2', 'TT2'])

# Show the first 10 rows of the dataframe
df.head(10)

Unnamed: 0,RESP,TC1,TT1,TC2,TT2
0,1,4,35,8,25
1,1,4,35,8,30
2,1,6,40,7,20
3,1,5,30,6,25
4,1,5,40,7,20
5,2,4,35,8,25
6,2,4,35,8,30
7,2,6,40,7,20
8,2,5,30,6,25
9,2,5,40,7,20


#### Synthetic choices
Next, we need to create the synthetic choices. Let's create these using a linear-additive RUM-MNL DGP. To do so, we assume:
1. All decision makers maximise utility (decision rule)
2. Travel cost, travel time and the label matter to the choice behaviour
3. $\beta_{tc} = -0.4, \beta_{tt} = -0.1$, $ASC_1 = 0$, $ASC_2 = 1$ for all decision makers
4. Unobserved utilities are i.i.d. EV type I distributed<br>

Hence, the utility function with this linear-addtive RUM-MNL DGP is:<br><br>
$U_{in} = V_{in} + \varepsilon_{in}$<br>

$U_{in} = ASC_i + \beta_{tc} \cdot x^{tc}_{in} + \beta_{tt} \cdot x^{tt}_{in} + \varepsilon_{in}$<br>


In [3]:
# Create the true utility parameters
beta_tt = -0.1
beta_tc = -0.4
asc1 = 0
asc2 = 1

print('True decision rule: Random Utility Maximisation (RUM), with parameters:')
print(f'   beta_TC: {beta_tc}')
print(f'   beta_TT: {beta_tt}')
print(f'   asc1: {asc1}')
print(f'   asc2: {asc2}')
print(f'   --> True Value of Travel Time: {60*beta_tt/beta_tc:0.1f} €/hr.')

True decision rule: Random Utility Maximisation (RUM), with parameters:
   beta_TC: -0.4
   beta_TT: -0.1
   asc1: 0
   asc2: 1
   --> True Value of Travel Time: 15.0 €/hr.


Next, we compute the choices based on the DGP. 
Note that there is still randomness in the choices because of the unobserved part of utility (i.e. epsilons). To be able to fully replicate the results we have fixed the random number generator using `np.random.seed(42)`

In [4]:
# Compute the utilities given the DGP: V = ASC + b1*TC1 + b2*TT1 + b3*TC2 + b4*TT2
df['V1'] = asc1 + beta_tc * df['TC1'] + beta_tt * df['TT1'] 
df['V2'] = asc2 + beta_tc * df['TC2'] + beta_tt * df['TT2']

# Add the error terms
# Fix the seed to make the results replicable
np.random.seed(42)
df['epsilon1'] = np.random.gumbel(size=len(df))
df['epsilon2'] = np.random.gumbel(size=len(df))

# Compute the total utility
df['U1'] = df['V1'] + df['epsilon1']
df['U2'] = df['V2'] + df['epsilon2']

# Identify the chosen alternative based on the maximum utility
df['CHOICE'] = np.nan
df.loc[df['U1'] > df['U2'], 'CHOICE'] = 1
df.loc[df['U2'] > df['U1'], 'CHOICE'] = 2

# Convert the chosen alternative to an integer (optional)
df['CHOICE'] = df['CHOICE'].astype(int)

# Save the data in a csv file
data_path =  Path(f'data/synthetic_VTTdata_tutorial1.dat')
df[['RESP','TC1', 'TT1', 'TC2', 'TT2','CHOICE']].to_csv(data_path, sep=',', index=False)

# Show the first rows
df.head()
# df.value_counts('CHOICE')

Unnamed: 0,RESP,TC1,TT1,TC2,TT2,V1,V2,epsilon1,epsilon2,U1,U2,CHOICE
0,1,4,35,8,25,-5.1,-4.7,0.756581,1.586062,-4.343419,-3.113938,2
1,1,4,35,8,30,-5.1,-5.2,-1.10198,0.247603,-6.20198,-4.952397,2
2,1,6,40,7,20,-6.4,-3.8,-0.275163,-0.72423,-6.675163,-4.52423,2
3,1,5,30,6,25,-5.0,-3.9,0.091082,-0.275818,-4.908918,-4.175818,2
4,1,5,40,7,20,-6.0,-3.8,1.774166,-0.496398,-4.225834,-4.296398,1


`--> We have created a data set in which the DGP is linear-additive RUM-MNL`<br>

In [5]:
# Save the dataframe for use in next turorials
# df.to_pickle('data/synthetic_data_tutorial1.pkl')

### 2. Recoverig the parameters of the DGP using Maximum Likelihood Estimation
Now, let's have a look what happens if we estimate a discrete choice model that **perfectly** aligns with the DGP. When we do so, we expect we are able to **accurately** recover the true parameters, i.e. $\beta_{TC}= -0.4$, $\beta_{TT} = -0.1$ and $ASC_2 = 1$.

We estimate the choice models using the `biogeme` package. <br>
First, we create the database for biogeme. 

In [6]:
# Create a Biogeme database
biodata = db.Database('synthetic_VTTdata', df)

# We create Variable objects for each of the variables in the data set that we want to use in the model
# Attributes of alternative 1
TT1  = Variable('TT1')
TC1  = Variable('TC1')

# Attributes of alternative 2    
TT2  = Variable('TT2')
TC2  = Variable('TC2')

# The choice
CHOICE = Variable('CHOICE')

Next, we define the model specification.

In [7]:
# Give a name to the model    
model_name = 'Linear-additive RUM-MNL with ASC (true model)'

# Define the model parameters, using the function "Beta()", in which you must define:
# the name of the parameter,
# starting value, 
# lower bound,
# upper bound, 
# 0 or 1, indicating if the parameter must be estimated. 0 means estimated, 1 means fixed to the starting value. 
B_TT = Beta('B_TT', 0, None, None, 0)
B_TC = Beta('B_TC', 0, None, None, 0)
ASC1 = Beta('ASC1', 0, None, None, 1)
ASC2 = Beta('ASC2', 0, None, None, 0)

# Define the utility functions
V1 = ASC1 + B_TT * TT1 + B_TC * TC1
V2 = ASC2 + B_TT * TT2 + B_TC * TC2

We create a function to estimate MNL models with two alternatives

In [8]:
# Create a function to estimate an MNL models with two alternatives
def estimate_mnl(V1,V2,CHOICE,database,model_name):

    V = {1: V1, 2: V2}
        
    # Create a dictionary called av to describe the availability conditions of each alternative, where 1 indicates that the alternative is available, and 0 indicates that the alternative is not available.
    # This shows that all alternatives were available to all respondents. 
    av = {1: 1, 2: 1} 

    # Define the choice model: The function models.logit() computes the MNL choice probabilities of the chosen alternative given the V. 
    prob = models.logit(V, av, CHOICE)

    # Define the log-likelihood   
    LL = log(prob)

    # Create the Biogeme object containing the object database and the formula for the contribution to the log-likelihood of each row using the following syntax:
    biogeme = bio.BIOGEME(database, LL)

    # The following syntax passes the name of the model:
    biogeme.modelName = model_name

    # Some object settings regaridng whether to save the results and outputs 
    biogeme.generate_pickle = False
    biogeme.generate_html = False
    biogeme.save_iterations = False

    # Syntax to calculate the null log-likelihood. The null-log-likelihood is used to compute the rho-square 
    biogeme.calculate_null_loglikelihood(av)

    # This line starts the estimation and returns the results object.
    results_MNL = biogeme.estimate()

    return results_MNL

Finally, we estimate the model and print the results.

In [9]:
# Estimate the model
results_MNL = estimate_mnl(V1,V2,CHOICE,biodata,model_name)

# Print the estimation statistics
print(results_MNL.short_summary())

# Print model parameters
print(results_MNL.get_estimated_parameters()) 

# Calculate the value of travel time and print it
VTT = 60*(results_MNL.get_beta_values()['B_TT']/results_MNL.get_beta_values()['B_TC'])
print(f'\nThe Value of Travel Time is: {VTT:.2f} €/hr.')

Results for model Linear-additive RUM-MNL with ASC (true model)
Nbr of parameters:		3
Sample size:			1000
Excluded data:			0
Null log likelihood:		-693.1472
Final log likelihood:		-516.5406
Likelihood ratio test (null):		353.2132
Rho square (null):			0.255
Rho bar square (null):			0.25
Akaike Information Criterion:	1039.081
Bayesian Information Criterion:	1053.804

         Value  Rob. Std err  Rob. t-test  Rob. p-value
ASC2  1.099577      0.256444     4.287789  1.804603e-05
B_TC -0.381359      0.058445    -6.525056  6.797607e-11
B_TT -0.082513      0.013774    -5.990257  2.095100e-09

The Value of Travel Time is: 12.98 €/hr.


`--> We see that the recovered parameters are not exactly equal to the true ones. But, they are close. More specifically, we see that the true values lie within est ± 1.96 * S.E. Accordingly, we see that the VTT is close to the true VTT of these data. Hence, the estimates are unbiased`

### 3. Biased parameters due to a mismatch between the true DGP and assumed model
Now let's see what happens if we estimate a model whose utility function is unequal to the true DGP. More specifically, the model that we will estimate does not have an ASCs (e.g. because the researcher fails to see the alternatives were  **labelled** in the stated choice experiment)

We define the model specification.

In [10]:
# Give a name to the model    
model_name = 'Lin-additive RUM-MNL without ASC'

# Define the model parameters, using the function "Beta()":
B_TT = Beta('B_TT', 0, None, None, 0)
B_TC = Beta('B_TC', 0, None, None, 0)

# Define the utility functions
# Note there is no ASC in these utility functions
V1 = B_TT * TT1 + B_TC * TC1
V2 = B_TT * TT2 + B_TC * TC2

We estimate the model and report the results.

In [11]:
# Estimate the model
results_MNL = estimate_mnl(V1,V2,CHOICE,biodata,model_name)

# Print the estimation statistics
print(results_MNL.short_summary())

# Print model parameters
print(results_MNL.get_estimated_parameters())

# Calculate the value of travel time and print it
VTT = 60*(results_MNL.get_beta_values()['B_TT']/results_MNL.get_beta_values()['B_TC'])
print(f'\nThe Value of Travel Time is: {VTT:.2f} €/hr.')

Results for model Lin-additive RUM-MNL without ASC
Nbr of parameters:		2
Sample size:			1000
Excluded data:			0
Null log likelihood:		-693.1472
Final log likelihood:		-526.9033
Likelihood ratio test (null):		332.4877
Rho square (null):			0.24
Rho bar square (null):			0.237
Akaike Information Criterion:	1057.807
Bayesian Information Criterion:	1067.622

         Value  Rob. Std err  Rob. t-test  Rob. p-value
B_TC -0.175644      0.034450    -5.098483  3.423864e-07
B_TT -0.126975      0.009855   -12.884885  0.000000e+00

The Value of Travel Time is: 43.37 €/hr.


`--> We make a couple of observations:`<br><br>
`1. The recovered parameters are biased: the true values lie outside est ± 1.96 * S.E.`<br><br>
`2. The recovered VTT is far from the true value`<br><br>
`3. The LL of the biased model is worse than the LL of the model with the correct utility specification. To the researcher not knowing which of the two models is the true model (which is usually the case), this tells him the true model makes the data more likely and is statistically preferred.`