# PUM and the Similarity Specification

This notebook introduces the general theory of Perturbed Utility Models and adresses how the important subclass of 'Similarity Models' may be specified in practical applications.  

In [1]:
import numpy as np
import pandas as pd
#pd.options.mode.chained_assignment = None
pd.set_option('display.max_rows', 500)
import os
import sys
from numpy import linalg as la
from scipy import optimize
import scipy.stats as scstat
from matplotlib import pyplot as plt
import itertools as iter
%load_ext line_profiler

# Files
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utilities.Logit_file import estimate_logit, logit_se, logit_t_p, q_logit, logit_score, logit_score_unweighted, logit_ccp, LogitBLP_estimator, LogitBLP_se
from data.Eurocarsdata_file import Eurocars_cleandata

   variable names                                        description
0              cy            cylinder volume or displacement (in cc)
1              hp                                 horsepower (in kW)
2              we                                     weight (in kg)
3              le                                     length (in cm)
4              wi                                      width (in cm)
5              he                                     height (in cm)
6              li          average of li1, li2, li3 (used in papers)
7              sp                            maximum speed (km/hour)
8              ac  time to acceleration (in seconds from 0 to 100...
9              pr   price (in destination currency including V.A.T.)
10          brand                                      name of brand
11           home  domestic car dummy (appropriate interaction of...
12            cla                              class or segment code
x has full rank


In [2]:
# Load dataset and variable names
descr = (pd.read_stata('../data/eurocars.dta', iterator = True)).variable_labels() # Obtain variable descriptions
dat_file = pd.read_csv('../data/eurocars.csv') # reads in the data set as a pandas dataframe.

In [3]:
# Outside option is included if OO == True, otherwise analysis is done on the inside options only.
OO = True

# Choose which variables to include in the analysis, and assign them either as discrete variables or continuous.

x_discretevars = [ 'brand', 'home', 'cla']
x_contvars = ['cy', 'hp', 'we', 'le', 'wi', 'he', 'li', 'sp', 'ac', 'pr']
z_IV_contvars = ['xexr']
z_IV_discretevars = []
x_allvars =  [*x_contvars, *x_discretevars]
z_allvars = [*z_IV_contvars, *z_IV_discretevars]

if OO:
    nest_contvars = [var for var in x_contvars if var != 'pr'] # We nest over all variables other than price, but an alternative list can be specified here if desired.
    nest_discvars = ['in_out', *x_discretevars]
    nest_vars = ['in_out', *nest_contvars, *x_discretevars]
else:
    nest_contvars = [var for var in x_contvars if (var != 'pr')]
    nest_discvars = x_discretevars # See above
    nest_vars = [*nest_contvars, *nest_discvars]

G = len(nest_vars)

# Print list of chosen variables as a dataframe
pd.DataFrame(descr, index=['description'])[x_allvars].transpose().reset_index().rename(columns={'index' : 'variable names'})

Unnamed: 0,variable names,description
0,cy,cylinder volume or displacement (in cc)
1,hp,horsepower (in kW)
2,we,weight (in kg)
3,le,length (in cm)
4,wi,width (in cm)
5,he,height (in cm)
6,li,"average of li1, li2, li3 (used in papers)"
7,sp,maximum speed (km/hour)
8,ac,time to acceleration (in seconds from 0 to 100...
9,pr,price (in destination currency including V.A.T.)


In [4]:
dat, dat_org, x_vars, z_vars, N, pop_share, T, J, K = Eurocars_cleandata(dat_file, x_contvars, x_discretevars, z_IV_contvars, z_IV_discretevars, outside_option=OO)

In [5]:
# Create dictionaries of numpy arrays for each market. This allows the size of the data set to vary over markets.

dat = dat.reset_index(drop = True).sort_values(by = ['market', 'co']) # Sort data so that reshape is successfull

x = {t: dat[dat['market'] == t][x_vars].values.reshape((J[t],K)) for t in np.arange(T)} # Dict of explanatory variables
y = {t: dat[dat['market'] == t]['ms'].to_numpy().reshape((J[t])) for t in np.arange(T)} # Dict of market shares

## Perturbed utility, logit and nested logit

In the following, a vector $z\in \mathbb R^d$ is always a column vector. The Similarity Model is a discrete choice model, where the probability vector over the alternatives is given by the solution to a utility maximization problem of the form
$$
P(u|\theta)=\arg\max_{q\in \Delta} q'u(\theta)-\Omega(q|\theta)
$$
where $\Delta$ is the probability simplex over the set of discrete choices, $u$ is a vector of payoffs for each option, $\Omega$ is a convex function and $q'$ denotes the transpose of $q$, and $\theta$ is a vector of parameters. All Additive Random Utility Models can be represented in this way (Fosgerau and Sørensen (2021)). For example, the logit choice probabilities result from the perturbation function $\Omega(q)=q'\ln q$ where $\ln q$ is the elementwise logarithm.

In the Nested Logit Model, the choice set is divided into a partition $\mathcal C=\left\{C_1,\ldots,C_L\right\}$, and the perturbation function is given by
$$
\Omega(q|\lambda)=(1-\lambda)q'\ln q+\lambda \sum_{\ell =1}^L \left( \sum_{j\in C_\ell}q_j\right)\ln \left( \sum_{j\in C}q_j\right),
$$
where $\lambda\in [0,1)$ is a parameter. This function can be written equivalently as
$$
\Omega(q|\lambda)=(1-\lambda)q'\ln q+\lambda \left(\psi q\right)'\ln \left( \psi q\right),
$$
where $\psi$ is a $J \times L$ matrix, where $\psi_{j\ell}=1$ if option $j$ belongs to nest $C_\ell$ and zero otherwise.
 This specification generates nested logit choice probabilities.

# The Similarity Model

The Similarity Model generalizes the Nested Logit Model. It allows for multiple nesting structures, and it also allows for 'continuous' nesting by measuring similarity of products in the space of characteristics. Let $g = 1,\ldots, G$ index a set of distinct nesting structures represented by  matrices $\psi^1, \ldots, \psi^G$ as in .... We define the Similarity pertubation function $\Omega$ as:

$$
\Omega(q|\lambda) = \left( 1 - \sum_{g = 1}^G \lambda_g\right) q' \ln (q) + \sum_{g = 1}^G \lambda_g (\psi^g q)' \ln( \psi^g q) - q' \delta
$$

where $\lambda \in \mathbb{R}^G$ is a vector of nesting parameters and $\delta \in \mathbb{R}^{J_t}$ is a normalizing constant vector. If the sum of the positive nesting parameters $\sum_{g : \lambda_g > 0} \lambda_g $ is strictly less than $1$, then the pertubation function $\Omega(\cdot|\lambda)$ is strictly convex, such that the Similarity Model is a perturbed utility Model.

Note that $q$ and $\psi^g q$ are probability distributions, wherefore the terms $q'\ln(q)$ and $(\psi^g q)' \ln(\psi^g q)$ are interpreted as the negative entropy of $q$ and of the probability distribution of similarity within characteristics $g=1,\ldots,G$, respectively.

When choosing the normalizing factor $\delta$, we want to normalize the pertubation function such that $\Omega(q|\lambda) = 0$ at the corners of the probability simplex $\Delta$, i.e. when the vector of choice probabilities $q$ contains a probability equal to $0$ or $1$. If $e_j$ is the $j$'th standard basis vector in $R^{J_t}$, then $0 = \Omega(e_j | \lambda) = \left( 1 - \sum_{g = 1}^G \lambda_g\right) \cdot 0 + \sum_{g = 1}^G \lambda_g (\psi_{[j]}^g)' \ln( \psi_{[j]}^g) - \delta_j$ implies that we must choose $\delta_j = \sum_{g = 1}^G \lambda_g (\psi_{[j]}^g)' \ln( \psi_{[j]}^g)$ to achieve this normalization, where $\psi_{[j]}^g$ here denotes the $j$'th row of $\psi^g$.

Furthermore, if $\lambda = 0$ then the Similarity Model reduces to the Multinomial Logit Model, since $\Omega(q|0) = q' \ln (q)$ is the negative Shannon-entropy, and the Nested Logit Model, as described above, may be obtained if $G = 1$ and $\delta = 0$. Furthermore, the IPDL Model by Fosgerau et. al (2022) may obtained by setting $\delta = 0$. Hence the Similarity Model allows for greater flexibility than many workhorse models. 

In implementions of the Similarity Model, it will be useful to define a the following matrices to use in computations. First we define the matrix $\Psi \in \mathbb{R}^{(G + 1)J_t \times J_t}$ as the matrix stacking the Identity matrix $I_{J_t}$ in $R^{J_t \times J_t}$ on top of the $\psi^g$ matrices:

$$
\Psi = \left(
    \begin{array}{c}
        I_{J_t} \\
        \psi^1 \\
        \vdots \\
        \psi^G
    \end{array}
    \right)
$$

Another useful matrix for carrying out computations is the matrix $\Gamma \in \mathbb{R}^{(G + 1)J_t \times J_t}$ defined by:

$$
\Gamma = \left(
    \begin{array}{c}
        \left(1 - \sum_{g = 1}^G \lambda_g\right) I_{J_t} \\
        \lambda_1 \psi^1 \\
        \vdots \\
        \lambda_G \psi^G
    \end{array}
    \right)
$$

Finally, since the nomarlizing vector $\delta$ is linear-in-parameters $\lambda$, we wish to construct a matrix $\varphi \in \mathbb{R}^{J_t \times G}$ such that $\delta = \varphi \lambda$. Hence for any nesting structure $g$, set $\varphi_{[g]} = (\psi^g \circ \ln (\psi^g))'\iota_{J_t}$, where $\iota_{J_t} = (1, \ldots, 1)' \in R^{J_t}$ is the all-ones vector; then 

$$
\varphi = \left(\varphi_{[1]} \ldots \varphi_{[G]}\right)$$ 

has the desired property. Using the above matrices, we may compute the Similarity pertubation function by: $\Omega(q|\lambda) = (\Gamma q)' \ln (\Psi q) - q' \varphi \lambda$

In [6]:
def Create_nests(data, markets_id, products_id, in_out_id, cont_var, disc_var, outside_option = True):
    '''
    This function creates the nest matrices \Psi^{gt}, and stack them over groups g for each market t.

    Args.
        data: a pandas DataFrame
        markets_id: a string denoting the column of 'data' containing an enumeration t=0,1,...,T-1 of markets
        products_id: a string denoting the column of 'data' containing product codes which uniquely identifies products
        in_out_id: a string denoting the column of 'data' containing the dummy for being an inside or outside option. If 'outside_option = True' then this may be set to e.g. the empty string ''.
        cont_var: a list of the continuous variables among the covariates
        disc_var: a list of the discrete variables among the covariates
        outside_option: a boolean indicating whether the model is estimated with or without an outside option. Default is set to 'True' i.e. with an outside option.

    Returns
        Psi: a dictionary of length T of numpy arrays ((G+1)*J[t], J[t]) the J[t] by J[t] identity stacked on top of the Psi_g matrices for each market t and each gropuing g
        Psi_dim: a dictionary of length T of (G+1,J[t],J[t]) numpy arrays with the top most array being the J[t] by J[t] identity matrix and the following G matrices being the \psi^g matrices 
    '''

    T = data[markets_id].nunique()
    J = np.array([data[data[markets_id] == t][products_id].nunique() for t in np.arange(T)])
    
    # We include nest on outside vs. inside options. The amount of categories varies if the outside option is included in the analysis.
    dat = data.sort_values(by = [markets_id, products_id]) # We sort the data in ascending, first according to market and then according to the product id
    
    Psi = {}
    Psi_dim = {}

    if OO:
        in_out_index = [n for n in np.arange(len(disc_var)) if disc_var[n] == in_out_id][0]
        non_in_out_indices = np.array([n for n in np.arange(len(disc_var)) if disc_var[n] != in_out_id])

    # Assign nests for products in each market t
    for t in np.arange(T):
        data_t = dat[dat[markets_id] == t] # Subset data on market t

        # Estimate discrete kernels
        D_disc = len(disc_var)
        K_disc = np.empty((D_disc, J[t], J[t]))
        C = np.array(data_t[disc_var].nunique())

        for d in np.arange(D_disc):
            Indicator = pd.get_dummies(data_t[disc_var[d]]).values.reshape((J[t], C[d]))
            K_disc[d,:,:] = Indicator@(Indicator.T) # Get the indicator kernel function for the discrete variables

        Psidisc_t = np.einsum('djk,dk->djk', K_disc, 1./(K_disc.sum(axis=1)))
            
        # Estimate continuous kernels
        D_cont = len(cont_var)
        IQR = scstat.iqr(data_t[cont_var].values, axis = 0) # Compute interquartile range of each continuous variable
        sd = np.std(data_t[cont_var].values, axis = 0) # Compute empirical standard deviation of each continuous variable
        h = 0.9*np.fmin(sd, IQR/1.34)/(J[t]**(1/5)) # Use Silverman's rule of thumb for bandwidth estimation for each continuous variable
        w = data_t[cont_var].values.transpose()
        diff = w[:,:,None]*np.ones((D_cont, J[t], J[t])) - w[:,None,:] # calculates the differences w_j - w_k for all continuous g = 1, ... , G , and for all alternatives j,k.
        
        # Compute continuous kernel functions
        if outside_option:
            K_cont = np.exp(-(diff**2)/(2*(h[:,None,None]**2)))[:,1:,1:] # Compute continuous kernel function for inside options
            Psicontinner_t = np.einsum('djk,dk->djk', K_cont, 1./K_cont.sum(axis=1))
            Psicont_t = np.zeros((D_cont, J[t], J[t]))
            Psicont_t[:,0,0] = 1 # The outside option is only similar to itself
            Psicont_t[:,1:,1:] = Psicontinner_t # The inside option are only similar to each other
        else:
            K_cont = np.exp(-(diff**2)/(2*(h[:,None,None]**2))) # -=-
            Psicont_t = np.einsum('djk,dk->djk', K_cont, 1./K_cont.sum(axis=1))

        # Stack Psi
        D = len([*cont_var, *disc_var]) + 1

        if outside_option:
            Psi_dim[t] = np.concatenate((np.eye(J[t]).reshape((1,J[t],J[t])), Psidisc_t[in_out_index,:,:].reshape((1,J[t],J[t])), Psicont_t, Psidisc_t[non_in_out_indices,:,:]), axis = 0)
            Psi[t] = Psi_dim[t].reshape((D*J[t], J[t]))
        else:
            Psi_dim[t] = np.concatenate((np.eye(J[t]).reshape((1,J[t],J[t])), Psicont_t, Psidisc_t), axis = 0)
            Psi[t] = Psi_dim[t].reshape((D*J[t], J[t]))

    return Psi, Psi_dim

In [7]:
def phi_matrix(psi):
    '''
    This function computes the \varphi matrix used in e.g. calculating \delta. 

    Args:
        psi: a dictionary of length T of numpy arrays ((G+1)*J[t], J[t]) the J[t] by J[t] identity stacked on top of the Psi_g matrices for each market t and each gropuing g as outputted by 'Create_nests'-function

    Returns.
        phi: a dictionary of length T of numpy arrays (J[t],G) of the \varphi^g matrices
    '''
    T = len(psi)
    J = np.array([psi[t].shape[1] for t in np.arange(T)])
    G = np.int32(psi[0].shape[0] / J[0] - 1)

    phi = {}

    for t in np.arange(T):
        phi_t = np.empty((J[t], G))
        psi_t = psi[t]

        # Compute phi_g = (psi^g \circ log(psi^g))^T %o% \iota 
        for g in np.arange(1,G+1):
            psi_g = psi_t[g*J[t]:(g+1)*J[t],:]
            phi_t[:,g-1] = (psi_g*np.log(psi_g, out = np.zeros_like(psi_g), where = (psi_g > 0))).sum(axis=0)
        
        phi[t] = phi_t

    return phi

In [8]:
def Similarity_specification(data, markets_id, products_id, in_out_id, cont_var, disc_var, outside_option = True):
    '''
    This function returns the Similarity Model specification as given by the covariates and the nesting structure

    Args:
        data: a pandas DataFrame
        markets_id: a string denoting the column of 'data' containing an enumeration t=0,1,...,T-1 of markets
        products_id: a string denoting the column of 'data' containing product codes which uniquely identifies products
        in_out_id: a string denoting the column of 'data' containing the dummy for being an inside or outside option. If 'outside_option = True' then this may be set to e.g. the empty string ''.
        cont_var: a list of the continuous variables among the covariates
        disc_var: a list of the discrete variables among the covariates
        outside_option: a boolean indicating whether the model is estimated with or without an outside option. Default is set to 'True' i.e. with an outside option.

    Returns.
        Model: a dictionary of length 3, containing the stacked Psi, the 3-dimensional Psi, and the Phi matrix as outputted by 'Create_nests' and 'phi_matrix', respectively.
    '''

    Psi, Psi_3d = Create_nests(data, markets_id, products_id, in_out_id, cont_var, disc_var, outside_option)
    Phi = phi_matrix(Psi)
    Model = {'psi' : Psi, 'psi_3d' : Psi_3d, 'phi' : Phi}

    return Model

In [9]:
Model = Similarity_specification(dat, 'market', 'co', 'in_out', nest_contvars, nest_discvars, outside_option = OO)
Psi = Model['psi']

In [10]:
def Create_Gamma(Lambda, model):
    '''
    This function computes the Gamma matrix

    Args:
        Lambda: a (G,) numpy array of grouping parameters \lambda_g
        model: a dictionary of the Similarity Model specification as outputted by 'Similarity_specification'

    Returns.
        Gamma: a dictionary of length T containing the ((G+1)*J[t],J[t]) numpy arrays of the \Gamma matrices for each market t.
    '''

    Psi = model['psi']
    T = len(Psi)
    J = np.array([Psi[t].shape[1] for t in np.arange(T)])
    
    Gamma = {}
    lambda0 = np.array([1 - sum(Lambda)])
    Lambda_full = np.concatenate((lambda0, Lambda)) # create vector (1- sum(lambda), lambda_1, ..., lambda_G)
    D = len(Lambda_full)
    
    for t in np.arange(T):
        Lambda_long =(Lambda_full[:,None]*np.ones((D,J[t]))).reshape((D*J[t],))
        Gamma[t] = Lambda_long[:,None]*Psi[t]

    return Gamma