# Mixd Logit using PyBLP

In [1]:
import pyblp
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# np.set_printoptions(suppress=True)

In [2]:
pyblp.options.digits = 3
pyblp.options.verbose = False
pd.options.display.precision = 3
pd.options.display.max_columns = 50
# pd.options.display.float_format = '{:.5f}'.format

import IPython.display
IPython.display.display(IPython.display.HTML('<style>pre { white-space: pre !important; }</style>'))

## Introduction

In this project, we will try to resolve the unrealistic issues in the pure logit model. There, we had unrealistic substitution effects. A good example on that is the blue bus/red bus paradox.

To resolve this issue, we will try to add preference heterogeneity to the initial model. 

The model here is a random coefficient one:

$$
u_{ijt}=x_{jt}^\prime\beta_{it}+\xi_{jt}+\varepsilon_{ijt}
$$

Where:

$$
u_{ijt} = \overbrace{x_{jt}'}^{1\times k}\beta_{it} + \zeta_{jt} + \varepsilon_{ijt} \\
    \beta_{it} = \beta + \underbrace{\Pi}_{k\times d} \underbrace{y_{it}}_{d\times 1} + \underbrace{\Sigma}_{k\times k} \underbrace{\nu_{it}}_{k\times 1} \,\, ,\,\, \nu_{it} \sim N(0,I) \rightarrow \beta_{it} \sim N(\beta+\Pi y_{it}, \Sigma\Sigma')

$$

Where, $y_{it}$ is just some demographic of agent $i$ in market $t$. $\Pi$ shifts preferences for different characteristics based on observed demographics. Also, $\Sigma$ shifts preferences according to “unobserved” preferences $\nu_{it}$.

Therefore, the parameter space here is $\left(\beta,\Pi,\Sigma\right)$.


So, the model is:

$$
u_{ijt}=\underbrace{x_{jt}^{\prime}\beta+\xi_{jt}}_{\delta_{jt}}+\underbrace{x_{jt}^{\prime}(\Sigma\nu_{it}+\Pi y_{it})}_{\mu_{ijt}}+\varepsilon_{ijt} \\

s_{jt}=\sum_{i\in\mathcal{I}_t}w_{it}\cdot\frac{\exp[\delta_{jt}+\mu_{ijt}(\Sigma,\Pi)]}{1+\sum_{k\in\mathcal{J}_t}\exp[\delta_{kt}+\mu_{ikt}(\Sigma,\Pi)]}\quad\text{for all}\quad j\in\mathcal{J}_t
$$

So, we have two data sets here. One is the product data, which is basically the same as before, the data on $j,t$. Yet, we need a new data on $i,t$. To create this dataset, we can do the following:

1. Draw $|\mathcal{I}_t| = 100$ from the agent data per market.
2. Draw $\nu_{it} \sim N(0,I)$.
3. Draw $y_{it}$ from the demographic data.
4. Each type is equaly-likely. So, $w_{it} = \frac{1}{|\mathcal{I}_t|}$.


Then, the solution is straightforward, using GMM-IV. First, using the market share above, for some guess on preference heterogeneity parameters, solve for the mean utilities. Then, using a GMM-IV regression, solve for $\beta$.

In [18]:
demographic = pd.read_csv('https://raw.githubusercontent.com/Mixtape-Sessions/Demand-Estimation/main/Exercises/Data/demographics.csv')
product = pd.read_csv('https://raw.githubusercontent.com/Mixtape-Sessions/Demand-Estimation/main/Exercises/Data/products.csv')

demographic['log_income'] = np.log(demographic['quarterly_income'])

product['demand_instruments0'] = product['price_instrument']
product['market_size'] = product['city_population'] * 90
product['market_share'] = product['servings_sold']/product['market_size']
product['outside_share'] = 1 - product.groupby('market')['market_share'].transform('sum')

demographic.rename(columns={'market': 'market_ids'}, inplace=True)
product.rename(columns={'market': 'market_ids'}, inplace=True)
product.rename(columns={'product': 'product_ids'}, inplace=True)
product.rename(columns={'market_share': 'shares'}, inplace=True)
product.rename(columns={'price_per_serving': 'prices'}, inplace=True)

The demographic dataset contains information about 20 individuals drawn from the Current Population Survey for each of the 94 markets in the product data. Each row is a different individual.

## 1. Describe cross-market variation

In [4]:
product_number = product.groupby('market_ids').agg({'product_ids':
    'count'})
product_agg = product.groupby('market_ids')[['mushy', 'prices']].agg(['mean', 'std'])
demographic_agg = demographic.groupby('market_ids')[['log_income']].agg(['mean', 'std'])

# Flattening the columns in product_agg
product_agg.columns = ['_'.join(col).strip() for col in product_agg.columns.values]
# Flattening the columns in demographic_agg
demographic_agg.columns = ['_'.join(col).strip() for col in demographic_agg.columns.values]

merged = product_number.merge(product_agg, on='market_ids').merge(demographic_agg, on='market_ids')

merged.describe().round(5)

Unnamed: 0,product_ids,mushy_mean,mushy_std,prices_mean,prices_std,log_income_mean,log_income_std
count,94.0,94.0,94.0,94.0,94.0,94.0,94.0
mean,24.0,0.333,0.482,0.126,0.029,8.091,0.885
std,0.0,0.0,0.0,0.005,0.004,0.289,0.242
min,24.0,0.333,0.482,0.112,0.022,7.499,0.523
25%,24.0,0.333,0.482,0.122,0.026,7.872,0.71
50%,24.0,0.333,0.482,0.126,0.028,8.093,0.831
75%,24.0,0.333,0.482,0.129,0.031,8.338,1.086
max,24.0,0.333,0.482,0.138,0.038,8.622,1.539


As every market has 24 products, then, there is no variation in the mushyness and the number of products in this data set. Yet, there is cross-market variations in prices. Therefore, as for identifying preference heterogeneity, we may be able to identify those heterogeneities in preferences for prices (and not for mushyness or a constant).

There seems to be a good amount of cross-market demographic variation. Consumers' income varies a good amount across our markets. This means that using this variation, we have a hope of credibly estimating how income shifts the preference of different characteristics.

## 2. Estimate a simple model: Only Heterogeneity in Mushyness preferences

Here, suppose that we only want to have heterogeneity in the coefficient of mushyness. Also, as there is not variation for mushyness between markets, we cannot add unobservable preference heterogeneity there. In other words, the model we want to estimate is:

$$
u_{ijt} = -\alpha p_{jt} + \beta M_{jt} + M_{jt}\left(\pi y_{it} + \underbrace{\sigma}_{= 0} \nu_{it}\right) + \xi_{jt} + \varepsilon_{ijt}
$$

Therefore, we only have three parameters $\alpha,\beta,\pi$ to estimate. So, we need at least 3 moment conditions. the instruments we use for each of these three are:
$(\text{price instrument}, M_{jt},M_{jt} \times \bar{y}_t)$. Price instrument is already given in the product data, $M_{jt}$ is exogenous. Also, for $\pi$, we can get mean income at each market and multiplied it by mushyness level of each product.

To use PyBLP, we need a product data and an agent data. The product data is already there. For the agent data, we will sample with replacement from the demographic data. Also, as for the unobservable terms in preference heterogeneity, although we will not need it in this part of the problem, we should sample from standard normal. This is done in the following code.

### Generating agent_data

In [None]:
n_samples = 1000

# Set the random seed for reproducibility
random_seed = 42

# Group by 'market_id' and sample with replacement
agent_data = (
    demographic
    .groupby('market_ids', as_index=False)  # Group by 'market_id'
    .apply(lambda x: x.sample(n=n_samples, replace=True, random_state=random_seed))  # Sample with replacement
    .reset_index(drop=True)  # Reset index to get a clean DataFrame
)

agent_data['weights'] = 1 / n_samples

rng = np.random.default_rng(random_seed)

normal_samples = rng.normal(size=(len(agent_data), 1))

agent_data['nodes0'] = normal_samples
# Draw random samples from a standard normal distribution
# normal_samples = rng.normal(size=(len(agent_data), 3))

# # Create new columns in agent_data
# agent_data[['nodes0', 'nodes1', 'nodes2']] = normal_samples

In [20]:
agent_data.sample(5, random_state=random_seed)

Unnamed: 0,market_ids,quarterly_income,log_income,weights,nodes0
18625,C14Q1,7360.966,8.904,0.001,0.366
42611,C31Q1,6383.453,8.761,0.001,-0.598
77885,C52Q2,2552.445,7.845,0.001,1.249
30309,C24Q1,6474.351,8.776,0.001,-1.495
7519,C05Q2,4900.165,8.497,0.001,-1.503


Now, that we have the weights for each individual, the integral for the market shares can be rewritten as follows:

$$
s_{jt}(\delta_{jt},\Sigma,\Pi)=\sum_{i\in\mathcal{I}_t}w_{it}\cdot\frac{\exp[\delta_{jt}+\mu_{ijt}(\Sigma,\Pi)]}{1+\sum_{k\in\mathcal{J}_t}\exp[\delta_{kt}+\mu_{ikt}(\Sigma,\Pi)]}\quad\text{for all}\quad j\in\mathcal{J}_t
$$

So, given the market shares and a guess on heterogeneity parameters, we can inverse this function to find:

$$
\delta_{jt}(S_{t},\Sigma,\Pi) = x_{jt}^\prime\beta+\xi_{jt}
$$
Then, using GMM-IV, we can solve for $\beta(\Sigma,\Pi)$.

### Generating the other instrument

In [21]:
product = product.merge(
    pd.DataFrame(demographic.groupby('market_ids')['log_income'].mean()).rename(columns={'log_income':
        'log_income_mean'}),
        on='market_ids')
product['demand_instruments1'] = product['log_income_mean'] * product['mushy']

Now, we need to give the formulation to PyBLP. First, we will give linear parts (which are prices and the absorbed fixed effects) and then, the nonlinear part (which is the mushyness).

In [22]:
product_formulations = (pyblp.Formulation('0 + prices', 
            absorb='C(market_ids) + C(product_ids)'), pyblp.Formulation('0 + mushy'))

And also an agent formulation, which defines the demographics we have for each market.

In [23]:
agent_formulation = pyblp.Formulation('0 + log_income')

In [27]:
mushy_problem  = pyblp.Problem(product_formulations, product, agent_formulation, agent_data)
mushy_problem

Dimensions:
 T    N      I     K1    K2    D    MD    ED 
---  ----  -----  ----  ----  ---  ----  ----
94   2256  94000   1     1     1    2     2  

Formulations:
       Column Indices:             0     
-----------------------------  ----------
 X1: Linear Characteristics      prices  
X2: Nonlinear Characteristics    mushy   
       d: Demographics         log_income

So, the model has 94 markets, 2256 product-market pair, 94000 agents (1000 agent for each market, which we sampled with replacement from demographic data). Also, $K_1$ is the number of linear product characteristics, $K_2$ is the number of nonlinear product characteristics, $D$ is the number of demographic features, $MD$ is the number of demand side instruments (Which is 2), and $ED$ is number of absorbed fixed effects of the demand side.

In [28]:
optimization = pyblp.Optimization('trust-constr', {'gtol': 1e-8, 'xtol': 1e-8})

In the following code, in PyBLP, whenever we set initial value of some parameter to 0, it just fix it to zero and does not try to solve for it. Also, as explained before, the only parameters we need to solve in GMM are just nonlinear ones, as the linear ones are just functions of them.

In [32]:
mushy_results = mushy_problem.solve(sigma=0, pi=1, method='1s', optimization=optimization)
mushy_results

Problem Results Summary:
GMM   Objective  Gradient              Clipped  Weighting Matrix  Covariance Matrix
Step    Value      Norm      Hessian   Shares   Condition Number  Condition Number 
----  ---------  ---------  ---------  -------  ----------------  -----------------
 1    +2.86E-20  +1.69E-09  +4.99E+01     0        +4.39E+01          +3.48E+01    

Cumulative Statistics:
Computation  Optimizer  Optimization   Objective   Fixed Point  Contraction
   Time      Converged   Iterations   Evaluations  Iterations   Evaluations
-----------  ---------  ------------  -----------  -----------  -----------
 00:00:14       Yes          9            10          4661         15036   

Nonlinear Coefficient Estimates (Robust SEs in Parentheses):
Sigma:    mushy    |   Pi:   log_income 
------  ---------  |  -----  -----------
mushy   +0.00E+00  |  mushy   +2.57E-01 
                   |         (+1.64E-01)

Beta Estimates (Robust SEs in Parentheses):
  prices   
-----------
 -3.06E+01 
(+9.

### Explaining Results

As the number of moment conditions is equal to the number of paramters, the model is just identified and the objective of GMM converges to 0. Also, the hessian is positive, meaning that it is indeed satisfying the SOC.

Notice, as the estimate for $\pi$ is positive, it means that, higher income agents prefer mushyness more. Also, the estimate of parameter of price is $-30.6$ which is the same as the pure logit case with fixed effects and IV (as should be as we have no preference heterogeneity for the prices in the current model).

Moreover, $\frac{\pi}{\alpha} = \frac{0.257}{30.6} = 0.0084$. In other words, with a one percent increase in income (as income is in log terms here), willingness to pay for mushyness increases by $0.0084$.

## 3. Make sure you get the same estimate with random starting values

In [39]:
n_seeds = 3

pi_bounds = (-10,10)
for seed in range(n_seeds):
    np.random.seed(seed)
    initial_pi = np.random.uniform(*pi_bounds)
    result = mushy_problem.solve(sigma=0, pi=initial_pi, method='1s', optimization=optimization)
    print(f'initial_pi: {initial_pi:.4f}, pi_estimate: {result.pi[0][0]:.4f}')

initial_pi: 0.9763, pi_estimate: 0.2567
initial_pi: -1.6596, pi_estimate: 0.2567
initial_pi: -1.2801, pi_estimate: 0.2567


So, we get the same estimate for any initial value of pi. This validates our estimate. If you have a more complicated model with many parameters and many instruments, you may often get a global minimum, and sometimes get a local minimum. Optimizers aren't perfect, and sometimes terminate prematurely, even with tight termination conditions. You should select the global one for your final estimates.

## 4. Evaluate changes to the price cut counterfactual