## The BLP (95/99) Automobile Problem
In this tutorial, we'll use data from :ref:`references:Berry, Levinsohn, and Pakes (1995)` to solve the paper's automobile problem. This notebook illustrates several features of `pyblp`. 
- Incorporating a Supply Side with Demand Estimates
- Allowing for simple $price/income$ _demographic_ effects.
- Calculating clustered standard errors
- Calculating Optimal Instruments

In [3]:
import pyblp
import numpy as np
import pandas as pd 

pyblp.options.digits = 3
pyblp.options.verbose = False
np.set_printoptions(precision=2, threshold=10, linewidth=100)
pyblp.__version__

'0.5.0'

### Loading the Automobile Data
We'll use NumPy to read the data. We load two datasets:
1. The __product data__ with prices, shares, and product characteristics
2. The __agent data__ with draws from the distribution of heterogeneity

In [7]:
product_data = np.recfromcsv(pyblp.data.BLP_PRODUCTS_LOCATION, encoding='utf-8')
pd.DataFrame(product_data).head()

Unnamed: 0,market_ids,clustering_ids,car_ids,firm_ids0,firm_ids1,domestic,japan,european,shares,prices,...,demand_instruments2,demand_instruments3,demand_instruments4,demand_instruments5,supply_instruments0,supply_instruments1,supply_instruments2,supply_instruments3,supply_instruments4,supply_instruments5
0,1971,AMGREM71,129,15,15,1,0,0,0.001051,4.935802,...,0.0,-0.593346,-0.346601,-0.198354,0.559897,0.280253,0.0,0.783835,0.407806,0.030212
1,1971,AMHORN71,130,15,15,1,0,0,0.00067,5.516049,...,0.0,-0.619844,-0.38014,-0.186345,0.566952,0.259925,0.0,0.829513,0.448638,0.008836
2,1971,AMJAVL71,132,15,15,1,0,0,0.000341,7.108642,...,0.0,-0.532932,-0.432497,-0.191577,0.565679,0.2438,0.0,0.700153,0.525192,0.020907
3,1971,AMMATA71,134,15,15,1,0,0,0.000522,6.839506,...,0.0,-0.529112,-0.468994,-0.181281,0.571697,0.220392,0.0,0.697406,0.580992,0.002274
4,1971,AMAMBS71,136,15,15,1,0,0,0.000442,8.928395,...,0.0,-0.452174,-0.487211,-0.199757,0.561929,0.233895,0.0,0.577626,0.609559,0.037782


The product data contains __market_ID__'s, __product_ID__'s, two sets of __firm_ID__'s (the second are IDs after a simple merger, which are used later), __shares__, __prices__, a number of product characteristics, and some pre-computed excluded __demand_instrumentsX__ and __supply_instrumentsX__. The __product_ID__'s are called __clustering_ID__'s because they will be used to compute clustered standard errors. For more information about the instruments and the example data as a whole, refer to the :mod:`data` module.

The __agent_data__ argument of :class:`Problem` should also be a structured array-like object.

The __agent_data__ contains __market_ID__'s, integration __weights__ $w_{it}$, integration nodes $\nu_{it}$, and demographics $d_{it}$. Here we use $I_{t}=200$ equally weighted draws per market.

In non-example problems, it is usually a better idea to use many more draws, or a more sophisticated :class:`Integration` configuration such as sparse grid quadrature.


In [6]:
agent_data = np.recfromcsv(pyblp.data.BLP_AGENTS_LOCATION, encoding='utf-8')
pd.DataFrame(agent_data).head()

Unnamed: 0,market_ids,weights,nodes0,nodes1,nodes2,nodes3,nodes4,nodes5,income
0,1971,0.005,0.548814,0.292642,0.45776,0.56469,0.395537,0.392173,9.728478
1,1971,0.005,0.715189,0.566518,0.376918,0.839746,0.844017,0.041157,7.908957
2,1971,0.005,0.602763,0.137414,0.702335,0.376884,0.150442,0.923301,11.079404
3,1971,0.005,0.544883,0.349712,0.207324,0.499676,0.306309,0.406235,17.641671
4,1971,0.005,0.423655,0.053216,0.07428,0.081302,0.09457,0.944282,12.423995


### Setting up the Automobile Problem
- Unlike the fake cereal problem, we won't absorb any fixed effects in the automobile problem, so the linear part of demand $X_1$ has more components.
- We also need to specify a formula for the random coefficients $X_2$ including a random coefficient on the constant, which captures correlation among all inside goods.
- The new addition relative to the Cereal example is that we add a supply side formula $X_3$ for marginal costs.
- The `patsy` style formulas support functions of regresors including $\log()$ and $x^2$.

We stack the three product formulations in order.

In [19]:
product_formulations = (
   # linear demand
   pyblp.Formulation('1 + hpwt + air + mpd + space'),
   # nonlinear demand
   pyblp.Formulation('1 + prices + hpwt + air + mpd + space'),
   # supply model
   pyblp.Formulation('1 + log(hpwt) + air + log(mpg) + log(space) + trend')
)
product_formulations

(1 + hpwt + air + mpd + space,
 1 + prices + hpwt + air + mpd + space,
 1 + log(hpwt) + air + log(mpg) + log(space) + trend)

The original specification for the automobile problem includes the term $\log(y_i - p_j)$, in which $y$ is income and $p$ are prices. Instead of including this term, which gives rise to a host of numerical problems, we'll follow :ref:`references:Berry, Levinsohn, and Pakes (1999)` and use its first-order linear approximation, $p_j / y_i$. 

The agent formulation for $d$ includes a column of $1 / y_i$ values, which we'll interact with $p_j$. To do this, we will treat draws of $y_i$ as _demographic_ variables.

In [20]:
agent_formulation = pyblp.Formulation('0 + I(1 / income)')
agent_formulation

I(1 / income)

As in the cereal example, the :class:`Problem` can be constructed by combining the __product formulations__, __product data__, __agent formulation__ and __agent data__. 

We provide a detailed explanation of the problem output, beginning with the _Dimensions_ table
- $N$ denotes the total number of observations (products and markets).
- $T$ denotes the number of markets.
- $K_1$ is the number of linear demand parameters.
- $K_2$ is the number of nonlinear demand parameters.
- $K_3$ is the number of linear supply parameters.
- $D$ is the number of demographic variables.
- $M_d$ is the number of demand (excluded) instruments
- $M_s$ is the number of supply (exlcuded) instruments.

The _Formulations_ table describes all four fomulas (Linear Characteristics, Nonlinear Characteristics, Cost Characteristics, and Demographics).

In [25]:
problem = pyblp.Problem(product_formulations, product_data, agent_formulation, agent_data)
problem

Dimensions:
 N     T    K1    K2    K3    D    MD    MS 
----  ---  ----  ----  ----  ---  ----  ----
2217  20    5     6     6     1    11    12 

Formulations:
       Column Indices:            0          1        2       3          4         5  
-----------------------------  --------  ---------  -----  --------  ----------  -----
 X1: Linear Characteristics       1        hpwt      air     mpd       space          
X2: Nonlinear Characteristics     1       prices    hpwt     air        mpd      space
  X3: Cost Characteristics        1      log(hpwt)   air   log(mpg)  log(space)  trend
       d: Demographics         1/income                                               

### Solving the Automobile Problem

The only decisions remaining are:
- Choosing starting values for $\Sigma_0, \Pi_0$.
- Potentially choosing bounds for $\Sigma, \Pi$.
- Choosing a functional form for $mc_{jt}$ (either _linear_ or _log_)

The decisions we will use are:
- Use published estimates as our starting values for $\Sigma_0$.
- Interact income $1/y_i$ only with prices and use the published estimates on $\log(y_i - p_j)$ as our starting value for $\alpha$.
- Bound $\Sigma_0$ to be positive (these are standard deviations) and diagonal.
- Bound the $price/income$ coefficient to be negative. (specifically, we'll use a bound that's slightly smaller than zero because when the parameter is exactly zero, there are matrix inversion problems with computing $\eta$)

When using a routine that supports bounds, :class:`Problem` chooses some default bounds to reduce the chance of numerical overflow that happens, for example, when optimization routines try out large parameter values. However, these default bounds are not quite restrictive enough to prevent overflow in the automobile problem, so we'll set our own bounds. 

Choosing reasonable bounds can be very important.

In [34]:
initial_sigma = np.diag([3.612, 0, 4.628, 1.818, 1.050, 2.056])
initial_pi = np.c_[[0, -10, 0, 0, 0, 0]]
sigma_bounds = (
   -np.diag([100, 0, 100, 100, 50, 100]),
   np.diag([100, 0, 100, 100, 50, 100])
)
pi_bounds = (
   np.c_[[0, -50, 0, 0, 0, 0]],
   np.c_[[0, -0.1, 0, 0, 0, 0]]
)

A linear marginal cost specification is the default in `pyblp`, so we'll need to use the __costs_type__ argument to employ the log-linear specification used by :ref:`references:Berry, Levinsohn, and Pakes (1995)`. A downside of this specification is that nonpositive estimated marginal costs can create problems for the optimization routine when computing $\log c(\hat{\theta})$. We'll use the __costs_bounds__ argument to bound marginal costs from below by a small number. 

Finally, as in the original paper, we'll use the `W_type` and `se_type` argument to cluster by product IDs, which were specified as `clustering_ids` in product data.

In [35]:
results = problem.solve(
   initial_sigma,
   initial_pi,
   sigma_bounds=sigma_bounds,
   pi_bounds=pi_bounds,
   costs_type='log',
   costs_bounds=(0.001, None),
   W_type='clustered',
   se_type='clustered'
)
results

Problem Results Summary:
Cumulative  GMM   Optimization   Objective   Total Fixed Point  Total Contraction  Objective    Gradient        Clipped    
Total Time  Step   Iterations   Evaluations     Iterations         Evaluations       Value    Infinity Norm  Marginal Costs
----------  ----  ------------  -----------  -----------------  -----------------  ---------  -------------  --------------
 0:02:26     2         21           29             21551              65180        +1.68E+05    +1.38E+01          0       

Linear Estimates (Robust SEs Adjusted for 999 Clusters in Parentheses):
Beta:        1          hpwt          air          mpd         space                
------  -----------  -----------  -----------  -----------  -----------             
         -3.92E+00    +1.35E+01    +4.33E+00    -1.46E+00    +4.90E+00              
        (+2.53E+00)  (+4.27E+00)  (+2.35E+00)  (+1.70E+00)  (+1.33E+00)             
Gamma:       1        log(hpwt)       air       log(mpg)    log(sp

There are some discrepancies between our results and the original paper, the instruments we constructed to are meant to mimic the original instruments, which we were unable to re-construct perfectly.