## Exercise 1

In [None]:
import pyblp
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

pyblp.options.digits = 3
pyblp.options.verbose = False
pd.options.display.precision = 3
pd.options.display.max_columns = 50

import IPython.display
IPython.display.display(IPython.display.HTML('<style>pre { white-space: pre !important; }</style>'))

### 1. Describe the data

Let's load the data and look at a random sample. It's good practice to set your seed whenever you do something with a random number generator.

In [None]:
product_data = pd.read_csv('../Data/products.csv')
product_data.sample(n=5, random_state=0)

### 2. Compute market shares

Let's compute the market size and market shares.

In [None]:
product_data['market_size'] = product_data['city_population'] * 90
product_data['market_share'] = product_data['servings_sold'] / product_data['market_size']
product_data['outside_share'] = 1 - product_data.groupby('market')['market_share'].transform('sum')
product_data[['market_share', 'outside_share']].describe()

### 3. Estimate the pure logit model with OLS

Let's use the R-style formula interface to statsmodels to estimate the pure logit model with an OLS regression. We'll use HC0 standard errors to align with PyBLP's default, which is to adjust for heteroskedasticity.

In [None]:
product_data['logit_delta'] = np.log(product_data['market_share'] / product_data['outside_share'])
statsmodels_ols = smf.ols('logit_delta ~ 1 + mushy + price_per_serving', product_data)
statsmodels_results = statsmodels_ols.fit(cov_type='HC0')
statsmodels_results.summary2().tables[1]

The coefficient on price is negative, which means demand is estimated to be sloping down. We'll compute elasticities later, which are more interpretable than the magnitude of the price coefficient here. To interpret the coefficient on mushy, we can divide it by the coefficient on price: 0.075 / 7.480 = $0.01 is the willingness to pay of consumers for a cereal begin "mushy." There are likely a number of other characteristics that mushy is correlated with, which we're not including in this regression.

### 4. Run the same regression with PyBLP

Let's prep our data for use by PyBLP. We need to rename some columns and then use a similar R-style formula to set up our problem.

In [None]:
product_data = product_data.rename(columns={
    'market': 'market_ids',
    'product': 'product_ids',
    'market_share': 'shares',
    'price_per_serving': 'prices',
})
product_data['demand_instruments0'] = product_data['prices']
ols_problem = pyblp.Problem(pyblp.Formulation('1 + mushy + prices'), product_data)
ols_problem

Let's double-check that PyBLP's instruments matrix is as we expect: a constant, mushy, and prices. The ordering is different, but it's the same.

In [None]:
pd.DataFrame(ols_problem.products.ZD).sample(n=5, random_state=0)

Now let's run the same OLS regression.

In [None]:
ols_results = ols_problem.solve(method='1s')
ols_results

We can create a quick dataframe to nicely-format the estimates in this notebook.

In [None]:
pd.DataFrame(index=ols_results.beta_labels, data={
    ("Estimates", "Statsmodels"): statsmodels_results.params.values,
    ("Estimates", "PyBLP"): ols_results.beta.flat,
    ("SEs", "Statsmodels"): statsmodels_results.bse.values,
    ("SEs", "PyBLP"): ols_results.beta_se.flat,
})

We get the same estimates and the same standard errors.

### 5. Add market and product fixed effects

It's easiest to add fixed effects by absorbing them. This is done under the hood with iterative de-meaning. We'll drop the constant and the mushy dummy because these are collinear with the fixed effects.

In [None]:
fe_problem = pyblp.Problem(pyblp.Formulation('0 + prices', absorb='C(market_ids) + C(product_ids)'), product_data)
fe_problem

In [None]:
fe_results = fe_problem.solve(method='1s')
fe_results

We get a more negative coefficient on price, suggesting that the OLS coefficient was biased upwards. This suggests that price was positively correlated with product/market-specific components of unobserved quality.

### 6. Add an instrument for price

First, let's run a first-stage regression of price on the price instrument in the data to make sure it's relevant.

In [None]:
first_stage = smf.ols('prices ~ 0 + price_instrument + C(market_ids) + C(product_ids)', product_data)
first_stage_results = first_stage.fit(cov_type='HC0')
first_stage_results.summary2().tables[1].sort_index(ascending=False)

It seems relevant, being strongly positively correlated with price even after adjusting for market and product fixed effects. Now we'll use it to instrument for price.

In [None]:
product_data = product_data.drop(columns='demand_instruments0').rename(columns={'price_instrument': 'demand_instruments0'})
iv_problem = pyblp.Problem(pyblp.Formulation('0 + prices', absorb='C(market_ids) + C(product_ids)'), product_data)
iv_problem

In [None]:
iv_results = iv_problem.solve(method='1s')
iv_results

In [None]:
pd.DataFrame(index=fe_results.beta_labels, data={
    ("Estimates", "OLS"): ols_results.beta[-1:].flat,
    ("Estimates", "+FE"): fe_results.beta.flat,
    ("Estimates", "+IV"): iv_results.beta.flat,
    ("SEs", "OLS"): ols_results.beta_se[-1:].flat,
    ("SEs", "+FE"): fe_results.beta_se.flat,
    ("SEs", "+IV"): iv_results.beta_se.flat,
})

Our estimate gets even more negative with an IV, suggesting that the within product *and* market component of unobserved quality was still positively correlated with price.

### 7. Cut a price in half and see what happens

Let's select the market in which we'll run the counterfactual and see what choices are available to consumers.

In [None]:
counterfactual_market = 'C01Q2'
counterfactual_data = product_data.loc[product_data['market_ids'] == counterfactual_market, ['product_ids', 'mushy', 'prices', 'shares']]
counterfactual_data

Let's cut the price of the first product in half and use our estimated model to predict how market shares of all products in the market will change.

In [None]:
counterfactual_data['new_prices'] = counterfactual_data['prices']
counterfactual_data.loc[counterfactual_data['product_ids'] == 'F1B04', 'new_prices'] /= 2
counterfactual_data['new_shares'] = iv_results.compute_shares(market_id=counterfactual_market, prices=counterfactual_data['new_prices'])
counterfactual_data['iv_change'] = 100 * (counterfactual_data['new_shares'] - counterfactual_data['shares']) / counterfactual_data['shares']
counterfactual_data

The market share of the product whose price we halved increased by more than 200%, suggesting that consumers are fairly responsive to price changes. The market shares of the other products all decreased, which makes sense (we need substitution from somehwere), but we see that they all decreased by the same percent, which seems unrealistic. We would expect more substitution from more similar products. Cannibalization estimates don't seem reasonable -- we'd expect more cannibalization from the other products of firm one that are more similar to the product whose price is being cut.

### 8. Compute demand elasticities

To get a sense for what's going on, we can compute demand elasticities.

In [None]:
iv_elasticities = iv_results.compute_elasticities(market_id=counterfactual_market)
pd.DataFrame(iv_elasticities)

The diagonal elements are useful statistics to report (perhaps as a quantity-weighted average or median) instead of the raw price coefficient, which is a bit hard to interpret on its own. They suggest that consumers are pretty elastic. The off-diagonal elements are cross-price elasticities, which are all fairly small. The non-realistic substitution patterns we saw in our counterfactual also show up here: all cross-price elasticities in each column are the same, even though we'd expect some differences for more similar products.