# Cars: Getting Started

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns 
import pyblp
sns.set_theme()

pyblp.options.digits = 2
pyblp.options.verbose = False
pyblp.__version__

'1.1.0'

# Read in data

The dataset, `cars.csv`, contains cleaned and processed data. If you want to make changes, the notebook, `materialize.ipynb`, creates the data from the raw source datsets. 

In [2]:
cars = pd.read_csv('cars.csv') # this reads the *balanced* dataset (i.e. J = 40 products per market always)
# cars = pd.read_excel('cars.xlsx') # this reads the *unbalanced* dataset (i.e. J varies over time)

In [3]:
lbl_vars = pd.read_csv('labels_variables.csv', index_col=0)
lbl_vals = pd.read_stata('cars.dta', iterator=True).value_labels() # the values that variables take (not relevant for all )

## Overview of the dataset

In [4]:
pd.set_option('display.max_colwidth', None)
tab = cars.mean(numeric_only=True).apply(lambda x: f'{x:.2f}').to_frame('Mean').join(lbl_vars)
tab

Unnamed: 0,Mean,label
ye,84.5,year (=first dimension of panel)
ma,3.0,market (=second dimension of panel)
co,207.5,model code (=third dimension of panel)
zcode,177.76,alternative model code (predecessors and successors get same number)
brd,16.79,brand code
org,2.72,"origin code (demand side, country with which consumers associate model)"
loc,5.17,"location code (production side, country where producer produce model)"
cla,2.3,class or segment code
home,0.32,domestic car dummy (appropriate interaction of org and ma)
frm,14.5,firm code


# Set up for analysis

## Price variables 

Can be either price (`pr`), price-to-income (`princ`), or log price (`logp`, created below).

In [5]:
price_var = 'princ'

In [6]:
cars['logp'] = np.log(cars[price_var])

In [7]:
cars.columns

Index(['ye', 'ma', 'co', 'zcode', 'brd', 'type', 'brand', 'model', 'org',
       'loc', 'cla', 'home', 'frm', 'qu', 'cy', 'hp', 'we', 'pl', 'do', 'le',
       'wi', 'he', 'li1', 'li2', 'li3', 'li', 'sp', 'ac', 'pr', 'princ',
       'eurpr', 'exppr', 'avexr', 'avdexr', 'avcpr', 'avppr', 'avdcpr',
       'avdppr', 'xexr', 'tax', 'pop', 'ngdp', 'rgdp', 'engdp', 'ergdp',
       'engdpc', 'ergdpc', 'inc', 'logp'],
      dtype='object')

## Market share

**Todo:** Decide how to measure the market size and thereby the market share. *Note:* Below is just an example that sets the market size = population / 3. 

In [8]:
# total quantity of cars sold in market-year (ma, ye)
cars['qu_tot'] = cars.groupby(['ma', 'ye'])['qu'].transform('sum')
cars['market_size'] = cars['pop'] / 3 # TODO: Choose your own market size measure
cars['s'] = cars['qu'] / cars['market_size']

In [9]:
# compute the share of the outside good (will be useful for the demand inversion)
cars['s0'] = 1.0 - cars.groupby(['ma', 'ye'])['s'].transform('sum')
print(f'Outside share is from {cars.s0.min():.1%} to {cars.s0.max():.1%}')

Outside share is from 88.0% to 95.4%


In [10]:
cars.groupby(['ma'])['s'].describe().rename(index=lbl_vals['market']).style.format('{:.3f}')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
ma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Belgium,1200.0,0.002,0.001,0.0,0.001,0.002,0.003,0.019
France,1200.0,0.002,0.002,0.0,0.001,0.001,0.003,0.017
Germany,1200.0,0.002,0.003,0.0,0.001,0.001,0.003,0.019
Italy,1200.0,0.002,0.003,0.0,0.001,0.001,0.002,0.023
UK,1200.0,0.002,0.002,0.0,0.001,0.001,0.002,0.01


# Code for getting started

You have two options
1. Using canned software (statsmodels or pyblp)
2. Using custom written code (numpy, scipy, etc.)

## 1. Using canned software

In [11]:
from linearmodels.iv import IV2SLS

In [12]:
cars['delta'] = np.log(cars['s'] / cars['s0'])
formula = 'delta ~ 1 + logp + we + li'
model = IV2SLS.from_formula(formula, cars).fit()

In [13]:
model.summary

0,1,2,3
Dep. Variable:,delta,R-squared:,0.0984
Estimator:,OLS,Adj. R-squared:,0.0980
No. Observations:,6000,F-statistic:,646.63
Date:,"Wed, Oct 16 2024",P-value (F-stat),0.0000
Time:,14:03:11,Distribution:,chi2(3)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,-7.0781,0.1105,-64.082,0.0000,-7.2946,-6.8616
logp,-0.8165,0.0451,-18.095,0.0000,-0.9050,-0.7281
we,0.0011,9.431e-05,12.121,0.0000,0.0010,0.0013
li,-0.0939,0.0097,-9.6424,0.0000,-0.1130,-0.0748


## 2. Using numpy

***WARNING:*** The code below works *only* with a *balanced* dataset (i.e. with the same number of products, $J$ for each market (`(ma,ye)` pair.))

### Dummy variables

It can be very important to control for some fixed effects. To do this with matrices, you will have to create dummy variables with one column for each possible value (except one for the reference category). 


In [14]:
categorical_var = 'brand' # name of categorical variable
dummies = pd.get_dummies(cars[categorical_var]) # creates a matrix of dummies for each value of dummyvar
x_vars_dummies = list(dummies.columns[1:].values) # omit a reference category, here it is the first (hence columns[1:])

# add dummies to the dataframe 
assert dummies.columns[0] not in cars.columns, f'It looks like you have already added this dummy to the dataframe. Avoid duplicates! '
cars = pd.concat([cars,dummies], axis=1)

### `x_vars`: List of regressors to be used 

In [15]:
# choose your preferred variables 
x_vars = ['logp', 'home', 'cy', 'hp', 'we', 'li'] + x_vars_dummies # <--- !!! CHOOSE HERE 
print(f'K = {len(x_vars)} variables selected.')

K = 38 variables selected.


In [16]:
K = len(x_vars)
N = cars.ma.nunique() * cars.ye.nunique()
J = 40 
x = cars[x_vars].values.reshape((N,J,K)).astype(float)
cars['outcome'] = cars['s'] / cars['s0']
y = np.log(cars['outcome'].values.reshape((N,J)))

# standardize x
x = ((x - x.mean(0).mean(0))/(x.std(0).std(0)))

# OLS Example

Let's compute the OLS estimator just to test that we can do algebra with the arrays. 

***Note:*** This particular choice of $y$ and $x$ variables might not make sense, it is just to help you get started doing algebra on these arrays. 

In [17]:
Y = y.reshape(N*J,) # Make Y 1-dimensional 
X = np.hstack([x.reshape(N*J,K), np.ones((N*J,1))]) # append a constant term 

In [18]:
# compute the OLS estimator 
bet = np.linalg.inv(X.T @ X) @ X.T @ Y

# print
varnames = x_vars + ['const'] # we added the constant as the K+1'th column 
pd.DataFrame({'Estimate':bet}, index=varnames)

Unnamed: 0,Estimate
logp,-0.030616
home,0.024164
cy,-0.007702
hp,-0.010126
we,0.034253
li,-0.014219
MCC,-0.023267
VW,0.017017
alfa romeo,-0.055345
audi,-0.007106


# Towards non-linear estimation

In order to work with the logit model, you have to be able to compute the utility indices, which typically take the form of some inner product of an $x$-vector and a $\theta$ vector. This is illustrated for you below. Since `x` is `(N,J,K)` (i.e. `x[i,j,:]` gives the $K$-vector of regressors for the car `j` in market-period `i`), we just have to form the matrix product `x @ theta`, and Python will do the sum over the 3rd dimension of `x`. 

In [21]:
theta0 = np.zeros((K,))
v = x @ theta0 # how to multiply a trial value with the matrix of regressors 
np.exp(v) / np.sum(np.exp(v), 1, keepdims=True) # choice probabilities 

array([[0.025, 0.025, 0.025, ..., 0.025, 0.025, 0.025],
       [0.025, 0.025, 0.025, ..., 0.025, 0.025, 0.025],
       [0.025, 0.025, 0.025, ..., 0.025, 0.025, 0.025],
       ...,
       [0.025, 0.025, 0.025, ..., 0.025, 0.025, 0.025],
       [0.025, 0.025, 0.025, ..., 0.025, 0.025, 0.025],
       [0.025, 0.025, 0.025, ..., 0.025, 0.025, 0.025]])