# A Critical Analysis of Engel and Rogers (1996) Using Python Visualizations

**Michal Fabinger and Quentin Batista**  
_The University of Tokyo_

Using CPI data for U.S. and Canadian cities, Engel and Rogers (1996) (henceforth ER) argued that the price variation is much higher for two cities located in different countries than for two equidistant cities in the same country. While the paper provided some potential explanation for this border effect, such as nominal price stickiness, the question remained a puzzle. The paper had a sizable impact on the Economics literature and was cited over 1,500 times according to Google Scholar. In a follow-up paper, Gorodnichenko and Tesar (2009) (henceforth GT) argued that the border effect identified by ER was in fact driven by the difference in the distribution of prices within the United States and Canada. Below, we complement GT by carefully examining the patterns in the data. Since ER, many data science tools have been developed that allow researchers to extract insights on the patterns in their dataset. In employing  these tools, we find that the model proposed by ER might have been inadequate. 

## Engel and Roger's (1996) Model

ER formulated the hypothesis that the volatility of the price of similar goods sold in different locations is related to the distance between locations and other explanatory variables including a dummy variable for whether the cities are in different countries. Formally, they use the following regression:

<center>$V\left(P_{j,k}^{i}\right)=\beta_{1}^{i}r_{j,k}+\beta_{2}^{i}B_{j,k}+\sum_{m=1}^{n}\gamma_{m}^{i}D_{m}+u_{j,k}$</center>

- $P_{j,k}^{i}$ is the log of the price of good $i$ in location $j$ relative to the price of good $i$ in location $k$, measured by taking the difference in the log of the relative price between time $t$ and $t-2$.
- $V\left(P_{j,k}^{i}\right)$ is the standard deviation of the relative prices series.
- $r_{j,k}$ is the log distance between location $j$ and $k$.
- $B_{j,k}$ is a dummy variable for whether locations $j$ and $k$ are in different countries.
- $D_{m}$ is a dummy variable for city $m$.
- $u_{j,k}$ is the regression error.

All prices are converted into U.S. dollars using a monthly average exchange rate. ER also considered a filtered measure of $P_{j,k}^{i}$ which uses seasonal dummies. While the original data is actually panel data, taking the standard deviation of the price series reduces it to cross-sectional data. After running this regression, ER found that the coefficient on the log of distance was positive and significant.

## The Data

The data was obtained directly from Engel's [website](https://www.ssc.wisc.edu/~cengel/Data/Border/BorderData.htm). ER used consumer price data from 23 North American cities for 14 disaggregated consumer price indexes obtained from the Bureau of Labor Statistics. The cities and goods used are described below. The data covered the period between June 1978 and December 1994. 

## Data Preprocessing

We begin our analysis by downloading the dataset and processing for visualization.

In [1]:
import pandas as pd
import numpy as np

# Import Data
us_data_url = 'http://www.ssc.wisc.edu/~cengel/Data/Border/USA.xls'
us_price_data = \
  pd.read_excel(us_data_url,
                na_values=np.nan).stack(dropna=False).reset_index()
us_l2m_price_data = \
  pd.read_excel(us_data_url,
                na_values=np.nan).shift(2).stack(dropna=False).reset_index()
# l2m stands for lagged by two months

can_data_url = 'http://www.ssc.wisc.edu/~cengel/Data/Border/CAN.xls'
can_price_data = \
  pd.read_excel(can_data_url,
                na_values=np.nan).stack(dropna=False).reset_index()
can_l2m_price_data = \
  pd.read_excel(can_data_url,
                na_values=np.nan).shift(2).stack(dropna=False).reset_index()

# Process US Data
# Create common index to merge price and lagged price series
us_price_data['join_index'] = us_price_data['level_0'] + \
 us_price_data['level_1']
us_l2m_price_data['join_index'] = us_l2m_price_data['level_0'] + \
 us_l2m_price_data['level_1']
us_price_data = us_price_data.merge(us_l2m_price_data[['join_index', 0]],
                                    how='left', on='join_index')

# Add country column
us_price_data['country'] = 'US'

# Split date into two columns
us_price_data['year'], us_price_data['month'] = \
 zip(*us_price_data['level_0'].map(lambda x: x.split(':')))

# Split city and good code into two columns
us_price_data['city_code'], us_price_data['good_code'] = \
 zip(*us_price_data['level_1'].map(lambda x: (x[:2], x[2:])))

# Process Canadian Data
# Create common index to merge price and lagged price series
can_price_data['join_index'] = can_price_data['level_0'] + \
 can_price_data['level_1']
can_l2m_price_data['join_index'] = can_l2m_price_data['level_0'] + \
 can_l2m_price_data['level_1']
can_price_data = can_price_data.merge(can_l2m_price_data[['join_index', 0]],
                                      how='left', on='join_index')

# Add country column
can_price_data['country'] = 'Canada'

# Split date into a month and a year column
can_price_data['year'], can_price_data['month'] = \
 zip(*can_price_data['level_0'].map(lambda x: x.split(':')))

# Split city and good code into two columns
# Explanation: Each series is labeled with a letter(s) and a number.
# The letter designates the city.  Two letters are used for U.S. cities
# (e.g., CH for Chicago), and only one letter for Canadian cities.
# The number corresponds to one of the 14 goods, listed in the same order
# we have them in the paper.  "Good 0" is the city's overall CPI, also used
# in the paper.  Thus, LA2 is "Food away from home" for Los Angeles.
# Source: https://www.ssc.wisc.edu/~cengel/Data/Border/BorderData.htm
can_price_data['city_code'], can_price_data['good_code'] = \
 zip(*can_price_data['level_1'].map(lambda x: (x[:1], x[1:])))

# Merging and cleaning up the dataframe
price_data = pd.concat([us_price_data, can_price_data])
price_data = price_data.drop(['level_1', 'join_index'], axis=1)

# Reformat date column
price_data['level_0'] = pd.to_datetime(price_data['level_0'].str.replace(':',
                                                                         '-'))

# Rename columns
price_data.columns = ['date', 'price', 'pricel2m', 'country', 'year', 'month',
                      'city_code', 'good_code']

# Replace negative values by np.nan
price_data.loc[price_data['price'] < 0, 'price'] = np.nan
price_data.loc[price_data['pricel2m'] < 0, 'pricel2m'] = np.nan

# Reorganize columns
price_data = price_data[['date', 'year', 'month', 'country', 'city_code',
                        'good_code', 'price', 'pricel2m']]

# Reset index
price_data = price_data.reset_index(drop=True)


In [2]:
# Create dictionaries containing good descriptions and city names

goods_descriptions = {"0": "City CPI",
                      "1": "Food at home",
                      "2": "Food away from home",
                      "3": "Alcoholic beverages",
                      "4": "Shelter",
                      "5": "Fuel and other utilities",
                      "6": "Household furnishings & operations",
                      "7": "Men's and boy's apparel",
                      "8": "Women's and girl's apparel",
                      "9": "Footwear",
                      "10": "Private transportation",
                      "11": "Public transportation",
                      "12": "Medical care",
                      "13": "Personal care",
                      "14": "Entertainment"}

city_names = {"CH": "Chicago",
              "LA": "Los Angeles",
              "NY": "New York City",
              "PH": "Philadelphia",
              "DA": "Dallas",
              "DT": "Detroit",
              "HS": "Houston",
              "PI": "Pittsburgh",
              "SF": "San Francisco",
              "BA": "Baltimore",
              "BO": "Boston",
              "MI": "Miami",
              "ST": "St. Louis",
              "WA": "Washington, DC",
              "Q": "Quebec",
              "M": "Montreal",
              "O": "Ottawa",
              "T": "Toronto",
              "W": "Winnipeg",
              "R": "Regina",
              "E": "Edmonton",
              "C": "Calgary",
              "V": "Vancouver"}

# Inverse mappings
inv_goods_descriptions = {v: k for k, v in goods_descriptions.items()}
inv_city_names = {v: k for k, v in city_names.items()}

price_data['good_description'] = price_data['good_code'].map(goods_descriptions)
price_data['city_name'] = price_data['city_code'].map(city_names)

In [3]:
price_data.good_code.unique()

array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14', 'W', 'APPI', 'ANS', 'ANPPI'], dtype=object)

Note that we do not incorporate all the good codes present in the data here. These series correspond to wage levels and producer price indices.

## Exploratory Data Analysis

In [4]:
price_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89947 entries, 0 to 89946
Data columns (total 10 columns):
date                89947 non-null datetime64[ns]
year                89947 non-null object
month               89947 non-null object
country             89947 non-null object
city_code           89947 non-null object
good_code           89947 non-null object
price               66541 non-null float64
pricel2m            66245 non-null float64
good_description    83625 non-null object
city_name           89714 non-null object
dtypes: datetime64[ns](1), float64(2), object(7)
memory usage: 6.9+ MB


In [5]:
price_data.describe()

Unnamed: 0,price,pricel2m
count,66541.0,66245.0
mean,95.978108,95.752169
std,36.577398,36.445694
min,0.972233,0.972233
25%,78.2,78.1
50%,100.3,100.2
75%,118.4,118.1
max,265.1,262.7


In [6]:
price_data.sample(n=15)

Unnamed: 0,date,year,month,country,city_code,good_code,price,pricel2m,good_description,city_name
75929,1987-05-01,1987,5,Canada,C,W,12.08,12.08,,Calgary
50412,1994-09-01,1994,9,US,CH,12,215.0,214.5,Medical care,Chicago
42456,1991-09-01,1991,9,US,BO,6,110.9,113.3,Household furnishings & operations,Boston
62033,1979-06-01,1979,6,Canada,M,13,63.4,63.8,Personal care,Montreal
35731,1989-03-01,1989,3,US,ST,1,123.2,120.3,Food at home,St. Louis
50713,1994-10-01,1994,10,US,DT,13,122.3,125.6,Personal care,Detroit
2909,1977-01-01,1977,1,US,WA,14,,70.0,Entertainment,"Washington, DC"
22457,1984-04-01,1984,4,US,ST,2,,,Food away from home,St. Louis
36069,1989-05-01,1989,5,US,DA,9,,,Footwear,Dallas
20173,1983-06-01,1983,6,US,BA,13,,,Personal care,Baltimore


In [7]:
# Start date
price_data.date.head(1)

0   1976-01-01
Name: date, dtype: datetime64[ns]

In [8]:
# End date
price_data.date.tail(1)

89946   1995-05-01
Name: date, dtype: datetime64[ns]

### Raw Data Visualization

Note: Visualizing the following graphs may require increasing `iopub_data_rate_limit` from its default value. This can be achieved by launching Jupyter notebook using `jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000`, for example.

In [9]:
from bokeh.plotting import figure, show, output_notebook, gridplot
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.palettes import all_palettes

output_notebook()

TOOLS = "crosshair,pan,wheel_zoom,reset,tap,save"

colors = all_palettes['Category20'][len(goods_descriptions)]

grid = []
grid_width = 3
plot_list = []

for city_code in city_names:
    hover = HoverTool(tooltips=[
        ("index", "$index"),
        ("good type", "@good"),
        ("(x,y)", "($x, $y)"),
    ])

    p = figure(x_axis_type="datetime", tools=[TOOLS, hover], plot_width=300,
               plot_height=300)
    p.title.text = city_names[city_code]
    p.title.align = 'center'

    for good_code in goods_descriptions:
        condition = (price_data['city_code'] == city_code) & \
         (price_data['good_code'] == good_code)
        source = ColumnDataSource(data=dict(
            x=price_data['date'][condition],
            y=price_data['price'][condition],
            good=price_data['good_description'][condition]))

        p.scatter(x='x', y='y', color=colors[int(good_code)], source=source)

    # Append the plot to a list to create the grid
    if len(plot_list) < grid_width:
        plot_list.append(p)
    else:
        grid.append(plot_list)
        plot_list = []
        plot_list.append(p)

# Append remaining plots
if plot_list:
    grid.append(plot_list)

p = gridplot(grid)

show(p)

Before further manipulating the data, we normalize the series based on the 1980-1981 price index.

In [10]:
def mean_init_price_index(price_type, city_code, good_code):
    index = price_data[(price_data['city_code'] == city_code) &
                       (price_data['good_code'] == good_code) &
                       price_data['year'].isin(['1980', '1981'])][
                           price_type].mean()
    return index


def data_normalization(df, col_to_normalize, city_names, goods_descriptions):
    for city_code in city_names:
        for good_code in goods_descriptions:
            condition = (df['city_code'] == city_code) & \
             (df['good_code'] == good_code)
            df.loc[condition, col_to_normalize + 'n'] = \
                df[col_to_normalize][condition] / \
                mean_init_price_index(col_to_normalize, city_code, good_code)

data_normalization(price_data, 'price', city_names, goods_descriptions)
data_normalization(price_data, 'pricel2m', city_names, goods_descriptions)

Additionally, we interpolate missing values for plotting lines.

In [11]:
def data_interpolation(df, city_names, goods_descriptions):
    for city_code in city_names:
        for good_code in goods_descriptions:
            condition = (df['city_code'] == city_code) & \
                 (df['good_code'] == good_code)
            df.loc[condition, 'pricen'] = \
                df.loc[condition,
                       ['date', 'pricen']
                       ].set_index('date').interpolate(method='cubic').values

data_interpolation(price_data, city_names, goods_descriptions)

## Visualization

Our analysis is based on the three plots below. Feel free to interactively explore the data using these plots before we delve into specific aspects.

Note: Visualizing these graphs requires downloading and running the notebook at this point. However, this is not necessary for following our analysis.

In [12]:
from ipywidgets import interact
import flexx
from bokeh.models import Legend


def city_plot_update(city):
    p_cities = figure(x_axis_type="datetime", tools=TOOLS, plot_width=800,
                      plot_height=600, toolbar_location="above",
                      title='Evolution of Prices by Cities')

    colors = all_palettes['Category20'][len(goods_descriptions)]

    lines = []
    legend_it = []

    for good_code in goods_descriptions:
        condition = (price_data['city_code'] == inv_city_names[city]) & \
         (price_data['good_code'] == good_code)
        temp_line = p_cities.line(x=price_data['date'][condition],
                                  y=price_data['pricen'][condition],
                                  color=colors[int(good_code)])
        lines.append(temp_line)
        legend_it.append((goods_descriptions[good_code], [temp_line]))

    legend = Legend(items=legend_it, location=(0, 100))
    legend.click_policy = "hide"

    p_cities.add_layout(legend, 'right')
    p_cities.title.text_font_size = '12pt'
    p_cities.yaxis.axis_label = 'Normalized Price Index'
    p_cities.xaxis.axis_label = 'Year'

    show(p_cities)

interact(city_plot_update, city=city_names.values())

<function __main__.city_plot_update>

In [13]:
from bokeh.palettes import magma


def good_plot_update(good):
    p_goods = figure(x_axis_type="datetime", tools=TOOLS, plot_width=800,
                     plot_height=600, toolbar_location="above",
                     title='Evolution of Prices by Cities')

    lines = []
    legend_it = []
    colors = magma(len(city_names))

    for (i, city) in enumerate(city_names):
        condition = (price_data['city_code'] == city) & \
         (price_data['good_code'] == inv_goods_descriptions[good])
        temp_line = p_goods.line(x=price_data['date'][condition],
                                 y=price_data['pricen'][condition],
                                 color=colors[i])
        lines.append(temp_line)
        legend_it.append((city_names[city], [temp_line]))

    legend = Legend(items=legend_it, location=(0, 25))
    legend.click_policy = "hide"

    p_goods.add_layout(legend, 'right')

    show(p_goods)

interact(good_plot_update, good=goods_descriptions.values())

<function __main__.good_plot_update>

In [14]:
def countries_plot_update(good):
    p_countries = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
                         plot_height=600, toolbar_location="above",
                         title='Evolution of Prices by Countries')

    lines = []

    for (i, city) in enumerate(city_names):
        condition = (price_data['city_code'] == city) & \
         (price_data['good_code'] == inv_goods_descriptions[good])
        if len(city) == 2:
            temp_line = p_countries.line(x=price_data['date'][condition],
                                         y=price_data['pricen'][condition],
                                         color='blue',
                                         legend='US Cities')
        else:
            temp_line = p_countries.line(x=price_data['date'][condition],
                                         y=price_data['pricen'][condition],
                                         color='red',
                                         legend='Canadian Cities')
        lines.append(temp_line)

    p_countries.legend.location = 'bottom_right'
    p_countries.title.text_font_size = '12pt'
    p_countries.yaxis.axis_label = 'Normalized Price Index'
    p_countries.xaxis.axis_label = 'Year'

    show(p_countries)

interact(countries_plot_update, good=goods_descriptions.values())

<function __main__.countries_plot_update>

## Discussion by Good Type

Below, we discuss the various factors that suggest that the model employed by ER may have been inadequate. More precisely, we suspect that the model suffers from omitted-variable bias, endogeneity of some of the regressors and low data quality.

### Shelter

Between 1985 and 1989, the average price of a house in the Greater Toronto Area increased by 113%. At first fueled by low unemployment and a large inflow of immigrants, the price increase subsequently attracted massive speculative investment, thereby creating a housing bubble. While the bubble was mostly concentrated in the Toronto area, it also impacted other Canadian cities. In fact, we can observe a sharp increase in shelter prices for Canadian cities. Additionally, the 1980 oil glut led to a deep recession throughout Canadian regions whose economies are deeply reliant on the production and sale of oil, which explains the sharp decline observed in the data on Calgary. During the same period, mortgage rates in the U.S. soared to a [record-high 17-18%](https://seekingalpha.com/article/117783-u-s-housing-market-1982-vs-2009) causing existing-home sales to fall by 50%. Both of these factors contributed to increasing the disparity in prices between U.S. and Canadian cities. Finally, since housing is not subject to international arbitrage, it is natural that the behavior of prices would differ in both countries.

In [15]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=800,
           plot_height=600, toolbar_location="above",
           title='Evolution of Shelter Prices')

lines = []
legend_it = []
colors = magma(len(city_names))

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] == inv_goods_descriptions['Shelter'])
    temp_line = p.line(x=price_data['date'][condition],
                       y=price_data['pricen'][condition],
                       color=colors[i])
    lines.append(temp_line)
    legend_it.append((city_names[city], [temp_line]))

legend = Legend(items=legend_it, location=(0, 25))
legend.click_policy = "hide"

p.add_layout(legend, 'right')

r = p.circle([pd.Timestamp('1982-06-01'),
              pd.Timestamp('1990-06-01')], [1.125, 1.9])

glyph = r.glyph
glyph.size = 110
glyph.fill_alpha = 0.2
glyph.line_color = "firebrick"
glyph.line_dash = [6, 3]
glyph.line_width = 2

show(p)

You can access Timestamp as pandas.Timestamp
  if pd and isinstance(obj, pd.tslib.Timestamp):


Note that by clicking on the city names in the legend, you can hide them in order to clearly visualize different impacts.

In the circle on the lower left, we can see the impact of the oil glut and mortgage rate peak. After a quick run-up lasting until 1982, prices in Calgary and Edmonton, which are cities both located in the province of Alberta whose economies are deeply reliant on oil, started declining before stagnating for a few years. Note that the price of oil started decreasing in 1980, but the shelter prices started to decline only around 1982-1983. This is probably because the impact of lower oil prices took time before propagating through the economy. Right before this, prices in cities such as Boston, Detroit, and Pittsburgh declined sharply, recovering as prices in Canada were declining.

The impact of the Toronto housing bubble is visible in the circle on the upper right. After increasing swiftly until 1991, prices slightly decreased before stagnating until the end of the period.

### Private Transportation

In 1981, with the American auto industry mired in recession, Japanese car makers agreed to limit exports of passenger cars to the United States. This "voluntary export restraint" (VER) program allowed only 1.68 million Japanese cars into the U.S. each year. The cap was raised to 1.85 million cars in 1984, and to 2.30 million in 1985 (representing about 20% of the car market at that time), before the program was terminated in 1994. Additionally, the effect of this program interacted with the impact of significant movements in exchange rates. Between 1985 and 1990, the yen significantly appreciated against the dollar by close to 50%. Meanwhile, the Canadian dollar significantly depreciated against the dollar between 1980 and 1986, then appreciated until 1991, before depreciating again until 1995. These changes potentially created supply shocks in the car markets of both countries, leading to price disparities which the model does not explicitly take into account.

In [16]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
           plot_height=600, toolbar_location="above",
           title='Evolution of Private Transportation Prices')

lines = []

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] ==
      inv_goods_descriptions['Private transportation'])
    if len(city) == 2:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='blue',
                           legend='US Cities')
    else:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='red',
                           legend='Canadian Cities')
    lines.append(temp_line)

p.legend.location = 'bottom_right'
p.title.text_font_size = '12pt'
p.yaxis.axis_label = 'Normalized Price Index'
p.xaxis.axis_label = 'Year'

r = p.circle([pd.Timestamp('1986-03-01')], [1.15])

glyph = r.glyph
glyph.size = 110
glyph.fill_alpha = 0.2
glyph.line_color = "firebrick"
glyph.line_dash = [6, 3]
glyph.line_width = 2

show(p)

### Public transportation

While changes in the prices of buses and trains are mostly due to market forces, public transportation is generally a highly regulated industry where prices are set by the government. As such, events such as a budget crisis can push the government to increase prices. Besides, the cost of public transportation is also related to the coverage distance and weather conditions because they both impact the cost of building and maintaining a public transportation network. None of those factors are however directly included in the regression model.

### Alcoholic beverages

Alcoholic beverages are generally subject to significant taxes, which makes their prices easily susceptible to country-specific fluctuations. In fact, the impact of the 1991 U.S. Federal Alcohol Tax increase can be observed in the data. As such, they are not well-suited for this model.

In [17]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
           plot_height=600, toolbar_location="above",
           title='Evolution of Alcoholic Beverages Prices')

lines = []

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] == inv_goods_descriptions['Alcoholic beverages'])
    if len(city) == 2:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='blue',
                           legend='US Cities')
    else:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='red',
                           legend='Canadian Cities')
    lines.append(temp_line)

p.legend.location = 'bottom_right'
p.title.text_font_size = '12pt'
p.yaxis.axis_label = 'Normalized Price Index'
p.xaxis.axis_label = 'Year'

r = p.circle([pd.Timestamp('1991-03-01')], [1.55])

glyph = r.glyph
glyph.size = 140
glyph.fill_alpha = 0.2
glyph.line_color = "firebrick"
glyph.line_dash = [6, 3]
glyph.line_width = 2

show(p)

### Fuel and other utilities

Many utilities are highly regulated, suggesting a similar issue to the public transportation case. Additionally, inspecting the fluctuations in many U.S. cities reveals some unexplained patterns which appear at the beginning of the 1980s, potentially indicating some data quality issues.

In [18]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=800,
           plot_height=600, toolbar_location="above",
           title='Evolution of Fuel & Utilities Prices')

lines = []
legend_it = []
fuel_cities = ['Chicago', 'Los Angeles', 'Dallas', 'San Francisco',
               'Baltimore', 'Boston', 'St. Louis']
colors = all_palettes['Category20'][len(fuel_cities)]

for (i, city) in enumerate(fuel_cities):
    condition = (price_data['city_code'] == inv_city_names[city]) & \
     (price_data['good_code'] ==
      inv_goods_descriptions['Fuel and other utilities'])
    temp_line = p.line(x=price_data['date'][condition],
                       y=price_data['pricen'][condition],
                       color=colors[i])
    lines.append(temp_line)
    legend_it.append((city, [temp_line]))

legend = Legend(items=legend_it, location=(0, 200))
legend.click_policy = "hide"

p.add_layout(legend, 'right')

show(p)

### Food away from home

This type of good has a highly non-tradable component. For example, the dining experience of a restaurant in New York City cannot be enjoyed in Toronto. Thus, this type of good is not well-suited for this model.

### Personal care

Similarly to food away from home, this type of good also has a highly non-tradable component. In most cases, personal care services are location specific, and therefore not subject to international arbitrage.

### Medical care

Medical care tends to be a regulated, non-tradable industry. Prices largely depend on government policies, and are therefore susceptible to country-specific shocks. Additionally, prices are set on an annual basis in Canada which explains why we observe jumps, while in the U.S., prices are set on a monthly basis which is why we observe smooth price increases in the data. As such, factors such as how the data is aggregated among providers partially drives the discrepancies in prices. 

In [19]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
           plot_height=600, toolbar_location="above",
           title="Evolution of Medical Care Prices")

lines = []

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] == inv_goods_descriptions['Medical care'])
    if len(city) == 2:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='blue',
                           legend='US Cities')
    else:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='red',
                           legend='Canadian Cities')
    lines.append(temp_line)

p.legend.location = 'bottom_right'
p.title.text_font_size = '12pt'
p.yaxis.axis_label = 'Normalized Price Index'
p.xaxis.axis_label = 'Year'

show(p)

The city for which prices jumped around 1987 is Regina; it is unclear why there is such a jump around this period in the data.

### Apparel & footwear

U.S. prices are highly volatile starting from the late 1980s, suggesting that they may not have been properly recorded. City-specific measurement error, which naturally arises if there is a difference in data quality across cities, would imply that the dummy variable for cities is correlated with the error term, and therefore that the model suffers from endogeneity issues. Under measurement error, the ER's model (introduced above) becomes:
$V\left(P_{j,k}^{i*}\right)=\beta_{1}^{i}r_{j,k}+\beta_{2}^{i}B_{j,k}+\sum_{m=1}^{n}\gamma_{m}^{i}D_{m}+u_{j,k}$ with $P_{j,k}^{i*}=P_{j,k}^{i}+\varepsilon_{j,k}^{i}$ where $\varepsilon_{j,k}^{i}$ is the measurement error. Therefore, the model can be expressed as:  
<center>$V\left(P_{j,k}^{i}\right)=\left[\left(\beta_{1}^{i}r_{j,k}+\beta_{2}^{i}B_{j,k}+\sum_{m=1}^{n}\gamma_{m}^{i}D_{m}+u_{j,k}\right)^{2}-\mathrm{var}\left(\varepsilon_{j,k}^{i}\right)-2\mathrm{cov}\left(P_{j,k}^{i},\varepsilon_{j,k}^{i}\right)\right]^{\frac{1}{2}}$</center>Since the error term is correlated with the city dummy variable, this shows that the model suffers from endogeneity.
Additionally, according to the data, the price of women's and girl's apparel was on average close to 40% higher in New York City compared to Philadelphia in the mid-1990s. Given that the two cities are less than a two-hour drive away from each other, the significant price difference appears highly implausible. Once again, this observation points to data quality issues. A potential explanation would be that the average quality of goods is actually different across cities, in which case, this should be accounted for in the regression.

In [20]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
           plot_height=600, toolbar_location="above",
           title="Evolution of Men's Apparel Prices")

lines = []

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] ==
      inv_goods_descriptions["Men's and boy's apparel"])
    if len(city) == 2:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='blue',
                           legend='US Cities')
    else:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='red',
                           legend='Canadian Cities')
    lines.append(temp_line)

p.legend.location = 'bottom_right'
p.title.text_font_size = '12pt'
p.yaxis.axis_label = 'Normalized Price Index'
p.xaxis.axis_label = 'Year'

show(p)

In [21]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
           plot_height=600, toolbar_location="above",
           title="Evolution of Women's Apparel Prices")

lines = []

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] ==
      inv_goods_descriptions["Women's and girl's apparel"])
    if len(city) == 2:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='blue',
                           legend='US Cities')
    else:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='red',
                           legend='Canadian Cities')
    lines.append(temp_line)

p.legend.location = 'bottom_right'
p.title.text_font_size = '12pt'
p.yaxis.axis_label = 'Normalized Price Index'
p.xaxis.axis_label = 'Year'

show(p)

In [22]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
                 plot_height=600, toolbar_location="above",
                 title="Evolution of Footwear Prices")

lines = []

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] == inv_goods_descriptions['Footwear'])
    if len(city) == 2:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='blue',
                           legend='US Cities')
    else:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='red', 
                           legend='Canadian Cities')
    lines.append(temp_line)

p.legend.location = 'bottom_right'
p.title.text_font_size = '12pt'
p.yaxis.axis_label = 'Normalized Price Index'
p.xaxis.axis_label = 'Year'

show(p)

In [45]:
p = figure(x_axis_type="datetime", plot_width=600,
           plot_height=600, toolbar_location="right",
           title="Comparison of Women's Apparel Prices between New York and Philadelphia")

ny_ph = ['NY', 'PH']

for i, city in enumerate(ny_ph):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] == inv_goods_descriptions['Footwear'])

    p.line(x=price_data['date'][condition],
                               y=price_data['pricen'][condition],
                               color=all_palettes['Category10'][10][i],
                               legend=city_names[city])

p.legend.location = 'bottom_right'

show(p)

### Tax Increase in Canada

In January 1991, the Canadian government implemented a 7% VAT tax. This tax had a significant impact on the prices of goods and services and can be observed in many time series such as that of apparel, footwear, and food away from home. Since this tax is not subject to arbitrage, it naturally contributed to the price dispersion between the U.S. and Canada.

In [23]:
p = figure(x_axis_type="datetime", tools=TOOLS, plot_width=600,
           plot_height=600, toolbar_location="above",
           title='Evolution of City CPI')

lines = []

for (i, city) in enumerate(city_names):
    condition = (price_data['city_code'] == city) & \
     (price_data['good_code'] == inv_goods_descriptions['City CPI'])
    if len(city) == 2:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='blue',
                           legend='US Cities')
    else:
        temp_line = p.line(x=price_data['date'][condition],
                           y=price_data['pricen'][condition],
                           color='red',
                           legend='Canadian Cities')
    lines.append(temp_line)

p.legend.location = 'bottom_right'
p.title.text_font_size = '12pt'
p.yaxis.axis_label = 'Normalized Price Index'
p.xaxis.axis_label = 'Year'

r = p.circle([pd.Timestamp('1991-01-01')], [1.7])

glyph = r.glyph
glyph.size = 110
glyph.fill_alpha = 0.2
glyph.line_color = "firebrick"
glyph.line_dash = [6, 3]
glyph.line_width = 2

show(p)

# References

Engel, Charles, and John H. Rogers. 1996. “How Wide Is the Border?” American Economic Review 86(5):1112–25.

Gorodnichenko, Yuriy, and Linda L. Tesar. 2009. "Border Effect or Country Effect? Seattle May Not Be So Far from Vancouver After All." American Economic Journal: Macroeconomics, 1(1): 219-41.

Benjamin, Daniel K. (September 1999). "Voluntary Export Restraints on Automobiles"