In [7]:
import os, sys

#figure display
from IPython.core.display import HTML
from IPython.display import display, IFrame

#files for data munging
file_dir = os.getcwd()
sys.path.append(file_dir + "/scripts")

#inline plotting
from bokeh.resources import INLINE

#set size of figures
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
          'width': 1500,
          'height': 1000,
})


{'height': 1000, 'width': 1500}

# Big Data Team - Research projects

 - <h3> Traffic flow as an early indicator for GDP growth (and other economic measures) </h3>
   <h4> Edward Rowland </h4>
   
 - <h3> Extracting data from online CVs </h3>
   <h4> Hazel Martindale </h4>

In [21]:
# source: https://stackoverflow.com/questions/27934885/how-to-hide-code-from-cells-in-ipython-notebook-visualized-with-nbviewer

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [1]:
#source: https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython
from IPython.display import HTML

HTML('''<script>
code_show_err=false; 
function code_toggle_err() {
 if (code_show_err){
 $('div.output_stderr').hide();
 } else {
 $('div.output_stderr').show();
 }
 code_show_err = !code_show_err
} 
$( document ).ready(code_toggle_err);
</script>

<form action="javascript:code_toggle_err()"><input type="submit" value="Click here to toggle on/off warnings."></form>''')

# Talk aims

- The work has been largely driven internally
- Promote the work
- Give oppotunity for others to collaborate and drive the aims

### If you think the work is relevant to your area, please get in touch!

# Traffic flow as an early indicator for GDP growth (and other economic & labour market measures)


# Talk Structure


1. Background
2. Other work
3. Aims
3. Data 
4. Overall Trends
5. Finding the lag
6. Correlations
7. Time series modelling 
8. Summary
9. Future work - High Frequency traffic counts



<h1>
1. Background
</h1>

### Traffic flow is thought to be elastic to a number of economic factors

- GDP: More stuff being made that needs to be moved about
- Salary: More money means people might travel more for work, business and leisure
- Employment: More people in work may mean more commuting
- Inflation: If the cost of living goes up, that might mean less traffic as people cut back on expenses


### How is this useful?

Traffic could be used as a proxy for these figures. This could provide the following benefits.. 

1. More localised estimates
2. Potential early indicator
3. Identify impacts of specific events

### Where could this be used? - Some examples

**1. More localised estimates**

Identify areas of low or negative growth that would be lost if looking at the national figures

**2. Potential early indicator**

A recession is defined as having two or more quarters of negative GDP growth.
The UK won't officially know if it is in recession until around 6-7 months after it started.

**3. Identify impacts of specific events**

With high frequency data, it could identify the impacts of events (Weather, terrorist attacks, Brexit etc.)

**This aim is to find relationships between traffic and econmic indicators to see if this could work**

# 2. Other work

Have others looked at this...?

...Yes!

### Stats Netherlands

Research using data from an extensive road sensor network in The Netherlands shows correlations with traffic flow and a number of economic measures with a lag of 3 months


| Measure       | Total Traffic | Cat 1 (< 5.6m) Traffic | Cat 2 (5.6m =< >= 12.2m) Traffic | Cat 3 (> 12.2m) Traffic |
| ------------- |:-------------:| -----:|
| Inflation     | -0.42 |  -0.43 | - 0.19 | -0.43 |
| Unemployment  | -0.47 |  -0.41 | -0.55 | -0.22 |
| Income        | 0.74  |  0.74  |  0.45 | 0.65 |
| GDP           | 0.54  | 0.63   | -0.01 | 0.70 |


- Income followed by GDP look to be the strongest correlates
- Larger vehicles (HGVs) correlate most strongly with GDP
- Perhaps some weaker correlations with inflation and Unemployment

We can look and see if this matches with the data used here later

Killan, Ros,  "Road Traffic Correlations with Economic Variables: The Big Data Perspective., 2017, https://pdfs.semanticscholar.org/1f1d/b563d229bdd4fd8c90ad8dd6c5cd3487f76b.pdf

### Some issues

| Variable      | Normally Distributed? | Method| 
| ------------- |:---------------------:| -----:|
| Daily traffic intensity   | No        | Spearman | 
| Monthly traffic intensity | Yes       | Pearson |
| Quarterly traffic intensity | Yes     | Pearson  |
| GDP           | Yes                   | Pearson | 
| Income        | No                    | Spearman |
| Unemployment  | Yes                   | Pearson |
| Inflation     | No                    | Spearman |


Not convinced on the use of Spearman's
- Shapiro–Wilk test for normality is more (too) sensitive for big data sets
- Income data is likely to be skewed, but hard to tell as histograms aren't given
- No indication they tried transforming the data
- Relationships look to be linear
- Spearman's tends to exaggerate coefficients and p-values compare to Pearson's where the latter is valid so can be used for p-hacking (I don't think this was deliberate here)


Would this change the conclusions? - Probably not hugely, so the work is still useful

### Slovenia

- Used traffic flow to estimate GDP figure 45 days before its publication.
- PCA with linear regression to estimate GDP
- Found that Cargo vehicles were the best predictor of GDP, within 1% of the GDP figure

Črt Grahonja, 2018, Use of alternative data sources as flash estimates of economic indicators,
European conference on quality in official statistics June 2018: 
https://www.q2018.pl/papers-presentations/?drawer=Sessions*Session%2022


## Finland

- Uses Company level data and traffic loops to produce two nowcasts and a 16-day backcast of GDP
- Reduce factors by two-step PCA and shrinkage step 
- 16 day backcast was as accurate as the first estimate of GDP when comparing both to the revised figure
- Traffic flow and company data were similar when estimating GDP

Henri Luomaranta, 2018, Nowcasting Finnish Real Economic Activity: a Machine Learning Approach,
European conference on Quality in official statistics June 2018: 
https://www.q2018.pl/papers-presentations/?drawer=Sessions*Session%2022

# 3. Aims 

- Examine the feasibility of applying this work to UK economic and labour market measures
- Looking at correlations between traffic flow and these; mostly replicates the Stats Netherlands approach 
- Apply some simple time series models to see if traffic flow improves predictions 

# 4. Data - Traffic flow

#### Department for Transport data
- Annual average daily flow (AADF) for major and minor roads is used as a measure of traffic flow from 2003 to 2015
- Split into different vehicle categories
- Daily flow is the number of vehicles passing a point on a road on a day. This is averaged across the year to produce the average daily flow
- This measure is based upon approximate 10,000 manual counts per year, between March and October on non-school and public holidays
- These counts are used to estimate AADF figures for major roads
- A representative sample of minor road sites are selected as observations points
- These figures are combined with the change on the previous year to estimate counts for all minor roads


# 4. Data - Economic and labour market measures

Annual figures are taken to match with traffic flow

- **GDP:** The measure used here is the National GDP growth figure as contained in the UK National Accounts Blue Book.
- **CPIH:** The annual UK Consumer Price index including owner occupied housing costs is used here. Note that this time series only dates to 2005, so no figure is available before this date
- **CPI:** The annual UK Consumer Price index is used here as CPIH is not available before 2005
- **Unemployment:** The seasonally adjusted UK unemployment rate for over 16s is used here
- **Earnings:** Average weekly earnings is the figure used that gives the money paid per week, per job before tax and other deductions to employees in the UK


# 4.  <a href="file:///home/eddr/Documents/Projects/report/figures/time_series_multi_metrics.html" target="blank">Overall trends in the data</a>

 This shows the timeseries of the data to get an overview

In [11]:
IFrame("figures/time_series_multi_metrics_small.html", width=1200, height=1000)

# 5. Finding the lag 

### Method

- Compute the cross-correlation function
- The peak gives the point of maximum similarity between traffic flow and the chosen measures
- This is used to find the time lag between the time series

## Results

| Variable      | Peak/Lag with traffic |
| ------------- |:---------------------:| 
| GDP           | 1 year                | 
| Income        | 1 year                |
| Unemployment  | 5 years*              |
| Inflation     | 0 years               |

** * little overlap at 1 year indicates negative correlation

## GDP Growth

In [10]:
IFrame("figures/cross_correlation_plots_gdp_growth.html", width=1200, height=1000)

## CPIH

In [12]:
IFrame("figures/cross_correlation_plots_CPIH.html", width=1200, height=1000)

## CPI

In [13]:
IFrame("figures/cross_correlation_plots_CPI.html", width=1200, height=1000)

## Unemployment

In [14]:
IFrame("figures/cross_correlation_plots_unemployment_pct_change.html", width=1200, height=1000)

## Average Earnings

In [15]:
IFrame("figures/cross_correlation_plots_av_weekly_earnings_change.html", width=1200, height=1000)

# 6. Correlations 

In [21]:
IFrame("figures/correlation_heatmap_small.html",
       width=800, 
       height=600)

### Coefficents

<table border="1" class="dataframe">  <thead>    <tr>      <th></th>       </tr>    <tr>      <th>Economic indicator</th>      <th>CPI</th>      <th>Change in unemployment (% pts)</th>      <th>Change in weekly earnings (£)</th>      <th>GDP Growth</th>    </tr>    <tr>      <th>Vehicle type</t >      <th></th>      <th></th>      <th></th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th>All HGVs</th>      <td>0.077045</td>      <td>-0.145477</td>      <td>0.946293</td>      <td>0.507384</td>    </tr>    <tr>      <th>All Motor Vehicles</th>      <td>0.088426</td>      <td>-0.494078</td>      <td>0.826505</td>      <td>0.757739</td>    </tr>    <tr>      <th>Buses and Coaches</th>      <td>0.111941</td>      <td>-0.559256</td>      <td>0.718989</td>      <td>0.756430</td>    </tr>    <tr>      <th>Cars and Taxis</th>      <td>0.104417</td>      <td>-0.455441</td>      <td>0.837060</td>      <td>0.765201</td>    </tr>    <tr>      <th>LGVs</th>      <td>-0.048951</td>      <td>-0.670309</td>      <td>0.183337</td>      <td>0.447924</td>    </tr>    <tr>      <th>Motorbikes and Scooters</th>      <td>0.073761</td>      <td>0.116624</td>      <td>0.825611</td>      <td>0.255856</td>    </tr>    <tr>      <th>Pedal Cycles</th>      <td>0.036957</td>      <td>-0.570229</td>      <td>-0.177683</td>      <td>0.366898</td>    </tr>  </tbody></table>'

### Stats Netherlands



| Measure       | Total Traffic | Cat 1 (< 5.6m) Traffic | Cat 2 (5.6m =< >= 12.2m) Traffic | Cat 3 (> 12.2m) Traffic |
| ------------- |:-------------:| -----:|
| Inflation     | -0.42 |  -0.43 | - 0.19 | -0.43 |
| Unemployment  | -0.47 |  -0.41 | -0.55 | -0.22 |
| Income        | 0.74  |  0.74  |  0.45 | 0.65 |
| GDP           | 0.54  | 0.63   | -0.01 | 0.70 |


## Log transformed traffic flow

In [20]:
IFrame("figures/correlation_heatmap_log_small.html",
       width=800, 
       height=600)

# 6. Correlations - summary

- Weekly earnings has the strongest correlation, similar to The Netherlands 
- GDP shows some strong correlations, though not with HGVs like in The Netherlands
- Unemployment shows weaker correlations, similar to The Netherlands
- No correlation with inflation

# 6. Correlations - caveats

1. Small dataset
2. Not clear if variables are normally distributed
3. Correlations for GDP are dependent on outliers (i.e may not exist)
4. Time series only contains one recession 
5. Not clear how recession events change distribution of variables

# 7. Time series models 

### Approach

1. Try some basic Auto-regressive (AR) models 
    - These contain one variable, where you are trying to predict future values from past (lagged) values
    - These shouldn't work well, otherwise it would be easy to predict GDP etc!
2. Add in All Vehicles variable in a Vector AR (VAR) model to predict the economic variable
    - If traffic flow is a good predictor, then this should give a better estimate of GDP


## Auto and partial correlations

In [22]:
IFrame("figures/auto_and_partial_correlations.html",
       width=800, 
       height=400
      )

## The models

In [24]:
IFrame("figures/actual_vs_AR_predictions.html",
       width=800, 
       height=600
      )

# 7. Time series models - caveats

1. Small timeseries
2. In-sample predictions (predicting the data it has been fitted with)
3. Time series' are not stationary - due to the recession
4. Does give some indication that this could work (with better data and methods)

# 8. Overall Summary

- Evidence that traffic flow can be used as an early indicator for economic measures
- Need more data
- Better data will allow for better methods
- Still, more data will not solve all our problems
- We need to test these methods' robustness even with new data

### Speaking of better data...

# 9. Future work - High frequency traffic flow data

Highways England provides an open dataset containing traffic flow counts at 15 minute intervals for Motorways and Major A-Roads across England.

- The Data Engineering team in DaaS have scraped this and ingested into DAP-E
    - Over 2 million seperate files
    - Over 200Gb of data

- Really exciting high frequency dataset - ideally suited for this work
    - Replicate these correlations with daily data
    - Look at the approaches used in Slovenia and Finland
    - Develop some ideas of our own


# Thank you for listening, Questions?

- Contact: edward.rowland@ons.gov.uk
- Location: Newport 1.156

### Analysis 
A bit rough plus some visualisations may not work as I update/break things periodically - If you want to look at something that is broken, please email me

- GitHub: https://github.com/ONSBigData/traffic_as_early_indicator
- MyBinder (run code without python/jupyter: https://mybinder.org/v2/gh/ONSBigData/traffic_as_early_indicator/master 