*This study was conducted for skills demonstration purposes only*

# **Forecasting the UK Construction Sector with Macroeconomic Indicators**
# Section 5. Modeling

This section builds on the Exploratory Data Analysis (EDA) insights to develop predictive models for forecasting UK construction sector trends with macroeconomic indicators. Using identified correlations and time-lags, we implement lagged regression and Vector Autoregression (VAR) models to capture the dynamic relationships between variables like GDP, inflation, and construction output, material prices, and new contracts. The models account for black swan events (e.g., 2008 crisis, Brexit, COVID-19) via dummy variables, aiming to provide accurate forecasts and address the research questions on predictive power and lagged effects.

## Research Questions

1. How do construction related variables (e.g., output, material costs, new contracts) correlate with economic indicators (e.g., GDP growth, interest rates, inflation, employment rates)?

2. Can macroeconomic indicators predict construction trends?

4. Do these macroeconomic indicators impact construction activity immediately, or with a time lag?
<br> If so, what is the typical delay between an economic change and a response in construction output or material prices?

5. Can macroeconomic indicators be used to accurately forecast future construction trends?
<br> How effective are models such as lagged regression or VAR in making such predictions?

6. How so-called 'black swans' (e.g., Brexit, COVID-19) influenced construction industry?

## Sutable models and techniques review

For addressing the research questions  the models that handle time-series data and lagged relationships can be used.

- **Cross-Correlation Analysis**
<br>Identifies the lag at which two time-series (e.g., GDP and construction output) exhibit the strongest correlation.
<br>Computes the correlation coefficient between a macroeconomic indicator and a construction indicator at various lag lengths (e.g., 0 to 12 months).

- **Engle-Granger cointegration test**
<br>Checks whether two non-stationary time series are linked by a stable long-term relationship.
<br>Regressing one series on the other. Testing the residuals for stationarity (using the ADF test).

- **Granger Causality Test**
<br>Tests whether one time-series (e.g., GDP) can predict another (e.g., construction output) at specific lags.
<br>Assesses if lagged values of one variable improve predictions of another, indicating causality and lag length.

- **Lagged Regression**
<br>Suitable for capturing the effect of lagged macroeconomic variables on construction indicators.
<br>Can be used for prediction  with time lags.

- **Vector Autoregression (VAR) with Lag Selection**
<br>Models multivariate time-series and identifies optimal lags for all variables simultaneously.
<br>VAR models each variable as a function of its own lags and lags of other variables, with lag length determined by criteria like AIC or BIC.

- **Distributed Lag Models (DLM)**
<br>Explicitly models the effect of a predictor’s lagged values on the dependent variable.
<br>Regresses a construction indicator (e.g., material prices) on multiple lagged values of a macroeconomic indicator (e.g., CPIH).

- **ARIMAX with Exogenous Lags**
<br>Extends ARIMA to include lagged exogenous variables, identifying their influence on the target variable.
<br>Models a construction indicator (e.g., output) with its own lags and lagged exogenous variables (e.g., GDP, CPIH).

- **Machine Learning Models**
<br>Random Forest or Gradient Boosting (e.g., XGBoost) for non-linear relationships.
<br>Recurrent Neural Networks (RNNs) or LSTMs for complex time-series patterns.

## Modeling plan

1. **Cross-Correlation Analysis**
This method will be used to confirm and refine the EDA-identified lags.

2. **Engle-Granger cointegration test**
This method will be used to check if series share a common trend because of a real economic link or their correlation is spurious.

4. **Granger Causality**
Causality for significant correlations to validate predictive relationships will be tested.

5. **Implementation VAR**
Multivariate modeling which let the model select optimal lags via AIC/BIC will be carried out.
Key variables (e.g., construction output, GDP, CPIH, employment rate) and dummy variables for black swan events will be included.


6. **Verification with DLM or ARIMAX**
DLM will me used to test specific lag structures for individual relationships (e.g., CPIH to material prices).
ARIMAX will be usrd for single-indicator forecasting with exogenous lags.

Further research and possible extensions of this study may use other methods and models.

## Tools and Libraries

In [660]:
import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import ccf
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import coint

## Auxiliary Functions

## Loading data

In [661]:
#df = pd.read_csv('df_normalized_final.csv', index_col='Date', parse_dates=True)

In [662]:
#Reading data
df = pd.read_csv('df_final.csv')
df_normalized = pd.read_csv('df_normalized_final.csv')
df_standardized = pd.read_csv('df_standardized_final.csv')

#Print column names
print(df.columns)
print(df_normalized.columns)
print(df_standardized.columns)

Index(['Date', 'CPIH', 'GDP, £m', 'Employment rate, %', 'BoE Rate, %',
       'GBP/EUR', 'GBP/USD', 'Business Investment, % change',
       'Govt Expenditure, £m', 'Construction output, £m',
       'Construction Material Price Index, 2015 = 100',
       'Small construction companies', 'Medium construction companies',
       'Large construction companies', 'Number of all construction companies',
       'Employees - Small construction companies',
       'Employees - Medium construction companies',
       'Employees - Large construction companies',
       'Employees - All construction companies',
       'New Contracts - Public Housing, £m',
       'New Contracts - Private Housing, £m',
       'New Contracts - Infrastructure, £m', 'New contracts - Other, £m',
       'New Contracts - Private Industrial, £m',
       'New Contracts - Private Commercial, £m',
       'New contracts - All Construction, £m', 'COVID_Lockdown',
       'HS2_contracts', 'Commercial_Construction_Boom_2006',
       'fi

In [663]:
#Setting up a type of column 'Date' as a datetime type
df['Date'] = pd.to_datetime(df['Date'])
df_normalized['Date'] = pd.to_datetime(df_normalized['Date'])
df_standardized['Date'] = pd.to_datetime(df_standardized['Date'])
#check data types: Column 'Date' should be datetime type, other columns - numerical
print(df.dtypes)
print(df_normalized.dtypes)
print(df_standardized.dtypes)

Date                                             datetime64[ns]
CPIH                                                    float64
GDP, £m                                                 float64
Employment rate, %                                      float64
BoE Rate, %                                             float64
GBP/EUR                                                 float64
GBP/USD                                                 float64
Business Investment, % change                           float64
Govt Expenditure, £m                                      int64
Construction output, £m                                 float64
Construction Material Price Index, 2015 = 100           float64
Small construction companies                            float64
Medium construction companies                           float64
Large construction companies                            float64
Number of all construction companies                    float64
Employees - Small construction companies

In [664]:
#Set column 'Date' as an index column for all datasets
df = df.set_index('Date')
df_normalized = df_normalized.set_index('Date')
df_standardized = df_standardized.set_index('Date')
df_normalized.head(3)

Unnamed: 0_level_0,CPIH_normalized,"GDP, £m_normalized","Employment rate, %_normalized","BoE Rate, %_normalized",GBP/EUR_normalized,GBP/USD_normalized,"Business Investment, % change_normalized","Govt Expenditure, £m_normalized","Construction output, £m_normalized","Construction Material Price Index, 2015 = 100_normalized",...,"New Contracts - Infrastructure, £m_normalized","New contracts - Other, £m_normalized","New Contracts - Private Industrial, £m_normalized","New Contracts - Private Commercial, £m_normalized","New contracts - All Construction, £m_normalized",COVID_Lockdown,HS2_contracts,Commercial_Construction_Boom_2006,financial_crisis_2008,brexit_referendum_2016
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-01,0.0,0.126817,0.484375,0.823009,0.822412,0.793805,0.360902,0.079379,0.632729,0.0,...,0.111526,0.534397,0.467188,0.599912,0.784868,0,0,0,0,0
2005-02-01,0.003521,0.126817,0.46875,0.823009,0.862298,0.805195,0.360902,0.057956,0.632729,0.0,...,0.111526,0.534397,0.467188,0.599912,0.784868,0,0,0,0,0
2005-03-01,0.008803,0.126817,0.4375,0.823009,0.848291,0.82723,0.360902,0.051451,0.632729,0.0,...,0.111526,0.534397,0.467188,0.599912,0.784868,0,0,0,0,0


## 1. Cross-Correlation Analysis

#### Selecting variables with strong correlation (|r| >= 0.6)


In [665]:
# Defining column groups
macro_cols = df_normalized.columns[:8]
construction_cols = df_normalized.columns[8:25]

# Calculating 8x17 correlation matrix
correlation = pd.DataFrame(index=macro_cols, columns=construction_cols)

for macro in macro_cols:
    for constr in construction_cols:
        correlation.loc[macro, constr] = df_normalized[macro].corr(df_normalized[constr])

correlation = correlation.astype(float)
correlation

Unnamed: 0,"Construction output, £m_normalized","Construction Material Price Index, 2015 = 100_normalized",Small construction companies_normalized,Medium construction companies_normalized,Large construction companies_normalized,Number of all construction companies_normalized,Employees - Small construction companies_normalized,Employees - Medium construction companies_normalized,Employees - Large construction companies_normalized,Employees - All construction companies_normalized,"New Contracts - Public Housing, £m_normalized","New Contracts - Private Housing, £m_normalized","New Contracts - Infrastructure, £m_normalized","New contracts - Other, £m_normalized","New Contracts - Private Industrial, £m_normalized","New Contracts - Private Commercial, £m_normalized","New contracts - All Construction, £m_normalized"
CPIH_normalized,0.704082,0.970216,0.922799,-0.288201,-0.86359,0.922069,0.913239,-0.247769,-0.515117,0.780774,-0.654427,-0.074353,0.106463,-0.686065,0.113121,-0.59312,-0.593206
"GDP, £m_normalized",0.896394,0.843694,0.887295,-0.166982,-0.713097,0.887036,0.883095,-0.065633,-0.274047,0.816295,-0.680002,0.233143,0.21218,-0.676269,0.30458,-0.368943,-0.293248
"Employment rate, %_normalized",0.791122,0.562143,0.810046,0.08477,-0.454556,0.810748,0.798926,0.252159,0.051296,0.833594,-0.732228,0.460144,0.1286,-0.682055,0.466079,-0.142142,-0.08956
"BoE Rate, %_normalized",0.269533,-0.016583,-0.111865,0.594597,0.172649,-0.109669,0.006453,0.710618,0.415945,0.184955,0.268383,0.115554,-0.409928,0.245181,0.412657,0.650041,0.466575
GBP/EUR_normalized,-0.15754,-0.541348,-0.550446,0.258513,0.419186,-0.549714,-0.478499,0.491547,0.436654,-0.324632,0.399797,0.467292,-0.277295,0.283556,0.432158,0.82403,0.753356
GBP/USD_normalized,-0.523338,-0.765306,-0.850719,0.359823,0.673819,-0.849727,-0.781839,0.381513,0.450839,-0.638687,0.721344,0.097887,-0.315911,0.658798,0.048875,0.756131,0.658364
"Business Investment, % change_normalized",0.101027,-0.033363,-0.04924,-0.071246,-0.022916,-0.049543,-0.03524,0.036667,0.009091,-0.027145,0.042422,0.178693,-0.018258,-0.015599,0.141313,0.104031,0.136927
"Govt Expenditure, £m_normalized",0.570013,0.940522,0.891737,-0.209445,-0.818683,0.891294,0.887348,-0.221684,-0.516287,0.759559,-0.604059,-0.207077,0.03426,-0.60718,0.07562,-0.609663,-0.647056


In [666]:
# Leave only pairs with a strong correlation
strong_pairs = correlation[abs(corr_matrix) >= 0.6].stack().reset_index()
strong_pairs.columns = ['Macroeconomic','Construction','Correlation']
# Exclude self-correlations
strong_pairs = strong_pairs[strong_pairs['Macroeconomic'] != strong_pairs['Construction']]
strong_pairs

Unnamed: 0,Macroeconomic,Construction,Correlation
0,CPIH_normalized,"Construction output, £m_normalized",0.704082
1,CPIH_normalized,"Construction Material Price Index, 2015 = 100_...",0.970216
2,CPIH_normalized,Small construction companies_normalized,0.922799
3,CPIH_normalized,Large construction companies_normalized,-0.86359
4,CPIH_normalized,Number of all construction companies_normalized,0.922069
5,CPIH_normalized,Employees - Small construction companies_norma...,0.913239
6,CPIH_normalized,Employees - All construction companies_normalized,0.780774
7,CPIH_normalized,"New Contracts - Public Housing, £m_normalized",-0.654427
8,CPIH_normalized,"New contracts - Other, £m_normalized",-0.686065
9,"GDP, £m_normalized","Construction output, £m_normalized",0.896394


#### Checking data for stationarity. Differentiation
Cross-correlation assumes stationary time-series. 
As we noticed in the EDA stage, the datasets have trends and should be non-stationary, so differentiation will be required.

In [667]:
# Build df_diff for indicators that are in highly correlated pairs
df_diff = df_normalized.copy().drop(df_normalized.columns, axis=1)
for constr in strong_pairs['Macroeconomic'].unique():
    df_diff[constr] = df_normalized[constr]
for constr in strong_pairs['Construction'].unique():
    df_diff[constr] = df_normalized[constr]

df_diff.head(5)

Unnamed: 0_level_0,CPIH_normalized,"GDP, £m_normalized","Employment rate, %_normalized","BoE Rate, %_normalized",GBP/EUR_normalized,GBP/USD_normalized,"Govt Expenditure, £m_normalized","Construction output, £m_normalized","Construction Material Price Index, 2015 = 100_normalized",Small construction companies_normalized,Large construction companies_normalized,Number of all construction companies_normalized,Employees - Small construction companies_normalized,Employees - All construction companies_normalized,"New Contracts - Public Housing, £m_normalized","New contracts - Other, £m_normalized",Employees - Medium construction companies_normalized,"New Contracts - Private Commercial, £m_normalized","New contracts - All Construction, £m_normalized"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2005-01-01,0.0,0.126817,0.484375,0.823009,0.822412,0.793805,0.079379,0.632729,0.0,0.0,0.708758,0.0,0.015011,0.167488,0.395604,0.534397,0.682813,0.599912,0.784868
2005-02-01,0.003521,0.126817,0.46875,0.823009,0.862298,0.805195,0.057956,0.632729,0.0,0.0,0.708758,0.0,0.015011,0.167488,0.395604,0.534397,0.682813,0.599912,0.784868
2005-03-01,0.008803,0.126817,0.4375,0.823009,0.848291,0.82723,0.051451,0.632729,0.0,0.0,0.708758,0.0,0.015011,0.167488,0.395604,0.534397,0.682813,0.599912,0.784868
2005-04-01,0.014085,0.152286,0.4375,0.823009,0.898623,0.814669,0.091843,0.620417,0.0,0.0,0.708758,0.0,0.015011,0.167488,0.471319,0.573523,0.682813,0.578837,0.856361
2005-05-01,0.019366,0.152286,0.4375,0.823009,0.888889,0.769747,0.0,0.620417,0.0,0.0,0.708758,0.0,0.015011,0.167488,0.471319,0.573523,0.682813,0.578837,0.856361


In [668]:
#Check for stationarity using the Augmented Dickey-Fuller (ADF) test and fifferentiate if it is needed
for i in range(df_diff.shape[1]):
    result = adfuller(df_diff.iloc[:, i].dropna())
    if result[1] > 0.05:
        col_name = df_diff.columns[i]
        df_diff[col_name] = df_diff[col_name].diff()
df_diff.head(3)

Unnamed: 0_level_0,CPIH_normalized,"GDP, £m_normalized","Employment rate, %_normalized","BoE Rate, %_normalized",GBP/EUR_normalized,GBP/USD_normalized,"Govt Expenditure, £m_normalized","Construction output, £m_normalized","Construction Material Price Index, 2015 = 100_normalized",Small construction companies_normalized,Large construction companies_normalized,Number of all construction companies_normalized,Employees - Small construction companies_normalized,Employees - All construction companies_normalized,"New Contracts - Public Housing, £m_normalized","New contracts - Other, £m_normalized",Employees - Medium construction companies_normalized,"New Contracts - Private Commercial, £m_normalized","New contracts - All Construction, £m_normalized"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2005-01-01,,,,,,,,,,,,,,,,,,,
2005-02-01,0.003521,0.0,-0.015625,0.0,0.039886,0.01139,-0.021423,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2005-03-01,0.005282,0.0,-0.03125,0.0,-0.014008,0.022035,-0.006505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [669]:
#Check if all data is stationary after one differencing
for i in range(df_diff.shape[1]):
    result = adfuller(df_diff.iloc[:, i].dropna())
    if result[1] > 0.05:
        print('Non-stationary data: ',df_diff.columns[i], result[1])

Non-stationary data:  CPIH_normalized 0.059024411732616335
Non-stationary data:  Small construction companies_normalized 0.0545738554585426
Non-stationary data:  Number of all construction companies_normalized 0.05446211049315519


In [670]:
#Drop first row with NaN, which appears after differentiation
df_diff = df_diff.drop('2005-01-01')
df_diff

Unnamed: 0_level_0,CPIH_normalized,"GDP, £m_normalized","Employment rate, %_normalized","BoE Rate, %_normalized",GBP/EUR_normalized,GBP/USD_normalized,"Govt Expenditure, £m_normalized","Construction output, £m_normalized","Construction Material Price Index, 2015 = 100_normalized",Small construction companies_normalized,Large construction companies_normalized,Number of all construction companies_normalized,Employees - Small construction companies_normalized,Employees - All construction companies_normalized,"New Contracts - Public Housing, £m_normalized","New contracts - Other, £m_normalized",Employees - Medium construction companies_normalized,"New Contracts - Private Commercial, £m_normalized","New contracts - All Construction, £m_normalized"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2005-02-01,0.003521,0.000000,-0.015625,0.000000,0.039886,0.011390,-0.021423,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000
2005-03-01,0.005282,0.000000,-0.031250,0.000000,-0.014008,0.022035,-0.006505,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000
2005-04-01,0.005282,0.025469,0.000000,0.000000,0.050332,-0.012561,0.040392,-0.012312,0.000000,0.0,0.0,0.0,0.0,0.0,0.075714,0.039126,0.0,-0.021074,0.071493
2005-05-01,0.005282,0.000000,0.000000,0.000000,-0.009734,-0.044922,-0.091843,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000
2005-06-01,0.000000,0.000000,0.015625,0.000000,0.080959,-0.038216,0.004050,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-09-01,0.001761,0.000000,0.000000,0.000000,0.038462,0.030232,0.047087,-0.005541,-0.008680,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000
2024-10-01,0.014085,0.003708,0.000000,0.000000,0.016382,-0.018310,0.028828,-0.000923,-0.008680,0.0,0.0,0.0,0.0,0.0,0.009780,0.013371,0.0,0.002356,-0.028808
2024-11-01,0.005282,0.000000,0.015625,-0.035823,0.006173,-0.031403,-0.075342,0.012813,0.007595,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000
2024-12-01,0.008803,0.000000,0.000000,-0.008425,0.019231,-0.010964,0.146225,-0.006695,-0.008680,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000


In [671]:
# Defining column groups
macro_cols = df_diff.columns[:7]
construction_cols = df_diff.columns[7:18]

# Calculating 7x11 correlation matrix
correlation = pd.DataFrame(index=macro_cols, columns=construction_cols)

for macro in macro_cols:
    for constr in construction_cols:
        correlation.loc[macro, constr] = df_diff[macro].corr(df_diff[constr])

correlation = correlation.astype(float)
correlation

Unnamed: 0,"Construction output, £m_normalized","Construction Material Price Index, 2015 = 100_normalized",Small construction companies_normalized,Large construction companies_normalized,Number of all construction companies_normalized,Employees - Small construction companies_normalized,Employees - All construction companies_normalized,"New Contracts - Public Housing, £m_normalized","New contracts - Other, £m_normalized",Employees - Medium construction companies_normalized,"New Contracts - Private Commercial, £m_normalized"
CPIH_normalized,0.097908,0.31958,-0.330113,0.028469,-0.329436,-0.206713,-0.123479,0.119489,0.023405,0.021608,0.031021
"GDP, £m_normalized",0.790808,-0.07806,0.024216,0.059549,0.024691,0.051409,0.071913,-0.009754,0.095869,0.075955,0.474782
"Employment rate, %_normalized",0.022834,0.026231,0.098427,0.013417,0.099187,0.07764,0.077204,-0.130119,-0.04755,0.087314,0.009894
"BoE Rate, %_normalized",0.152525,0.101134,0.04327,0.033316,0.044627,0.123216,0.159045,0.040041,0.0482,0.1865,0.090022
GBP/EUR_normalized,-0.023292,0.076403,-0.007553,0.00099,-0.007985,0.079868,0.070779,-0.027095,0.051812,0.051517,0.055885
GBP/USD_normalized,0.035251,-0.011916,-0.057007,-0.050658,-0.057055,0.020076,0.018838,-0.016828,0.07867,0.040203,0.135088
"Govt Expenditure, £m_normalized",-0.435733,0.02639,0.045671,0.030358,0.04642,0.05134,0.062325,-0.047738,-0.012126,0.065675,-0.212124


#### Cross-Correlation Analysis Implementation

In [676]:
strong_pairs['Optimal lag'] = 0
strong_pairs['Max Cross-Correlation'] = strong_pairs['Correlation']
strong_pairs['statistical significance, 95% confidence'] = 0
strong_pairs.shape[0]
for i in strong_pairs.index:
    # Select variables
    x = df_diff[strong_pairs.iloc[i, 0]].dropna()
    y = df_diff[strong_pairs.iloc[i, 1]].dropna()
    x, y = x.align(y, join='inner')
    n = len(x)
    # Compute cross-correlation with positive lags up to 12 months
    max_lags = 12
    cross_corr = ccf(x, y, adjusted=True)
    cross_corr_12 = cross_corr[:max_lags + 1]
    # Find optimal lag
    optimal_index = np.argmax(np.abs(cross_corr_12))
    strong_pairs.loc[i,'Optimal lag'] = optimal_index
    strong_pairs.loc[i, 'Max Cross-Correlation'] = cross_corr_12[optimal_index]
    
    #compute a threshold for statistical significance using Bartlett's formula
    conf_bound = 1.96 / np.sqrt(n)
    is_significant = np.abs(cross_corr_12[optimal_index]) > conf_bound
    strong_pairs.loc[i, 'statistical significance, 95% confidence'] = int(is_significant)
    
strong_pairs[strong_pairs['statistical significance, 95% confidence'] == 1]

Unnamed: 0,Macroeconomic,Construction,Correlation,Optimal lag,Max Cross-Correlation,"statistical significance, 95% confidence"
1,CPIH_normalized,"Construction Material Price Index, 2015 = 100_...",0.970216,11,0.386006,1
2,CPIH_normalized,Small construction companies_normalized,0.922799,12,-0.34536,1
4,CPIH_normalized,Number of all construction companies_normalized,0.922069,12,-0.344993,1
5,CPIH_normalized,Employees - Small construction companies_norma...,0.913239,12,-0.232285,1
6,CPIH_normalized,Employees - All construction companies_normalized,0.780774,12,-0.143875,1
9,"GDP, £m_normalized","Construction output, £m_normalized",0.896394,0,0.790808,1
11,"GDP, £m_normalized",Small construction companies_normalized,0.887295,6,0.130309,1
12,"GDP, £m_normalized",Large construction companies_normalized,-0.713097,6,-0.171824,1
13,"GDP, £m_normalized",Number of all construction companies_normalized,0.887036,6,0.129721,1
18,"Employment rate, %_normalized","Construction output, £m_normalized",0.791122,4,0.200666,1


## 2. Cointegration Test (Engle-Granger method)

The cointegration test is a statistical method used in time series analysis to determine whether two or more non-stationary series have a stable long-term equilibrium relationship, despite short-term fluctuations. While non-stationary series (e.g., those with trends) often produce spurious correlations, cointegration helps identify whether their trends are meaningfully connected.

For Cointegration Test original normalized data will be used. All series should be non-stationary but integrated of order 1 (i.e., stationary after one differencing).

In [673]:
# The data that us not integrated of order 1 (i.e., stationary after one differencing) can't be used for the cointegration test.
# Make df_normalized_i1 without columns 'CPIH_normalized', 'Small construction companies_normalized', 'Number of all construction companies_normalized'
df_normalized_i1 = df_normalized.copy().drop(columns=['CPIH_normalized', 'Small construction companies_normalized', 'Number of all construction companies_normalized'])
df_normalized_i1.head(3)

Unnamed: 0_level_0,"GDP, £m_normalized","Employment rate, %_normalized","BoE Rate, %_normalized",GBP/EUR_normalized,GBP/USD_normalized,"Business Investment, % change_normalized","Govt Expenditure, £m_normalized","Construction output, £m_normalized","Construction Material Price Index, 2015 = 100_normalized",Medium construction companies_normalized,...,"New Contracts - Infrastructure, £m_normalized","New contracts - Other, £m_normalized","New Contracts - Private Industrial, £m_normalized","New Contracts - Private Commercial, £m_normalized","New contracts - All Construction, £m_normalized",COVID_Lockdown,HS2_contracts,Commercial_Construction_Boom_2006,financial_crisis_2008,brexit_referendum_2016
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005-01-01,0.126817,0.484375,0.823009,0.822412,0.793805,0.360902,0.079379,0.632729,0.0,0.451295,...,0.111526,0.534397,0.467188,0.599912,0.784868,0,0,0,0,0
2005-02-01,0.126817,0.46875,0.823009,0.862298,0.805195,0.360902,0.057956,0.632729,0.0,0.451295,...,0.111526,0.534397,0.467188,0.599912,0.784868,0,0,0,0,0
2005-03-01,0.126817,0.4375,0.823009,0.848291,0.82723,0.360902,0.051451,0.632729,0.0,0.451295,...,0.111526,0.534397,0.467188,0.599912,0.784868,0,0,0,0,0


In [674]:
#Create a new dataset for strong pairs with data that can be integrated of order 1 (stationary after one differencing)

strong_pairs_i1 = strong_pairs.copy()
strong_pairs_i1 = strong_pairs_i1[strong_pairs_i1['Macroeconomic'] != 'CPIH_normalized']
strong_pairs_i1 = strong_pairs_i1[strong_pairs_i1['Construction'].isin(['Small construction companies_normalized', 'Number of all construction companies_normalized'])==False].reset_index(drop=True)

strong_pairs_i1.head(3)

Unnamed: 0,Macroeconomic,Construction,Correlation,Optimal lag,Max Cross-Correlation,"statistical significance, 95% confidence"
0,"GDP, £m_normalized","Construction output, £m_normalized",0.896394,0,0.790808,1
1,"GDP, £m_normalized","Construction Material Price Index, 2015 = 100_...",0.843694,5,0.116774,0
2,"GDP, £m_normalized",Large construction companies_normalized,-0.713097,6,-0.171824,1


In [675]:
# Cointegration Test Using the Engle-Granger method
strong_pairs_i1['Cointegration'] = 0
strong_pairs_i1['Coint_pvalue'] = np.nan
for i in strong_pairs_i1.index:
    x = strong_pairs_i1.iloc[i, 0]
    y = strong_pairs_i1.iloc[i, 1]
    series_x = df_normalized_i1[x]
    series_y = df_normalized_i1[y]
    score, pvalue, crit_values = coint(series_x, series_y)
    strong_pairs_i1.loc[i, 'Coint_pvalue'] = pvalue
    if pvalue < 0.05:
        strong_pairs_i1.loc[i, 'Cointegration'] = 1

strong_pairs_i1[strong_pairs_i1['Cointegration'] == 1]

Unnamed: 0,Macroeconomic,Construction,Correlation,Optimal lag,Max Cross-Correlation,"statistical significance, 95% confidence",Cointegration,Coint_pvalue
14,GBP/EUR_normalized,"New Contracts - Private Commercial, £m_normalized",0.82403,8,0.148499,1,1,0.010134
15,GBP/EUR_normalized,"New contracts - All Construction, £m_normalized",0.753356,2,0.080155,0,1,0.04634
24,"Govt Expenditure, £m_normalized","Construction Material Price Index, 2015 = 100_...",0.940522,7,0.119107,0,1,0.000167


The cointegration test was conducted on selected pairs of macroeconomic and construction indicators to identify long-term equilibrium relationships. Using the Engle-Granger method on I(1) series, several variable pairs were found to be cointegrated, indicating they move together over time despite short-term fluctuations. These cointegrated pairs are suitable candidates for further modeling using Vector Error Correction Models (VECM). Non-cointegrated pairs, by contrast, do not share a stable long-run relationship and should be analyzed with caution in trend-based modeling.