Title: The Prediction of Gross National Income of Canada in 2020, 2021 and 2022 from the datasets of 2013-2019

Introduction

We are trying to predict the Real Gross National Income (GNI) of Canada in 2020, 2021 and 2022 by three variables, namely Consumer Price Index (CPI), current account balance (CA), and unemployment rate. We would do this by running regression analysis of the four variables (including the dependent variable GNI) from 2013-2019. This project is significant because it interacts different macroeconomic variables and any policy manipulation on the three variables could predict changes in GNI, which is one of the best measurements of economic wellbeing. We collected our data from Statistics Canada, which each variables filtered from 2013-2022 only. All data are unadjusted annual data.
    

Terminology 
- GNI: Gross National Income, which is the aggregate income of residents of an economy in a particular year.
- CPI: Consumer Price Index, which is the price level of an economy by calculating the indexed price of a basket of goods and services a typical consumer of an economy purchases.
- Current Account: Net goods and services outflow of an economy in a particular year. 
- Unemployment rate: Total unemployment persons over total labour force of an economy in a particular year.

Preliminary exploratory data analysis below:

In [4]:
library(tidyverse)
library(repr)
library(readxl)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [5]:
gni<-read_csv("gni.csv",skip=7)|>
filter(...1 == "Estimates" | ...1 == "Real gross national income, volume index 2012=100") 
gni$...0 <- c("year", "Real gross national income, volume index 2012=100") 
gni_data <- select(gni, ...0,...2,...3,...4,...5,...6,...7,...8,...9,...10,...11)
gni_data

[1m[22mNew names:
[36m•[39m `` -> `...1`
[36m•[39m `` -> `...2`
[36m•[39m `` -> `...3`
[36m•[39m `` -> `...4`
[36m•[39m `` -> `...5`
[36m•[39m `` -> `...6`
[36m•[39m `` -> `...7`
[36m•[39m `` -> `...8`
[36m•[39m `` -> `...9`
[36m•[39m `` -> `...10`
[36m•[39m `` -> `...11`
[1mRows: [22m[34m33[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): ...1, ...2
[32mdbl[39m (9): ...3, ...4, ...5, ...6, ...7, ...8, ...9, ...10, ...11

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...0,...2,...3,...4,...5,...6,...7,...8,...9,...10,...11
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
year,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022.0
"Real gross national income, volume index 2012=100",102.8,105.1,103.5,104.4,108.8,111.4,113.8,107.3,117.6,123.1


In [39]:
current_account <- read_csv("current_account.csv", skip = 9) |>
filter(Geography == "Current account and capital account" | Geography == "Total current account") |>
filter(...1 != "Payments", ...1 != "Receipts")

current_account$...0 <- c("year", "Total current account balance") 
current_account_d <- select(current_account, ...0,Canada,...4,...5,...6,...7,...8,...9,...10,...11, ...12)


current_account_d$...3 <- c(2013, -59759) 
current_account_data <- select(current_account_d, ...0,...3,...4,...5,...6,...7,...8,...9,...10,...11, ...12)
current_account_data


[1m[22mNew names:
[36m•[39m `` -> `...1`
[36m•[39m `` -> `...4`
[36m•[39m `` -> `...5`
[36m•[39m `` -> `...6`
[36m•[39m `` -> `...7`
[36m•[39m `` -> `...8`
[36m•[39m `` -> `...9`
[36m•[39m `` -> `...10`
[36m•[39m `` -> `...11`
[36m•[39m `` -> `...12`
“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m51[39m [1mColumns: [22m[34m12[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): ...1, Geography, Canada

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...0,...3,...4,...5,...6,...7,...8,...9,...10,...11,...12
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
year,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022
Total current account balance,-59759,-46278,-69569,-62553,-59998,-53141,-45183,-47578,-6749,-9105


In [7]:
cpi<-read_csv("cpi.csv",skip = 7)|>
filter(!is.na(...5))|>
slice(-c(17:45))|>
filter(...1 == "All-items" | ...1 == "Products and product groups 4") 

cpi$...0 <- c("year", "CPI for all items") 
cpi_data <- select(cpi, ...0,...2,...3,...4,...5,...6,...7,...8,...9,...10,...11)

cpi_data

[1m[22mNew names:
[36m•[39m `` -> `...1`
[36m•[39m `` -> `...2`
[36m•[39m `` -> `...3`
[36m•[39m `` -> `...4`
[36m•[39m `` -> `...5`
[36m•[39m `` -> `...6`
[36m•[39m `` -> `...7`
[36m•[39m `` -> `...8`
[36m•[39m `` -> `...9`
[36m•[39m `` -> `...10`
[36m•[39m `` -> `...11`
[36m•[39m `` -> `...12`
[36m•[39m `` -> `...13`
[36m•[39m `` -> `...14`
[36m•[39m `` -> `...15`
[36m•[39m `` -> `...16`
[36m•[39m `` -> `...17`
[36m•[39m `` -> `...18`
[1mRows: [22m[34m37[39m [1mColumns: [22m[34m18[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (18): ...1, ...2, ...3, ...4, ...5, ...6, ...7, ...8, ...9, ...10, ...11...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


...0,...2,...3,...4,...5,...6,...7,...8,...9,...10,...11
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
year,2013.0,2014.0,2015.0,2016.0,2017.0,2018.0,2019.0,2020.0,2021.0,2022.0
CPI for all items,122.8,125.2,126.6,128.4,130.4,133.4,136.0,137.0,141.6,151.2


In [20]:
unemployment<-read_csv("unemployment.csv",skip = 8) |>
filter(...4 == 2014 |...4 ==7.0) |>
select(...1:...12)

unemployment$...2 <- c(2013, 7.1) 
unemployment$...0 <- c("year", "Unemployment rate") 
unemployment_data <- select(unemployment, ...0, ...2, ...4,...5,...6,...7,...8,...9,...10,...11,...12)

unemployment_data

[1m[22mNew names:
[36m•[39m `` -> `...1`
[36m•[39m `` -> `...4`
[36m•[39m `` -> `...5`
[36m•[39m `` -> `...6`
[36m•[39m `` -> `...7`
[36m•[39m `` -> `...8`
[36m•[39m `` -> `...9`
[36m•[39m `` -> `...10`
[36m•[39m `` -> `...11`
[36m•[39m `` -> `...12`
[36m•[39m `` -> `...13`
[36m•[39m `` -> `...14`
[36m•[39m `` -> `...15`
[36m•[39m `` -> `...16`
[36m•[39m `` -> `...17`
[36m•[39m `` -> `...18`
[36m•[39m `` -> `...19`
[36m•[39m `` -> `...20`
[36m•[39m `` -> `...21`
[36m•[39m `` -> `...22`
[36m•[39m `` -> `...23`
[36m•[39m `` -> `...24`
[36m•[39m `` -> `...25`
[36m•[39m `` -> `...26`
[36m•[39m `` -> `...27`
[36m•[39m `` -> `...28`
[36m•[39m `` -> `...29`
[36m•[39m `` -> `...30`
[36m•[39m `` -> `...31`
[36m•[39m `` -> `...32`
“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m45[39m [1mColumns: [22m[34m32[39m
[36m──[39m [1mColumn specification[22m [36m─────────────────────────────────────────

...0,...2,...4,...5,...6,...7,...8,...9,...10,...11,...12
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
year,2013.0,2014,2015.0,2016,2017.0,2018.0,2019.0,2020.0,2021.0,2022.0
Unemployment rate,7.1,7,6.9,7,6.4,5.8,5.7,9.7,7.5,5.3


Methods:

We will use publicly available Economic data from Statistics Canada, with obersvations including the years 2013-2022 annually. The variables we are concerned with are Real GNI, unadjusted CPI, current account, and  aggregate unemployment rate (all annual data). In particular, the predictors would be CPI, current account and unemployment rate, while the response variable would be Real GNI. We are going to split the data into training set (2013-2019) and testing set (2020,2021,2022). We will carry out regression analysis using the k-nearest neighbours algorithm, with 5-fold cross-validation on the training set, pick the optimal k-value, and carry out predictions on Real GNI in 2020,2021,2022. We will then evaluate the accuracy of our prediction model by computing its root mean squared prediction error, and determine if the three independent variables are strong predictors of GNI. 

In [1]:
Expected outcomes and significance



ERROR: Error in parse(text = x, srcfile = src): <text>:1:10: unexpected symbol
1: Expected outcomes
             ^
