**Random Forest Model Analysis**

Random Forest is a Bagging-based integrated learning method. It optimizes the construction process of the decision tree: for ordinary decision trees, we will select an optimal feature among all m sample features on the node to divide the left and right subtrees of the decision tree; but the random forest assume that the number of features on the node is n, and then among these randomly selected sample features, select an optimal feature to divide the left and right subtrees of the decision tree. This further enhances the generalization ability of the model.
We use the Random Forest Model to predicate the change of the inflation rate and test the stability of the algorithm.

**1.Feature Selection:**

We first set a threshold for the correlation coefficient, and then set the significance of the correlation. Our feature selection is based on whether the absolute value between each independent variable is greater than this threshold, and whether the significance of the correlation is less than P. 
Then we select some features that are highly correlated with the dependent variable to start the analysis.

In [7]:
library(caret)

In [8]:
select_X <- function(X, y, cor_thr = 0.2, p_thr = 0.05){
  select_cols <- NULL
  for(col in colnames(X)){
    cor_result <- cor.test(unlist(X[, col]), y)
    if(abs(cor_result$estimate) > cor_thr && cor_result$p.value < p_thr){
      select_cols <- c(select_cols, col)
    }
  }
  return(select_cols)
}

**2.Parameters Optimization**

In random forest training, two important parameters are tree and mtry, where ntree is the number of base classifiers included; mtry is the number of variables contained in each decision tree. We set the test rate is 0.25 and use 10 cross validation. Considering the time series type data, we set the test size as 30 and use remained data to train. Through parameter optimization, we could conclude the best setting of the parameters and judge the stability of the model via R-squared.

In [9]:
find_best_params <- function(df, X_cols, y_col, test_size=30, test_rate=0.25, cv=10){
  select_cols <- select_X(df[, X_cols], df[, y_col])
  set.seed(100)
  
  test_index <- (nrow(df)-test_size):(nrow(df))
  train <- df[-test_index, c(y_col, select_cols)]
  test <- df[test_index, c(y_col, select_cols)]
  
  trControl <- trainControl(
    method = "cv",
    number = cv,
    search = "grid"
  )
  
  tuneGrid <- expand.grid(.mtry = c(1: 5))
  
  rf_mtry <- train(
    formula(paste(y_col, "~.")),
    data = train,
    method = "rf",
    metric = "Rsquared",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    ntree = 100
  )

  # Best mtry
  best_mtry <- rf_mtry$bestTune$mtry
  print(paste("best_mtry：", best_mtry))
  
  # Finding the best ntree(The default value is 500)
  best_ntree <- -1
  best_r2 <- -1
  for (ntree in c(100, 300, 500, 800, 1000)) {
    set.seed(2020.1315)
    rf_maxtrees <- train(
      formula(paste(y_col, "~.")),
      data = train,
      method = "rf",
      metric = "Rsquared",
      tuneGrid = tuneGrid,
      trControl = trControl,
      importance = TRUE,
      # maxnodes = nodesize,
      ntree = ntree
    )
    r2 <- mean(rf_maxtrees$results$Rsquared)
    if(r2 > best_r2){
      best_r2 <- r2
      best_ntree <- ntree
    }
  }
  # Best ntree
  print(paste("best_ntreee：", best_ntree))
  
  return(list(
    train = train,
    test = test,
    best_mtry = best_mtry,
    best_ntree = best_ntree
  ))
}

In [10]:
get_best_model <- function(best_params, y_col, cv=5){
  tuneGrid <- expand.grid(.mtry = best_params[["best_mtry"]])
  
  trControl <- trainControl(
    method = "cv",
    number = cv,
    search = "grid"
  )
  
  best_rf <- train(
    formula(paste(y_col, "~.")),
    data = best_params[["train"]],
    method = "rf",
    metric = "Rsquared",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    ntree = best_params[["best_ntree"]]
  )
  return(best_rf)
}

In [11]:
cal_r2 <- function(preds, truths){
  ss_res <- sum((preds - truths)^2)
  ss_tot <- sum((truths - mean(truths))^2)
  return(1-ss_res/ss_tot)
}

**3.Data Analysis**

We load the data and use the model to predicate the CPI and PPI. Using the separating methods, we got 10 results of R-squared.

In [12]:
load(url("https://github.com/zhentaoshi/Econ5821/raw/main/data_example/dataset_inf.Rdata"))

In [13]:
y_cpi <- diff(log(cpi$CPI), 12)
cpi_r2s <- NULL
for(i in 1:12){
  cpi_data <- y_cpi
  for(j in 1:i){
    cpi_data <- cbind(cpi_data, X[(13-i):(nrow(X)-i), -1])
  }
  colnames(cpi_data) <- c("CPI", paste("var_", 1:(ncol(cpi_data)-1), sep=""))
  cpi_best_params <- find_best_params(cpi_data, X_cols = colnames(cpi_data)[-1], y_col=colnames(cpi_data)[1])
  best_rf_cpi <- get_best_model(cpi_best_params, y_col=colnames(cpi_data)[1])
  cpi_preds <- predict(best_rf_cpi, cpi_best_params[["test"]])
  cpi_r2 <- cal_r2(cpi_preds, cpi_best_params[["test"]][, 1])
  cpi_r2s <- c(cpi_r2s, cpi_r2)
  print(paste("CPI prediction R2：", cpi_r2))
}

[1] "best_mtry： 5"
[1] "best_ntreee： 800"
[1] "CPI prediction R2： -14.0580513706206"
[1] "best_mtry： 5"
[1] "best_ntreee： 500"
[1] "CPI prediction R2： -12.8304039243986"
[1] "best_mtry： 5"
[1] "best_ntreee： 1000"
[1] "CPI prediction R2： -11.9095027941131"
[1] "best_mtry： 5"
[1] "best_ntreee： 1000"
[1] "CPI prediction R2： -11.2734572348134"
[1] "best_mtry： 5"
[1] "best_ntreee： 800"
[1] "CPI prediction R2： -10.702192044378"
[1] "best_mtry： 4"
[1] "best_ntreee： 1000"
[1] "CPI prediction R2： -8.0362354314748"
[1] "best_mtry： 5"
[1] "best_ntreee： 500"
[1] "CPI prediction R2： -4.75756256672969"
[1] "best_mtry： 3"
[1] "best_ntreee： 800"
[1] "CPI prediction R2： -1.79638357891794"
[1] "best_mtry： 3"
[1] "best_ntreee： 1000"
[1] "CPI prediction R2： -1.29385469173532"
[1] "best_mtry： 4"
[1] "best_ntreee： 800"
[1] "CPI prediction R2： -0.786131483889147"
[1] "best_mtry： 4"
[1] "best_ntreee： 1000"
[1] "CPI prediction R2： -0.616549789500145"
[1] "best_mtry： 5"
[1] "best_ntreee： 800"
[1] "CPI predictio

In [14]:
y_ppi <- diff(log(ppi$PPI), 12)
ppi_r2s <- NULL
for(i in 1:12){
  ppi_data <- y_ppi
  for(j in 1:i){
    ppi_data <- cbind(ppi_data, X[(13-i):(nrow(X)-i), -1])
  }
  colnames(ppi_data) <- c("PPI", paste("var_", 1:(ncol(ppi_data)-1), sep=""))
  ppi_best_params <- find_best_params(ppi_data, X_cols = colnames(ppi_data)[-1], y_col=colnames(ppi_data)[1])
  best_rf_ppi <- get_best_model(ppi_best_params, y_col=colnames(ppi_data)[1])
  ppi_preds <- predict(best_rf_ppi, ppi_best_params[["test"]])
  ppi_r2 <- cal_r2(ppi_preds, ppi_best_params[["test"]][, 1])
  ppi_r2s <- c(ppi_r2s, ppi_r2)
  print(paste("PPI prediction R2：", ppi_r2))
}

[1] "best_mtry： 4"
[1] "best_ntreee： 300"
[1] "PPI prediction R2： -1.34777199130405"
[1] "best_mtry： 4"
[1] "best_ntreee： 300"
[1] "PPI prediction R2： -1.31837536918131"
[1] "best_mtry： 4"
[1] "best_ntreee： 1000"
[1] "PPI prediction R2： -1.48824853272104"
[1] "best_mtry： 4"
[1] "best_ntreee： 1000"
[1] "PPI prediction R2： -1.7410999889542"
[1] "best_mtry： 4"
[1] "best_ntreee： 1000"
[1] "PPI prediction R2： -1.87007434883846"
[1] "best_mtry： 2"
[1] "best_ntreee： 1000"
[1] "PPI prediction R2： -1.12134565661876"
[1] "best_mtry： 5"
[1] "best_ntreee： 800"
[1] "PPI prediction R2： -1.07945347582219"
[1] "best_mtry： 4"
[1] "best_ntreee： 1000"
[1] "PPI prediction R2： -0.56803832181577"
[1] "best_mtry： 4"
[1] "best_ntreee： 800"
[1] "PPI prediction R2： 0.00836631695697088"
[1] "best_mtry： 5"
[1] "best_ntreee： 1000"
[1] "PPI prediction R2： 0.29715048795873"
[1] "best_mtry： 5"
[1] "best_ntreee： 800"
[1] "PPI prediction R2： 0.418067166795376"
[1] "best_mtry： 3"
[1] "best_ntreee： 1000"
[1] "PPI predict

In [15]:
ppi_data <- cbind(ppi[, -1], X[, -1])
ppi_best_params <- find_best_params(ppi_data, X_cols = colnames(ppi_data)[-1], colnames(ppi_data)[1])
best_rf_ppi <- get_best_model(ppi_best_params, y_col="PPI")
ppi_preds <- predict(best_rf_ppi, ppi_best_params[["test"]])
ppi_r2 <- cal_r2(ppi_preds, ppi_best_params[["test"]]$PPI)
print(paste("PPI prediction R2：", ppi_r2))

[1] "best_mtry： 5"
[1] "best_ntreee： 1000"
[1] "PPI prediction R2： 0.875107915614231"


In [16]:
results <- data.frame(
  index = 1:12,
  cpi_r2 = cpi_r2s,
  ppi_r2 = ppi_r2s
)
results

index,cpi_r2,ppi_r2
<int>,<dbl>,<dbl>
1,-14.0580514,-1.347771991
2,-12.8304039,-1.318375369
3,-11.9095028,-1.488248533
4,-11.2734572,-1.741099989
5,-10.702192,-1.870074349
6,-8.0362354,-1.121345657
7,-4.7575626,-1.079453476
8,-1.7963836,-0.568038322
9,-1.2938547,0.008366317
10,-0.7861315,0.297150488


**4.Conclusion**

Unfortunately, from the results of our data analysis, we concluded that the stability of the Random Forest Model is so poor and the prediction effect of CPI and PPI is not what we wish to be. We need to pay more attention to the parameters optimization and test the data in other different models.