Group 125 - Christian Algaranaz, Krish Arora, Mike Min, Musa Sayeed

# How to Pick A Top-Performing Investment Portfolio Based On Different Economic Assumptions About the American Economy

## Introduction
___

The investment world is a vast place with countless financial products to choose from! Our project specifically focuses on investment portfolios/bundles composed of different equity and bond mutual funds;  a mutual fund is a type of investment vehicle that pools different assets, such as stocks or bonds. More specifically, we are interested in exploring the annual performance of 6 different portfolios, with each having a unique mutual fund composition (see diagram 1 below as reference), from 1997-2021. Based on these returns, we then pay attention to annual U.S. inflation, unemployment, and GDP growth rates data for the same time span. These explanatory variables will then be used to help predict and answer our project question: **Which investment portfolio is expected to earn the highest return under different US inflation, unemployment, and real GDP growth expectations?** A focus is given to U.S data since the American stock market accounts for ~60% of the world's total market capitalization.

Ther rationale behind choosing annual inflation, GDP growth, and unemployment data as predictors is because these indicators are a key economical factors that influence any economy. This changes in the economy gets reflected into financial markets, which ultimately impacts investment portfolios of every type. 

#*****A NEW DIAGRAM IN NEEDED IF WE DECIDE TO GO WITH THE NEW PORTFOLIOS (CHANGING THIS TEXT WOULD ALSO BE REQUIERED)*****

<img src="https://i.imgur.com/oOdGM3V.png"/>


### Loading and Merging the Data
___

In [1]:
library(tidyverse) #function that allows us to import our excel files in csv format

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
#importing inflation rate data
inflation_url <- "https://raw.githubusercontent.com/Arioniums/DSCI_100_125/main/inflation_data.csv"
inflation_path <- "inflation_data.csv"
download.file(inflation_url, destfile = inflation_path)
inflation_data <- read_csv(inflation_path, skip=11, col_names = c("date","annual_inflation"))

#reformatting the date values and filtering years to only 1997-2021
inflation_data$date <- format(as.Date(inflation_data$date,'%y-%m-%d'),'%Y')
inflation_data_c <- filter(inflation_data, between(date, 1997, 2021))

#renaming date column into year
names(inflation_data_c)[names(inflation_data_c) == 'date'] <- 'year'

#c stands for cleaned
inflation_data_c

[1mRows: [22m[34m33[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m  (1): annual_inflation
[34mdate[39m (1): date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


year,annual_inflation
<chr>,<dbl>
1997,2.3376899
1998,1.5522791
1999,2.1880272
2000,3.3768573
2001,2.8261711
2002,1.5860316
2003,2.270095
2004,2.6772367
2005,3.3927468
2006,3.2259441


In [3]:
#summarizing our raw inflation data

inflation_parameters <- inflation_data_c |>
summarize(inflation_mean = mean(annual_inflation), inflation_med = median(annual_inflation), inflation_sd = sd(annual_inflation))

inflation_parameters

inflation_mean,inflation_med,inflation_sd
<dbl>,<dbl>,<dbl>
2.216766,2.188027,1.111635


#PROVIDE COMMENT ON THIS TABLE, ALSO PROVIDE A CONFIDENCE INTERVAL FOR EACH VARIABLE!!!

In [4]:
#importing gdp growth rate data
real_gdp_url <- "https://raw.githubusercontent.com/Arioniums/DSCI_100_125/main/real_gdp_data.csv"
real_gdp_path <- "real_gdp_data.csv"
download.file(real_gdp_url, destfile = real_gdp_path)
real_gdp_data <- read_csv(real_gdp_path, skip=3, col_names = c("year","real_gdp_growth_rate","type"))

#reformatting the year values and filtering years to only 1997-2021
real_gdp_data$year <- format(as.Date(real_gdp_data$year, "'%y"),'%Y')
gdp_data_c <- select(real_gdp_data, year, real_gdp_growth_rate) |>
                filter(between(year, 1997, 2021))

#clarification of gdp growth rate unit (percentage)
names(gdp_data_c)[names(gdp_data_c) == 'real_gdp_growth_rate'] <- 'real_gdp_growth_rate_percentage'

#c stands for cleaned
gdp_data_c

[1mRows: [22m[34m32[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): year, type
[32mdbl[39m (1): real_gdp_growth_rate

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


year,real_gdp_growth_rate_percentage
<chr>,<dbl>
1997,4.4
1998,4.5
1999,4.8
2000,4.1
2001,1.0
2002,1.7
2003,2.8
2004,3.9
2005,3.5
2006,2.8


In [5]:
#Summarizing our raw real GDP data

real_gdp_parameters <- gdp_data_c |>
summarize(real_gdp_mean = mean(real_gdp_growth_rate_percentage), real_gdp_med = median(real_gdp_growth_rate_percentage), real_gdp_sd = sd(real_gdp_growth_rate_percentage))

real_gdp_parameters

real_gdp_mean,real_gdp_med,real_gdp_sd
<dbl>,<dbl>,<dbl>
2.312,2.3,2.043347


This table suggests that from the 1997-2021 period, average and median real GDP growth in the United States was 2.3%. This average growth rate is typical for developed economies.

In [None]:
#importing annual unemployment rate data
unemployment_url <- "https://raw.githubusercontent.com/Arioniums/DSCI_100_125/main/unemployment_rates_data.csv"
unemployment_path <- "unemployment_rates_data.csv"
download.file(unemployment_url, destfile = unemployment_path)
unemployment_data <- read_csv(unemployment_path, skip=1)

#filtering for USA unemployment data, selecting for TIME and Value columns, and then filtering years to only 1997-2021
unemployment_data_c <- filter(unemployment_data, LOCATION == "USA") |>
                    select(TIME, Value) |>
                    filter(between(TIME, 1997, 2021))

##renaming TIME column into year and Value column into annual_unemployment rate
names(unemployment_data_c)[names(unemployment_data_c) == 'TIME'] <- 'year'
names(unemployment_data_c)[names(unemployment_data_c) == 'Value'] <- 'annual_unemployment_rate'

#c stands for cleaned
unemployment_data_c

[1m[22mNew names:
[36m•[39m `` -> `...9`
[36m•[39m `` -> `...10`
[36m•[39m `` -> `...11`
[36m•[39m `` -> `...12`
[36m•[39m `` -> `...13`
[36m•[39m `` -> `...14`
[36m•[39m `` -> `...15`
[36m•[39m `` -> `...16`
[36m•[39m `` -> `...17`
[36m•[39m `` -> `...18`
[36m•[39m `` -> `...19`
[36m•[39m `` -> `...20`
[36m•[39m `` -> `...21`
[36m•[39m `` -> `...22`
[36m•[39m `` -> `...23`
[36m•[39m `` -> `...24`
[36m•[39m `` -> `...25`
[36m•[39m `` -> `...26`
[36m•[39m `` -> `...27`
[36m•[39m `` -> `...28`
[36m•[39m `` -> `...29`
[36m•[39m `` -> `...30`
[36m•[39m `` -> `...31`
[36m•[39m `` -> `...32`
[36m•[39m `` -> `...33`
[36m•[39m `` -> `...34`
[36m•[39m `` -> `...35`
[36m•[39m `` -> `...36`
[36m•[39m `` -> `...37`
[36m•[39m `` -> `...38`
[36m•[39m `` -> `...39`
[36m•[39m `` -> `...40`
[36m•[39m `` -> `...41`
[36m•[39m `` -> `...42`
[36m•[39m `` -> `...43`
[36m•[39m `` -> `...44`
[36m•[39m `` -> `...45`
[36m•[39m `` -> `

In [None]:
#Summarizing our raw unemployment data

unemployment_parameters <- unemployment_data_c |>
summarize(unemployment_mean = mean(annual_unemployment_rate), unemployment_med = median(annual_unemployment_rate), unemployment_sd = sd(annual_unemployment_rate))

unemployment_parameters

This table indicates that the mean unemployment rate in the US has been ~5.8%, which is a very healthy number as the US natural rate of unemployment is estimatedto be between 4.5-5.5%. The natural rate of unemployment is the healthy rate of unemployment that is consistent with a stable price level and sustainabble level of output (GDP) in the lon run. It is also knwon as NAIRU: non-accelerating inflation rate of unemployment. The differential between the 4.5-5.5% range and the ~5.8% can be attributed to both frictional and cyclycal unemployment.

In [None]:
#importing income portfolio return data, selecting for year, income_portfolio_path, and filtering years to only 1997-2021
income_portfolio_url <- "https://raw.githubusercontent.com/Arioniums/DSCI_100_125/main/income_portfolio_data.csv"
income_portfolio_path <- "income_portfolio_data.csv"
download.file(income_portfolio_url, destfile = income_portfolio_path)
income_portfolio_data_c <- read_csv(income_portfolio_path, skip = 1, col_names= c("year","X2","income_portfolio_return")) |>
                            select(year, income_portfolio_return) |>
                            filter(between(year, 1997, 2021))

#deleting the percentage unit in the column income_portfolio_return and renaming the income_portfolio_return into income_portfolio_return_percentage
income_portfolio_data_c$income_portfolio_return = as.numeric(gsub("[\\%,]", "", income_portfolio_data_c$income_portfolio_return))
names(income_portfolio_data_c)[names(income_portfolio_data_c) == 'income_portfolio_return'] <- 'income_portfolio_return_percentage'

#c stands for cleaned
income_portfolio_data_c

In [None]:
#Summarizing our raw income portfolio data

income_portfolio_parameters <- income_portfolio_data_c |>
summarize(income_portfolio_mean = mean(income_portfolio_return_percentage), income_portfolio_med = median(income_portfolio_return_percentage), income_portfolio_sd = sd(income_portfolio_return_percentage))

income_portfolio_parameters

This table indicates that the mean income portfolio returns for the 1997-2021 period is ~6%, which is largely consistent with the historical US stock market real (adjusted for inflation) average returns of 6-7% annually, albiet on the lower end of the range.

In [None]:
#importing sixty/forty and forty/sixty portfolio data
sixty_forty_and_forty_sixty_portfolio_url <- "https://raw.githubusercontent.com/Arioniums/DSCI_100_125/main/sixty_forty_and_forty_sixty_portfolio_data.csv"
                                              
sf_and_fs_path <- "sixty_forty_and_forty_sixty_portfolio_data.csv"
download.file(sixty_forty_and_forty_sixty_portfolio_url, destfile = sf_and_fs_path)

#selecting for year, 60/40_portfolio_return, 40/60_portfolio_return, and filtering years to only 1997-2021
sf_fs_portfolio_data_c <- read_csv(sf_and_fs_path, skip = 3, col_names= c("year","X2","60/40_portfolio_return","X3","40/60_portfolio_return")) |>
                                                select(year, "60/40_portfolio_return", "40/60_portfolio_return") |>
                                                filter(between(year, 1997, 2021))

#deleting the percentage unit in the column 60/40_portfolio_return and renaming the 60/40_portfolio_return into 60/40_portfolio_return_percentage
sf_fs_portfolio_data_c$'60/40_portfolio_return' = as.numeric(gsub("[\\%,]", "", sf_fs_portfolio_data_c$'60/40_portfolio_return'))
names(sf_fs_portfolio_data_c)[names(sf_fs_portfolio_data_c) == '60/40_portfolio_return'] <- '60/40_portfolio_return_percentage'

#deleting the percentage unit in the column 40/60_portfolio_return and renaming the 40/60_portfolio_return into 40/60_portfolio_return_percentage
sf_fs_portfolio_data_c$'40/60_portfolio_return' = as.numeric(gsub("[\\%,]", "", sf_fs_portfolio_data_c$'40/60_portfolio_return'))
names(sf_fs_portfolio_data_c)[names(sf_fs_portfolio_data_c) == '40/60_portfolio_return'] <- '40/60_portfolio_return_percentage'

#c stands for cleaned
sf_fs_portfolio_data_c

In [None]:
#summarizing our raw 60/40 and 40/60 portfolio data

sf_portfolio_parameters <- sf_fs_portfolio_data_c |>
rename(Sixty_Forty_portfolio = "60/40_portfolio_return_percentage") |>
summarize(sf_portfolio_mean = mean(Sixty_Forty_portfolio), sf_portfolio_med = median(Sixty_Forty_portfolio), sf_portfolio_sd = sd(Sixty_Forty_portfolio))

sf_portfolio_parameters

fs_portfolio_parameters <- sf_fs_portfolio_data_c |>
rename(Forty_Sixty_portfolio = "40/60_portfolio_return_percentage") |>
summarize(fs_portfolio_mean = mean(Forty_Sixty_portfolio), fs_portfolio_med = median(Forty_Sixty_portfolio), fs_portfolio_sd = sd(Forty_Sixty_portfolio))

fs_portfolio_parameters

This table indicates that the mean 60/40 and 40/60 portfolio returns for the 1997-2021 period is 8.7% and 7.4 respecively. These results indicate that the 60/40 portfolio tends to outperform historical US stock market real average returns of 6-7% annually. On the other hand, 40/60 portfolio average returns tend to slightly outperform historical averages.

In [None]:
#downloading growth, moderate, conservative portfolio data
gmc_url <- "https://raw.githubusercontent.com/Arioniums/DSCI_100_125/main/growth_moderate_conservative_portfolios.csv"
gmc_path <- "growth_moderate_conservative_portfolios.csv"
download.file(gmc_url, destfile = gmc_path)

#importing growth, moderate, conservative portfolio data, selecting for year, growth_portfolio_return, moderate_portfolio_return, conservative_portfolio_return, and filter to only 1997-2021
gmc_portfolios_data_c <- read_csv(gmc_path, skip = 3, col_names= c("year","X2","growth_portfolio_return","X3","moderate_portfolio_return","X4","conservative_portfolio_return")) |>
                                                select(year, growth_portfolio_return, moderate_portfolio_return, conservative_portfolio_return) |>
                                                filter(between(year, 1997, 2021))

#deleting the percentage unit in the column growth_portfolio_return and renaming the growth_portfolio_return into growth_portfolio_return_percentage
gmc_portfolios_data_c$'growth_portfolio_return' = as.numeric(gsub("[\\%,]", "", gmc_portfolios_data_c$'growth_portfolio_return'))
names(gmc_portfolios_data_c)[names(gmc_portfolios_data_c) == 'growth_portfolio_return'] <- 'growth_portfolio_return_percentage'

#deleting the percentage unit in the column moderate_portfolio_return and renaming the moderate_portfolio_return into moderate_portfolio_return_percentage
gmc_portfolios_data_c$'moderate_portfolio_return' = as.numeric(gsub("[\\%,]", "", gmc_portfolios_data_c$'moderate_portfolio_return'))
names(gmc_portfolios_data_c)[names(gmc_portfolios_data_c) == 'moderate_portfolio_return'] <- 'moderate_portfolio_return_percentage'

#deleting the percentage unit in the column conservative_portfolio_return and renaming the conservative_portfolio_return into conservative_portfolio_return_percentage
gmc_portfolios_data_c$'conservative_portfolio_return' = as.numeric(gsub("[\\%,]", "", gmc_portfolios_data_c$'conservative_portfolio_return'))
names(gmc_portfolios_data_c)[names(gmc_portfolios_data_c) == 'conservative_portfolio_return'] <- 'conservative_portfolio_return_percentage'

#c stands for cleaned
gmc_portfolios_data_c

In [None]:
#summarizing our raw growth, moderate, and conservative portfolios data

growth_portfolio_parameters <- gmc_portfolios_data_c |>
summarize(growth_portfolio_mean = mean(growth_portfolio_return_percentage), growth_portfolio_med = median(growth_portfolio_return_percentage), growth_portfolio_sd = sd(growth_portfolio_return_percentage))
growth_portfolio_parameters

moderate_portfolio_parameters <- gmc_portfolios_data_c |>
summarize(moderate_portfolio_mean = mean(moderate_portfolio_return_percentage), moderate_portfolio_med = median(moderate_portfolio_return_percentage), moderate_portfolio_sd = sd(moderate_portfolio_return_percentage))
moderate_portfolio_parameters

conservative_portfolio_parameters <- gmc_portfolios_data_c |>
summarize(conservative_portfolio_mean = mean(conservative_portfolio_return_percentage), conservative_portfolio_med = median(conservative_portfolio_return_percentage), conservative_portfolio_sd = sd(conservative_portfolio_return_percentage))
conservative_portfolio_parameters


This table indicates that the mean growth, moderate, and conservative portfolio returns for the 1997-2021 period is ~9%, ~8%, and ~7%, respecively. These results indicate that the growth portfolio tends to outperform historical US stock market real average returns of 6-7% annually; same applies to the moderate portfolio. On the other hand, conservative portfolio average returns tend perform at the top range of the historical averages.

In [None]:
#merging all cleaned datasets
project_ds1 <- merge(
                    x = inflation_data_c, 
                    y = gdp_data_c,
                    by = "year")
project_ds2 <- merge(
                    x = project_ds1, 
                    y = unemployment_data_c,
                    by = "year")
project_ds3 <- merge(
                    x = project_ds2, 
                    y = income_portfolio_data_c,
                    by = "year")
project_ds4 <- merge(
                    x = project_ds3, 
                    y = sf_fs_portfolio_data_c,
                    by = "year")
project_ds_c <- merge(
                    x = project_ds4, 
                    y = gmc_portfolios_data_c,
                    by = "year")

#finding max portfolio return value and name
project_ds_c$max_portfolio_return_value <- do.call(pmax, project_ds_c[5:10])
project_ds_c$max_portfolio_return_name <- colnames(project_ds_c[5:10])[max.col(project_ds_c[5:10])]

#the dataset before selecting process above
project_ds_c

In [None]:
#select for year, annual_inflation, real_gdp_growth_rate, annual_unemployment_rate, max_portfolio_return_value, max_portfolio_return_name
project_ds <- select(project_ds_c, year, annual_inflation, real_gdp_growth_rate_percentage, annual_unemployment_rate, max_portfolio_return_value, max_portfolio_return_name)

#Final data set, years will be deleted for training and testing
project_ds

#Our table provides information about the top performing investment portfolios for given year from 1997 to 2021. The names of each portfolio was selected based on comparison analysis of 6 different portfolios.
#The very last column (max_portfolio_return_name) shows the name of protfolio with maximum return rate. 

#all of the variables are in yearly figures. It's measurements are in percentages.

#With the classification algorithm, we will be spliting using initial_split to split the data to be used for analysis and then to be used to for checking the code.

In [None]:
final_dataframe<-collect(project_ds)
write_csv(final_dataframe,"final_data.csv")


## Methods
___

* We will conduct our data analysis by first splitting our data into **training data** and **testing data**. The class is known for our training data which includes the observations. This will be used as a basis for prediction for our classifier. Using the classifier, we can predict the class for the testing data whose classes are unknown.
* We’ll be using the **K-Nearest Neighbor Classification algorithm** from the parsnip R package in tidymodels, in order to make predictions.
    * We will use cross-validation to derive the best value to use as k.
    * We will be defining a model specification for the K-nearest neighbour and fit the model on the data by passing the model specification and data set to a fit function.
    * In the same step, we will specify the target variable (i.e, investment portfolio) and predictors that we are going to use (GDP, Inflation and Unemployment.)
* Finally, we’ll use the predict function to predict the best investment portfolio. 
    * We will provide two distinct examples to prove that our model works.
* We will visualize the result by:
    * Plotting a line graph with Returns in the y-axis and Time in the x-axis and factored by Portfolio.
      * This step requires us to first create a dataset for three columns; year, return, and portfolio name

## Expected outcomes and significance
___

Our findings would demonstrate the relationship each variable has with each portfolio, associating the impact of inflation, real GDP, and unemployment on the aforementioned investment portfolios. Having this association will provide us with the ability to make educated guesses regarding the future performance of a given portfolio.

We expect to find which investment portfolio ( based on six different mutual fund portfolios) would provide you with the highest returns for a given set of assumptions/expectations about annual US unemployment rates, inflation rates, and real GDP growth rates; these are our explanatory variables of choice. After building the knn classification algorithm that provides the best portfolio under specific inflation, real GDP, and unemployment assumptions, we will then be able to plot the return of each of the six portfolios against the year variable; this process involves effective data visualization. The ability to predict the best performing investment portfolio presents us with many questions. For instance, are there any other portfolios that would perform better in different economic environments? and what composition would yield the highest returns? What would our model predict if we used portfolios that use more complex financial products beyond mutual funds? Additionally, even though project constraints won't allow us to persue this opportunity, performing an OLS regression analysis would had been very insightful and complementary to our project. Performing six different regression analyses by graphing each portfolio return against our explanatory variables would have allowed us to find the various relationships between US real GDP, unemployment rate, and inflation rate with our individual portfolios. Having this association would of provided us the ability to make educated guesses regarding the future returns of any given portfolio, rather than only a qualitative prediction.

# KNN classification Algorithm

In [None]:
##before continuing the following packages need to be runned
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)


In [None]:
#reading our finalized data before implementing KNN classification algorithm
portfolio_data <- read_csv("final_data.csv")|>
mutate(max_portfolio_return_name = as_factor(max_portfolio_return_name))|>
select(-year, -max_portfolio_return_value)
portfolio_data

In [None]:
#exloring our data:

num_obs <- nrow(portfolio_data)

portfolio_data |>
group_by(max_portfolio_return_name) |>
summarize(count = n(), percentage = n() / num_obs * 100)

#Word of caution: data seems to have too much "growth_portfolio" labels. Thus, rebalancing could be attempted: 

In [None]:
#performing rebalancing:
install.packages("themis")
library (themis)

ups_recipe <- recipe(max_portfolio_return_name ~ ., data = portfolio_data) |>
step_upsample(max_portfolio_return_name, over_ratio = 1, skip = FALSE) |>
prep()
ups_recipe

upsampled_portfolio <- bake(ups_recipe, portfolio_data)

upsampled_portfolio |>
group_by(max_portfolio_return_name) |>
summarize(n = n())

In [None]:
#comparing upsampled vs normal data:

print(upsampled_portfolio)
print (portfolio_data)

As seen above, upsampling would harm our data as it would create fictitious data points just for the sake of balancing out the labels. This in turn, would harm our real-life prediction accuracy interpretation since the data point created don't make any real sense. **Therefore, for the rest of the classification, we won't be using upsampled data.**

In [None]:
set.seed(9999) #ensuring reproducibility
options(repr.plot.height = 5, repr.plot.width = 6)

#splitting with normal data:
portfolio_split <- initial_split(portfolio_data, prop = 0.75, strata = max_portfolio_return_name)
portfolio_train <- training(portfolio_split)
portfolio_test <- testing(portfolio_split) 


#exploring distribution of labels in each split:

portfolio_train |>
group_by(max_portfolio_return_name) |>
summarize(count = n(), percentage = n() / num_obs * 100)

portfolio_test |>
group_by(max_portfolio_return_name) |>
summarize(count = n(), percentage = n() / num_obs * 100)

The distribution of both split and test indicate that most of the data labels are "growth_portfolio". Hence this is something to bare and be critical when interptreting our predictions below. The best way to avoid this is to ideally collect data of more years since, as mentioned above, upsampling would harm the real-life applicability of the model. 

In [None]:
#recipe with normal data:
knn_recipe <- recipe(max_portfolio_return_name ~ annual_inflation + real_gdp_growth_rate_percentage + annual_unemployment_rate, data = portfolio_train) |>
    step_center(all_predictors()) |>
    step_scale (all_predictors())

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |> 
set_engine("kknn") |>
set_mode("classification") 

In [None]:
set.seed(9999) #ensuring reproducibility

knn_vfold <- vfold_cv(portfolio_train, v = 5, strata = max_portfolio_return_name)
gridvals <- tibble(neighbors = seq(from = 1, to = 12))

knn_results <- workflow() |>
  add_recipe(knn_recipe) |>
  add_model(knn_tune) |>
  tune_grid(resamples = knn_vfold, grid = gridvals) |>
  collect_metrics() 

accuracies <- knn_results |> 
       filter(.metric == "accuracy") |>
       arrange(desc(mean))
accuracies


cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean))+
       geom_point() +
       geom_line() +
       labs(x = "Neighbors", y = "Accuracy Estimate") +
       ggtitle("cross-validation plot")
cross_val_plot

In [None]:
# based on the plot,  we will pick K=8 because it the median neighbor between the equal-accuracy-range of K= 4-12.
#Choosing k=8 ensures that going a bit to the righ or left leaves the accuracy estimate intact.
# We certainly believe that in order to enhance our model, more data should be collected for each variable. 
# Unfortunatly, our data was restricted for data from 1997-2021.

knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 8) |>
set_engine("kknn") |>
set_mode("classification")

knn_fit <- workflow() |>
add_recipe(knn_recipe) |>
add_model(knn_spec) |>
fit(data = portfolio_train)

In [None]:
#testing model predictions with testing set

set.seed(9999) #ensuring reproducibility

test_predictions <- predict(knn_fit,portfolio_test) |>
bind_cols(portfolio_test)

knn_metrics <- test_predictions |>
metrics(truth = max_portfolio_return_name, estimate = .pred_class)
knn_metrics
#accuracy seems to improve a bit (from 70% up to ~71%) when applying model to testing dataset

In [None]:
#This is the new observation to predict

new_economic_environment_1 <- tibble(annual_inflation = 2, real_gdp_growth_rate_percentage = 5, annual_unemployment_rate = 8)
new_economic_environment_2 <- tibble(annual_inflation = 2, real_gdp_growth_rate_percentage = 0, annual_unemployment_rate = 4)

portfolio_prediction_1 <- predict(knn_fit, new_economic_environment_1)
portfolio_prediction_1

portfolio_prediction_2 <- predict(knn_fit, new_economic_environment_2)
portfolio_prediction_2

# Interprating Model Predictions

Based on an economic environment with an expected annual US inflation, real GDP, and unemployment rate of 2%, 5%, and 8% respecively, Our modell peridicts thatthe top performing portfolio (out of the six we incorporated in our model) will be the growth portfolio. On the other hand, we we change our predictions to expected annual US inflation, real GDP, and unemployment rate of 2%, 0%, and 4% respecively, the income portfolio is suggested. The model is very interesting and could have the potential of providing real-world professional investment advice. Nontheless, the model lacks a lot of data and accuracy to be a serious contender. As previously mentioned, data going back to 1997 is not enough as to have a robust model. Furthermore, an accuracy of 70% is far from ideal. Lastely, the class used to in our model is limited by the amount of portfolio types. Thus, this model intrinsically ignore a whole other universe of investment porfolios and financial instruments ( i.e., not just mutual funds). 

Nontheless, we believe and are hopeful that a model with similar dynamics could be built by professionals with more time and resourced on their side.