Group 125 - Christian Algaranaz, Krish Arora, Mike Min, Musa Sayeed

## Introduction
___

The investment world is a vast place with countless financial products to choose from! Our project specifically focuses on investment portfolios/bundles composed of different  stocks/securities/equities mutual funds;  an mutual fund is an investment fund that track the performance of a specific benchmark index, such as the S&P 500 (500 biggest public U.S companies). More specifically, we are interested in exploring the annual performance of 6 different portfolios, with each having a unique mutual funds composition (see diagram 1 below as reference), from 1997-2021. Based on these returns, we then pay attention to annual US inflation, unemployment, and Real GDP growth rates data for the same time span. These are our explanatory variables that will then be used to help predict and answer our project question: Which index portfolios is expected to earn the highest return under different US inflation, unemployment, and real GDP growth expectations? A focus is given to U.S data since the American stock market account for ~60% of the world total market capitalization.

<img src = "https://i.imgur.com/oOdGM3V.png">

### Loading and Merging the Data
___

In [80]:
library(tidyverse) #function that allows us to import our excel files in csv format

In [81]:
getwd() #function to know my current working directory 

In [82]:
inflation_data <- read_csv("data/annual_inflation.csv", skip=11, col_names = c("date","annual_inflation"))

inflation_data$date <- format(as.Date(inflation_data$date,'%y-%m-%d'),'%Y')
inflation_data_c <- filter(inflation_data, between(date, 1997, 2021))

names(inflation_data_c)[names(inflation_data_c) == 'date'] <- 'year'

inflation_data_c

[1mRows: [22m[34m33[39m [1mColumns: [22m[34m2[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m  (1): annual_inflation
[34mdate[39m (1): date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


year,annual_inflation
<chr>,<dbl>
1997,2.3376899
1998,1.5522791
1999,2.1880272
2000,3.3768573
2001,2.8261711
2002,1.5860316
2003,2.270095
2004,2.6772367
2005,3.3927468
2006,3.2259441


In [83]:
real_gdp_data <- read_csv("data/real_gdp.csv", skip=3, col_names = c("year","real_gdp_growth_rate","type"))

real_gdp_data$year <- format(as.Date(real_gdp_data$year, "'%y"),'%Y')
gdp_data_c <- select(real_gdp_data, year, real_gdp_growth_rate) |>
                filter(between(year, 1997, 2021))

gdp_data_c

#keep it in percentage

[1mRows: [22m[34m32[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): year, type
[32mdbl[39m (1): real_gdp_growth_rate

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


year,real_gdp_growth_rate
<chr>,<dbl>
1997,4.4
1998,4.5
1999,4.8
2000,4.1
2001,1.0
2002,1.7
2003,2.8
2004,3.9
2005,3.5
2006,2.8


In [84]:
unemployment_data <- read_csv("data/unemployment_rates.csv", skip=1)
unemployment_data_c <- filter(unemployment_data, LOCATION == "USA") |>
                    select(TIME, Value) |>
                    filter(between(TIME, 1997, 2021))

names(unemployment_data_c)[names(unemployment_data_c) == 'TIME'] <- 'year'
names(unemployment_data_c)[names(unemployment_data_c) == 'Value'] <- 'annual_unemployment_rate'

unemployment_data_c

#filter to location(USA only) time -> year, value -> annual unemployment rate

[1m[22mNew names:
[36m•[39m `` -> `...9`
[36m•[39m `` -> `...10`
[36m•[39m `` -> `...11`
[36m•[39m `` -> `...12`
[36m•[39m `` -> `...13`
[36m•[39m `` -> `...14`
[36m•[39m `` -> `...15`
[36m•[39m `` -> `...16`
[36m•[39m `` -> `...17`
[36m•[39m `` -> `...18`
[36m•[39m `` -> `...19`
[36m•[39m `` -> `...20`
[36m•[39m `` -> `...21`
[36m•[39m `` -> `...22`
[36m•[39m `` -> `...23`
[36m•[39m `` -> `...24`
[36m•[39m `` -> `...25`
[36m•[39m `` -> `...26`
[36m•[39m `` -> `...27`
[36m•[39m `` -> `...28`
[36m•[39m `` -> `...29`
[36m•[39m `` -> `...30`
[36m•[39m `` -> `...31`
[36m•[39m `` -> `...32`
[36m•[39m `` -> `...33`
[36m•[39m `` -> `...34`
[36m•[39m `` -> `...35`
[36m•[39m `` -> `...36`
[36m•[39m `` -> `...37`
[36m•[39m `` -> `...38`
[36m•[39m `` -> `...39`
[36m•[39m `` -> `...40`
[36m•[39m `` -> `...41`
[36m•[39m `` -> `...42`
[36m•[39m `` -> `...43`
[36m•[39m `` -> `...44`
[36m•[39m `` -> `...45`
[36m•[39m `` -> `

year,annual_unemployment_rate
<dbl>,<dbl>
1997,4.95
1998,4.508333
1999,4.216667
2000,3.991667
2001,4.733333
2002,5.775
2003,5.991667
2004,5.533333
2005,5.066667
2006,4.616667


In [85]:
income_portfolio_data_c <- read_csv("data/income_portfolio.csv", skip = 1, col_names= c("year","X2","income_portfolio_return")) |>
                            select(year, income_portfolio_return) |>
                            filter(between(year, 1997, 2021))

income_portfolio_data_c$income_portfolio_return = as.numeric(gsub("[\\%,]", "", income_portfolio_data_c$income_portfolio_return))
names(income_portfolio_data_c)[names(income_portfolio_data_c) == 'income_portfolio_return'] <- 'income_portfolio_return_percentage'


income_portfolio_data_c

#we don't need X2-X11. Only year and income portfolio retuen, filter for 1997 - 2021

[1mRows: [22m[34m598[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (8): year, X2, income_portfolio_return, X4, X5, X6, X7, X8
[33mlgl[39m (3): X9, X10, X11

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
“NAs introduced by coercion”


year,income_portfolio_return_percentage
<chr>,<dbl>
1997,10.39
1998,11.3
1999,5.32
2000,4.77
2001,3.38
2002,4.77
2003,12.84
2004,7.36
2005,2.71
2006,7.93


In [86]:
sf_fs_portfolio_data_c <- read_csv("data/sixty_forty_and_forty_sixty_portfolio_data.csv", skip = 3, col_names= c("year","X2","60/40_portfolio_return","X3","40/60_portfolio_return")) |>
                                                select(year, "60/40_portfolio_return", "40/60_portfolio_return") |>
                                                filter(between(year, 1997, 2021))

sf_fs_portfolio_data_c$'60/40_portfolio_return' = as.numeric(gsub("[\\%,]", "", sf_fs_portfolio_data_c$'60/40_portfolio_return'))
names(sf_fs_portfolio_data_c)[names(sf_fs_portfolio_data_c) == '60/40_portfolio_return'] <- '60/40_portfolio_return_percentage'

sf_fs_portfolio_data_c$'40/60_portfolio_return' = as.numeric(gsub("[\\%,]", "", sf_fs_portfolio_data_c$'40/60_portfolio_return'))
names(sf_fs_portfolio_data_c)[names(sf_fs_portfolio_data_c) == '40/60_portfolio_return'] <- '40/60_portfolio_return_percentage'


sf_fs_portfolio_data_c

# delete X2-X11, keep year, 60/40, 40/60, again filter for year range

[1mRows: [22m[34m769[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (8): year, X2, 60/40_portfolio_return, X3, 40/60_portfolio_return, X6, X...
[33mlgl[39m (3): X9, X10, X11

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
“NAs introduced by coercion”


year,60/40_portfolio_return_percentage,40/60_portfolio_return_percentage
<chr>,<dbl>,<dbl>
1997,22.37,18.06
1998,17.39,14.45
1999,13.98,9.07
2000,-1.79,2.61
2001,-3.21,0.67
2002,-9.27,-3.43
2003,20.4,14.93
2004,9.2,7.55
2005,4.55,3.83
2006,11.01,8.77


In [87]:
gmc_portfolios_data_c <- read_csv("data/growth_moderate_conservative_portfolios.csv", skip = 3, col_names= c("year","X2","growth_portfolio_return","X3","moderate_portfolio_return","X4","conservative_portfolio_return")) |>
                                                select(year, growth_portfolio_return, moderate_portfolio_return, conservative_portfolio_return) |>
                                                filter(between(year, 1997, 2021))

gmc_portfolios_data_c$'growth_portfolio_return' = as.numeric(gsub("[\\%,]", "", gmc_portfolios_data_c$'growth_portfolio_return'))
names(gmc_portfolios_data_c)[names(gmc_portfolios_data_c) == 'growth_portfolio_return'] <- 'growth_portfolio_return_percentage'

gmc_portfolios_data_c$'moderate_portfolio_return' = as.numeric(gsub("[\\%,]", "", gmc_portfolios_data_c$'moderate_portfolio_return'))
names(gmc_portfolios_data_c)[names(gmc_portfolios_data_c) == 'moderate_portfolio_return'] <- 'moderate_portfolio_return_percentage'

gmc_portfolios_data_c$'conservative_portfolio_return' = as.numeric(gsub("[\\%,]", "", gmc_portfolios_data_c$'conservative_portfolio_return'))
names(gmc_portfolios_data_c)[names(gmc_portfolios_data_c) == 'conservative_portfolio_return'] <- 'conservative_portfolio_return_percentage'

gmc_portfolios_data_c

# delete all X2-X11, keep year, growth, moderate, conservative portfolio, get rid of units

[1mRows: [22m[34m782[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (12): year, X2, growth_portfolio_return, X3, moderate_portfolio_return, ...
[33mlgl[39m  (4): X13, X14, X15, X16

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
“NAs introduced by coercion”


year,growth_portfolio_return_percentage,moderate_portfolio_return_percentage,conservative_portfolio_return_percentage
<chr>,<dbl>,<dbl>,<dbl>
1997,17.5,15.13,12.76
1998,18.26,15.94,13.62
1999,20.79,15.63,10.47
2000,-8.06,-3.78,0.49
2001,-9.79,-5.4,-1.01
2002,-12.99,-7.07,-1.15
2003,29.08,23.67,18.26
2004,13.41,11.4,9.38
2005,7.68,6.02,4.36
2006,16.54,13.67,10.8


In [88]:
#merging datasets

project_ds1 <- merge(
                    x = inflation_data_c, 
                    y = gdp_data_c,
                    by = "year")
project_ds2 <- merge(
                    x = project_ds1, 
                    y = unemployment_data_c,
                    by = "year")
project_ds3 <- merge(
                    x = project_ds2, 
                    y = income_portfolio_data_c,
                    by = "year")
project_ds4 <- merge(
                    x = project_ds3, 
                    y = sf_fs_portfolio_data_c,
                    by = "year")
project_ds_c <- merge(
                    x = project_ds4, 
                    y = gmc_portfolios_data_c,
                    by = "year")

#unemployment_data_c
#income_portfolio_data_c
#sf_fs_portfolio_data_c
#gmc_portfolios_data_c

# the merge happens for each of row in inflation_data, correct this.
project_ds_c

year,annual_inflation,real_gdp_growth_rate,annual_unemployment_rate,income_portfolio_return_percentage,60/40_portfolio_return_percentage,40/60_portfolio_return_percentage,growth_portfolio_return_percentage,moderate_portfolio_return_percentage,conservative_portfolio_return_percentage
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1997,2.3376899,4.4,4.95,10.39,22.37,18.06,17.5,15.13,12.76
1998,1.5522791,4.5,4.508333,11.3,17.39,14.45,18.26,15.94,13.62
1999,2.1880272,4.8,4.216667,5.32,13.98,9.07,20.79,15.63,10.47
2000,3.3768573,4.1,3.991667,4.77,-1.79,2.61,-8.06,-3.78,0.49
2001,2.8261711,1.0,4.733333,3.38,-3.21,0.67,-9.79,-5.4,-1.01
2002,1.5860316,1.7,5.775,4.77,-9.27,-3.43,-12.99,-7.07,-1.15
2003,2.270095,2.8,5.991667,12.84,20.4,14.93,29.08,23.67,18.26
2004,2.6772367,3.9,5.533333,7.36,9.2,7.55,13.41,11.4,9.38
2005,3.3927468,3.5,5.066667,2.71,4.55,3.83,7.68,6.02,4.36
2006,3.2259441,2.8,4.616667,7.93,11.01,8.77,16.54,13.67,10.8


In [89]:
# seeking for max portfolio

# adds new column with the maximum portfolio

# change the max portfolio into categorical data (state which data it was)

## Methods
___

* We’ll conduct our data analysis by first dividing our data into training data and testing data. Training data are the observations for which we already know the class for and can be used as a basis for prediction for our classifier. Using the classifier, we can predict the class for the testing data whose classes are unknown.
* We’ll be using the K-Nearest Neighbor Classification algorithm from the parsnip R package in tidymodels, in order to make predictions.
  * We’ll use cross validation to derive the best value to use as k.
  * We’ll be defining a model specification for the K-nearest neighbor and fit the model on the data by passing the model specification and data set to a fit function.
  * In the same step, we’ll specify the target variables (i.e, investment portfolio) and predictors that we are going to use GDP, Inflation and Unemployment.
* Finally, we’ll use the predict function to predict the best investment portfolio.
* We’ll visalize the result by:
  * Plotting a line graph with Returns in y-axis and Time in x-axis and factored by Portfolio.

## Expected outcomes and significance
___

We expect to find the yearly performance of six index portfolios with each respected mutual funds composition, with the rates of unemployment, US inflation, and real GDP growth expectations (explanatory variables) in the same period. Afterwards, we expect to find the relationship of each explanatory variable with each index portfolio. The findings will lead us to a conclusion of which index portfolio will perform the best in each explanatory variable. The impacts of our findings would lead us to associate what negative causes on the economy affects what major index portfolio and each respected mutual funds composition. Having this association will provide us with the ability to predict which funds and portfolios will decline, increase, or stay stable during each respective variable. The ability to predict the outcome of portfolios leads us to question what foundations these funds are built on and how the American stock market truly functions.
