# Portfolio optimization
**Problem**: Is it possible to use macroeconomic data to "predict" an optimal asset allocation for a portfolio to achieve better risk-adjusted returns?

## Data Collection
**Input**: -

**Output**: Raw data stored in MongoDB

Where we took data from:
- OECD (https://data.oecd.org/api/)
- FRED (https://fred.stlouisfed.org/docs/api/fred/)
- YahooFinance (library yfinance)
- Investing.com (scraping with BeautifulSoup4)

First thing, we choose what indexes to use as a benchmark for the different asset classes (Equity, Bond, Real Estate, Commodity, Cash). These will be the **targets** in our model.

Equity (Yahoo finance, with related ticker):
- SP500 ^GSPC
- DowJones ^DJI
- Nasdaq ^IXIC
- Russell2000 ^RUT

Bond:
- Long-term interest rates (OECD https://data.oecd.org/interest/long-term-interest-rates.htm)
- Treasury10Y Yield (Yahoo Finance ^TNX) 

Real Estate:
- All-Transactions House Price Index (FRED series_id = USSTHPI)
- Housing prices (OECD https://data.oecd.org/price/housing-prices.htm)

Commodity (Investing.com):
- GOLD (https://www.investing.com/commodities/gold)
- OIL (https://www.investing.com/commodities/crude-oil)
- WHEAT (https://www.investing.com/commodities/us-wheat)

Cash (OECD):
- Short-term interest rates (OECD https://data.oecd.org/interest/short-term-interest-rates.htm)

As **features** we take every series in the FRED and OECD datasets. These contain data such as gdp, growth, inflation, unemployment, equity market volatility, new houses permits, FED rates, gold reserves, balance of payments, and much more.

We save raw data as-is in MongoDB Atlas, which we use as a Data Lake.
The alternatives we evaluated are S3 and DocumentDB (AWS).
We choose MongoDB Atlas, as it allows a Free Tier with 512MB of storage, while also allowing to query the documents (unlike S3)

### OECD Data Collection
OECD presents data via REST API with no auth.

https://data.oecd.org/api/sdmx-json-documentation/

Data can be retrieved via http requests containing filters and dataset id.
"Live" most recent data is in the DP_LIVE dataset.

Here we get all features data from OECD + 3 targets described above.

Below an example of a request for GDP data for USA.

In [1]:
# move to root to simplify imports
%cd ..

C:\Users\marco\PycharmProjects\portfolio-optimization


In [2]:
import requests
url = f'https://stats.oecd.org/SDMX-JSON/data/DP_LIVE/USA.GDP..MLN_USD./all?dimensionAtObservation=allDimensions&startTime=2010'
r = requests.get(url).json()
print(r["dataSets"])

[{'action': 'Information', 'observations': {'0:0:0:0:0:0': [15048970.0, 0, None], '0:0:0:0:0:1': [15599731.0, 0, None], '0:0:0:0:0:2': [16253970.0, 0, None], '0:0:0:0:0:3': [16843196.0, 0, None], '0:0:0:0:0:4': [17550687.0, 0, None], '0:0:0:0:0:5': [18206023.0, 0, None], '0:0:0:0:0:6': [18695106.0, 0, None], '0:0:0:0:0:7': [19477337.0, 0, None], '0:0:0:0:0:8': [20533058.0, 0, None], '0:0:0:0:0:9': [21380976.0, 0, None], '0:0:0:0:0:10': [21060474.0, 0, None], '0:0:0:0:0:11': [23315081.0, 0, None]}}]


Data format is a little bit oscure at this point, but we will solve (and explain) this in the Data cleaning part of the process.

### FRED Data collection
FRED presents data via REST API with authentication via api_key (free to request and use)
https://fred.stlouisfed.org/docs/api/fred/

To retrieve a series data you need to specify the corresponding series_id.
We couldn't find a comprehensive series_id list, so we decided to traverse the whole tree structure of categories and series.
We started from the root category and ask for the category children and so on. When we have all the categories we ask for the series contained in that category. This way we retrieved all possible series_id.

Due to a higher than excepted amount of data, we chose to keep only series with "popularity" >= 30. Popularity is a metadata of each series representing interest of public in that series data. (For example "GDP" data for USA is "more interesting" than "Employed Persons in Talbot County, GA" data)

Here we get all features data from FRED + 1 target described above.

Below an example of a request for GDP data for USA. api_key is been obscured for privacy reasons, to run the same you will need to request an api_key from FRED.

https://fred.stlouisfed.org/docs/api/api_key.html

In [3]:
import fredapi as fa
import os
from configparser import ConfigParser

parser = ConfigParser()
_ = parser.read("credentials.cfg")
fred_api_key = parser.get("fred", "fred_api_key")

fred = fa.Fred(api_key=fred_api_key)
df = fred.get_series("GDP").to_frame()
df.tail()

Unnamed: 0,0
2021-10-01,24349.121
2022-01-01,24740.48
2022-04-01,25248.476
2022-07-01,25723.941
2022-10-01,26137.992


### YahooFinance Data Collection
For YahooFinance data we can use the yfinance library.

https://pypi.org/project/yfinance/

Here we get the target data we need from YahooFinance as described above.

Below an example of a request for S&P500 price data.

In [4]:
import yfinance as yf

ticker = "^GSPC"
t = yf.Ticker(ticker)
df = t.history(period="max", interval="1mo")
df.tail()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-01,3853.290039,4094.209961,3794.330078,4076.600098,80763810000,0,0
2023-02-01,4070.070068,4195.439941,3943.080078,3970.149902,80392280000,0,0
2023-03-01,3963.340088,4110.75,3808.860107,4109.310059,113094800000,0,0
2023-04-01,4102.200195,4133.129883,4069.840088,4109.109863,19340860000,0,0
2023-04-11,4110.290039,4115.359863,4102.890137,4102.910156,465928982,0,0


### Investing.com Data Collection
For investing.com we manually download data in .csv and created a scraper that retrieve subsequential data.

The problem with the scraper is that data after the past month is loaded via javascript in the webpage.

It could possibly be achieved using Selenium, but we tried to keep things as simple as possible using only BeautifulSoup.

### Storing Data in MongoDB
We save raw data as-is in MongoDB Atlas, which we use as a Data Lake.

https://www.mongodb.com/cloud/atlas/register

To store a pandas Dataframe we have to convert it to a dictionary.

Each document in MongoDB is assigned a random "_id". We can override this to achieve an unique column in the collection.

Below an example of how to connect to MongoDB (in this case Atlas version) and insert a json file into a desired database and collection. You would need an account on MongoDB Atlas to run this. Or alternatively you can install MongoDB on your local machine and the connection string would look like: *"mongodb://localhost:27017"*

In [5]:
from pymongo import MongoClient
import json

from configparser import ConfigParser
parser = ConfigParser()
_ = parser.read("credentials.cfg")
username = parser.get("mongo_db", "username")
password = parser.get("mongo_db", "password")

data = {"_id":ticker, "data":json.loads(df.reset_index().to_json(orient="records"))}

connection_string = f"mongodb+srv://{username}:{password}@cluster0.3dxfmjo.mongodb.net/?" \
                    f"retryWrites=true&w=majority"
client = MongoClient(connection_string)

# database = client[db_name]
# collection = database[collection]
# collection.insert_one(data)

At this point we have all raw data stored in MongoDB and we are ready to continue into the next phase, that is wrangling data to be prepared to be analyzed.

The transformed data will be saved in PostgreSQL.

[Go to Data Cleaning](data_cleaning.ipynb)