### Demo Notebook For CPO 2.0 API
This notebook presents how to connect to the CPO 2.0 API, manage files, set up parameters and request predictions. It also covers the requirement on and format of input data.

In [1]:
# import statement
import os
import datetime
import time
import requests
import json

# user info, user_email is the main identifier and will be used to send result and notification
# please change the user_email to your email address, since all notification will be sent to the email
user = "Demo User"
user_email = "me@office.com"

# connect to the VM, a welcome message will display if connected
api_host = 'http://localhost:5005/'      # for debug purposes
resp = requests.get(api_host, verify=False)
resp_json = resp.content
json.loads(resp_json)

{'message': 'Welcome to CPO API 2.0.6'}

### Uploading Data
CPO requires daily returns data and constraints on each portfolio components as inputs. User can also upload additional features. These data are stored in separate directories under the client's account. If a new data file is re-uploaded under the same file name, the existing file will be overwritten. This can be used to update the return file for live predictions.

#### Return File
Return file should contain daily return of all or a superset of the components in the portfolio universe, with `return(t) = close(t) / close(t-1) - 1.0`. The return file should have a `Date` (alias `tradedate`, `time`, `timestamp`, `datetime`) column for indexing purposes, and the desired format is `yyyy-mm-dd`. Upon uploading the return file, CPO API will identify the `Date` column and provides feedbacks on the start- and end-dates so user can check if the processing is successful. User can upload multiple return files but only one return file can be used for each portfolio.

CPO API will take all other columns as daily returns of portfolio components. NaNs and zeros and treated differently for these return columns. To be specific, NaNs are for historical period when the ticker *does not exist*, while zeros are taken as *real zero returns* (e.g. due to holidays). *Please contact PredictNow if your return data is sparse.*

The following code upload the `ETF_return.csv` file from the `Data` directory, then list all available return files under the account.

In [2]:
# upload return file, validation result will be displayed
print("Uploading return files")
return_filename = "ETF_return.csv"
upload_file = open(file=os.path.join(".", "Data", return_filename), mode="rb")
data = {
    "type": "Returns", 
    "email": user_email,
}
resp = requests.post(f"{api_host}upload-data", files={"file": upload_file}, data=data, verify=False)
print(json.loads(resp.content))

# also list what return files have been uploaded
resp = requests.get(f"{api_host}list-return-files/{user_email}", verify=False)
resp_json = resp.content
resp_content = json.loads(resp_json)
print("=" * 100)
print(resp_content)

Uploading return files


{'message': 'Date info from date with 16 return columns. Index range between 2010-01-04 and 2023-03-03.'}
{'Uploaded return_data files': ['ETF_return.csv', 'ETF_return2.csv', 'ETF_return99.csv']}


#### Constrain files
The constraint file defines the portfolio universe and the upper and lower boundaries (min and max allocation) of each components. There should be a `component` column for the component names, and these names should also be a subset of the return columns of the return file to be used. The column names are case sensitive. The upper and lower boundaries of each component should be given in the `UB` and `LB` columns, and their default values are 0.0 (0%) and 1.0 (100%) if not provided for a given component. Each user can upload multiple constraint files under different names, but only one constraint file can be used for each portfoio.

The following code upload the `ETF_constrain.csv` file from the `Data` directory, then list all available constraint files under the account. Note that `ETF_constrain.csv` contains fewer components than the `ETF_return.csv` file, in which case the additional ETFs will not be used to construct the portfolio.

In [3]:
# upload constraint file, validation result will be displayed
print("Uploading constraint files")
constraint_filename = "ETF_constrain.csv"
upload_file = open(file=os.path.join(".", "Data", constraint_filename), mode="rb")
data = {
    "type": "Constraint", 
    "email": user_email,
}
resp = requests.post(f"{api_host}upload-data", files={"file": upload_file}, data=data, verify=False)
print(json.loads(resp.content))

# also list what constraints files are uploaded
resp = requests.get(f"{api_host}list-constraint-files/{user_email}", verify=False)
resp_json = resp.content
resp_content = json.loads(resp_json)
print("=" * 100)
print(resp_content)

Uploading constraint files
{'message': 'Constraint file processed for 6 components'}
{'Uploaded constraint_data files': ['ETF_constrain.csv']}


#### Feature files (Optional)

User can provide additional feature files to help prediction. Feature files are time-series files just like returns, so the key column is also `Date` (alias `tradedate`, `time`, `timestamp`, `datetime`). The desired format is `yyyy-mm-dd` for the ease of auto-processing. The clinet's features are merged with pre-engineered features at PredictNow by matching the `Date` column, and CPO API assumes all features are available before the open of market of the target date. All features are forward filled (using the latest one that is available) before training. User can upload multiple feature files.

The following code upload the `Random_Feature.csv` file from the `Data` directory, then list all available client feauture files under the account. The uploaded feature are randomly generated and should not provide any prediction power.

In [5]:
# upload feature file, validation result will be displayed
print("Uploading feature files")
feature_filename = "Random_Feature.csv"
upload_file = open(file=os.path.join(".", "Data", feature_filename), mode="rb")
data = {
    "type": "features", 
    "email": user_email,
}
resp = requests.post(f"{api_host}upload-data", files={"file": upload_file}, data=data, verify=False)
print(json.loads(resp.content))

# also list what constraints files are uploaded
resp = requests.get(f"{api_host}list-feature-files/{user_email}", verify=False)
resp_json = resp.content
resp_content = json.loads(resp_json)
print("=" * 100)
print(resp_content)

Uploading feature files
{'message': 'Date info from TradeDate with 4 return columns. Index range between 2015-01-02 and 2022-08-02.'}
{'Uploaded feature_data files': ['Random_Feature.csv']}


### Parameters

There are certain parameters that are required to run a CPO job, and these parameters generally fall into two categoraries: a) those define the property of the target portfolio, like what components are included, how often it is rebalanced, what is the metric to rebalanced etc., and b) parameter that define the job, like what period of time should be used as in-sample, or what is the target date for a live prediction. 

This section focuses on the first part of the parameters, while the job-related parameters will be described in later sections. But before diving into the parameters, we will talke about the important idea of *project* in CPO API.

#### Project

CPO is designed to determine the allocation of a portfolio during each rebalancing, but not for ticker selection or tune the optimized rebalancing frequency. These configurations are like hyper-parameters of the portfolio, and users can use projects to test different portfolio configurations like:

- What if I do weekly rebalancing instead of monthly rebalancing?
- What if I lower the max allocation of some high-risk components and allow certain amount of Cash allocation?
- What if I include additional features that I think might be helpful?
etc.

In short, each project corresponds to a specific portfolio set up that is defined by the portfolio parameters.

#### Portfolio parameters
The following parameters define the portfolio and should be fixed for a given project throughthout in-sample, out-of-sample, and live-prediction. 

- `email`, mandatory, user identification, e.g. `me@office.com`
- `project_name`, mandatory, project / portfolio identification, e.g. `Demo_Project`
- `return_file`, mandatory, daily return filename, should be uploaded first, e.g. `ETF_return.csv`
- `constraint_file`, mandatory, portfolio components and their min and max allocations, should be uploaded first, e.g. `ETF_constraint.csv`. Note it's the `component` column in the `constraint_file` that defines the current portfolio universe.
- `feature_file`, optional, 'none' or 'feature_file1.csv, feature_file2.csv, ...', clinet feature to be included, and these files should be uploaded first.
- `skip_PNow_feature`, optional, if set to 'yes', 'true', or 'skip', will not include predictnow features. Note, if `feature_file` is not provided (`none` or not in the parameter dictionary) there would be no X-columns in the prediction model.
- `max_cash`, mandatory, maximum cash allocation allowed when risk is predicted to be large, float between 0 and 1 where 1 correspond to 100% (no market exposure).
- `rebalancing_period_unit`, mandatory, 'week' or 'month', used with `rebalancing_period`.
- `rebalancing_period`, mandatory, int. If `rebalancing_period = 2` and `rebalancing_period_unit = 'week'`, the portfolio would rebalanced every other week.
- `rebalance_on`, mandatory, 'first' or 'last', determine if the portfolio is rebalanced at the close of the first or last market day of the rebalancing period. Please see next subsection on how the dates are indexed in CPO.
- `training_data_size`, mandatory, int, in the unit of years, size of rolling window that is used to make the prediction for each rebalancing.
- `evaluation_metric`, mandatory, key performance metric to optimize, can be chosen from 'return', 'risk', 'sharpe', 'CAGR', 'UI', 'UPI', or 'MaxDD'. For risk related metrics, i.e. 'risk', 'UI', and 'MaxDD', `max_cash` will be overridden since Cash will always have zero and hence the minimal risk.

The code below gives how to set up the portfolio parameters, with some optional ones commented out.

In [7]:
portfolio_params = {
    "email": user_email,
    "project_name": "Demo_Project_20231211",
    "returns_file": "ETF_return.csv",
    "constraint_file": "ETF_constrain.csv",
    # "feature_file": "Random_Feature.csv",
    # "skip_PNow_feature": "skip",
    "max_cash": 1.0,
    "rebalancing_period_unit": "month",
    "rebalancing_period": 1,
    "rebalance_on": "first",
    "training_data_size": 3,
    "evaluation_metric": "sharpe",
}

#### Additional Notes on how CPO Handles Date index
There are two sets of date that are tightly related to each other, the rebalance period, and the rebalancing date.

Rebalance period is usually labelled by the first calendar day within the period, regardless of if it is a valid market day. The start date of the first rebalancing period is taken from the `training_start_date` when in-sample or out-of-sample backtesting is requested. After that, the enxt start date would be the previous start date plus the date offset of the rebalancing period (e.g. 1 months, 2 weeks etc.). These dates are for indexing purposes, especially for CPO API to store and manage data files for each rebalancing period. The back-testing experiment will stop when the start date of the rebalancing period is larger than the `training_end_date`. 

The start date of a rebalancing period can be any calendar day and may not be an effective rebalancing (market) day. To determine the actual rebalancing date of the rebalancing period, CPO requires the knowledge on when the rebalancing should take place, i.e. the `rebalane_on` parameter. If `rebalance_on` is set to 'first', then CPO will use the first marketday within the rebalancing period; otherwise if `rebalance_on` is set to 'last', then the rebalancing date will be taken from the last market day of the PREVIOUS rebalancing period.

### Backtesting

Backtesting can be further divided into in-sample and out-of-sample backtestings. CPO uses the in-sample period to tune hyper-parameters of the prediction system, including selection of candidate strategies, type of prediction model to be used, and aggregation function that turns predictions into final recommendations. The tuned hyper-parameters will be saved, and user can use out-of-sample period to verify if CPO continuous to add values. Both in-sample and out-of-sample backtesting jobs require some additional parameters then those defined the portfolio.

#### In-sample backtesting

In-sample backtesting is used to determine the hyper-parameters of the CPO system. The parameters required for in-sample backtesting include:
- `training_start_date`, mandatory, in the format of 'yyyy-mm-dd', the start date of the first rebalancing period to be included in the experiment. 
- `training_end_date`, mandatory, in the format of 'yyyy-mm-dd', the experiment terminates when the start of the period exceed the `training_end_date`.
- `sampling_rate`, mandatory, float between 0 and 1, the fraction of base strategies to be kept. This parameter is usually set to 0.3 or 0.4.
- `debug`, optional, will output more information in the backend when set to `debug`, and will not affect the performance or prediction.

The following code submit an in-sample backtesing job between Jan and Dec 2023 for the demo portfolio defined earlier.

In [8]:
params = {
    "training_start_date": "2019-01-01",
    "training_end_date": "2019-12-31",
    "sampling_proportion": 0.3,
    "debug": "debug",
}
params.update(portfolio_params)

uri = f"{api_host}run-insample-backtest" 
resp = requests.post(uri, json=params, verify=False)
resp_content = json.loads(resp.content)
cpo_job_id_backtest = resp_content['task_id']
print(resp_content)

{'message': 'job submitted for cpo in-sample backtesting.', 'task_id': '8b728606-1451-4bbb-b338-c691a0c32ccd'}


#### Time cost of in-sample backtesting
It can take 20 - 30 min for each rebalancing period within the in-sample period, depending on no. of components in the portfolio, size of training data, and the sampling proportion. The time cost can add up quickly when the test includes many rebalancing period, eithor due to more frequently rebalancing (e.g. weekly) or longer testing period. User can use the `get-cpo-job-status` function to check current status. CPO API will return the performance metrics of the tuned model after the job is finished, and the corresponding allocations will be sent to user's email. These weights can be loaded using a separate function as well, see later section.

It is also possible that, due to the long-running nature of the back-testing jobs (both in-sample and out-of-sample), there could be connection errors that create a breaking point. In that case, **simply re-submit the training request without chaning any parameters.** There is breaking-point handling modules in CPO API that record the progress and will continue the job from the most recent breaking point. This is true for both in-sample and out-of-sample backtesting.

The following code check the progress of the back-testing submited, and output the performance if job is finished.

In [10]:
# check status of the in-sample backtesting job progress and output performance if finished.
resp = requests.get(f"{api_host}get-cpo-job-status/{cpo_job_id_backtest}", verify=False)
resp_json = resp.content
resp_content = json.loads(resp_json)
print(resp_content)

#print("Response:", resp_content)
status = resp_content['cpo_job_status']
print(datetime.datetime.now())
print("Current Status:", status)
if status=='SUCCESS': 
    print("="*50)
    print("CPO RESULTS")
    result = resp_content['cpo_result']
    print(result)

{'cpo_job_id': '8b728606-1451-4bbb-b338-c691a0c32ccd', 'cpo_job_status': 'SUCCESS', 'cpo_result': "{'return': 0.17429690353749647, 'risk': 0.0241256100193871, 'sharpe': 7.224559436939966, 'CAGR': 0.189991246453467, 'UI': 0.09178534874175193, 'UPI': 189.89621538389505, 'MaxDD': 0.0032538487588696543}", 'progress': '{"step": 5, "progress": "In sample backtesting copleted, preparing outputs"}'}
2023-12-11 23:07:51.022386
Current Status: SUCCESS
CPO RESULTS
{'return': 0.17429690353749647, 'risk': 0.0241256100193871, 'sharpe': 7.224559436939966, 'CAGR': 0.189991246453467, 'UI': 0.09178534874175193, 'UPI': 189.89621538389505, 'MaxDD': 0.0032538487588696543}


#### Out of Sample Backteting

Out of sample backtesting simply applies the hyper-parameters tuned during in-sample periods and test if they continue to add value over a different time period. That means, out-of-sample test can only be run after in-sample tuning is finished. The required input parameters are `training_start_date` and `training_end_date`, with the same definition and format as in-sample backtesting.

It is important to keep the `training_start_date` parameters have the same format for in-sample and out-of-sample tests. For this example we are working on a portfolio that takes monthly rebalance on the first market day of the month, so we will keep `training_start_date` to the 1st of the month in OOS. Similarly, if in-sample test starts on a Monday for a weekly rebalanced portfolio, the OOS should start on the next Monday as well. 

OOS test generally runs faster than in-sample because there are fewer models to be run, but may still take 15 min for each rebalancing period.

The following code submits an out-of-sample job for the demo portfolio between Jan and Jun 2022. The prediction will be made using configurations determined during in-sample testing.

In [19]:
# oos prediction
params = {
    "training_start_date": "2020-01-01",
    "training_end_date": "2020-12-31",
    "debug": "debug",
}
params.update(portfolio_params)

uri = f"{api_host}run-oos-backtest" 
resp = requests.post(uri, json=params, verify=False)
resp_content = json.loads(resp.content)
cpo_job_id_backtest = resp_content['task_id']
print(resp_content)

{'message': 'job submitted for cpo back-testing.', 'task_id': '3b32a5c7-c098-450a-8014-fe48ae5425cb'}


In [20]:
# check status
resp = requests.get(f"{api_host}get-cpo-job-status/{cpo_job_id_backtest}", verify=False)
resp_json = resp.content
resp_content = json.loads(resp_json)
print(resp_content)

#print("Response:", resp_content)
status = resp_content['cpo_job_status']
print(datetime.datetime.now())
print("Current Status:", status)

if status=='SUCCESS': 
    print("="*50)
    print("CPO RESULTS")
    result = resp_content['cpo_result']
    print(result)

{'cpo_job_id': '3b32a5c7-c098-450a-8014-fe48ae5425cb', 'cpo_job_status': 'PENDING', 'cpo_result': 'None', 'progress': '{"step": 1, "progress": "Training models"}'}
2023-11-26 22:05:36.031587
Current Status: PENDING


#### Load Backtesting Results

The predictions generated during in-sample and out-of-sample backtesting experiments are stored on CPO API. Users can request these results for in-sample, out-of-sample, or combined. The key parameters are, again, `training_start_date` and `training_end_date`.

The following requests the CPO allocations and its performance for the entire back-testing period (in-sample + out-of-sample).

In [19]:
# get backtesting weights
params = {
    "training_start_date": "2019-01-01",
    "training_end_date": "2020-03-31",    # note dates can cover in-sample and OOS at the same time
    "debug": "debug",
}
params.update(portfolio_params)

print("="*50)
print("Loading backtesting weights")
uri = f"{api_host}get-backtest-weights" 
resp = requests.get(uri, json=params, verify=False)
resp_content = json.loads(resp.content)
print(resp_content)

# can also turn output into a pandas dataframe
import pandas as pd
df = pd.DataFrame.from_dict(resp_content).T
print("="*50)
print("Backtesting Weights as Dataframe")
print(df)

print("="*50)
print("Loading backtesting performance")
uri = f"{api_host}get-backtest-performance"     
resp = requests.get(uri, json=params, verify=False)
resp_content = json.loads(resp.content)
print(resp_content)

Loading backtesting weights
{'2019-01-02': {'SPY': 0.0539468147, 'QQQ': 0.0346698355, 'VNQ': 0.0331834347, 'REM': 0.1002935535, 'IEF': 0.2798133149, 'TLT': 0.1904534153}, '2019-02-01': {'SPY': 0.0446511781, 'QQQ': 0.039910024, 'VNQ': 0.0463260017, 'REM': 0.0524557947, 'IEF': 0.1055622018, 'TLT': 0.0640442022}, '2019-03-01': {'SPY': 0.07094082, 'QQQ': 0.0683798591, 'VNQ': 0.0799032846, 'REM': 0.1146294247, 'IEF': 0.2425103942, 'TLT': 0.110000258}}
                 SPY      QQQ       VNQ       REM       IEF       TLT
2019-01-02  0.053947  0.03467  0.033183  0.100294  0.279813  0.190453
2019-02-01  0.044651  0.03991  0.046326  0.052456  0.105562  0.064044
2019-03-01  0.070941  0.06838  0.079903  0.114629  0.242510  0.110000


NameError: name 'a' is not defined

### Live Prediction

Live prediction means to use the tuned hyper-parameter to make prediction for an incoming rebalancing period. Live prediction requires the knowledge of a) the target rebalancing date, and b) the prediction horizon, i.e. how many market days are there in the incoming rebalancing period. User can provide the informaion using the following parameters:

- `rebalance_date`, mandatory, in the format of 'yyyy-mm-dd', the target rebalance date.
- `next_rebalance_date`, optional, in the format of 'yyyy-mm-dd', the next rebalance date after current target date. For example, for a weekly-rebalanced portfolio, if the `rebalance_date` is set to '2023-10-02', the `next_rebalance_date` would be Monday '2023-10-09'. If `next_rebalance_date` is passed, CPO will use US market calendar to determine how many market days are there in the target rebalancing period.
- `n_days`, optional, int, the number of market days in the incoming rebalancing period. For a weekly rebalanced portfolio, `n_days` is usually 5 unless there is a holiday. This parameter overrides `next_rebalance_date`.

If neithor `next_rebalance_date` nor `n_days` parameters are passed, CPO will infer the number of market days from rebalancing frequency by assuming 5 market days a week and 21 market days a month.

The following code submit a live prediction training request, check the progress, then load the predicted allocation.

In [None]:
# live prediction
params = {
    "rebalance_date": "2022-07-01",
    "next_rebalance_date": "2022-08-01",
    "debug": "debug",
}
params.update(portfolio_params)

uri = f"{api_host}run-live-prediction" 
resp = requests.post(uri, json=params, verify=False)
resp_content = json.loads(resp.content)
cpo_job_id_live = resp_content['task_id']
print(resp_content)

In [None]:
# check status, and if training finished, load weights
resp = requests.get(f"{api_host}get-cpo-job-status/{cpo_job_id_live}", verify=False)
resp_json = resp.content
resp_content = json.loads(resp_json)
print(resp_content)

status = resp_content['cpo_job_status']
print(datetime.datetime.now())
print("Current Status:", status)

# print out weights
if status=='SUCCESS': 
    result = resp_content['cpo_result']
    print("CPO RESULTS")
    print(result)

After the job prediction is made, user can also load the result without redo the training with the following code. The key parameter is `rebalance_date`.

In [None]:
# can also load live prediction weights after the training is finished
params = {
    "rebalance_date": "2022-07-01",
    "debug": "debug",
}
params.update(portfolio_params)
uri = f"{api_host}get-live-prediction-weights" 
resp = requests.get(uri, json=params, verify=False)
resp_content = json.loads(resp.content)
print(resp_content)