Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the method of preprocessing #1

Closed
LeetaH666 opened this issue Jul 18, 2023 · 3 comments
Closed

About the method of preprocessing #1

LeetaH666 opened this issue Jul 18, 2023 · 3 comments

Comments

@LeetaH666
Copy link

From the data_prepare.py I find that the managed portfolio is constructed using the classical method, i.e., long the first decile and short the last decile. However, the preprocessing method of GKX (2021) is a linear transformation of returns (see equation (16) in the original paper). I wonder if it is your simple choice for convenience, or you just find the classical method is better than the authors' method?

@chuanmx20
Copy link
Collaborator

From the data_prepare.py I find that the managed portfolio is constructed using the classical method, i.e., long the first decile and short the last decile. However, the preprocessing method of GKX (2021) is a linear transformation of returns (see equation (16) in the original paper). I wonder if it is your simple choice for convenience, or you just find the classical method is better than the authors' method?

As the given data covers a long period of time, some stocks can’t be found until it’s listed, i.e. the number of stocks in each month varies. Taking the data of each month as batches, we did so to keep the batch_size unchanged. Thus, it’s easier to construct the dataloader.

@LeetaH666
Copy link
Author

Yes, this kind of construction does form a balanced panel. But I mean your method is different from the authors' method while their method can also form a balanced panel. This 2 methods will both give 94 characteristic-based portfolios each month, but the method of contruction is different.

@RichardS0268
Copy link
Owner

RichardS0268 commented Aug 9, 2023

From the data_prepare.py I find that the managed portfolio is constructed using the classical method, i.e., long the first decile and short the last decile. However, the preprocessing method of GKX (2021) is a linear transformation of returns (see equation (16) in the original paper). I wonder if it is your simple choice for convenience, or you just find the classical method is better than the authors' method?

Hello here. Thanks for raising this issue. It may be a little confusing to understand the empirical part of original paper in the beginning. I hope I can make things easier here.

Firstly, I want to share my understandings of the CA models proposed by GKX. The architecture is well shown in Figure 2: Conditional Autoencoder Model in GKX's paper. Kindly remind that input of beta network is the same all the time, that is, the characteristic martix of single stocks (dimension = N times P, P fixed, N time variant). While input of factor network can either be x_t (dimension = P times 1) or r_t (dimension = N times 1), corresponding to the Individual Stocks returns and Managed Portfolios returns respectively. As for the calculation of x_t, it can be identified as the OLS solution to the regression r_t = Z_{t-1} @ x_t. x_t is thus defined (or regarded) as the coefficients to the characteristics. Note that coefficients are also called risk premia, or further, "returns". Those are so-called managed portfolio returns. You can actually view this procedure as a more graceful and reasonable fillna operation. As GKX also pointed in section 2.2, "Second, the panel is extremely unbalanced—in any given month, we have on average around 6,000 non-missing stocks ..." , we can truely fill characteristics matrix's NaN with cross median while it seems awful to fill returns of individual stocks with certain value or simply dropna for each time series. By calculating x_t, information of r_t can be throughly used and we can get fixed length inputs without NaN values for all time series. Besides, note that the explained R square of x_t is way higher than r_t. It is because both trained 720 times, models for r_t need to inference (predict) around 30,000 entries each time while models for x_t only need to inference (predict) 94 entries each time.

As for our implementation, the "portfolios" are actually different from those in GKX. You can view them as synthetic tickers. In other words, we assume there are only 94 tickers in the whole market. Thus N is fixed to 94 for all time series. In this way, we can use batch training as @chuanmx20 mentioned above. And correspondingly, we calculate the returns for those portfolios, simply by long-short method, which also yields fixed length inputs without NaN values for factor network. Rigorously, those returns are not x_t, but r_t instead. We did such simplification because we did not have enough time and calculating resource to carry out exact experiments of GKX.

From my perspective, I prefer to view GKX's work as an inspiring innovation of factor models. And this is the reason why we try to implement all models included by inheriting modelBase. And also, the specific market (tickers) explained by factor models, whether US or other markets; whether physical existed names or synthetic portfolios, do not matter, as long as they are identical in all factor models.

It can be noted that in our implementation, CA models actually do not perform better than IPCA model, mainly because after fixing N=94, there is no enough data to well-fit CA models. CA models seem to be under fitted especially when more hidden layers are added. If you are interested in carrying out GKX's experiment, you can replace the inputs of beta network and factor network based on our work. You can calculate x_t as the method metioned above. And for training process, you can either train networks sequentially without batches or truncate Ns to a certain size using some strategies to form training batches.

I hope my explanation can be useful to you. I also recommend you read the report for more implementation details. 💡

@RichardS0268 RichardS0268 pinned this issue Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants