# **The Marginal Contribution of Alpha191 Factors on Non-A-Share Market**

This is a jupyter notebook file for the master thesis titled *The Marginal Contribution of Alpha191 Factors on the Non-A-Share Market*. For clarity, all main code and corresponding explanations have been integrated into this document. This structure allows readers to follow the author’s workflow step by step, facilitating both reproducibility and a deeper understanding of the research.

## Environment Setup

**1. Using conda (Recommended)**

In [None]:
conda env create -f env/environment.yml

**2. Using pip**

In [None]:
pip install -r env/requirements.txt

Before proceeding with the next steps, please ensure your IDE and Jupyter Notebook have been switched to the new environment.

## Dataset Download Guide

This guide provides method to download datasets from Kaggle, especially for the file "usa.csv".

<span style="color:red"> Please note: the full "usa.csv" file is no longer available. You may skip the download guide. To review the dataset's structure and format, please visit: https://www.kaggle.com/datasets/jindu4928/usa-subset-csv </span>

**Step 1: Install Kaggle**

This step should already be completed, if you have finished the **Environment Setup** process. Please check the file path: 
- Windows: C:\Users\\\<username>\
- Linux/Mac: ~/

If you found the ".kaggle" folder, you could skip this step. Otherwise, please manually execute the following code.

In [None]:
pip install kaggle

**Step 2: Setup API Credentials**

1. Go to Kaggle.com

2. Click on your profile picture → "Settings"

3. Scroll down to "API" section

4. Click "Create New API Token", this downloads "kaggle.json" file

5. Place the file in:

- Windows: C:\Users\\\<username>\\.kaggle\kaggle.json

- Linux/Mac: ~/.kaggle/kaggle.json

**Step 3: Download Dataset**

Please execute the following code to download "usa.csv" dataset. This may take several minutes as the file is approximately 14GB. Please ensure you have sufficient storage space.

In [None]:
from download import *
download_usa_csv()

## Import

In [None]:
import pandas as pd
from class_AlphaCalculator import *
from class_USACalculator import *
from combine_times_files import combine_times_files
from usa_returns import calculateFactorReturns
from class_Alpha191Portfolios import *
from class_UsaPortfolios import *
from class_DSregression import *
from class_Alpha191Portfolios55 import *
from class_UsaPortfolios55 import *

## Calculate Alpha191 Returns

**Read files of s&p500 data to calculate Alpha191 retruns**

In [None]:
CP = pd.read_parquet(r'data\SPXConstituentsPrices.parquet')
CD = pd.read_parquet(r'data\SPXConstituentsDaily.parquet')
CM = pd.read_parquet(r'data\CompleteMapping.parquet')
SP = pd.read_parquet(r'data\SPXPrices.parquet')

**Create an instance object from 'class_AlphaCalculator.py'**

In [None]:
calculator = AlphaCalculator(CP, CD, CM, SP)

**Preprocess data**

In [None]:
calculator.preprocessData()

**Calculate Alpha191 factor value**

Because of the reason of runing time, I split the date into 6 parts:
1. **period0.py**: _'1996-01-02'--'2001-01-02'_
2. **period1.py**: _'2000-01-02'--'2006-01-02'_
3. **period2.py**: _'2005-01-02'--'2011-01-02'_
4. **period3.py**: _'2010-01-02'--'2016-01-02'_
5. **period4.py**: _'2015-01-02'--'2021-01-02'_
6. **period5.py**: _'2020-01-02'--'2023-08-31'_

_find code files above in folder 'period_split'_

And run the 6 parts code together to save time, store results into folder 'period_split' as parquet file.

Because can't run the 6 parts at the same time in jupyter notebook, so I suggest open them and run them manually in different dedicated terminal together. To save time, you can also download the results from kaggle.com. If you want so, please execute the following code.

In [None]:
download_period_data()

**Merge the 6 files into one parquet file in folder 'factor_value' which named 'alpha_factor'**

In [None]:
combine_times_files()

**Calculate Alpha191 returns and save it into the folder 'factor_returns' as a parquet file named 'alpha_returns'**

In [None]:
alpha_factor = pd.read_parquet(r'factor_value\alpha_factor.parquet')
factor_returns = calculator.calculateFactorReturns(alpha_factor)
factor_returns['period'] = factor_returns['period'].dt.to_timestamp()
factor_returns.to_parquet(r'factor_returns\alpha_returns.parquet', index=False)

## Calculate Jensen Dataset Returns

**Creat an instance object from 'class_USACalculator.py'**

In [None]:
usa_calculator = USACalculator(
    csv_file=r"data\usa.csv", 
    output_dir='usa_chunks',
    chunk_size=300000
)

**Split Jensen dataset data 'usa.csv' into 14 parquet files and save them into folder 'usa_chunks'**

In [None]:
usa_calculator.csv_to_parquet_chunks()

**Filter each chunks and save the filtered parquet files into the same folder**

In [None]:
usa_calculator.filter_chunks()

**Merge all the filtered chunks into one parquet in the folder 'factor_value' named 'usa_factor'**

In [None]:
usa_calculator.merge_filters()

**Calculate usa returns and save it into parquet file in the folder 'factor_returns' named 'usa_returns'**

In [None]:
usa_factor = pd.read_parquet(r'factor_value\usa_factor.parquet')
factor_returns_list = calculateFactorReturns(usa_factor)
factor_returns_list['period'] = factor_returns_list['period'].dt.to_timestamp()
factor_returns_list.to_parquet(r'factor_returns\usa_returns.parquet', index=False)

## Build 3×2 Bivariate-Sorted Portfolios(Alpha191 part)

**Read file**

In [None]:
alpha_factor = pd.read_parquet(r'factor_value\alpha_factor.parquet')

**Creat an instance object from 'class_Alpha191Portfolios.py'**

In [None]:
alpha_calculator = Alpha191Portfolios(alpha_factor)

**Preprocess 'alpha_factor.parquet' date for building portfolios and following calculation**

In [None]:
alpha_calculator.data_preprocess()

**Build 3×2 Bivariate-Sorted portfolios by sorting prior entire year's characteristics average and market_cap average**

1. Sorting entire year characteristics average and market_cap average to build portfolis using by next year

In [None]:
alpha_calculator.portfolio_group()

2. Calcuate portfolio weighted returns by using prior year's sorting results

In [None]:
alpha_port_ret = alpha_calculator.portfolio_ret()

**Save results into parquet file in folder 'portfolios' as the name of 'alpha_port_ret'**

In [None]:
alpha_port_ret.to_parquet(r'portfolios\alpha_port_ret.parquet', index=False)

## Build 5×5 Bivariate-Sorted Portfolios(Alpha191 part)

The process is the same as bulding 3×2 Bivariate-Sorted Portfolios(Alpha191 part)

In [None]:
alpha_factor = pd.read_parquet(r'factor_value\alpha_factor.parquet')

In [None]:
alpha55_calculator = Alpha191Portfolios55(alpha_factor)

In [None]:
alpha55_calculator.data_preprocess()

In [None]:
alpha55_calculator.portfolio_group()

In [None]:
alpha55_port_ret = alpha55_calculator.portfolio_ret()

In [None]:
alpha55_port_ret.to_parquet(r'portfolios\alpha55_port_ret.parquet', index=False)

## Build 3×2 Bivariate-Sorted Portfolios(Jensen dataset part)

**Read file**

In [None]:
usa_factor = pd.read_parquet(r'factor_value\usa_factor.parquet')

**Creat an instance object from 'class_UsaPortfolios.py'**

In [None]:
usa_calculator = UsaPortfolios(usa_factor)

**Preprocess usa_factor date for building portfolios and following calculation**

In [None]:
usa_calculator.data_preprocess()

**Build 3×2 Bivariate-Sorted portfolios by sorting prior entire year's characteristics average and market_cap average**

1. Sorting entire year characteristics average and market_cap average to build portfolis using by next year

In [None]:
usa_calculator.portfolio_group()

2. Calcuate portfolio weighted returns by using prior year's sorting results

In [None]:
usa_port_ret = usa_calculator.portfolio_ret()

**Save results into parquet file in folder 'portfolios' as the name of 'usa_port_ret'**

In [None]:
usa_port_ret.to_parquet(r'portfolios\usa_port_ret.parquet', index=False)

## Build 5*5 Bivariate-Sorted Portfolios(Jensen dataset part)

The process is the same as bulding 3×2 Bivariate-Sorted Portfolios(Jensen dataset part)

In [None]:
usa_factor = pd.read_parquet(r'factor_value\usa_factor.parquet')

In [None]:
usa55_calculator = UsaPortfolios55(usa_factor)

In [None]:
usa55_calculator.data_preprocess()

In [None]:
usa55_calculator.portfolio_group()

In [None]:
usa55_port_ret = usa55_calculator.portfolio_ret()

In [None]:
usa55_port_ret.to_parquet(r'portfolios\usa55_port_ret.parquet', index=False)

## Regression of Double-Selection(DS), Single-Selection(SS), Elastic Net(EN), and Principal Component Analysis(PCA) using 3×2 Bivariate-Sorted Portfolios

**Read file**

In [None]:
alpha_port = pd.read_parquet(r'portfolios\alpha_port_ret.parquet')
usa_port = pd.read_parquet(r'portfolios\usa_port_ret.parquet')
alpha_ret = pd.read_parquet(r'factor_returns\alpha_returns.parquet')
usa_ret = pd.read_parquet(r'factor_returns\usa_returns.parquet')

**Creat an instance object from 'class_DSregression.py'**

In [None]:
DS_calculator = DSregression(alpha_port, usa_port, alpha_ret, usa_ret )

**Preprocess data**

In [None]:
DS_calculator.preprocessData()

**DS regression and get the first LASSO results "I_1", second LASSO reults "I_2"**

In [None]:
I_1, I_2 = DS_calculator.DSregression()

In [None]:
I_1

In [None]:
I_2

**SS regression and get the LASSO results "I"**

In [None]:
I = DS_calculator.SSregression()

In [None]:
I

**EN regression and get the results "I_3"**

In [None]:
I_3 = DS_calculator.ENregression()

In [None]:
I_3

**PCA regression**

In [None]:
I_4 = DS_calculator.PCAregression()

## Regression of Double-Selection(DS) using 5×5 Bivariate-Sorted Portfolios

**Read file**

In [None]:
alpha55_port = pd.read_parquet(r'portfolios\alpha55_port_ret.parquet')
usa55_port = pd.read_parquet(r'portfolios\usa55_port_ret.parquet')
alpha_ret = pd.read_parquet(r'factor_returns\alpha_returns.parquet')
usa_ret = pd.read_parquet(r'factor_returns\usa_returns.parquet')

**Creat an instance object from 'class_DSregression.py'**

In [None]:
DS_calculator = DSregression(alpha55_port, usa55_port, alpha_ret, usa_ret )

**Preprocess data**

In [None]:
DS_calculator.preprocessData()

**DS regression and get the first LASSO results "I_1", second LASSO reults "I_2"**

In [None]:
I_1, I_2 = DS_calculator.DSregression()

In [None]:
I_1

In [None]:
I_2