# **The Marginal Contribution of Alpha191 Factors on Non-A-Share Market**

This is a jupyter notebook file for the master thesis titled *The Marginal Contribution of Alpha191 Factors on the Non-A-Share Market*. For clarity, all main code and corresponding explanations have been integrated into this document. This structure allows readers to follow the author’s workflow step by step, facilitating both reproducibility and a deeper understanding of the research.

## Environment Setup

**1. Using conda (Recommended)**

In [2]:
conda env create -f env/environment.yml

3 channel Terms of Service accepted
Channels:
 - defaults
 - conda-forge
Platform: win-64
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages: ...working... done
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Installing pip dependencies: - Ran pip subprocess with arguments:
['C:\\Users\\49282\\miniconda3\\envs\\env_of_alpha\\python.exe', '-m', 'pip', 'install', '-U', '-r', 'd:\\The Marginal Contribution of Alpha191 Factors on Non-A-Share Market - new\\env\\condaenv.pgpj63gj.requirements.txt', '--exists-action=b']
Pip subprocess output:
Collecting bleach==6.2.0 (from -r d:\The Marginal Contribution of Alpha191 Factors on Non-A-Share Market - new\env\condaenv.pgpj63gj.requirements.txt (line 1))

  Using cached bleach-6.2.0-py3-none-any.whl.metadata (30 kB)

Collecting certifi==2025.10.5 (from -r d:\The Marginal Contribution of Alpha191 Factors on Non-A-Share Market - new\env\condaenv.pgpj63g

**2. Using pip**

In [None]:
pip install -r env/requirements.txt

Before proceeding with the next steps, please ensure your IDE and Jupyter Notebook have been switched to the new environment.

## Dataset Download Guide

This guide provides method to download datasets from Kaggle, especially for the file "usa.csv".

**Step 1: Install Kaggle**

This step should already be completed, if you have finished the **Environment Setup** process. Please check the file path: 
- Windows: C:\Users\\\<username>\
- Linux/Mac: ~/

If you found the ".kaggle" folder, you could skip this step. Otherwise, please manually execute the following code.

In [None]:
pip install kaggle

**Step 2: Setup API Credentials**

1. Go to Kaggle.com

2. Click on your profile picture → "Settings"

3. Scroll down to "API" section

4. Click "Create New API Token", this downloads "kaggle.json" file

5. Place the file in:

- Windows: C:\Users\\\<username>\\.kaggle\kaggle.json

- Linux/Mac: ~/.kaggle/kaggle.json

**Step 3: Download Dataset**

Please execute the following code to download "usa.csv" dataset. This may take several minutes as the file is approximately 14GB. Please ensure you have sufficient storage space.

In [1]:
from download import *
download_usa_csv()

Dataset URL: https://www.kaggle.com/datasets/jindu4928/usa-csv
usa.csv dataset download completed. Files have been extracted to the 'data' folder.


## Import

In [2]:
import pandas as pd
from class_AlphaCalculator import *
from class_USACalculator import *
from combine_times_files import combine_times_files
from usa_returns import calculateFactorReturns
from class_Alpha191Portfolios import *
from class_UsaPortfolios import *
from class_DSregression import *
from class_Alpha191Portfolios55 import *
from class_UsaPortfolios55 import *

## Calculate Alpha191 Returns

**Read files of s&p500 data to calculate Alpha191 retruns**

In [3]:
CP = pd.read_parquet(r'data\SPXConstituentsPrices.parquet')
CD = pd.read_parquet(r'data\SPXConstituentsDaily.parquet')
CM = pd.read_parquet(r'data\CompleteMapping.parquet')
SP = pd.read_parquet(r'data\SPXPrices.parquet')

**Create an instance object from 'class_AlphaCalculator.py'**

In [4]:
calculator = AlphaCalculator(CP, CD, CM, SP)

**Preprocess data**

In [5]:
calculator.preprocessData()

**Calculate Alpha191 factor value**

Because of the reason of runing time, I split the date into 6 parts:
1. **period0.py**: _'1996-01-02'--'2001-01-02'_
2. **period1.py**: _'2000-01-02'--'2006-01-02'_
3. **period2.py**: _'2005-01-02'--'2011-01-02'_
4. **period3.py**: _'2010-01-02'--'2016-01-02'_
5. **period4.py**: _'2015-01-02'--'2021-01-02'_
6. **period5.py**: _'2020-01-02'--'2023-08-31'_

_find code files above in folder 'period_split'_

And run the 6 parts code together to save time, store results into folder 'period_split' as parquet file.

Because can't run the 6 parts at the same time in jupyter notebook, so I suggest open them and run them manually in different dedicated terminal together. To save time, you can also download the results from kaggle.com. If you want so, please execute the following code.

In [None]:
download_period_data()

**Merge the 6 files into one parquet file in folder 'factor_value' which named 'alpha_factor'**

In [6]:
combine_times_files()

**Calculate Alpha191 returns and save it into the folder 'factor_returns' as a parquet file named 'alpha_returns'**

In [7]:
alpha_factor = pd.read_parquet(r'factor_value\alpha_factor.parquet')
factor_returns = calculator.calculateFactorReturns(alpha_factor)
factor_returns['period'] = factor_returns['period'].dt.to_timestamp()
factor_returns.to_parquet(r'factor_returns\alpha_returns.parquet', index=False)

Processing alpha_001: 100%|██████████| 236/236 [00:01<00:00, 190.11it/s]
Processing alpha_002: 100%|██████████| 236/236 [00:01<00:00, 198.51it/s]
Processing alpha_003: 100%|██████████| 236/236 [00:01<00:00, 198.13it/s]
Processing alpha_004: 100%|██████████| 236/236 [00:00<00:00, 331.81it/s]
Processing alpha_006: 100%|██████████| 236/236 [00:00<00:00, 289.66it/s]
Processing alpha_007: 100%|██████████| 236/236 [00:01<00:00, 190.50it/s]
Processing alpha_008: 100%|██████████| 236/236 [00:00<00:00, 246.80it/s]
Processing alpha_009: 100%|██████████| 236/236 [00:01<00:00, 153.06it/s]
Processing alpha_010: 100%|██████████| 236/236 [00:00<00:00, 242.04it/s]
Processing alpha_011: 100%|██████████| 236/236 [00:01<00:00, 166.02it/s]
Processing alpha_012: 100%|██████████| 236/236 [00:01<00:00, 140.43it/s]
Processing alpha_013: 100%|██████████| 236/236 [00:02<00:00, 92.21it/s]
Processing alpha_014: 100%|██████████| 236/236 [00:02<00:00, 92.13it/s]
Processing alpha_015: 100%|██████████| 236/236 [00:02

## Calculate Jensen Dataset Returns

**Creat an instance object from 'class_USACalculator.py'**

In [8]:
usa_calculator = USACalculator(
    csv_file=r"data\usa.csv", 
    output_dir='usa_chunks',
    chunk_size=300000
)

**Split Jensen dataset data 'usa.csv' into 14 parquet files and save them into folder 'usa_chunks'**

In [9]:
usa_calculator.csv_to_parquet_chunks()

Created usa_chunks\chunk_1.parquet with 300000 rows
Created usa_chunks\chunk_2.parquet with 300000 rows
Created usa_chunks\chunk_3.parquet with 300000 rows
Created usa_chunks\chunk_4.parquet with 300000 rows
Created usa_chunks\chunk_5.parquet with 300000 rows
Created usa_chunks\chunk_6.parquet with 300000 rows
Created usa_chunks\chunk_7.parquet with 300000 rows
Created usa_chunks\chunk_8.parquet with 300000 rows
Created usa_chunks\chunk_9.parquet with 300000 rows
Created usa_chunks\chunk_10.parquet with 300000 rows
Created usa_chunks\chunk_11.parquet with 300000 rows
Created usa_chunks\chunk_12.parquet with 300000 rows
Created usa_chunks\chunk_13.parquet with 300000 rows
Created final usa_chunks\chunk_14.parquet with 235225 rows


**Filter each chunks and save the filtered parquet files into the same folder**

In [10]:
usa_calculator.filter_chunks()

100%|██████████| 14/14 [04:12<00:00, 18.02s/it]


**Merge all the filtered chunks into one parquet in the folder 'factor_value' named 'usa_factor'**

In [11]:
usa_calculator.merge_filters()

**Calculate usa returns and save it into parquet file in the folder 'factor_returns' named 'usa_returns'**

In [12]:
usa_factor = pd.read_parquet(r'factor_value\usa_factor.parquet')
factor_returns_list = calculateFactorReturns(usa_factor)
factor_returns_list['period'] = factor_returns_list['period'].dt.to_timestamp()
factor_returns_list.to_parquet(r'factor_returns\usa_returns.parquet', index=False)

Processing niq_su: 100%|██████████| 702/702 [00:01<00:00, 538.41it/s] 
Processing ret_6_1: 100%|██████████| 1159/1159 [00:01<00:00, 849.37it/s]
Processing ret_12_1: 100%|██████████| 1153/1153 [00:01<00:00, 795.61it/s]
Processing saleq_su: 100%|██████████| 702/702 [00:01<00:00, 615.95it/s] 
Processing tax_gr1a: 100%|██████████| 855/855 [00:02<00:00, 370.06it/s] 
Processing ni_inc8q: 100%|██████████| 696/696 [00:01<00:00, 398.11it/s] 
Processing prc_highprc_252d: 100%|██████████| 1160/1160 [00:02<00:00, 551.57it/s]
Processing resff3_6_1: 100%|██████████| 837/837 [00:01<00:00, 441.53it/s] 
Processing resff3_12_1: 100%|██████████| 837/837 [00:02<00:00, 417.77it/s] 
Processing be_me: 100%|██████████| 867/867 [00:02<00:00, 402.51it/s] 
Processing debt_me: 100%|██████████| 867/867 [00:02<00:00, 417.56it/s] 
Processing at_me: 100%|██████████| 867/867 [00:02<00:00, 418.44it/s] 
Processing ret_60_12: 100%|██████████| 1105/1105 [00:02<00:00, 543.16it/s]
Processing ni_me: 100%|██████████| 867/867 

## Build 3×2 Bivariate-Sorted Portfolios(Alpha191 part)

**Read file**

In [13]:
alpha_factor = pd.read_parquet(r'factor_value\alpha_factor.parquet')

**Creat an instance object from 'class_Alpha191Portfolios.py'**

In [14]:
alpha_calculator = Alpha191Portfolios(alpha_factor)

**Preprocess 'alpha_factor.parquet' date for building portfolios and following calculation**

In [15]:
alpha_calculator.data_preprocess()

**Build 3×2 Bivariate-Sorted portfolios by sorting prior entire year's characteristics average and market_cap average**

1. Sorting entire year characteristics average and market_cap average to build portfolis using by next year

In [16]:
alpha_calculator.portfolio_group()

Processing alpha_001: 100%|██████████| 19/19 [00:00<00:00, 211.58it/s]
Processing alpha_002: 100%|██████████| 19/19 [00:00<00:00, 213.46it/s]
Processing alpha_003: 100%|██████████| 19/19 [00:00<00:00, 183.54it/s]
Processing alpha_007: 100%|██████████| 19/19 [00:00<00:00, 214.59it/s]
Processing alpha_008: 100%|██████████| 19/19 [00:00<00:00, 186.49it/s]
Processing alpha_009: 100%|██████████| 19/19 [00:00<00:00, 225.66it/s]
Processing alpha_010: 100%|██████████| 19/19 [00:00<00:00, 178.21it/s]
Processing alpha_011: 100%|██████████| 19/19 [00:00<00:00, 206.80it/s]
Processing alpha_012: 100%|██████████| 19/19 [00:00<00:00, 187.70it/s]
Processing alpha_013: 100%|██████████| 19/19 [00:00<00:00, 218.94it/s]
Processing alpha_014: 100%|██████████| 19/19 [00:00<00:00, 218.15it/s]
Processing alpha_015: 100%|██████████| 19/19 [00:00<00:00, 172.68it/s]
Processing alpha_017: 100%|██████████| 19/19 [00:00<00:00, 194.78it/s]
Processing alpha_018: 100%|██████████| 19/19 [00:00<00:00, 197.66it/s]
Proces

there is a empty list -> i: alpha_150, period: 2010, key: high1
there is a empty list -> i: alpha_150, period: 2017, key: high1
there is a empty list -> i: alpha_150, period: 2018, key: high1
there is a empty list -> i: alpha_150, period: 2019, key: high1
there is a empty list -> i: alpha_150, period: 2020, key: high1
there is a empty list -> i: alpha_150, period: 2021, key: high1
there is a empty list -> i: alpha_150, period: 2022, key: high1


Processing alpha_151: 100%|██████████| 19/19 [00:00<00:00, 222.48it/s]
Processing alpha_152: 100%|██████████| 19/19 [00:00<00:00, 231.97it/s]
Processing alpha_153: 100%|██████████| 19/19 [00:00<00:00, 229.38it/s]
Processing alpha_155: 100%|██████████| 19/19 [00:00<00:00, 207.59it/s]
Processing alpha_156: 100%|██████████| 19/19 [00:00<00:00, 225.24it/s]
Processing alpha_157: 100%|██████████| 19/19 [00:00<00:00, 177.38it/s]
Processing alpha_158: 100%|██████████| 19/19 [00:00<00:00, 217.54it/s]
Processing alpha_159: 100%|██████████| 19/19 [00:00<00:00, 199.85it/s]
Processing alpha_160: 100%|██████████| 19/19 [00:00<00:00, 216.61it/s]
Processing alpha_161: 100%|██████████| 19/19 [00:00<00:00, 181.66it/s]
Processing alpha_162: 100%|██████████| 19/19 [00:00<00:00, 196.73it/s]
Processing alpha_163: 100%|██████████| 19/19 [00:00<00:00, 197.20it/s]
Processing alpha_164: 100%|██████████| 19/19 [00:00<00:00, 187.98it/s]
Processing alpha_166: 100%|██████████| 19/19 [00:00<00:00, 225.70it/s]
Proces

there is a empty list -> i: alpha_182, period: 2012, key: high1


Processing alpha_184: 100%|██████████| 19/19 [00:00<00:00, 205.52it/s]
Processing alpha_185: 100%|██████████| 19/19 [00:00<00:00, 203.95it/s]
Processing alpha_186: 100%|██████████| 19/19 [00:00<00:00, 194.46it/s]
Processing alpha_187: 100%|██████████| 19/19 [00:00<00:00, 186.91it/s]
Processing alpha_188: 100%|██████████| 19/19 [00:00<00:00, 204.76it/s]
Processing alpha_189: 100%|██████████| 19/19 [00:00<00:00, 189.17it/s]
Processing alpha_190: 100%|██████████| 19/19 [00:00<00:00, 194.16it/s]
Processing alpha_191: 100%|██████████| 19/19 [00:00<00:00, 235.46it/s]


2. Calcuate portfolio weighted returns by using prior year's sorting results

In [17]:
alpha_port_ret = alpha_calculator.portfolio_ret()

Processing alpha_001: 100%|██████████| 228/228 [00:00<00:00, 633.39it/s]
Processing alpha_002: 100%|██████████| 228/228 [00:00<00:00, 630.66it/s]
Processing alpha_003: 100%|██████████| 228/228 [00:00<00:00, 648.71it/s]
Processing alpha_007: 100%|██████████| 228/228 [00:00<00:00, 637.56it/s]
Processing alpha_008: 100%|██████████| 228/228 [00:00<00:00, 639.17it/s]
Processing alpha_009: 100%|██████████| 228/228 [00:00<00:00, 639.02it/s]
Processing alpha_010: 100%|██████████| 228/228 [00:00<00:00, 645.59it/s]
Processing alpha_011: 100%|██████████| 228/228 [00:00<00:00, 579.24it/s]
Processing alpha_012: 100%|██████████| 228/228 [00:00<00:00, 652.37it/s]
Processing alpha_013: 100%|██████████| 228/228 [00:00<00:00, 651.96it/s]
Processing alpha_014: 100%|██████████| 228/228 [00:00<00:00, 614.44it/s]
Processing alpha_015: 100%|██████████| 228/228 [00:00<00:00, 593.73it/s]
Processing alpha_017: 100%|██████████| 228/228 [00:00<00:00, 658.76it/s]
Processing alpha_018: 100%|██████████| 228/228 [00:

**Save results into parquet file in folder 'portfolios' as the name of 'alpha_port_ret'**

In [18]:
alpha_port_ret.to_parquet(r'portfolios\alpha_port_ret.parquet', index=False)

## Build 5×5 Bivariate-Sorted Portfolios(Alpha191 part)

The process is the same as bulding 3×2 Bivariate-Sorted Portfolios(Alpha191 part)

In [19]:
alpha_factor = pd.read_parquet(r'factor_value\alpha_factor.parquet')

In [20]:
alpha55_calculator = Alpha191Portfolios55(alpha_factor)

In [21]:
alpha55_calculator.data_preprocess()

In [22]:
alpha55_calculator.portfolio_group()

Processing alpha_001: 100%|██████████| 19/19 [00:00<00:00, 102.04it/s]
Processing alpha_002: 100%|██████████| 19/19 [00:00<00:00, 111.96it/s]
Processing alpha_003: 100%|██████████| 19/19 [00:00<00:00, 103.90it/s]
Processing alpha_007: 100%|██████████| 19/19 [00:00<00:00, 106.08it/s]
Processing alpha_008: 100%|██████████| 19/19 [00:00<00:00, 109.12it/s]
Processing alpha_009: 100%|██████████| 19/19 [00:00<00:00, 96.79it/s] 
Processing alpha_010: 100%|██████████| 19/19 [00:00<00:00, 91.97it/s]
Processing alpha_011: 100%|██████████| 19/19 [00:00<00:00, 102.74it/s]
Processing alpha_012: 100%|██████████| 19/19 [00:00<00:00, 96.79it/s] 
Processing alpha_013: 100%|██████████| 19/19 [00:00<00:00, 103.56it/s]
Processing alpha_014: 100%|██████████| 19/19 [00:00<00:00, 107.69it/s]
Processing alpha_015: 100%|██████████| 19/19 [00:00<00:00, 105.76it/s]
Processing alpha_017: 100%|██████████| 19/19 [00:00<00:00, 98.68it/s]
Processing alpha_018: 100%|██████████| 19/19 [00:00<00:00, 94.79it/s]
Processin

In [23]:
alpha55_port_ret = alpha55_calculator.portfolio_ret()

Processing alpha_001: 100%|██████████| 228/228 [00:01<00:00, 166.11it/s]
Processing alpha_002: 100%|██████████| 228/228 [00:01<00:00, 152.59it/s]
Processing alpha_003: 100%|██████████| 228/228 [00:01<00:00, 174.90it/s]
Processing alpha_007: 100%|██████████| 228/228 [00:01<00:00, 167.87it/s]
Processing alpha_008: 100%|██████████| 228/228 [00:01<00:00, 178.52it/s]
Processing alpha_009: 100%|██████████| 228/228 [00:01<00:00, 180.32it/s]
Processing alpha_010: 100%|██████████| 228/228 [00:01<00:00, 155.61it/s]
Processing alpha_011: 100%|██████████| 228/228 [00:02<00:00, 92.37it/s]
Processing alpha_012: 100%|██████████| 228/228 [00:02<00:00, 90.58it/s]
Processing alpha_013: 100%|██████████| 228/228 [00:02<00:00, 95.00it/s]
Processing alpha_014: 100%|██████████| 228/228 [00:02<00:00, 89.90it/s] 
Processing alpha_015: 100%|██████████| 228/228 [00:02<00:00, 105.33it/s]
Processing alpha_017: 100%|██████████| 228/228 [00:02<00:00, 91.69it/s]
Processing alpha_018: 100%|██████████| 228/228 [00:02<0

In [24]:
alpha55_port_ret.to_parquet(r'portfolios\alpha55_port_ret.parquet', index=False)

## Build 3×2 Bivariate-Sorted Portfolios(Jensen dataset part)

**Read file**

In [25]:
usa_factor = pd.read_parquet(r'factor_value\usa_factor.parquet')

**Creat an instance object from 'class_UsaPortfolios.py'**

In [26]:
usa_calculator = UsaPortfolios(usa_factor)

**Preprocess usa_factor date for building portfolios and following calculation**

In [27]:
usa_calculator.data_preprocess()

**Build 3×2 Bivariate-Sorted portfolios by sorting prior entire year's characteristics average and market_cap average**

1. Sorting entire year characteristics average and market_cap average to build portfolis using by next year

In [28]:
usa_calculator.portfolio_group()

Processing niq_su: 100%|██████████| 19/19 [00:00<00:00, 168.81it/s]
Processing ret_6_1: 100%|██████████| 19/19 [00:00<00:00, 211.76it/s]
Processing ret_12_1: 100%|██████████| 19/19 [00:00<00:00, 224.60it/s]
Processing saleq_su: 100%|██████████| 19/19 [00:00<00:00, 191.73it/s]
Processing tax_gr1a: 100%|██████████| 19/19 [00:00<00:00, 200.10it/s]
Processing prc_highprc_252d: 100%|██████████| 19/19 [00:00<00:00, 206.17it/s]
Processing resff3_6_1: 100%|██████████| 19/19 [00:00<00:00, 238.95it/s]
Processing resff3_12_1: 100%|██████████| 19/19 [00:00<00:00, 194.79it/s]
Processing be_me: 100%|██████████| 19/19 [00:00<00:00, 199.80it/s]
Processing debt_me: 100%|██████████| 19/19 [00:00<00:00, 166.71it/s]
Processing at_me: 100%|██████████| 19/19 [00:00<00:00, 195.30it/s]
Processing ret_60_12: 100%|██████████| 19/19 [00:00<00:00, 196.94it/s]
Processing ni_me: 100%|██████████| 19/19 [00:00<00:00, 189.27it/s]
Processing fcf_me: 100%|██████████| 19/19 [00:00<00:00, 191.18it/s]
Processing div12m_me:

there is a empty list -> i: market_equity, period: 2004, key: low2
there is a empty list -> i: market_equity, period: 2004, key: high1
there is a empty list -> i: market_equity, period: 2005, key: low2
there is a empty list -> i: market_equity, period: 2005, key: high1
there is a empty list -> i: market_equity, period: 2006, key: low2
there is a empty list -> i: market_equity, period: 2006, key: high1
there is a empty list -> i: market_equity, period: 2007, key: low2
there is a empty list -> i: market_equity, period: 2007, key: high1
there is a empty list -> i: market_equity, period: 2008, key: low2
there is a empty list -> i: market_equity, period: 2008, key: high1
there is a empty list -> i: market_equity, period: 2009, key: low2
there is a empty list -> i: market_equity, period: 2009, key: high1
there is a empty list -> i: market_equity, period: 2010, key: low2
there is a empty list -> i: market_equity, period: 2010, key: high1
there is a empty list -> i: market_equity, period: 2011

Processing ivol_ff3_21d: 100%|██████████| 19/19 [00:00<00:00, 187.97it/s]
Processing ivol_capm_252d: 100%|██████████| 19/19 [00:00<00:00, 202.72it/s]
Processing ivol_capm_21d: 100%|██████████| 19/19 [00:00<00:00, 186.29it/s]
Processing ivol_hxz4_21d: 100%|██████████| 19/19 [00:00<00:00, 197.10it/s]
Processing rvol_21d: 100%|██████████| 19/19 [00:00<00:00, 205.19it/s]
Processing beta_60m: 100%|██████████| 19/19 [00:00<00:00, 193.68it/s]
Processing betabab_1260d: 100%|██████████| 19/19 [00:00<00:00, 188.07it/s]
Processing beta_dimson_21d: 100%|██████████| 19/19 [00:00<00:00, 206.96it/s]
Processing turnover_126d: 100%|██████████| 19/19 [00:00<00:00, 187.75it/s]
Processing turnover_var_126d: 100%|██████████| 19/19 [00:00<00:00, 192.75it/s]
Processing dolvol_126d: 100%|██████████| 19/19 [00:00<00:00, 192.09it/s]


there is a empty list -> i: dolvol_126d, period: 2004, key: high1
there is a empty list -> i: dolvol_126d, period: 2005, key: high1
there is a empty list -> i: dolvol_126d, period: 2009, key: high1
there is a empty list -> i: dolvol_126d, period: 2010, key: high1
there is a empty list -> i: dolvol_126d, period: 2013, key: high1
there is a empty list -> i: dolvol_126d, period: 2014, key: high1
there is a empty list -> i: dolvol_126d, period: 2015, key: high1
there is a empty list -> i: dolvol_126d, period: 2016, key: high1
there is a empty list -> i: dolvol_126d, period: 2017, key: high1
there is a empty list -> i: dolvol_126d, period: 2018, key: high1
there is a empty list -> i: dolvol_126d, period: 2019, key: high1
there is a empty list -> i: dolvol_126d, period: 2020, key: high1
there is a empty list -> i: dolvol_126d, period: 2021, key: high1
there is a empty list -> i: dolvol_126d, period: 2022, key: high1


Processing dolvol_var_126d: 100%|██████████| 19/19 [00:00<00:00, 165.13it/s]
Processing prc: 100%|██████████| 19/19 [00:00<00:00, 217.68it/s]
Processing ami_126d: 100%|██████████| 19/19 [00:00<00:00, 206.72it/s]


there is a empty list -> i: ami_126d, period: 2004, key: low1
there is a empty list -> i: ami_126d, period: 2004, key: high2
there is a empty list -> i: ami_126d, period: 2005, key: high2
there is a empty list -> i: ami_126d, period: 2006, key: low1
there is a empty list -> i: ami_126d, period: 2006, key: high2
there is a empty list -> i: ami_126d, period: 2007, key: high2
there is a empty list -> i: ami_126d, period: 2008, key: low1
there is a empty list -> i: ami_126d, period: 2008, key: high2
there is a empty list -> i: ami_126d, period: 2009, key: low1
there is a empty list -> i: ami_126d, period: 2009, key: high2
there is a empty list -> i: ami_126d, period: 2010, key: low1
there is a empty list -> i: ami_126d, period: 2010, key: high2
there is a empty list -> i: ami_126d, period: 2011, key: low1
there is a empty list -> i: ami_126d, period: 2011, key: high2
there is a empty list -> i: ami_126d, period: 2012, key: low1
there is a empty list -> i: ami_126d, period: 2012, key: high2

Processing zero_trades_21d: 100%|██████████| 19/19 [00:00<00:00, 184.85it/s]
Processing zero_trades_126d: 100%|██████████| 19/19 [00:00<00:00, 225.43it/s]
Processing zero_trades_252d: 100%|██████████| 19/19 [00:00<00:00, 217.18it/s]
Processing rmax1_21d: 100%|██████████| 19/19 [00:00<00:00, 187.85it/s]
Processing rskew_21d: 100%|██████████| 19/19 [00:00<00:00, 186.83it/s]
Processing iskew_capm_21d: 100%|██████████| 19/19 [00:00<00:00, 205.53it/s]
Processing iskew_ff3_21d: 100%|██████████| 19/19 [00:00<00:00, 205.50it/s]
Processing iskew_hxz4_21d: 100%|██████████| 19/19 [00:00<00:00, 174.70it/s]
Processing coskew_21d: 100%|██████████| 19/19 [00:00<00:00, 192.90it/s]
Processing ret_1_0: 100%|██████████| 19/19 [00:00<00:00, 193.80it/s]
Processing betadown_252d: 100%|██████████| 19/19 [00:00<00:00, 236.11it/s]
Processing bidaskhl_21d: 100%|██████████| 19/19 [00:00<00:00, 228.27it/s]
Processing ret_3_1: 100%|██████████| 19/19 [00:00<00:00, 202.14it/s]
Processing ret_9_1: 100%|██████████| 19

2. Calcuate portfolio weighted returns by using prior year's sorting results

In [29]:
usa_port_ret = usa_calculator.portfolio_ret()

Processing niq_su: 100%|██████████| 228/228 [00:00<00:00, 662.64it/s]
Processing ret_6_1: 100%|██████████| 228/228 [00:00<00:00, 577.80it/s]
Processing ret_12_1: 100%|██████████| 228/228 [00:00<00:00, 659.85it/s]
Processing saleq_su: 100%|██████████| 228/228 [00:00<00:00, 675.95it/s]
Processing tax_gr1a: 100%|██████████| 228/228 [00:00<00:00, 608.80it/s]
Processing prc_highprc_252d: 100%|██████████| 228/228 [00:00<00:00, 671.63it/s]
Processing resff3_6_1: 100%|██████████| 228/228 [00:00<00:00, 653.30it/s]
Processing resff3_12_1: 100%|██████████| 228/228 [00:00<00:00, 661.66it/s]
Processing be_me: 100%|██████████| 228/228 [00:00<00:00, 618.34it/s]
Processing debt_me: 100%|██████████| 228/228 [00:00<00:00, 609.42it/s]
Processing at_me: 100%|██████████| 228/228 [00:00<00:00, 645.20it/s]
Processing ret_60_12: 100%|██████████| 228/228 [00:00<00:00, 630.88it/s]
Processing ni_me: 100%|██████████| 228/228 [00:00<00:00, 635.24it/s]
Processing fcf_me: 100%|██████████| 228/228 [00:00<00:00, 634.9

**Save results into parquet file in folder 'portfolios' as the name of 'usa_port_ret'**

In [30]:
usa_port_ret.to_parquet(r'portfolios\usa_port_ret.parquet', index=False)

## Build 5*5 Bivariate-Sorted Portfolios(Jensen dataset part)

The process is the same as bulding 3×2 Bivariate-Sorted Portfolios(Jensen dataset part)

In [31]:
usa_factor = pd.read_parquet(r'factor_value\usa_factor.parquet')

In [32]:
usa55_calculator = UsaPortfolios55(usa_factor)

In [33]:
usa55_calculator.data_preprocess()

In [34]:
usa55_calculator.portfolio_group()

Processing niq_su: 100%|██████████| 19/19 [00:00<00:00, 94.19it/s]
Processing ret_6_1: 100%|██████████| 19/19 [00:00<00:00, 95.63it/s]
Processing ret_12_1: 100%|██████████| 19/19 [00:00<00:00, 99.57it/s]
Processing saleq_su: 100%|██████████| 19/19 [00:00<00:00, 95.70it/s]
Processing tax_gr1a: 100%|██████████| 19/19 [00:00<00:00, 100.53it/s]
Processing prc_highprc_252d: 100%|██████████| 19/19 [00:00<00:00, 100.95it/s]
Processing resff3_6_1: 100%|██████████| 19/19 [00:00<00:00, 97.91it/s]
Processing resff3_12_1: 100%|██████████| 19/19 [00:00<00:00, 97.35it/s]
Processing be_me: 100%|██████████| 19/19 [00:00<00:00, 99.02it/s]
Processing debt_me: 100%|██████████| 19/19 [00:00<00:00, 88.01it/s]
Processing at_me: 100%|██████████| 19/19 [00:00<00:00, 103.29it/s]
Processing ret_60_12: 100%|██████████| 19/19 [00:00<00:00, 100.51it/s]
Processing ni_me: 100%|██████████| 19/19 [00:00<00:00, 103.97it/s]
Processing fcf_me: 100%|██████████| 19/19 [00:00<00:00, 99.53it/s] 
Processing div12m_me: 100%|██

In [35]:
usa55_port_ret = usa55_calculator.portfolio_ret()

Processing niq_su: 100%|██████████| 228/228 [00:02<00:00, 90.61it/s]
Processing ret_6_1: 100%|██████████| 228/228 [00:01<00:00, 156.69it/s]
Processing ret_12_1: 100%|██████████| 228/228 [00:01<00:00, 180.46it/s]
Processing saleq_su: 100%|██████████| 228/228 [00:01<00:00, 173.13it/s]
Processing tax_gr1a: 100%|██████████| 228/228 [00:01<00:00, 213.50it/s]
Processing prc_highprc_252d: 100%|██████████| 228/228 [00:01<00:00, 203.65it/s]
Processing resff3_6_1: 100%|██████████| 228/228 [00:02<00:00, 98.42it/s] 
Processing resff3_12_1: 100%|██████████| 228/228 [00:02<00:00, 105.56it/s]
Processing be_me: 100%|██████████| 228/228 [00:02<00:00, 107.41it/s]
Processing debt_me: 100%|██████████| 228/228 [00:02<00:00, 97.32it/s] 
Processing at_me: 100%|██████████| 228/228 [00:02<00:00, 106.52it/s]
Processing ret_60_12: 100%|██████████| 228/228 [00:02<00:00, 99.61it/s] 
Processing ni_me: 100%|██████████| 228/228 [00:02<00:00, 104.70it/s]
Processing fcf_me: 100%|██████████| 228/228 [00:02<00:00, 112.42

In [36]:
usa55_port_ret.to_parquet(r'portfolios\usa55_port_ret.parquet', index=False)

## Regression of Double-Selection(DS), Single-Selection(SS), Elastic Net(EN), and Principal Component Analysis(PCA) using 3×2 Bivariate-Sorted Portfolios

**Read file**

In [37]:
alpha_port = pd.read_parquet(r'portfolios\alpha_port_ret.parquet')
usa_port = pd.read_parquet(r'portfolios\usa_port_ret.parquet')
alpha_ret = pd.read_parquet(r'factor_returns\alpha_returns.parquet')
usa_ret = pd.read_parquet(r'factor_returns\usa_returns.parquet')

**Creat an instance object from 'class_DSregression.py'**

In [38]:
DS_calculator = DSregression(alpha_port, usa_port, alpha_ret, usa_ret )

**Preprocess data**

In [39]:
DS_calculator.preprocessData()

**DS regression and get the first LASSO results "I_1", second LASSO reults "I_2"**

In [40]:
I_1, I_2 = DS_calculator.DSregression()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     1791.
Date:                Mon, 06 Oct 2025   Prob (F-statistic):               0.00
Time:                        12:33:15   Log-Likelihood:                 11740.
No. Observations:                1869   AIC:                        -2.305e+04
Df Residuals:                    1656   BIC:                        -2.187e+04
Df Model:                         212                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -0.0001   2.18e-05  

In [41]:
I_1

{'beta_60m'}

In [42]:
I_2

{'aliq_at',
 'ami_126d',
 'beta_60m',
 'betadown_252d',
 'bev_mev',
 'cash_at',
 'chcsho_12m',
 'corr_1260d',
 'debt_gr3',
 'debt_me',
 'div12m_me',
 'dolvol_126d',
 'dsale_drec',
 'ebit_sale',
 'ebitda_mev',
 'emp_gr1',
 'eqnpo_12m',
 'eqnpo_me',
 'eqpo_me',
 'fcf_me',
 'market_equity',
 'ni_ivol',
 'ni_me',
 'ocf_me',
 'ocfq_saleq_std',
 'op_atl1',
 'prc',
 'prc_highprc_252d',
 'qmj_prof',
 'qmj_safety',
 'rd5_at',
 'ret_1_0',
 'ret_60_12',
 'sale_gr1',
 'sale_gr3',
 'sale_me',
 'saleq_su',
 'seas_11_15na',
 'seas_16_20na',
 'seas_2_5na',
 'tangibility',
 'turnover_126d',
 'z_score',
 'zero_trades_126d',
 'zero_trades_252d'}

**SS regression and get the LASSO results "I"**

In [43]:
I = DS_calculator.SSregression()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     2189.
Date:                Mon, 06 Oct 2025   Prob (F-statistic):               0.00
Time:                        12:33:53   Log-Likelihood:                 11685.
No. Observations:                1869   AIC:                        -2.303e+04
Df Residuals:                    1700   BIC:                        -2.210e+04
Df Model:                         168                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0001   2.07e-05     -6.438      0.0

In [44]:
I

{'beta_60m'}

**EN regression and get the results "I_3"**

In [45]:
I_3 = DS_calculator.ENregression()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     2094.
Date:                Mon, 06 Oct 2025   Prob (F-statistic):               0.00
Time:                        12:34:00   Log-Likelihood:                 11709.
No. Observations:                1869   AIC:                        -2.306e+04
Df Residuals:                    1689   BIC:                        -2.206e+04
Df Model:                         179                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.0001 

In [46]:
I_3

{'beta_60m',
 'cop_at',
 'debt_gr3',
 'earnings_variability',
 'eqnetis_at',
 'fcf_me',
 'market_equity',
 'netis_at',
 'prc_highprc_252d',
 'resff3_12_1',
 'sale_me',
 'seas_16_20na'}

**PCA regression**

In [47]:
I_4 = DS_calculator.PCAregression()

The number of original control factors: 153
The number of retaining principal components: 2
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     2174.
Date:                Mon, 06 Oct 2025   Prob (F-statistic):               0.00
Time:                        12:34:06   Log-Likelihood:                 11685.
No. Observations:                1869   AIC:                        -2.303e+04
Df Residuals:                    1699   BIC:                        -2.209e+04
Df Model:                         169                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------

## Regression of Double-Selection(DS) using 5×5 Bivariate-Sorted Portfolios

**Read file**

In [48]:
alpha55_port = pd.read_parquet(r'portfolios\alpha55_port_ret.parquet')
usa55_port = pd.read_parquet(r'portfolios\usa55_port_ret.parquet')
alpha_ret = pd.read_parquet(r'factor_returns\alpha_returns.parquet')
usa_ret = pd.read_parquet(r'factor_returns\usa_returns.parquet')

**Creat an instance object from 'class_DSregression.py'**

In [49]:
DS_calculator = DSregression(alpha55_port, usa55_port, alpha_ret, usa_ret )

**Preprocess data**

In [50]:
DS_calculator.preprocessData()

**DS regression and get the first LASSO results "I_1", second LASSO reults "I_2"**

In [51]:
I_1, I_2 = DS_calculator.DSregression()

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.977
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     322.0
Date:                Mon, 06 Oct 2025   Prob (F-statistic):               0.00
Time:                        12:35:47   Log-Likelihood:                 10155.
No. Observations:                1864   AIC:                        -1.988e+04
Df Residuals:                    1648   BIC:                        -1.868e+04
Df Model:                         215                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                -9.944e-05 

In [52]:
I_1

{'beta_60m'}

In [53]:
I_2

{'age',
 'aliq_at',
 'aliq_mat',
 'ami_126d',
 'at_turnover',
 'be_gr1a',
 'beta_60m',
 'betabab_1260d',
 'betadown_252d',
 'capx_gr2',
 'cash_at',
 'chcsho_12m',
 'coa_gr1a',
 'cop_atl1',
 'corr_1260d',
 'coskew_21d',
 'cowc_gr1a',
 'debt_gr3',
 'div12m_me',
 'dolvol_126d',
 'dolvol_var_126d',
 'dsale_dinv',
 'dsale_drec',
 'dsale_dsga',
 'earnings_variability',
 'ebit_bev',
 'ebit_sale',
 'ebitda_mev',
 'emp_gr1',
 'eqnetis_at',
 'eqnpo_12m',
 'eqnpo_me',
 'eqpo_me',
 'fcf_me',
 'fnl_gr1a',
 'inv_gr1a',
 'iskew_capm_21d',
 'ival_me',
 'ivol_capm_21d',
 'kz_index',
 'lnoa_gr1a',
 'lti_gr1a',
 'market_equity',
 'mispricing_perf',
 'ncoa_gr1a',
 'netdebt_me',
 'netis_at',
 'nfna_gr1a',
 'ni_ar1',
 'ni_be',
 'ni_ivol',
 'ni_me',
 'niq_at_chg1',
 'niq_be',
 'niq_be_chg1',
 'niq_su',
 'noa_at',
 'noa_gr1a',
 'oaccruals_at',
 'oaccruals_ni',
 'ocf_at',
 'ocf_at_chg1',
 'ocf_me',
 'ocfq_saleq_std',
 'pi_nix',
 'prc',
 'prc_highprc_252d',
 'qmj_growth',
 'qmj_safety',
 'rd_me',
 'resff3_12_1'