# Data Exploration & Alpha Factor Demo

Two sections:
1. **Database sanity check** — verify `stock_data.db` is built correctly and preview sample data.
2. **Alpha computation** — compute Alpha#6, #12, #38, #41, #101 via `Alpha101`.

> **Prerequisites**: run `engine.download_data()` first so that `data/stock_data.db` exists.

In [8]:
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
from data_loader import DataEngine
from alphas import Alpha101

In [9]:
engine = DataEngine()
data = engine.load_data()

df_price    = data['df_price']
df_mv       = data['df_mv']
df_industry = data['df_industry']

print('Tables loaded.')
print(f'  daily_price : {df_price.shape}  (rows = date×code combinations)')
print(f'  df_mv       : {df_mv.shape}')
print(f'  stock_info  : {df_industry.shape}')

Tables loaded.
  daily_price : (223275, 5)  (rows = date×code combinations)
  df_mv       : (222518, 1)
  stock_info  : (300, 2)


---
## Part 1 — Database Sanity Check

### 1.1 Daily Price (OHLCV)

In [10]:
dates = df_price.index.get_level_values('date')
codes = df_price.index.get_level_values('code')
print(f'Date range   : {dates.min()}  ->  {dates.max()}')
print(f'Unique stocks: {codes.nunique()}')
df_price.head(10)

Date range   : 20220104  ->  20250221
Unique stocks: 299


Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,vol
date,code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20220104,000001.SZ,16.48,16.66,16.18,16.66,1169259.33
20220104,000002.SZ,19.49,20.65,19.36,20.49,1947202.02
20220104,000063.SZ,33.58,33.64,33.13,33.42,290034.38
20220104,000100.SZ,6.18,6.26,6.14,6.24,1612641.47
20220104,000157.SZ,7.17,7.22,7.14,7.21,442457.79
20220104,000166.SZ,5.13,5.18,5.09,5.13,714768.81
20220104,000301.SZ,19.35,19.48,18.86,19.11,375623.65
20220104,000333.SZ,74.0,75.5,73.6,75.36,310408.01
20220104,000338.SZ,17.93,18.0,17.52,17.69,1214435.55
20220104,000408.SZ,41.6,41.77,36.99,38.28,401552.96


### 1.2 Market Cap (total_mv)

In [11]:
df_mv.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_mv
date,code,Unnamed: 2_level_1
20220104,000001.SZ,32330260.0
20220104,000002.SZ,23820410.0
20220104,000063.SZ,15806060.0
20220104,000100.SZ,8755121.0
20220104,000157.SZ,6256832.0
20220104,000166.SZ,12845490.0
20220104,000301.SZ,9239500.0
20220104,000333.SZ,52640910.0
20220104,000338.SZ,15437280.0
20220104,000408.SZ,7544719.0


### 1.3 Industry Distribution

In [12]:
df_industry.head(10)

Unnamed: 0_level_0,name,industry
code,Unnamed: 1_level_1,Unnamed: 2_level_1
000001.SZ,平安银行,银行
000002.SZ,万科Ａ,全国地产
000063.SZ,中兴通讯,通信设备
000100.SZ,TCL科技,元器件
000157.SZ,中联重科,工程机械
000301.SZ,东方盛虹,化纤
000408.SZ,藏格矿业,农药化肥
000425.SZ,徐工机械,工程机械
000538.SZ,云南白药,中成药
000568.SZ,泸州老窖,白酒


In [13]:
df_industry['industry'].value_counts()

industry
银行      24
证券      22
半导体     19
电气设备    16
元器件     16
        ..
广告包装     1
乳制品      1
医药商业     1
供气供热     1
服饰       1
Name: count, Length: 66, dtype: int64

### 1.4 Missing Data Check

In [14]:
print('=== daily_price null counts ===')
print(df_price.isnull().sum())
print()
print('=== df_mv null counts ===')
print(df_mv.isnull().sum())

=== daily_price null counts ===
open     0
high     0
low      0
close    0
vol      0
dtype: int64

=== df_mv null counts ===
total_mv    0
dtype: int64


---
## Part 2 — Alpha Factor Computation

Using `Alpha101` from `src/alphas.py`, which implements alphas from  
*'101 Formulaic Alphas'* (Kakushadze, 2015).  
Raw values (NaN, inf) are preserved — data cleaning comes later.

In [15]:
alpha = Alpha101(data)
print('Alpha101 initialized.')
print(f'  Price matrix shape (dates × codes): {alpha.close.shape}')

Alpha101 initialized.
  Price matrix shape (dates × codes): (757, 299)


  self.returns  = self.close.pct_change()


### 2.1 Individual Alphas (wide form: dates × codes)

In [16]:
# Alpha#6: -1 * correlation(open, volume, 10)
a6 = alpha.alpha006()
print('Alpha#6  shape:', a6.shape)
a6.tail(5).iloc[:, :6]   # last 5 dates, first 6 stocks

Alpha#6  shape: (757, 299)


code,000001.SZ,000002.SZ,000063.SZ,000100.SZ,000157.SZ,000166.SZ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20250217,-0.524517,-0.248834,-0.427437,-0.025062,0.25417,-0.000582
20250218,-0.725675,-0.188596,-0.542344,-0.008392,0.142601,0.051444
20250219,-0.556824,-0.18776,-0.606402,0.194334,0.196593,0.164627
20250220,-0.311493,-0.044974,-0.595116,0.024773,0.197812,0.038002
20250221,-0.309479,-0.213198,-0.146451,0.122155,-0.057652,-0.505223


In [17]:
# Alpha#12: sign(delta(volume, 1)) * (-1 * delta(close, 1))
a12 = alpha.alpha012()
print('Alpha#12 shape:', a12.shape)
a12.tail(5).iloc[:, :6]

Alpha#12 shape: (757, 299)


code,000001.SZ,000002.SZ,000063.SZ,000100.SZ,000157.SZ,000166.SZ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20250217,-0.23,-0.13,-1.16,-0.02,0.05,-0.06
20250218,0.03,-0.16,-1.61,0.11,-0.13,-0.11
20250219,-0.1,0.13,-1.1,-0.04,-0.21,0.05
20250220,-0.05,-0.13,-0.57,0.01,0.05,-0.04
20250221,0.02,-0.1,-3.59,-0.02,0.05,-0.09


In [18]:
# Alpha#38: (-1 * rank(ts_rank(close, 10))) * rank(close / open)
a38 = alpha.alpha038()
print('Alpha#38 shape:', a38.shape)
a38.tail(5).iloc[:, :6]

Alpha#38 shape: (757, 299)


code,000001.SZ,000002.SZ,000063.SZ,000100.SZ,000157.SZ,000166.SZ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20250217,-0.751033,-0.142421,-0.240842,-0.220225,-0.479535,-0.314662
20250218,-0.754439,-0.230242,-0.022626,-0.027474,-0.223244,-0.12296
20250219,-0.095172,-0.451168,-0.403038,-0.014725,-0.826648,-0.438632
20250220,-0.281812,-0.135064,-0.121613,-0.206168,-0.760039,-0.148111
20250221,-0.142566,-0.273095,-0.866864,-0.124831,-0.14228,-0.502724


In [19]:
# Alpha#41: sqrt(high * low) - vwap
a41 = alpha.alpha041()
print('Alpha#41 shape:', a41.shape)
a41.tail(5).iloc[:, :6]

Alpha#41 shape: (757, 299)


code,000001.SZ,000002.SZ,000063.SZ,000100.SZ,000157.SZ,000166.SZ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20250217,-0.035669,0.020658,-0.081462,-0.00197,-0.009412,0.011265
20250218,0.016245,0.030286,0.180518,0.044714,0.02757,0.014591
20250219,0.011487,-0.017614,-0.182166,0.004421,-0.039413,-0.010243
20250220,0.014871,0.008133,0.017347,0.001355,-0.016912,0.003246
20250221,-0.003608,-0.00428,-0.578808,-0.005579,0.015991,-0.017141


In [20]:
# Alpha#101: (close - open) / (high - low + 0.001)
a101 = alpha.alpha101()
print('Alpha#101 shape:', a101.shape)
a101.tail(5).iloc[:, :6]

Alpha#101 shape: (757, 299)


code,000001.SZ,000002.SZ,000063.SZ,000100.SZ,000157.SZ,000166.SZ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20250217,0.717131,-0.278884,0.482426,0.36036,0.079681,-0.305344
20250218,0.248756,-0.412371,-0.735189,-0.391459,-0.616114,-0.763359
20250219,-0.687023,0.622407,0.69361,-0.13245,0.876494,0.594059
20250220,-0.45045,-0.630631,-0.401737,0.18018,0.495868,-0.491803
20250221,-0.310559,0.373444,0.763032,0.0,-0.348259,0.567376


### 2.2 Combined Alpha DataFrame  (MultiIndex: date × code)

In [21]:
df_alphas = alpha.get_all_alphas()
print('Combined alpha DataFrame shape:', df_alphas.shape)
print('Columns:', df_alphas.columns.tolist())
df_alphas.head(10)

  stacked = wide.stack(dropna=False)   # Series: MultiIndex (date, code)
  stacked = wide.stack(dropna=False)   # Series: MultiIndex (date, code)


Combined alpha DataFrame shape: (226343, 5)
Columns: ['alpha006', 'alpha012', 'alpha038', 'alpha041', 'alpha101']


  stacked = wide.stack(dropna=False)   # Series: MultiIndex (date, code)
  stacked = wide.stack(dropna=False)   # Series: MultiIndex (date, code)
  stacked = wide.stack(dropna=False)   # Series: MultiIndex (date, code)


Unnamed: 0_level_0,Unnamed: 1_level_0,alpha006,alpha012,alpha038,alpha041,alpha101
date,code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
20220104,000001.SZ,,,,-0.081754,0.37422
20220104,000002.SZ,,,,-0.172067,0.774593
20220104,000063.SZ,,,,-0.012641,-0.313112
20220104,000100.SZ,,,,-0.013624,0.495868
20220104,000157.SZ,,,,-0.010111,0.493827
20220104,000166.SZ,,,,0.001469,0.0
20220104,000301.SZ,,,,0.017493,-0.386473
20220104,000333.SZ,,,,-0.276053,0.715413
20220104,000338.SZ,,,,0.021712,-0.49896
20220104,000408.SZ,,,,0.294074,-0.694415


### 2.3 Descriptive Statistics

In [22]:
df_alphas.describe()

Unnamed: 0,alpha006,alpha012,alpha038,alpha041,alpha101
count,220287.0,222943.0,220287.0,223275.0,223275.0
mean,-0.098789,-0.202029,-0.292565,0.00022,-0.020437
std,0.418794,2.912957,0.264456,0.463766,0.519667
min,-0.994983,-144.1,-0.998316,-23.553738,-0.999981
25%,-0.4239,-0.33,-0.472419,-0.034151,-0.471204
50%,-0.110979,-0.04,-0.208783,0.002605,0.0
75%,0.215456,0.12,-0.065098,0.047036,0.422535
max,0.989055,161.4,-1.1e-05,20.348329,0.999976


### 2.4 NaN Coverage per Alpha

In [23]:
total = len(df_alphas)
null_pct = df_alphas.isnull().sum() / total * 100
null_pct.rename('NaN %').to_frame()

Unnamed: 0,NaN %
alpha006,2.675585
alpha012,1.502145
alpha038,2.675585
alpha041,1.355465
alpha101,1.355465


### 2.5 Cross-Sectional Snapshot on the Latest Date

In [24]:
latest_date = df_alphas.index.get_level_values('date').max()
print(f'Latest date: {latest_date}')
df_alphas.loc[latest_date].dropna().head(10)

Latest date: 20250221


Unnamed: 0_level_0,alpha006,alpha012,alpha038,alpha041,alpha101
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
000001.SZ,-0.309479,0.02,-0.142566,-0.003608,-0.310559
000002.SZ,-0.213198,-0.1,-0.273095,-0.00428,0.373444
000063.SZ,-0.146451,-3.59,-0.866864,-0.578808,0.763032
000100.SZ,0.122155,-0.02,-0.124831,-0.005579,0.0
000157.SZ,-0.057652,0.05,-0.14228,0.015991,-0.348259
000166.SZ,-0.505223,-0.09,-0.502724,-0.017141,0.567376
000301.SZ,-0.102671,0.17,-0.032774,0.032392,-0.613027
000333.SZ,0.454647,0.64,-0.024865,0.039791,-0.417071
000338.SZ,-0.208885,-0.96,-0.840237,-0.102937,0.828604
000408.SZ,-0.008624,-0.34,-0.064026,0.079879,-0.81042
