# Liquid Universe Factor DNA Analysis

**Objective**: Implement the new "liquid-universe-first" backtesting pipeline and conduct quintile analysis for standalone Quality, Value, and Momentum factors on the ASC-VN-Liquid-150 universe.

**Critical Architecture Change**: Unlike previous notebooks, this pipeline filters the universe BEFORE any factor ranking occurs, ensuring we only evaluate signals on truly investable stocks.

## Strategic Context

This notebook represents the pivot from our previous "liquidity-last" architecture to a new "liquid-universe-first" approach. The original strategy showed phenomenal alpha (~2.1 Sharpe) but was concentrated in untradable micro-cap stocks. This analysis will establish a realistic performance baseline for our existing factors within the investable universe.

**Universe Definition**: Top 200 stocks by 63-day ADTV, refreshed quarterly, with baseline ADTV threshold of 10B VND (ASC-VN-Liquid-150).


## Section 1: Environment Setup & Data Loading

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, date, timedelta
import warnings
import yaml
from pathlib import Path
from sqlalchemy import create_engine, text

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("✅ Environment setup complete")
print(f"Analysis date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Environment setup complete
Analysis date: 2025-07-28 08:22:18


In [2]:
# Database connection setup
def create_db_connection():
    """Create database connection using config file"""
    config_path = Path('../../../config/database.yml')
    
    with open(config_path, 'r') as f:
        db_config = yaml.safe_load(f)
    
    conn_params = db_config['production']
    connection_string = (
        f"mysql+pymysql://{conn_params['username']}:{conn_params['password']}"
        f"@{conn_params['host']}/{conn_params['schema_name']}"
    )
    
    engine = create_engine(connection_string, pool_pre_ping=True)
    return engine

# Create database connection
engine = create_db_connection()
print("✅ Database connection established")

✅ Database connection established


## Section 2: Liquid Universe Constructor

This section implements the core "liquid-universe-first" logic. The universe is constructed BEFORE any factor analysis.

In [3]:
def calculate_adtv_universe(engine, as_of_date: str, lookback_days: int = 63, top_n: int = 200, min_adtv_bn: float = 10.0):
    """
    Calculate liquid universe based on ADTV criteria.
    
    Parameters:
    - as_of_date: Date for universe construction (T-2 to avoid look-ahead bias)
    - lookback_days: Days to calculate ADTV (default 63 = ~3 months)
    - top_n: Number of most liquid stocks to select
    - min_adtv_bn: Minimum ADTV threshold in billion VND
    
    Returns:
    - DataFrame with liquid universe tickers and their ADTV metrics
    """
    # Calculate lookback start date
    as_of_dt = pd.to_datetime(as_of_date)
    start_date = (as_of_dt - timedelta(days=lookback_days)).strftime('%Y-%m-%d')
    
    print(f"🔍 Calculating ADTV universe for {as_of_date}")
    print(f"   Lookback period: {start_date} to {as_of_date} ({lookback_days} days)")
    print(f"   Criteria: Top {top_n} stocks with ADTV >= {min_adtv_bn}B VND")
    
    # Query to calculate ADTV by ticker
    adtv_query = text("""
        SELECT 
            v.ticker,
            m.sector,
            COUNT(v.trading_date) as trading_days,
            AVG(v.total_value / 1e9) as adtv_bn_vnd,
            SUM(v.total_value / 1e9) as total_turnover_bn_vnd,
            AVG(v.market_cap / 1e9) as avg_market_cap_bn_vnd,
            MIN(v.trading_date) as first_date,
            MAX(v.trading_date) as last_date
        FROM vcsc_daily_data_complete v
        INNER JOIN master_info m ON v.ticker = m.ticker
        WHERE v.trading_date BETWEEN :start_date AND :as_of_date
            AND v.total_value > 0
            AND v.market_cap > 0
        GROUP BY v.ticker, m.sector
        HAVING trading_days >= :min_trading_days
            AND adtv_bn_vnd >= :min_adtv_bn
        ORDER BY adtv_bn_vnd DESC
        LIMIT :top_n
    """)
    
    # Require at least 80% of trading days for inclusion
    min_trading_days = int(lookback_days * 0.8)
    
    with engine.connect() as conn:
        universe_df = pd.read_sql_query(
            adtv_query,
            conn,
            params={
                'start_date': start_date,
                'as_of_date': as_of_date,
                'min_trading_days': min_trading_days,
                'min_adtv_bn': min_adtv_bn,
                'top_n': top_n
            }
        )
    
    print(f"✅ Universe calculated: {len(universe_df)} stocks qualify")
    print(f"   ADTV range: {universe_df['adtv_bn_vnd'].min():.1f}B - {universe_df['adtv_bn_vnd'].max():.1f}B VND")
    print(f"   Market cap range: {universe_df['avg_market_cap_bn_vnd'].min():.1f}B - {universe_df['avg_market_cap_bn_vnd'].max():.1f}B VND")
    
    return universe_df

# Test universe construction for Q1 2024
test_date = '2024-03-29'  # End of Q1 2024
liquid_universe = calculate_adtv_universe(engine, test_date)

# Display top 10 most liquid stocks
print("\n📊 Top 10 Most Liquid Stocks:")
display(liquid_universe.head(10)[['ticker', 'sector', 'adtv_bn_vnd', 'avg_market_cap_bn_vnd', 'trading_days']])

🔍 Calculating ADTV universe for 2024-03-29
   Lookback period: 2024-01-26 to 2024-03-29 (63 days)
   Criteria: Top 200 stocks with ADTV >= 10.0B VND


OperationalError: (pymysql.err.OperationalError) (1267, "Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,IMPLICIT) for operation '='")
[SQL: 
        SELECT 
            v.ticker,
            m.sector,
            COUNT(v.trading_date) as trading_days,
            AVG(v.total_value / 1e9) as adtv_bn_vnd,
            SUM(v.total_value / 1e9) as total_turnover_bn_vnd,
            AVG(v.market_cap / 1e9) as avg_market_cap_bn_vnd,
            MIN(v.trading_date) as first_date,
            MAX(v.trading_date) as last_date
        FROM vcsc_daily_data_complete v
        INNER JOIN master_info m ON v.ticker = m.ticker
        WHERE v.trading_date BETWEEN %(start_date)s AND %(as_of_date)s
            AND v.total_value > 0
            AND v.market_cap > 0
        GROUP BY v.ticker, m.sector
        HAVING trading_days >= %(min_trading_days)s
            AND adtv_bn_vnd >= %(min_adtv_bn)s
        ORDER BY adtv_bn_vnd DESC
        LIMIT %(top_n)s
    ]
[parameters: {'start_date': '2024-01-26', 'as_of_date': '2024-03-29', 'min_trading_days': 50, 'min_adtv_bn': 10.0, 'top_n': 200}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

In [None]:
# Analyze universe composition by sector
sector_analysis = liquid_universe.groupby('sector').agg({
    'ticker': 'count',
    'adtv_bn_vnd': ['mean', 'sum'],
    'avg_market_cap_bn_vnd': ['mean', 'sum']
}).round(2)

sector_analysis.columns = ['Count', 'Avg_ADTV_Bn', 'Total_ADTV_Bn', 'Avg_MCap_Bn', 'Total_MCap_Bn']
sector_analysis = sector_analysis.sort_values('Count', ascending=False)

print("🏢 Liquid Universe Composition by Sector:")
display(sector_analysis)

# Plot sector composition
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Count by sector
sector_analysis['Count'].plot(kind='bar', ax=ax1, color='skyblue')
ax1.set_title('Liquid Universe: Stock Count by Sector')
ax1.set_ylabel('Number of Stocks')
ax1.tick_params(axis='x', rotation=45)

# ADTV by sector  
sector_analysis['Total_ADTV_Bn'].plot(kind='bar', ax=ax2, color='lightcoral')
ax2.set_title('Liquid Universe: Total ADTV by Sector (Billion VND)')
ax2.set_ylabel('Total ADTV (Billion VND)')
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print(f"\n📈 Universe Statistics:")
print(f"   Total tickers: {len(liquid_universe)}")
print(f"   Sectors represented: {liquid_universe['sector'].nunique()}")
print(f"   Total market cap: {liquid_universe['avg_market_cap_bn_vnd'].sum():.0f}B VND")
print(f"   Total daily turnover: {liquid_universe['adtv_bn_vnd'].sum():.0f}B VND")

## Section 3: Factor Data Loading for Liquid Universe

Load factor scores ONLY for the tickers in our liquid universe. This is the key architectural change - we filter first, then analyze.

In [None]:
def load_factor_scores_for_universe(engine, universe_tickers, start_date: str, end_date: str, strategy_version: str = 'qvm_v2.0_enhanced'):
    """
    Load factor scores ONLY for stocks in the liquid universe.
    
    This is the core "liquid-universe-first" implementation:
    We filter the universe BEFORE loading any factor data.
    """
    print(f"📊 Loading factor scores for liquid universe")
    print(f"   Universe size: {len(universe_tickers)} tickers")
    print(f"   Date range: {start_date} to {end_date}")
    print(f"   Strategy version: {strategy_version}")
    
    # Convert universe tickers to tuple for SQL IN clause
    ticker_tuple = tuple(universe_tickers)
    
    factor_query = text("""
        SELECT 
            ticker,
            date,
            Quality_Composite,
            Value_Composite,
            Momentum_Composite,
            QVM_Composite
        FROM factor_scores_qvm
        WHERE ticker IN :tickers
            AND date BETWEEN :start_date AND :end_date
            AND strategy_version = :strategy_version
            AND Quality_Composite IS NOT NULL
            AND Value_Composite IS NOT NULL
            AND Momentum_Composite IS NOT NULL
        ORDER BY date, ticker
    """)
    
    with engine.connect() as conn:
        factor_df = pd.read_sql_query(
            factor_query,
            conn,
            params={
                'tickers': ticker_tuple,
                'start_date': start_date,
                'end_date': end_date,
                'strategy_version': strategy_version
            }
        )
    
    factor_df['date'] = pd.to_datetime(factor_df['date'])
    
    print(f"✅ Loaded {len(factor_df):,} factor observations")
    print(f"   Date range: {factor_df['date'].min().date()} to {factor_df['date'].max().date()}")
    print(f"   Unique tickers with data: {factor_df['ticker'].nunique()}")
    print(f"   Unique dates: {factor_df['date'].nunique()}")
    
    return factor_df

# Load factor data for our liquid universe
factor_data = load_factor_scores_for_universe(
    engine=engine,
    universe_tickers=liquid_universe['ticker'].tolist(),
    start_date='2024-01-01',
    end_date='2024-03-29',
    strategy_version='qvm_v2.0_enhanced'
)

# Display sample data
print("\n📋 Sample Factor Data:")
display(factor_data.head(10))

## Section 4: Critical Sanity Checks

Before proceeding with any analysis, we must validate three critical conditions:
1. **Coverage Check**: Sufficient number of stocks with factor data
2. **Liquidity Overlap Check**: Factor universe aligns with liquid universe  
3. **Factor Dispersion Check**: Factors show meaningful variation in liquid universe

In [None]:
def run_sanity_checks(factor_data, liquid_universe, min_coverage=125, min_dispersion=0.10):
    """
    Run critical sanity checks before proceeding with analysis.
    These are mandatory gates that must pass.
    """
    print("🔍 RUNNING CRITICAL SANITY CHECKS")
    print("=" * 50)
    
    results = {}
    
    # 1. Coverage Check
    unique_factor_tickers = factor_data['ticker'].nunique()
    universe_size = len(liquid_universe)
    coverage_ratio = unique_factor_tickers / universe_size
    
    print(f"\n1️⃣ COVERAGE CHECK:")
    print(f"   Liquid universe size: {universe_size}")
    print(f"   Tickers with factor data: {unique_factor_tickers}")
    print(f"   Coverage ratio: {coverage_ratio:.1%}")
    print(f"   Minimum required: {min_coverage} tickers")
    
    coverage_pass = unique_factor_tickers >= min_coverage
    results['coverage'] = {
        'pass': coverage_pass,
        'value': unique_factor_tickers,
        'threshold': min_coverage,
        'ratio': coverage_ratio
    }
    print(f"   Status: {'✅ PASS' if coverage_pass else '❌ FAIL'}")
    
    # 2. Liquidity Overlap Check  
    factor_tickers = set(factor_data['ticker'].unique())
    universe_tickers = set(liquid_universe['ticker'].unique())
    overlap = factor_tickers.intersection(universe_tickers)
    overlap_ratio = len(overlap) / len(universe_tickers)
    
    print(f"\n2️⃣ LIQUIDITY OVERLAP CHECK:")
    print(f"   Universe tickers: {len(universe_tickers)}")
    print(f"   Factor tickers: {len(factor_tickers)}")
    print(f"   Overlap: {len(overlap)} ({overlap_ratio:.1%})")
    
    # Check if any universe tickers are missing factor data
    missing_tickers = universe_tickers - factor_tickers
    if missing_tickers:
        print(f"   Missing factor data for: {sorted(list(missing_tickers))[:10]}...")
    
    overlap_pass = overlap_ratio >= 0.8  # At least 80% overlap
    results['overlap'] = {
        'pass': overlap_pass,
        'ratio': overlap_ratio,
        'missing_count': len(missing_tickers)
    }
    print(f"   Status: {'✅ PASS' if overlap_pass else '❌ FAIL'}")
    
    # 3. Factor Dispersion Check
    print(f"\n3️⃣ FACTOR DISPERSION CHECK:")
    factors = ['Quality_Composite', 'Value_Composite', 'Momentum_Composite']
    dispersion_results = {}
    
    for factor in factors:
        # Calculate cross-sectional standard deviation for each date
        daily_std = factor_data.groupby('date')[factor].std()
        avg_std = daily_std.mean()
        
        dispersion_pass = avg_std >= min_dispersion
        dispersion_results[factor] = {
            'pass': dispersion_pass,
            'avg_std': avg_std,
            'threshold': min_dispersion
        }
        
        print(f"   {factor}: {avg_std:.3f} ({'✅ PASS' if dispersion_pass else '❌ FAIL'})")
    
    results['dispersion'] = dispersion_results
    
    # Overall assessment
    all_dispersion_pass = all(r['pass'] for r in dispersion_results.values())
    overall_pass = coverage_pass and overlap_pass and all_dispersion_pass
    
    print(f"\n{'='*50}")
    print(f"🎯 OVERALL SANITY CHECK: {'✅ ALL PASS' if overall_pass else '❌ SOME FAILED'}")
    
    if not overall_pass:
        print("\n⚠️  WARNING: Some sanity checks failed!")
        print("   This indicates our factors may not be suitable for the liquid universe.")
        print("   Consider this a 'No-Go' decision for current factor definitions.")
    
    results['overall_pass'] = overall_pass
    return results

# Run sanity checks
sanity_results = run_sanity_checks(factor_data, liquid_universe)

## Section 5: Liquid Universe Factor DNA Analysis

If sanity checks pass, proceed with quintile analysis to establish the performance baseline for our factors in the investable universe.

In [None]:
def analyze_factor_dna(factor_data, factor_name='Quality_Composite'):
    """
    Analyze the "DNA" of a factor in the liquid universe:
    - Distribution characteristics
    - Temporal stability
    - Cross-sectional dispersion
    """
    print(f"🧬 FACTOR DNA ANALYSIS: {factor_name}")
    print("=" * 50)
    
    # 1. Distribution Analysis
    factor_values = factor_data[factor_name].dropna()
    
    print(f"\n📊 Distribution Statistics:")
    print(f"   Count: {len(factor_values):,}")
    print(f"   Mean: {factor_values.mean():.4f}")
    print(f"   Std Dev: {factor_values.std():.4f}")
    print(f"   Skewness: {factor_values.skew():.4f}")
    print(f"   Min: {factor_values.min():.4f}")
    print(f"   25th %ile: {factor_values.quantile(0.25):.4f}")
    print(f"   Median: {factor_values.median():.4f}")
    print(f"   75th %ile: {factor_values.quantile(0.75):.4f}")
    print(f"   Max: {factor_values.max():.4f}")
    
    # 2. Temporal Analysis
    daily_stats = factor_data.groupby('date')[factor_name].agg([
        'count', 'mean', 'std', 'min', 'max'
    ]).round(4)
    
    print(f"\n📈 Temporal Stability:")
    print(f"   Avg daily coverage: {daily_stats['count'].mean():.1f} stocks")
    print(f"   Mean stability (std of daily means): {daily_stats['mean'].std():.4f}")
    print(f"   Dispersion stability (std of daily stds): {daily_stats['std'].std():.4f}")
    
    # 3. Visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Distribution histogram
    axes[0,0].hist(factor_values, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0,0].axvline(factor_values.mean(), color='red', linestyle='--', label=f'Mean: {factor_values.mean():.3f}')
    axes[0,0].axvline(factor_values.median(), color='orange', linestyle='--', label=f'Median: {factor_values.median():.3f}')
    axes[0,0].set_title(f'{factor_name} Distribution in Liquid Universe')
    axes[0,0].set_xlabel('Factor Value')
    axes[0,0].set_ylabel('Frequency')
    axes[0,0].legend()
    
    # Box plot
    axes[0,1].boxplot(factor_values, patch_artist=True,
                      boxprops=dict(facecolor='lightcoral', alpha=0.7))
    axes[0,1].set_title(f'{factor_name} Box Plot')
    axes[0,1].set_ylabel('Factor Value')
    
    # Time series of daily means
    axes[1,0].plot(daily_stats.index, daily_stats['mean'], marker='o', linewidth=2)
    axes[1,0].set_title(f'{factor_name} Daily Mean Over Time')
    axes[1,0].set_xlabel('Date')
    axes[1,0].set_ylabel('Daily Mean')
    axes[1,0].tick_params(axis='x', rotation=45)
    
    # Time series of daily dispersion
    axes[1,1].plot(daily_stats.index, daily_stats['std'], marker='s', color='green', linewidth=2)
    axes[1,1].set_title(f'{factor_name} Daily Dispersion Over Time')
    axes[1,1].set_xlabel('Date')
    axes[1,1].set_ylabel('Daily Std Dev')
    axes[1,1].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    return {
        'distribution_stats': factor_values.describe(),
        'temporal_stats': daily_stats,
        'mean_stability': daily_stats['mean'].std(),
        'dispersion_stability': daily_stats['std'].std()
    }

# Analyze each factor's DNA if sanity checks passed
if sanity_results['overall_pass']:
    print("✅ Sanity checks passed - proceeding with Factor DNA analysis\n")
    
    # Analyze Quality factor
    quality_dna = analyze_factor_dna(factor_data, 'Quality_Composite')
else:
    print("❌ Sanity checks failed - Factor DNA analysis not recommended")
    print("   Current factors may not be suitable for the liquid universe.")

In [None]:
# Continue DNA analysis for Value and Momentum if Quality passed
if sanity_results['overall_pass']:
    print("\n" + "="*60)
    value_dna = analyze_factor_dna(factor_data, 'Value_Composite')
    
    print("\n" + "="*60)
    momentum_dna = analyze_factor_dna(factor_data, 'Momentum_Composite')
    
    # Summary comparison
    print("\n" + "="*60)
    print("🎯 FACTOR DNA SUMMARY COMPARISON")
    print("=" * 60)
    
    factors = ['Quality_Composite', 'Value_Composite', 'Momentum_Composite']
    dna_results = [quality_dna, value_dna, momentum_dna]
    
    summary_df = pd.DataFrame({
        'Factor': factors,
        'Mean': [factor_data[f].mean() for f in factors],
        'Std_Dev': [factor_data[f].std() for f in factors],
        'Skewness': [factor_data[f].skew() for f in factors],
        'Mean_Stability': [dna['mean_stability'] for dna in dna_results],
        'Dispersion_Stability': [dna['dispersion_stability'] for dna in dna_results]
    }).round(4)
    
    display(summary_df)
    
    # Flag any concerning patterns
    print("\n🚨 DNA Health Check:")
    for i, factor in enumerate(factors):
        std_dev = summary_df.iloc[i]['Std_Dev']
        stability = summary_df.iloc[i]['Mean_Stability']
        
        if std_dev < 0.1:
            print(f"   ⚠️  {factor}: Low dispersion ({std_dev:.3f}) - may lack signal")
        if stability > 0.05:
            print(f"   ⚠️  {factor}: High instability ({stability:.3f}) - may be noisy")
        if std_dev >= 0.1 and stability <= 0.05:
            print(f"   ✅ {factor}: Healthy DNA profile")

## Section 6: Preliminary Quintile Analysis

If Factor DNA is healthy, conduct initial quintile analysis to measure factor efficacy in the liquid universe.

In [None]:
def preliminary_quintile_analysis(factor_data, price_data=None, factor_name='Quality_Composite'):
    """
    Conduct preliminary quintile analysis for a single factor.
    For now, focus on factor distribution across quintiles.
    
    Note: Full performance analysis requires price data loading,
    which will be implemented in subsequent development.
    """
    print(f"📊 PRELIMINARY QUINTILE ANALYSIS: {factor_name}")
    print("=" * 50)
    
    # Create quintile ranks for each date
    factor_ranked = factor_data.copy()
    factor_ranked[f'{factor_name}_quintile'] = factor_ranked.groupby('date')[factor_name].transform(
        lambda x: pd.qcut(x, q=5, labels=[1, 2, 3, 4, 5], duplicates='drop')
    )
    
    # Remove any rows where quintile assignment failed
    factor_ranked = factor_ranked.dropna(subset=[f'{factor_name}_quintile'])
    
    print(f"✅ Quintile ranking complete")
    print(f"   Total observations with quintiles: {len(factor_ranked):,}")
    
    # Analyze quintile characteristics
    quintile_stats = factor_ranked.groupby(f'{factor_name}_quintile')[factor_name].agg([
        'count', 'mean', 'std', 'min', 'max'
    ]).round(4)
    quintile_stats.columns = ['Count', 'Mean', 'Std_Dev', 'Min', 'Max']
    
    print(f"\n📈 Quintile Characteristics:")
    display(quintile_stats)
    
    # Calculate quintile spread
    q5_mean = quintile_stats.loc[5, 'Mean']
    q1_mean = quintile_stats.loc[1, 'Mean']
    quintile_spread = q5_mean - q1_mean
    
    print(f"\n🎯 Key Metrics:")
    print(f"   Quintile 5 (Top) Mean: {q5_mean:.4f}")
    print(f"   Quintile 1 (Bottom) Mean: {q1_mean:.4f}")
    print(f"   Quintile Spread (Q5-Q1): {quintile_spread:.4f}")
    
    # Assess factor efficacy
    if quintile_spread > 0.5:
        efficacy = "Strong"
    elif quintile_spread > 0.2:
        efficacy = "Moderate"
    elif quintile_spread > 0.1:
        efficacy = "Weak"
    else:
        efficacy = "Very Weak"
    
    print(f"   Factor Efficacy: {efficacy}")
    
    # Visualization
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
    
    # Quintile means
    quintile_stats['Mean'].plot(kind='bar', ax=ax1, color='steelblue')
    ax1.set_title(f'{factor_name} Mean by Quintile')
    ax1.set_xlabel('Quintile (1=Worst, 5=Best)')
    ax1.set_ylabel('Factor Value')
    ax1.tick_params(axis='x', rotation=0)
    
    # Box plot by quintile
    factor_ranked.boxplot(column=factor_name, by=f'{factor_name}_quintile', ax=ax2)
    ax2.set_title(f'{factor_name} Distribution by Quintile')
    ax2.set_xlabel('Quintile (1=Worst, 5=Best)')
    ax2.set_ylabel('Factor Value')
    
    plt.tight_layout()
    plt.show()
    
    return {
        'quintile_stats': quintile_stats,
        'quintile_spread': quintile_spread,
        'efficacy': efficacy,
        'ranked_data': factor_ranked
    }

# Run preliminary quintile analysis if DNA is healthy
if sanity_results['overall_pass']:
    quality_quintiles = preliminary_quintile_analysis(factor_data, factor_name='Quality_Composite')
else:
    print("❌ Skipping quintile analysis - sanity checks failed")

In [None]:
# Continue quintile analysis for all factors
if sanity_results['overall_pass']:
    print("\n" + "="*70)
    value_quintiles = preliminary_quintile_analysis(factor_data, factor_name='Value_Composite')
    
    print("\n" + "="*70)
    momentum_quintiles = preliminary_quintile_analysis(factor_data, factor_name='Momentum_Composite')
    
    # Summary comparison of all factors
    print("\n" + "="*70)
    print("🎯 LIQUID UNIVERSE FACTOR EFFICACY SUMMARY")
    print("=" * 70)
    
    efficacy_summary = pd.DataFrame({
        'Factor': ['Quality_Composite', 'Value_Composite', 'Momentum_Composite'],
        'Quintile_Spread': [
            quality_quintiles['quintile_spread'],
            value_quintiles['quintile_spread'],
            momentum_quintiles['quintile_spread']
        ],
        'Efficacy_Rating': [
            quality_quintiles['efficacy'],
            value_quintiles['efficacy'],
            momentum_quintiles['efficacy']
        ]
    })
    
    display(efficacy_summary)
    
    # Determine go/no-go decision
    strong_factors = sum(1 for efficacy in efficacy_summary['Efficacy_Rating'] if efficacy == 'Strong')
    moderate_factors = sum(1 for efficacy in efficacy_summary['Efficacy_Rating'] if efficacy == 'Moderate')
    
    print(f"\n🚦 GO/NO-GO DECISION:")
    print(f"   Strong factors: {strong_factors}/3")
    print(f"   Moderate+ factors: {strong_factors + moderate_factors}/3")
    
    if strong_factors >= 2:
        decision = "✅ GO - Strong factor signals in liquid universe"
        recommendation = "Proceed with full backtesting pipeline development"
    elif strong_factors + moderate_factors >= 2:
        decision = "🟡 CAUTIOUS GO - Moderate factor signals"
        recommendation = "Proceed but consider factor enhancement"
    else:
        decision = "❌ NO-GO - Weak factor signals in liquid universe"
        recommendation = "Pivot to Liquid Alpha Discovery phase for new factor engineering"
    
    print(f"   Decision: {decision}")
    print(f"   Recommendation: {recommendation}")
    
else:
    print("❌ Cannot make go/no-go decision - preliminary analysis failed")

## Section 7: Session Summary & Next Steps

Document findings and establish clear next steps based on the analysis results.

In [None]:
# Generate comprehensive session summary
print("📋 LIQUID UNIVERSE FACTOR DNA - SESSION SUMMARY")
print("=" * 60)

print(f"\n🎯 Analysis Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"📊 Universe Definition: ASC-VN-Liquid-150 (Top 200 by ADTV, 10B+ VND threshold)")
print(f"📈 Test Period: Q1 2024 (2024-01-01 to 2024-03-29)")
print(f"🔧 Strategy Version: qvm_v2.0_enhanced")

if 'liquid_universe' in locals():
    print(f"\n🏢 Universe Composition:")
    print(f"   Total stocks: {len(liquid_universe)}")
    print(f"   ADTV range: {liquid_universe['adtv_bn_vnd'].min():.1f}B - {liquid_universe['adtv_bn_vnd'].max():.1f}B VND")
    print(f"   Sectors represented: {liquid_universe['sector'].nunique()}")

if 'sanity_results' in locals():
    print(f"\n🔍 Sanity Check Results:")
    print(f"   Coverage: {'✅ PASS' if sanity_results['coverage']['pass'] else '❌ FAIL'} ({sanity_results['coverage']['value']} tickers)")
    print(f"   Overlap: {'✅ PASS' if sanity_results['overlap']['pass'] else '❌ FAIL'} ({sanity_results['overlap']['ratio']:.1%} overlap)")
    
    print(f"   Factor Dispersion:")
    for factor, result in sanity_results['dispersion'].items():
        print(f"     {factor}: {'✅ PASS' if result['pass'] else '❌ FAIL'} ({result['avg_std']:.3f})")

if 'efficacy_summary' in locals():
    print(f"\n🧬 Factor DNA Results:")
    for _, row in efficacy_summary.iterrows():
        print(f"   {row['Factor']}: {row['Efficacy_Rating']} (spread: {row['Quintile_Spread']:.3f})")
    
    print(f"\n🚦 Final Decision: {decision}")
    print(f"💡 Recommendation: {recommendation}")

print(f"\n📋 Key Architectural Achievement:")
print(f"   ✅ Successfully implemented 'liquid-universe-first' pipeline")
print(f"   ✅ Universe filtering occurs BEFORE factor analysis")
print(f"   ✅ Eliminated risk of discovering inaccessible alpha")

print(f"\n⏭️  Next Steps:")
if 'sanity_results' in locals() and sanity_results['overall_pass']:
    if 'strong_factors' in locals() and strong_factors >= 2:
        print(f"   1. Load price data for liquid universe")
        print(f"   2. Implement full quintile performance analysis")
        print(f"   3. Calculate returns, Sharpe ratios, and turnover")
        print(f"   4. Build complete liquid-universe backtesting module")
        print(f"   5. Compare liquid vs unrestricted universe performance")
    else:
        print(f"   1. Investigate factor weakness in liquid universe")
        print(f"   2. Consider factor enhancement or new engineering")
        print(f"   3. Analyze sector-specific factor behavior")
        print(f"   4. Potentially pivot to Liquid Alpha Discovery phase")
else:
    print(f"   1. Investigate sanity check failures")
    print(f"   2. Review factor generation process for liquid universe")
    print(f"   3. Consider data quality issues or timing problems")
    print(f"   4. Re-run analysis with different universe parameters")

print(f"\n💾 Session artifacts created:")
print(f"   - Liquid universe definition for Q1 2024")
print(f"   - Factor DNA analysis for Quality, Value, Momentum")
print(f"   - Preliminary quintile efficacy assessment")
print(f"   - Go/No-Go decision framework")

print(f"\n" + "=" * 60)
print(f"✅ LIQUID UNIVERSE FACTOR DNA ANALYSIS COMPLETE")
print(f"=" * 60)