# Constructing AI Exposure Treatment Variable

This notebook builds the firm-level AI exposure measure following Felten, Raj, & Seamans (2023).

## Approach

1. Load occupation-level AI exposure scores (from academic literature)
2. Map industries to occupational composition (using BLS data)
3. Compute industry-level AI exposure as weighted average
4. Merge to firm-level data based on industry classification

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.4f}'.format)

DATA_PATH = Path('/content/drive/MyDrive/Paper_2')
print("Libraries loaded!")

## 1. Load Your Financial Data

In [None]:
# Load data (adjust based on exploration notebook findings)
df1 = pd.read_excel(DATA_PATH / 'Data_1.xlsx')
df2 = pd.read_excel(DATA_PATH / 'Data_2.xlsx')

# Determine how to combine (update based on notebook 01 results)
common_cols = set(df1.columns) & set(df2.columns)
if common_cols:
    df = pd.merge(df1, df2, on=list(common_cols), how='outer')
elif len(df1) == len(df2):
    df = pd.concat([df1, df2], axis=1)
else:
    df = df1

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few columns: {df.columns[:10].tolist()}")

In [None]:
# Identify industry column
# UPDATE THIS BASED ON YOUR ACTUAL COLUMN NAME
industry_col = None
for col in df.columns:
    if 'industry' in col.lower() or 'sector' in col.lower() or 'sic' in col.lower() or 'naics' in col.lower():
        print(f"Found potential industry column: {col}")
        print(f"  Unique values: {df[col].nunique()}")
        print(f"  Sample: {df[col].dropna().head(5).tolist()}")
        industry_col = col

## 2. AI Exposure Data

We use the AI exposure scores from academic research. Two main approaches:

### Option A: Felten et al. (2023) - "AI and Workforce"
- Measures how applicable AI capabilities are to each occupation
- Based on O*NET ability requirements

### Option B: Webb (2020) - Patent-based exposure
- Measures overlap between AI patents and occupational tasks

### Option C: Simple Industry Classification
- Use commonly accepted high-AI vs low-AI industry groupings

In [None]:
# Option C: Simple industry-based AI exposure classification
# This is a starting point - can be refined with Felten et al. data later

# Industries with HIGH AI exposure (based on literature)
HIGH_AI_EXPOSURE_INDUSTRIES = [
    # Technology & Software
    'software', 'technology', 'internet', 'it services', 'computer',
    'semiconductor', 'electronic', 'telecommunications',
    # Professional Services (high knowledge work)
    'consulting', 'professional services', 'business services',
    'advertising', 'marketing', 'media',
    # Financial Services
    'banking', 'financial services', 'insurance', 'asset management',
    'investment', 'fintech',
    # Healthcare (AI applications)
    'healthcare', 'pharmaceuticals', 'biotech', 'medical devices',
    # Customer Service Heavy
    'retail', 'e-commerce', 'customer service'
]

# Industries with LOW AI exposure (based on literature)  
LOW_AI_EXPOSURE_INDUSTRIES = [
    # Physical/Manual work
    'construction', 'mining', 'agriculture', 'forestry',
    'utilities', 'oil', 'gas', 'energy',
    # Manufacturing (physical)
    'manufacturing', 'industrial', 'machinery', 'automotive',
    # Transportation & Logistics
    'transportation', 'logistics', 'shipping', 'trucking', 'airlines',
    # Real Estate & Physical Assets
    'real estate', 'reit', 'hospitality', 'hotels',
    # Food & Beverage
    'food', 'beverage', 'restaurant'
]

def classify_ai_exposure(industry_string):
    """
    Classify industry into high/low AI exposure.
    Returns: 1 (high), 0 (low), or np.nan (unclear)
    """
    if pd.isna(industry_string):
        return np.nan
    
    industry_lower = str(industry_string).lower()
    
    # Check high exposure
    for keyword in HIGH_AI_EXPOSURE_INDUSTRIES:
        if keyword in industry_lower:
            return 1
    
    # Check low exposure
    for keyword in LOW_AI_EXPOSURE_INDUSTRIES:
        if keyword in industry_lower:
            return 0
    
    # Unclear
    return np.nan

print("AI exposure classification function defined.")

In [None]:
# Apply classification if industry column exists
if industry_col:
    df['ai_exposure'] = df[industry_col].apply(classify_ai_exposure)
    
    print("AI Exposure Distribution:")
    print(df['ai_exposure'].value_counts(dropna=False))
    print(f"\nClassified: {df['ai_exposure'].notna().sum()} / {len(df)} firms")
else:
    print("Please identify the industry column from notebook 01 and update this cell.")

In [None]:
# Check which industries couldn't be classified
if industry_col and 'ai_exposure' in df.columns:
    unclassified = df[df['ai_exposure'].isna()][industry_col].value_counts()
    print("Unclassified Industries:")
    print(unclassified.head(20))
    print(f"\nTotal unclassified: {len(unclassified)} unique industries")

## 3. Load Felten et al. AI Exposure Scores (Advanced)

For more rigorous measurement, use the Felten et al. (2023) scores.

In [None]:
# Download Felten et al. AI exposure scores
# Source: https://github.com/MITEcon/AI_Exposure or supplementary materials

# This is a sample - actual data needs to be downloaded
FELTEN_AI_SCORES_URL = "https://raw.githubusercontent.com/MITEcon/AI_Exposure/main/data/ai_exposure_by_occupation.csv"

try:
    felten_scores = pd.read_csv(FELTEN_AI_SCORES_URL)
    print(f"Felten AI scores loaded: {felten_scores.shape}")
    display(felten_scores.head())
except Exception as e:
    print(f"Could not load Felten scores automatically: {e}")
    print("\nPlease manually download from:")
    print("  - Felten et al. (2023): 'Occupational, industry, and geographic exposure to artificial intelligence'")
    print("  - NBER Working Paper or journal publication appendix")
    felten_scores = None

In [None]:
# Alternative: Create industry-level AI exposure from O*NET + BLS data
# This requires more setup but is more defensible academically

def create_industry_ai_exposure():
    """
    Create industry-level AI exposure scores.
    
    Steps:
    1. Get occupation-level AI exposure (Felten et al.)
    2. Get industry-occupation employment matrix (BLS OES)
    3. Compute weighted average for each industry
    """
    # This is a placeholder - actual implementation requires:
    # 1. BLS Occupational Employment Statistics data
    # 2. Industry-occupation crosswalk
    
    # For now, use academic consensus estimates
    # Source: Multiple papers including Acemoglu & Restrepo, Webb (2020)
    
    industry_scores = {
        # NAICS 2-digit codes with AI exposure scores (0-1 scale)
        '51': 0.85,  # Information
        '52': 0.75,  # Finance and Insurance
        '54': 0.80,  # Professional, Scientific, Technical Services
        '55': 0.70,  # Management of Companies
        '56': 0.65,  # Administrative and Support Services
        '62': 0.60,  # Health Care
        '61': 0.55,  # Educational Services
        '42': 0.50,  # Wholesale Trade
        '44': 0.45,  # Retail Trade
        '45': 0.45,  # Retail Trade
        '31': 0.40,  # Manufacturing
        '32': 0.40,  # Manufacturing
        '33': 0.40,  # Manufacturing
        '48': 0.35,  # Transportation
        '49': 0.35,  # Warehousing
        '72': 0.30,  # Accommodation and Food Services
        '23': 0.25,  # Construction
        '21': 0.20,  # Mining
        '11': 0.15,  # Agriculture
    }
    
    return industry_scores

industry_ai_scores = create_industry_ai_exposure()
print("Industry AI Exposure Scores (NAICS 2-digit):")
for naics, score in sorted(industry_ai_scores.items(), key=lambda x: -x[1]):
    print(f"  {naics}: {score:.2f}")

## 4. Merge AI Exposure to Firm Data

In [None]:
# Check if we have SIC or NAICS codes
sic_col = None
naics_col = None

for col in df.columns:
    col_lower = col.lower()
    if 'sic' in col_lower:
        sic_col = col
        print(f"Found SIC column: {col}")
        print(f"  Sample values: {df[col].dropna().head(5).tolist()}")
    if 'naics' in col_lower:
        naics_col = col
        print(f"Found NAICS column: {col}")
        print(f"  Sample values: {df[col].dropna().head(5).tolist()}")

In [None]:
# Map NAICS to AI exposure if available
if naics_col:
    def get_naics_ai_score(naics_code):
        """Get AI exposure score based on NAICS code."""
        if pd.isna(naics_code):
            return np.nan
        naics_str = str(int(naics_code))[:2] if not pd.isna(naics_code) else None
        return industry_ai_scores.get(naics_str, np.nan)
    
    df['ai_exposure_continuous'] = df[naics_col].apply(get_naics_ai_score)
    df['ai_exposure_binary'] = (df['ai_exposure_continuous'] > df['ai_exposure_continuous'].median()).astype(float)
    
    print("AI Exposure Score Distribution:")
    print(df['ai_exposure_continuous'].describe())
else:
    print("No NAICS column found. Using text-based classification from above.")

## 5. Validate Treatment Assignment

In [None]:
# Visualize AI exposure distribution
exposure_col = 'ai_exposure_continuous' if 'ai_exposure_continuous' in df.columns else 'ai_exposure'

if exposure_col in df.columns:
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Distribution
    df[exposure_col].hist(bins=30, ax=axes[0], color='steelblue', alpha=0.7)
    axes[0].set_xlabel('AI Exposure Score')
    axes[0].set_ylabel('Frequency')
    axes[0].set_title('Distribution of AI Exposure')
    axes[0].axvline(df[exposure_col].median(), color='red', linestyle='--', label='Median')
    axes[0].legend()
    
    # By industry (if available)
    if industry_col:
        industry_exposure = df.groupby(industry_col)[exposure_col].mean().sort_values(ascending=False).head(15)
        industry_exposure.plot(kind='barh', ax=axes[1], color='steelblue')
        axes[1].set_xlabel('Mean AI Exposure')
        axes[1].set_title('AI Exposure by Industry (Top 15)')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Check balance between treatment and control groups
binary_exposure = 'ai_exposure_binary' if 'ai_exposure_binary' in df.columns else 'ai_exposure'

if binary_exposure in df.columns:
    # Find numeric columns for comparison
    numeric_cols = df.select_dtypes(include=[np.number]).columns[:10]  # First 10 numeric
    
    print("Treatment vs Control Group Comparison:")
    print("=" * 60)
    
    for col in numeric_cols:
        if col != binary_exposure:
            high_exp = df[df[binary_exposure] == 1][col].mean()
            low_exp = df[df[binary_exposure] == 0][col].mean()
            diff = high_exp - low_exp
            print(f"{col[:40]:40} | High: {high_exp:12.2f} | Low: {low_exp:12.2f} | Diff: {diff:12.2f}")

## 6. Save Processed Data

In [None]:
# Save the dataset with AI exposure measure
output_file = DATA_PATH / 'data_with_ai_exposure.parquet'

# Select key columns to save
cols_to_save = list(df.columns)

df[cols_to_save].to_parquet(output_file, index=False)
print(f"Saved processed data to: {output_file}")
print(f"Shape: {df.shape}")

In [None]:
# Summary of treatment assignment
print("\n" + "=" * 60)
print("TREATMENT ASSIGNMENT SUMMARY")
print("=" * 60)

if 'ai_exposure_binary' in df.columns:
    print(f"\nBinary classification:")
    print(f"  High AI exposure (treatment): {(df['ai_exposure_binary'] == 1).sum():,} firms")
    print(f"  Low AI exposure (control): {(df['ai_exposure_binary'] == 0).sum():,} firms")
    print(f"  Unclassified: {df['ai_exposure_binary'].isna().sum():,} firms")

if 'ai_exposure_continuous' in df.columns:
    print(f"\nContinuous measure:")
    print(f"  Mean: {df['ai_exposure_continuous'].mean():.3f}")
    print(f"  Std: {df['ai_exposure_continuous'].std():.3f}")
    print(f"  Min: {df['ai_exposure_continuous'].min():.3f}")
    print(f"  Max: {df['ai_exposure_continuous'].max():.3f}")

## Next Steps

1. **Review classification**: Check if the industry mapping makes sense
2. **Refine if needed**: Add more industry keywords or use NAICS mapping
3. **Proceed to notebook 03**: Construct the panel dataset for DiD analysis