# 🚀 Environment Setup & Validation - Generalized Pipeline

This notebook sets up the environment and validates the configuration for the generalized financial sentiment analysis pipeline.

**Key Features:**
- Centralized configuration management
- Environment validation and dependency checking
- GPU/device detection and optimization
- Pipeline state initialization
- Model availability verification

**Configuration-driven approach:** All settings loaded from `../config/pipeline_config.json`

In [1]:
# Import configuration system and utilities
import sys
import os

# Add parent directory to path for imports
sys.path.append('../')

from src.pipeline_utils import ConfigManager, StateManager, LoggingManager
import torch
import transformers
import datasets
import pandas as pd
import numpy as np
from pathlib import Path
import json
from datetime import datetime

# Initialize configuration system
config = ConfigManager('../config/pipeline_config.json')
state = StateManager('../config/pipeline_state.json')
logger_manager = LoggingManager(config, 'setup')
logger = logger_manager.get_logger()

logger.info("🚀 Starting Environment Setup - Generalized Pipeline")
print("📋 Configuration loaded from ../config/pipeline_config.json")

2025-08-08 16:28:04,656 - pipeline.setup - INFO - 🚀 Starting Environment Setup - Generalized Pipeline


📋 Configuration loaded from ../config/pipeline_config.json


In [2]:
# Validate prerequisites and environment
logger.info("🔍 Validating environment and dependencies...")

print("🔍 Environment Validation:")
print(f"   📍 Current working directory: {os.getcwd()}")
print(f"   🐍 Python version: {sys.version}")
print(f"   🤗 Transformers version: {transformers.__version__}")
print(f"   📊 Pandas version: {pd.__version__}")
print(f"   🔢 NumPy version: {np.__version__}")

# Device detection and configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"   🔧 Primary device: {device}")

if torch.cuda.is_available():
    print(f"   🚀 CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"   💾 CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("   🍎 Apple Metal Performance Shaders (MPS) available")
else:
    print("   💻 Using CPU")

print("✅ Environment validation completed")

2025-08-08 16:28:04,693 - pipeline.setup - INFO - 🔍 Validating environment and dependencies...


🔍 Environment Validation:
   📍 Current working directory: /Users/matthew/Documents/deepmind_internship/notebooks_generalized
   🐍 Python version: 3.11.13 (main, Jun  3 2025, 18:38:25) [Clang 16.0.0 (clang-1600.0.26.6)]
   🤗 Transformers version: 4.52.4
   📊 Pandas version: 2.3.1
   🔢 NumPy version: 1.26.4
   🔧 Primary device: cpu
   🍎 Apple Metal Performance Shaders (MPS) available
✅ Environment validation completed


In [3]:
# Validate configuration and directory structure
logger.info("📋 Validating pipeline configuration...")

print("📋 Configuration Validation:")

# Check if configuration files exist
config_file = Path("../config/pipeline_config.json")
utils_file = Path("../src/pipeline_utils.py")

if config_file.exists():
    print(f"   ✅ Configuration file found: {config_file}")
else:
    print(f"   ❌ Configuration file missing: {config_file}")

if utils_file.exists():
    print(f"   ✅ Utilities module found: {utils_file}")
else:
    print(f"   ❌ Utilities module missing: {utils_file}")

# Validate configuration structure
try:
    data_config = config.get('data', {})
    models_config = config.get('models', {})
    training_config = config.get('training', {})
    
    print(f"   📊 Data sources configured: {len(data_config.get('datasets', {}))}")
    print(f"   🤖 Models configured: {len(models_config.get('base_models', []))}")
    print(f"   🏋️ Training epochs: {training_config.get('num_epochs', 'Not set')}")
    print("   ✅ Configuration structure validated")
    
except Exception as e:
    logger.error(f"Configuration validation failed: {str(e)}")
    print(f"   ❌ Configuration validation failed: {str(e)}")

# Check required directories
required_dirs = ["../data", "../models", "../results", "../config", "../src"]
print("\n📁 Directory Structure:")

for dir_path in required_dirs:
    path = Path(dir_path)
    if path.exists():
        print(f"   ✅ {dir_path}")
    else:
        print(f"   ❌ {dir_path} (will be created as needed)")
        
print("✅ Configuration validation completed")

2025-08-08 16:28:04,731 - pipeline.setup - INFO - 📋 Validating pipeline configuration...


📋 Configuration Validation:
   ✅ Configuration file found: ../config/pipeline_config.json
   ✅ Utilities module found: ../src/pipeline_utils.py
   📊 Data sources configured: 0
   🤖 Models configured: 3
   🏋️ Training epochs: 3
   ✅ Configuration structure validated

📁 Directory Structure:
   ✅ ../data
   ✅ ../models
   ✅ ../results
   ✅ ../config
   ✅ ../src
✅ Configuration validation completed


In [4]:
# Model availability and base model verification
logger.info("🤖 Checking model availability...")

print("🤖 Model Availability Check:")

# Check configured models
models_config = config.get('models', {})
base_models = models_config.get('base_models', [])
available_models = []
missing_models = []

for model_config in base_models:
    if model_config.get('enabled', True):
        model_name = model_config['name']
        model_id = model_config['model_id']
        
        try:
            # Try to load tokenizer to verify model accessibility
            from transformers import AutoTokenizer
            tokenizer = AutoTokenizer.from_pretrained(model_id)
            available_models.append(model_name)
            print(f"   ✅ {model_name} ({model_id})")
            
        except Exception as e:
            missing_models.append(model_name)
            print(f"   ❌ {model_name} ({model_id}) - {str(e)[:100]}...")

print(f"\n📊 Model Summary:")
print(f"   ✅ Available: {len(available_models)} models")
print(f"   ❌ Missing: {len(missing_models)} models")

if missing_models:
    print(f"\n⚠️ Missing models will be downloaded during training.")

# Check data availability
print(f"\n📂 Data Availability:")
data_config = config.get('data', {})

# Check main raw data path
raw_data_path = Path(f"../{data_config.get('raw_data_path', '')}")
if raw_data_path.exists():
    print(f"   ✅ Main dataset: {raw_data_path}")
else:
    print(f"   ❌ Main dataset: {raw_data_path}")

# Check processed data directory
processed_dir = Path(f"../{data_config.get('processed_data_dir', '')}")
if processed_dir.exists():
    print(f"   ✅ Processed data dir: {processed_dir}")
else:
    print(f"   ❌ Processed data dir: {processed_dir} (will be created)")

print("✅ Model availability check completed")

2025-08-08 16:28:04,744 - pipeline.setup - INFO - 🤖 Checking model availability...


🤖 Model Availability Check:
   ✅ tinybert-financial-classifier (huawei-noah/TinyBERT_General_4L_312D)
   ✅ tinybert-financial-classifier (huawei-noah/TinyBERT_General_4L_312D)
   ✅ finbert-tone (ProsusAI/finbert)

📊 Model Summary:
   ✅ Available: 2 models
   ❌ Missing: 0 models

📂 Data Availability:
   ✅ Main dataset: ../data/FinancialPhraseBank/all-data.csv
   ❌ Processed data dir: ../data/processed (will be created)
✅ Model availability check completed
   ✅ finbert-tone (ProsusAI/finbert)

📊 Model Summary:
   ✅ Available: 2 models
   ❌ Missing: 0 models

📂 Data Availability:
   ✅ Main dataset: ../data/FinancialPhraseBank/all-data.csv
   ❌ Processed data dir: ../data/processed (will be created)
✅ Model availability check completed


In [5]:
# Initialize pipeline state and complete setup
logger.info("🔄 Initializing pipeline state...")

print("🔄 Pipeline State Initialization:")

# Initialize or update pipeline state
setup_info = {
    'setup_timestamp': datetime.now().isoformat(),
    'python_version': sys.version,
    'device': str(device),
    'available_models': available_models if 'available_models' in locals() else [],
    'missing_models': missing_models if 'missing_models' in locals() else [],
    'configuration_valid': True,
    'environment_ready': True
}

# Mark setup as completed
state.mark_step_complete('setup_completed', **setup_info)

print("   ✅ Pipeline state initialized")
print(f"   📅 Setup timestamp: {setup_info['setup_timestamp']}")
print(f"   🔧 Device configured: {setup_info['device']}")
print(f"   🤖 Models available: {len(setup_info['available_models'])}")

# Display next steps
print(f"\n{'='*60}")
print("🎉 SETUP COMPLETED SUCCESSFULLY!")
print(f"{'='*60}")
print("📝 Next Steps:")
print("1. Run 1_data_processing_generalized.ipynb to process and prepare data")
print("2. Run 2_train_models_generalized.ipynb to train the models")
print("3. Continue with the sequential pipeline: 3 → 4 → 5 → 6")

print(f"\n🔧 Configuration Summary:")
print(f"   📊 Data source: {data_config.get('raw_data_path', 'Not configured')}")
print(f"   🤖 Models configured: {len(base_models)}")
print(f"   🏋️ Training epochs: {training_config.get('num_epochs', 'Not configured')}")
print(f"   📈 Batch size: {training_config.get('batch_size', 'Not configured')}")

logger.info("✅ Environment setup completed successfully")

# Save setup report
setup_report = {
    'timestamp': datetime.now().isoformat(),
    'status': 'completed',
    'environment': {
        'python_version': sys.version,
        'device': str(device),
        'torch_version': torch.__version__,
        'transformers_version': transformers.__version__
    },
    'configuration': {
        'config_file_exists': config_file.exists() if 'config_file' in locals() else False,
        'utils_file_exists': utils_file.exists() if 'utils_file' in locals() else False,
        'models_configured': len(base_models),
        'data_source_configured': bool(data_config.get('raw_data_path'))
    },
    'models': {
        'available': available_models if 'available_models' in locals() else [],
        'missing': missing_models if 'missing_models' in locals() else []
    }
}

# Ensure results directory exists
results_dir = Path("../results")
results_dir.mkdir(exist_ok=True)

# Save setup report
with open(results_dir / 'setup_report.json', 'w') as f:
    json.dump(setup_report, f, indent=2)

print(f"\n📄 Setup report saved to: {results_dir / 'setup_report.json'}")

2025-08-08 16:28:07,129 - pipeline.setup - INFO - 🔄 Initializing pipeline state...
2025-08-08 16:28:07,132 - pipeline.setup - INFO - ✅ Environment setup completed successfully
2025-08-08 16:28:07,132 - pipeline.setup - INFO - ✅ Environment setup completed successfully


🔄 Pipeline State Initialization:
   ✅ Pipeline state initialized
   📅 Setup timestamp: 2025-08-08T16:28:07.131245
   🔧 Device configured: cpu
   🤖 Models available: 2

🎉 SETUP COMPLETED SUCCESSFULLY!
📝 Next Steps:
1. Run 1_data_processing_generalized.ipynb to process and prepare data
2. Run 2_train_models_generalized.ipynb to train the models
3. Continue with the sequential pipeline: 3 → 4 → 5 → 6

🔧 Configuration Summary:
   📊 Data source: data/FinancialPhraseBank/all-data.csv
   🤖 Models configured: 3
   🏋️ Training epochs: 3
   📈 Batch size: 16

📄 Setup report saved to: ../results/setup_report.json
