# 🚀 Environment Setup & Validation
## Financial Sentiment Analysis Pipeline - Production Setup

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue?logo=python&logoColor=white)](https://www.python.org/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0%2B-red?logo=pytorch&logoColor=white)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/🤗%20Transformers-4.0%2B-yellow)](https://huggingface.co/transformers/)

---

### 📋 Overview

This notebook initialises and validates the complete environment for the **Financial Sentiment Analysis Pipeline**. It serves as the foundation for all subsequent notebooks in the workflow.

### 🎯 Key Objectives

- **Environment Validation**: Verify all dependencies and system requirements
- **Configuration Management**: Load and validate centralised pipeline configuration
- **Hardware Detection**: Identify and optimise for available compute resources (CPU/GPU/MPS)
- **State Management**: Initialise pipeline state tracking for reproducible runs
- **Model Verification**: Confirm availability of required models and datasets

### 🏗️ Architecture

This pipeline follows a **configuration-driven approach** where all settings are centralised in `../config/pipeline_config.json`. This ensures:

- ✅ **Reproducible Results**: Consistent configuration across environments
- ✅ **Easy Deployment**: Single configuration file for all settings
- ✅ **Flexible Scaling**: Easy to add new models or modify parameters
- ✅ **Production Ready**: Structured logging and state management

### 📁 Directory Structure

```
deepmind_internship/
├── config/
│   ├── pipeline_config.json    # Main configuration file
│   └── pipeline_state.json     # Runtime state tracking
├── notebooks/                  # Production notebooks
├── src/                        # Core pipeline utilities
├── data/                       # Training and test datasets
└── models/                     # Trained model artifacts
```

### 🚀 Quick Start

1. **Prerequisites**: Ensure Python 3.8+ and required dependencies are installed
2. **Configuration**: Review and modify `config/pipeline_config.json` as needed
3. **Run Setup**: Execute all cells in this notebook to validate environment
4. **Proceed**: Continue to `1_data_processing.ipynb`

---

**⚠️ Important**: Run this notebook first before proceeding with any other pipeline components.

In [None]:
"""
🔧 PIPELINE INITIALISATION
==========================

This cell imports all required dependencies and initialises the configuration system.
The pipeline uses a centralised configuration approach for reproducible results.

Author: DeepMind Internship Project
Date: August 2025
Version: 1.0.0
"""

# Core Python imports
import sys
import os
from pathlib import Path
import json
from datetime import datetime

# Add parent directory to Python path for custom imports
sys.path.append('../')

try:
    # Import custom pipeline utilities
    from src.pipeline_utils import ConfigManager, StateManager, LoggingManager
    print("✅ Custom pipeline utilities imported successfully")
except ImportError as e:
    print(f"❌ Failed to import pipeline utilities: {e}")
    print("💡 Ensure you're running from the notebooks/ directory")
    raise

try:
    # Core ML and data processing libraries
    import torch
    import transformers
    import datasets
    import pandas as pd
    import numpy as np
    print("✅ Core ML libraries imported successfully")
except ImportError as e:
    print(f"❌ Missing required dependencies: {e}")
    print("💡 Run: pip install -r requirements.txt")
    raise

# Initialise configuration system
print("\n🔧 Initialising pipeline configuration...")

# Define file paths
config_path = Path("../config/pipeline_config.json")
state_path = Path("../config/pipeline_state.json")

# Initialize managers with proper parameters
config_manager = ConfigManager(config_path=str(config_path))
state_manager = StateManager(state_path=str(state_path))

# Get the loaded configuration (load_config() is called in __init__)
config = config_manager.config

# Initialize logging manager with config and component name
logging_manager = LoggingManager(config=config_manager, component_name='setup')

# Set up remaining variables
state = state_manager
logger = logging_manager.get_logger()

print("✅ Configuration system initialised")
print(f"   📋 Config loaded: {len(config)} top-level sections")
print(f"   📊 State tracking: {'enabled' if state else 'disabled'}")
print(f"   📝 Logging: {'configured' if logger else 'basic'}")

## 🔧 Initial Setup & Dependencies

### 📦 Required Libraries

The following libraries are essential for the pipeline:

- **Core ML**: `torch`, `transformers`, `datasets`
- **Data Processing**: `pandas`, `numpy`, `scikit-learn`
- **Visualisation**: `matplotlib`, `seaborn`, `plotly`
- **Utilities**: `pathlib`, `json`, `logging`

### ⚙️ Configuration System

This pipeline uses a centralised configuration management system:

- **ConfigManager**: Loads and validates `pipeline_config.json`
- **StateManager**: Tracks pipeline execution state
- **LoggingManager**: Provides structured logging across all components

### 🚨 Prerequisites

Before proceeding, ensure:

1. ✅ All dependencies are installed via `pip install -r requirements.txt`
2. ✅ Configuration file `config/pipeline_config.json` exists and is valid
3. ✅ Sufficient disk space for models and datasets (~5GB recommended)
4. ✅ GPU drivers installed (if using CUDA)

---

In [None]:
# Validate prerequisites and environment
logger.info("🔍 Validating environment and dependencies...")

print("🔍 Environment Validation:")
print(f"   📍 Current working directory: {os.getcwd()}")
print(f"   🐍 Python version: {sys.version}")
print(f"   🤗 Transformers version: {transformers.__version__}")
print(f"   📊 Pandas version: {pd.__version__}")
print(f"   🔢 NumPy version: {np.__version__}")

# Device detection and configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"   🔧 Primary device: {device}")

if torch.cuda.is_available():
    print(f"   🚀 CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"   💾 CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("   🍎 Apple Metal Performance Shaders (MPS) available")
else:
    print("   💻 Using CPU")

print("✅ Environment validation completed")

In [None]:
# Validate configuration and directory structure
logger.info("📋 Validating pipeline configuration...")

print("📋 Configuration Validation:")

# Check if configuration files exist
config_file = Path("../config/pipeline_config.json")
utils_file = Path("../src/pipeline_utils.py")

if config_file.exists():
    print(f"   ✅ Configuration file found: {config_file}")
else:
    print(f"   ❌ Configuration file missing: {config_file}")

if utils_file.exists():
    print(f"   ✅ Utilities module found: {utils_file}")
else:
    print(f"   ❌ Utilities module missing: {utils_file}")

# Validate configuration structure
try:
    data_config = config.get('data', {})
    models_config = config.get('models', {})
    training_config = config.get('training', {})
    
    print(f"   📊 Data sources configured: {len(data_config.get('datasets', {}))}")
    print(f"   🤖 Models configured: {len(models_config.get('base_models', []))}")
    print(f"   🏋️ Training epochs: {training_config.get('num_epochs', 'Not set')}")
    print("   ✅ Configuration structure validated")
    
except Exception as e:
    logger.error(f"Configuration validation failed: {str(e)}")
    print(f"   ❌ Configuration validation failed: {str(e)}")

# Check required directories
required_dirs = ["../data", "../models", "../results", "../config", "../src"]
print("\n📁 Directory Structure:")

for dir_path in required_dirs:
    path = Path(dir_path)
    if path.exists():
        print(f"   ✅ {dir_path}")
    else:
        print(f"   ❌ {dir_path} (will be created as needed)")
        
print("✅ Configuration validation completed")

In [None]:
# Model availability and base model verification
logger.info("🤖 Checking model availability...")

print("🤖 Model Availability Check:")

# Check configured models
models_config = config.get('models', {})
base_models = models_config.get('base_models', [])
available_models = []
missing_models = []

for model_config in base_models:
    if model_config.get('enabled', True):
        model_name = model_config['name']
        model_id = model_config['model_id']
        
        try:
            # Try to load tokenizer to verify model accessibility
            from transformers import AutoTokenizer
            tokenizer = AutoTokenizer.from_pretrained(model_id)
            available_models.append(model_name)
            print(f"   ✅ {model_name} ({model_id})")
            
        except Exception as e:
            missing_models.append(model_name)
            print(f"   ❌ {model_name} ({model_id}) - {str(e)[:100]}...")

print(f"\n📊 Model Summary:")
print(f"   ✅ Available: {len(available_models)} models")
print(f"   ❌ Missing: {len(missing_models)} models")

if missing_models:
    print(f"\n⚠️ Missing models will be downloaded during training.")

# Check data availability
print(f"\n📂 Data Availability:")
data_config = config.get('data', {})

# Check main raw data path
raw_data_path = Path(f"../{data_config.get('raw_data_path', '')}")
if raw_data_path.exists():
    print(f"   ✅ Main dataset: {raw_data_path}")
else:
    print(f"   ❌ Main dataset: {raw_data_path}")

# Check processed data directory
processed_dir = Path(f"../{data_config.get('processed_data_dir', '')}")
if processed_dir.exists():
    print(f"   ✅ Processed data dir: {processed_dir}")
else:
    print(f"   ❌ Processed data dir: {processed_dir} (will be created)")

print("✅ Model availability check completed")

In [None]:
# Initialize pipeline state and complete setup
logger.info("🔄 Initializing pipeline state...")

print("🔄 Pipeline State Initialization:")

# Initialize or update pipeline state
setup_info = {
    'setup_timestamp': datetime.now().isoformat(),
    'python_version': sys.version,
    'device': str(device),
    'available_models': available_models if 'available_models' in locals() else [],
    'missing_models': missing_models if 'missing_models' in locals() else [],
    'configuration_valid': True,
    'environment_ready': True
}

# Mark setup as completed
state.mark_step_complete('setup_completed', **setup_info)

print("   ✅ Pipeline state initialized")
print(f"   📅 Setup timestamp: {setup_info['setup_timestamp']}")
print(f"   🔧 Device configured: {setup_info['device']}")
print(f"   🤖 Models available: {len(setup_info['available_models'])}")

# Display next steps
print(f"\n{'='*60}")
print("🎉 SETUP COMPLETED SUCCESSFULLY!")
print(f"{'='*60}")
print("📝 Next Steps:")
print("1. Run 1_data_processing.ipynb to process and prepare data")
print("2. Run 2_train_models.ipynb to train the models")
print("3. Continue with the sequential pipeline: 3 → 4 → 5 → 6")

# Safe variable access for configuration summary
data_config = locals().get('data_config', config.get('data', {}))
models_config = locals().get('models_config', config.get('models', {}))
training_config = locals().get('training_config', config.get('training', {}))
base_models = locals().get('base_models', models_config.get('base_models', []))

print(f"\n🔧 Configuration Summary:")
print(f"   📊 Data source: {data_config.get('raw_data_path', 'Not configured')}")
print(f"   🤖 Models configured: {len(base_models)}")
print(f"   🏋️ Training epochs: {training_config.get('num_epochs', 'Not configured')}")
print(f"   📈 Batch size: {training_config.get('batch_size', 'Not configured')}")

logger.info("✅ Environment setup completed successfully")

# Save setup report
setup_report = {
    'timestamp': datetime.now().isoformat(),
    'status': 'completed',
    'environment': {
        'python_version': sys.version,
        'device': str(device),
        'torch_version': torch.__version__,
        'transformers_version': transformers.__version__
    },
    'configuration': {
        'config_file_exists': locals().get('config_file', Path("../config/pipeline_config.json")).exists(),
        'utils_file_exists': locals().get('utils_file', Path("../src/pipeline_utils.py")).exists(),
        'models_configured': len(base_models),
        'data_source_configured': bool(data_config.get('raw_data_path'))
    },
    'models': {
        'available': available_models if 'available_models' in locals() else [],
        'missing': missing_models if 'missing_models' in locals() else []
    }
}

# Ensure results directory exists
results_dir = Path("../results")
results_dir.mkdir(exist_ok=True)

# Save setup report
with open(results_dir / 'setup_report.json', 'w') as f:
    json.dump(setup_report, f, indent=2)

print(f"\n📄 Setup report saved to: {results_dir / 'setup_report.json'}")