# **Chapter 8: Data Storage and Management**

---

## **Learning Objectives**

By the end of this chapter, you will be able to:
- Make informed decisions about storage architecture for time-series data
- Implement file-based storage solutions (CSV, Parquet, HDF5)
- Design and query relational databases for time-series data
- Utilize specialized time-series databases (InfluxDB, TimescaleDB)
- Choose appropriate NoSQL solutions for different use cases
- Implement data partitioning and archival strategies
- Ensure data security and compliance

---

## **Prerequisites**

- Completed Chapter 7: Exploratory Data Analysis
- Understanding of Python file I/O operations
- Basic knowledge of SQL
- Familiarity with pandas DataFrames

---

## **8.1 Storage Architecture Decisions**

Choosing the right storage architecture is critical for any time-series prediction system. The decision impacts performance, scalability, cost, and maintainability. Let's explore the factors that influence this decision.

### **Understanding Storage Requirements**

Before selecting a storage solution, you need to analyze your specific requirements:

```python
"""
Storage Requirements Analysis for Time-Series Systems

This module helps analyze and document storage requirements
for time-series prediction systems.
"""

from dataclasses import dataclass
from typing import Optional
from enum import Enum
import os

class AccessPattern(Enum):
    """Enumeration of common data access patterns"""
    SEQUENTIAL = "sequential"      # Read data in time order
    RANDOM = "random"              # Access arbitrary time points
    AGGREGATION = "aggregation"    # Frequent aggregations/rollups
    REAL_TIME = "real_time"        # Continuous writes and reads

class DataVolume(Enum):
    """Classification of data volume categories"""
    SMALL = "small"       # < 1 GB
    MEDIUM = "medium"     # 1 GB - 100 GB
    LARGE = "large"       # 100 GB - 10 TB
    VERY_LARGE = "very_large"  # > 10 TB

@dataclass
class StorageRequirements:
    """
    Data class to capture storage requirements for analysis.
    
    Using a dataclass provides automatic __init__, __repr__, and __eq__
    methods, making it easy to create and compare requirement objects.
    """
    estimated_size_gb: float          # Expected data size in gigabytes
    write_frequency: str              # How often data is written
    read_frequency: str               # How often data is read
    access_pattern: AccessPattern     # Primary access pattern
    query_complexity: str             # Simple, moderate, or complex queries
    retention_period_days: int        # How long to keep data
    requires_transaction: bool        # Need for ACID transactions
    concurrent_users: int             # Number of simultaneous users
    budget_constraint: str            # low, medium, high
    
    def analyze_requirements(self) -> dict:
        """
        Analyze the storage requirements and provide recommendations.
        
        Returns a dictionary with analysis results and recommendations.
        """
        # Determine volume category based on size
        if self.estimated_size_gb < 1:
            volume = DataVolume.SMALL
        elif self.estimated_size_gb < 100:
            volume = DataVolume.MEDIUM
        elif self.estimated_size_gb < 10000:
            volume = DataVolume.LARGE
        else:
            volume = DataVolume.VERY_LARGE
        
        # Calculate write-to-read ratio
        # This helps determine if the system is write-heavy or read-heavy
        write_patterns = {
            'continuous': 10,
            'hourly': 5,
            'daily': 1,
            'weekly': 0.5
        }
        read_patterns = {
            'continuous': 10,
            'frequent': 5,
            'moderate': 2,
            'occasional': 1
        }
        
        write_score = write_patterns.get(self.write_frequency, 1)
        read_score = read_patterns.get(self.read_frequency, 1)
        write_read_ratio = write_score / read_score if read_score > 0 else 0
        
        # Determine system type
        if write_read_ratio > 2:
            system_type = "write-heavy"
        elif write_read_ratio < 0.5:
            system_type = "read-heavy"
        else:
            system_type = "balanced"
        
        return {
            'volume_category': volume.value,
            'write_read_ratio': write_read_ratio,
            'system_type': system_type,
            'recommendations': self._get_recommendations(volume, system_type)
        }
    
    def _get_recommendations(self, volume: DataVolume, system_type: str) -> list:
        """
        Generate storage recommendations based on volume and system type.
        
        Args:
            volume: DataVolume enum value
            system_type: String describing the system type
            
        Returns:
            List of recommended storage solutions
        """
        recommendations = []
        
        # File-based storage is good for small to medium datasets
        # especially for development and prototyping
        if volume in [DataVolume.SMALL, DataVolume.MEDIUM]:
            recommendations.extend([
                "Parquet files for efficient columnar storage",
                "HDF5 for numerical data with complex queries",
                "SQLite for transactional requirements"
            ])
        
        # Relational databases work well for medium datasets with
        # complex queries and transaction requirements
        if volume == DataVolume.MEDIUM and self.requires_transaction:
            recommendations.extend([
                "PostgreSQL with TimescaleDB extension",
                "MySQL with partitioning"
            ])
        
        # Time-series databases are optimal for large datasets
        # with time-based queries
        if volume in [DataVolume.LARGE, DataVolume.VERY_LARGE]:
            recommendations.extend([
                "TimescaleDB for SQL compatibility",
                "InfluxDB for high write throughput",
                "ClickHouse for analytical queries"
            ])
        
        # Cloud solutions for scalability
        if self.concurrent_users > 10 or volume == DataVolume.VERY_LARGE:
            recommendations.extend([
                "AWS Redshift for data warehousing",
                "Google BigQuery for serverless analytics",
                "Azure Time Series Insights for IoT scenarios"
            ])
        
        return recommendations


def analyze_nepse_requirements():
    """
    Analyze storage requirements for the NEPSE prediction system.
    
    This function demonstrates how to analyze requirements for a real
    stock market prediction system using the StorageRequirements class.
    """
    # Create a requirements object based on NEPSE characteristics
    # NEPSE data characteristics:
    # - Daily data for ~200 stocks
    # - Approximately 250 trading days per year
    # - Each record has 21 columns
    # - Historical data going back multiple years
    nepse_requirements = StorageRequirements(
        estimated_size_gb=0.5,           # ~500 MB for historical data
        write_frequency='daily',          # Data updated once per trading day
        read_frequency='frequent',        # Model training runs frequently
        access_pattern=AccessPattern.SEQUENTIAL,  # Time-series access
        query_complexity='moderate',      # Date range queries, aggregations
        retention_period_days=3650,       # 10 years of historical data
        requires_transaction=False,       # No complex transactions needed
        concurrent_users=5,               # Multiple analysts/models
        budget_constraint='medium'
    )
    
    # Perform analysis
    analysis = nepse_requirements.analyze_requirements()
    
    print("=" * 60)
    print("NEPSE Time-Series Prediction System - Storage Analysis")
    print("=" * 60)
    print(f"\nData Volume Category: {analysis['volume_category']}")
    print(f"Write/Read Ratio: {analysis['write_read_ratio']:.2f}")
    print(f"System Type: {analysis['system_type']}")
    print("\nRecommended Storage Solutions:")
    for i, rec in enumerate(analysis['recommendations'], 1):
        print(f"  {i}. {rec}")
    
    return analysis


# Example usage
if __name__ == "__main__":
    analyze_nepse_requirements()
```

**Detailed Explanation:**

1. **AccessPattern Enum**: Defines how your application accesses data. Time-series systems typically use sequential access (reading data in time order) or aggregation patterns (computing statistics over time windows).

2. **DataVolume Enum**: Categorizes data size, which heavily influences storage choice. Small datasets can use simple file-based solutions, while large datasets need specialized databases.

3. **StorageRequirements Dataclass**: Captures all relevant factors for storage decisions:
   - `estimated_size_gb`: The expected size of your data. For NEPSE, with ~200 stocks and 21 columns per record, storing 10 years of daily data is approximately 500 MB.
   - `write_frequency`: How often new data arrives. NEPSE updates daily after market close.
   - `read_frequency`: How often data is accessed. Prediction systems read frequently for training.
   - `access_pattern`: The primary way data is accessed. Time-series models need sequential access.
   - `query_complexity`: Simple queries fetch specific records; complex queries involve joins and aggregations.
   - `retention_period_days`: How long to keep historical data. Financial data often requires long retention.
   - `requires_transaction`: Whether you need ACID guarantees. For pure analytics, this is often unnecessary.
   - `concurrent_users`: Number of simultaneous readers/writers.
   - `budget_constraint`: Financial limitations affect cloud vs. on-premise decisions.

4. **analyze_requirements Method**: Computes a write-to-read ratio and classifies the system. A write-heavy system (ratio > 2) needs optimized write paths; a read-heavy system (ratio < 0.5) benefits from read replicas and caching.

### **Storage Decision Matrix**

```python
"""
Storage Decision Matrix for Time-Series Systems

This module provides a decision matrix to help choose the appropriate
storage solution based on specific requirements.
"""

import pandas as pd
from typing import Dict, List, Tuple

class StorageDecisionMatrix:
    """
    A decision matrix class that evaluates storage options against
    multiple criteria to recommend the best solution.
    """
    
    def __init__(self):
        """Initialize the decision matrix with storage options and criteria."""
        # Define storage options available for time-series data
        self.storage_options = [
            'CSV Files',
            'Parquet Files',
            'HDF5 Files',
            'SQLite',
            'PostgreSQL',
            'TimescaleDB',
            'InfluxDB',
            'MongoDB',
            'Redis'
        ]
        
        # Define evaluation criteria with weights
        # Higher weight means the criterion is more important
        self.criteria = {
            'query_performance': 0.20,      # Speed of data retrieval
            'write_performance': 0.15,      # Speed of data insertion
            'storage_efficiency': 0.15,     # Compression and space usage
            'scalability': 0.15,            # Ability to handle growth
            'ease_of_use': 0.10,            # Developer experience
            'cost': 0.10,                   # Licensing and operational costs
            'ecosystem': 0.10,              # Community support and tools
            'time_series_features': 0.05    # Built-in time-series support
        }
        
        # Score matrix: each storage option scored 1-10 on each criterion
        # These scores are based on typical use cases and may vary
        self.scores = self._initialize_scores()
    
    def _initialize_scores(self) -> pd.DataFrame:
        """
        Initialize the scoring matrix with default values.
        
        Returns a DataFrame with storage options as rows and criteria as columns.
        Each cell contains a score from 1 (poor) to 10 (excellent).
        """
        # Score data based on typical performance characteristics
        # These scores are based on common benchmarks and use cases
        data = {
            'query_performance': {
                'CSV Files': 3,      # Slow for large files, no indexing
                'Parquet Files': 7,  # Good columnar access, predicate pushdown
                'HDF5 Files': 8,     # Excellent for numerical data, indexed
                'SQLite': 6,         # Good for moderate datasets
                'PostgreSQL': 8,     # Excellent with proper indexing
                'TimescaleDB': 9,    # Optimized for time-series queries
                'InfluxDB': 9,       # Excellent time-series query engine
                'MongoDB': 7,        # Good for document queries
                'Redis': 10          # In-memory, extremely fast
            },
            'write_performance': {
                'CSV Files': 7,      # Simple append operations
                'Parquet Files': 5,  # Slower writes due to compression
                'HDF5 Files': 6,     # Moderate write performance
                'SQLite': 6,         # Good for moderate write loads
                'PostgreSQL': 7,     # Good with proper configuration
                'TimescaleDB': 9,    # Optimized for high write throughput
                'InfluxDB': 10,      # Designed for high-frequency writes
                'MongoDB': 8,        # Good write performance
                'Redis': 10          # In-memory writes are fast
            },
            'storage_efficiency': {
                'CSV Files': 2,      # No compression, text format
                'Parquet Files': 9,  # Excellent compression ratios
                'HDF5 Files': 8,     # Good compression for numerical data
                'SQLite': 5,         # Moderate, page-based storage
                'PostgreSQL': 6,     # TOAST compression available
                'TimescaleDB': 7,    # Good with columnar compression
                'InfluxDB': 8,       # Built-in compression
                'MongoDB': 5,        # BSON can be verbose
                'Redis': 1           # In-memory, no persistence efficiency
            },
            'scalability': {
                'CSV Files': 2,      # Manual sharding, no built-in support
                'Parquet Files': 5,  # Can partition files manually
                'HDF5 Files': 4,     # Limited to single file/node
                'SQLite': 2,         # Single file, no clustering
                'PostgreSQL': 7,     # Read replicas, partitioning
                'TimescaleDB': 8,    # Native partitioning, multi-node
                'InfluxDB': 8,       # Clustering available
                'MongoDB': 9,        # Built-in sharding
                'Redis': 7           # Clustering available
            },
            'ease_of_use': {
                'CSV Files': 10,     # Simplest format, universal support
                'Parquet Files': 8,  # Easy with pandas/PyArrow
                'HDF5 Files': 6,     # More complex API
                'SQLite': 9,         # Simple setup, file-based
                'PostgreSQL': 7,     # Requires setup and maintenance
                'TimescaleDB': 7,    # PostgreSQL-based, familiar
                'InfluxDB': 7,       # New query language (InfluxQL/Flux)
                'MongoDB': 8,        # JSON-like documents, intuitive
                'Redis': 6           # Different paradigm (key-value)
            },
            'cost': {
                'CSV Files': 10,     # Free, no infrastructure
                'Parquet Files': 10, # Free, no infrastructure
                'HDF5 Files': 10,    # Free, no infrastructure
                'SQLite': 10,        # Free, embedded
                'PostgreSQL': 9,     # Open source, self-hosted
                'TimescaleDB': 8,    # Open source with enterprise tier
                'InfluxDB': 8,       # Open source with cloud offering
                'MongoDB': 7,        # Open source with cloud costs
                'Redis': 9           # Open source, self-hosted
            },
            'ecosystem': {
                'CSV Files': 10,     # Universal support
                'Parquet Files': 9,  # Strong big data ecosystem
                'HDF5 Files': 7,     # Scientific computing focus
                'SQLite': 9,         # Widely supported
                'PostgreSQL': 9,     # Mature ecosystem
                'TimescaleDB': 7,    # Growing ecosystem
                'InfluxDB': 8,       # Strong monitoring ecosystem
                'MongoDB': 9,        # Large community
                'Redis': 9           # Widely adopted
            },
            'time_series_features': {
                'CSV Files': 1,      # No built-in support
                'Parquet Files': 2,  # Partitioning by time possible
                'HDF5 Files': 3,     # Time dimension support
                'SQLite': 2,         # Basic date functions
                'PostgreSQL': 5,     # Date functions, can optimize
                'TimescaleDB': 10,   # Purpose-built for time-series
                'InfluxDB': 10,      # Purpose-built for time-series
                'MongoDB': 3,        # Can store time-series but not optimized
                'Redis': 2           # Time-series module available
            }
        }
        
        return pd.DataFrame(data)
    
    def calculate_weighted_scores(self, 
                                   custom_weights: Dict[str, float] = None) -> pd.DataFrame:
        """
        Calculate weighted scores for each storage option.
        
        Args:
            custom_weights: Optional dictionary of custom criteria weights.
                           If provided, overrides default weights.
        
        Returns:
            DataFrame with weighted scores for each storage option.
        """
        weights = custom_weights if custom_weights else self.criteria
        
        # Normalize weights to ensure they sum to 1
        total_weight = sum(weights.values())
        normalized_weights = {k: v/total_weight for k, v in weights.items()}
        
        # Calculate weighted scores
        weighted_scores = pd.DataFrame(index=self.storage_options)
        
        for criterion, weight in normalized_weights.items():
            weighted_scores[f'{criterion}_weighted'] = (
                self.scores[criterion] * weight
            )
        
        # Calculate total weighted score
        weighted_scores['total_score'] = weighted_scores.sum(axis=1)
        
        # Sort by total score
        weighted_scores = weighted_scores.sort_values(
            'total_score', ascending=False
        )
        
        return weighted_scores
    
    def get_recommendation(self, 
                           use_case: str = 'general',
                           custom_weights: Dict[str, float] = None) -> Tuple[str, str]:
        """
        Get a storage recommendation based on use case.
        
        Args:
            use_case: The primary use case. Options:
                     'general', 'high_write', 'high_read', 'analytics', 
                     'real_time', 'archival'
            custom_weights: Optional custom weights for criteria.
        
        Returns:
            Tuple of (recommended storage, explanation)
        """
        # Predefined weight configurations for different use cases
        use_case_weights = {
            'general': {
                'query_performance': 0.20,
                'write_performance': 0.15,
                'storage_efficiency': 0.15,
                'scalability': 0.15,
                'ease_of_use': 0.10,
                'cost': 0.10,
                'ecosystem': 0.10,
                'time_series_features': 0.05
            },
            'high_write': {
                'query_performance': 0.10,
                'write_performance': 0.35,  # Prioritize write speed
                'storage_efficiency': 0.15,
                'scalability': 0.20,
                'ease_of_use': 0.05,
                'cost': 0.05,
                'ecosystem': 0.05,
                'time_series_features': 0.05
            },
            'high_read': {
                'query_performance': 0.35,  # Prioritize read speed
                'write_performance': 0.10,
                'storage_efficiency': 0.10,
                'scalability': 0.15,
                'ease_of_use': 0.10,
                'cost': 0.05,
                'ecosystem': 0.10,
                'time_series_features': 0.05
            },
            'analytics': {
                'query_performance': 0.20,
                'write_performance': 0.10,
                'storage_efficiency': 0.25,  # Important for large datasets
                'scalability': 0.15,
                'ease_of_use': 0.10,
                'cost': 0.10,
                'ecosystem': 0.05,
                'time_series_features': 0.05
            },
            'real_time': {
                'query_performance': 0.25,
                'write_performance': 0.25,  # Both important
                'storage_efficiency': 0.10,
                'scalability': 0.15,
                'ease_of_use': 0.05,
                'cost': 0.05,
                'ecosystem': 0.05,
                'time_series_features': 0.10
            },
            'archival': {
                'query_performance': 0.10,
                'write_performance': 0.05,
                'storage_efficiency': 0.35,  # Most important for archival
                'scalability': 0.20,
                'ease_of_use': 0.10,
                'cost': 0.15,
                'ecosystem': 0.05,
                'time_series_features': 0.00
            }
        }
        
        weights = custom_weights if custom_weights else use_case_weights.get(
            use_case, use_case_weights['general']
        )
        
        scores = self.calculate_weighted_scores(weights)
        best_option = scores.index[0]
        best_score = scores.loc[best_option, 'total_score']
        
        explanation = self._generate_explanation(best_option, scores)
        
        return best_option, explanation
    
    def _generate_explanation(self, 
                              best_option: str, 
                              scores: pd.DataFrame) -> str:
        """
        Generate an explanation for why a storage option was recommended.
        
        Args:
            best_option: The recommended storage option
            scores: DataFrame with all scores
        
        Returns:
            Explanation string
        """
        explanations = {
            'CSV Files': "CSV files are simple and universally supported. "
                        "Best for small datasets, prototyping, or when you need "
                        "maximum compatibility with different tools.",
            
            'Parquet Files': "Parquet provides excellent compression and "
                            "columnar access patterns. Ideal for analytical "
                            "workloads on medium to large datasets where "
                            "query performance and storage efficiency matter.",
            
            'HDF5 Files': "HDF5 excels at storing numerical scientific data "
                         "with fast random access. Good for medium-sized "
                         "datasets where you need efficient slicing and "
                         "complex data structures.",
            
            'SQLite': "SQLite is a simple, file-based relational database. "
                     "Great for applications that need SQL capabilities "
                     "without a separate server, ideal for embedded or "
                     "desktop applications.",
            
            'PostgreSQL': "PostgreSQL is a robust relational database with "
                         "excellent query optimization and extensibility. "
                         "Good for applications requiring complex queries, "
                         "transactions, and data integrity.",
            
            'TimescaleDB': "TimescaleDB is PostgreSQL optimized for time-series "
                          "data. It provides automatic partitioning, time-based "
                          "functions, and excellent performance for time-series "
                          "workloads. Ideal for production time-series systems.",
            
            'InfluxDB': "InfluxDB is purpose-built for time-series data with "
                       "excellent write throughput and time-based queries. "
                       "Great for monitoring, IoT, and real-time analytics.",
            
            'MongoDB': "MongoDB offers flexible document storage with good "
                      "scalability. Suitable when your data schema evolves "
                      "frequently or you need horizontal scaling.",
            
            'Redis': "Redis provides in-memory storage for extremely fast "
                    "access. Perfect for caching, real-time analytics, or "
                    "when latency is critical. Data size limited by memory."
        }
        
        return explanations.get(best_option, "No explanation available.")


def demonstrate_decision_matrix():
    """
    Demonstrate the storage decision matrix for NEPSE system.
    """
    matrix = StorageDecisionMatrix()
    
    print("=" * 70)
    print("Storage Decision Matrix - NEPSE Time-Series Prediction System")
    print("=" * 70)
    
    # Get recommendations for different use cases relevant to NEPSE
    use_cases = ['general', 'high_read', 'analytics', 'archival']
    
    for use_case in use_cases:
        recommendation, explanation = matrix.get_recommendation(use_case)
        print(f"\n{'-' * 70}")
        print(f"Use Case: {use_case.upper()}")
        print(f"{'-' * 70}")
        print(f"Recommended Storage: {recommendation}")
        print(f"\nExplanation: {explanation}")
    
    # Show the scoring matrix
    print(f"\n{'=' * 70}")
    print("Complete Scoring Matrix")
    print("=" * 70)
    print(matrix.scores.transpose())
    
    # Calculate and show weighted scores for general use case
    print(f"\n{'=' * 70}")
    print("Weighted Scores (General Use Case)")
    print("=" * 70)
    weighted = matrix.calculate_weighted_scores()
    print(weighted[['total_score']].round(3))


if __name__ == "__main__":
    demonstrate_decision_matrix()
```

**Detailed Explanation:**

1. **StorageDecisionMatrix Class**: A comprehensive decision-making tool that evaluates storage options against multiple criteria. Each criterion has a weight reflecting its importance.

2. **Criteria and Weights**:
   - `query_performance` (0.20): How fast can you retrieve data? Critical for prediction systems that need to load historical data frequently.
   - `write_performance` (0.15): How fast can you write data? Important for systems with high-frequency data ingestion.
   - `storage_efficiency` (0.15): How much disk space does the data occupy? Important for long-term storage of historical data.
   - `scalability` (0.15): Can the system grow with your data? Important for systems that will accumulate data over time.
   - `ease_of_use` (0.10): How quickly can developers become productive?
   - `cost` (0.10): Licensing and operational costs.
   - `ecosystem` (0.10): Community support, documentation, and available tools.
   - `time_series_features` (0.05): Built-in support for time-series operations.

3. **Scoring System**: Each storage option is scored 1-10 on each criterion. These scores are based on typical benchmarks and use cases. For example:
   - CSV scores low on storage_efficiency (2) because it's uncompressed text.
   - InfluxDB scores high on time_series_features (10) because it's purpose-built for time-series.
   - Redis scores high on query_performance (10) because it's in-memory.

4. **Use Case Configurations**: Different use cases have different priorities:
   - `high_write`: Weights write performance higher for ingestion-heavy systems.
   - `high_read`: Weights query performance for analytics-heavy systems.
   - `analytics`: Weights storage efficiency for large analytical datasets.
   - `archival`: Weights storage efficiency and cost for long-term storage.

---

## **8.2 File-Based Storage**

File-based storage is the simplest and most portable approach for storing time-series data. While not suitable for all use cases, it's often the best starting point and remains relevant for many applications.

### **8.2.1 CSV and Flat Files**

CSV (Comma-Separated Values) files are the most universal format for tabular data. They're human-readable and supported by virtually every data tool.

```python
"""
CSV Storage Module for Time-Series Data

This module provides comprehensive CSV file operations for time-series
data, specifically designed for the NEPSE stock prediction system.

CSV Format for NEPSE Data:
S.No,Symbol,Conf.,Open,High,Low,Close,LTP,Close - LTP,Close - LTP %,
VWAP,Vol,Prev. Close,Turnover,Trans.,Diff,Range,Diff %,Range %,
VWAP %,52 Weeks High,52 Weeks Low
"""

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Optional, Dict, Union
from pathlib import Path
import csv
import os


class CSVTimeSeriesStorage:
    """
    A class to handle CSV storage operations for time-series data.
    
    This class provides methods for reading, writing, and managing
    CSV files with a focus on time-series data from the NEPSE system.
    """
    
    def __init__(self, base_path: str = './data'):
        """
        Initialize the CSV storage handler.
        
        Args:
            base_path: The base directory for storing CSV files.
                      Will be created if it doesn't exist.
        """
        self.base_path = Path(base_path)
        self.base_path.mkdir(parents=True, exist_ok=True)
        
        # Define the NEPSE column structure
        # This maps the CSV column names to internal names and types
        self.nepse_columns = {
            'S.No': 'int64',
            'Symbol': 'str',
            'Conf.': 'str',           # Confirmation status
            'Open': 'float64',        # Opening price
            'High': 'float64',        # High price of the day
            'Low': 'float64',         # Low price of the day
            'Close': 'float64',       # Closing price
            'LTP': 'float64',         # Last Traded Price
            'Close - LTP': 'float64', # Difference between close and LTP
            'Close - LTP %': 'float64', # Percentage difference
            'VWAP': 'float64',        # Volume Weighted Average Price
            'Vol': 'int64',           # Volume (number of shares)
            'Prev. Close': 'float64', # Previous day's closing price
            'Turnover': 'float64',    # Total turnover in NPR
            'Trans.': 'int64',        # Number of transactions
            'Diff': 'float64',        # Price difference from previous close
            'Range': 'float64',       # High - Low
            'Diff %': 'float64',      # Percentage change from previous close
            'Range %': 'float64',     # Range as percentage
            'VWAP %': 'float64',      # VWAP percentage
            '52 Weeks High': 'float64',
            '52 Weeks Low': 'float64'
        }
    
    def write_data(self, 
                   data: pd.DataFrame, 
                   filename: str,
                   mode: str = 'w',
                   include_index: bool = False) -> None:
        """
        Write DataFrame to a CSV file.
        
        Args:
            data: The DataFrame to write
            filename: Name of the CSV file (without path)
            mode: Write mode - 'w' for write (overwrite), 
                  'a' for append
            include_index: Whether to include the DataFrame index
        
        Example:
            >>> storage = CSVTimeSeriesStorage('./data')
            >>> df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
            >>> storage.write_data(df, 'test.csv')
        """
        filepath = self.base_path / filename
        
        # Determine if we need to write the header
        # For append mode, check if file exists and has content
        header = True
        if mode == 'a' and filepath.exists():
            # Check if file has content
            if filepath.stat().st_size > 0:
                header = False
        
        # Write to CSV
        # encoding='utf-8' ensures proper handling of Nepali characters
        # if present in stock symbols or company names
        data.to_csv(
            filepath,
            mode=mode,
            header=header,
            index=include_index,
            encoding='utf-8'
        )
        
        print(f"Data written to {filepath}")
    
    def read_data(self, 
                  filename: str,
                  parse_dates: bool = True,
                  date_column: str = None) -> pd.DataFrame:
        """
        Read CSV file into a DataFrame.
        
        Args:
            filename: Name of the CSV file
            parse_dates: Whether to parse date columns
            date_column: Name of the date column to parse
        
        Returns:
            DataFrame containing the CSV data
        
        Example:
            >>> storage = CSVTimeSeriesStorage('./data')
            >>> df = storage.read_data('nepse_data.csv')
        """
        filepath = self.base_path / filename
        
        if not filepath.exists():
            raise FileNotFoundError(f"File not found: {filepath}")
        
        # Read CSV with proper handling
        df = pd.read_csv(
            filepath,
            encoding='utf-8',
            parse_dates=[date_column] if date_column and parse_dates else None
        )
        
        return df
    
    def read_nepse_data(self, 
                        filename: str,
                        add_date_column: bool = True) -> pd.DataFrame:
        """
        Read NEPSE stock data with proper data type handling.
        
        This method specifically handles the NEPSE CSV format with
        all its columns and ensures proper data types.
        
        Args:
            filename: Name of the NEPSE CSV file
            add_date_column: Whether to add a date column based on filename
                            (assumes filename contains date like 'nepse_2024-01-15.csv')
        
        Returns:
            DataFrame with NEPSE stock data properly typed
        
        Example:
            >>> storage = CSVTimeSeriesStorage('./data')
            >>> df = storage.read_nepse_data('nepse_2024-01-15.csv')
        """
        filepath = self.base_path / filename
        
        # Read the CSV file
        # na_values specifies strings that should be treated as NaN
        # This is important for NEPSE data where missing values might be
        # represented as '-', 'N/A', or empty strings
        df = pd.read_csv(
            filepath,
            encoding='utf-8',
            na_values=['-', 'N/A', 'NA', '', ' '],
            thousands=','  # Handle comma-separated numbers like "1,234,567"
        )
        
        # Extract date from filename if pattern matches
        # NEPSE data files are often named with dates like:
        # nepse_2024-01-15.csv or NEPSE_20240115.csv
        if add_date_column:
            import re
            # Try different date patterns
            patterns = [
                r'(\d{4}-\d{2}-\d{2})',      # YYYY-MM-DD
                r'(\d{8})',                   # YYYYMMDD
                r'(\d{2}-\d{2}-\d{4})',       # DD-MM-YYYY
            ]
            
            for pattern in patterns:
                match = re.search(pattern, filename)
                if match:
                    date_str = match.group(1)
                    try:
                        if len(date_str) == 8:  # YYYYMMDD
                            date = datetime.strptime(date_str, '%Y%m%d')
                        elif '-' in date_str and len(date_str.split('-')[0]) == 4:
                            date = datetime.strptime(date_str, '%Y-%m-%d')
                        else:
                            date = datetime.strptime(date_str, '%d-%m-%Y')
                        
                        df['Date'] = date
                        break
                    except ValueError:
                        continue
        
        # Apply proper data types to known columns
        for col, dtype in self.nepse_columns.items():
            if col in df.columns:
                if dtype == 'float64':
                    df[col] = pd.to_numeric(df[col], errors='coerce')
                elif dtype == 'int64':
                    df[col] = pd.to_numeric(df[col], errors='coerce').astype('Int64')
                # 'str' type is already handled by default
        
        return df
    
    def append_daily_data(self, 
                          new_data: pd.DataFrame,
                          master_file: str = 'nepse_master.csv') -> None:
        """
        Append new daily data to a master CSV file.
        
        This method is designed for incremental updates to the
        NEPSE historical data file. It handles duplicates by keeping
        the most recent data.
        
        Args:
            new_data: New data to append
            master_file: Name of the master file to append to
        
        Example:
            >>> storage = CSVTimeSeriesStorage('./data')
            >>> new_data = pd.read_csv('today_nepse.csv')
            >>> storage.append_daily_data(new_data)
        """
        master_path = self.base_path / master_file
        
        if master_path.exists():
            # Read existing data
            existing_data = self.read_nepse_data(master_file)
            
            # Combine with new data
            combined = pd.concat([existing_data, new_data], ignore_index=True)
            
            # Remove duplicates, keeping the last occurrence
            # This ensures that if the same day's data is uploaded twice,
            # we keep the most recent version
            if 'Date' in combined.columns and 'Symbol' in combined.columns:
                combined = combined.drop_duplicates(
                    subset=['Date', 'Symbol'],
                    keep='last'
                )
        else:
            combined = new_data
        
        # Write combined data
        self.write_data(combined, master_file)
    
    def create_partitioned_files(self,
                                  data: pd.DataFrame,
                                  partition_by: str = 'Symbol',
                                  directory: str = 'partitioned') -> Dict[str, str]:
        """
        Create partitioned CSV files based on a column value.
        
        Partitioning is useful for:
        - Faster queries on specific symbols
        - Parallel processing of different stocks
        - Organized data storage
        
        Args:
            data: DataFrame to partition
            partition_by: Column to partition by (e.g., 'Symbol')
            directory: Subdirectory name for partitioned files
        
        Returns:
            Dictionary mapping partition values to file paths
        
        Example:
            >>> storage = CSVTimeSeriesStorage('./data')
            >>> df = storage.read_nepse_data('nepse_master.csv')
            >>> partitions = storage.create_partitioned_files(df)
            >>> print(partitions)
            {'NABIL': 'partitioned/NABIL.csv', 'NICA': 'partitioned/NICA.csv', ...}
        """
        partition_dir = self.base_path / directory
        partition_dir.mkdir(parents=True, exist_ok=True)
        
        partitions = {}
        
        # Group by the partition column
        for partition_value, group in data.groupby(partition_by):
            # Create safe filename (replace invalid characters)
            safe_name = str(partition_value).replace('/', '_').replace('\\', '_')
            filename = f"{safe_name}.csv"
            filepath = partition_dir / filename
            
            # Write partition file
            group.to_csv(filepath, index=False, encoding='utf-8')
            
            partitions[partition_value] = str(filepath)
        
        print(f"Created {len(partitions)} partitioned files in {partition_dir}")
        return partitions
    
    def get_file_info(self, filename: str) -> Dict:
        """
        Get information about a CSV file.
        
        Args:
            filename: Name of the CSV file
        
        Returns:
            Dictionary with file information
        """
        filepath = self.base_path / filename
        
        if not filepath.exists():
            return {'exists': False}
        
        # Get file stats
        stat = filepath.stat()
        
        # Read first few lines to get column info
        with open(filepath, 'r', encoding='utf-8') as f:
            reader = csv.reader(f)
            header = next(reader)
            first_row = next(reader, None)
        
        # Count total rows (excluding header)
        with open(filepath, 'r', encoding='utf-8') as f:
            row_count = sum(1 for _ in f) - 1  # Subtract header
        
        return {
            'exists': True,
            'path': str(filepath),
            'size_bytes': stat.st_size,
            'size_mb': stat.st_size / (1024 * 1024),
            'modified': datetime.fromtimestamp(stat.st_mtime),
            'columns': header,
            'column_count': len(header),
            'row_count': row_count,
            'first_row': first_row
        }
    
    def merge_csv_files(self, 
                        file_pattern: str = '*.csv',
                        output_file: str = 'merged.csv') -> pd.DataFrame:
        """
        Merge multiple CSV files into a single file.
        
        This is useful for combining daily NEPSE data files into
        a single historical file.
        
        Args:
            file_pattern: Glob pattern to match files
            output_file: Name of the output merged file
        
        Returns:
            Merged DataFrame
        
        Example:
            >>> storage = CSVTimeSeriesStorage('./data')
            >>> merged = storage.merge_csv_files('nepse_*.csv', 'all_nepse.csv')
        """
        files = list(self.base_path.glob(file_pattern))
        
        if not files:
            print(f"No files matching pattern: {file_pattern}")
            return pd.DataFrame()
        
        # Read all files
        dfs = []
        for file in files:
            try:
                df = self.read_nepse_data(file.name)
                dfs.append(df)
            except Exception as e:
                print(f"Error reading {file}: {e}")
                continue
        
        if not dfs:
            return pd.DataFrame()
        
        # Merge all DataFrames
        merged = pd.concat(dfs, ignore_index=True)
        
        # Sort by date and symbol if columns exist
        if 'Date' in merged.columns:
            merged = merged.sort_values(['Date', 'Symbol'] 
                                        if 'Symbol' in merged.columns 
                                        else ['Date'])
        
        # Write merged file
        self.write_data(merged, output_file)
        
        print(f"Merged {len(files)} files into {output_file}")
        print(f"Total rows: {len(merged)}")
        
        return merged


def demonstrate_csv_operations():
    """
    Demonstrate CSV storage operations with NEPSE example data.
    """
    print("=" * 70)
    print("CSV Storage Operations Demonstration")
    print("=" * 70)
    
    # Initialize storage
    storage = CSVTimeSeriesStorage('./nepse_data')
    
    # Create sample NEPSE data for demonstration
    # This simulates the actual NEPSE data format
    sample_data = pd.DataFrame({
        'S.No': range(1, 6),
        'Symbol': ['NABIL', 'NICA', 'SCBL', 'ADBL', 'EBL'],
        'Conf.': ['Confirmed'] * 5,
        'Open': [850.0, 780.0, 520.0, 340.0, 290.0],
        'High': [870.0, 795.0, 535.0, 350.0, 305.0],
        'Low': [845.0, 775.0, 515.0, 335.0, 285.0],
        'Close': [865.0, 790.0, 530.0, 345.0, 300.0],
        'LTP': [865.0, 790.0, 530.0, 345.0, 300.0],
        'Close - LTP': [0.0, 0.0, 0.0, 0.0, 0.0],
        'Close - LTP %': [0.0, 0.0, 0.0, 0.0, 0.0],
        'VWAP': [860.5, 785.2, 525.8, 342.3, 295.6],
        'Vol': [125000, 98000, 76000, 145000, 89000],
        'Prev. Close': [850.0, 775.0, 520.0, 335.0, 288.0],
        'Turnover': [107562500.0, 76949600.0, 39960800.0, 49633500.0, 26308400.0],
        'Trans.': [850, 620, 480, 920, 580],
        'Diff': [15.0, 15.0, 10.0, 10.0, 12.0],
        'Range': [25.0, 20.0, 20.0, 15.0, 20.0],
        'Diff %': [1.76, 1.94, 1.92, 2.99, 4.17],
        'Range %': [2.94, 2.58, 3.85, 4.48, 6.90],
        'VWAP %': [1.24, 1.32, 1.12, 2.18, 2.64],
        '52 Weeks High': [920.0, 850.0, 580.0, 380.0, 340.0],
        '52 Weeks Low': [650.0, 580.0, 380.0, 240.0, 195.0]
    })
    
    # Add date column for time-series context
    sample_data['Date'] = pd.Timestamp('2024-01-15')
    
    print("\n1. Writing NEPSE Data to CSV")
    print("-" * 40)
    storage.write_data(sample_data, 'nepse_2024-01-15.csv')
    
    print("\n2. Reading and Displaying File Info")
    print("-" * 40)
    info = storage.get_file_info('nepse_2024-01-15.csv')
    for key, value in info.items():
        print(f"  {key}: {value}")
    
    print("\n3. Reading NEPSE Data")
    print("-" * 40)
    df = storage.read_nepse_data('nepse_2024-01-15.csv')
    print(df.head())
    print(f"\nData types:\n{df.dtypes}")
    
    print("\n4. Creating Partitioned Files by Symbol")
    print("-" * 40)
    partitions = storage.create_partitioned_files(sample_data, partition_by='Symbol')
    print(f"Created partitions: {list(partitions.keys())[:5]}...")
    
    return storage, sample_data


if __name__ == "__main__":
    storage, data = demonstrate_csv_operations()
```

**Detailed Explanation:**

1. **CSVTimeSeriesStorage Class**: A comprehensive class for managing CSV files for time-series data. It provides methods for reading, writing, partitioning, and merging CSV files.

2. **Column Type Handling**: The `nepse_columns` dictionary defines the expected data types for each column in the NEPSE data format. This ensures:
   - Price columns (`Open`, `High`, `Low`, `Close`) are `float64`
   - Volume columns (`Vol`, `Trans.`) are `int64`
   - Symbol column is string type

3. **Date Extraction from Filename**: The `read_nepse_data` method extracts dates from filenames using regular expressions. This handles different date formats:
   - `nepse_2024-01-15.csv` (YYYY-MM-DD)
   - `nepse_20240115.csv` (YYYYMMDD)
   - `nepse_15-01-2024.csv` (DD-MM-YYYY)

4. **Handling Missing Values**: The `na_values` parameter in `pd.read_csv` converts various representations of missing data ('-', 'N/A', empty strings) to NaN values.

5. **Thousands Separator**: The `thousands=','` parameter handles numbers formatted with commas (e.g., "1,234,567").

6. **Partitioning**: The `create_partitioned_files` method creates separate CSV files for each stock symbol. This is useful because:
   - Queries for a single stock only need to read one small file
   - Parallel processing is easier
   - Data management is simpler

7. **Duplicate Handling**: The `append_daily_data` method handles duplicates by keeping the most recent entry. This prevents data duplication when the same day's data is uploaded multiple times.

### **8.2.2 Parquet and Feather**

Parquet and Feather are columnar storage formats that provide better compression and faster read performance compared to CSV, especially for analytical workloads.

```python
"""
Parquet and Feather Storage Module for Time-Series Data

This module demonstrates efficient columnar storage formats for
time-series data, particularly suited for the NEPSE prediction system.

Key advantages over CSV:
- Better compression (smaller files)
- Faster reads (column pruning)
- Preserves data types (no type inference needed)
- Better for analytical queries
"""

import pandas as pd
import numpy as np
from pathlib import Path
from typing import List, Optional, Dict, Any
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.feather as feather
import os


class ColumnarStorage:
    """
    A class to handle Parquet and Feather file operations for time-series data.
    
    Both formats are columnar storage formats optimized for analytical workloads.
    - Parquet: Compressed, good for long-term storage and sharing
    - Feather: Uncompressed (or lightly compressed), good for temporary storage
               and inter-process communication
    """
    
    def __init__(self, base_path: str = './data'):
        """
        Initialize the columnar storage handler.
        
        Args:
            base_path: Base directory for storing files
        """
        self.base_path = Path(base_path)
        self.base_path.mkdir(parents=True, exist_ok=True)
    
    def write_parquet(self,
                      df: pd.DataFrame,
                      filename: str,
                      compression: str = 'snappy',
                      partition_cols: List[str] = None,
                      engine: str = 'pyarrow') -> None:
        """
        Write DataFrame to Parquet format.
        
        Parquet is a columnar storage format that provides:
        - Efficient compression (reduces file size by 70-90% vs CSV)
        - Column pruning (only reads needed columns)
        - Predicate pushdown (filters data at file level)
        - Schema preservation (data types are stored)
        
        Args:
            df: DataFrame to write
            filename: Output filename (should end with .parquet)
            compression: Compression algorithm
                        'snappy' - Fast compression/decompression (default)
                        'gzip' - Better compression, slower
                        'brotli' - Best compression, slowest
                        'none' - No compression
            partition_cols: Columns to partition by (creates directory structure)
            engine: Parquet engine ('pyarrow' or 'fastparquet')
        
        Example:
            >>> storage = ColumnarStorage('./data')
            >>> df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
            >>> storage.write_parquet(df, 'data.parquet')
        """
        filepath = self.base_path / filename
        
        if partition_cols:
            # Write partitioned dataset
            # This creates a directory structure like:
            # data.parquet/Symbol=NABIL/part-0.parquet
            # data.parquet/Symbol=NICA/part-0.parquet
            pq.write_to_dataset(
                pa.Table.from_pandas(df),
                root_path=str(filepath),
                partition_cols=partition_cols,
                compression=compression
            )
            print(f"Partitioned Parquet dataset written to {filepath}")
        else:
            # Write single file
            df.to_parquet(
                filepath,
                engine=engine,
                compression=compression,
                index=False
            )
            print(f"Parquet file written to {filepath}")
    
    def read_parquet(self,
                     filename: str,
                     columns: List[str] = None,
                     filters: List[tuple] = None,
                     engine: str = 'pyarrow') -> pd.DataFrame:
        """
        Read DataFrame from Parquet format.
        
        Args:
            filename: Parquet file or directory name
            columns: List of columns to read (column pruning)
                    If None, reads all columns
            filters: Row filters to apply (predicate pushdown)
                    Format: [('column', 'operator', value), ...]
                    Example: [('Symbol', '==', 'NABIL'), ('Close', '>', 800)]
            engine: Parquet engine ('pyarrow' or 'fastparquet')
        
        Returns:
            DataFrame with the requested data
        
        Example:
            >>> storage = ColumnarStorage('./data')
            >>> df = storage.read_parquet('data.parquet', 
            ...                           columns=['Symbol', 'Close'],
            ...                           filters=[('Symbol', '==', 'NABIL')])
        """
        filepath = self.base_path / filename
        
        if filters:
            # Convert filters to PyArrow format
            # PyArrow uses expressions like: (pa.dataset.field('col') == value)
            # The filters parameter in read_table accepts list of tuples
            table = pq.read_table(
                filepath,
                columns=columns,
                filters=filters
            )
            return table.to_pandas()
        else:
            return pd.read_parquet(
                filepath,
                engine=engine,
                columns=columns
            )
    
    def write_feather(self,
                      df: pd.DataFrame,
                      filename: str,
                      compression: bool = True) -> None:
        """
        Write DataFrame to Feather format.
        
        Feather (Arrow IPC) is designed for:
        - Fast read/write speeds
        - Minimal CPU overhead
        - Inter-process communication
        - Temporary data storage
        
        Feather is ideal when you need to quickly save data and read it back
        in the same or another Python process. It's faster than Parquet but
        typically produces larger files.
        
        Args:
            df: DataFrame to write
            filename: Output filename (should end with .feather or .arrow)
            compression: Whether to use compression
                        True - Uses LZ4 compression (fast, moderate compression)
                        False - No compression (fastest, largest files)
        
        Example:
            >>> storage = ColumnarStorage('./data')
            >>> df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
            >>> storage.write_feather(df, 'data.feather')
        """
        filepath = self.base_path / filename
        
        # Feather format using pyarrow
        # compression='lz4' provides fast compression with good speed
        df.to_feather(
            filepath,
            compression='lz4' if compression else None
        )
        print(f"Feather file written to {filepath}")
    
    def read_feather(self,
                     filename: str,
                     columns: List[str] = None) -> pd.DataFrame:
        """
        Read DataFrame from Feather format.
        
        Args:
            filename: Feather file name
            columns: List of columns to read (column pruning)
        
        Returns:
            DataFrame with the requested data
        
        Example:
            >>> storage = ColumnarStorage('./data')
            >>> df = storage.read_feather('data.feather')
        """
        filepath = self.base_path / filename
        
        # Feather supports column pruning for efficient reading
        return pd.read_feather(filepath, columns=columns)
    
    def compare_formats(self,
                        df: pd.DataFrame,
                        name: str = 'data') -> pd.DataFrame:
        """
        Compare CSV, Parquet, and Feather formats for the given DataFrame.
        
        This method saves the same data in all three formats and compares:
        - File size
        - Write time
        - Read time
        
        Args:
            df: DataFrame to use for comparison
            name: Base name for files
        
        Returns:
            DataFrame with comparison results
        """
        import time
        
        results = []
        
        # CSV
        csv_path = self.base_path / f"{name}.csv"
        start = time.time()
        df.to_csv(csv_path, index=False)
        csv_write_time = time.time() - start
        
        start = time.time()
        _ = pd.read_csv(csv_path)
        csv_read_time = time.time() - start
        
        csv_size = os.path.getsize(csv_path) / 1024  # KB
        
        results.append({
            'format': 'CSV',
            'size_kb': csv_size,
            'write_time_ms': csv_write_time * 1000,
            'read_time_ms': csv_read_time * 1000,
            'compression_ratio': 1.0
        })
        
        # Parquet (snappy compression)
        parquet_path = self.base_path / f"{name}.parquet"
        start = time.time()
        df.to_parquet(parquet_path, compression='snappy', index=False)
        parquet_write_time = time.time() - start
        
        start = time.time()
        _ = pd.read_parquet(parquet_path)
        parquet_read_time = time.time() - start
        
        parquet_size = os.path.getsize(parquet_path) / 1024
        
        results.append({
            'format': 'Parquet (snappy)',
            'size_kb': parquet_size,
            'write_time_ms': parquet_write_time * 1000,
            'read_time_ms': parquet_read_time * 1000,
            'compression_ratio': csv_size / parquet_size
        })
        
        # Feather (with compression)
        feather_path = self.base_path / f"{name}.feather"
        start = time.time()
        df.to_feather(feather_path, compression='lz4')
        feather_write_time = time.time() - start
        
        start = time.time()
        _ = pd.read_feather(feather_path)
        feather_read_time = time.time() - start
        
        feather_size = os.path.getsize(feather_path) / 1024
        
        results.append({
            'format': 'Feather (lz4)',
            'size_kb': feather_size,
            'write_time_ms': feather_write_time * 1000,
            'read_time_ms': feather_read_time * 1000,
            'compression_ratio': csv_size / feather_size
        })
        
        return pd.DataFrame(results)


class NEPSEParquetManager:
    """
    Specialized class for managing NEPSE stock data in Parquet format.
    
    This class provides methods optimized for the NEPSE data structure,
    including efficient querying by symbol, date range, and other criteria.
    """
    
    def __init__(self, base_path: str = './nepse_parquet'):
        """
        Initialize the NEPSE Parquet manager.
        
        Args:
            base_path: Base directory for Parquet files
        """
        self.base_path = Path(base_path)
        self.base_path.mkdir(parents=True, exist_ok=True)
        self.storage = ColumnarStorage(base_path)
        
        # Define the schema for NEPSE data
        # This ensures consistent data types across all files
        self.schema = pa.schema([
            ('S.No', pa.int64()),
            ('Symbol', pa.string()),
            ('Conf.', pa.string()),
            ('Open', pa.float64()),
            ('High', pa.float64()),
            ('Low', pa.float64()),
            ('Close', pa.float64()),
            ('LTP', pa.float64()),
            ('Close_LTP', pa.float64()),
            ('Close_LTP_Pct', pa.float64()),
            ('VWAP', pa.float64()),
            ('Vol', pa.int64()),
            ('Prev_Close', pa.float64()),
            ('Turnover', pa.float64()),
            ('Trans', pa.int64()),
            ('Diff', pa.float64()),
            ('Range', pa.float64()),
            ('Diff_Pct', pa.float64()),
            ('Range_Pct', pa.float64()),
            ('VWAP_Pct', pa.float64()),
            ('High_52W', pa.float64()),
            ('Low_52W', pa.float64()),
            ('Date', pa.date64())
        ])
    
    def save_daily_data(self,
                        df: pd.DataFrame,
                        date: str,
                        mode: str = 'append') -> None:
        """
        Save daily NEPSE data to partitioned Parquet files.
        
        This method partitions data by:
        - Year (for efficient date range queries)
        - Symbol (for efficient symbol-specific queries)
        
        Args:
            df: DataFrame with daily stock data
            date: Date string (YYYY-MM-DD format)
            mode: 'append' to add to existing, 'overwrite' to replace
        """
        # Parse date and extract year
        date_obj = pd.to_datetime(date)
        year = date_obj.year
        
        # Add date column if not present
        if 'Date' not in df.columns:
            df = df.copy()
            df['Date'] = date_obj
        
        # Clean column names (replace special characters)
        df.columns = df.columns.str.replace(' ', '_')
        df.columns = df.columns.str.replace('-', '_')
        df.columns = df.columns.str.replace('.', '')
        df.columns = df.columns.str.replace('%', 'Pct')
        
        # Define partition path
        partition_path = self.base_path / f"year={year}"
        
        if mode == 'overwrite' and partition_path.exists():
            import shutil
            shutil.rmtree(partition_path)
        
        partition_path.mkdir(parents=True, exist_ok=True)
        
        # Write to Parquet with symbol partitioning
        # This creates files like: year=2024/Symbol=NABIL/data.parquet
        self.storage.write_parquet(
            df,
            f"year={year}/data.parquet",
            compression='snappy',
            partition_cols=['Symbol']
        )
    
    def query_by_symbol(self,
                        symbol: str,
                        columns: List[str] = None,
                        start_date: str = None,
                        end_date: str = None) -> pd.DataFrame:
        """
        Query historical data for a specific symbol.
        
        This method efficiently retrieves data for a single stock by:
        1. Using column pruning to read only needed columns
        2. Using predicate pushdown to filter at the file level
        
        Args:
            symbol: Stock symbol (e.g., 'NABIL')
            columns: Columns to retrieve (None for all)
            start_date: Start date (YYYY-MM-DD)
            end_date: End date (YYYY-MM-DD)
        
        Returns:
            DataFrame with historical data for the symbol
        """
        # Build filters
        filters = [('Symbol', '==', symbol)]
        
        if start_date:
            start = pd.to_datetime(start_date)
            filters.append(('Date', '>=', start))
        
        if end_date:
            end = pd.to_datetime(end_date)
            filters.append(('Date', '<=', end))
        
        # Read all parquet files
        dfs = []
        for year_dir in sorted(self.base_path.glob('year=*')):
            try:
                df = self.storage.read_parquet(
                    year_dir.name + '/data.parquet',
                    columns=columns,
                    filters=filters
                )
                dfs.append(df)
            except Exception as e:
                print(f"Warning: Could not read {year_dir}: {e}")
                continue
        
        if not dfs:
            return pd.DataFrame()
        
        result = pd.concat(dfs, ignore_index=True)
        return result.sort_values('Date')
    
    def query_by_date_range(self,
                            start_date: str,
                            end_date: str,
                            columns: List[str] = None,
                            symbols: List[str] = None) -> pd.DataFrame:
        """
        Query data for a date range across all or specific symbols.
        
        Args:
            start_date: Start date (YYYY-MM-DD)
            end_date: End date (YYYY-MM-DD)
            columns: Columns to retrieve (None for all)
            symbols: List of symbols to filter (None for all)
        
        Returns:
            DataFrame with data for the specified date range
        """
        start = pd.to_datetime(start_date)
        end = pd.to_datetime(end_date)
        
        # Determine which year directories to read
        years = range(start.year, end.year + 1)
        
        dfs = []
        for year in years:
            year_dir = self.base_path / f"year={year}"
            if year_dir.exists():
                try:
                    df = self.storage.read_parquet(
                        f"year={year}/data.parquet",
                        columns=columns
                    )
                    
                    # Filter by date range
                    df['Date'] = pd.to_datetime(df['Date'])
                    df = df[(df['Date'] >= start) & (df['Date'] <= end)]
                    
                    # Filter by symbols if specified
                    if symbols:
                        df = df[df['Symbol'].isin(symbols)]
                    
                    dfs.append(df)
                except Exception as e:
                    print(f"Warning: Could not read {year_dir}: {e}")
                    continue
        
        if not dfs:
            return pd.DataFrame()
        
        result = pd.concat(dfs, ignore_index=True)
        return result.sort_values(['Date', 'Symbol'])
    
    def get_all_symbols(self) -> List[str]:
        """
        Get list of all unique symbols in the dataset.
        
        Returns:
            Sorted list of stock symbols
        """
        symbols = set()
        
        for year_dir in self.base_path.glob('year=*/Symbol=*'):
            # Extract symbol from directory name
            symbol = year_dir.name.split('=')[1]
            symbols.add(symbol)
        
        return sorted(list(symbols))
    
    def get_statistics(self) -> Dict[str, Any]:
        """
        Get statistics about the stored data.
        
        Returns:
            Dictionary with statistics
        """
        stats = {
            'years': [],
            'total_files': 0,
            'total_size_mb': 0,
            'symbols': self.get_all_symbols()
        }
        
        for year_dir in sorted(self.base_path.glob('year=*')):
            year = year_dir.name.split('=')[1]
            stats['years'].append(year)
            
            # Count files and calculate size
            for parquet_file in year_dir.rglob('*.parquet'):
                stats['total_files'] += 1
                stats['total_size_mb'] += parquet_file.stat().st_size / (1024 * 1024)
        
        stats['total_size_mb'] = round(stats['total_size_mb'], 2)
        
        return stats


def demonstrate_parquet_operations():
    """
    Demonstrate Parquet and Feather operations with NEPSE data.
    """
    print("=" * 70)
    print("Parquet and Feather Storage Demonstration")
    print("=" * 70)
    
    # Initialize storage
    storage = ColumnarStorage('./nepse_columnar')
    
    # Create sample NEPSE data (larger dataset for meaningful comparison)
    np.random.seed(42)
    
    # Generate 1 year of daily data for 5 stocks
    dates = pd.date_range('2023-01-01', '2023-12-31', freq='B')  # Business days
    symbols = ['NABIL', 'NICA', 'SCBL', 'ADBL', 'EBL']
    
    data = []
    for date in dates:
        for i, symbol in enumerate(symbols):
            base_price = 500 + i * 100 + np.random.randint(-50, 50)
            high = base_price * (1 + np.random.uniform(0, 0.05))
            low = base_price * (1 - np.random.uniform(0, 0.05))
            open_price = np.random.uniform(low, high)
            close = np.random.uniform(low, high)
            
            data.append({
                'Symbol': symbol,
                'Date': date,
                'Open': round(open_price, 2),
                'High': round(high, 2),
                'Low': round(low, 2),
                'Close': round(close, 2),
                'Volume': np.random.randint(10000, 500000),
                'Turnover': round(close * np.random.randint(10000, 500000), 2)
            })
    
    df = pd.DataFrame(data)
    print(f"\nGenerated {len(df)} records for {len(symbols)} stocks over {len(dates)} trading days")
    
    # Compare formats
    print("\n1. Comparing Storage Formats")
    print("-" * 40)
    comparison = storage.compare_formats(df, 'nepse_sample')
    print(comparison.to_string(index=False))
    
    print("\n2. Writing Partitioned Parquet")
    print("-" * 40)
    storage.write_parquet(df, 'nepse_partitioned.parquet', 
                          partition_cols=['Symbol'])
    
    print("\n3. Reading Specific Columns (Column Pruning)")
    print("-" * 40)
    # Only read needed columns - much faster for wide tables
    selected_cols = ['Symbol', 'Date', 'Close', 'Volume']
    df_selected = storage.read_parquet('nepse_sample.parquet', 
                                        columns=selected_cols)
    print(f"Read only {len(selected_cols)} columns:")
    print(df_selected.head())
    
    print("\n4. Reading with Filters (Predicate Pushdown)")
    print("-" * 40)
    # Filter at file level - only reads matching rows
    filtered = storage.read_parquet(
        'nepse_sample.parquet',
        columns=['Symbol', 'Date', 'Close'],
        filters=[('Symbol', '==', 'NABIL')]
    )
    print(f"Filtered data for NABIL ({len(filtered)} rows):")
    print(filtered.head())
    
    # Demonstrate NEPSE manager
    print("\n5. NEPSE Parquet Manager")
    print("-" * 40)
    manager = NEPSEParquetManager('./nepse_parquet_data')
    
    # Save data partitioned by year and symbol
    for date in df['Date'].unique()[:5]:  # Save first 5 days as example
        daily_df = df[df['Date'] == date]
        manager.save_daily_data(daily_df, date.strftime('%Y-%m-%d'))
    
    stats = manager.get_statistics()
    print(f"Statistics: {stats}")
    
    return storage, df


if __name__ == "__main__":
    storage, df = demonstrate_parquet_operations()
```

**Detailed Explanation:**

1. **Parquet Format**: A columnar storage format developed by the Apache project. Key features:
   - **Column pruning**: When you only need certain columns, Parquet reads only those columns from disk, skipping others entirely.
   - **Predicate pushdown**: Filters are applied at the file level before reading data, dramatically reducing I/O.
   - **Compression**: Parquet uses efficient compression algorithms (snappy, gzip, brotli). Each column is compressed separately, leading to better compression ratios because data in a column is often similar.
   - **Schema preservation**: Data types are stored in the file, eliminating the need for type inference.

2. **Feather Format**: Also known as Apache Arrow IPC format:
   - Designed for fast read/write speeds
   - Uses memory-mapped files for efficient reading
   - Minimal CPU overhead
   - Ideal for temporary storage and inter-process communication
   - Less compressed than Parquet but faster to read/write

3. **Partitioning**: The code demonstrates two levels of partitioning:
   - **Year partitioning**: Data is organized by year (e.g., `year=2024/`), making date range queries efficient.
   - **Symbol partitioning**: Within each year, data is partitioned by stock symbol, making symbol-specific queries very fast.

4. **Comparison Results**: Typically you'll see:
   - CSV is largest (no compression)
   - Parquet is smallest (good compression, typically 3-5x smaller than CSV)
   - Feather is medium-sized (light compression, fastest read/write)

5. **NEPSEParquetManager**: A specialized class for managing NEPSE data:
   - `save_daily_data`: Saves daily data with automatic partitioning
   - `query_by_symbol`: Efficiently retrieves historical data for one stock
   - `query_by_date_range`: Retrieves data across multiple symbols for a time period
   - `get_statistics`: Provides metadata about stored data

---

## **8.2.3 HDF5 and NetCDF**

HDF5 (Hierarchical Data Format version 5) and NetCDF (Network Common Data Form) are binary file formats designed for scientific and numerical data. They excel at storing large multidimensional arrays with metadata.

```python
"""
HDF5 and NetCDF Storage Module for Time-Series Data

HDF5 is particularly well-suited for time-series data because:
- Efficient storage of numerical arrays
- Supports hierarchical organization (like a file system)
- Allows metadata storage alongside data
- Supports chunking and compression
- Fast random access to subsets of data

For NEPSE data, HDF5 provides an excellent balance between:
- File size (good compression)
- Read performance (fast partial reads)
- Query capability (can read specific time ranges)
"""

import pandas as pd
import numpy as np
import h5py
from pathlib import Path
from typing import List, Optional, Dict, Any, Union, Tuple
from datetime import datetime, timedelta
import warnings


class HDF5TimeSeriesStorage:
    """
    A class to handle HDF5 storage operations for time-series data.
    
    HDF5 organizes data in a hierarchical structure similar to a file system:
    - Groups (like directories) - can contain other groups or datasets
    - Datasets (like files) - contain the actual data arrays
    - Attributes (like metadata) - can be attached to groups or datasets
    
    For NEPSE data, we organize the HDF5 file as:
    /nepse/
        /metadata/           - Global metadata about the dataset
        /prices/             - Price data for all stocks
            /NABIL/          - Per-stock group
            /NICA/
            ...
        /volumes/            - Volume data organized similarly
        /index/              - Time index for all data
    """
    
    def __init__(self, filepath: str = './nepse_data.h5'):
        """
        Initialize the HDF5 storage handler.
        
        Args:
            filepath: Path to the HDF5 file (.h5 or .hdf5 extension)
        """
        self.filepath = Path(filepath)
        self.filepath.parent.mkdir(parents=True, exist_ok=True)
        
        # Compression settings for HDF5
        # 'gzip' provides good compression and is widely supported
        # 'lzf' is faster but lower compression
        # 'szip' is fast but requires special installation
        self.compression = 'gzip'
        self.compression_level = 4  # 1-9, higher = more compression but slower
    
    def _get_file(self, mode: str = 'a') -> h5py.File:
        """
        Get an HDF5 file handle.
        
        Args:
            mode: File mode
                  'r' - read only
                  'a' - read/write, create if doesn't exist (default)
                  'w' - write, overwrite if exists
                  'r+' - read/write, file must exist
        
        Returns:
            HDF5 file object
        """
        return h5py.File(self.filepath, mode)
    
    def store_stock_data(self,
                         df: pd.DataFrame,
                         group_path: str = '/nepse/prices',
                         overwrite: bool = False) -> None:
        """
        Store stock price data in HDF5 format.
        
        This method organizes data by stock symbol, storing each stock's
        time-series as a separate dataset within the group.
        
        Args:
            df: DataFrame with columns including 'Symbol', 'Date', and price data
            group_path: HDF5 group path for storing data
            overwrite: Whether to overwrite existing data
        
        The HDF5 structure created:
        /nepse/
            /prices/
                /NABIL/
                    /open      - 1D array of open prices
                    /high      - 1D array of high prices
                    /low       - 1D array of low prices
                    /close     - 1D array of close prices
                    /volume    - 1D array of volumes
                    /dates     - 1D array of dates (as integers)
                    (attributes: symbol, start_date, end_date, count)
                /NICA/
                    ...
        """
        with self._get_file('a') as f:
            # Create the group structure if it doesn't exist
            # require_group creates the group only if it doesn't exist
            base_group = f.require_group(group_path)
            
            # Get unique symbols
            symbols = df['Symbol'].unique()
            
            for symbol in symbols:
                # Filter data for this symbol
                stock_df = df[df['Symbol'] == symbol].copy()
                stock_df = stock_df.sort_values('Date')
                
                # Create or get the group for this symbol
                if overwrite and symbol in base_group:
                    del base_group[symbol]
                
                stock_group = base_group.require_group(symbol)
                
                # Convert dates to integers for efficient storage
                # We use Unix timestamp (seconds since 1970-01-01)
                # This allows efficient date range queries
                dates = pd.to_datetime(stock_df['Date'])
                date_ints = dates.astype(np.int64) // 10**9  # Convert to seconds
                
                # Store numerical columns as datasets
                # Each dataset is a 1D array
                columns_to_store = ['Open', 'High', 'Low', 'Close', 'Volume']
                
                for col in columns_to_store:
                    if col in stock_df.columns:
                        # Create dataset with compression
                        # chunks=True enables chunked storage for efficient partial reads
                        stock_group.create_dataset(
                            col.lower(),
                            data=stock_df[col].values,
                            compression=self.compression,
                            compression_opts=self.compression_level,
                            chunks=True  # Enable chunking for partial reads
                        )
                
                # Store dates as integers
                stock_group.create_dataset(
                    'dates',
                    data=date_ints.values,
                    compression=self.compression,
                    compression_opts=self.compression_level,
                    chunks=True
                )
                
                # Store metadata as attributes
                # Attributes are small pieces of metadata attached to groups/datasets
                stock_group.attrs['symbol'] = symbol
                stock_group.attrs['start_date'] = str(dates.min().date())
                stock_group.attrs['end_date'] = str(dates.max().date())
                stock_group.attrs['count'] = len(stock_df)
                stock_group.attrs['columns'] = ', '.join(columns_to_store)
            
            # Store global metadata
            base_group.attrs['last_updated'] = datetime.now().isoformat()
            base_group.attrs['total_symbols'] = len(symbols)
    
    def store_aligned_data(self,
                           df: pd.DataFrame,
                           dataset_name: str = 'aligned_prices') -> None:
        """
        Store all stock data in a single aligned 2D array.
        
        This approach is more efficient for:
        - Cross-sectional analysis (comparing stocks at the same time)
        - Matrix operations
        - Machine learning models that need consistent array shapes
        
        Structure:
        /nepse/
            /aligned_prices/
                /data        - 2D array: (time x stocks)
                /dates       - 1D array of dates
                /symbols     - 1D array of symbol names
                /mapping/    - Metadata about the structure
                    /symbol_to_idx  - JSON mapping symbol -> column index
        
        Args:
            df: DataFrame with Date, Symbol, and price columns
            dataset_name: Name for the dataset group
        """
        with self._get_file('a') as f:
            # Pivot the data to create a 2D matrix
            # Rows = dates, Columns = symbols
            pivot_df = df.pivot_table(
                index='Date',
                columns='Symbol',
                values='Close',
                aggfunc='first'  # Take first value if duplicates
            )
            
            # Sort by date
            pivot_df = pivot_df.sort_index()
            
            # Create group
            group = f.require_group(f'/nepse/{dataset_name}')
            
            # Store the 2D price matrix
            group.create_dataset(
                'close_matrix',
                data=pivot_df.values,
                compression=self.compression,
                compression_opts=self.compression_level,
                chunks=True
            )
            
            # Store dates
            dates = pd.to_datetime(pivot_df.index)
            date_ints = dates.astype(np.int64) // 10**9
            
            group.create_dataset(
                'dates',
                data=date_ints.values,
                compression=self.compression,
                compression_opts=self.compression_level
            )
            
            # Store symbols
            symbols = pivot_df.columns.tolist()
            # HDF5 can store variable-length strings
            dt = h5py.special_dtype(vlen=str)
            group.create_dataset('symbols', data=symbols, dtype=dt)
            
            # Store metadata
            group.attrs['shape'] = str(pivot_df.shape)
            group.attrs['num_dates'] = len(pivot_df)
            group.attrs['num_symbols'] = len(symbols)
            group.attrs['start_date'] = str(dates.min().date())
            group.attrs['end_date'] = str(dates.max().date())
    
    def read_stock_data(self,
                        symbol: str,
                        start_date: Optional[str] = None,
                        end_date: Optional[str] = None,
                        columns: Optional[List[str]] = None,
                        group_path: str = '/nepse/prices') -> pd.DataFrame:
        """
        Read stock data from HDF5 with optional date filtering.
        
        This method demonstrates the efficiency of HDF5:
        - Only reads the requested columns (like Parquet column pruning)
        - Can filter by date range without reading all data (chunking)
        - Fast random access to any part of the dataset
        
        Args:
            symbol: Stock symbol to read
            start_date: Start date filter (YYYY-MM-DD)
            end_date: End date filter (YYYY-MM-DD)
            columns: List of columns to read (None for all)
            group_path: HDF5 group path where data is stored
        
        Returns:
            DataFrame with the requested data
        """
        with self._get_file('r') as f:
            # Navigate to the stock's group
            if group_path not in f:
                raise ValueError(f"Group {group_path} not found in HDF5 file")
            
            base_group = f[group_path]
            
            if symbol not in base_group:
                raise ValueError(f"Symbol {symbol} not found in {group_path}")
            
            stock_group = base_group[symbol]
            
            # Get dates
            date_ints = stock_group['dates'][:]
            dates = pd.to_datetime(date_ints * 10**9)  # Convert back to nanoseconds
            
            # Apply date filter
            mask = np.ones(len(dates), dtype=bool)
            
            if start_date:
                start = pd.to_datetime(start_date)
                mask &= (dates >= start)
            
            if end_date:
                end = pd.to_datetime(end_date)
                mask &= (dates <= end)
            
            # Determine which columns to read
            all_columns = ['open', 'high', 'low', 'close', 'volume']
            if columns:
                # Convert to lowercase for matching
                columns = [c.lower() for c in columns]
                read_columns = [c for c in all_columns if c in columns]
            else:
                read_columns = all_columns
            
            # Build DataFrame
            data = {'Date': dates[mask]}
            
            for col in read_columns:
                if col in stock_group:
                    # Only read the data that matches our mask
                    data[col.capitalize()] = stock_group[col][mask]
            
            df = pd.DataFrame(data)
            
            # Add symbol column
            df['Symbol'] = symbol
            
            return df
    
    def read_aligned_data(self,
                          dataset_name: str = 'aligned_prices',
                          symbols: Optional[List[str]] = None,
                          start_date: Optional[str] = None,
                          end_date: Optional[str] = None) -> pd.DataFrame:
        """
        Read aligned stock data from HDF5.
        
        Args:
            dataset_name: Name of the dataset group
            symbols: List of symbols to read (None for all)
            start_date: Start date filter
            end_date: End date filter
        
        Returns:
            DataFrame with aligned price data
        """
        with self._get_file('r') as f:
            group_path = f'/nepse/{dataset_name}'
            
            if group_path not in f:
                raise ValueError(f"Dataset {dataset_name} not found")
            
            group = f[group_path]
            
            # Read dates and filter
            date_ints = group['dates'][:]
            dates = pd.to_datetime(date_ints * 10**9)
            
            mask = np.ones(len(dates), dtype=bool)
            
            if start_date:
                start = pd.to_datetime(start_date)
                mask &= (dates >= start)
            
            if end_date:
                end = pd.to_datetime(end_date)
                mask &= (dates <= end)
            
            # Read price matrix
            price_matrix = group['close_matrix'][mask]
            
            # Read symbols
            all_symbols = list(group['symbols'][:])
            
            # Filter symbols
            if symbols:
                symbol_mask = [s in symbols for s in all_symbols]
                price_matrix = price_matrix[:, symbol_mask]
                selected_symbols = [s for s in all_symbols if s in symbols]
            else:
                selected_symbols = all_symbols
            
            # Create DataFrame
            df = pd.DataFrame(
                price_matrix,
                index=dates[mask],
                columns=selected_symbols
            )
            df.index.name = 'Date'
            
            return df
    
    def get_metadata(self, symbol: str, group_path: str = '/nepse/prices') -> Dict[str, Any]:
        """
        Get metadata for a specific stock.
        
        Args:
            symbol: Stock symbol
            group_path: HDF5 group path
        
        Returns:
            Dictionary with metadata
        """
        with self._get_file('r') as f:
            if group_path not in f:
                raise ValueError(f"Group {group_path} not found")
            
            stock_group = f[f'{group_path}/{symbol}']
            
            # Read all attributes
            metadata = dict(stock_group.attrs)
            
            return metadata
    
    def list_symbols(self, group_path: str = '/nepse/prices') -> List[str]:
        """
        List all symbols stored in the HDF5 file.
        
        Args:
            group_path: HDF5 group path
        
        Returns:
            List of symbol names
        """
        with self._get_file('r') as f:
            if group_path not in f:
                return []
            
            base_group = f[group_path]
            return list(base_group.keys())
    
    def get_file_info(self) -> Dict[str, Any]:
        """
        Get information about the HDF5 file.
        
        Returns:
            Dictionary with file information
        """
        info = {
            'filepath': str(self.filepath),
            'exists': self.filepath.exists(),
            'size_mb': 0,
            'groups': [],
            'total_stocks': 0,
            'datasets': []
        }
        
        if not self.filepath.exists():
            return info
        
        info['size_mb'] = self.filepath.stat().st_size / (1024 * 1024)
        
        with self._get_file('r') as f:
            # Recursively list all groups and datasets
            def explore_group(group, path=''):
                for key in group.keys():
                    item = group[key]
                    full_path = f'{path}/{key}'
                    
                    if isinstance(item, h5py.Group):
                        info['groups'].append(full_path)
                        explore_group(item, full_path)
                    elif isinstance(item, h5py.Dataset):
                        info['datasets'].append({
                            'path': full_path,
                            'shape': item.shape,
                            'dtype': str(item.dtype),
                            'size_bytes': item.nbytes
                        })
            
            explore_group(f)
            
            # Count stocks
            if '/nepse/prices' in f:
                info['total_stocks'] = len(f['/nepse/prices'].keys())
        
        return info
    
    def append_data(self,
                    df: pd.DataFrame,
                    symbol: str,
                    group_path: str = '/nepse/prices') -> None:
        """
        Append new data to an existing stock's dataset.
        
        HDF5 supports efficient appending through chunking.
        However, resizing datasets requires careful handling.
        
        Args:
            df: DataFrame with new data to append
            symbol: Stock symbol
            group_path: HDF5 group path
        """
        with self._get_file('a') as f:
            stock_group = f[f'{group_path}/{symbol}']
            
            # Read existing dates
            existing_dates = stock_group['dates'][:]
            existing_set = set(existing_dates)
            
            # Filter out duplicates
            df = df.copy()
            df['Date'] = pd.to_datetime(df['Date'])
            new_date_ints = df['Date'].astype(np.int64) // 10**9
            mask = ~new_date_ints.isin(existing_set)
            df = df[mask]
            
            if len(df) == 0:
                print(f"No new data to append for {symbol}")
                return
            
            # Append to each dataset
            columns = ['open', 'high', 'low', 'close', 'volume']
            
            for col in columns:
                if col in stock_group and col.capitalize() in df.columns:
                    dataset = stock_group[col]
                    
                    # Get current size
                    current_size = dataset.shape[0]
                    new_size = current_size + len(df)
                    
                    # Resize and append
                    dataset.resize(new_size, axis=0)
                    dataset[current_size:] = df[col.capitalize()].values
            
            # Append dates
            dates_dataset = stock_group['dates']
            current_size = dates_dataset.shape[0]
            new_size = current_size + len(df)
            dates_dataset.resize(new_size, axis=0)
            dates_dataset[current_size:] = new_date_ints[mask].values
            
            # Update metadata
            all_dates = stock_group['dates'][:]
            dates_dt = pd.to_datetime(all_dates * 10**9)
            stock_group.attrs['end_date'] = str(dates_dt.max().date())
            stock_group.attrs['count'] = len(all_dates)
            
            print(f"Appended {len(df)} records to {symbol}")


class NetCDFTimeSeriesStorage:
    """
    A class to handle NetCDF storage for time-series data.
    
    NetCDF (Network Common Data Form) is a set of software libraries
    and machine-independent data formats for array-oriented scientific data.
    
    While HDF5 is more general-purpose, NetCDF is particularly well-suited for:
    - Meteorological and climate data
    - Multi-dimensional gridded data
    - Data that needs CF (Climate and Forecast) conventions
    - Interoperability with scientific software (MATLAB, R, etc.)
    
    For NEPSE data, we can use NetCDF to store:
    - Daily prices across multiple stocks (2D: time x stocks)
    - Multiple variables (open, high, low, close, volume)
    - Geographic data if we add location information
    """
    
    def __init__(self, filepath: str = './nepse_data.nc'):
        """
        Initialize NetCDF storage.
        
        Args:
            filepath: Path to the NetCDF file (.nc extension)
        """
        try:
            import netCDF4 as nc
            self.nc = nc
        except ImportError:
            raise ImportError(
                "netCDF4 package is required. Install with: pip install netCDF4"
            )
        
        self.filepath = Path(filepath)
        self.filepath.parent.mkdir(parents=True, exist_ok=True)
    
    def store_time_series(self,
                          df: pd.DataFrame,
                          time_dim: str = 'time',
                          symbol_dim: str = 'symbol') -> None:
        """
        Store time-series data in NetCDF format.
        
        NetCDF uses dimensions and variables:
        - Dimensions define the shape of data (time, symbol)
        - Variables hold the actual data with attributes
        
        Args:
            df: DataFrame with Date, Symbol, and price columns
            time_dim: Name for the time dimension
            symbol_dim: Name for the symbol dimension
        """
        with self.nc.Dataset(self.filepath, 'w') as ds:
            # Get unique dates and symbols
            dates = pd.to_datetime(df['Date'].unique())
            dates = dates.sort_values()
            symbols = sorted(df['Symbol'].unique())
            
            # Create dimensions
            # 'UNLIMITED' allows appending new time steps
            ds.createDimension(time_dim, None)  # Unlimited dimension
            ds.createDimension(symbol_dim, len(symbols))
            
            # Create coordinate variables
            # Time is stored as days since a reference date
            time_var = ds.createVariable(
                time_dim,
                'f8',  # 64-bit float
                (time_dim,)
            )
            time_var.units = 'days since 2020-01-01'
            time_var.calendar = 'standard'
            time_var.long_name = 'Time'
            
            # Calculate days since reference
            ref_date = pd.Timestamp('2020-01-01')
            days_since = (dates - ref_date).days
            time_var[:] = days_since
            
            # Symbol is stored as strings using a character array
            # NetCDF doesn't directly support variable-length strings
            # So we use a character dimension
            max_symbol_len = max(len(s) for s in symbols)
            ds.createDimension('symbol_strlen', max_symbol_len)
            
            symbol_var = ds.createVariable(
                symbol_dim,
                'S1',  # Single character
                (symbol_dim, 'symbol_strlen')
            )
            symbol_var.long_name = 'Stock Symbol'
            
            # Store symbols as character arrays
            for i, symbol in enumerate(symbols):
                symbol_var[i, :] = symbol.ljust(max_symbol_len).encode('utf-8')
            
            # Create data variables
            # Each variable has the dimensions (time, symbol)
            variable_configs = [
                ('open', 'f8', 'Opening Price', 'NPR'),
                ('high', 'f8', 'High Price', 'NPR'),
                ('low', 'f8', 'Low Price', 'NPR'),
                ('close', 'f8', 'Closing Price', 'NPR'),
                ('volume', 'i8', 'Trading Volume', 'shares'),
            ]
            
            # Create a mapping from symbol to index
            symbol_to_idx = {s: i for i, s in enumerate(symbols)}
            
            for var_name, dtype, long_name, units in variable_configs:
                var = ds.createVariable(
                    var_name,
                    dtype,
                    (time_dim, symbol_dim),
                    zlib=True,  # Enable compression
                    complevel=4  # Compression level (1-9)
                )
                var.long_name = long_name
                var.units = units
                
                # Initialize with missing value
                var[:] = np.nan if dtype == 'f8' else -9999
            
            # Fill in the data
            # Create a date to index mapping
            date_to_idx = {d: i for i, d in enumerate(dates)}
            
            for _, row in df.iterrows():
                date = pd.to_datetime(row['Date'])
                symbol = row['Symbol']
                
                if date in date_to_idx and symbol in symbol_to_idx:
                    t_idx = date_to_idx[date]
                    s_idx = symbol_to_idx[symbol]
                    
                    # Store each variable
                    if 'Open' in row:
                        ds.variables['open'][t_idx, s_idx] = row['Open']
                    if 'High' in row:
                        ds.variables['high'][t_idx, s_idx] = row['High']
                    if 'Low' in row:
                        ds.variables['low'][t_idx, s_idx] = row['Low']
                    if 'Close' in row:
                        ds.variables['close'][t_idx, s_idx] = row['Close']
                    if 'Volume' in row:
                        ds.variables['volume'][t_idx, s_idx] = row['Volume']
            
            # Add global attributes
            ds.title = 'NEPSE Stock Price Data'
            ds.institution = 'Nepal Stock Exchange'
            ds.source = 'NEPSE Historical Data'
            ds.history = f'Created {datetime.now().isoformat()}'
            ds.Conventions = 'CF-1.8'
    
    def read_time_series(self,
                         variables: Optional[List[str]] = None,
                         symbols: Optional[List[str]] = None,
                         start_date: Optional[str] = None,
                         end_date: Optional[str] = None) -> pd.DataFrame:
        """
        Read time-series data from NetCDF.
        
        Args:
            variables: List of variables to read (None for all)
            symbols: List of symbols to read (None for all)
            start_date: Start date filter
            end_date: End date filter
        
        Returns:
            DataFrame with the requested data
        """
        with self.nc.Dataset(self.filepath, 'r') as ds:
            # Read time coordinate
            time_var = ds.variables['time']
            days_since = time_var[:]
            ref_date = pd.Timestamp('2020-01-01')
            dates = ref_date + pd.to_timedelta(days_since, unit='D')
            
            # Read symbols
            symbol_data = ds.variables['symbol'][:]
            symbols_all = [''.join(s.astype(str)).strip() for s in symbol_data]
            
            # Determine time range
            time_mask = np.ones(len(dates), dtype=bool)
            if start_date:
                start = pd.to_datetime(start_date)
                time_mask &= (dates >= start)
            if end_date:
                end = pd.to_datetime(end_date)
                time_mask &= (dates <= end)
            
            # Determine symbols
            if symbols:
                symbol_mask = [s in symbols for s in symbols_all]
            else:
                symbol_mask = np.ones(len(symbols_all), dtype=bool)
            
            # Determine variables
            if variables is None:
                variables = ['open', 'high', 'low', 'close', 'volume']
            
            # Build data array
            data = []
            for t_idx, date in enumerate(dates):
                if not time_mask[t_idx]:
                    continue
                
                for s_idx, symbol in enumerate(symbols_all):
                    if not symbol_mask[s_idx]:
                        continue
                    
                    row = {'Date': date, 'Symbol': symbol}
                    for var in variables:
                        if var in ds.variables:
                            value = ds.variables[var][t_idx, s_idx]
                            # Convert missing values to NaN
                            if value == -9999 or np.isnan(value):
                                row[var.capitalize()] = np.nan
                            else:
                                row[var.capitalize()] = value
                    
                    data.append(row)
            
            return pd.DataFrame(data)
    
    def get_variable_info(self) -> Dict[str, Any]:
        """
        Get information about variables in the NetCDF file.
        
        Returns:
            Dictionary with variable information
        """
        with self.nc.Dataset(self.filepath, 'r') as ds:
            info = {
                'dimensions': {name: len(dim) for name, dim in ds.dimensions.items()},
                'variables': {},
                'global_attributes': {name: getattr(ds, name) for name in ds.ncattrs()}
            }
            
            for name, var in ds.variables.items():
                info['variables'][name] = {
                    'dimensions': var.dimensions,
                    'shape': var.shape,
                    'dtype': str(var.dtype),
                    'attributes': {attr: var.getncattr(attr) for attr in var.ncattrs()}
                }
            
            return info


def demonstrate_hdf5_storage():
    """
    Demonstrate HDF5 storage operations with NEPSE data.
    """
    print("=" * 70)
    print("HDF5 Storage Demonstration for NEPSE Data")
    print("=" * 70)
    
    # Generate sample data
    np.random.seed(42)
    
    dates = pd.date_range('2023-01-01', '2023-12-31', freq='B')
    symbols = ['NABIL', 'NICA', 'SCBL', 'ADBL', 'EBL', 'GBIME', 'HBL', 'NBL']
    
    data = []
    for date in dates:
        for symbol in symbols:
            base_price = 300 + np.random.randint(0, 600)
            high = base_price * (1 + np.random.uniform(0, 0.05))
            low = base_price * (1 - np.random.uniform(0, 0.05))
            open_price = np.random.uniform(low, high)
            close = np.random.uniform(low, high)
            
            data.append({
                'Date': date,
                'Symbol': symbol,
                'Open': round(open_price, 2),
                'High': round(high, 2),
                'Low': round(low, 2),
                'Close': round(close, 2),
                'Volume': np.random.randint(10000, 500000)
            })
    
    df = pd.DataFrame(data)
    print(f"\nGenerated {len(df)} records for {len(symbols)} stocks")
    
    # Initialize HDF5 storage
    hdf5_storage = HDF5TimeSeriesStorage('./nepse_analysis.h5')
    
    # Store data by symbol (individual time-series)
    print("\n1. Storing Data by Symbol")
    print("-" * 40)
    hdf5_storage.store_stock_data(df, group_path='/nepse/prices')
    
    # Store aligned data (matrix format)
    print("\n2. Storing Aligned Data (Matrix Format)")
    print("-" * 40)
    hdf5_storage.store_aligned_data(df)
    
    # Get file info
    print("\n3. File Information")
    print("-" * 40)
    info = hdf5_storage.get_file_info()
    print(f"File: {info['filepath']}")
    print(f"Size: {info['size_mb']:.2f} MB")
    print(f"Total stocks: {info['total_stocks']}")
    print(f"Groups: {len(info['groups'])}")
    print(f"Datasets: {len(info['datasets'])}")
    
    # Read specific stock data
    print("\n4. Reading Single Stock Data")
    print("-" * 40)
    nabil_data = hdf5_storage.read_stock_data(
        symbol='NABIL',
        start_date='2023-06-01',
        end_date='2023-06-30',
        columns=['open', 'high', 'low', 'close', 'volume']
    )
    print(f"NABIL data for June 2023 ({len(nabil_data)} rows):")
    print(nabil_data.head())
    
    # Read aligned data
    print("\n5. Reading Aligned Data")
    print("-" * 40)
    aligned = hdf5_storage.read_aligned_data(
        symbols=['NABIL', 'NICA', 'SCBL'],
        start_date='2023-06-01',
        end_date='2023-06-30'
    )
    print(f"Aligned data ({aligned.shape}):")
    print(aligned.head())
    
    # Get metadata
    print("\n6. Stock Metadata")
    print("-" * 40)
    metadata = hdf5_storage.get_metadata('NABIL')
    print("NABIL metadata:")
    for key, value in metadata.items():
        print(f"  {key}: {value}")
    
    return hdf5_storage, df


def demonstrate_netcdf_storage():
    """
    Demonstrate NetCDF storage operations.
    """
    print("\n" + "=" * 70)
    print("NetCDF Storage Demonstration")
    print("=" * 70)
    
    # Generate sample data
    np.random.seed(42)
    
    dates = pd.date_range('2023-01-01', '2023-12-31', freq='B')
    symbols = ['NABIL', 'NICA', 'SCBL']
    
    data = []
    for date in dates:
        for symbol in symbols:
            base_price = 300 + np.random.randint(0, 500)
            data.append({
                'Date': date,
                'Symbol': symbol,
                'Open': round(base_price * (1 + np.random.uniform(-0.02, 0.02)), 2),
                'High': round(base_price * 1.02, 2),
                'Low': round(base_price * 0.98, 2),
                'Close': round(base_price * (1 + np.random.uniform(-0.01, 0.01)), 2),
                'Volume': np.random.randint(10000, 100000)
            })
    
    df = pd.DataFrame(data)
    
    try:
        # Initialize NetCDF storage
        netcdf_storage = NetCDFTimeSeriesStorage('./nepse_data.nc')
        
        print("\n1. Storing Data in NetCDF Format")
        print("-" * 40)
        netcdf_storage.store_time_series(df)
        print("Data stored successfully")
        
        print("\n2. Reading NetCDF Data")
        print("-" * 40)
        read_df = netcdf_storage.read_time_series(
            symbols=['NABIL'],
            start_date='2023-06-01',
            end_date='2023-06-30'
        )
        print(f"Read {len(read_df)} rows:")
        print(read_df.head())
        
        print("\n3. Variable Information")
        print("-" * 40)
        var_info = netcdf_storage.get_variable_info()
        print("Dimensions:")
        for name, size in var_info['dimensions'].items():
            print(f"  {name}: {size}")
        print("\nVariables:")
        for name, info in var_info['variables'].items():
            print(f"  {name}: {info['shape']} - {info['attributes'].get('long_name', 'N/A')}")
        
        return netcdf_storage, df
    
    except ImportError as e:
        print(f"\nNetCDF demonstration skipped: {e}")
        print("Install netCDF4 with: pip install netCDF4")
        return None, df


if __name__ == "__main__":
    hdf5_storage, df = demonstrate_hdf5_storage()
    netcdf_storage, df2 = demonstrate_netcdf_storage()
```

**Detailed Explanation:**

1. **HDF5 Structure**: HDF5 organizes data hierarchically:
   - **Groups** act like directories, organizing data into logical collections
   - **Datasets** are multidimensional arrays that store the actual data
   - **Attributes** are small metadata attached to groups or datasets

2. **Storing by Symbol**: The `store_stock_data` method creates a group for each stock symbol, with separate datasets for each price field (open, high, low, close, volume). This structure:
   - Enables fast retrieval of a single stock's data
   - Supports efficient appending of new data
   - Stores metadata like start/end dates

3. **Storing Aligned Data**: The `store_aligned_data` method creates a 2D matrix where rows are dates and columns are symbols. This is ideal for:
   - Cross-sectional analysis (comparing stocks at the same time)
   - Correlation calculations
   - Machine learning models requiring consistent array shapes

4. **Date Storage**: Dates are converted to Unix timestamps (integers) for efficient storage and comparison. This allows:
   - Efficient date range queries using boolean masking
   - Compact storage (8 bytes per date)
   - Fast arithmetic operations

5. **Chunking**: HDF5's chunking feature splits data into fixed-size blocks:
   - Each chunk can be read/written independently
   - Enables efficient partial reads
   - Improves performance for time-range queries

6. **NetCDF Differences**: While HDF5 is general-purpose, NetCDF is designed for scientific data:
   - Follows CF (Climate and Forecast) conventions
   - Built-in coordinate handling (time as "days since reference")
   - Widely used in climate science, meteorology, oceanography
   - Better interoperability with scientific software

7. **Performance Considerations**:
   - HDF5 excels at random access and partial reads
   - NetCDF is better for gridded scientific data
   - Both support compression (gzip/lzf for HDF5, zlib for NetCDF)

---

## **8.3 Relational Databases**

Relational databases provide structured storage with ACID guarantees, making them ideal for transactional data and complex queries. For time-series data, they offer SQL's powerful query capabilities.

### **8.3.1 Schema Design**

Proper schema design is critical for time-series data in relational databases. Let's explore various approaches.

```python
"""
Relational Database Schema Design for Time-Series Data

This module demonstrates schema designs optimized for time-series data,
specifically for the NEPSE stock prediction system.

Key considerations for time-series schema design:
1. Indexing strategy for time-based queries
2. Partitioning for large datasets
3. Normalization vs. denormalization tradeoffs
4. Handling updates and inserts efficiently
"""

import sqlite3
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Optional, Dict, Any, Tuple
from pathlib import Path
import json


class DatabaseSchema:
    """
    A class to define and manage database schemas for time-series data.
    
    This class provides schema definitions and SQL statements for creating
    tables optimized for time-series data storage.
    """
    
    @staticmethod
    def get_nepse_schema_sqlite() -> Dict[str, str]:
        """
        Get SQLite schema definition for NEPSE data.
        
        SQLite is a lightweight, file-based database that's excellent for:
        - Development and prototyping
        - Embedded applications
        - Desktop applications
        - Small to medium datasets
        
        Returns:
            Dictionary mapping table names to CREATE TABLE statements
        """
        schema = {
            # Main stock prices table
            # This uses a normalized design with separate tables for
            # stocks and prices
            'stocks': '''
                CREATE TABLE IF NOT EXISTS stocks (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    symbol TEXT UNIQUE NOT NULL,
                    company_name TEXT,
                    sector TEXT,
                    listed_date DATE,
                    is_active BOOLEAN DEFAULT 1,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                );
                
                -- Index on symbol for fast lookups
                CREATE INDEX IF NOT EXISTS idx_stocks_symbol ON stocks(symbol);
            ''',
            
            'stock_prices': '''
                CREATE TABLE IF NOT EXISTS stock_prices (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    stock_id INTEGER NOT NULL,
                    trade_date DATE NOT NULL,
                    open_price REAL,
                    high_price REAL,
                    low_price REAL,
                    close_price REAL,
                    ltp REAL,
                    vwap REAL,
                    volume INTEGER,
                    turnover REAL,
                    transactions INTEGER,
                    prev_close REAL,
                    diff REAL,
                    price_range REAL,
                    diff_percent REAL,
                    range_percent REAL,
                    vwap_percent REAL,
                    high_52_week REAL,
                    low_52_week REAL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    
                    -- Foreign key to stocks table
                    FOREIGN KEY (stock_id) REFERENCES stocks(id),
                    
                    -- Unique constraint to prevent duplicate entries
                    -- This ensures one price record per stock per day
                    UNIQUE(stock_id, trade_date)
                );
                
                -- Composite index for time-range queries
                -- This is the most important index for time-series queries
                -- It allows efficient queries like:
                -- WHERE stock_id = ? AND trade_date BETWEEN ? AND ?
                CREATE INDEX IF NOT EXISTS idx_prices_stock_date 
                    ON stock_prices(stock_id, trade_date);
                
                -- Index for date-only queries (cross-sectional)
                CREATE INDEX IF NOT EXISTS idx_prices_date 
                    ON stock_prices(trade_date);
                
                -- Index for price queries (e.g., finding stocks in price range)
                CREATE INDEX IF NOT EXISTS idx_prices_close 
                    ON stock_prices(close_price);
            ''',
            
            # Daily market summary table
            # Stores aggregate statistics for each trading day
            'market_summary': '''
                CREATE TABLE IF NOT EXISTS market_summary (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    trade_date DATE UNIQUE NOT NULL,
                    total_turnover REAL,
                    total_volume INTEGER,
                    total_transactions INTEGER,
                    advancing_stocks INTEGER,
                    declining_stocks INTEGER,
                    unchanged_stocks INTEGER,
                    index_value REAL,
                    index_change REAL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
                );
                
                CREATE INDEX IF NOT EXISTS idx_market_summary_date 
                    ON market_summary(trade_date);
            ''',
            
            # Technical indicators table
            # Pre-calculated indicators for faster queries
            'technical_indicators': '''
                CREATE TABLE IF NOT EXISTS technical_indicators (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    stock_id INTEGER NOT NULL,
                    trade_date DATE NOT NULL,
                    sma_5 REAL,
                    sma_10 REAL,
                    sma_20 REAL,
                    sma_50 REAL,
                    ema_12 REAL,
                    ema_26 REAL,
                    rsi_14 REAL,
                    macd REAL,
                    macd_signal REAL,
                    macd_histogram REAL,
                    bollinger_upper REAL,
                    bollinger_middle REAL,
                    bollinger_lower REAL,
                    atr_14 REAL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    
                    FOREIGN KEY (stock_id) REFERENCES stocks(id),
                    UNIQUE(stock_id, trade_date)
                );
                
                CREATE INDEX IF NOT EXISTS idx_indicators_stock_date 
                    ON technical_indicators(stock_id, trade_date);
            ''',
            
            # Prediction results table
            # Stores model predictions for tracking and analysis
            'predictions': '''
                CREATE TABLE IF NOT EXISTS predictions (
                    id INTEGER PRIMARY KEY AUTOINCREMENT,
                    stock_id INTEGER NOT NULL,
                    prediction_date DATE NOT NULL,
                    target_date DATE NOT NULL,
                    model_name TEXT NOT NULL,
                    model_version TEXT,
                    predicted_close REAL,
                    prediction_interval_lower REAL,
                    prediction_interval_upper REAL,
                    actual_close REAL,
                    prediction_error REAL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    
                    FOREIGN KEY (stock_id) REFERENCES stocks(id),
                    UNIQUE(stock_id, prediction_date, target_date, model_name)
                );
                
                CREATE INDEX IF NOT EXISTS idx_predictions_stock 
                    ON predictions(stock_id);
                CREATE INDEX IF NOT EXISTS idx_predictions_model 
                    ON predictions(model_name);
            '''
        }
        
        return schema
    
    @staticmethod
    def get_nepse_schema_postgresql() -> Dict[str, str]:
        """
        Get PostgreSQL schema definition for NEPSE data.
        
        PostgreSQL offers advanced features for time-series data:
        - Table partitioning for large datasets
        - Advanced indexing (BRIN, GiST)
        - JSONB for flexible data storage
        - Powerful date/time functions
        
        Returns:
            Dictionary mapping table names to CREATE TABLE statements
        """
        schema = {
            'stocks': '''
                CREATE TABLE IF NOT EXISTS stocks (
                    id SERIAL PRIMARY KEY,
                    symbol VARCHAR(20) UNIQUE NOT NULL,
                    company_name VARCHAR(255),
                    sector VARCHAR(100),
                    listed_date DATE,
                    is_active BOOLEAN DEFAULT TRUE,
                    metadata JSONB,
                    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
                    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
                );
                
                CREATE INDEX IF NOT EXISTS idx_stocks_symbol ON stocks(symbol);
                CREATE INDEX IF NOT EXISTS idx_stocks_sector ON stocks(sector);
            ''',
            
            'stock_prices': '''
                -- Partitioned table for stock prices
                -- Partitioning by date range improves query performance
                -- and makes data management easier (e.g., dropping old partitions)
                CREATE TABLE IF NOT EXISTS stock_prices (
                    id BIGSERIAL,
                    stock_id INTEGER NOT NULL REFERENCES stocks(id),
                    trade_date DATE NOT NULL,
                    open_price NUMERIC(12, 4),
                    high_price NUMERIC(12, 4),
                    low_price NUMERIC(12, 4),
                    close_price NUMERIC(12, 4),
                    ltp NUMERIC(12, 4),
                    vwap NUMERIC(12, 4),
                    volume BIGINT,
                    turnover NUMERIC(18, 2),
                    transactions INTEGER,
                    prev_close NUMERIC(12, 4),
                    diff NUMERIC(12, 4),
                    price_range NUMERIC(12, 4),
                    diff_percent NUMERIC(8, 4),
                    range_percent NUMERIC(8, 4),
                    vwap_percent NUMERIC(8, 4),
                    high_52_week NUMERIC(12, 4),
                    low_52_week NUMERIC(12, 4),
                    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
                    PRIMARY KEY (id, trade_date)
                ) PARTITION BY RANGE (trade_date);
                
                -- Create partitions for each year
                -- This should be done dynamically as data grows
                CREATE TABLE IF NOT EXISTS stock_prices_2023 
                    PARTITION OF stock_prices
                    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
                
                CREATE TABLE IF NOT EXISTS stock_prices_2024 
                    PARTITION OF stock_prices
                    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
                
                CREATE TABLE IF NOT EXISTS stock_prices_2025 
                    PARTITION OF stock_prices
                    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
                
                -- BRIN index for time-series data
                -- BRIN (Block Range INdex) is very efficient for large
                -- time-series tables with naturally ordered data
                CREATE INDEX IF NOT EXISTS idx_prices_stock_date_brin 
                    ON stock_prices USING BRIN (stock_id, trade_date);
                
                -- Standard B-tree index for exact lookups
                CREATE INDEX IF NOT EXISTS idx_prices_stock_date_btree 
                    ON stock_prices(stock_id, trade_date);
            ''',
            
            'stock_prices_denormalized': '''
                -- Denormalized table for faster reads
                -- Includes stock symbol directly for simpler queries
                -- Trade-off: larger table size, redundancy
                CREATE TABLE IF NOT EXISTS stock_prices_denormalized (
                    id BIGSERIAL,
                    symbol VARCHAR(20) NOT NULL,
                    trade_date DATE NOT NULL,
                    open_price NUMERIC(12, 4),
                    high_price NUMERIC(12, 4),
                    low_price NUMERIC(12, 4),
                    close_price NUMERIC(12, 4),
                    volume BIGINT,
                    turnover NUMERIC(18, 2),
                    sector VARCHAR(100),
                    PRIMARY KEY (id, trade_date)
                ) PARTITION BY RANGE (trade_date);
                
                CREATE INDEX IF NOT EXISTS idx_denorm_symbol_date 
                    ON stock_prices_denormalized(symbol, trade_date);
            ''',
            
            'ohlc_materialized_view': '''
                -- Materialized view for common OHLC queries
                -- Pre-computes data for faster access
                CREATE MATERIALIZED VIEW IF NOT EXISTS ohlc_daily AS
                SELECT 
                    s.symbol,
                    s.company_name,
                    sp.trade_date,
                    sp.open_price,
                    sp.high_price,
                    sp.low_price,
                    sp.close_price,
                    sp.volume,
                    sp.turnover
                FROM stock_prices sp
                JOIN stocks s ON sp.stock_id = s.id
                ORDER BY s.symbol, sp.trade_date;
                
                -- Refresh the materialized view periodically
                -- REFRESH MATERIALIZED VIEW ohlc_daily;
                
                CREATE INDEX IF NOT EXISTS idx_ohlc_symbol_date 
                    ON ohlc_daily(symbol, trade_date);
            '''
        }
        
        return schema


class SQLiteTimeSeriesDB:
    """
    A class to manage SQLite database operations for time-series data.
    
    SQLite is ideal for:
    - Development and testing
    - Single-user applications
    - Embedded systems
    - Prototyping database schemas
    - Desktop applications
    
    For the NEPSE prediction system, SQLite provides:
    - Zero configuration (no server setup)
    - Single file for easy backup and transfer
    - Full SQL support
    - Good performance for moderate datasets
    """
    
    def __init__(self, db_path: str = './nepse.db'):
        """
        Initialize the SQLite database connection.
        
        Args:
            db_path: Path to the SQLite database file.
                    If it doesn't exist, it will be created.
        """
        self.db_path = Path(db_path)
        self.db_path.parent.mkdir(parents=True, exist_ok=True)
        self.conn = None
        self.cursor = None
        
        # Connect and initialize schema
        self._connect()
        self._initialize_schema()
    
    def _connect(self):
        """
        Establish connection to the SQLite database.
        
        SQLite connection options:
        - check_same_thread: Allows sharing connection across threads
        - isolation_level: Controls transaction behavior
        - timeout: Wait time for locks
        """
        self.conn = sqlite3.connect(
            str(self.db_path),
            check_same_thread=False,
            isolation_level='DEFERRED',
            timeout=30.0
        )
        
        # Enable foreign key support (disabled by default in SQLite)
        self.conn.execute('PRAGMA foreign_keys = ON')
        
        # Enable WAL mode for better concurrent access
        # WAL (Write-Ahead Logging) allows readers and writers to work
        # simultaneously, improving performance for read-heavy workloads
        self.conn.execute('PRAGMA journal_mode = WAL')
        
        # Return rows as dictionaries for easier access
        self.conn.row_factory = sqlite3.Row
        
        self.cursor = self.conn.cursor()
    
    def _initialize_schema(self):
        """
        Initialize the database schema by creating all tables.
        """
        schema = DatabaseSchema.get_nepse_schema_sqlite()
        
        for table_name, sql in schema.items():
            try:
                # Execute the SQL (may contain multiple statements)
                # Split by semicolon and execute each statement
                statements = [s.strip() for s in sql.split(';') if s.strip()]
                for stmt in statements:
                    self.cursor.execute(stmt)
                self.conn.commit()
                print(f"Created/verified table: {table_name}")
            except sqlite3.Error as e:
                print(f"Error creating table {table_name}: {e}")
    
    def insert_stock(self, 
                     symbol: str, 
                     company_name: str = None,
                     sector: str = None,
                     listed_date: str = None) -> int:
        """
        Insert a new stock into the stocks table.
        
        Args:
            symbol: Stock symbol (e.g., 'NABIL')
            company_name: Full company name
            sector: Business sector
            listed_date: Date the stock was listed
        
        Returns:
            The stock_id of the inserted or existing stock
        """
        # Check if stock already exists
        self.cursor.execute(
            'SELECT id FROM stocks WHERE symbol = ?',
            (symbol,)
        )
        row = self.cursor.fetchone()
        
        if row:
            return row['id']
        
        # Insert new stock
        self.cursor.execute('''
            INSERT INTO stocks (symbol, company_name, sector, listed_date)
            VALUES (?, ?, ?, ?)
        ''', (symbol, company_name, sector, listed_date))
        
        self.conn.commit()
        return self.cursor.lastrowid
    
    def insert_price_data(self, 
                          df: pd.DataFrame,
                          batch_size: int = 1000) -> int:
        """
        Insert price data from a DataFrame.
        
        This method handles the complete insertion process:
        1. Ensures stocks exist in the stocks table
        2. Inserts price data in batches for efficiency
        3. Handles duplicates using INSERT OR IGNORE
        
        Args:
            df: DataFrame with columns matching the stock_prices table
               Must include 'Symbol' and 'Date' columns
            batch_size: Number of rows to insert per transaction
        
        Returns:
            Number of rows inserted
        """
        # First, ensure all stocks exist
        symbols = df['Symbol'].unique()
        stock_ids = {}
        
        for symbol in symbols:
            stock_ids[symbol] = self.insert_stock(symbol)
        
        # Prepare data for insertion
        rows_inserted = 0
        
        for i in range(0, len(df), batch_size):
            batch = df.iloc[i:i + batch_size]
            
            for _, row in batch.iterrows():
                try:
                    stock_id = stock_ids.get(row['Symbol'])
                    if stock_id is None:
                        continue
                    
                    # Use INSERT OR IGNORE to handle duplicates
                    # This silently skips rows that would violate unique constraints
                    self.cursor.execute('''
                        INSERT OR IGNORE INTO stock_prices (
                            stock_id, trade_date,
                            open_price, high_price, low_price, close_price,
                            ltp, vwap, volume, turnover, transactions,
                            prev_close, diff, price_range,
                            diff_percent, range_percent, vwap_percent,
                            high_52_week, low_52_week
                        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                    ''', (
                        stock_id,
                        row.get('Date') or row.get('trade_date'),
                        row.get('Open') or row.get('open_price'),
                        row.get('High') or row.get('high_price'),
                        row.get('Low') or row.get('low_price'),
                        row.get('Close') or row.get('close_price'),
                        row.get('LTP') or row.get('ltp'),
                        row.get('VWAP') or row.get('vwap'),
                        row.get('Vol') or row.get('Volume') or row.get('volume'),
                        row.get('Turnover') or row.get('turnover'),
                        row.get('Trans.') or row.get('transactions'),
                        row.get('Prev. Close') or row.get('prev_close'),
                        row.get('Diff') or row.get('diff'),
                        row.get('Range') or row.get('price_range'),
                        row.get('Diff %') or row.get('diff_percent'),
                        row.get('Range %') or row.get('range_percent'),
                        row.get('VWAP %') or row.get('vwap_percent'),
                        row.get('52 Weeks High') or row.get('high_52_week'),
                        row.get('52 Weeks Low') or row.get('low_52_week')
                    ))
                    
                    if self.cursor.rowcount > 0:
                        rows_inserted += 1
                        
                except sqlite3.Error as e:
                    print(f"Error inserting row: {e}")
                    continue
            
            # Commit after each batch
            self.conn.commit()
        
        print(f"Inserted {rows_inserted} rows")
        return rows_inserted
    
    def query_stock_data(self,
                         symbol: str,
                         start_date: str = None,
                         end_date: str = None,
                         columns: List[str] = None) -> pd.DataFrame:
        """
        Query stock price data.
        
        Args:
            symbol: Stock symbol to query
            start_date: Start date (YYYY-MM-DD)
            end_date: End date (YYYY-MM-DD)
            columns: Specific columns to retrieve (None for all)
        
        Returns:
            DataFrame with the query results
        """
        # Build the SELECT clause
        if columns:
            # Map common column names
            column_mapping = {
                'date': 'trade_date',
                'open': 'open_price',
                'high': 'high_price',
                'low': 'low_price',
                'close': 'close_price',
                'volume': 'volume'
            }
            select_cols = [column_mapping.get(c.lower(), c) for c in columns]
            select_clause = ', '.join(select_cols)
        else:
            select_clause = '''
                sp.trade_date, sp.open_price, sp.high_price, sp.low_price, 
                sp.close_price, sp.volume, sp.turnover, sp.vwap,
                sp.transactions, sp.diff, sp.price_range,
                sp.diff_percent, sp.range_percent
            '''
        
        # Build the WHERE clause
        conditions = ['s.symbol = ?']
        params = [symbol]
        
        if start_date:
            conditions.append('sp.trade_date >= ?')
            params.append(start_date)
        
        if end_date:
            conditions.append('sp.trade_date <= ?')
            params.append(end_date)
        
        where_clause = ' AND '.join(conditions)
        
        # Execute query
        query = f'''
            SELECT {select_clause}
            FROM stock_prices sp
            JOIN stocks s ON sp.stock_id = s.id
            WHERE {where_clause}
            ORDER BY sp.trade_date
        '''
        
        self.cursor.execute(query, params)
        rows = self.cursor.fetchall()
        
        # Convert to DataFrame
        if rows:
            df = pd.DataFrame([dict(row) for row in rows])
            return df
        else:
            return pd.DataFrame()
    
    def query_cross_sectional(self,
                              date: str,
                              columns: List[str] = None) -> pd.DataFrame:
        """
        Query all stocks' data for a specific date (cross-sectional).
        
        This query retrieves data for all stocks on a single day,
        useful for:
        - Market analysis
        - Relative performance comparison
        - Portfolio construction
        
        Args:
            date: The date to query (YYYY-MM-DD)
            columns: Specific columns to retrieve
        
        Returns:
            DataFrame with all stocks' data for the date
        """
        if columns:
            select_clause = ', '.join(columns)
        else:
            select_clause = '''
                s.symbol, s.company_name, s.sector,
                sp.trade_date, sp.open_price, sp.high_price, 
                sp.low_price, sp.close_price, sp.volume, sp.turnover
            '''
        
        query = f'''
            SELECT {select_clause}
            FROM stock_prices sp
            JOIN stocks s ON sp.stock_id = s.id
            WHERE sp.trade_date = ?
            ORDER BY sp.turnover DESC
        '''
        
        self.cursor.execute(query, (date,))
        rows = self.cursor.fetchall()
        
        if rows:
            return pd.DataFrame([dict(row) for row in rows])
        else:
            return pd.DataFrame()
    
    def execute_query(self, query: str, params: tuple = None) -> pd.DataFrame:
        """
        Execute a raw SQL query and return results as DataFrame.
        
        Args:
            query: SQL query string
            params: Query parameters (for parameterized queries)
        
        Returns:
            DataFrame with query results
        """
        if params:
            self.cursor.execute(query, params)
        else:
            self.cursor.execute(query)
        
        rows = self.cursor.fetchall()
        
        if rows:
            return pd.DataFrame([dict(row) for row in rows])
        else:
            return pd.DataFrame()
    
    def get_table_info(self, table_name: str) -> pd.DataFrame:
        """
        Get information about a table's structure.
        
        Args:
            table_name: Name of the table
        
        Returns:
            DataFrame with column information
        """
        query = f"PRAGMA table_info({table_name})"
        return self.execute_query(query)
    
    def get_index_info(self, table_name: str) -> pd.DataFrame:
        """
        Get information about indexes on a table.
        
        Args:
            table_name: Name of the table
        
        Returns:
            DataFrame with index information
        """
        query = f"PRAGMA index_list({table_name})"
        return self.execute_query(query)
    
    def get_database_stats(self) -> Dict[str, Any]:
        """
        Get statistics about the database.
        
        Returns:
            Dictionary with database statistics
        """
        stats = {
            'file_size_mb': self.db_path.stat().st_size / (1024 * 1024),
            'tables': {}
        }
        
        # Get stats for each table
        tables = ['stocks', 'stock_prices', 'market_summary', 'technical_indicators']
        
        for table in tables:
            try:
                row_count = self.execute_query(
                    f"SELECT COUNT(*) as count FROM {table}"
                )
                stats['tables'][table] = {
                    'row_count': row_count['count'].iloc[0] if len(row_count) > 0 else 0
                }
            except:
                stats['tables'][table] = {'row_count': 0}
        
        return stats
    
    def close(self):
        """Close the database connection."""
        if self.conn:
            self.conn.close()


def demonstrate_sqlite_storage():
    """
    Demonstrate SQLite storage operations with NEPSE data.
    """
    print("=" * 70)
    print("SQLite Relational Database Storage Demonstration")
    print("=" * 70)
    
    # Initialize database
    db = SQLiteTimeSeriesDB('./nepse_analysis.db')
    
    # Generate sample data
    np.random.seed(42)
    
    dates = pd.date_range('2023-01-01', '2023-12-31', freq='B')
    symbols = ['NABIL', 'NICA', 'SCBL', 'ADBL', 'EBL']
    sectors = ['Banking', 'Banking', 'Banking', 'Banking', 'Banking']
    
    data = []
    for date in dates:
        for symbol, sector in zip(symbols, sectors):
            base_price = 300 + np.random.randint(0, 500)
            data.append({
                'Date': date,
                'Symbol': symbol,
                'Open': round(base_price * (1 + np.random.uniform(-0.02, 0.02)), 2),
                'High': round(base_price * 1.03, 2),
                'Low': round(base_price * 0.97, 2),
                'Close': round(base_price * (1 + np.random.uniform(-0.01, 0.01)), 2),
                'Volume': np.random.randint(10000, 500000),
                'Turnover': round(base_price * np.random.randint(10000, 500000), 2),
                'VWAP': round(base_price, 2),
                'Trans.': np.random.randint(100, 1000),
                'Diff': round(base_price * np.random.uniform(-0.02, 0.02), 2),
                'Range': round(base_price * 0.06, 2),
                'Diff %': round(np.random.uniform(-2, 2), 2),
                'Range %': round(np.random.uniform(4, 8), 2),
                'VWAP %': round(np.random.uniform(-1, 1), 2),
                '52 Weeks High': round(base_price * 1.2, 2),
                '52 Weeks Low': round(base_price * 0.8, 2)
            })
    
    df = pd.DataFrame(data)
    print(f"\nGenerated {len(df)} records")
    
    # Insert data
    print("\n1. Inserting Data")
    print("-" * 40)
    rows_inserted = db.insert_price_data(df)
    
    # Query data
    print("\n2. Querying Single Stock Data")
    print("-" * 40)
    nabil_data = db.query_stock_data(
        symbol='NABIL',
        start_date='2023-06-01',
        end_date='2023-06-30'
    )
    print(f"NABIL data for June 2023 ({len(nabil_data)} rows):")
    print(nabil_data.head())
    
    # Cross-sectional query
    print("\n3. Cross-Sectional Query (All Stocks for a Date)")
    print("-" * 40)
    cross_section = db.query_cross_sectional('2023-06-15')
    print(f"All stocks on 2023-06-15 ({len(cross_section)} rows):")
    print(cross_section[['symbol', 'close_price', 'volume', 'turnover']])
    
    # Custom query
    print("\n4. Custom SQL Query - Top Performers")
    print("-" * 40)
    top_performers = db.execute_query('''
        SELECT 
            s.symbol,
            COUNT(*) as trading_days,
            AVG(sp.close_price) as avg_close,
            MAX(sp.close_price) as high_close,
            MIN(sp.close_price) as low_close,
            SUM(sp.volume) as total_volume,
            SUM(sp.turnover) as total_turnover
        FROM stock_prices sp
        JOIN stocks s ON sp.stock_id = s.id
        GROUP BY s.symbol
        ORDER BY total_turnover DESC
    ''')
    print(top_performers)
    
    # Database stats
    print("\n5. Database Statistics")
    print("-" * 40)
    stats = db.get_database_stats()
    print(f"Database size: {stats['file_size_mb']:.2f} MB")
    print(f"Stocks table: {stats['tables']['stocks']['row_count']} rows")
    print(f"Prices table: {stats['tables']['stock_prices']['row_count']} rows")
    
    # Table structure
    print("\n6. Table Structure")
    print("-" * 40)
    table_info = db.get_table_info('stock_prices')
    print(table_info[['name', 'type', 'notnull', 'pk']])
    
    # Indexes
    print("\n7. Indexes")
    print("-" * 40)
    index_info = db.get_index_info('stock_prices')
    print(index_info)
    
    db.close()
    return db, df


if __name__ == "__main__":
    db, df = demonstrate_sqlite_storage()
```

**Detailed Explanation:**

1. **Schema Design Choices**:
   - **Normalized Design**: The `stocks` table stores stock metadata separately from price data. This:
     - Reduces redundancy (symbol, company_name stored once)
     - Enables easy updates to stock metadata
     - Requires joins for most queries
   
   - **Foreign Keys**: The `stock_id` in `stock_prices` references `stocks(id)`. This:
     - Ensures referential integrity
     - Prevents orphaned records
     - Enables cascading updates/deletes

   - **Unique Constraints**: `UNIQUE(stock_id, trade_date)` prevents duplicate entries for the same stock on the same day.

2. **Indexing Strategy**:
   - **Composite Index** `idx_prices_stock_date`: The most important index for time-series queries. It allows efficient queries filtering by both stock and date range.
   - **Single-column Index** `idx_prices_date`: For cross-sectional queries that need all stocks for a specific date.
   - **Column Index** `idx_prices_close`: For queries filtering by price (e.g., stocks in a price range).

3. **PostgreSQL-Specific Features**:
   - **Table Partitioning**: The `PARTITION BY RANGE (trade_date)` splits the table by year. Benefits:
     - Queries on a date range only scan relevant partitions
     - Old data can be archived by dropping partitions
     - Each partition can have its own storage settings
   
   - **BRIN Index**: Block Range INdex is very efficient for large, naturally ordered datasets like time-series. It stores summary information about blocks of data, making it much smaller than B-tree indexes.

   - **Materialized Views**: Pre-computed queries that can be refreshed periodically. Great for expensive aggregations that are queried frequently.

4. **SQLite Features**:
   - **WAL Mode**: Write-Ahead Logging allows concurrent readers during writes, improving performance for read-heavy workloads.
   - **Foreign Keys**: Must be explicitly enabled with `PRAGMA foreign_keys = ON`.
   - **INSERT OR IGNORE**: Handles duplicates gracefully without raising errors.

5. **Batch Insertion**: The `insert_price_data` method inserts data in batches:
   - Reduces transaction overhead
   - Provides progress feedback
   - Allows partial success if errors occur

---


### **8.3.2 Indexing Strategies**

Proper indexing is crucial for time-series database performance. Without appropriate indexes, queries on large time-series datasets can take minutes or hours instead of milliseconds.

```python
"""
Database Indexing Strategies for Time-Series Data

This module demonstrates various indexing strategies optimized for
time-series queries, using the NEPSE stock data as an example.

Key concepts:
1. B-Tree indexes (default) - good for equality and range queries
2. BRIN indexes (PostgreSQL) - efficient for large, naturally ordered data
3. Partial indexes - index only relevant data (e.g., active stocks only)
4. Covering indexes - include all columns needed for a query
5. Composite indexes - multi-column indexes for specific query patterns
"""

import sqlite3
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict, Any, Tuple
import time


class IndexingStrategy:
    """
    Demonstrates various indexing strategies for time-series data.
    
    For the NEPSE prediction system, we need indexes that support:
    1. Time-range queries: "Get prices for NABIL from Jan to Mar"
    2. Cross-sectional queries: "Get all stocks for 2024-01-15"
    3. Aggregate queries: "Average closing price by month"
    4. Point queries: "Get specific date's closing price"
    """
    
    def __init__(self, db_path: str = './nepse_indexing.db'):
        self.db_path = db_path
        self.conn = sqlite3.connect(db_path)
        self.conn.execute('PRAGMA foreign_keys = ON')
        self.cursor = self.conn.cursor()
        self._setup_tables()
    
    def _setup_tables(self):
        """Create tables with various indexing strategies."""
        # Table without indexes (baseline for comparison)
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS prices_no_index (
                id INTEGER PRIMARY KEY,
                symbol TEXT,
                trade_date DATE,
                close_price REAL,
                volume INTEGER
            )
        ''')
        
        # Table with basic single-column indexes
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS prices_basic_index (
                id INTEGER PRIMARY KEY,
                symbol TEXT,
                trade_date DATE,
                close_price REAL,
                volume INTEGER
            )
        ''')
        
        # Create basic indexes
        self.cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_basic_symbol 
            ON prices_basic_index(symbol)
        ''')
        self.cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_basic_date 
            ON prices_basic_index(trade_date)
        ''')
        
        # Table with composite index (optimal for time-series)
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS prices_composite (
                id INTEGER PRIMARY KEY,
                symbol TEXT,
                trade_date DATE,
                close_price REAL,
                volume INTEGER
            )
        ''')
        
        # Composite index: symbol first, then date
        # This is optimal for queries filtering by symbol and date range
        self.cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_composite_symbol_date 
            ON prices_composite(symbol, trade_date)
        ''')
        
        # Covering index: includes all columns needed for common queries
        # This allows "index-only scans" where the database never touches the table
        self.cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_composite_covering 
            ON prices_composite(symbol, trade_date, close_price, volume)
        ''')
        
        # Partial index: only index high-volume trading days
        # This reduces index size and maintenance cost
        self.cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_composite_high_volume 
            ON prices_composite(symbol, trade_date) 
            WHERE volume > 100000
        ''')
        
        self.conn.commit()
    
    def generate_test_data(self, num_records: int = 100000):
        """Generate test data for performance comparison."""
        symbols = ['NABIL', 'NICA', 'SCBL', 'ADBL', 'EBL']
        start_date = datetime(2020, 1, 1)
        
        print(f"Generating {num_records} test records...")
        
        # Generate data
        data = []
        for i in range(num_records):
            symbol = symbols[i % len(symbols)]
            days_offset = i // len(symbols)
            trade_date = start_date + timedelta(days=days_offset)
            
            data.append((
                symbol,
                trade_date.strftime('%Y-%m-%d'),
                100 + np.random.random() * 900,  # Price between 100-1000
                np.random.randint(1000, 1000000)  # Volume
            ))
        
        # Insert into all tables
        for table in ['prices_no_index', 'prices_basic_index', 'prices_composite']:
            self.cursor.executemany(f'''
                INSERT INTO {table} (symbol, trade_date, close_price, volume)
                VALUES (?, ?, ?, ?)
            ''', data)
        
        self.conn.commit()
        print("Data inserted successfully")
    
    def benchmark_query(self, query: str, params: tuple = (), 
                       iterations: int = 10) -> Dict[str, Any]:
        """
        Benchmark a query across different table configurations.
        
        Args:
            query: SQL query template with {table} placeholder
            params: Query parameters
            iterations: Number of times to run each query
        
        Returns:
            Dictionary with timing results
        """
        results = {}
        
        tables = {
            'No Index': 'prices_no_index',
            'Basic Index': 'prices_basic_index',
            'Composite Index': 'prices_composite'
        }
        
        for config_name, table_name in tables.items():
            # Replace table name in query
            actual_query = query.format(table=table_name)
            
            # Warm up
            self.cursor.execute(actual_query, params)
            self.cursor.fetchall()
            
            # Benchmark
            times = []
            for _ in range(iterations):
                start = time.time()
                self.cursor.execute(actual_query, params)
                self.cursor.fetchall()
                times.append(time.time() - start)
            
            results[config_name] = {
                'avg_time_ms': np.mean(times) * 1000,
                'min_time_ms': np.min(times) * 1000,
                'max_time_ms': np.max(times) * 1000
            }
        
        return results
    
    def demonstrate_time_range_query(self):
        """
        Benchmark time-range queries (most common in time-series).
        
        Query pattern: Get all prices for a specific symbol in a date range
        """
        print("\n" + "=" * 70)
        print("Benchmark: Time-Range Query")
        print("Pattern: SELECT * FROM table WHERE symbol = ? AND trade_date BETWEEN ? AND ?")
        print("=" * 70)
        
        query = '''
            SELECT * FROM {table} 
            WHERE symbol = ? AND trade_date BETWEEN ? AND ?
        '''
        
        results = self.benchmark_query(
            query,
            params=('NABIL', '2023-01-01', '2023-12-31'),
            iterations=50
        )
        
        for config, metrics in results.items():
            print(f"{config:20s}: {metrics['avg_time_ms']:.2f} ms "
                  f"(min: {metrics['min_time_ms']:.2f}, "
                  f"max: {metrics['max_time_ms']:.2f})")
        
        # Explain query plans
        print("\nQuery Plans:")
        for config, table in [('No Index', 'prices_no_index'), 
                              ('Composite', 'prices_composite')]:
            print(f"\n{config}:")
            cursor = self.conn.execute(f'EXPLAIN QUERY PLAN {query.format(table=table)}',
                                       ('NABIL', '2023-01-01', '2023-12-31'))
            for row in cursor:
                print(f"  {row}")
    
    def demonstrate_cross_sectional_query(self):
        """
        Benchmark cross-sectional queries (all stocks for one date).
        
        Query pattern: Get all stocks for a specific trading day
        """
        print("\n" + "=" * 70)
        print("Benchmark: Cross-Sectional Query")
        print("Pattern: SELECT * FROM table WHERE trade_date = ?")
        print("=" * 70)
        
        query = 'SELECT * FROM {table} WHERE trade_date = ?'
        
        results = self.benchmark_query(
            query,
            params=('2023-06-15',),
            iterations=50
        )
        
        for config, metrics in results.items():
            print(f"{config:20s}: {metrics['avg_time_ms']:.2f} ms")
    
    def demonstrate_aggregation_query(self):
        """
        Benchmark aggregation queries (analytical workloads).
        
        Query pattern: Monthly average closing price by symbol
        """
        print("\n" + "=" * 70)
        print("Benchmark: Aggregation Query")
        print("Pattern: SELECT symbol, AVG(close_price) GROUP BY symbol")
        print("=" * 70)
        
        query = '''
            SELECT symbol, AVG(close_price), MAX(close_price), MIN(close_price)
            FROM {table}
            WHERE trade_date BETWEEN ? AND ?
            GROUP BY symbol
        '''
        
        results = self.benchmark_query(
            query,
            params=('2023-01-01', '2023-12-31'),
            iterations=20
        )
        
        for config, metrics in results.items():
            print(f"{config:20s}: {metrics['avg_time_ms']:.2f} ms")
    
    def explain_index_usage(self):
        """
        Demonstrate how to check if indexes are being used.
        """
        print("\n" + "=" * 70)
        print("Index Usage Analysis")
        print("=" * 70)
        
        # SQLite EXPLAIN QUERY PLAN
        print("\n1. Query Plan for Symbol Lookup:")
        cursor = self.conn.execute('''
            EXPLAIN QUERY PLAN
            SELECT * FROM prices_composite 
            WHERE symbol = 'NABIL' AND trade_date > '2023-06-01'
        ''')
        
        for row in cursor:
            print(f"  {row}")
        
        # Show index info
        print("\n2. Available Indexes on prices_composite:")
        cursor = self.conn.execute(
            "SELECT name, sql FROM sqlite_master WHERE type='index' AND tbl_name='prices_composite'"
        )
        for row in cursor:
            print(f"  {row['name']}: {row['sql']}")
    
    def get_index_statistics(self) -> pd.DataFrame:
        """
        Get statistics about indexes (SQLite specific).
        
        Returns:
            DataFrame with index statistics
        """
        # Query sqlite_stat tables if they exist
        try:
            cursor = self.conn.execute(
                "SELECT * FROM sqlite_master WHERE type='index'"
            )
            indexes = cursor.fetchall()
            
            data = []
            for idx in indexes:
                data.append({
                    'name': idx['name'],
                    'table': idx['tbl_name'],
                    'sql': idx['sql'][:100] + '...' if len(idx['sql']) > 100 else idx['sql']
                })
            
            return pd.DataFrame(data)
        except:
            return pd.DataFrame()
    
    def close(self):
        self.conn.close()


def demonstrate_indexing_strategies():
    """Run the indexing strategy demonstration."""
    indexer = IndexingStrategy()
    
    # Generate data
    indexer.generate_test_data(50000)
    
    # Run benchmarks
    indexer.demonstrate_time_range_query()
    indexer.demonstrate_cross_sectional_query()
    indexer.demonstrate_aggregation_query()
    indexer.explain_index_usage()
    
    # Show indexes
    print("\nIndex Statistics:")
    stats = indexer.get_index_statistics()
    print(stats)
    
    indexer.close()


if __name__ == "__main__":
    demonstrate_indexing_strategies()
```

**Detailed Explanation:**

1. **B-Tree Indexes**: The default index type in most databases. They balance lookup speed, insertion speed, and storage efficiency. For time-series:
   - Single-column index on `symbol`: Fast for symbol lookups but requires separate index for dates
   - Single-column index on `trade_date`: Fast for date lookups but not for symbol-specific queries

2. **Composite Indexes**: Multi-column indexes where order matters. For `INDEX(symbol, trade_date)`:
   - Optimized for `WHERE symbol = ? AND trade_date BETWEEN ? AND ?`
   - Can also satisfy `WHERE symbol = ?` (leftmost prefix rule)
   - Cannot efficiently satisfy `WHERE trade_date = ?` alone (symbol comes first)

3. **Covering Indexes**: Include all columns needed for a query. For example, if you always query `symbol`, `trade_date`, `close_price`, and `volume`, a covering index on all four columns allows the database to answer the query using only the index, without touching the table (index-only scan).

4. **Partial Indexes**: Index only a subset of data. For example, indexing only high-volume days (`WHERE volume > 100000`):
   - Smaller index size
   - Faster index maintenance
   - Optimized for specific query patterns

5. **Query Plan Analysis**: The `EXPLAIN QUERY PLAN` command shows how the database executes queries:
   - "SCAN TABLE" means full table scan (slow for large tables)
   - "SEARCH TABLE USING INDEX" means index lookup (fast)
   - "USING COVERING INDEX" means index-only scan (fastest)

### **8.3.3 Query Optimization**

Optimizing queries for time-series data involves understanding execution plans, using appropriate query patterns, and leveraging database-specific features.

```python
"""
Query Optimization Techniques for Time-Series Data

This module demonstrates optimization techniques for common time-series
queries in the NEPSE prediction system.
"""

import sqlite3
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict, Any


class QueryOptimizer:
    """
    Demonstrates query optimization techniques for time-series data.
    
    Key optimization strategies:
    1. Use appropriate indexes (covered in 8.3.2)
    2. Filter early (WHERE clauses before JOINs)
    3. Use covering indexes for aggregations
    4. Limit result sets when possible
    5. Use window functions for time-series calculations
    6. Materialize expensive calculations
    """
    
    def __init__(self, db_path: str = './nepse_optimized.db'):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute('PRAGMA foreign_keys = ON')
        self.cursor = self.conn.cursor()
        self._setup_schema()
    
    def _setup_schema(self):
        """Set up optimized schema with materialized views."""
        # Main prices table
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS stock_prices (
                id INTEGER PRIMARY KEY,
                symbol TEXT,
                trade_date DATE,
                open_price REAL,
                high_price REAL,
                low_price REAL,
                close_price REAL,
                volume INTEGER
            )
        ''')
        
        # Create optimized indexes
        self.cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_prices_sym_date_vol 
            ON stock_prices(symbol, trade_date, volume)
        ''')
        
        # Create a summary table (materialized view pattern)
        # This pre-computes daily aggregates for fast querying
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS daily_summary (
                trade_date DATE PRIMARY KEY,
                total_volume INTEGER,
                total_turnover REAL,
                num_stocks INTEGER,
                advancers INTEGER,
                decliners INTEGER
            )
        ''')
        
        self.conn.commit()
    
    def optimized_time_range_query(self, 
                                   symbol: str, 
                                   start_date: str, 
                                   end_date: str) -> pd.DataFrame:
        """
        Optimized query for time-range data retrieval.
        
        Optimization techniques used:
        1. Specific column selection (avoid SELECT *)
        2. Proper index usage (symbol, date)
        3. Parameterized queries (prevent re-parsing)
        4. Fetching in batches for large results
        
        Args:
            symbol: Stock symbol
            start_date: Start date (YYYY-MM-DD)
            end_date: End date (YYYY-MM-DD)
        
        Returns:
            DataFrame with price data
        """
        # OPTIMIZED: Select only needed columns
        # Avoids memory overhead of unused columns
        query = '''
            SELECT 
                trade_date,
                open_price,
                high_price,
                low_price,
                close_price,
                volume
            FROM stock_prices
            WHERE symbol = ?
              AND trade_date >= ?
              AND trade_date <= ?
            ORDER BY trade_date ASC
        '''
        
        # Use parameterized query for security and performance
        # Parameterized queries are cached by the query planner
        df = pd.read_sql_query(
            query, 
            self.conn, 
            params=(symbol, start_date, end_date),
            parse_dates=['trade_date']
        )
        
        return df
    
    def optimized_cross_sectional(self, 
                                  trade_date: str,
                                  min_volume: int = 10000) -> pd.DataFrame:
        """
        Optimized cross-sectional query with filtering.
        
        Args:
            trade_date: Date to query
            min_volume: Minimum volume filter
        
        Returns:
            DataFrame with all stocks for the date
        """
        # OPTIMIZED: Filter on indexed columns first
        # The database can use the index to quickly find the date range
        query = '''
            SELECT 
                symbol,
                close_price,
                volume,
                (close_price - LAG(close_price) OVER (ORDER BY symbol)) / 
                    LAG(close_price) OVER (ORDER BY symbol) * 100 as price_change_pct
            FROM stock_prices
            WHERE trade_date = ?
              AND volume >= ?
            ORDER BY volume DESC
        '''
        
        return pd.read_sql_query(
            query,
            self.conn,
            params=(trade_date, min_volume)
        )
    
    def optimized_rolling_average(self,
                                   symbol: str,
                                   window: int = 20) -> pd.DataFrame:
        """
        Calculate rolling average using SQL window functions.
        
        Window functions are more efficient than self-joins or
        fetching all data and calculating in Python.
        
        Args:
            symbol: Stock symbol
            window: Rolling window size (days)
        
        Returns:
            DataFrame with rolling averages
        """
        # Use SQLite window functions (available in SQLite 3.25+)
        query = f'''
            SELECT 
                trade_date,
                close_price,
                AVG(close_price) OVER (
                    ORDER BY trade_date 
                    ROWS BETWEEN {window-1} PRECEDING AND CURRENT ROW
                ) as sma_{window},
                AVG(volume) OVER (
                    ORDER BY trade_date 
                    ROWS BETWEEN {window-1} PRECEDING AND CURRENT ROW
                ) as avg_volume
            FROM stock_prices
            WHERE symbol = ?
            ORDER BY trade_date
        '''
        
        return pd.read_sql_query(query, self.conn, params=(symbol,))
    
    def batch_insert_optimization(self, 
                                   df: pd.DataFrame,
                                   batch_size: int = 1000) -> int:
        """
        Optimized batch insertion with transaction management.
        
        Args:
            df: DataFrame to insert
            batch_size: Number of rows per batch
        
        Returns:
            Number of rows inserted
        """
        # Convert DataFrame to list of tuples for executemany
        data = [(
            row['symbol'],
            row['trade_date'],
            row['open_price'],
            row['high_price'],
            row['low_price'],
            row['close_price'],
            row['volume']
        ) for _, row in df.iterrows()]
        
        total_inserted = 0
        
        # Use single transaction for entire batch
        try:
            self.cursor.execute('BEGIN TRANSACTION')
            
            for i in range(0, len(data), batch_size):
                batch = data[i:i+batch_size]
                self.cursor.executemany('''
                    INSERT INTO stock_prices 
                    (symbol, trade_date, open_price, high_price, 
                     low_price, close_price, volume)
                    VALUES (?, ?, ?, ?, ?, ?, ?)
                ''', batch)
                total_inserted += len(batch)
            
            self.conn.commit()
            
        except Exception as e:
            self.conn.rollback()
            raise e
        
        return total_inserted
    
    def close(self):
        self.conn.close()
```

**Detailed Explanation:**

1. **Column Selection**: Always specify columns in SELECT statements. `SELECT *` forces the database to read all columns, even if you only need a few, increasing I/O and memory usage.

2. **Parameterized Queries**: Using `?` placeholders instead of string formatting:
   - Prevents SQL injection attacks
   - Allows query plan caching (the database reuses the execution plan)
   - Handles data type conversion automatically

3. **Window Functions**: SQL window functions (`OVER`, `PARTITION BY`, `ROWS BETWEEN`) calculate rolling statistics efficiently in the database:
   - Avoids transferring large datasets to Python
   - Uses optimized algorithms in the database engine
   - Handles edge cases (beginning of series) automatically

4. **Transaction Management**: Wrapping batch inserts in explicit transactions:
   - Reduces disk I/O (commit once instead of per row)
   - Ensures atomicity (all or nothing)
   - Improves performance by orders of magnitude

---

## **8.4 Time-Series Databases**

Specialized time-series databases are optimized for handling time-stamped data. They provide superior performance for ingestion, storage, and querying of time-series data compared to general-purpose databases.

### **8.4.1 InfluxDB**

InfluxDB is a purpose-built time-series database designed for high write loads and time-range queries. It's particularly popular for monitoring, IoT, and financial data.

```python
"""
InfluxDB Storage Module for NEPSE Time-Series Data

InfluxDB is optimized for:
- High write throughput (millions of points per second)
- Time-range queries
- Downsampling and retention policies
- Tag-based indexing (inverted index)

Key concepts:
- Measurement: Similar to a table (e.g., "stock_prices")
- Tags: Indexed metadata (e.g., symbol, sector) - use for filtering
- Fields: Non-indexed data (e.g., price, volume) - use for values
- Timestamp: The time field (nanosecond precision)
"""

from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional
import pandas as pd
import numpy as np

# Note: Install influxdb-client: pip install influxdb-client
try:
    from influxdb_client import InfluxDBClient, Point
    from influxdb_client.client.write_api import SYNCHRONOUS
    INFLUX_AVAILABLE = True
except ImportError:
    INFLUX_AVAILABLE = False
    print("Warning: influxdb-client not installed. Install with: pip install influxdb-client")


class InfluxDBTimeSeriesStorage:
    """
    A class to handle InfluxDB operations for NEPSE stock data.
    
    Schema Design for NEPSE in InfluxDB:
    Measurement: stock_prices
    Tags: symbol, sector (indexed, low cardinality)
    Fields: open, high, low, close, volume, turnover (non-indexed values)
    Timestamp: trade_date
    """
    
    def __init__(self, 
                 url: str = "http://localhost:8086",
                 token: str = "your-token",
                 org: str = "nepse-org",
                 bucket: str = "nepse-bucket"):
        """
        Initialize InfluxDB connection.
        
        Args:
            url: InfluxDB server URL
            token: Authentication token
            org: Organization name
            bucket: Bucket (database) name
        """
        if not INFLUX_AVAILABLE:
            raise ImportError("influxdb-client required")
        
        self.url = url
        self.token = token
        self.org = org
        self.bucket = bucket
        
        self.client = InfluxDBClient(url=url, token=token, org=org)
        self.write_api = self.client.write_api(write_options=SYNCHRONOUS)
        self.query_api = self.client.query_api()
    
    def write_stock_data(self, df: pd.DataFrame, 
                         measurement: str = "stock_prices") -> None:
        """
        Write stock data to InfluxDB.
        
        Data Model:
        - Measurement: stock_prices
        - Tags: symbol (indexed for fast filtering)
        - Fields: open, high, low, close, volume
        - Timestamp: trade_date
        
        Args:
            df: DataFrame with columns: Symbol, Date, Open, High, Low, Close, Volume
            measurement: InfluxDB measurement name
        """
        points = []
        
        for _, row in df.iterrows():
            # Create a Point (InfluxDB data structure)
            point = Point(measurement)\
                .tag("symbol", row['Symbol'])\
                .field("open", float(row['Open']))\
                .field("high", float(row['High']))\
                .field("low", float(row['Low']))\
                .field("close", float(row['Close']))\
                .field("volume", int(row['Volume']))
            
            # Add optional fields if present
            if 'Turnover' in row:
                point = point.field("turnover", float(row['Turnover']))
            if 'VWAP' in row:
                point = point.field("vwap", float(row['VWAP']))
            
            # Set timestamp
            if isinstance(row['Date'], str):
                timestamp = datetime.strptime(row['Date'], '%Y-%m-%d')
            else:
                timestamp = row['Date']
            
            point = point.time(timestamp)
            points.append(point)
        
        # Write in batch
        self.write_api.write(bucket=self.bucket, record=points)
        print(f"Written {len(points)} points to InfluxDB")
    
    def query_time_range(self, 
                        symbol: str,
                        start: str,
                        stop: str,
                        measurement: str = "stock_prices") -> pd.DataFrame:
        """
        Query stock data for a specific time range using Flux query language.
        
        Args:
            symbol: Stock symbol to query
            start: Start time (ISO format or relative like "-30d")
            stop: Stop time (ISO format or relative like "now()")
            measurement: Measurement name
        
        Returns:
            DataFrame with query results
        
        Flux Query Explanation:
        from(): Select the bucket
        |> range(): Filter by time range
        |> filter(): Filter by measurement and tags
        |> aggregateWindow(): Optional downsampling
        |> yield(): Return results
        """
        query = f'''
        from(bucket: "{self.bucket}")
            |> range(start: {start}, stop: {stop})
            |> filter(fn: (r) => r._measurement == "{measurement}")
            |> filter(fn: (r) => r.symbol == "{symbol}")
            |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
            |> sort(columns: ["_time"])
        '''
        
        # Execute query and convert to DataFrame
        result = self.query_api.query_data_frame(query)
        
        if result.empty:
            return pd.DataFrame()
        
        # Rename columns for clarity
        result = result.rename(columns={
            '_time': 'timestamp',
            'close': 'close_price',
            'open': 'open_price',
            'high': 'high_price',
            'low': 'low_price'
        })
        
        return result
    
    def query_downsampled(self,
                         symbol: str,
                         start: str,
                         window: str = "1w",
                         aggregate: str = "mean") -> pd.DataFrame:
        """
        Query downsampled (aggregated) data.
        
        InfluxDB is excellent at downsampling on the fly.
        
        Args:
            symbol: Stock symbol
            start: Start time
            window: Window duration (1h, 1d, 1w, 1mo)
            aggregate: Aggregation function (mean, sum, min, max, first, last)
        
        Returns:
            DataFrame with downsampled data
        """
        query = f'''
        from(bucket: "{self.bucket}")
            |> range(start: {start})
            |> filter(fn: (r) => r._measurement == "stock_prices")
            |> filter(fn: (r) => r.symbol == "{symbol}")
            |> aggregateWindow(every: {window}, fn: {aggregate}, createEmpty: false)
            |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")
        '''
        
        return self.query_api.query_data_frame(query)
    
    def query_multiple_symbols(self,
                                symbols: List[str],
                                start: str,
                                field: str = "close") -> pd.DataFrame:
        """
        Query multiple symbols simultaneously (cross-sectional).
        
        Args:
            symbols: List of stock symbols
            start: Start time
            field: Field to retrieve (close, volume, etc.)
        
        Returns:
            DataFrame with data for all symbols
        """
        # Build symbol filter
        symbol_filter = " or ".join([f'r.symbol == "{s}"' for s in symbols])
        
        query = f'''
        from(bucket: "{self.bucket}")
            |> range(start: {start})
            |> filter(fn: (r) => r._measurement == "stock_prices")
            |> filter(fn: (r) => {symbol_filter})
            |> filter(fn: (r) => r._field == "{field}")
            |> pivot(rowKey:["_time"], columnKey: ["symbol"], valueColumn: "_value")
        '''
        
        return self.query_api.query_data_frame(query)
    
    def continuous_query_example(self):
        """
        Example of setting up a continuous query (Task in InfluxDB 2.x).
        
        Continuous queries automatically downsample high-frequency data
        to lower retention buckets for long-term storage.
        """
        # This would be set up via the InfluxDB API or CLI
        # Example Flux task for daily aggregation:
        task_flux = '''
        option task = {{
            name: "daily_stock_aggregation",
            every: 1d
        }}
        
        from(bucket: "nepse-bucket")
            |> range(start: -task.every)
            |> filter(fn: (r) => r._measurement == "stock_prices")
            |> aggregateWindow(every: 1d, fn: mean)
            |> to(bucket: "nepse-daily", org: "nepse-org")
        '''
        
        print("Task Flux script for daily aggregation:")
        print(task_flux)
        return task_flux
    
    def get_measurement_stats(self) -> Dict[str, Any]:
        """
        Get statistics about the measurement.
        
        Returns:
            Dictionary with statistics
        """
        query = f'''
        from(bucket: "{self.bucket}")
            |> range(start: -30d)
            |> filter(fn: (r) => r._measurement == "stock_prices")
            |> group(columns: ["symbol"])
            |> count()
        '''
        
        result = self.query_api.query(query)
        
        stats = {}
        for table in result:
            for record in table.records:
                symbol = record.values.get('symbol')
                count = record.get_value()
                stats[symbol] = count
        
        return stats
    
    def close(self):
        """Close the InfluxDB client."""
        self.client.close()


class InfluxDBSchemaDesign:
    """
    Documentation and examples of InfluxDB schema design for NEPSE.
    
    Schema Design Best Practices:
    
    1. Tags vs Fields:
       - Tags: Use for metadata that you filter/group by (symbol, sector)
              Tags are indexed (inverted index) - fast filtering
              But high cardinality (many unique values) costs memory
       - Fields: Use for actual data values (price, volume)
                Not indexed - efficient for storage
    
    2. Measurement Design:
       Option A: Single measurement "stock_prices" with symbol as tag
       Option B: Separate measurement per stock (avoid - too many measurements)
       
       Recommended: Option A (single measurement with symbol tag)
    
    3. Retention Policies:
       - Raw data: 1 year retention
       - Daily aggregates: 5 years retention
       - Monthly aggregates: Infinite retention
    """
    
    @staticmethod
    def get_schema_recommendations():
        """
        Print schema recommendations for NEPSE data.
        """
        recommendations = """
        InfluxDB Schema Design for NEPSE:
        
        Measurement: stock_prices
        Tags:
          - symbol: Stock ticker (NABIL, NICA, etc.) [cardinality: ~200]
          - sector: Banking, Hydro, etc. [cardinality: ~20]
        
        Fields:
          - open: Opening price (float)
          - high: Daily high (float)
          - low: Daily low (float)
          - close: Closing price (float)
          - volume: Trading volume (integer)
          - turnover: Total turnover (float)
          - vwap: Volume weighted average price (float)
        
        Timestamp: trade_date (midnight UTC or market close time)
        
        Retention Policies:
        - autogen (raw data): 52 weeks
        - daily_aggregates: 260 weeks (5 years)
        - monthly_aggregates: INF (forever)
        
        Why this design?
        - Symbol as tag allows fast queries: SELECT * WHERE symbol='NABIL'
        - Low cardinality (~200 symbols) fits well in memory
        - Fields store the actual time-series values
        - Timestamp allows efficient time-range queries
        """
        print(recommendations)


def demonstrate_influxdb():
    """
    Demonstrate InfluxDB operations (if available).
    """
    if not INFLUX_AVAILABLE:
        print("InfluxDB demonstration skipped (influxdb-client not installed)")
        print("To install: pip install influxdb-client")
        return
    
    print("=" * 70)
    print("InfluxDB Time-Series Database Demonstration")
    print("=" * 70)
    
    # Show schema design
    InfluxDBSchemaDesign.get_schema_recommendations()
    
    # Note: This requires a running InfluxDB instance
    # For demonstration, we'll show the code structure
    print("\nExample code structure:")
    print("""
    # Initialize
    storage = InfluxDBTimeSeriesStorage(
        url="http://localhost:8086",
        token="your-token",
        org="nepse-org",
        bucket="nepse-bucket"
    )
    
    # Write data
    storage.write_stock_data(df)
    
    # Query time range
    result = storage.query_time_range(
        symbol="NABIL",
        start="-30d",
        stop="now()"
    )
    
    # Query downsampled data
    weekly = storage.query_downsampled(
        symbol="NABIL",
        start="-1y",
        window="1w",
        aggregate="mean"
    )
    """)
```

**Detailed Explanation:**

1. **Data Model**: InfluxDB uses a specific data model:
   - **Measurement**: Like a table name (`stock_prices`)
   - **Tags**: Indexed metadata (symbol, sector) - used in WHERE clauses
   - **Fields**: Non-indexed values (price, volume) - the actual data
   - **Timestamp**: The time field (nanosecond precision)

2. **Tag vs Field Decision**: 
   - Use **tags** for fields you filter by (symbol = 'NABIL') or group by (GROUP BY sector)
   - Use **fields** for values you aggregate (AVG(close), SUM(volume))
   - Tags are stored in an inverted index (fast lookup but memory intensive)
   - Fields are compressed and stored separately (space efficient)

3. **Flux Query Language**: InfluxDB 2.x uses Flux (functional query language):
   - Pipe-forward operator `|>` passes data between functions
   - `from()` selects the bucket
   - `range()` filters by time
   - `filter()` filters by tags/fields
   - `aggregateWindow()` downsamples data
   - `pivot()` transforms from long format (one row per field) to wide format (one row per timestamp)

4. **Retention Policies**: InfluxDB can automatically downsample and expire data:
   - Keep high-resolution data for short periods (e.g., 1 year of daily data)
   - Keep low-resolution aggregates for long periods (e.g., weekly averages for 5 years)
   - Reduces storage costs while maintaining historical trends

### **8.4.2 TimescaleDB**

TimescaleDB is a PostgreSQL extension that adds time-series capabilities to PostgreSQL. It combines SQL's flexibility with time-series optimizations.

```python
"""
TimescaleDB Storage Module for NEPSE Time-Series Data

TimescaleDB transforms PostgreSQL into a time-series database by:
1. Automatic partitioning by time (hypertables)
2. Time-series specific functions (time_bucket, first, last, etc.)
3. Compression for historical data
4. Continuous aggregates (materialized views for time-series)

Advantages over vanilla PostgreSQL:
- Automatic partitioning improves query performance
- Better ingestion rates for time-series data
- Built-in time-series functions
- Data retention policies
"""

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional

try:
    import psycopg2
    from psycopg2.extras import execute_values
    POSTGRES_AVAILABLE = True
except ImportError:
    POSTGRES_AVAILABLE = False


class TimescaleDBStorage:
    """
    A class to handle TimescaleDB operations for NEPSE data.
    
    TimescaleDB uses "hypertables" - tables that are automatically
    partitioned by time. They look like regular tables but perform
    like time-series databases.
    """
    
    def __init__(self, 
                 host: str = "localhost",
                 port: int = 5432,
                 database: str = "nepse",
                 user: str = "postgres",
                 password: str = "password"):
        """
        Initialize TimescaleDB connection.
        
        Args:
            host: Database host
            port: Database port
            database: Database name
            user: Username
            password: Password
        """
        if not POSTGRES_AVAILABLE:
            raise ImportError("psycopg2 required. Install: pip install psycopg2-binary")
        
        self.conn = psycopg2.connect(
            host=host,
            port=port,
            database=database,
            user=user,
            password=password
        )
        self.cursor = self.conn.cursor()
        self._initialize_schema()
    
    def _initialize_schema(self):
        """Create hypertables and indexes."""
        # Enable TimescaleDB extension
        self.cursor.execute("CREATE EXTENSION IF NOT EXISTS timescaledb;")
        
        # Create stock prices table
        # This will be converted to a hypertable
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS stock_prices (
                time TIMESTAMPTZ NOT NULL,
                symbol VARCHAR(20) NOT NULL,
                open_price NUMERIC(12, 4),
                high_price NUMERIC(12, 4),
                low_price NUMERIC(12, 4),
                close_price NUMERIC(12, 4),
                volume BIGINT,
                turnover NUMERIC(18, 2),
                PRIMARY KEY (time, symbol)
            )
        ''')
        
        # Convert to hypertable
        # chunk_time_interval determines partition size
        # For daily stock data, 7 days (1 week) per chunk is reasonable
        try:
            self.cursor.execute('''
                SELECT create_hypertable('stock_prices', 'time', 
                                        chunk_time_interval => INTERVAL '7 days',
                                        if_not_exists => TRUE)
            ''')
        except Exception as e:
            print(f"Hypertable may already exist: {e}")
        
        # Create indexes optimized for time-series queries
        self.cursor.execute('''
            CREATE INDEX IF NOT EXISTS idx_prices_symbol_time 
            ON stock_prices(symbol, time DESC)
        ''')
        
        self.conn.commit()
    
    def insert_data(self, df: pd.DataFrame) -> int:
        """
        Insert stock data into TimescaleDB.
        
        Uses batch insertion for performance.
        
        Args:
            df: DataFrame with stock data
        
        Returns:
            Number of rows inserted
        """
        # Prepare data
        data = []
        for _, row in df.iterrows():
            if isinstance(row['Date'], str):
                dt = datetime.strptime(row['Date'], '%Y-%m-%d')
            else:
                dt = row['Date']
            
            data.append((
                dt,
                row['Symbol'],
                row.get('Open'),
                row.get('High'),
                row.get('Low'),
                row.get('Close'),
                row.get('Volume'),
                row.get('Turnover')
            ))
        
        # Use execute_values for efficient batch insertion
        # This is much faster than individual INSERT statements
        execute_values(
            self.cursor,
            '''
            INSERT INTO stock_prices 
            (time, symbol, open_price, high_price, low_price, close_price, volume, turnover)
            VALUES %s
            ON CONFLICT (time, symbol) DO NOTHING
            ''',
            data,
            page_size=1000
        )
        
        self.conn.commit()
        return len(data)
    
    def query_time_range(self,
                        symbol: str,
                        start_date: str,
                        end_date: str) -> pd.DataFrame:
        """
        Query stock prices for a time range.
        
        Args:
            symbol: Stock symbol
            start_date: Start date (YYYY-MM-DD)
            end_date: End date (YYYY-MM-DD)
        
        Returns:
            DataFrame with price data
        """
        query = '''
            SELECT 
                time,
                symbol,
                open_price,
                high_price,
                low_price,
                close_price,
                volume,
                turnover
            FROM stock_prices
            WHERE symbol = %s
              AND time BETWEEN %s AND %s
            ORDER BY time ASC
        '''
        
        df = pd.read_sql_query(
            query,
            self.conn,
            params=(symbol, start_date, end_date),
            parse_dates=['time']
        )
        
        return df
    
    def query_aggregates(self,
                        symbol: str,
                        bucket_size: str = '1 month',
                        start_date: str = None) -> pd.DataFrame:
        """
        Query aggregated data using time_bucket.
        
        TimescaleDB's time_bucket function is like GROUP BY for time.
        
        Args:
            symbol: Stock symbol
            bucket_size: Time bucket (1 day, 1 week, 1 month, etc.)
            start_date: Start date for query
        
        Returns:
            DataFrame with aggregated data (OHLCV)
        """
        query = '''
            SELECT 
                time_bucket(%s, time) as bucket,
                first(open_price, time) as open_price,
                max(high_price) as high_price,
                min(low_price) as low_price,
                last(close_price, time) as close_price,
                sum(volume) as total_volume,
                avg(close_price) as avg_close
            FROM stock_prices
            WHERE symbol = %s
              AND time > %s
            GROUP BY bucket
            ORDER BY bucket ASC
        '''
        
        if start_date is None:
            start_date = '2020-01-01'
        
        df = pd.read_sql_query(
            query,
            self.conn,
            params=(bucket_size, symbol, start_date)
        )
        
        return df
    
    def create_continuous_aggregate(self):
        """
        Create a continuous aggregate (materialized view for time-series).
        
        Continuous aggregates automatically refresh and maintain
        pre-computed aggregations for fast querying.
        """
        # Create daily aggregates view
        self.cursor.execute('''
            CREATE MATERIALIZED VIEW IF NOT EXISTS daily_stock_stats
            WITH (timescaledb.continuous) AS
            SELECT 
                time_bucket('1 day', time) as day,
                symbol,
                first(open_price, time) as open_price,
                max(high_price) as high_price,
                min(low_price) as low_price,
                last(close_price, time) as close_price,
                sum(volume) as total_volume,
                avg(close_price) as avg_price
            FROM stock_prices
            GROUP BY day, symbol
            WITH DATA
        ''')
        
        # Add policy to drop raw data after 1 year
        # but keep aggregates
        try:
            self.cursor.execute('''
                SELECT add_retention_policy('stock_prices', INTERVAL '1 year')
            ''')
        except:
            pass  # Policy may already exist
        
        self.conn.commit()
        print("Continuous aggregate created")
    
    def query_with_window_functions(self, symbol: str) -> pd.DataFrame:
        """
        Query using advanced window functions.
        
        TimescaleDB supports all PostgreSQL window functions,
        optimized for time-series.
        
        Args:
            symbol: Stock symbol
        
        Returns:
            DataFrame with calculated indicators
        """
        query = '''
            SELECT 
                time,
                close_price,
                volume,
                -- Moving averages
                AVG(close_price) OVER (ORDER BY time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) as sma_5,
                AVG(close_price) OVER (ORDER BY time ROWS BETWEEN 19 PRECEDING AND CURRENT ROW) as sma_20,
                -- Previous day close (for daily returns)
                LAG(close_price) OVER (ORDER BY time) as prev_close,
                -- Daily return percentage
                (close_price - LAG(close_price) OVER (ORDER BY time)) / 
                    LAG(close_price) OVER (ORDER BY time) * 100 as daily_return_pct,
                -- Running total volume
                SUM(volume) OVER (ORDER BY time ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) as volume_5d_sum,
                -- Rank by volume in last 20 days
                RANK() OVER (ORDER BY volume DESC ROWS BETWEEN 19 PRECEDING AND CURRENT ROW) as volume_rank
            FROM stock_prices
            WHERE symbol = %s
            ORDER BY time DESC
            LIMIT 100
        '''
        
        return pd.read_sql_query(query, self.conn, params=(symbol,))
    
    def compress_old_data(self, older_than: str = '1 month'):
        """
        Enable compression for old data.
        
        TimescaleDB can compress chunks (partitions) to save space.
        Compressed chunks are still queryable but use less disk space.
        
        Args:
            older_than: Compress chunks older than this interval
        """
        self.cursor.execute(f'''
            ALTER TABLE stock_prices SET (
                timescaledb.compress,
                timescaledb.compress_segmentby = 'symbol'
            )
        ''')
        
        # Add compression policy
        try:
            self.cursor.execute(f'''
                SELECT add_compression_policy('stock_prices', INTERVAL '{older_than}')
            ''')
            self.conn.commit()
            print(f"Compression policy added for data older than {older_than}")
        except Exception as e:
            print(f"Compression policy may already exist: {e}")
    
    def get_hypertable_info(self) -> pd.DataFrame:
        """
        Get information about the hypertable.
        
        Returns:
            DataFrame with hypertable statistics
        """
        query = '''
            SELECT 
                hypertable_name,
                chunk_count,
                total_bytes,
                index_bytes,
                toast_bytes,
                compression_enabled
            FROM hypertable_detailed_size('stock_prices')
        '''
        
        return pd.read_sql_query(query, self.conn)
    
    def close(self):
        """Close the database connection."""
        self.cursor.close()
        self.conn.close()


def demonstrate_timescaledb():
    """
    Demonstrate TimescaleDB features.
    """
    if not POSTGRES_AVAILABLE:
        print("TimescaleDB demonstration skipped (psycopg2 not installed)")
        return
    
    print("=" * 70)
    print("TimescaleDB Demonstration")
    print("=" * 70)
    
    print("""
    TimescaleDB Setup for NEPSE:
    
    1. Install TimescaleDB extension in PostgreSQL
    2. Create hypertable from regular table:
       SELECT create_hypertable('stock_prices', 'time');
    
    3. Key benefits for NEPSE data:
       - Automatic partitioning by time (7-day chunks)
       - Efficient time-range queries
       - Built-in time_bucket for aggregations
       - Continuous aggregates for pre-computed indicators
       - Compression for old data
    """)
    
    # Example code structure
    print("""
    Example Usage:
    
    # Initialize
    db = TimescaleDBStorage(
        host="localhost",
        database="nepse",
        user="postgres",
        password="password"
    )
    
    # Insert data
    db.insert_data(df)
    
    # Query with time_bucket aggregation
    monthly = db.query_aggregates(
        symbol="NABIL",
        bucket_size="1 month"
    )
    
    # Create continuous aggregate
    db.create_continuous_aggregate()
    
    # Enable compression
    db.compress_old_data("3 months")
    """)
```

**Detailed Explanation:**

1. **Hypertables**: TimescaleDB's core feature. They look like regular tables but are automatically partitioned by time:
   - **Chunks**: Partitions of the hypertable (e.g., 7 days of data per chunk)
   - **Automatic creation**: New chunks are created automatically as data is inserted
   - **Efficient queries**: Queries that filter by time only scan relevant chunks

2. **time_bucket Function**: Like GROUP BY but for time:
   - `time_bucket('1 day', time)` groups data into daily buckets
   - `time_bucket('1 week', time)` groups into weekly buckets
   - Works with `first()`, `last()`, `max()`, `min()` for OHLC calculations

3. **Continuous Aggregates**: Materialized views that automatically refresh:
   - Pre-compute daily/weekly aggregates from raw data
   - Queries on aggregates are instant
   - Raw data can be dropped while keeping aggregates (retention policies)

4. **Compression**: TimescaleDB can compress old chunks:
   - Uses less disk space (often 90%+ reduction)
   - Still queryable (decompresses on the fly)
   - Configurable by age (e.g., compress data older than 3 months)

### **8.4.3 Prometheus**

Prometheus is primarily a monitoring system but can be used for time-series data with appropriate exporters. It's less suitable for financial data than InfluxDB or TimescaleDB but worth mentioning for monitoring ML model metrics.

# **Chapter 8: Data Storage and Management (Continued - Final Part)**

---

### **8.4.3 Prometheus (Continued)**

```python
"""
Prometheus Integration for NEPSE Model Monitoring (Continued)

Prometheus is primarily used for monitoring operational metrics
of the prediction system rather than storing the actual stock data.
"""

import time
from datetime import datetime
from typing import Dict, Any

try:
    from prometheus_client import Counter, Histogram, Gauge, start_http_server, Info
    PROMETHEUS_AVAILABLE = True
except ImportError:
    PROMETHEUS_AVAILABLE = False


class PrometheusModelMonitor:
    """
    Monitor NEPSE prediction system metrics using Prometheus.
    
    Use cases:
    - Track prediction latency (how long models take to predict)
    - Monitor prediction accuracy over time
    - Track data ingestion rates
    - Alert on system errors
    - Monitor model drift
    """
    
    def __init__(self, port: int = 8000):
        """
        Initialize Prometheus metrics.
        
        Args:
            port: Port for Prometheus metrics endpoint
        """
        if not PROMETHEUS_AVAILABLE:
            raise ImportError("prometheus-client required. Install: pip install prometheus-client")
        
        self.port = port
        
        # Counter for total predictions made
        # Labels allow us to track by symbol and model version
        self.prediction_counter = Counter(
            'nepse_predictions_total',
            'Total number of predictions made',
            ['symbol', 'model_version', 'prediction_type']
        )
        
        # Histogram for prediction latency
        # Buckets appropriate for ML inference (milliseconds)
        self.prediction_latency = Histogram(
            'nepse_prediction_latency_seconds',
            'Time spent making predictions',
            ['symbol', 'model_version'],
            buckets=[.001, .005, .01, .025, .05, .075, .1, .25, .5, 1.0]
        )
        
        # Gauge for model accuracy (updated periodically)
        self.model_accuracy = Gauge(
            'nepse_model_accuracy',
            'Current model accuracy metrics',
            ['metric_type', 'symbol']
        )
        
        # Gauge for data freshness
        self.data_freshness_hours = Gauge(
            'nepse_data_freshness_hours',
            'Hours since last data update',
            ['data_source']
        )
        
        # Counter for data ingestion
        self.ingestion_counter = Counter(
            'nepse_data_ingestion_total',
            'Total records ingested',
            ['source', 'status']
        )
        
        # Info about the model
        self.model_info = Info('nepse_model', 'Model metadata')
    
    def start_server(self):
        """Start the Prometheus metrics HTTP server."""
        start_http_server(self.port)
        print(f"Prometheus metrics server started on port {self.port}")
        print(f"Access metrics at: http://localhost:{self.port}/metrics")
    
    def record_prediction(self, 
                         symbol: str, 
                         model_version: str,
                         prediction_type: str = 'close_price',
                         latency_seconds: float = 0):
        """
        Record a prediction event.
        
        Args:
            symbol: Stock symbol
            model_version: Model version string
            prediction_type: Type of prediction (close_price, direction, etc.)
            latency_seconds: Time taken to make prediction
        """
        # Increment counter
        self.prediction_counter.labels(
            symbol=symbol,
            model_version=model_version,
            prediction_type=prediction_type
        ).inc()
        
        # Record latency
        self.prediction_latency.labels(
            symbol=symbol,
            model_version=model_version
        ).observe(latency_seconds)
    
    def update_accuracy(self, 
                       metric_type: str,
                       value: float,
                       symbol: str = 'all'):
        """
        Update model accuracy metrics.
        
        Args:
            metric_type: Type of metric (mae, rmse, mape, direction_accuracy)
            value: Metric value
            symbol: Stock symbol or 'all' for aggregate
        """
        self.model_accuracy.labels(
            metric_type=metric_type,
            symbol=symbol
        ).set(value)
    
    def record_ingestion(self, 
                         source: str,
                         count: int,
                         success: bool = True):
        """
        Record data ingestion metrics.
        
        Args:
            source: Data source (nepse_api, csv_import, etc.)
            count: Number of records
            success: Whether ingestion was successful
        """
        status = 'success' if success else 'error'
        self.ingestion_counter.labels(
            source=source,
            status=status
        ).inc(count)
    
    def set_model_info(self, 
                       model_name: str,
                       version: str,
                       training_date: str,
                       features: str):
        """
        Set model metadata.
        
        Args:
            model_name: Name of the model
            version: Version string
            training_date: Date model was trained
            features: Comma-separated list of features
        """
        self.model_info.info({
            'model_name': model_name,
            'version': version,
            'training_date': training_date,
            'features': features
        })


def demonstrate_prometheus():
    """
    Demonstrate Prometheus monitoring setup.
    """
    if not PROMETHEUS_AVAILABLE:
        print("Prometheus demonstration skipped (prometheus-client not installed)")
        return
    
    print("=" * 70)
    print("Prometheus Model Monitoring Setup")
    print("=" * 70)
    
    monitor = PrometheusModelMonitor(port=8000)
    
    # Set model info
    monitor.set_model_info(
        model_name="LSTM_Price_Predictor",
        version="v2.1.0",
        training_date="2024-01-15",
        features="sma_20,rsi,macd,volume"
    )
    
    # Start server
    monitor.start_server()
    
    # Simulate predictions
    print("\nSimulating predictions...")
    for i in range(10):
        symbol = ['NABIL', 'NICA', 'SCBL'][i % 3]
        
        # Simulate prediction latency
        start = time.time()
        time.sleep(0.01)  # Simulate work
        latency = time.time() - start
        
        monitor.record_prediction(
            symbol=symbol,
            model_version="v2.1.0",
            prediction_type="close_price",
            latency_seconds=latency
        )
        
        # Update accuracy periodically
        if i % 5 == 0:
            monitor.update_accuracy('mae', 15.5, symbol)
            monitor.update_accuracy('direction_accuracy', 0.65, symbol)
    
    print("Metrics recorded. Access at http://localhost:8000/metrics")
    
    return monitor
```

**Detailed Explanation:**

1. **Metric Types**:
   - **Counters**: Monotonically increasing values (total predictions, total errors). Used for rates (predictions per second).
   - **Histograms**: Distribution of values (prediction latency). Automatically calculates percentiles.
   - **Gauges**: Values that can go up or down (current accuracy, queue size).
   - **Info**: Static metadata (model version, training date).

2. **Labels**: Key-value pairs that add dimensions to metrics:
   - `symbol='NABIL'` allows filtering by stock
   - `model_version='v2.1.0'` allows comparing versions
   - Without labels, you'd need separate metrics for each combination

3. **Use Cases for NEPSE**:
   - Monitor prediction latency (should be < 100ms)
   - Track model accuracy degradation (data drift)
   - Alert when data ingestion fails
   - Monitor API rate limits

---

## **8.5 NoSQL Solutions**

NoSQL databases offer flexible schemas and horizontal scalability, making them suitable for certain time-series use cases.

### **8.5.1 Document Stores (MongoDB)**

Document stores like MongoDB are useful when data structure varies or when you need to store complex nested data with time-series.

```python
"""
MongoDB Storage for Time-Series Data

MongoDB is a document database that stores data in JSON-like format.
For time-series, MongoDB 5.0+ introduced specific optimizations:
- Time-series collections
- Automatic bucketing
- Improved compression

Use cases for NEPSE:
- Storing unstructured news/sentiment data alongside prices
- Flexible schema for different data sources
- Storing model metadata and experiment tracking
"""

from datetime import datetime
from typing import List, Dict, Any, Optional
import pandas as pd

try:
    from pymongo import MongoClient, ASCENDING, DESCENDING
    from pymongo.collection import Collection
    MONGODB_AVAILABLE = True
except ImportError:
    MONGODB_AVAILABLE = False


class MongoDBTimeSeriesStorage:
    """
    MongoDB storage handler for NEPSE time-series data.
    
    Two approaches:
    1. Time-series collections (MongoDB 5.0+) - optimized for time-series
    2. Regular collections with proper indexing - more flexible
    
    Schema Design:
    {
        "timestamp": ISODate("2024-01-15T00:00:00Z"),
        "symbol": "NABIL",
        "metadata": {
            "sector": "Banking",
            "market_cap": "Large"
        },
        "prices": {
            "open": 850.0,
            "high": 870.0,
            "low": 845.0,
            "close": 865.0
        },
        "volume": 125000,
        "indicators": {
            "sma_20": 860.5,
            "rsi": 65.3
        }
    }
    """
    
    def __init__(self, 
                 connection_string: str = "mongodb://localhost:27017",
                 database_name: str = "nepse"):
        """
        Initialize MongoDB connection.
        
        Args:
            connection_string: MongoDB connection URI
            database_name: Database name
        """
        if not MONGODB_AVAILABLE:
            raise ImportError("pymongo required. Install: pip install pymongo")
        
        self.client = MongoClient(connection_string)
        self.db = self.client[database_name]
        self._setup_collections()
    
    def _setup_collections(self):
        """Set up time-series collections with proper indexing."""
        # Check if time-series collection exists, create if not
        collections = self.db.list_collection_names()
        
        if 'stock_prices' not in collections:
            try:
                # Create time-series collection (MongoDB 5.0+)
                self.db.create_collection(
                    'stock_prices',
                    timeseries={
                        'timeField': 'timestamp',
                        'metaField': 'metadata',
                        'granularity': 'hours'
                    }
                )
                print("Created time-series collection: stock_prices")
            except Exception as e:
                # Fallback to regular collection for older MongoDB versions
                print(f"Time-series collection not supported ({e}), using regular collection")
                self.db.create_collection('stock_prices')
        
        # Create indexes for efficient querying
        # Compound index: symbol + timestamp (optimal for time-range queries)
        self.db.stock_prices.create_index([
            ('symbol', ASCENDING),
            ('timestamp', DESCENDING)
        ], name='symbol_time_idx')
        
        # Index for cross-sectional queries (specific date)
        self.db.stock_prices.create_index([
            ('timestamp', ASCENDING),
            ('symbol', ASCENDING)
        ], name='time_symbol_idx')
        
        # Text index for searching
        self.db.stock_prices.create_index([
            ('metadata.sector', ASCENDING)
        ])
    
    def insert_data(self, df: pd.DataFrame) -> int:
        """
        Insert stock data into MongoDB.
        
        Args:
            df: DataFrame with stock data
        
        Returns:
            Number of documents inserted
        """
        documents = []
        
        for _, row in df.iterrows():
            # Parse date
            if isinstance(row['Date'], str):
                timestamp = datetime.strptime(row['Date'], '%Y-%m-%d')
            else:
                timestamp = row['Date']
            
            # Create document
            doc = {
                'timestamp': timestamp,
                'symbol': row['Symbol'],
                'metadata': {
                    'sector': row.get('Sector', 'Unknown'),
                    'source': 'nepse_api'
                },
                'prices': {
                    'open': float(row.get('Open', 0)),
                    'high': float(row.get('High', 0)),
                    'low': float(row.get('Low', 0)),
                    'close': float(row.get('Close', 0)),
                    'vwap': float(row.get('VWAP', 0))
                },
                'volume': int(row.get('Volume', 0)),
                'turnover': float(row.get('Turnover', 0)),
                'transactions': int(row.get('Trans.', 0))
            }
            
            # Add calculated fields if present
            if 'Diff' in row:
                doc['change'] = {
                    'absolute': float(row['Diff']),
                    'percent': float(row.get('Diff %', 0))
                }
            
            # Add technical indicators if present
            indicators = {}
            for col in ['sma_20', 'sma_50', 'rsi', 'macd']:
                if col in row:
                    indicators[col] = float(row[col])
            
            if indicators:
                doc['indicators'] = indicators
            
            documents.append(doc)
        
        # Insert in bulk
        if documents:
            result = self.db.stock_prices.insert_many(documents, ordered=False)
            return len(result.inserted_ids)
        
        return 0
    
    def query_time_range(self, 
                        symbol: str,
                        start_date: datetime,
                        end_date: datetime) -> pd.DataFrame:
        """
        Query stock data for a time range.
        
        Args:
            symbol: Stock symbol
            start_date: Start datetime
            end_date: End datetime
        
        Returns:
            DataFrame with results
        """
        query = {
            'symbol': symbol,
            'timestamp': {
                '$gte': start_date,
                '$lte': end_date
            }
        }
        
        # Projection to select only needed fields
        projection = {
            'timestamp': 1,
            'symbol': 1,
            'prices.open': 1,
            'prices.high': 1,
            'prices.low': 1,
            'prices.close': 1,
            'volume': 1,
            '_id': 0
        }
        
        cursor = self.db.stock_prices.find(query, projection).sort('timestamp', ASCENDING)
        
        # Convert to DataFrame
        data = list(cursor)
        if not data:
            return pd.DataFrame()
        
        # Flatten nested structure
        flattened = []
        for doc in data:
            flat = {
                'timestamp': doc['timestamp'],
                'symbol': doc['symbol'],
                'open': doc['prices']['open'],
                'high': doc['prices']['high'],
                'low': doc['prices']['low'],
                'close': doc['prices']['close'],
                'volume': doc['volume']
            }
            flattened.append(flat)
        
        return pd.DataFrame(flattened)
    
    def query_with_aggregation(self,
                                symbol: str,
                                start_date: datetime) -> List[Dict]:
        """
        Query with MongoDB aggregation pipeline.
        
        Useful for complex analytics like:
        - Moving averages
        - Resampling
        - Statistical aggregations
        
        Args:
            symbol: Stock symbol
            start_date: Start date
        
        Returns:
            List of aggregated results
        """
        pipeline = [
            # Match stage: Filter documents
            {
                '$match': {
                    'symbol': symbol,
                    'timestamp': {'$gte': start_date}
                }
            },
            # Group stage: Aggregate by month
            {
                '$group': {
                    '_id': {
                        'year': {'$year': '$timestamp'},
                        'month': {'$month': '$timestamp'}
                    },
                    'avg_close': {'$avg': '$prices.close'},
                    'max_high': {'$max': '$prices.high'},
                    'min_low': {'$min': '$prices.low'},
                    'total_volume': {'$sum': '$volume'},
                    'count': {'$sum': 1}
                }
            },
            # Sort stage
            {
                '$sort': {'_id.year': -1, '_id.month': -1}
            }
        ]
        
        results = list(self.db.stock_prices.aggregate(pipeline))
        return results
    
    def update_with_indicators(self,
                               symbol: str,
                               date: datetime,
                               indicators: Dict[str, float]):
        """
        Update documents with calculated indicators.
        
        Args:
            symbol: Stock symbol
            date: Date
            indicators: Dictionary of indicator values
        """
        self.db.stock_prices.update_one(
            {
                'symbol': symbol,
                'timestamp': date
            },
            {
                '$set': {
                    'indicators': indicators,
                    'updated_at': datetime.now()
                }
            }
        )
    
    def get_distinct_symbols(self) -> List[str]:
        """Get list of all symbols in the collection."""
        return self.db.stock_prices.distinct('symbol')
    
    def close(self):
        """Close MongoDB connection."""
        self.client.close()


def demonstrate_mongodb():
    """
    Demonstrate MongoDB time-series storage.
    """
    if not MONGODB_AVAILABLE:
        print("MongoDB demonstration skipped (pymongo not installed)")
        return
    
    print("=" * 70)
    print("MongoDB Time-Series Storage Demonstration")
    print("=" * 70)
    
    try:
        storage = MongoDBTimeSeriesStorage()
        
        # Generate sample data
        import numpy as np
        dates = pd.date_range('2024-01-01', '2024-01-31', freq='B')
        
        data = []
        for date in dates:
            data.append({
                'Date': date,
                'Symbol': 'NABIL',
                'Sector': 'Banking',
                'Open': 850 + np.random.randint(-20, 20),
                'High': 870 + np.random.randint(-20, 20),
                'Low': 840 + np.random.randint(-20, 20),
                'Close': 860 + np.random.randint(-20, 20),
                'Volume': np.random.randint(100000, 200000),
                'Turnover': 150000000 + np.random.randint(-10000000, 10000000),
                'Trans.': 500 + np.random.randint(-100, 100),
                'Diff': np.random.randint(-10, 10),
                'Diff %': np.random.uniform(-1, 1)
            })
        
        df = pd.DataFrame(data)
        
        print("\n1. Inserting data...")
        count = storage.insert_data(df)
        print(f"Inserted {count} documents")
        
        print("\n2. Querying time range...")
        result = storage.query_time_range(
            'NABIL',
            datetime(2024, 1, 1),
            datetime(2024, 1, 15)
        )
        print(f"Retrieved {len(result)} records")
        print(result.head())
        
        print("\n3. Aggregation query...")
        agg_results = storage.query_with_aggregation(
            'NABIL',
            datetime(2024, 1, 1)
        )
        print(f"Monthly aggregations: {len(agg_results)}")
        
        storage.close()
        
    except Exception as e:
        print(f"MongoDB connection failed (is MongoDB running?): {e}")
```

**Detailed Explanation:**

1. **Time-Series Collections**: MongoDB 5.0+ has native time-series support:
   - `timeField`: The timestamp field (required)
   - `metaField`: Metadata that doesn't change often (symbol, sector)
   - `granularity`: Hint for bucketing (seconds, minutes, hours)
   - Automatic compression and indexing

2. **Schema Flexibility**: Documents can have varying structures:
   - Some documents have `indicators` field, others don't
   - Can add new fields without schema migration
   - Nested documents (`prices.open`) organize related data

3. **Aggregation Pipeline**: MongoDB's powerful query language:
   - `$match`: Filter documents (like WHERE)
   - `$group`: Aggregate (like GROUP BY)
   - `$sort`: Order results
   - Supports complex time-series calculations

### **8.5.2 Key-Value Stores (Redis)**

Redis is an in-memory data structure store, excellent for caching and real-time applications.

```python
"""
Redis Storage for Time-Series Data

Redis is an in-memory key-value store with several time-series use cases:
1. Caching recent data for fast access
2. Real-time streaming data
3. Session storage for prediction API
4. Rate limiting

RedisTimeSeries module (available in Redis Enterprise) provides
native time-series capabilities.

For NEPSE:
- Cache latest prices for quick API responses
- Store real-time model predictions
- Queue system for async processing
"""

from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional, Union
import json
import pandas as pd

try:
    import redis
    REDIS_AVAILABLE = True
except ImportError:
    REDIS_AVAILABLE = False


class RedisTimeSeriesCache:
    """
    Redis cache for NEPSE time-series data.
    
    Key patterns:
    - price:{symbol}:latest -> Latest price data (hash)
    - price:{symbol}:history -> Recent price history (sorted set)
    - predictions:{symbol} -> Model predictions (hash)
    - metadata:{symbol} -> Stock metadata (hash)
    - queue:ingestion -> Data ingestion queue (list)
    """
    
    def __init__(self, 
                 host: str = 'localhost',
                 port: int = 6379,
                 db: int = 0,
                 password: Optional[str] = None):
        """
        Initialize Redis connection.
        
        Args:
            host: Redis host
            port: Redis port
            db: Database number (0-15)
            password: Authentication password
        """
        if not REDIS_AVAILABLE:
            raise ImportError("redis required. Install: pip install redis")
        
        self.r = redis.Redis(
            host=host,
            port=port,
            db=db,
            password=password,
            decode_responses=True  # Auto-decode bytes to strings
        )
    
    def cache_latest_price(self, 
                          symbol: str,
                          price_data: Dict[str, Any],
                          ttl_seconds: int = 3600):
        """
        Cache latest price data with TTL (time-to-live).
        
        Args:
            symbol: Stock symbol
            price_data: Dictionary with price information
            ttl_seconds: Time to live in seconds (default 1 hour)
        """
        key = f"price:{symbol}:latest"
        
        # Convert to hash (field-value pairs)
        # Convert non-string values to JSON
        data = {}
        for k, v in price_data.items():
            if isinstance(v, (dict, list)):
                data[k] = json.dumps(v)
            else:
                data[k] = str(v)
        
        self.r.hset(key, mapping=data)
        self.r.expire(key, ttl_seconds)  # Set expiration
    
    def get_latest_price(self, symbol: str) -> Optional[Dict[str, Any]]:
        """
        Get cached latest price for a symbol.
        
        Args:
            symbol: Stock symbol
        
        Returns:
            Dictionary with price data or None if not cached
        """
        key = f"price:{symbol}:latest"
        data = self.r.hgetall(key)
        
        if not data:
            return None
        
        # Convert string values back to appropriate types
        result = {}
        for k, v in data.items():
            # Try to parse as JSON first
            try:
                result[k] = json.loads(v)
            except json.JSONDecodeError:
                # Try to convert to number
                try:
                    if '.' in v:
                        result[k] = float(v)
                    else:
                        result[k] = int(v)
                except ValueError:
                    result[k] = v
        
        return result
    
    def add_to_history(self,
                      symbol: str,
                      timestamp: datetime,
                      close_price: float,
                      max_entries: int = 1000):
        """
        Add price to historical sorted set.
        
        Uses Redis Sorted Sets (ZADD) where score is timestamp.
        Allows efficient range queries by time.
        
        Args:
            symbol: Stock symbol
            timestamp: Price timestamp
            close_price: Closing price
            max_entries: Maximum history entries to keep
        """
        key = f"price:{symbol}:history"
        
        # Convert timestamp to Unix timestamp (float) for scoring
        score = timestamp.timestamp()
        
        # Add to sorted set
        # Member is the price (or can be JSON with more data)
        member = json.dumps({
            'timestamp': timestamp.isoformat(),
            'close': close_price
        })
        
        self.r.zadd(key, {member: score})
        
        # Trim to max entries (keep most recent)
        self.r.zremrangebyrank(key, 0, -(max_entries + 1))
        
        # Set expiration (e.g., 7 days for history)
        self.r.expire(key, 7 * 24 * 3600)
    
    def get_price_history(self,
                         symbol: str,
                         start_date: Optional[datetime] = None,
                         end_date: Optional[datetime] = None) -> List[Dict]:
        """
        Get price history from Redis.
        
        Args:
            symbol: Stock symbol
            start_date: Start date (None for all)
            end_date: End date (None for all)
        
        Returns:
            List of price records
        """
        key = f"price:{symbol}:history"
        
        # Convert dates to Unix timestamps
        min_score = start_date.timestamp() if start_date else '-inf'
        max_score = end_date.timestamp() if end_date else '+inf'
        
        # Get members by score range
        members = self.r.zrangebyscore(
            key,
            min_score,
            max_score,
            withscores=False
        )
        
        # Parse JSON members
        results = []
        for member in members:
            try:
                data = json.loads(member)
                results.append(data)
            except json.JSONDecodeError:
                continue
        
        return results
    
    def cache_prediction(self,
                         symbol: str,
                         prediction_date: str,
                         predicted_value: float,
                         confidence: float,
                         model_version: str,
                         ttl_hours: int = 24):
        """
        Cache model prediction.
        
        Args:
            symbol: Stock symbol
            prediction_date: Target date for prediction
            predicted_value: Predicted price
            confidence: Prediction confidence (0-1)
            model_version: Model version string
            ttl_hours: Cache duration
        """
        key = f"prediction:{symbol}:{prediction_date}"
        
        data = {
            'symbol': symbol,
            'prediction_date': prediction_date,
            'predicted_value': str(predicted_value),
            'confidence': str(confidence),
            'model_version': model_version,
            'created_at': datetime.now().isoformat()
        }
        
        self.r.hset(key, mapping=data)
        self.r.expire(key, ttl_hours * 3600)
    
    def get_cached_prediction(self, 
                              symbol: str, 
                              prediction_date: str) -> Optional[Dict]:
        """Get cached prediction if available."""
        key = f"prediction:{symbol}:{prediction_date}"
        return self.r.hgetall(key) or None
    
    def enqueue_data_ingestion(self, data: Dict[str, Any]):
        """
        Add data to ingestion queue.
        
        Uses Redis List (LPUSH/RPOP) for queue semantics.
        """
        self.r.lpush('queue:ingestion', json.dumps(data))
    
    def dequeue_data_ingestion(self, timeout: int = 5) -> Optional[Dict]:
        """
        Pop data from ingestion queue (blocking).
        
        Args:
            timeout: Block for up to N seconds
        
        Returns:
            Data dictionary or None if timeout
        """
        result = self.r.brpop('queue:ingestion', timeout=timeout)
        if result:
            _, data = result
            return json.loads(data)
        return None
    
    def increment_counter(self, 
                         metric_name: str,
                         amount: int = 1):
        """
        Increment a counter (for metrics).
        
        Args:
            metric_name: Metric name
            amount: Amount to increment
        """
        self.r.incr(f"counter:{metric_name}", amount)
    
    def get_counter(self, metric_name: str) -> int:
        """Get current counter value."""
        value = self.r.get(f"counter:{metric_name}")
        return int(value) if value else 0
    
    def clear_cache(self, pattern: str = "*"):
        """
        Clear cached data matching pattern.
        
        Args:
            pattern: Key pattern to clear (e.g., "price:NABIL:*")
        """
        keys = self.r.keys(pattern)
        if keys:
            self.r.delete(*keys)
    
    def close(self):
        """Close Redis connection."""
        self.r.close()


def demonstrate_redis():
    """
    Demonstrate Redis caching for NEPSE data.
    """
    if not REDIS_AVAILABLE:
        print("Redis demonstration skipped (redis not installed)")
        return
    
    print("=" * 70)
    print("Redis Cache Demonstration")
    print("=" * 70)
    
    try:
        cache = RedisTimeSeriesCache()
        
        print("\n1. Caching latest price...")
        cache.cache_latest_price('NABIL', {
            'symbol': 'NABIL',
            'date': '2024-01-15',
            'close': 865.0,
            'open': 850.0,
            'high': 870.0,
            'low': 845.0,
            'volume': 125000,
            'change_pct': 1.76
        })
        
        latest = cache.get_latest_price('NABIL')
        print(f"Cached price: {latest}")
        
        print("\n2. Adding to price history...")
        import numpy as np
        base_date = datetime(2024, 1, 1)
        for i in range(10):
            date = base_date + timedelta(days=i)
            price = 850 + np.random.randint(-20, 20)
            cache.add_to_history('NABIL', date, float(price))
        
        history = cache.get_price_history('NABIL')
        print(f"History entries: {len(history)}")
        print(f"First entry: {history[0] if history else 'None'}")
        
        print("\n3. Caching prediction...")
        cache.cache_prediction('NABIL', '2024-01-16', 870.5, 0.85, 'v2.1.0')
        pred = cache.get_cached_prediction('NABIL', '2024-01-16')
        print(f"Cached prediction: {pred}")
        
        print("\n4. Queue operations...")
        cache.enqueue_data_ingestion({'symbol': 'NICA', 'price': 780.0})
        item = cache.dequeue_data_ingestion()
        print(f"Dequeued: {item}")
        
        print("\n5. Counters...")
        cache.increment_counter('predictions_made', 5)
        count = cache.get_counter('predictions_made')
        print(f"Predictions made: {count}")
        
        cache.close()
        print("\nRedis demonstration completed successfully")
        
    except redis.ConnectionError as e:
        print(f"Redis connection failed (is Redis running?): {e}")
```

**Detailed Explanation:**

1. **Key Patterns**: Redis keys follow a naming convention:
   - `price:NABIL:latest` - Hash with latest price fields
   - `price:NABIL:history` - Sorted set with historical prices
   - `prediction:NABIL:2024-01-16` - Specific prediction
   - `queue:ingestion` - List for queue operations

2. **Data Structures**:
   - **Hash**: Store object-like data (latest price with multiple fields)
   - **Sorted Set**: Time-series data with automatic sorting by timestamp
   - **List**: Queue implementation (LPUSH to add, BRPOP to consume)
   - **String**: Simple counters and values

3. **TTL (Time-To-Live)**: Automatic expiration:
   - Latest prices: 1 hour (refresh frequently)
   - History: 7 days (keep recent data)
   - Predictions: 24 hours (stale predictions are useless)

4. **Use Cases**:
   - **Caching**: Store frequently accessed data in memory (< 1ms access)
   - **Rate Limiting**: Track API usage per client
   - **Pub/Sub**: Real-time updates to connected clients
   - **Session Store**: User sessions for web dashboard

### **8.5.3 Wide-Column Stores (Cassandra)**

Wide-column stores like Apache Cassandra excel at handling massive write loads and providing high availability.

```python
"""
Apache Cassandra Storage for Time-Series Data

Cassandra is designed for:
- Massive scale (petabytes of data)
- High write throughput
- Distributed architecture (no single point of failure)
- Time-series data (used by Apple, Netflix for time-series)

Data Model for NEPSE:
- Partition key: symbol (distributes data across nodes)
- Clustering key: trade_date (orders data within partition)
- Columns: open, high, low, close, volume

Trade-offs:
+ Excellent write performance
+ Linear scalability
+ High availability
- No JOINs (must denormalize)
- Eventually consistent (not ACID)
- Complex to operate
"""

from datetime import datetime
from typing import List, Dict, Any, Optional

try:
    from cassandra.cluster import Cluster
    from cassandra.query import SimpleStatement, BatchStatement
    CASSANDRA_AVAILABLE = True
except ImportError:
    CASSANDRA_AVAILABLE = False


class CassandraTimeSeriesStorage:
    """
    Cassandra storage for NEPSE time-series data.
    
    Schema:
    CREATE KEYSPACE nepse WITH replication = {
        'class': 'SimpleStrategy', 
        'replication_factor': 3
    };
    
    CREATE TABLE stock_prices (
        symbol text,
        trade_date date,
        open_price double,
        high_price double,
        low_price double,
        close_price double,
        volume bigint,
        PRIMARY KEY (symbol, trade_date)
    ) WITH CLUSTERING ORDER BY (trade_date DESC);
    """
    
    def __init__(self, 
                 hosts: List[str] = ['127.0.0.1'],
                 keyspace: str = 'nepse'):
        """
        Initialize Cassandra connection.
        
        Args:
            hosts: List of Cassandra node IPs
            keyspace: Keyspace (database) name
        """
        if not CASSANDRA_AVAILABLE:
            raise ImportError("cassandra-driver required. Install: pip install cassandra-driver")
        
        self.cluster = Cluster(hosts)
        self.session = self.cluster.connect()
        self.keyspace = keyspace
        
        self._setup_schema()
    
    def _setup_schema(self):
        """Create keyspace and tables if they don't exist."""
        # Create keyspace with SimpleStrategy (for single datacenter)
        # NetworkTopologyStrategy for multiple datacenters
        self.session.execute(f"""
            CREATE KEYSPACE IF NOT EXISTS {self.keyspace}
            WITH replication = {{
                'class': 'SimpleStrategy',
                'replication_factor': 3
            }}
        """)
        
        self.session.set_keyspace(self.keyspace)
        
        # Create stock_prices table
        # Partition key: symbol (data distributed by symbol)
        # Clustering key: trade_date (sorted within each symbol partition)
        self.session.execute("""
            CREATE TABLE IF NOT EXISTS stock_prices (
                symbol text,
                trade_date date,
                open_price double,
                high_price double,
                low_price double,
                close_price double,
                volume bigint,
                turnover double,
                PRIMARY KEY (symbol, trade_date)
            ) WITH CLUSTERING ORDER BY (trade_date DESC)
              AND compaction = {{
                'class': 'TimeWindowCompactionStrategy',
                'compaction_window_unit': 'DAYS',
                'compaction_window_size': 7
              }}
        """)
        
        # Create materialized view for time-range queries across symbols
        # Cassandra doesn't allow querying by date only (need partition key)
        # Materialized view duplicates data but allows different query patterns
        self.session.execute("""
            CREATE MATERIALIZED VIEW IF NOT EXISTS stock_prices_by_date AS
            SELECT symbol, trade_date, close_price, volume
            FROM stock_prices
            WHERE trade_date IS NOT NULL AND symbol IS NOT NULL
            PRIMARY KEY (trade_date, symbol)
            WITH CLUSTERING ORDER BY (symbol ASC)
        """)
    
    def insert_data(self, 
                    symbol: str,
                    trade_date: datetime,
                    open_p: float,
                    high_p: float,
                    low_p: float,
                    close_p: float,
                    volume: int,
                    turnover: float = 0.0):
        """
        Insert a single price record.
        
        Args:
            symbol: Stock symbol
            trade_date: Trading date
            open_p: Opening price
            high_p: High price
            low_p: Low price
            close_p: Closing price
            volume: Trading volume
            turnover: Total turnover
        """
        query = """
            INSERT INTO stock_prices 
            (symbol, trade_date, open_price, high_price, low_price, close_price, volume, turnover)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?)
        """
        
        self.session.execute(query, (
            symbol, trade_date.date(), open_p, high_p, low_p, close_p, volume, turnover
        ))
    
    def batch_insert(self, records: List[Dict[str, Any]]):
        """
        Insert multiple records using batch for efficiency.
        
        Note: Cassandra batches are for atomicity within partition,
        not for performance. For high throughput, use async inserts.
        
        Args:
            records: List of record dictionaries
        """
        from cassandra.concurrent import execute_concurrent
        
        queries = []
        for record in records:
            query = """
                INSERT INTO stock_prices 
                (symbol, trade_date, open_price, high_price, low_price, close_price, volume)
                VALUES (?, ?, ?, ?, ?, ?, ?)
            """
            params = (
                record['symbol'],
                record['trade_date'],
                record['open'],
                record['high'],
                record['low'],
                record['close'],
                record['volume']
            )
            queries.append((query, params))
        
        # Execute concurrently for better performance
        results = execute_concurrent(self.session, queries)
        
        # Check for errors
        errors = [r for r in results if not r.success]
        if errors:
            print(f"Errors during batch insert: {len(errors)}")
    
    def query_by_symbol(self,
                        symbol: str,
                        start_date: Optional[datetime] = None,
                        end_date: Optional[datetime] = None,
                        limit: int = 1000) -> List[Dict]:
        """
        Query stock prices by symbol.
        
        This is efficient because symbol is the partition key.
        
        Args:
            symbol: Stock symbol
            start_date: Start date filter
            end_date: End date filter
            limit: Maximum results
        
        Returns:
            List of price records
        """
        if start_date and end_date:
            query = """
                SELECT * FROM stock_prices 
                WHERE symbol = ? AND trade_date >= ? AND trade_date <= ?
                LIMIT ?
            """
            params = (symbol, start_date.date(), end_date.date(), limit)
        else:
            query = """
                SELECT * FROM stock_prices 
                WHERE symbol = ?
                LIMIT ?
            """
            params = (symbol, limit)
        
        rows = self.session.execute(query, params)
        
        results = []
        for row in rows:
            results.append({
                'symbol': row.symbol,
                'trade_date': row.trade_date,
                'open': row.open_price,
                'high': row.high_price,
                'low': row.low_price,
                'close': row.close_price,
                'volume': row.volume
            })
        
        return results
    
    def query_recent(self, symbol: str, days: int = 30) -> List[Dict]:
        """
        Query recent data for a symbol.
        
        Takes advantage of CLUSTERING ORDER BY (trade_date DESC)
        to efficiently get most recent data.
        
        Args:
            symbol: Stock symbol
            days: Number of days to retrieve
        
        Returns:
            List of recent price records
        """
        # Since data is ordered DESC by date, we can just LIMIT
        query = """
            SELECT * FROM stock_prices 
            WHERE symbol = ?
            LIMIT ?
        """
        
        # Estimate rows (approximately 1 per day)
        rows = self.session.execute(query, (symbol, days))
        return list(rows)
    
    def close(self):
        """Close Cassandra connection."""
        self.cluster.shutdown()


def demonstrate_cassandra():
    """
    Demonstrate Cassandra storage (requires running Cassandra instance).
    """
    if not CASSANDRA_AVAILABLE:
        print("Cassandra demonstration skipped (cassandra-driver not installed)")
        return
    
    print("=" * 70)
    print("Apache Cassandra Storage Demonstration")
    print("=" * 70)
    
    print("""
    Cassandra Setup for NEPSE:
    
    1. Data Model:
       - Partition Key: symbol (distributes across nodes)
       - Clustering Key: trade_date (sorts within partition)
       - Columns: OHLCV data
    
    2. Compaction Strategy:
       - TimeWindowCompactionStrategy (TWCS) optimized for time-series
       - Groups data into 7-day windows
       - Efficient for time-range queries and TTL
    
    3. Replication:
       - replication_factor: 3 (3 copies for high availability)
       - Tunable consistency (ONE, QUORUM, ALL)
    
    4. Use Cases:
       - High-frequency trading data (tick data)
       - Large scale (billions of records)
       - Multi-datacenter replication
    """)
    
    try:
        storage = CassandraTimeSeriesStorage()
        print("\nCassandra connected successfully")
        
        # Example operations
        print("\nExample: Insert and query data")
        print("storage.insert_data('NABIL', datetime.now(), 850.0, 870.0, 845.0, 865.0, 125000)")
        print("results = storage.query_by_symbol('NABIL')")
        
        storage.close()
        
    except Exception as e:
        print(f"Cassandra connection failed (is Cassandra running?): {e}")
```

**Detailed Explanation:**

1. **Partition Key**: `symbol` distributes data across cluster nodes:
   - All data for 'NABIL' is on the same node (or replica set)
   - Allows efficient queries for single symbol
   - Prevents cross-node queries for symbol-specific data

2. **Clustering Key**: `trade_date` orders data within each partition:
   - `WITH CLUSTERING ORDER BY (trade_date DESC)` stores newest first
   - Allows efficient "latest N records" queries
   - Enables time-range filtering within partition

3. **TimeWindowCompactionStrategy**: Optimized compaction for time-series:
   - Groups data into time windows (7 days)
   - Expires entire windows at once (efficient TTL)
   - Optimizes for time-range queries

4. **Materialized Views**: Allow querying by different keys:
   - Main table: Query by symbol (partition key)
   - View: Query by date (for cross-sectional queries)
   - Cassandra automatically maintains consistency

---

## **8.6 Cloud Storage Solutions**

Cloud providers offer managed storage services that eliminate operational overhead.

```python
"""
Cloud Storage Solutions for NEPSE Time-Series Data

This module demonstrates integration with major cloud storage services:
- AWS S3: Object storage for files
- AWS RDS: Managed relational databases
- Google BigQuery: Analytics data warehouse
- Azure Blob Storage: Object storage

Benefits:
- No infrastructure management
- Automatic scaling
- Built-in durability (99.999999999%)
- Pay-as-you-go pricing
- Global availability
"""

from pathlib import Path
from typing import Optional, List, Dict, Any
import pandas as pd
import io


class AWSStorage:
    """
    AWS Storage integration for NEPSE data.
    
    Services:
    - S3: Store CSV, Parquet, HDF5 files
    - RDS: Managed PostgreSQL/TimescaleDB
    - DynamoDB: NoSQL for metadata
    """
    
    def __init__(self, 
                 aws_access_key: Optional[str] = None,
                 aws_secret_key: Optional[str] = None,
                 region: str = 'us-east-1'):
        """
        Initialize AWS storage.
        
        Args:
            aws_access_key: AWS access key (or use IAM role)
            aws_secret_key: AWS secret key
            region: AWS region
        """
        try:
            import boto3
            self.boto3 = boto3
        except ImportError:
            raise ImportError("boto3 required. Install: pip install boto3")
        
        # Initialize sessions
        if aws_access_key and aws_secret_key:
            self.session = self.boto3.Session(
                aws_access_key_id=aws_access_key,
                aws_secret_access_key=aws_secret_key,
                region_name=region
            )
        else:
            # Use default credential chain (IAM role, ~/.aws/credentials, etc.)
            self.session = self.boto3.Session(region_name=region)
        
        self.s3 = self.session.client('s3')
        self.rds = self.session.client('rds')
    
    def upload_to_s3(self,
                     local_path: str,
                     bucket: str,
                     s3_key: Optional[str] = None,
                     storage_class: str = 'STANDARD'):
        """
        Upload file to S3.
        
        Args:
            local_path: Local file path
            bucket: S3 bucket name
            s3_key: S3 object key (path in bucket)
            storage_class: S3 storage class
                          STANDARD, STANDARD_IA, GLACIER, etc.
        """
        if s3_key is None:
            s3_key = Path(local_path).name
        
        self.s3.upload_file(
            local_path,
            bucket,
            s3_key,
            ExtraArgs={'StorageClass': storage_class}
        )
        
        print(f"Uploaded {local_path} to s3://{bucket}/{s3_key}")
    
    def download_from_s3(self,
                        bucket: str,
                        s3_key: str,
                        local_path: str):
        """
        Download file from S3.
        
        Args:
            bucket: S3 bucket name
            s3_key: S3 object key
            local_path: Local destination path
        """
        self.s3.download_file(bucket, s3_key, local_path)
        print(f"Downloaded s3://{bucket}/{s3_key} to {local_path}")
    
    def upload_dataframe_s3(self,
                            df: pd.DataFrame,
                            bucket: str,
                            s3_key: str,
                            format: str = 'parquet'):
        """
        Upload DataFrame directly to S3 without saving locally.
        
        Args:
            df: DataFrame to upload
            bucket: S3 bucket name
            s3_key: S3 object key
            format: 'parquet', 'csv', or 'json'
        """
        buffer = io.BytesIO()
        
        if format == 'parquet':
            df.to_parquet(buffer, index=False)
            content_type = 'application/octet-stream'
        elif format == 'csv':
            df.to_csv(buffer, index=False)
            content_type = 'text/csv'
        elif format == 'json':
            df.to_json(buffer, orient='records')
            content_type = 'application/json'
        
        buffer.seek(0)
        
        self.s3.put_object(
            Bucket=bucket,
            Key=s3_key,
            Body=buffer.getvalue(),
            ContentType=content_type
        )
        
        print(f"Uploaded DataFrame to s3://{bucket}/{s3_key}")
    
    def list_s3_objects(self,
                         bucket: str,
                         prefix: str = '') -> List[Dict]:
        """
        List objects in S3 bucket.
        
        Args:
            bucket: S3 bucket name
            prefix: Key prefix filter
        
        Returns:
            List of object metadata
        """
        response = self.s3.list_objects_v2(
            Bucket=bucket,
            Prefix=prefix
        )
        
        return response.get('Contents', [])
    
    def get_s3_object_url(self,
                          bucket: str,
                          s3_key: str,
                          expiration: int = 3600) -> str:
        """
        Generate presigned URL for temporary access.
        
        Args:
            bucket: S3 bucket name
            s3_key: S3 object key
            expiration: URL expiration in seconds
        
        Returns:
            Presigned URL string
        """
        url = self.s3.generate_presigned_url(
            'get_object',
            Params={'Bucket': bucket, 'Key': s3_key},
            ExpiresIn=expiration
        )
        return url


class GoogleCloudStorage:
    """
    Google Cloud Storage integration.
    
    Services:
    - Cloud Storage: Object storage (like S3)
    - BigQuery: Analytics data warehouse
    - Cloud SQL: Managed PostgreSQL
    """
    
    def __init__(self, project_id: Optional[str] = None):
        """
        Initialize Google Cloud Storage.
        
        Args:
            project_id: GCP project ID
        """
        try:
            from google.cloud import storage, bigquery
            self.storage = storage
            self.bigquery = bigquery
        except ImportError:
            raise ImportError("google-cloud-storage and google-cloud-bigquery required")
        
        self.project_id = project_id
        self.storage_client = self.storage.Client(project=project_id)
        self.bq_client = self.bigquery.Client(project=project_id)
    
    def upload_to_gcs(self,
                      local_path: str,
                      bucket_name: str,
                      blob_name: Optional[str] = None):
        """
        Upload file to Google Cloud Storage.
        
        Args:
            local_path: Local file path
            bucket_name: GCS bucket name
            blob_name: Destination blob name
        """
        bucket = self.storage_client.bucket(bucket_name)
        blob = bucket.blob(blob_name or Path(local_path).name)
        
        blob.upload_from_filename(local_path)
        print(f"Uploaded to gs://{bucket_name}/{blob.name}")
    
    def upload_dataframe_bigquery(self,
                                   df: pd.DataFrame,
                                   dataset: str,
                                   table: str,
                                   write_disposition: str = 'WRITE_APPEND'):
        """
        Upload DataFrame to BigQuery.
        
        BigQuery is excellent for analytics queries on large datasets.
        
        Args:
            df: DataFrame to upload
            dataset: BigQuery dataset name
            table: Table name
            write_disposition: 'WRITE_TRUNCATE', 'WRITE_APPEND', 'WRITE_EMPTY'
        """
        table_ref = f"{self.project_id}.{dataset}.{table}"
        
        job = self.bq_client.load_table_from_dataframe(
            df,
            table_ref,
            job_config=self.bigquery.LoadJobConfig(
                write_disposition=write_disposition
            )
        )
        
        job.result()  # Wait for completion
        print(f"Loaded {len(df)} rows to {table_ref}")
    
    def query_bigquery(self, query: str) -> pd.DataFrame:
        """
        Execute BigQuery SQL query.
        
        Args:
            query: SQL query string
        
        Returns:
            DataFrame with results
        """
        query_job = self.bq_client.query(query)
        return query_job.to_dataframe()


def demonstrate_cloud_storage():
    """
    Demonstrate cloud storage concepts.
    """
    print("=" * 70)
    print("Cloud Storage Solutions Overview")
    print("=" * 70)
    
    print("""
    AWS S3 for NEPSE Data:
    
    Storage Classes:
    - S3 Standard: Frequently accessed data
    - S3 Standard-IA: Infrequent access (cheaper storage, higher retrieval)
    - S3 Glacier: Archive (very cheap, slow retrieval)
    - S3 Intelligent-Tiering: Automatic cost optimization
    
    Organization:
    s3://nepse-bucket/
      raw/
        YYYY/MM/DD/
          nepse_data_YYYYMMDD.csv
      processed/
        parquet/
          stock_prices/
            year=2024/
              month=01/
                part-00001.parquet
      models/
        model_v1.0.pkl
        model_v2.0.pkl
    
    Lifecycle Policies:
    - Move to IA after 30 days
    - Move to Glacier after 1 year
    - Delete after 7 years (compliance)
    
    Google BigQuery for Analytics:
    
    Schema:
    Dataset: nepse_analytics
    Table: stock_prices
      - symbol: STRING
      - trade_date: DATE
      - close_price: FLOAT
      - volume: INTEGER
    
    Advantages:
    - Serverless (no infrastructure)
    - Petabyte-scale queries
    - Standard SQL support
    - Integration with ML (BigQuery ML)
    
    Cost Optimization:
    - Partition by date (query only relevant partitions)
    - Cluster by symbol (faster filtering)
    - Use materialized views for common aggregations
    """)
    
    # Example code structure
    print("""
    Example Usage:
    
    # AWS S3
    aws = AWSStorage()
    aws.upload_dataframe_s3(df, 'nepse-bucket', 'data/2024/01/prices.parquet')
    
    # Google BigQuery
    gcp = GoogleCloudStorage(project_id='nepse-project')
    gcp.upload_dataframe_bigquery(df, 'nepse_dataset', 'stock_prices')
    
    # Query in BigQuery
    results = gcp.query_bigquery('''
        SELECT symbol, AVG(close_price) as avg_price
        FROM `nepse-project.nepse_dataset.stock_prices`
        WHERE trade_date >= '2024-01-01'
        GROUP BY symbol
    ''')
    """)


if __name__ == "__main__":
    demonstrate_cloud_storage()
```

**Detailed Explanation:**

1. **AWS S3 Storage Classes**:
   - **Standard**: Hot data, frequent access
   - **Standard-IA**: 40% cheaper, retrieval fee
   - **Glacier**: 90% cheaper, minutes/hours retrieval
   - **Intelligent-Tiering**: Automatic based on access patterns

2. **BigQuery**: Analytics data warehouse:
   - **Columnar storage**: Fast aggregations
   - **Partitioning**: Divide table by date (query only needed partitions)
   - **Clustering**: Sort within partitions by symbol (faster filtering)
   - **Pricing**: Pay per query (not storage), so optimize queries

3. **Data Lifecycle**: Automate cost optimization:
   - Raw data: S3 Standard (30 days)
   - Processed data: S3 Standard-IA (90 days)
   - Old data: Glacier (7 years)
   - Compliance: Delete after regulatory period

---

## **8.7 Data Partitioning Strategies**

Partitioning divides large datasets into smaller, manageable pieces for improved query performance and maintenance.

```python
"""
Data Partitioning Strategies for Time-Series Data

Partitioning improves:
- Query performance (scan only relevant partitions)
- Data management (drop old partitions vs. delete rows)
- Maintenance (rebuild indexes per partition)
- Parallel processing (process partitions concurrently)

Strategies:
1. Time-based partitioning (most common for time-series)
2. Hash partitioning (distribute evenly)
3. List partitioning (by category)
4. Composite partitioning (time + symbol)
"""

from datetime import datetime, timedelta
from typing import List, Dict, Any, Optional, Callable
import pandas as pd
import numpy as np
from pathlib import Path
import shutil


class PartitioningManager:
    """
    Manage data partitioning for NEPSE time-series data.
    
    Partitioning approaches:
    1. File-based: Directory structure (year=2024/month=01/day=15/)
    2. Database: Table partitioning (PostgreSQL/TimescaleDB)
    3. DataFrame: In-memory partitioning
    """
    
    def __init__(self, base_path: str = './partitioned_data'):
        """
        Initialize partitioning manager.
        
        Args:
            base_path: Base directory for partitioned files
        """
        self.base_path = Path(base_path)
        self.base_path.mkdir(parents=True, exist_ok=True)
    
    def partition_by_time(self,
                          df: pd.DataFrame,
                          date_column: str = 'Date',
                          frequency: str = 'month',
                          output_format: str = 'parquet') -> Dict[str, Path]:
        """
        Partition DataFrame by time periods.
        
        Args:
            df: DataFrame with time-series data
            date_column: Name of date column
            frequency: 'day', 'week', 'month', 'year'
            output_format: 'csv', 'parquet', 'feather'
        
        Returns:
            Dictionary mapping partition keys to file paths
        """
        df = df.copy()
        df[date_column] = pd.to_datetime(df[date_column])
        
        # Create time partition key
        if frequency == 'day':
            df['partition_key'] = df[date_column].dt.strftime('%Y-%m-%d')
            dir_pattern = 'year={}/month={}/day={}'
        elif frequency == 'week':
            df['partition_key'] = df[date_column].dt.strftime('%Y-W%U')
            dir_pattern = 'year={}/week={}'
        elif frequency == 'month':
            df['partition_key'] = df[date_column].dt.strftime('%Y-%m')
            dir_pattern = 'year={}/month={}'
        elif frequency == 'year':
            df['partition_key'] = df[date_column].dt.strftime('%Y')
            dir_pattern = 'year={}'
        else:
            raise ValueError(f"Unknown frequency: {frequency}")
        
        partitions = {}
        
        # Group by partition key
        for partition_key, group in df.groupby('partition_key'):
            # Create directory structure
            date = group[date_column].iloc[0]
            
            if frequency == 'day':
                path = self.base_path / dir_pattern.format(
                    date.year, 
                    f"{date.month:02d}", 
                    f"{date.day:02d}"
                )
            elif frequency == 'week':
                path = self.base_path / dir_pattern.format(
                    date.year,
                    date.strftime('%U')
                )
            elif frequency == 'month':
                path = self.base_path / dir_pattern.format(
                    date.year,
                    f"{date.month:02d}"
                )
            else:
                path = self.base_path / dir_pattern.format(date.year)
            
            path.mkdir(parents=True, exist_ok=True)
            
            # Write file
            filename = f"data.{output_format}"
            filepath = path / filename
            
            if output_format == 'csv':
                group.to_csv(filepath, index=False)
            elif output_format == 'parquet':
                group.to_parquet(filepath, index=False)
            elif output_format == 'feather':
                group.to_feather(filepath)
            
            partitions[partition_key] = filepath
        
        print(f"Created {len(partitions)} partitions")
        return partitions
    
    def partition_by_symbol(self,
                            df: pd.DataFrame,
                            symbol_column: str = 'Symbol',
                            chunks_per_symbol: int = 1) -> Dict[str, Path]:
        """
        Partition data by stock symbol.
        
        Args:
            df: DataFrame with stock data
            symbol_column: Symbol column name
            chunks_per_symbol: Split large symbols into chunks
        
        Returns:
            Dictionary mapping symbols to file paths
        """
        partitions = {}
        
        for symbol, group in df.groupby(symbol_column):
            # Clean symbol for filename
            safe_symbol = str(symbol).replace('/', '_')
            
            # Create directory
            symbol_dir = self.base_path / f"symbol={safe_symbol}"
            symbol_dir.mkdir(parents=True, exist_ok=True)
            
            # Split into chunks if needed
            if chunks_per_symbol > 1 and len(group) > 1000:
                chunk_size = len(group) // chunks_per_symbol
                
                for i in range(chunks_per_symbol):
                    start = i * chunk_size
                    end = start + chunk_size if i < chunks_per_symbol - 1 else len(group)
                    chunk = group.iloc[start:end]
                    
                    filepath = symbol_dir / f"part_{i:04d}.parquet"
                    chunk.to_parquet(filepath, index=False)
                    partitions[f"{symbol}_{i}"] = filepath
            else:
                filepath = symbol_dir / "data.parquet"
                group.to_parquet(filepath, index=False)
                partitions[symbol] = filepath
        
        return partitions
    
    def partition_composite(self,
                            df: pd.DataFrame,
                            date_column: str = 'Date',
                            symbol_column: str = 'Symbol',
                            time_freq: str = 'month') -> Dict[str, Path]:
        """
        Composite partitioning: Time + Symbol.
        
        Best for:
        - Large datasets (billions of rows)
        - Queries that filter by both time and symbol
        - Parallel processing by symbol
        
        Structure:
        year=2024/month=01/
          symbol=NABIL/
            data.parquet
          symbol=NICA/
            data.parquet
        """
        df = df.copy()
        df[date_column] = pd.to_datetime(df[date_column])
        
        if time_freq == 'month':
            df['time_partition'] = df[date_column].dt.strftime('%Y-%m')
        elif time_freq == 'day':
            df['time_partition'] = df[date_column].dt.strftime('%Y-%m-%d')
        elif time_freq == 'year':
            df['time_partition'] = df[date_column].dt.strftime('%Y')
        
        partitions = {}
        
        for (time_part, symbol), group in df.groupby(['time_partition', symbol_column]):
            # Create path: time_partition/symbol=XXX/
            safe_symbol = str(symbol).replace('/', '_')
            path = self.base_path / f"time={time_part}" / f"symbol={safe_symbol}"
            path.mkdir(parents=True, exist_ok=True)
            
            filepath = path / "data.parquet"
            group.to_parquet(filepath, index=False)
            
            partitions[f"{time_part}_{symbol}"] = filepath
        
        return partitions
    
    def read_partition(self, partition_path: Path) -> pd.DataFrame:
        """
        Read a single partition file.
        
        Args:
            partition_path: Path to partition file
        
        Returns:
            DataFrame with partition data
        """
        if partition_path.suffix == '.csv':
            return pd.read_csv(partition_path)
        elif partition_path.suffix == '.parquet':
            return pd.read_parquet(partition_path)
        elif partition_path.suffix == '.feather':
            return pd.read_feather(partition_path)
        else:
            raise ValueError(f"Unknown file type: {partition_path.suffix}")
    
    def read_partitions(self,
                        pattern: str = "**/*.parquet",
                        filters: Optional[Dict[str, Any]] = None) -> pd.DataFrame:
        """
        Read multiple partitions with optional filtering.
        
        Args:
            pattern: Glob pattern for files
            filters: Dictionary of partition key filters
                    e.g., {'year': '2024', 'month': '01'}
        
        Returns:
            Combined DataFrame
        """
        if filters:
            # Build path from filters
            path_parts = []
            for key in ['year', 'month', 'day', 'symbol']:
                if key in filters:
                    path_parts.append(f"{key}={filters[key]}")
            
            search_path = self.base_path / Path(*path_parts)
            files = list(search_path.glob("*.parquet"))
        else:
            files = list(self.base_path.glob(pattern))
        
        if not files:
            return pd.DataFrame()
        
        # Read and combine
        dfs = []
        for file in files:
            try:
                df = self.read_partition(file)
                dfs.append(df)
            except Exception as e:
                print(f"Error reading {file}: {e}")
        
        if not dfs:
            return pd.DataFrame()
        
        return pd.concat(dfs, ignore_index=True)
    
    def drop_partition(self, partition_key: str):
        """
        Drop (delete) a specific partition.
        
        Much faster than deleting rows in a large table.
        
        Args:
            partition_key: Key of partition to drop (e.g., '2023-01')
        """
        partition_path = self.base_path / f"time={partition_key}"
        if partition_path.exists():
            shutil.rmtree(partition_path)
            print(f"Dropped partition: {partition_key}")
        else:
            print(f"Partition not found: {partition_key}")
    
    def get_partition_statistics(self) -> pd.DataFrame:
        """
        Get statistics about partitions.
        
        Returns:
            DataFrame with partition info
        """
        stats = []
        
        for partition_dir in self.base_path.rglob("*.parquet"):
            stat = partition_dir.stat()
            stats.append({
                'partition': str(partition_dir.parent.relative_to(self.base_path)),
                'filename': partition_dir.name,
                'size_mb': stat.st_size / (1024 * 1024),
                'modified': datetime.fromtimestamp(stat.st_mtime)
            })
        
        return pd.DataFrame(stats)


def demonstrate_partitioning():
    """
    Demonstrate partitioning strategies.
    """
    print("=" * 70)
    print("Data Partitioning Strategies Demonstration")
    print("=" * 70)
    
    # Generate sample data
    np.random.seed(42)
    dates = pd.date_range('2023-01-01', '2023-12-31', freq='B')
    symbols = ['NABIL', 'NICA', 'SCBL', 'ADBL']
    
    data = []
    for date in dates:
        for symbol in symbols:
            data.append({
                'Date': date,
                'Symbol': symbol,
                'Close': np.random.uniform(100, 1000),
                'Volume': np.random.randint(10000, 100000)
            })
    
    df = pd.DataFrame(data)
    print(f"\nGenerated {len(df)} records")
    
    manager = PartitioningManager('./demo_partitions')
    
    # Time partitioning
    print("\n1. Time Partitioning (by month)")
    print("-" * 40)
    time_parts = manager.partition_by_time(df, frequency='month')
    print(f"Created {len(time_parts)} monthly partitions")
    print(f"Example path: {list(time_parts.values())[0]}")
    
    # Symbol partitioning
    print("\n2. Symbol Partitioning")
    print("-" * 40)
    symbol_parts = manager.partition_by_symbol(df)
    print(f"Created {len(symbol_parts)} symbol partitions")
    
    # Composite partitioning
    print("\n3. Composite Partitioning (Month + Symbol)")
    print("-" * 40)
    # Clean up first
    import shutil
    if Path('./demo_partitions').exists():
        shutil.rmtree('./demo_partitions')
    
    manager2 = PartitioningManager('./demo_partitions')
    composite_parts = manager2.partition_composite(df, time_freq='month')
    print(f"Created {len(composite_parts)} composite partitions")
    
    # Read specific partition
    print("\n4. Reading Specific Partition")
    print("-" * 40)
    specific = manager2.read_partitions(filters={'year': '2023', 'month': '01', 'symbol': 'NABIL'})
    print(f"Read {len(specific)} records for 2023-01 NABIL")
    
    # Statistics
    print("\n5. Partition Statistics")
    print("-" * 40)
    stats = manager2.get_partition_statistics()
    print(f"Total partitions: {len(stats)}")
    print(f"Total size: {stats['size_mb'].sum():.2f} MB")
    print(f"Average size: {stats['size_mb'].mean():.2f} MB")
    
    return manager, df


if __name__ == "__main__":
    demonstrate_partitioning()
```

**Detailed Explanation:**

1. **Time Partitioning**: Divide by date (year/month/day)
   - Query `WHERE date >= '2024-01-01'` only scans 2024 partitions
   - Drop old data: `DROP PARTITION year=2022` (instant vs. DELETE)

2. **Symbol Partitioning**: Divide by stock symbol
   - Query for single symbol only reads one partition
   - Enables parallel processing (process NABIL and NICA concurrently)

3. **Composite Partitioning**: Time + Symbol
   - Best for very large datasets
   - Query for specific symbol in time range: scan one small partition
   - Trade-off: More files, higher metadata overhead

4. **Partition Pruning**: Query engines skip irrelevant partitions:
   - Hive/Spark: Use partition columns in WHERE clause
   - PostgreSQL: Constraint exclusion
   - Manual: Only read files matching filters

---

## **8.8 Data Archival and Retention**

Managing data lifecycle from hot (frequently accessed) to cold (rarely accessed) to deleted.

```python
"""
Data Archival and Retention Policies

Strategies for managing data lifecycle:
- Hot data: Recent, frequently accessed (SSD, memory)
- Warm data: Occasionally accessed (standard disk)
- Cold data: Rarely accessed (archive storage)
- Deleted: Beyond regulatory requirement

For NEPSE:
- Last 1 year: Hot (fast queries for model training)
- 1-5 years: Warm (occasional backtesting)
- 5+ years: Cold (regulatory compliance)
- 7+ years: Delete (unless required by law)
"""

from datetime import datetime, timedelta
from typing import Dict, Any, List, Optional
from pathlib import Path
import shutil
import pandas as pd


class DataRetentionManager:
    """
    Manage data retention policies for NEPSE time-series data.
    
    Policies:
    - Raw data: Keep 7 years (regulatory)
    - Processed features: Keep 3 years
    - Model predictions: Keep 2 years
    - Logs: Keep 90 days
    """
    
    def __init__(self, 
                 data_root: str = './data',
                 archive_root: str = './archive'):
        """
        Initialize retention manager.
        
        Args:
            data_root: Root directory for active data
            archive_root: Root directory for archived data
        """
        self.data_root = Path(data_root)
        self.archive_root = Path(archive_root)
        self.archive_root.mkdir(parents=True, exist_ok=True)
        
        # Define retention policies (days)
        self.policies = {
            'raw_data': 7 * 365,        # 7 years
            'processed_features': 3 * 365,  # 3 years
            'predictions': 2 * 365,     # 2 years
            'logs': 90,                 # 90 days
            'temp': 7                   # 7 days
        }
    
    def should_archive(self, 
                       file_path: Path,
                       data_type: str = 'raw_data') -> bool:
        """
        Determine if file should be archived based on age.
        
        Args:
            file_path: Path to file
            data_type: Type of data (determines retention period)
        
        Returns:
            True if file should be archived
        """
        if not file_path.exists():
            return False
        
        # Get file modification time
        stat = file_path.stat()
        file_date = datetime.fromtimestamp(stat.st_mtime)
        age_days = (datetime.now() - file_date).days
        
        threshold = self.policies.get(data_type, 365)
        
        return age_days > threshold
    
    def should_delete(self,
                      file_path: Path,
                      data_type: str = 'raw_data') -> bool:
        """
        Determine if file should be deleted based on retention policy.
        
        Args:
            file_path: Path to file
            data_type: Type of data
        
        Returns:
            True if file should be deleted
        """
        if not file_path.exists():
            return False
        
        stat = file_path.stat()
        file_date = datetime.fromtimestamp(stat.st_mtime)
        age_days = (datetime.now() - file_date).days
        
        # Delete after 2x retention (grace period)
        threshold = self.policies.get(data_type, 365) * 2
        
        return age_days > threshold
    
    def archive_file(self, 
                     file_path: Path,
                     compress: bool = True) -> Path:
        """
        Move file to archive storage.
        
        Args:
            file_path: Original file path
            compress: Whether to compress during archival
        
        Returns:
            Path to archived file
        """
        # Create archive directory structure
        relative_path = file_path.relative_to(self.data_root)
        archive_path = self.archive_root / relative_path
        archive_path.parent.mkdir(parents=True, exist_ok=True)
        
        if compress and file_path.suffix not in ['.gz', '.zip', '.bz2']:
            # Compress to save space
            import gzip
            archive_path = archive_path.with_suffix(file_path.suffix + '.gz')
            
            with open(file_path, 'rb') as f_in:
                with gzip.open(archive_path, 'wb') as f_out:
                    shutil.copyfileobj(f_in, f_out)
            
            # Remove original after successful archive
            file_path.unlink()
        else:
            # Just move
            shutil.move(str(file_path), str(archive_path))
        
        print(f"Archived: {file_path} -> {archive_path}")
        return archive_path
    
    def apply_retention_policy(self, 
                               data_type: str,
                               dry_run: bool = True) -> Dict[str, List[str]]:
        """
        Apply retention policy to a data type.
        
        Args:
            data_type: Type of data (raw_data, predictions, etc.)
            dry_run: If True, only report what would be done
        
        Returns:
            Dictionary with actions taken
        """
        actions = {
            'archived': [],
            'deleted': [],
            'kept': []
        }
        
        # Determine directory based on data type
        if data_type == 'raw_data':
            search_dir = self.data_root / 'raw'
        elif data_type == 'predictions':
            search_dir = self.data_root / 'predictions'
        else:
            search_dir = self.data_root / data_type
        
        if not search_dir.exists():
            return actions
        
        # Find all files
        files = list(search_dir.rglob('*'))
        files = [f for f in files if f.is_file()]
        
        for file_path in files:
            if self.should_delete(file_path, data_type):
                actions['deleted'].append(str(file_path))
                if not dry_run:
                    file_path.unlink()
                    
            elif self.should_archive(file_path, data_type):
                actions['archived'].append(str(file_path))
                if not dry_run:
                    self.archive_file(file_path)
            else:
                actions['kept'].append(str(file_path))
        
        return actions
    
    def get_storage_statistics(self) -> pd.DataFrame:
        """
        Get statistics about data storage.
        
        Returns:
            DataFrame with storage stats by data type
        """
        stats = []
        
        for data_type in self.policies.keys():
            search_dir = self.data_root / data_type
            if not search_dir.exists():
                continue
            
            files = list(search_dir.rglob('*'))
            files = [f for f in files if f.is_file()]
            
            total_size = sum(f.stat().st_size for f in files)
            total_files = len(files)
            
            # Calculate age distribution
            ages = []
            for f in files:
                age = (datetime.now() - datetime.fromtimestamp(f.stat().st_mtime)).days
                ages.append(age)
            
            if ages:
                avg_age = sum(ages) / len(ages)
                max_age = max(ages)
                min_age = min(ages)
            else:
                avg_age = max_age = min_age = 0
            
            stats.append({
                'data_type': data_type,
                'file_count': total_files,
                'total_size_mb': total_size / (1024 * 1024),
                'avg_age_days': avg_age,
                'max_age_days': max_age,
                'min_age_days': min_age,
                'retention_days': self.policies[data_type]
            })
        
        return pd.DataFrame(stats)


def demonstrate_retention():
    """
    Demonstrate retention policies.
    """
    print("=" * 70)
    print("Data Retention and Archival Policies")
    print("=" * 70)
    
    manager = DataRetentionManager('./data', './archive')
    
    print("\nRetention Policies:")
    for data_type, days in manager.policies.items():
        years = days / 365
        print(f"  {data_type:20s}: {days:4d} days ({years:.1f} years)")
    
    # Simulate applying policy
    print("\nApplying retention policy (dry run):")
    # This would normally scan actual files
    print("  - Files older than retention period: archive")
    print("  - Files older than 2x retention period: delete")
    print("  - Recent files: keep in hot storage")
    
    return manager


if __name__ == "__main__":
    demonstrate_retention()
```

**Detailed Explanation:**

1. **Tiered Storage**:
   - **Hot**: Last 1 year, SSD, instant access (active trading)
   - **Warm**: 1-5 years, standard disk, occasional access (backtesting)
   - **Cold**: 5+ years, archive storage (Glacier), compliance only
   - **Delete**: 7+ years, unless legally required

2. **Automation**: Scripts run daily/weekly:
   - Identify files exceeding retention thresholds
   - Move to archive (compressed)
   - Delete after grace period
   - Generate compliance reports

3. **Compliance**: Financial data often regulated:
   - SEC requires 7 years of trade data
   - GDPR requires deletion of personal data
   - Audit trails for all deletions

---

## **8.9 Backup and Recovery**

Ensuring data durability and ability to recover from failures.

```python
"""
Backup and Recovery Strategies

Types of backups:
1. Full backup: Complete copy of all data
2. Incremental backup: Only changes since last backup
3. Differential backup: Changes since last full backup

Strategies:
- 3-2-1 rule: 3 copies, 2 different media, 1 offsite
- Daily incremental, weekly full
- Point-in-time recovery
- Disaster recovery testing
"""

from datetime import datetime
from pathlib import Path
import shutil
import hashlib
import json
from typing import List, Dict, Optional


class BackupManager:
    """
    Manage backups for NEPSE data.
    
    Backup schedule:
    - Daily: Incremental (changed files only)
    - Weekly: Full backup (all data)
    - Monthly: Archive to cold storage
    """
    
    def __init__(self, 
                 source_dir: str = './data',
                 backup_dir: str = './backups'):
        """
        Initialize backup manager.
        
        Args:
            source_dir: Directory to backup
            backup_dir: Directory for backups
        """
        self.source_dir = Path(source_dir)
        self.backup_dir = Path(backup_dir)
        self.backup_dir.mkdir(parents=True, exist_ok=True)
        
        # Track backup history
        self.manifest_path = self.backup_dir / 'backup_manifest.json'
        self.manifest = self._load_manifest()
    
    def _load_manifest(self) -> Dict:
        """Load backup manifest."""
        if self.manifest_path.exists():
            with open(self.manifest_path) as f:
                return json.load(f)
        return {'backups': []}
    
    def _save_manifest(self):
        """Save backup manifest."""
        with open(self.manifest_path, 'w') as f:
            json.dump(self.manifest, f, indent=2, default=str)
    
    def _calculate_checksum(self, file_path: Path) -> str:
        """Calculate MD5 checksum of file."""
        hash_md5 = hashlib.md5()
        with open(file_path, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash_md5.update(chunk)
        return hash_md5.hexdigest()
    
    def full_backup(self, name: Optional[str] = None) -> Path:
        """
        Perform full backup of all data.
        
        Args:
            name: Backup name (timestamp if not provided)
        
        Returns:
            Path to backup directory
        """
        if name is None:
            name = f"full_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        
        backup_path = self.backup_dir / name
        backup_path.mkdir(parents=True, exist_ok=True)
        
        files_backed = 0
        total_size = 0
        
        # Copy all files
        for file_path in self.source_dir.rglob('*'):
            if file_path.is_file():
                relative_path = file_path.relative_to(self.source_dir)
                dest_path = backup_path / relative_path
                dest_path.parent.mkdir(parents=True, exist_ok=True)
                
                shutil.copy2(file_path, dest_path)
                files_backed += 1
                total_size += file_path.stat().st_size
        
        # Record in manifest
        backup_info = {
            'name': name,
            'type': 'full',
            'timestamp': datetime.now().isoformat(),
            'files': files_backed,
            'size_bytes': total_size,
            'path': str(backup_path)
        }
        self.manifest['backups'].append(backup_info)
        self._save_manifest()
        
        print(f"Full backup completed: {name}")
        print(f"  Files: {files_backed}")
        print(f"  Size: {total_size / (1024*1024):.2f} MB")
        
        return backup_path
    
    def incremental_backup(self, name: Optional[str] = None) -> Path:
        """
        Perform incremental backup (only changed files).
        
        Args:
            name: Backup name
        
        Returns:
            Path to backup directory
        """
        if name is None:
            name = f"incr_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        
        backup_path = self.backup_dir / name
        backup_path.mkdir(parents=True, exist_ok=True)
        
        # Get last full backup timestamp
        last_full = None
        for backup in reversed(self.manifest['backups']):
            if backup['type'] == 'full':
                last_full = datetime.fromisoformat(backup['timestamp'])
                break
        
        if last_full is None:
            print("No full backup found, performing full backup instead")
            return self.full_backup(name)
        
        files_backed = 0
        total_size = 0
        
        # Only copy files modified since last full backup
        for file_path in self.source_dir.rglob('*'):
            if file_path.is_file():
                stat = file_path.stat()
                mod_time = datetime.fromtimestamp(stat.st_mtime)
                
                if mod_time > last_full:
                    relative_path = file_path.relative_to(self.source_dir)
                    dest_path = backup_path / relative_path
                    dest_path.parent.mkdir(parents=True, exist_ok=True)
                    
                    shutil.copy2(file_path, dest_path)
                    files_backed += 1
                    total_size += stat.st_size
        
        # Record in manifest
        backup_info = {
            'name': name,
            'type': 'incremental',
            'timestamp': datetime.now().isoformat(),
            'files': files_backed,
            'size_bytes': total_size,
            'path': str(backup_path),
            'base_backup': last_full.isoformat()
        }
        self.manifest['backups'].append(backup_info)
        self._save_manifest()
        
        print(f"Incremental backup completed: {name}")
        print(f"  Files: {files_backed}")
        print(f"  Size: {total_size / (1024*1024):.2f} MB")
        
        return backup_path
    
    def verify_backup(self, backup_name: str) -> bool:
        """
        Verify backup integrity by checking checksums.
        
        Args:
            backup_name: Name of backup to verify
        
        Returns:
            True if backup is valid
        """
        backup_path = self.backup_dir / backup_name
        
        if not backup_path.exists():
            print(f"Backup not found: {backup_name}")
            return False
        
        # Simple verification: check all files exist and are readable
        errors = []
        for file_path in backup_path.rglob('*'):
            if file_path.is_file():
                try:
                    with open(file_path, 'rb') as f:
                        f.read(1)  # Try to read first byte
                except Exception as e:
                    errors.append(f"Error reading {file_path}: {e}")
        
        if errors:
            print(f"Backup verification failed with {len(errors)} errors")
            return False
        
        print(f"Backup verified successfully: {backup_name}")
        return True
    
    def restore(self, 
                backup_name: str,
                target_dir: Optional[str] = None,
                dry_run: bool = True):
        """
        Restore from backup.
        
        Args:
            backup_name: Name of backup to restore
            target_dir: Directory to restore to (default: source_dir)
            dry_run: If True, only show what would be restored
        """
        backup_path = self.backup_dir / backup_name
        
        if not backup_path.exists():
            print(f"Backup not found: {backup_name}")
            return
        
        if target_dir is None:
            target_dir = self.source_dir
        else:
            target_dir = Path(target_dir)
        
        print(f"Restoring from {backup_name} to {target_dir}")
        
        # List all files in backup
        files = [f for f in backup_path.rglob('*') if f.is_file()]
        
        if dry_run:
            print(f"Would restore {len(files)} files:")
            for f in files[:10]:  # Show first 10
                rel = f.relative_to(backup_path)
                print(f"  {rel}")
            if len(files) > 10:
                print(f"  ... and {len(files) - 10} more")
        else:
            # Perform restore
            for file_path in files:
                relative_path = file_path.relative_to(backup_path)
                dest_path = target_dir / relative_path
                dest_path.parent.mkdir(parents=True, exist_ok=True)
                shutil.copy2(file_path, dest_path)
            
            print(f"Restored {len(files)} files")
    
    def list_backups(self) -> List[Dict]:
        """List all available backups."""
        return self.manifest['backups']
    
    def cleanup_old_backups(self, keep_days: int = 30):
        """
        Remove backups older than specified days.
        
        Args:
            keep_days: Number of days to keep backups
        """
        cutoff = datetime.now() - timedelta(days=keep_days)
        
        to_remove = []
        for backup in self.manifest['backups']:
            backup_time = datetime.fromisoformat(backup['timestamp'])
            if backup_time < cutoff:
                backup_path = Path(backup['path'])
                if backup_path.exists():
                    shutil.rmtree(backup_path)
                to_remove.append(backup)
        
        # Update manifest
        self.manifest['backups'] = [b for b in self.manifest['backups'] if b not in to_remove]
        self._save_manifest()
        
        print(f"Removed {len(to_remove)} old backups")


def demonstrate_backup():
    """
    Demonstrate backup operations.
    """
    print("=" * 70)
    print("Backup and Recovery Demonstration")
    print("=" * 70)
    
    manager = BackupManager('./data', './backups')
    
    print("\nBackup Strategy:")
    print("  - Full backup: Weekly (all data)")
    print("  - Incremental: Daily (changes only)")
    print("  - Retention: 30 days")
    print("  - Verification: Checksums on completion")
    
    print("\nExample commands:")
    print("  manager.full_backup('weekly_2024_01_15')")
    print("  manager.incremental_backup('daily_2024_01_16')")
    print("  manager.verify_backup('weekly_2024_01_15')")
    print("  manager.restore('weekly_2024_01_15', dry_run=False)")
    
    return manager


if __name__ == "__main__":
    demonstrate_backup()
```

**Detailed Explanation:**

1. **3-2-1 Rule**:
   - **3 copies**: Original + 2 backups
   - **2 different media**: Local disk + cloud/offsite
   - **1 offsite**: Protect against site disaster

2. **Backup Types**:
   - **Full**: Complete copy, slow, large
   - **Incremental**: Changes since last backup, fast, small
   - **Differential**: Changes since last full, medium size

3. **Recovery**:
   - **Point-in-time**: Restore to specific moment
   - **Granular**: Restore single file or table
   - **Testing**: Regular restore drills to verify backups work

---

## **8.10 Data Security and Compliance**

Protecting sensitive financial data and meeting regulatory requirements.

```python
"""
Data Security and Compliance for NEPSE Data

Security measures:
1. Encryption at rest (disk/files)
2. Encryption in transit (TLS/SSL)
3. Access control (authentication/authorization)
4. Audit logging (who accessed what)
5. Data masking (for non-prod environments)

Compliance:
- Data privacy laws (GDPR, CCPA)
- Financial regulations (SEC, FINRA)
- Audit requirements
"""

from pathlib import Path
from typing import Optional, Dict, Any
import hashlib
import secrets


class DataSecurityManager:
    """
    Manage data security for NEPSE storage.
    
    Security layers:
    - Network: TLS, VPN, firewalls
    - Application: Authentication, authorization
    - Data: Encryption, masking, tokenization
    - Physical: Data center security
    """
    
    def __init__(self):
        self.audit_log = []
    
    def encrypt_file(self,
                     file_path: Path,
                     password: str,
                     output_path: Optional[Path] = None) -> Path:
        """
        Encrypt file using Fernet (symmetric encryption).
        
        Args:
            file_path: File to encrypt
            password: Encryption password
            output_path: Output file path
        
        Returns:
            Path to encrypted file
        """
        from cryptography.fernet import Fernet
        from cryptography.hazmat.primitives import hashes
        from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
        import base64
        
        # Derive key from password
        kdf = PBKDF2HMAC(
            algorithm=hashes.SHA256(),
            length=32,
            salt=secrets.token_bytes(16),
            iterations=480000,
        )
        key = base64.urlsafe_b64encode(kdf.derive(password.encode()))
        
        f = Fernet(key)
        
        # Read and encrypt
        with open(file_path, 'rb') as file:
            file_data = file.read()
        
        encrypted_data = f.encrypt(file_data)
        
        # Write encrypted file
        if output_path is None:
            output_path = file_path.with_suffix(file_path.suffix + '.encrypted')
        
        with open(output_path, 'wb') as file:
            file.write(encrypted_data)
        
        self.log_audit('ENCRYPT', str(file_path), str(output_path))
        return output_path
    
    def hash_sensitive_data(self, data: str) -> str:
        """
        One-way hash for sensitive data (PII).
        
        Args:
            data: String to hash
        
        Returns:
            SHA-256 hash
        """
        return hashlib.sha256(data.encode()).hexdigest()
    
    def mask_data(self, 
                  df: pd.DataFrame,
                  columns: Dict[str, str]) -> pd.DataFrame:
        """
        Mask sensitive data for non-production use.
        
        Args:
            df: DataFrame with sensitive data
            columns: Dict of column -> mask_type
                    mask_types: 'hash', 'partial', 'random', 'null'
        
        Returns:
            Masked DataFrame
        """
        df = df.copy()
        
        for col, mask_type in columns.items():
            if col not in df.columns:
                continue
            
            if mask_type == 'hash':
                df[col] = df[col].astype(str).apply(self.hash_sensitive_data)
            
            elif mask_type == 'partial':
                # Show first 2 and last 2 characters only
                df[col] = df[col].astype(str).apply(
                    lambda x: x[:2] + '***' + x[-2:] if len(x) > 4 else '****'
                )
            
            elif mask_type == 'random':
                # Replace with random values of same type
                if df[col].dtype == 'int64':
                    df[col] = np.random.randint(0, 100, size=len(df))
                else:
                    df[col] = ['RANDOM'] * len(df)
            
            elif mask_type == 'null':
                df[col] = None
        
        return df
    
    def log_audit(self, 
                  action: str, 
                  resource: str,
                  details: str = ''):
        """
        Log security-relevant actions.
        
        Args:
            action: Action performed (READ, WRITE, DELETE, etc.)
            resource: Resource affected
            details: Additional details
        """
        from datetime import datetime
        import getpass
        
        entry = {
            'timestamp': datetime.now().isoformat(),
            'user': getpass.getuser(),
            'action': action,
            'resource': resource,
            'details': details
        }
        self.audit_log.append(entry)
    
    def get_audit_log(self) -> pd.DataFrame:
        """Get audit log as DataFrame."""
        return pd.DataFrame(self.audit_log)


def demonstrate_security():
    """
    Demonstrate security features.
    """
    print("=" * 70)
    print("Data Security and Compliance")
    print("=" * 70)
    
    security = DataSecurityManager()
    
    print("\nSecurity Measures:")
    print("  1. Encryption at rest (AES-256)")
    print("  2. Encryption in transit (TLS 1.3)")
    print("  3. Access control (RBAC)")
    print("  4. Audit logging (all access)")
    print("  5. Data masking (non-prod)")
    
    print("\nCompliance Requirements:")
    print("  - Data retention: 7 years")
    print("  - Access logs: Indefinite")
    print("  - Encryption: Required for PII")
    print("  - Backup testing: Quarterly")
    
    return security


if __name__ == "__main__":
    demonstrate_security()
```

**Detailed Explanation:**

1. **Encryption**:
   - **At rest**: Files encrypted on disk (AES-256)
   - **In transit**: TLS for network communication
   - **In use**: Memory encryption (advanced)

2. **Access Control**:
   - **Authentication**: Verify identity (passwords, keys, MFA)
   - **Authorization**: Verify permissions (RBAC - Role-Based Access Control)
   - **Principle of least privilege**: Minimum necessary access

3. **Compliance**:
   - **GDPR**: Right to deletion, data portability
   - **SEC Rules 17a-3/4**: Record keeping for brokers
   - **Audit trails**: Immutable logs of all access

---

## **8.11 Choosing the Right Storage**

Decision framework for selecting the appropriate storage solution.

```python
"""
Storage Decision Framework for NEPSE Prediction System

This module provides a decision tree and scoring system to help
choose the right storage solution based on specific requirements.
"""

from dataclasses import dataclass
from typing import List, Dict, Any
from enum import Enum


class StorageOption(Enum):
    CSV = "CSV Files"
    PARQUET = "Parquet Files"
    HDF5 = "HDF5 Files"
    SQLITE = "SQLite"
    POSTGRESQL = "PostgreSQL"
    TIMESCALEDB = "TimescaleDB"
    INFLUXDB = "InfluxDB"
    MONGODB = "MongoDB"
    CASSANDRA = "Apache Cassandra"
    REDIS = "Redis"
    S3 = "AWS S3 / Cloud Storage"


@dataclass
class Requirements:
    """System requirements for storage selection."""
    data_size_gb: float
    write_frequency: str  # 'high', 'medium', 'low'
    query_pattern: str    # 'time_range', 'point', 'analytical', 'mixed'
    concurrency: int      # Number of concurrent users
    budget: str          # 'low', 'medium', 'high'
    team_size: int       # For operational complexity
    latency_requirement: str  # 'real_time', 'batch', 'analytical'
    data_retention_years: int
    need_sql: bool
    need_scaling: bool


class StorageRecommender:
    """
    Recommend storage solution based on requirements.
    """
    
    def __init__(self):
        self.scoring = {
            StorageOption.CSV: {
                'small_data': 10, 'large_data': 2,
                'low_write': 10, 'high_write': 3,
                'simple_query': 10, 'complex_query': 3,
                'low_concurrency': 10, 'high_concurrency': 2,
                'low_budget': 10, 'high_budget': 5,
                'small_team': 10, 'large_team': 5,
                'batch_latency': 10, 'realtime_latency': 1,
                'short_retention': 10, 'long_retention': 5,
                'sql_needed': 3, 'sql_not_needed': 10,
                'scaling_needed': 1, 'scaling_not_needed': 10
            },
            StorageOption.PARQUET: {
                'small_data': 8, 'large_data': 9,
                'low_write': 8, 'high_write': 6,
                'simple_query': 7, 'complex_query': 9,
                'low_concurrency': 8, 'high_concurrency': 6,
                'low_budget': 9, 'high_budget': 8,
                'small_team': 9, 'large_team': 7,
                'batch_latency': 9, 'realtime_latency': 3,
                'short_retention': 8, 'long_retention': 9,
                'sql_needed': 5, 'sql_not_needed': 10,
                'scaling_needed': 5, 'scaling_not_needed': 10
            },
            StorageOption.TIMESCALEDB: {
                'small_data': 7, 'large_data': 10,
                'low_write': 7, 'high_write': 9,
                'simple_query': 9, 'complex_query': 10,
                'low_concurrency': 8, 'high_concurrency': 9,
                'low_budget': 6, 'high_budget': 9,
                'small_team': 6, 'large_team': 9,
                'batch_latency': 9, 'realtime_latency': 8,
                'short_retention': 8, 'long_retention': 10,
                'sql_needed': 10, 'sql_not_needed': 6,
                'scaling_needed': 8, 'scaling_not_needed': 9
            },
            StorageOption.INFLUXDB: {
                'small_data': 6, 'large_data': 9,
                'low_write': 6, 'high_write': 10,
                'simple_query': 8, 'complex_query': 7,
                'low_concurrency': 7, 'high_concurrency': 9,
                'low_budget': 5, 'high_budget': 8,
                'small_team': 5, 'large_team': 8,
                'batch_latency': 7, 'realtime_latency': 10,
                'short_retention': 7, 'long_retention': 9,
                'sql_needed': 3, 'sql_not_needed': 10,
                'scaling_needed': 8, 'scaling_not_needed': 8
            }
        }
    
    def recommend(self, req: Requirements) -> Dict[StorageOption, float]:
        """
        Score each storage option based on requirements.
        
        Returns:
            Dictionary mapping options to scores (higher is better)
        """
        scores = {}
        
        for option in StorageOption:
            if option not in self.scoring:
                continue
            
            score = 0
            weights = self.scoring[option]
            
            # Data size
            if req.data_size_gb < 1:
                score += weights.get('small_data', 5)
            else:
                score += weights.get('large_data', 5)
            
            # Write frequency
            if req.write_frequency == 'high':
                score += weights.get('high_write', 5)
            else:
                score += weights.get('low_write', 5)
            
            # Query pattern
            if req.query_pattern in ['analytical', 'mixed']:
                score += weights.get('complex_query', 5)
            else:
                score += weights.get('simple_query', 5)
            
            # Concurrency
            if req.concurrency > 10:
                score += weights.get('high_concurrency', 5)
            else:
                score += weights.get('low_concurrency', 5)
            
            # Budget
            if req.budget == 'low':
                score += weights.get('low_budget', 5)
            else:
                score += weights.get('high_budget', 5)
            
            # Latency
            if req.latency_requirement == 'real_time':
                score += weights.get('realtime_latency', 5)
            else:
                score += weights.get('batch_latency', 5)
            
            # SQL need
            if req.need_sql:
                score += weights.get('sql_needed', 5)
            else:
                score += weights.get('sql_not_needed', 5)
            
            # Scaling
            if req.need_scaling:
                score += weights.get('scaling_needed', 5)
            else:
                score += weights.get('scaling_not_needed', 5)
            
            scores[option] = score
        
        return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))


def demonstrate_decision_framework():
    """
    Demonstrate the storage decision framework.
    """
    print("=" * 70)
    print("Storage Decision Framework")
    print("=" * 70)
    
    recommender = StorageRecommender()
    
    # Define NEPSE requirements
    nepse_reqs = Requirements(
        data_size_gb=0.5,           # 500 MB historical data
        write_frequency='low',     # Daily updates
        query_pattern='time_range', # Time-series analysis
        concurrency=5,              # Small team
        budget='low',               # Limited budget
        team_size=2,                # Small team
        latency_requirement='batch', # Not real-time
        data_retention_years=10,    # Long history
        need_sql=True,              # SQL familiarity
        need_scaling=False          # Not massive scale
    )
    
    print("\nNEPSE System Requirements:")
    print(f"  Data size: {nepse_reqs.data_size_gb} GB")
    print(f"  Write frequency: {nepse_reqs.write_frequency}")
    print(f"  Query pattern: {nepse_reqs.query_pattern}")
    print(f"  Concurrency: {nepse_reqs.concurrency}")
    print(f"  Budget: {nepse_reqs.budget}")
    print(f"  Team size: {nepse_reqs.team_size}")
    print(f"  Latency: {nepse_reqs.latency_requirement}")
    print(f"  Retention: {nepse_reqs.data_retention_years} years")
    print(f"  SQL needed: {nepse_reqs.need_sql}")
    print(f"  Scaling needed: {nepse_reqs.need_scaling}")
    
    print("\nRecommendations (ranked by score):")
    scores = recommender.recommend(nepse_reqs)
    
    for i, (option, score) in enumerate(scores.items(), 1):
        print(f"  {i}. {option.value:20s} (Score: {score:.1f})")
    
    print("\n" + "=" * 70)
    print("Final Recommendation for NEPSE:")
    print("=" * 70)
    print("""
    Primary: Parquet files with partitioning
      - Cost-effective
      - Good performance for time-series
      - Easy to manage with small team
      - Compatible with pandas/NumPy
    
    Secondary: SQLite for metadata
      - Simple relational data
      - Zero configuration
      - Single file backup
    
    Future (if scaling needed): TimescaleDB
      - Easy migration from PostgreSQL
      - Time-series optimizations
      - When data grows beyond 10GB
    """)
    
    return recommender, nepse_reqs


if __name__ == "__main__":
    demonstrate_decision_framework()
```

**Detailed Explanation:**

1. **Decision Criteria**:
   - **Data size**: Small (<1GB) vs Large (>100GB)
   - **Write pattern**: Batch (daily) vs Streaming (real-time)
   - **Query complexity**: Simple lookups vs Analytical aggregations
   - **Team expertise**: SQL familiarity vs NoSQL learning curve
   - **Operational capacity**: Managed services vs Self-hosted

2. **NEPSE Recommendation**:
   - **Current (Small scale)**: Parquet files + SQLite
     - 500MB data fits in memory
     - Daily batch updates
     - Simple time-range queries
     - Zero operational overhead
   
   - **Future (Scale up)**: TimescaleDB
     - When data exceeds 10GB
     - Need concurrent users
     - Complex SQL analytics
     - Managed service available

3. **Hybrid Approach**: Most production systems use multiple storage types:
   - **Hot data**: Redis (recent prices)
   - **Warm data**: TimescaleDB (historical prices)
   - **Cold data**: Parquet on S3 (archived data)
   - **Metadata**: PostgreSQL (stock info, users)

---

## **Chapter Summary**

In this chapter, we covered comprehensive data storage and management strategies for time-series prediction systems:

### **Key Takeaways:**

1. **Storage Architecture Decisions**: Analyze requirements (size, access patterns, latency) before choosing storage.

2. **File-Based Storage**:
   - **CSV**: Universal compatibility, human-readable, poor performance
   - **Parquet**: Columnar, compressed, ideal for analytics
   - **HDF5**: Scientific data, hierarchical organization
   - **Feather**: Fast I/O for temporary storage

3. **Relational Databases**:
   - **SQLite**: Zero-config, embedded, single-user
   - **PostgreSQL**: Advanced features, extensibility
   - **TimescaleDB**: PostgreSQL extension for time-series
   - **Indexing**: Composite indexes on (symbol, date) critical for performance

4. **Time-Series Databases**:
   - **InfluxDB**: High write throughput, purpose-built for time-series
   - **TimescaleDB**: SQL compatibility with time-series optimizations
   - **Prometheus**: Monitoring and metrics (not primary storage)

5. **NoSQL Solutions**:
   - **MongoDB**: Flexible schema, document storage
   - **Redis**: In-memory cache, ultra-low latency
   - **Cassandra**: Massive scale, high availability

6. **Cloud Storage**: S3 for files, BigQuery for analytics, managed databases for operational data.

7. **Partitioning**: Time-based partitioning essential for large datasets; enables efficient querying and data lifecycle management.

8. **Retention & Archival**: Automate data lifecycle from hot (SSD) to warm (disk) to cold (glacier) to deleted.

9. **Backup & Recovery**: 3-2-1 rule, regular testing, point-in-time recovery capability.

10. **Security**: Encryption at rest and in transit, access control, audit logging, compliance with financial regulations.

11. **Decision Framework**: For NEPSE (small scale), start with Parquet + SQLite; migrate to TimescaleDB as data grows.

### **Next Steps:**

Chapter 9 will cover **Data Pipelines and Automation**, including:
- Batch and streaming pipeline architectures
- Workflow orchestration with Apache Airflow
- Data quality gates and validation
- Monitoring and error handling
- Building production-ready data pipelines

---

**End of Chapter 8**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='7. exploratory_data_analysis.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='9. data_pipelines_and_automation.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
