# Bermuda Data Ingestion Workshop

This notebook demonstrates various methods for ingesting data into Bermuda triangles. We'll cover multiple data formats and ingestion patterns commonly used in actuarial workflows.

## Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [None]:
import bermuda as tri
import pandas as pd
import numpy as np
import altair as alt
from datetime import date, datetime
import io
import os

# Enable HTML rendering for Altair charts
alt.renderers.enable("html")

print(f"Bermuda version: {tri.__version__ if hasattr(tri, '__version__') else 'Unknown'}")

## 1. Wide CSV Format Ingestion

Wide CSV format represents the traditional actuarial triangle layout where:
- Rows represent accident/occurrence periods
- Columns represent development periods
- Each cell contains loss values

In [None]:
# Create sample wide CSV data
wide_csv_data = """
accident_year,dev_0,dev_12,dev_24,dev_36,dev_48
2019,1000000,1200000,1250000,1275000,1280000
2020,1100000,1350000,1400000,1420000,
2021,950000,1180000,1220000,,
2022,1200000,1450000,,,
2023,1050000,,,,
""".strip()

print("Wide CSV Format Example:")
print(wide_csv_data)

# Convert to DataFrame for processing
wide_df = pd.read_csv(io.StringIO(wide_csv_data))
print("\nDataFrame representation:")
display(wide_df)

In [None]:
# Convert wide format to triangle using bermuda's wide_csv_to_triangle function
# Note: This assumes bermuda has this function - we'll simulate the conversion process

def wide_csv_to_triangle_demo(df):
    """Demonstrate converting wide CSV format to bermuda triangle"""
    cells = []
    
    for _, row in df.iterrows():
        accident_year = row['accident_year']
        
        # Convert each development period to a cell
        for col in df.columns[1:]:  # Skip accident_year column
            if pd.notna(row[col]):
                dev_months = int(col.split('_')[1])  # Extract development months
                
                period_start = date(accident_year, 1, 1)
                period_end = date(accident_year, 12, 31)
                
                # Calculate evaluation date
                if dev_months == 0:
                    eval_date = period_end
                else:
                    eval_year = accident_year + (dev_months // 12)
                    eval_month = 12 + (dev_months % 12)
                    if eval_month > 12:
                        eval_year += 1
                        eval_month -= 12
                    eval_date = date(eval_year, eval_month, 31)
                
                cell_data = {
                    'period_start': period_start,
                    'period_end': period_end,
                    'evaluation_date': eval_date,
                    'reported_loss': float(row[col])
                }
                cells.append(cell_data)
    
    return pd.DataFrame(cells)

# Convert and display
triangle_data = wide_csv_to_triangle_demo(wide_df)
print("Converted to triangle format:")
display(triangle_data.head(10))
print(f"Total cells: {len(triangle_data)}")

## 2. Long CSV Format Ingestion

Long CSV format is database-normalized format where:
- Each row represents a single cell observation
- Columns include period identifiers, evaluation dates, and loss values
- Multiple fields can be included in the same dataset

In [None]:
# Create sample long CSV data
long_csv_data = """
accident_year,evaluation_date,reported_loss,paid_loss,earned_premium,line_of_business
2019,2019-12-31,1000000,800000,5000000,Auto
2019,2020-12-31,1200000,950000,5000000,Auto
2019,2021-12-31,1250000,1100000,5000000,Auto
2019,2022-12-31,1275000,1180000,5000000,Auto
2020,2020-12-31,1100000,850000,5200000,Auto
2020,2021-12-31,1350000,1050000,5200000,Auto
2020,2022-12-31,1400000,1200000,5200000,Auto
2021,2021-12-31,950000,750000,4800000,Auto
2021,2022-12-31,1180000,900000,4800000,Auto
2022,2022-12-31,1200000,900000,5100000,Auto
2019,2019-12-31,800000,600000,3000000,Property
2019,2020-12-31,950000,750000,3000000,Property
2019,2021-12-31,980000,850000,3000000,Property
2020,2020-12-31,850000,650000,3100000,Property
2020,2021-12-31,1050000,800000,3100000,Property
2021,2021-12-31,750000,580000,2900000,Property
""".strip()

print("Long CSV Format Example:")
long_df = pd.read_csv(io.StringIO(long_csv_data))
long_df['evaluation_date'] = pd.to_datetime(long_df['evaluation_date']).dt.date
display(long_df.head(10))
print(f"Total records: {len(long_df)}")

In [None]:
# Demonstrate long format advantages
print("Long format advantages:")
print(f"1. Multiple fields: {[col for col in long_df.columns if 'loss' in col or 'premium' in col]}")
print(f"2. Multiple segments: {long_df['line_of_business'].unique()}")
print(f"3. Metadata preservation: Each record includes line_of_business")

# Show data by segment
print("\nData by line of business:")
for lob in long_df['line_of_business'].unique():
    lob_data = long_df[long_df['line_of_business'] == lob]
    print(f"{lob}: {len(lob_data)} records")

In [None]:
# Convert long format to triangle-ready format
def long_csv_to_triangle_demo(df):
    """Convert long CSV to triangle format with metadata"""
    triangle_cells = []
    
    for _, row in df.iterrows():
        accident_year = row['accident_year']
        eval_date = row['evaluation_date']
        
        cell_data = {
            'period_start': date(accident_year, 1, 1),
            'period_end': date(accident_year, 12, 31),
            'evaluation_date': eval_date,
            'reported_loss': row['reported_loss'],
            'paid_loss': row['paid_loss'],
            'earned_premium': row['earned_premium'],
            'line_of_business': row['line_of_business']
        }
        triangle_cells.append(cell_data)
    
    return pd.DataFrame(triangle_cells)

# Convert long format
long_triangle_data = long_csv_to_triangle_demo(long_df)
print("Converted long format to triangle structure:")
display(long_triangle_data.head())

# Show multi-slice capability
print("\nMulti-slice triangle structure:")
for lob in long_triangle_data['line_of_business'].unique():
    lob_data = long_triangle_data[long_triangle_data['line_of_business'] == lob]
    print(f"{lob} slice: {len(lob_data)} cells")

## 3. Actuarial Triangle-Shaped Format

This format represents the classic actuarial triangle visualization where data is arranged in a triangular matrix.

In [None]:
# Create a triangle-shaped array (classic actuarial format)
triangle_array = np.array([
    [1000000, 1200000, 1250000, 1275000, 1280000],
    [1100000, 1350000, 1400000, 1420000, np.nan],
    [950000,  1180000, 1220000, np.nan,   np.nan],
    [1200000, 1450000, np.nan,   np.nan,   np.nan],
    [1050000, np.nan,   np.nan,   np.nan,   np.nan]
])

accident_years = [2019, 2020, 2021, 2022, 2023]
development_periods = [0, 12, 24, 36, 48]  # months

print("Triangle-shaped array:")
print("Accident Years (rows) vs Development Periods (columns)")
print("Development Periods:", development_periods)
print()

for i, year in enumerate(accident_years):
    row_str = f"{year}: "
    for j, val in enumerate(triangle_array[i]):
        if not np.isnan(val):
            row_str += f"{val:>10.0f} "
        else:
            row_str += f"{'':>10} "
    print(row_str)

In [None]:
# Convert triangle array to bermuda format using array_triangle_builder concept
def array_triangle_builder_demo(triangle_array, accident_years, development_periods):
    """Build triangle from array using bermuda array_triangle_builder concept"""
    cells = []
    
    for i, accident_year in enumerate(accident_years):
        for j, dev_period in enumerate(development_periods):
            value = triangle_array[i, j]
            
            if not np.isnan(value):
                period_start = date(accident_year, 1, 1)
                period_end = date(accident_year, 12, 31)
                
                # Calculate evaluation date based on development period
                if dev_period == 0:
                    eval_date = period_end
                else:
                    eval_year = accident_year + (dev_period // 12)
                    eval_month = 12 + (dev_period % 12)
                    if eval_month > 12:
                        eval_year += 1
                        eval_month -= 12
                    eval_date = date(eval_year, eval_month, 31)
                
                cell = {
                    'period_start': period_start,
                    'period_end': period_end,
                    'evaluation_date': eval_date,
                    'development_period': dev_period,
                    'reported_loss': value,
                    'array_position': f'[{i},{j}]'
                }
                cells.append(cell)
    
    return pd.DataFrame(cells)

# Build triangle from array
array_triangle_data = array_triangle_builder_demo(triangle_array, accident_years, development_periods)
print("Array-based triangle conversion:")
display(array_triangle_data)

print(f"\nSuccessfully converted {len(array_triangle_data)} cells from triangle array")

## 4. Excel Spreadsheet Ingestion

Many actuaries work with Excel files. Let's demonstrate creating and reading triangle data from Excel format.

In [None]:
# Create sample Excel-like data structure
excel_triangle_data = {
    'Accident Year': [2019, 2020, 2021, 2022, 2023],
    '0 months': [1000000, 1100000, 950000, 1200000, 1050000],
    '12 months': [1200000, 1350000, 1180000, 1450000, None],
    '24 months': [1250000, 1400000, 1220000, None, None],
    '36 months': [1275000, 1420000, None, None, None],
    '48 months': [1280000, None, None, None, None]
}

excel_df = pd.DataFrame(excel_triangle_data)
print("Excel-style triangle data:")
display(excel_df)

# Save as Excel file for demonstration
excel_filename = 'sample_triangle.xlsx'
try:
    excel_df.to_excel(excel_filename, index=False)
    print(f"\nSaved sample data to {excel_filename}")
except ImportError:
    print("\nNote: openpyxl not available for Excel export, but format demonstrated")

In [None]:
# Convert Excel-style data to triangle format
def excel_to_triangle_demo(df):
    """Convert Excel-style triangle to bermuda format"""
    cells = []
    
    for _, row in df.iterrows():
        accident_year = row['Accident Year']
        
        # Process development columns
        for col in df.columns[1:]:  # Skip 'Accident Year'
            if pd.notna(row[col]):
                # Extract development period from column name
                dev_months = int(col.split()[0])  # e.g., "12 months" -> 12
                
                period_start = date(accident_year, 1, 1)
                period_end = date(accident_year, 12, 31)
                
                # Calculate evaluation date
                if dev_months == 0:
                    eval_date = period_end
                else:
                    eval_year = accident_year + (dev_months // 12)
                    eval_month = 12 + (dev_months % 12)
                    if eval_month > 12:
                        eval_year += 1
                        eval_month -= 12
                    eval_date = date(eval_year, eval_month, 31)
                
                cell = {
                    'period_start': period_start,
                    'period_end': period_end,
                    'evaluation_date': eval_date,
                    'development_months': dev_months,
                    'reported_loss': float(row[col]),
                    'source_format': 'Excel'
                }
                cells.append(cell)
    
    return pd.DataFrame(cells)

# Convert Excel data
excel_triangle_converted = excel_to_triangle_demo(excel_df)
print("Excel data converted to triangle format:")
display(excel_triangle_converted.head(10))
print(f"\nTotal cells from Excel format: {len(excel_triangle_converted)}")

## 5. Chain-Ladder Package I/O

Demonstrate interoperability with the popular chainladder package for R/Python actuarial work.

In [None]:
# Simulate chainladder package format (since it may not be installed)
# This demonstrates the concept of converting between different actuarial libraries

def simulate_chainladder_triangle():
    """Simulate a triangle in chainladder package format"""
    # Chainladder typically uses arrays with specific indexing
    cl_data = {
        'values': triangle_array,  # The loss triangle array
        'origin': accident_years,   # Accident years
        'development': development_periods,  # Development periods
        'valuation': None  # Calculated from origin + development
    }
    
    # Calculate valuation dates
    valuation_dates = []
    for i, origin_year in enumerate(cl_data['origin']):
        val_row = []
        for j, dev_period in enumerate(cl_data['development']):
            if not np.isnan(cl_data['values'][i, j]):
                val_year = origin_year + (dev_period // 12)
                val_month = 12 + (dev_period % 12)
                if val_month > 12:
                    val_year += 1
                    val_month -= 12
                val_row.append(date(val_year, val_month, 31))
            else:
                val_row.append(None)
        valuation_dates.append(val_row)
    
    cl_data['valuation'] = valuation_dates
    return cl_data

# Create simulated chainladder triangle
cl_triangle = simulate_chainladder_triangle()
print("Simulated Chain-Ladder package triangle:")
print(f"Origin periods: {cl_triangle['origin']}")
print(f"Development periods: {cl_triangle['development']}")
print(f"Triangle shape: {cl_triangle['values'].shape}")
print("\nSample valuation dates:")
for i in range(min(3, len(cl_triangle['valuation']))):
    valid_dates = [d for d in cl_triangle['valuation'][i] if d is not None]
    print(f"  Origin {cl_triangle['origin'][i]}: {valid_dates[:3]}...")

In [None]:
# Convert chainladder format to bermuda
def chainladder_to_bermuda_demo(cl_data):
    """Convert chainladder-style data to bermuda triangle format"""
    cells = []
    
    for i, origin_year in enumerate(cl_data['origin']):
        for j, dev_period in enumerate(cl_data['development']):
            value = cl_data['values'][i, j]
            valuation_date = cl_data['valuation'][i][j]
            
            if not np.isnan(value) and valuation_date is not None:
                cell = {
                    'period_start': date(origin_year, 1, 1),
                    'period_end': date(origin_year, 12, 31),
                    'evaluation_date': valuation_date,
                    'development_period': dev_period,
                    'reported_loss': value,
                    'origin_year': origin_year,
                    'source_package': 'chainladder'
                }
                cells.append(cell)
    
    return pd.DataFrame(cells)

# Convert chainladder to bermuda
cl_to_bermuda = chainladder_to_bermuda_demo(cl_triangle)
print("Chain-ladder format converted to Bermuda:")
display(cl_to_bermuda.head())
print(f"\nSuccessfully converted {len(cl_to_bermuda)} cells from chainladder format")

## 6. Trib Files (Binary Triangle Format)

Trib files are Bermuda's native binary format for efficient triangle storage and retrieval.

In [None]:
# Load the existing trib file from our test data
trib_file = "data/solm.gl-general.trib"

if os.path.exists(trib_file):
    print(f"Loading triangle from trib file: {trib_file}")
    trib_triangle = tri.binary_to_triangle(trib_file)
    print("\nTrib file triangle summary:")
    print(trib_triangle)
    
    print(f"\nTrib file advantages:")
    print(f"- Fast binary loading")
    print(f"- Preserves all metadata")
    print(f"- Compact storage")
    print(f"- Native bermuda format")
    
    # Show file size
    file_size = os.path.getsize(trib_file)
    print(f"\nFile size: {file_size:,} bytes ({file_size/1024/1024:.2f} MB)")
    print(f"Cells per byte: {len(trib_triangle)/file_size:.2f}")
    
else:
    print(f"Trib file not found: {trib_file}")
    print("Trib files provide efficient binary storage for bermuda triangles")

In [None]:
# Demonstrate trib file creation concept
def simulate_trib_creation(triangle_df, filename="sample.trib"):
    """Simulate creating a trib file from triangle data"""
    print(f"Would create trib file: {filename}")
    print(f"Input data: {len(triangle_df)} cells")
    
    # Show what would be stored
    required_fields = ['period_start', 'period_end', 'evaluation_date']
    value_fields = [col for col in triangle_df.columns if 'loss' in col.lower() or 'premium' in col.lower()]
    metadata_fields = [col for col in triangle_df.columns if col not in required_fields + value_fields]
    
    print(f"Required fields: {required_fields}")
    print(f"Value fields: {value_fields}")
    print(f"Metadata fields: {metadata_fields}")
    
    return {
        'filename': filename,
        'cell_count': len(triangle_df),
        'fields': required_fields + value_fields,
        'metadata': metadata_fields
    }

# Simulate creating trib from our converted data
if len(long_triangle_data) > 0:
    # Use our long format data
    trib_info = simulate_trib_creation(long_triangle_data, "multi_slice_example.trib")
    print(f"\nWould create trib file with:")
    print(f"  {trib_info['cell_count']} cells")
    print(f"  {len(trib_info['fields'])} value fields")
    print(f"  {len(trib_info['metadata'])} metadata fields")

## 7. Data Validation and Quality Checks

When ingesting data from various sources, it's important to validate the results.

In [None]:
def validate_triangle_data(df, source_name):
    """Validate converted triangle data"""
    print(f"\nValidating {source_name} data:")
    
    # Check required fields
    required_fields = ['period_start', 'period_end', 'evaluation_date']
    missing_fields = [f for f in required_fields if f not in df.columns]
    if missing_fields:
        print(f"  X Missing required fields: {missing_fields}")
    else:
        print(f"  ✓ All required fields present")
    
    # Check for valid dates
    try:
        date_columns = [col for col in df.columns if 'date' in col]
        for col in date_columns:
            if df[col].isnull().any():
                print(f"  ! Null dates found in {col}")
            else:
                print(f"  ✓ Valid dates in {col}")
    except Exception as e:
        print(f"  X Date validation error: {e}")
    
    # Check for value fields
    value_fields = [col for col in df.columns if 'loss' in col.lower() or 'premium' in col.lower()]
    if not value_fields:
        print(f"  ! No value fields found")
    else:
        print(f"  ✓ Value fields found: {value_fields}")
    
    # Check for negative values
    for col in value_fields:
        if (df[col] < 0).any():
            print(f"  ! Negative values found in {col}")
    
    # Summary statistics
    print(f"  Total cells: {len(df)}")
    if len(df) > 0:
        print(f"  Period range: {df['period_start'].min()} to {df['period_end'].max()}")
        print(f"  Evaluation range: {df['evaluation_date'].min()} to {df['evaluation_date'].max()}")
    
    return len(missing_fields) == 0

# Validate all our converted datasets
datasets = [
    (triangle_data, "Wide CSV"),
    (long_triangle_data, "Long CSV"),
    (array_triangle_data, "Array Format"),
    (excel_triangle_converted, "Excel Format"),
    (cl_to_bermuda, "Chain-ladder Format")
]

valid_datasets = []
for dataset, name in datasets:
    if len(dataset) > 0:
        is_valid = validate_triangle_data(dataset, name)
        if is_valid:
            valid_datasets.append((dataset, name))

print(f"\nSuccessfully validated {len(valid_datasets)} datasets")

## 8. Visualization of Ingested Data

Let's create visualizations to compare the different ingestion methods.

In [None]:
# Compare data coverage across ingestion methods
coverage_comparison = []

for dataset, name in valid_datasets:
    if len(dataset) > 0:
        period_count = len(dataset['period_start'].unique())
        eval_count = len(dataset['evaluation_date'].unique())
        avg_loss = dataset['reported_loss'].mean() if 'reported_loss' in dataset.columns else 0
        
        coverage_comparison.append({
            'Format': name,
            'Cell Count': len(dataset),
            'Accident Periods': period_count,
            'Evaluation Dates': eval_count,
            'Avg Reported Loss': avg_loss
        })

comparison_df = pd.DataFrame(coverage_comparison)
print("Ingestion Method Comparison:")
display(comparison_df)

# Create a simple visualization
if len(comparison_df) > 0:
    # Cell count comparison
    chart = alt.Chart(comparison_df).mark_bar().encode(
        x=alt.X('Format:N', title='Ingestion Format'),
        y=alt.Y('Cell Count:Q', title='Number of Cells'),
        color=alt.Color('Format:N', legend=None),
        tooltip=['Format', 'Cell Count', 'Accident Periods', 'Evaluation Dates']
    ).properties(
        title='Triangle Cells by Ingestion Format',
        width=400,
        height=300
    )
    
    chart

## 9. Best Practices and Recommendations

Summary of ingestion best practices based on different use cases.

In [None]:
# Display best practices guide
best_practices = {
    "Wide CSV": {
        "Best for": "Traditional actuarial triangles, simple data",
        "Advantages": ["Familiar format", "Easy to create in Excel", "Visual triangle structure"],
        "Disadvantages": ["Limited metadata", "Single field only", "Sparse for irregular triangles"]
    },
    "Long CSV": {
        "Best for": "Database exports, multi-field data, segmented triangles",
        "Advantages": ["Multiple fields", "Rich metadata", "Database-friendly", "Multi-slice support"],
        "Disadvantages": ["Less intuitive format", "Larger file sizes"]
    },
    "Array Format": {
        "Best for": "Mathematical operations, chainladder compatibility",
        "Advantages": ["Efficient for calculations", "Standard actuarial format", "Easy matrix operations"],
        "Disadvantages": ["Fixed structure", "Limited metadata", "Requires index management"]
    },
    "Excel Format": {
        "Best for": "Actuarial workflows, presentation data",
        "Advantages": ["Familiar to actuaries", "Easy editing", "Good for small datasets"],
        "Disadvantages": ["Version compatibility", "Size limitations", "Manual process"]
    },
    "Trib Files": {
        "Best for": "Production systems, large datasets, archival storage",
        "Advantages": ["Fast loading", "Compact storage", "Full metadata", "Native format"],
        "Disadvantages": ["Binary format", "Requires bermuda to read", "Less portable"]
    }
}

print("BERMUDA INGESTION BEST PRACTICES\n")
print("=" * 50)

for format_name, info in best_practices.items():
    print(f"\n{format_name}")
    print(f"   Best for: {info['Best for']}")
    print(f"   ✓ Advantages: {', '.join(info['Advantages'])}")
    print(f"   ! Disadvantages: {', '.join(info['Disadvantages'])}")

print("\n" + "=" * 50)
print("GENERAL RECOMMENDATIONS:")
print("   • Use Long CSV for complex, multi-dimensional data")
print("   • Use Wide CSV for simple, single-field triangles")
print("   • Use Trib files for production and archival storage")
print("   • Validate all data after ingestion")
print("   • Preserve metadata whenever possible")
print("   • Consider data volume and performance requirements")

## Summary

This notebook has demonstrated comprehensive data ingestion capabilities for the Bermuda library:

1. **Wide CSV Format** - Traditional triangle layout for simple data
2. **Long CSV Format** - Database-normalized format for complex, multi-dimensional data
3. **Array Format** - Mathematical matrix format for computational work
4. **Excel Integration** - Working with spreadsheet data common in actuarial workflows
5. **Chain-ladder Compatibility** - Interoperability with other actuarial libraries
6. **Trib Files** - Native binary format for efficient storage and retrieval
7. **Data Validation** - Quality checks and validation procedures
8. **Best Practices** - Guidance for choosing the right ingestion method

Each method has its strengths and appropriate use cases. The key is selecting the right approach based on your data structure, workflow requirements, and performance needs.