# Survey Data Processor with Simple Dictionary Mappings

This notebook processes survey data from SQL query files and applies dictionary-based transformations.

## Overview
- Reads SQL query files for each survey
- Applies dictionary mappings to transform text responses to numeric values
- Uses simple JSON configuration files to specify which columns need which mappings
- Outputs combined files with transformed data for analysis

In [ ]:
import pandas as pd
import numpy as np
import sqlite3
import os
import json
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Add parent directory to path for imports
import sys
sys.path.append('../')

from shared.metrics.common_metrics import *
from shared.utils.data_validator import *
from shared.transformations import SimpleSurveyMapper, quick_transform

## Configuration

### Survey Configuration
- Each survey has a folder in `sql_queries/survey_name/` containing:
  - `get_data.sql` - SQL query to extract survey data
  - `dev_notes.md` - Developer documentation for the SQL query
- Column mappings are defined in `config/survey_column_mappings.json`
- Dictionary mappings are defined in `config/mapping_dictionaries.json`

In [ ]:
# Survey Configuration - Update this list with your survey names
SURVEYS = [
    "survey_1",
    "survey_2", 
    "survey_3",
    "survey_4",
    "survey_5"
]

# Initialize the mapper
mapper = SimpleSurveyMapper()

# File paths
SQL_QUERIES_DIR = "../sql_queries/"
OUTPUT_DIR = "../data/processed/"
TIMESTAMP = datetime.now().strftime("%Y%m%d_%H%M%S")

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Processing {len(SURVEYS)} surveys")
print(f"\nAvailable mapping dictionaries ({len(mapper.get_available_mappings())}):")
for i, mapping_name in enumerate(mapper.get_available_mappings(), 1):
    print(f"  {i}. {mapping_name}")

# Show example mappings
print("\nExample mappings:")
print("  satisfaction_5_scale: 'very satisfied' → 5, 'satisfied' → 4, ...")
print("  yes_no_binary: 'yes' → 1, 'no' → 0")
print("  recommendation_10_scale: '0' → 0, '1' → 1, ..., '10' → 10")

## Data Loading Functions

In [ ]:
def validate_columns(df, survey_name, all_columns_dict):
    """
    Validate that all columns in the DataFrame are in the allowed columns dictionary.
    
    Args:
        df (pd.DataFrame): Survey data
        survey_name (str): Name of the survey
        all_columns_dict (dict): Dictionary of allowed column names and their types
    
    Returns:
        tuple: (is_valid, invalid_columns)
    """
    # Exclude automatically added columns from validation
    auto_added_columns = {'Survey_Name', 'Processed_Date'}
    df_columns = set(df.columns) - auto_added_columns
    allowed_columns = set(all_columns_dict.keys())
    
    invalid_columns = df_columns - allowed_columns
    
    if invalid_columns:
        print(f"  ✗ VALIDATION FAILED for {survey_name}")
        print(f"    Invalid columns: {sorted(invalid_columns)}")
        print(f"    Valid columns must be from: {sorted(allowed_columns)}")
        return False, invalid_columns
    else:
        print(f"  ✓ Column validation passed for {survey_name}")
        return True, set()

def load_sql_query(survey_name):
    """
    Load SQL query from file for a given survey.
    
    Args:
        survey_name (str): Name of the survey
    
    Returns:
        str: SQL query string
    """
    sql_file_path = os.path.join(SQL_QUERIES_DIR, survey_name, "get_data.sql")
    
    if not os.path.exists(sql_file_path):
        raise FileNotFoundError(f"SQL file not found: {sql_file_path}")
    
    with open(sql_file_path, 'r', encoding='utf-8') as file:
        query = file.read().strip()
    
    return query

def execute_query_and_load_data(survey_name, connection=None):
    """
    Execute SQL query and load data for a survey.
    
    Args:
        survey_name (str): Name of the survey
        connection: Database connection (if None, will create SQLite connection)
    
    Returns:
        pd.DataFrame: Survey data
    """
    query = load_sql_query(survey_name)
    
    # If no connection provided, create a simple SQLite connection
    # Note: Update this section based on your actual database configuration
    if connection is None:
        # For demonstration - replace with your actual database connection
        conn = sqlite3.connect(':memory:')
        print(f"Warning: Using in-memory SQLite for {survey_name}. Update connection for production.")
    else:
        conn = connection
    
    try:
        df = pd.read_sql_query(query, conn)
        
        # Validate columns before processing
        is_valid, invalid_cols = validate_columns(df, survey_name, ALL_COLUMNS)
        if not is_valid:
            print(f"  ✗ Skipping {survey_name} due to column validation failure")
            return None
        
        df['Survey_Name'] = survey_name  # Add survey identifier
        print(f"✓ Loaded {len(df)} records from {survey_name}")
        return df
    except Exception as e:
        print(f"✗ Error loading data for {survey_name}: {str(e)}")
        return None
    finally:
        if connection is None and 'conn' in locals():
            conn.close()

In [ ]:
## View Survey Mappings

Let's examine what transformations are configured for each survey.

# Display configured mappings for each survey
for survey_name in SURVEYS[:3]:  # Show first 3 surveys
    print(f"\n{survey_name}:")
    print("-" * 50)
    
    mappings = mapper.get_survey_mappings(survey_name)
    if mappings:
        for column, dict_name in mappings.items():
            print(f"  {column} → {dict_name}")
    else:
        print("  No mappings configured")

print("\n... and more surveys")

In [None]:
# Initialize storage for processed data
all_survey_data = []
nps_driver_data = []
processing_summary = []

print("Starting data processing pipeline...\n")

In [ ]:
# Process each survey
for survey_name in SURVEYS:
    print(f"\nProcessing survey: {survey_name}")
    print("=" * 60)
    
    try:
        # Load data from SQL query
        df = execute_query_and_load_data(survey_name)
        
        if df is not None and len(df) > 0:
            # Store original data structure info
            original_shape = df.shape
            print(f"  Original shape: {original_shape}")
            
            # Add survey identifier and timestamp
            df['Survey_Name'] = survey_name
            df['Processed_Date'] = datetime.now()
            
            # Show configured mappings for this survey
            survey_mappings = mapper.get_survey_mappings(survey_name)
            if survey_mappings:
                print(f"  Configured mappings: {len(survey_mappings)} columns")
                
                # Apply transformations using the simple mapper
                df = mapper.transform_survey(df, survey_name, verbose=False)
                print(f"  ✓ Transformations applied")
            else:
                print(f"  ⚠ No mappings configured for this survey")
            
            # Validate that expected columns were transformed
            validation = mapper.validate_survey_config(survey_name, df)
            if validation['missing']:
                print(f"  ⚠ Missing columns: {validation['missing']}")
            
            # Store full dataset
            all_survey_data.append(df.copy())
            
            # Extract LTR and driver scores if available
            ltr_driver_subset = extract_ltr_and_drivers(df, survey_name, ALL_COLUMNS)
            if ltr_driver_subset is not None:
                nps_driver_data.append(ltr_driver_subset)
            
            # Record processing summary
            processing_summary.append({
                'Survey_Name': survey_name,
                'Records_Processed': len(df),
                'Columns_Count': len(df.columns),
                'Columns_Transformed': len(survey_mappings),
                'Has_LTR_Data': any(ALL_COLUMNS.get(col) == "LTR" for col in df.columns),
                'Has_Driver_Data': any(ALL_COLUMNS.get(col) == "DRIVERS" for col in df.columns),
                'Processing_Status': 'Success',
                'Processed_Time': datetime.now()
            })
            
            print(f"  ✓ Successfully processed {len(df)} records")
            
        else:
            status = 'No Data' if df is not None else 'Failed to load'
            print(f"  ✗ {status} for {survey_name}")
            processing_summary.append({
                'Survey_Name': survey_name,
                'Records_Processed': 0,
                'Columns_Count': 0,
                'Columns_Transformed': 0,
                'Has_LTR_Data': False,
                'Has_Driver_Data': False,
                'Processing_Status': status,
                'Processed_Time': datetime.now()
            })
            
    except Exception as e:
        print(f"  ✗ Error processing {survey_name}: {str(e)}")
        processing_summary.append({
            'Survey_Name': survey_name,
            'Records_Processed': 0,
            'Columns_Count': 0,
            'Columns_Transformed': 0,
            'Has_LTR_Data': False,
            'Has_Driver_Data': False,
            'Processing_Status': f'Error: {str(e)}',
            'Processed_Time': datetime.now()
        })

print(f"\n\nData processing complete. Processed {len(all_survey_data)} surveys successfully.")

In [ ]:
def extract_ltr_and_drivers(df, survey_name, all_columns_dict):
    """
    Extract LTR (Likelihood to Recommend) and driver scores from survey data.
    
    Args:
        df (pd.DataFrame): Survey data
        survey_name (str): Name of the survey
        all_columns_dict (dict): Dictionary mapping column names to types
    
    Returns:
        pd.DataFrame or None: Subset with LTR and driver data
    """
    # Get LTR and DRIVERS columns from the column type mapping
    ltr_columns = [col for col, col_type in all_columns_dict.items() 
                   if col_type == "LTR" and col in df.columns]
    driver_columns = [col for col, col_type in all_columns_dict.items() 
                      if col_type == "DRIVERS" and col in df.columns]
    
    # Standard columns to include
    standard_columns = [col for col, col_type in all_columns_dict.items() 
                       if col_type == "STANDARD" and col in df.columns]
    
    # Auto-added columns to include
    auto_added_columns = ['Survey_Name', 'Processed_Date']
    available_auto_added = [col for col in auto_added_columns if col in df.columns]
    
    # Combine all relevant columns
    relevant_columns = list(set(standard_columns + ltr_columns + driver_columns + available_auto_added))
    
    if len(ltr_columns) > 0 or len(driver_columns) > 0:
        subset_df = df[relevant_columns].copy()
        print(f"    Extracted LTR/Driver data: {len(ltr_columns)} LTR columns, {len(driver_columns)} driver columns")
        print(f"    LTR columns: {ltr_columns}")
        print(f"    Driver columns: {driver_columns}")
        return subset_df
    else:
        print(f"    No LTR or driver columns found in {survey_name}")
        return None

## Output Generation

In [ ]:
print("Generating output files...\n")

# Define output paths
RAG_OUTPUT_DIR = "../rag_outputs/current/"
ARCHIVE_DIR = "../rag_outputs/archive/"
os.makedirs(RAG_OUTPUT_DIR, exist_ok=True)
os.makedirs(ARCHIVE_DIR, exist_ok=True)

# Output 1: Combined LTR and Driver Scores
if nps_driver_data:
    combined_ltr_drivers = pd.concat(nps_driver_data, ignore_index=True, sort=False)
    
    # Save to archive with timestamp
    archive_file = os.path.join(ARCHIVE_DIR, f"Combined_LTR_Drivers_{TIMESTAMP}.csv")
    combined_ltr_drivers.to_csv(archive_file, index=False)
    
    # Save to current for RAG access
    current_file = os.path.join(RAG_OUTPUT_DIR, "combined_ltr_drivers.csv")
    combined_ltr_drivers.to_csv(current_file, index=False)
    
    print(f"✓ LTR & Driver Scores:")
    print(f"  - RAG access: {current_file}")
    print(f"  - Archive: {archive_file}")
    print(f"  - Total records: {len(combined_ltr_drivers)}")
    print(f"  - Surveys included: {combined_ltr_drivers['Survey_Name'].nunique()}")
    print(f"  - Columns: {len(combined_ltr_drivers.columns)}")
else:
    print("✗ No LTR/Driver data found across surveys")

print()

# Output 2: All Survey Data Combined
if all_survey_data:
    combined_all_data = pd.concat(all_survey_data, ignore_index=True, sort=False)
    
    # Save to archive with timestamp
    archive_file = os.path.join(ARCHIVE_DIR, f"Combined_All_Surveys_{TIMESTAMP}.csv")
    combined_all_data.to_csv(archive_file, index=False)
    
    # Save to current for RAG access
    current_file = os.path.join(RAG_OUTPUT_DIR, "combined_all_surveys.csv")
    combined_all_data.to_csv(current_file, index=False)
    
    print(f"✓ All Survey Data:")
    print(f"  - RAG access: {current_file}")
    print(f"  - Archive: {archive_file}")
    print(f"  - Total records: {len(combined_all_data)}")
    print(f"  - Surveys included: {combined_all_data['Survey_Name'].nunique()}")
    print(f"  - Total columns: {len(combined_all_data.columns)}")
else:
    print("✗ No survey data processed successfully")

print()

# Output 3: Processing Summary
summary_df = pd.DataFrame(processing_summary)

# Save to archive with timestamp
archive_file = os.path.join(ARCHIVE_DIR, f"Processing_Summary_{TIMESTAMP}.csv")
summary_df.to_csv(archive_file, index=False)

# Save to current for RAG access
current_file = os.path.join(RAG_OUTPUT_DIR, "processing_summary.csv")
summary_df.to_csv(current_file, index=False)

print(f"✓ Processing Summary:")
print(f"  - RAG access: {current_file}")
print(f"  - Archive: {archive_file}")
print(f"  - Surveys processed: {len(summary_df)}")
print(f"  - Successful: {len(summary_df[summary_df['Processing_Status'] == 'Success'])}")
print(f"  - Failed: {len(summary_df[summary_df['Processing_Status'] != 'Success'])}")

# Also keep backward compatibility - save to data/processed
OLD_OUTPUT_DIR = "../data/processed/"
os.makedirs(OLD_OUTPUT_DIR, exist_ok=True)
print(f"\n✓ Backward compatibility: Files also saved to {OLD_OUTPUT_DIR}")

## Summary and Validation

In [ ]:
# Display processing summary
print("\n" + "="*60)
print("PROCESSING SUMMARY")
print("="*60)

summary_df = pd.DataFrame(processing_summary)
print(summary_df[['Survey_Name', 'Records_Processed', 'Columns_Transformed', 'Processing_Status']].to_string(index=False))

print(f"\nTotal records processed: {summary_df['Records_Processed'].sum()}")
print(f"Average records per survey: {summary_df['Records_Processed'].mean():.1f}")
print(f"Total columns transformed: {summary_df['Columns_Transformed'].sum()}")
print(f"Success rate: {len(summary_df[summary_df['Processing_Status'] == 'Success']) / len(summary_df) * 100:.1f}%")

In [None]:
# Data quality checks
if all_survey_data:
    print("\n" + "="*60)
    print("DATA QUALITY CHECKS")
    print("="*60)
    
    combined_data = pd.concat(all_survey_data, ignore_index=True, sort=False)
    
    # Check for missing values
    missing_summary = combined_data.isnull().sum()
    print(f"Columns with missing values: {len(missing_summary[missing_summary > 0])}")
    
    # Check for duplicate ResponseIDs within surveys
    duplicate_check = combined_data.groupby('Survey_Name')['ResponseID'].apply(lambda x: x.duplicated().sum() if 'ResponseID' in combined_data.columns else 0)
    print(f"Surveys with duplicate ResponseIDs: {len(duplicate_check[duplicate_check > 0])}")
    
    # Data type summary
    print(f"Total unique columns across all surveys: {len(combined_data.columns)}")
    print(f"Numeric columns: {len(combined_data.select_dtypes(include=[np.number]).columns)}")
    print(f"Text columns: {len(combined_data.select_dtypes(include=['object']).columns)}")

## Next Steps

### 1. Configure Survey SQL
For each survey, create a folder in `sql_queries/survey_name/` containing:
- `get_data.sql` - SQL query to extract survey data
- `dev_notes.md` - Documentation about the SQL logic

Example structure:
```
sql_queries/
├── survey_1/
│   ├── get_data.sql
│   └── dev_notes.md
├── survey_2/
│   ├── get_data.sql
│   └── dev_notes.md
└── survey_6/      # Your new survey
    ├── get_data.sql
    └── dev_notes.md
```

### 2. Configure Survey Mappings
Edit `config/survey_column_mappings.json` to specify which columns need which dictionary mappings:
```json
"survey_6": {
  "columns": {
    "Column_Name": "dictionary_name",
    "Another_Column": "another_dictionary"
  }
}
```

### 3. Document SQL Logic
In each survey's `dev_notes.md`, document:
- Data source (database, tables, views)
- Column mappings and descriptions
- SQL logic (joins, filters, transformations)
- Data quality notes
- Change history

### 4. Available Dictionaries
The system includes 15 pre-defined mapping dictionaries in `config/mapping_dictionaries.json`:
- `satisfaction_5_scale`: Very Dissatisfied → 1, ..., Very Satisfied → 5
- `agreement_5_scale`: Strongly Disagree → 1, ..., Strongly Agree → 5
- `frequency_5_scale`: Never → 1, ..., Always → 5
- `quality_5_scale`: Very Poor → 1, ..., Excellent → 5
- `yes_no_binary`: Yes → 1, No → 0
- `true_false_binary`: True → 1, False → 0
- `importance_5_scale`: Not Important → 1, ..., Extremely Important → 5
- `likelihood_5_scale`: Very Unlikely → 1, ..., Very Likely → 5
- `difficulty_5_scale`: Very Difficult → 1, ..., Very Easy → 5
- `performance_5_scale`: Far Below Expectations → 1, ..., Far Above Expectations → 5
- `recommendation_10_scale`: 0 → 0, ..., 10 → 10
- `priority_3_scale`: Low → 1, Medium → 2, High → 3
- `size_5_scale`: Much Too Small → 1, ..., Much Too Large → 5
- `change_5_scale`: Much Worse → 1, ..., Much Better → 5
- `awareness_4_scale`: Not Aware → 1, ..., Extremely Aware → 4

### 5. Database Connection
Update the `execute_query_and_load_data` function with your actual database connection details when ready to process real data.