# NovaConnect Data Processing Notebook

## Purpose

This notebook loads and processes all data for the NovaConnect Snowflake Intelligence lab.

**What this notebook does:**
1. **Load Structured Data:** Import 6 CSV files (81,425 records) from @CSV_STAGE
2. **Process Audio Files:** Transcribe 25 customer support calls using AI_TRANSCRIBE from @AUDIO_STAGE
3. **Process PDF Documents:** Extract text from 8 NovaConnect help documents using AI_PARSE_DOCUMENT from @PDF_STAGE
4. **Apply AI Analysis:** Use Cortex AI functions (AI_SENTIMENT, AI_CLASSIFY, SUMMARIZE, AI_COMPLETE)
5. **Populate Tables:** Save all processed data for analytics and Cortex Search

**Prerequisites:**
- setup.sql has been run successfully
- Files uploaded to stages:
  - @CSV_STAGE: 6 CSV files (network_performance.csv, customer_details.csv, etc.)
  - @AUDIO_STAGE: 25 MP3 audio files (CALL_001.mp3 to CALL_025.mp3)
  - @PDF_STAGE: 8 PDF documents (NovaConnect help guides)

**Duration:** 15-20 minutes

**Output:** Populated tables ready for intelligence_lab.ipynb exploration

In [None]:
# Setup
from snowflake.snowpark import Session, Row
from snowflake.snowpark.functions import col, lit, call_builtin
from snowflake.snowpark.types import StringType, TimestampType
from datetime import datetime
import random

session = Session.builder.getOrCreate()
print(f"Connected to: {session.get_current_database()}.{session.get_current_schema()}")


## Part 1: Load Structured Data from CSV Files

**What happens in this section:**
- Loads 6 CSV files from @CSV_STAGE using COPY INTO command
- Truncates existing data to ensure clean re-runs
- Tables loaded:
  - **network_performance:** 49,864 tower performance metrics
  - **customer_feedback_summary:** 20,061 daily feedback aggregations
  - **customer_details:** 10,000 customer profiles
  - **infrastructure_capacity:** 1,500 tower capacity snapshots
  - **csat_surveys:** 25 customer satisfaction surveys
  - **customer_interaction_history:** 25 customer interaction summaries

**Expected output:** All structured data tables populated and ready for analysis

**Processing time:** 1-2 minutes

In [None]:
# Load CSV files from @CSV_STAGE with fully qualified names
print("Loading structured data from @CSV_STAGE...\\n")

# Get current database and use DEFAULT_SCHEMA where tables are located
current_db = session.get_current_database()
target_schema = "DEFAULT_SCHEMA"  # Tables are in DEFAULT_SCHEMA, not NOTEBOOKS
print(f"Loading into: {current_db}.{target_schema}\\n")

# Define CSV files in stage and their target tables
csv_files = {
    'network_performance': 'network_performance.csv',
    'customer_feedback_summary': 'customer_feedback_summary.csv',
    'customer_details': 'customer_details.csv',
    'infrastructure_capacity': 'infrastructure_capacity.csv',
    'csat_surveys': 'csat_surveys.csv',
    'customer_interaction_history': 'customer_interaction_history.csv'
}

# Load each CSV from stage using COPY INTO with fully qualified names
for table_name, file_name in csv_files.items():
    # Build fully qualified table name - tables are in DEFAULT_SCHEMA
    fully_qualified_table = f"{current_db}.{target_schema}.{table_name}"
    
    # Clear existing data to avoid duplicates on re-run
    session.sql(f"TRUNCATE TABLE IF EXISTS {fully_qualified_table}").collect()
    
    # Load from CSV_STAGE with fully qualified names
    session.sql(f"""
    COPY INTO {fully_qualified_table}
    FROM @{current_db}.{target_schema}.CSV_STAGE/{file_name}
    FILE_FORMAT = (FORMAT_NAME = '{current_db}.{target_schema}.csv_format')
    ON_ERROR = 'CONTINUE'
    PURGE = FALSE
    """).collect()
    
    # Count records using fully qualified name
    count = session.sql(f"SELECT COUNT(*) FROM {fully_qualified_table}").collect()[0][0]
    print(f"Loaded {count:,} records into {table_name.upper()}")

print("\\nAll structured data loaded from stage!")

## Part 2: Process Audio Files with AI_TRANSCRIBE

**What happens in this section:**
- Lists all audio files (MP3/WAV/M4A) from @AUDIO_STAGE
- Transcribes each audio file to text using AI_TRANSCRIBE
- Extracts call duration from audio metadata
- Applies AI analysis:
  - **AI_SENTIMENT:** Analyzes customer emotion (returns VARIANT with sentiment categories)
  - **SUMMARIZE:** Creates concise call summaries
  - **AI_CLASSIFY:** Categorizes calls (Network Issue, Billing Issue, etc.)
  - **AI_COMPLETE:** Extracts agent names and key customer issues
- Saves to CUSTOMER_CALL_TRANSCRIPTS table

**Expected output:** 25 fully analyzed call transcripts with all fields populated

**Processing time:** 5-10 minutes (parallel execution)

In [None]:
# Get current database and use DEFAULT_SCHEMA where AUDIO_STAGE is located
current_db = session.get_current_database()
target_schema = "DEFAULT_SCHEMA"  # AUDIO_STAGE is in DEFAULT_SCHEMA

# Get audio files from @AUDIO_STAGE with fully qualified name
all_files = session.sql(f"LIST @{current_db}.{target_schema}.AUDIO_STAGE").collect()

# Filter for audio files (.mp3, .wav, .m4a)
audio_files = [f for f in all_files if f['name'].lower().endswith(('.mp3', '.wav', '.m4a'))]
print(f"Found {len(audio_files)} audio files in @{current_db}.{target_schema}.AUDIO_STAGE\n")

if len(audio_files) == 0:
    print("ERROR: No audio files found in @AUDIO_STAGE")
    print("Please upload MP3 files (CALL_001.mp3, etc.) to AUDIO_STAGE/call_recordings/")
    raise Exception("No audio files found")

# Build DataFrame with fully qualified stage paths
audio_data = []
for f in audio_files:
    # The LIST result 'name' field contains the full path like 'audio_stage/call_recordings/CALL_001.mp3'
    # We need to extract the relative path after the stage name
    full_path = f['name']
    # Get everything after 'audio_stage/' to get 'call_recordings/CALL_001.mp3'
    relative_path = '/'.join(full_path.split('/')[1:])  # Remove stage name prefix
    
    # Extract just the filename for call_id
    file_name = full_path.split('/')[-1]
    call_id = file_name.replace('.mp3', '').replace('.wav', '').replace('.m4a', '')
    
    # Construct fully qualified stage path with the full relative path
    fully_qualified_path = f"@{current_db}.{target_schema}.AUDIO_STAGE/{relative_path}"
    
    audio_data.append(Row(
        call_id=call_id,
        customer_id=f"CUST_{random.randint(1, 10000):06d}",
        file_path=fully_qualified_path,
        call_date=datetime.now()
    ))

audio_df = session.create_dataframe(audio_data)
print(f"Prepared {len(audio_data)} audio files for processing")
print(f"Sample path: {audio_data[0]['file_path']}")

# Transcribe with AI_TRANSCRIBE
print("Transcribing audio...")

# Use TO_FILE to create file reference (per Snowflake docs)
transcribed = audio_df.with_column(
    "file_ref",
    call_builtin("TO_FILE", col("file_path"))
).with_column(
    "transcription_result",
    call_builtin("SNOWFLAKE.CORTEX.AI_TRANSCRIBE", col("file_ref"))
).with_column(
    "transcript_text",
    col("transcription_result")["text"].cast(StringType())
).with_column(
    "call_duration_seconds",
    col("transcription_result")["audio_duration"].cast("INT")
)

# Apply AI analysis
transcribed = transcribed.with_column(
    "sentiment_score",
    call_builtin("SNOWFLAKE.CORTEX.AI_SENTIMENT", col("transcript_text"))
).with_column(
    "summary",
    call_builtin("SNOWFLAKE.CORTEX.SUMMARIZE", col("transcript_text"))
).with_column(
    "classify_result",
    call_builtin("SNOWFLAKE.CORTEX.AI_CLASSIFY", col("transcript_text"),
                 lit(['Network Issue', 'Billing Issue', 'Technical Support', 'General']))
).with_column(
    "call_reason",
    col("classify_result")["labels"][0].cast(StringType())
).drop("classify_result")

print("Extracting additional insights...")

# Extract agent name using AI_COMPLETE
transcribed = transcribed.with_column(
    "agent_name",
    call_builtin("SNOWFLAKE.CORTEX.AI_COMPLETE", lit("mistral-large2"),
        call_builtin("CONCAT",
            lit("Extract only the agent's name from this call transcript. Return just the first name or 'Agent' if none mentioned: "),
            col("transcript_text")))
)

# Extract key issues as ARRAY using AI_COMPLETE
transcribed = transcribed.with_column(
    "issues_text",
    call_builtin("SNOWFLAKE.CORTEX.AI_COMPLETE", lit("mistral-large2"),
        call_builtin("CONCAT",
            lit("Extract main customer issues as comma-separated list: "),
            col("transcript_text")))
).with_column(
    "key_issues",
    call_builtin("SPLIT", col("issues_text"), lit(","))
).drop("issues_text")

print(f"Processed {transcribed.count()} transcripts with full AI analysis")

# Clear existing data before saving (for re-runs) using fully qualified name
fully_qualified_table = f"{current_db}.{target_schema}.customer_call_transcripts"
session.sql(f"TRUNCATE TABLE IF EXISTS {fully_qualified_table}").collect()

# Save to table with fully qualified name
transcribed.select(
    "call_id",
    "customer_id",
    "call_date",
    "call_duration_seconds",
    col("file_path").alias("audio_file_path"),
    "transcript_text",
    "sentiment_score",
    "key_issues",
    "summary",
    lit("Processed").alias("resolution_status"),
    "agent_name",
    "call_reason",
    lit(None).alias("csat_score")
).write.mode("append").save_as_table(fully_qualified_table)

print(f"Saved {transcribed.count()} call transcripts with all fields populated")


## Part 3: Process PDF Documents with AI_PARSE_DOCUMENT

**What happens in this section:**
- Uses bulk processing (single INSERT SELECT) for parallel execution
- Parses all PDF files from @PDF_STAGE using AI_PARSE_DOCUMENT with LAYOUT mode
- Extracts full text content for Cortex Search indexing
- PDFs are NovaConnect help documentation:
  - 5G plans information
  - Help Centre guides (prepaid, roaming, internet issues, etc.)
  - One Plan benefits
- No sentiment analysis (these are reference documents, not customer feedback)
- Saves to CUSTOMER_COMPLAINT_DOCUMENTS table

**Expected output:** 8 help documents parsed and ready for Cortex Search

**Processing time:** 2-3 minutes (all PDFs processed in parallel)

**Use case:** Cortex Search will index this content so the Intelligence Agent can answer questions about NovaConnect plans and policies

In [None]:
# Bulk process PDF files from @PDF_STAGE with fully qualified names
try:
    # Get current database and use DEFAULT_SCHEMA where tables/stages are located
    current_db = session.get_current_database()
    target_schema = "DEFAULT_SCHEMA"  # Tables and PDF_STAGE are in DEFAULT_SCHEMA
    fully_qualified_table = f"{current_db}.{target_schema}.customer_complaint_documents"
    fully_qualified_stage = f"{current_db}.{target_schema}.PDF_STAGE"
    
    # Get PDF file count
    all_files = session.sql(f"LIST @{fully_qualified_stage}").collect()
    pdf_count = len([f for f in all_files if f['name'].lower().endswith('.pdf')])
    
    print(f"Found {pdf_count} PDF files in @PDF_STAGE")
    
    if pdf_count > 0:
        # Clear existing data
        session.sql(f"TRUNCATE TABLE IF EXISTS {fully_qualified_table}").collect()
        
        # Bulk process all PDFs in single query (parallel processing)
        print("Processing all PDFs in parallel...")
        
        session.sql(f"""
        INSERT INTO {fully_qualified_table} (
            document_id, customer_id, document_date, document_type,
            file_path, extracted_text, sentiment_score,
            complaint_category, priority_level, summary
        )
        SELECT 
            REPLACE(relative_path, '.pdf', '') as document_id,
            'SYSTEM' as customer_id,
            CURRENT_DATE() as document_date,
            'Help Document' as document_type,
            '@{fully_qualified_stage}/' || relative_path as file_path,
            SNOWFLAKE.CORTEX.AI_PARSE_DOCUMENT(
                TO_FILE('@{fully_qualified_stage}', relative_path), 
                {{'mode': 'LAYOUT'}}
            ):content::VARCHAR as extracted_text,
            NULL::VARIANT as sentiment_score,
            'Documentation' as complaint_category,
            'Reference' as priority_level,
            relative_path as summary
        FROM DIRECTORY('@{fully_qualified_stage}')
        WHERE LOWER(relative_path) LIKE '%.pdf'
        """).collect()
        
        count = session.sql(f"SELECT COUNT(*) FROM {fully_qualified_table}").collect()[0][0]
        print(f"\\nProcessed and saved {count} help documents in parallel")
    else:
        print("No PDFs found - skipping")
        
except Exception as e:
    print(f"PDF processing error: {e}")

## Summary: Data Processing Complete

**Verification:**
This cell verifies all data has been successfully loaded and processed.

**What to check:**
- All 8 tables should show record counts
- customer_call_transcripts: 25 records
- customer_complaint_documents: 8 records
- Other tables: Thousands of records

**If any table shows "Not loaded":**
- Check that files were uploaded to @raw_files stage
- Verify file names match expected names
- Re-run the specific Part (1, 2, or 3) above

**Next step:** Run intelligence_lab.ipynb for hands-on exploration and visualization


In [None]:
# Verify all data loaded with fully qualified names
print("Data Loading Summary:")
print("=" * 60)

# Get current database and use DEFAULT_SCHEMA where tables are located
current_db = session.get_current_database()
target_schema = "DEFAULT_SCHEMA"  # Tables are in DEFAULT_SCHEMA
print(f"Schema: {current_db}.{target_schema}\\n")

tables = ['network_performance', 'customer_details', 'customer_feedback_summary',
          'infrastructure_capacity', 'csat_surveys', 'customer_interaction_history',
          'customer_call_transcripts', 'customer_complaint_documents']

for table in tables:
    try:
        fully_qualified_table = f"{current_db}.{target_schema}.{table}"
        count = session.sql(f"SELECT COUNT(*) FROM {fully_qualified_table}").collect()[0][0]
        print(f"{table:40s}: {count:>6,} records")
    except Exception as e:
        print(f"{table:40s}: Not loaded ({str(e)[:50]})")

print("\\n" + "=" * 60)
print("Data processing complete!")
print("Next: Run intelligence_lab.ipynb for hands-on exploration")
