# Sentiment Analysis with MLflow Integration

This notebook performs sentiment analysis on customer call transcripts. It processes raw text data, applies a pre-trained sentiment analysis model (DistilBERT fine-tuned on SST-2), calculates sentiment scores and save them to a Unity Catalog table for downstream business intelligence and anomaly detection.

### Core Functionality:
*   **Data Loading**: Ingests call transcript data from a specified source table.
*   **Sentiment Scoring**: Applies a transformer-based model to derive sentiment scores from text summaries. It handles long texts by breaking them into manageable chunks.
*   **Results Storage**: Writes the enriched data, including the sentiment scores, to a target table in Unity Catalog.
*   **Configuration Management**: Utilizes a centralized `config.py` for managing settings like table names, model names, and logging levels, promoting maintainability.
*   **Robust Processing**: Incorporates enhanced logging for better traceability and error handling mechanisms, including fallbacks for model loading. It also supports efficient batch processing for large datasets.
*   **MLFlow integration**:
    *  Leverage MLflow for robust model lifecycle mamagement
    *   **Model Registration**: The sentiment analysis model is programmatically registered in the MLflow Model Registry. This ensures that the exact model used for analysis is tracked and versioned.
    *   **Version Control**: MLflow automatically versions the registered model, allowing for reproducibility and easy rollback to previous model versions if needed.

## Setup and Configuration

In [None]:
# Standard library imports
import pandas as pd
import mlflow
import logging

# Local application imports
import config
import de_utils as dut
from mlflow_utils import SentimentAnalyzer, create_model_serving_endpoint

# Configure logging
logging.basicConfig(level=getattr(logging, config.LOG_LEVEL))
logger = logging.getLogger(__name__)

print(f"Source table: {config.SOURCE_TABLE}")
print(f"Target table: {config.TARGET_TABLE}")
print(f"Model name: {config.MLFLOW_MODEL_NAME}")

## Load Data

In [None]:
# Load data using enhanced utilities
df_spark = dut.read_data_from_bricks_catalog(config.SOURCE_TABLE)
df = df_spark.toPandas()

print(f"Loaded {len(df)} records")
print(f"Columns: {list(df.columns)}")
print(f"Text column '{config.TEXT_COLUMN}' preview:")
if config.TEXT_COLUMN in df.columns:
    print(df[config.TEXT_COLUMN].iloc[0][:200] + "...")
else:
    print(f"❌ Warning: {config.TEXT_COLUMN} column not found!")

## Model Registration with MLflow

In [None]:
# Initialize sentiment analyzer
analyzer = SentimentAnalyzer()

# Register model with MLflow
print("Registering sentiment analysis model with MLflow...")
model_info = analyzer.register_model_with_mlflow(
    model_version_description="Blue Bricks DistilBERT sentiment analysis model for customer call transcripts"
)

## Model Serving Setup

In [None]:
# Create model serving endpoint, fallback to direct inference if model serving not enabled
print("Creating model serving endpoint...")

try:
    endpoint = create_model_serving_endpoint()
    if endpoint:
        print(f"Serving endpoint created: {endpoint.name}")
    else:
        print("Serving endpoint creation skipped - will use direct model inference")
except Exception as e:
    print(f"Could not create serving endpoint: {e}")
    print("Pipeline will use direct model inference.")

## Sentiment Analysis Pipeline

In [None]:
# Use the complete pipeline function (recommended)
print("Running complete sentiment analysis pipeline...")

success = dut.process_sentiment_analysis_pipeline(
    source_table=config.SOURCE_TABLE,
    target_table=config.TARGET_TABLE,
    use_mlflow_model=True,  # Use MLflow registered model
    batch_size=1000   
)

if success:
    print("Pipeline completed successfully!")
    print(f"Results saved to: {config.TARGET_TABLE}")
else:
    print("❌ Pipeline failed - check logs for details")