<a href="https://colab.research.google.com/github/ShovalBenjer/Bigdata_Pyspark_Spark_Hadoop_Apache/blob/main/SoloSolve-AI_Big_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **⚡ Spark & MLflow Setup**

## 📌 **Import Dependencies**
The following libraries are imported:
- **PySpark**: `SparkSession`, `SparkConf`
- **MLflow**: `mlflow`, `mlflow.spark`

---

## 🔧 **Define Spark Session Creator (`create_spark_session`)**
A function is implemented to **manage the SparkSession lifecycle**, ensuring:
- **Reusing an existing session** when available.
- **Applying essential configurations**, such as:
  - Memory allocation
  - Package dependencies
  - Checkpointing settings
  - Performance optimizations
- **Robust error handling** to catch and log session creation failures.  

📌 **Refer to the function docstring for configuration details.**

---

## 🚀 **Initialize Spark**
- Calls `create_spark_session` to **initialize Spark**.
- Assigns the resulting **SparkSession** instance to the `spark` variable.

---

## 🔍 **Log Spark Info**
- Prints the **Spark version** and **active configuration settings** to the console for verification.

---

## 📊 **Setup MLflow**
1. **Configure Tracking**:
   - Sets **MLflow Tracking URI** to a local directory (`MLFLOW_DIR`).

2. **Manage Experiment**:
   - Attempts to retrieve an **MLflow experiment** matching `SPARK_APP_NAME`.
   - If the experiment **does not exist**, it is **created**.

3. **Log Experiment Details**:
   - Displays the **MLflow tracking URI** and the **active experiment name**.

4. **Error Handling**:
   - Catches potential **MLflow setup issues**.
   - Prints a **warning message** if an error occurs, but **allows execution to continue**.

---

## ✅ **Final Verification**
- Ensures **SparkSession** is running properly.
- Confirms **MLflow setup** is successful (or logs any issues).


In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkConf
import mlflow
import mlflow.spark
# Assuming logger and constants (CHECKPOINT_DIR, SPARK_APP_NAME, etc.) are defined elsewhere

def create_spark_session():
    """
    Creates or retrieves and configures a PySpark SparkSession.

    What:
        This function initializes a SparkSession, which is the entry point
        for programming Spark with the Dataset and DataFrame API. It ensures
        that only one session is active and configured correctly.

    Why:
        - To provide a centralized and standardized way to obtain a SparkSession.
        - To reuse existing sessions if available, preventing resource overhead.
        - To apply consistent configurations (memory, packages, performance tuning,
          checkpointing) required for the application's Spark jobs.
        - To manage Spark's logging verbosity.

    How:
        1.  **Check Existing Session:** It first attempts to retrieve an active
            SparkSession using `SparkSession.getActiveSession()`.
        2.  **Configure Existing:** If a session exists, it updates its configuration
            with necessary settings like checkpoint location, adaptive query
            execution, and shuffle partitions. The log level is set to 'WARN'.
            The existing, re-configured session is then returned. A warning is
            logged if checking fails.
        3.  **Create New Session:** If no active session is found or an error
            occurred during the check, it proceeds to create a new one.
        4.  **Configure New:** A `SparkConf` object is instantiated and configured
            with the application name, master URL, driver/executor memory,
            required JAR packages, checkpoint location, adaptive execution, and
            shuffle partitions using predefined constants.
        5.  **Build Session:** `SparkSession.builder.config(conf=conf).getOrCreate()`
            is used to build the session with the specified configuration.
            `getOrCreate` ensures that if a session was created concurrently
            elsewhere, that session is returned.
        6.  **Set Log Level:** The SparkContext's log level is set to 'WARN' to
            reduce console output noise.
        7.  **Error Handling:** Both checking for existing sessions and creating
            new ones are wrapped in try-except blocks to log potential errors
            using the 'logger' object. Creation failure raises the exception.

    Returns:
        pyspark.sql.SparkSession: The active, configured SparkSession instance.

    Raises:
        Exception: Propagates exceptions encountered during SparkSession creation
                   if it fails after logging the error.

    Note:
        This function relies on globally defined constants such as SPARK_APP_NAME,
        SPARK_MASTER, SPARK_DRIVER_MEMORY, SPARK_EXECUTOR_MEMORY, SPARK_PACKAGES,
        CHECKPOINT_DIR, and a pre-configured 'logger' object.
    """
    try:
        existing_spark = SparkSession.getActiveSession()
        if existing_spark:
            # logger.info("Using existing Spark session") # Assuming logger is available
            existing_spark.conf.set("spark.sql.streaming.checkpointLocation", CHECKPOINT_DIR)
            existing_spark.conf.set("spark.sql.adaptive.enabled", "true")
            existing_spark.conf.set("spark.sql.shuffle.partitions", "10")
            existing_spark.sparkContext.setLogLevel("WARN")
            return existing_spark
    except Exception as e:
        # logger.warning(f"Error checking for existing session: {str(e)}") # Assuming logger is available
        pass # Proceed to create a new session

    try:
        conf = SparkConf().setAppName(SPARK_APP_NAME).setMaster(SPARK_MASTER)
        conf.set("spark.driver.memory", SPARK_DRIVER_MEMORY)
        conf.set("spark.executor.memory", SPARK_EXECUTOR_MEMORY)
        if SPARK_PACKAGES:
            conf.set("spark.jars.packages", ",".join(SPARK_PACKAGES))
        conf.set("spark.sql.streaming.checkpointLocation", CHECKPOINT_DIR)
        conf.set("spark.sql.adaptive.enabled", "true")
        conf.set("spark.sql.shuffle.partitions", "10")
        builder = SparkSession.builder.config(conf=conf)
        spark = builder.getOrCreate()
        spark.sparkContext.setLogLevel("WARN")
        return spark
    except Exception as e:
        # logger.error(f"Failed to create Spark session: {str(e)}") # Assuming logger is available
        raise

spark = create_spark_session()

print(f"Spark Version: {spark.version}")
print(f"Spark Configuration:")
for item in sorted(spark.sparkContext.getConf().getAll()):
    print(f"  {item[0]}: {item[1]}")

try:
    mlflow.set_tracking_uri(f"file:{MLFLOW_DIR}")
    experiment = mlflow.get_experiment_by_name(SPARK_APP_NAME)
    if experiment is None:
        experiment_id = mlflow.create_experiment(SPARK_APP_NAME)
        experiment = mlflow.get_experiment(experiment_id)
        # logger.info(f"Created new MLflow experiment: {SPARK_APP_NAME}") # Assuming logger is available
    else:
        # logger.info(f"Using existing MLflow experiment: {SPARK_APP_NAME}") # Assuming logger is available
        pass
    print(f"\nMLflow tracking URI: {mlflow.get_tracking_uri()}")
    print(f"MLflow experiment: {experiment.name}")
except Exception as e:
    # logger.error(f"Error setting up MLflow: {str(e)}") # Assuming logger is available
    print(f"\nWarning: MLflow setup failed. Continuing without experiment tracking.")

# **🔗 Kafka Connection & Topic Management**

## 📌 **Import Dependencies**
The following Python modules are imported:
- **Kafka Interaction**: `kafka-python`
- **System Commands & JSON Handling**: `subprocess`, `json`

---

## 🔍 **Define Kafka Connection Check (`check_kafka_connection`)**
A function is defined to perform a **basic connectivity test** to the Kafka brokers by:
1. Attempting to instantiate a **Kafka Producer**.
2. Attempting to instantiate a **Kafka Consumer**.
3. Closing both instances after the check.  

📌 **For more details, refer to the function docstring.**

---

## 🛠 **Define Topic Verification & Creation (`verify_create_kafka_topics`)**
A function is defined to ensure that all **required Kafka topics** exist by:
1. **Listing existing topics**:  
   - Uses **Kafka Admin API** first.  
   - Falls back to **CLI-based listing** if necessary.  
2. **Creating missing topics**:  
   - Uses **Admin API** as the first approach.  
   - Falls back to **CLI commands** if API creation fails.  

📌 **For more details, refer to the function docstring.**

---

## 🚀 **Execute Connection Check**
- Calls the `check_kafka_connection` function to test **Kafka connectivity**.

---

## ⚠️ **Handle Connection Result**
- **If the connection fails**:
  - Displays a **warning message** with guidance on how to start Kafka.  
- **If the connection succeeds**:
  - Calls `verify_create_kafka_topics` to ensure required topics are available.

---

## ✅ **Summarize Results**
- Prints a summary of the **Kafka connection status**.
- Lists **newly created topics** (if any) during execution.


In [None]:
import subprocess
import json
from kafka.admin import KafkaAdminClient, NewTopic
from kafka import KafkaProducer, KafkaConsumer
# Assuming logger and constants (KAFKA_BROKERS, KAFKA_TOPIC_*, KAFKA_HOME) are defined elsewhere

def check_kafka_connection():
    """
    Verifies connectivity to the configured Kafka brokers.

    What:
        This function attempts to establish a basic connection to the Kafka
        cluster defined by the KAFKA_BROKERS constant.

    Why:
        To confirm that the Kafka cluster is accessible from the current
        environment before proceeding with operations like topic creation
        or data production/consumption. This prevents downstream errors
        caused by network issues or Kafka service unavailability.

    How:
        1.  Instantiates a `KafkaProducer` targeting the specified `KAFKA_BROKERS`.
            A basic JSON serializer is configured, though not strictly needed
            for the connection check itself.
        2.  Immediately closes the producer.
        3.  Instantiates a `KafkaConsumer` targeting the same brokers. A short
            timeout is set to avoid indefinite blocking if brokers are down.
        4.  Immediately closes the consumer.
        5.  If both instantiation and closing succeed without exceptions, it
            logs a success message and returns `True`.
        6.  If any exception occurs during this process (e.g., connection refused,
            timeout), it logs an error message detailing the failure and
            returns `False`.

    Returns:
        bool: True if a connection could be established and resources closed,
              False otherwise.

    Note:
        Relies on the globally defined `KAFKA_BROKERS` constant and a
        pre-configured `logger` object.
    """
    try:
        producer = KafkaProducer(
            bootstrap_servers=KAFKA_BROKERS,
            value_serializer=lambda v: json.dumps(v).encode('utf-8')
        )
        producer.close()
        consumer = KafkaConsumer(
            bootstrap_servers=KAFKA_BROKERS,
            auto_offset_reset='earliest',
            consumer_timeout_ms=5000
        )
        consumer.close()
        # logger.info("✅ Kafka connection successful") # Assuming logger is available
        return True
    except Exception as e:
        # logger.error(f"❌ Kafka connection failed: {str(e)}") # Assuming logger is available
        return False

def verify_create_kafka_topics():
    """
    Checks for the existence of required Kafka topics and creates any that are missing.

    What:
        Ensures that a predefined list of Kafka topics, necessary for the
        application's workflow, exists within the connected Kafka cluster.

    Why:
        To automate the setup of essential Kafka infrastructure, preventing errors
        that would occur if producers or consumers attempt to interact with
        non-existent topics. It provides robustness by trying multiple methods
        for listing and creating topics.

    How:
        1.  **Define Requirements:** A list of necessary topic names is defined using
            globally available constants (e.g., `KAFKA_TOPIC_RAW`).
        2.  **List Existing Topics (Attempt 1: API):** It first tries to retrieve
            the list of all existing topics using `kafka-python`'s
            `KafkaConsumer.topics()` method.
        3.  **List Existing Topics (Attempt 2: CLI Fallback):** If the API method fails
            or returns an empty set, it attempts to list topics by executing the
            `kafka-topics.sh --list` command via `subprocess.run`, parsing the output.
            This requires `KAFKA_HOME` to be set correctly. Errors during CLI execution
            are logged.
        4.  **Identify Missing Topics:** The required topics list is compared against
            the retrieved list of existing topics to determine which ones need creation.
        5.  **Create Missing Topics (Attempt 1: Admin API):** If missing topics are
            found, it first tries to create them using `kafka-python`'s
            `KafkaAdminClient`. It configures `NewTopic` objects with desired
            partition counts and replication factors before calling `create_topics`.
        6.  **Create Missing Topics (Attempt 2: CLI Fallback):** If the Admin API
            creation fails (e.g., due to permissions or API issues), it falls back
            to creating each missing topic individually by executing the
            `kafka-topics.sh --create` command via `subprocess.run`. Errors during
            CLI creation are logged per topic.
        7.  **Logging:** Informational messages are logged (using `logger`) about
            existing topics, topics being created, and the success or failure of
            creation methods. Error messages detail any exceptions or CLI failures.
        8.  **Return Value:** The function returns a list containing the names of
            the topics that were successfully created during its execution. If no
            topics needed creation or creation failed, an empty list may be returned.

    Returns:
        list[str]: A list of topic names that were newly created by this function.
                   Returns an empty list if all topics already existed or if
                   creation failed completely.

    Note:
        Relies on globally defined constants `KAFKA_BROKERS`, `KAFKA_HOME`,
        `KAFKA_TOPIC_*`, and a pre-configured `logger` object. Assumes appropriate
        permissions for listing and creating topics via API or CLI. The partition
        count (3) and replication factor (1) are hardcoded but could be parameterized.
    """
    required_topics = [
        KAFKA_TOPIC_RAW,
        KAFKA_TOPIC_TRAINING,
        KAFKA_TOPIC_TESTING_STREAM,
        KAFKA_TOPIC_PREDICTIONS,
        KAFKA_TOPIC_METRICS
    ]
    existing_topics = set()
    try:
        # logger.info("Checking existing Kafka topics...") # Assuming logger is available
        try:
            consumer = KafkaConsumer(bootstrap_servers=KAFKA_BROKERS, request_timeout_ms=6000)
            existing_topics = consumer.topics()
            consumer.close()
            if existing_topics:
                # logger.info(f"Existing topics (via API): {existing_topics}") # Assuming logger is available
                pass
        except Exception as api_e:
             # logger.warning(f"Failed to list topics via API ({api_e}), trying CLI...") # Assuming logger is available
             existing_topics = set() # Ensure it's empty if API failed

        if not existing_topics and KAFKA_HOME: # Try CLI only if API failed/empty and KAFKA_HOME is set
            cmd = f"{KAFKA_HOME}/bin/kafka-topics.sh --list --bootstrap-server {KAFKA_BROKERS}"
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=10)
            if result.returncode == 0:
                existing_topics = set(topic for topic in result.stdout.strip().split('\n') if topic) # Filter empty lines
                # logger.info(f"Existing topics (via CLI): {existing_topics}") # Assuming logger is available
            else:
                # logger.warning(f"Error listing topics via CLI: {result.stderr or result.stdout}") # Assuming logger is available
                existing_topics = set() # Reset if CLI also failed

        topics_to_create = [topic for topic in required_topics if topic not in existing_topics and topic and topic.strip()]

        if topics_to_create:
            # logger.info(f"Creating missing topics: {topics_to_create}") # Assuming logger is available
            created_list = []
            try:
                admin_client = KafkaAdminClient(bootstrap_servers=KAFKA_BROKERS)
                new_topics_obj = [NewTopic(name=topic, num_partitions=3, replication_factor=1) for topic in topics_to_create]
                admin_client.create_topics(new_topics=new_topics_obj, validate_only=False)
                admin_client.close()
                # logger.info("Topics created successfully via Admin API") # Assuming logger is available
                created_list.extend(topics_to_create) # Assume all succeeded if no exception
                return created_list # Return immediately if API worked
            except Exception as e:
                # logger.warning(f"Error creating topics via Admin API: {str(e)}. Falling back to CLI.") # Assuming logger is available
                # Fallback to CLI
                if not KAFKA_HOME:
                     # logger.error("Cannot fallback to CLI: KAFKA_HOME not set.") # Assuming logger is available
                     return [] # Cannot proceed

                successfully_created_cli = []
                for topic in topics_to_create:
                    cmd = f"{KAFKA_HOME}/bin/kafka-topics.sh --create --topic {topic} " \
                          f"--bootstrap-server {KAFKA_BROKERS} --partitions 3 --replication-factor 1"
                    result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=10)
                    if result.returncode == 0 or "already exists" in (result.stderr + result.stdout).lower():
                        # logger.info(f"Topic {topic} created or already exists (via CLI)") # Assuming logger is available
                        successfully_created_cli.append(topic)
                    else:
                         # logger.error(f"Failed to create topic {topic} via CLI: {result.stderr or result.stdout}") # Assuming logger is available
                         pass # Continue trying others
                return successfully_created_cli
        else:
            # logger.info("All required topics already exist") # Assuming logger is available
            return []
    except Exception as e:
        # logger.error(f"General error verifying/creating Kafka topics: {str(e)}") # Assuming logger is available
        return [] # Return empty list on major failure

kafka_connected = check_kafka_connection()
created_topics = []
if not kafka_connected:
    print("\n⚠️ WARNING: Kafka connection failed. Please check if Kafka is running.")
    print("  Consider checking Zookeeper: bin/zookeeper-server-start.sh config/zookeeper.properties")
    print("  Consider checking Kafka Broker: bin/kafka-server-start.sh config/server.properties")
else:
    created_topics = verify_create_kafka_topics()

print("\n=== Kafka Environment Verification Results ===")
print(f"Kafka Connection: {'✅ Success' if kafka_connected else '❌ Failed'}")
print(f"Topics Created Now: {', '.join(created_topics) if created_topics else 'None (all existed or creation failed)'}")

# **Data Processing Pipeline Overview**

## 📌 **Import Dependencies**
The following libraries are imported:
- **Data Manipulation**: `pandas`, `pyspark.sql.functions`, `pyspark.sql.types`
- **Visualization**: `matplotlib`, `seaborn`
- **Automated EDA (Optional)**: `autoviz`
- **System Utilities**: `os`

---

## 📥 **Load Data (`load_and_inspect_data`)**
- Reads the **consumer complaints dataset** from `DATASET_PATH` into a **Spark DataFrame** using **inferred schema**.
- Displays:
  - **Schema**
  - **Sample rows**
  - **Basic descriptive statistics**

---

## 🔍 **Exploratory Data Analysis (`perform_eda`)**
1. **Sampling & Conversion**  
   - Extracts a **sample** from the Spark DataFrame and converts it to a Pandas DataFrame.

2. **Basic EDA**
   - Computes:
     - **Dimensions (rows & columns)**
     - **Data types**
     - **Missing values analysis**

3. **Data Visualization**
   - Generates and saves:
     - **Product Distribution** plot
     - **Company Response Distribution** plot  
   - Uses **Matplotlib** & **Seaborn** for visualizations.

4. **(Optional) Automated EDA**
   - Runs **AutoViz** for **comprehensive visualizations** (if applicable).

---

## 🛠 **Prepare Data for Kafka (`prepare_data_for_kafka`)**
- Filters the **original Spark DataFrame** to retain only:
  - Non-empty **Consumer complaint narratives**
  - Non-null **Complaint IDs**

---

## 🚀 **Write to Kafka (`write_to_kafka`)**
- Transforms the **filtered DataFrame** into **key-value pairs**:
  - **Key**: `Complaint ID`
  - **Value**: Entire row **as a JSON string**
- **Writes records in batch** to **Kafka Topic (`KAFKA_TOPIC_RAW`)** using **Spark's Kafka Sink**.

---

## ✅ **Execution & Error Handling**
- The **main execution block**:
  1. Calls all functions **sequentially**.
  2. Uses a **try-except block**:
     - ✅ Prints **success messages** if all steps run smoothly.
     - ❌ Catches and **logs errors** if any occur.


In [None]:
import pandas as pd
import numpy as np
from pyspark.sql.functions import col, to_json, struct, lit, when, isnull
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
import matplotlib.pyplot as plt
import seaborn as sns
import os
# Assuming spark, logger, and constants (DATASET_PATH, RANDOM_SEED, DATA_DIR, VIZ_DIR,
# KAFKA_TOPIC_RAW, KAFKA_BROKERS) are defined elsewhere

def load_and_inspect_data():
    """
    Loads consumer complaints data from a CSV file into a Spark DataFrame and performs initial inspection.

    What:
        Reads a specified CSV file containing consumer complaints data, infers or applies a
        schema, and provides basic insights into the loaded data structure and content.

    Why:
        To ingest the raw data into the Spark environment for subsequent processing and analysis.
        Initial inspection helps verify successful loading, understand data types, and get a
        preliminary sense of the data's characteristics (row count, sample records, basic stats).
        Using `inferSchema` provides convenience, though defining an explicit schema (commented out
        in the original but structure shown) is generally better for production performance and
        data type reliability.

    How:
        1.  Logs the path of the dataset being loaded using the `logger`.
        2.  Defines an optional explicit `StructType` schema (currently relies on `inferSchema`).
            *Note: Using `inferSchema=true` can be slow for large files and might guess incorrect types.*
        3.  Uses `spark.read.option("header", "true").option("inferSchema", "true").csv(DATASET_PATH)`
            to read the CSV file into a Spark DataFrame (`df`), assuming the file has headers.
        4.  Performs basic inspection actions:
            -   Logs the total number of rows loaded using `df.count()`.
            -   Prints the DataFrame's schema using `df.printSchema()`.
            -   Displays the first 5 rows of data without truncation using `df.show(5, truncate=False)`.
            -   Calculates and displays basic descriptive statistics for numerical/timestamp columns
                using `df.describe().show()`.
        5.  Includes error handling: If any exception occurs during loading or inspection, it logs
            the error and re-raises the exception.

    Returns:
        pyspark.sql.DataFrame: The loaded Spark DataFrame containing the consumer complaints data.

    Raises:
        Exception: Propagates exceptions encountered during file reading or initial inspection
                   after logging the error.

    Note:
        Relies on the globally defined SparkSession `spark`, `DATASET_PATH` constant, and a
        pre-configured `logger` object. The current implementation uses `inferSchema`.
    """
    # logger.info(f"Loading data from: {DATASET_PATH}") # Assuming logger is available
    # Optional explicit schema definition (adjust types as needed)
    # schema = StructType([...]) # Example schema structure provided in original code
    try:
        df = spark.read.option("header", "true") \
                       .option("inferSchema", "true") \
                       .csv(DATASET_PATH)
        row_count = df.count()
        # logger.info(f"DataFrame loaded successfully with {row_count} rows") # Assuming logger is available
        # logger.info("DataFrame schema:") # Assuming logger is available
        df.printSchema()
        print("\nSample data:")
        df.show(5, truncate=False)
        print("\nBasic statistics:")
        df.describe().show()
        return df
    except Exception as e:
        # logger.error(f"Error loading data: {str(e)}") # Assuming logger is available
        raise

def perform_eda(df):
    """
    Performs basic Exploratory Data Analysis (EDA) on a sample of the input Spark DataFrame.

    What:
        Generates summary statistics, analyzes missing values, and visualizes distributions
        of key categorical features from a sample of the consumer complaints data. Optionally
        uses AutoViz for automated visualization generation.

    Why:
        To gain a deeper understanding of the data's characteristics, distributions, potential
        quality issues (like missing values), and relationships between variables before
        further processing or modeling. Sampling is used because EDA often involves libraries
        (like Pandas, Matplotlib, Seaborn, AutoViz) that work more efficiently or exclusively
        on in-memory data.

    How:
        1.  **Sampling:** Takes a sample of the Spark DataFrame (`df`). The sample size is
            capped (e.g., 10,000 rows or the total count if smaller) to manage memory usage.
            The sampled data is converted to a Pandas DataFrame (`sample_df`) using `.toPandas()`.
            *Caution: `.toPandas()` collects all sampled data to the driver node.*
        2.  **Save Sample (Optional):** Saves the Pandas sample to a CSV file, which might be
            useful for external tools or AutoViz.
        3.  **Basic Pandas EDA:** Prints dimensions, data types, and a summary of missing values
            (count and percentage) for the sampled data.
        4.  **Visualization Prep:** Creates a directory (`VIZ_DIR`) to store generated plots.
        5.  **Manual Visualizations:**
            -   Generates and saves a bar plot of the top 10 most frequent values in the 'Product' column.
            -   Generates and saves a bar plot showing the distribution of 'Company response to consumer'.
            (Uses Matplotlib and Seaborn).
        6.  **Automated Visualization (Optional):**
            -   Attempts to import `AutoViz_Class`.
            -   If successful, instantiates `AutoViz_Class` and calls `AutoViz()` on the saved
              sample CSV or the Pandas DataFrame to automatically generate various plots.
            -   Catches potential errors during AutoViz execution (e.g., library not installed)
              and logs a warning.
        7.  Logs completion messages using the `logger`.

    Args:
        df (pyspark.sql.DataFrame): The Spark DataFrame containing the consumer complaints data.

    Note:
        Relies on global constants `RANDOM_SEED`, `DATA_DIR`, `VIZ_DIR`, and a pre-configured
        `logger`. Requires Pandas, Matplotlib, Seaborn, and optionally AutoViz to be installed.
        Converts a sample of Spark data to Pandas, which requires sufficient driver memory.
        Visualizations are saved to the path specified by `VIZ_DIR`.
    """
    # logger.info("Sampling data for EDA...") # Assuming logger is available
    total_count = df.count()
    sample_size = min(10000, total_count)
    sample_fraction = sample_size / total_count if total_count > 0 else 0
    if sample_fraction > 0:
        sample_df = df.sample(fraction=sample_fraction, seed=RANDOM_SEED).toPandas()
    else:
        sample_df = df.limit(0).toPandas() # Create empty pandas df with same schema if no data

    # logger.info(f"Sample data shape: {sample_df.shape}") # Assuming logger is available
    sample_csv_path = os.path.join(DATA_DIR, "sample_complaints.csv")
    try:
       os.makedirs(DATA_DIR, exist_ok=True)
       sample_df.to_csv(sample_csv_path, index=False)
    except Exception as e:
       # logger.warning(f"Could not save sample CSV to {sample_csv_path}: {e}") # Assuming logger is available
       pass

    print("\n=== Basic EDA ===")
    print(f"Dataset dimensions (sampled): {sample_df.shape}")
    print("\nData types (sampled):")
    print(sample_df.dtypes)
    print("\nMissing values by column (sampled):")
    if not sample_df.empty:
        missing_values = sample_df.isnull().sum()
        missing_pct = (missing_values / len(sample_df)) * 100
        missing_df = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_pct})
        print(missing_df[missing_df['Missing Values'] > 0].sort_values('Percentage', ascending=False))
    else:
        print("Sample DataFrame is empty, skipping missing value analysis.")

    try:
        os.makedirs(VIZ_DIR, exist_ok=True)
        plt.figure(figsize=(12, 6))
        if 'Product' in sample_df.columns and not sample_df.empty:
            product_counts = sample_df['Product'].value_counts().head(10)
            sns.barplot(x=product_counts.values, y=product_counts.index)
            plt.title('Top 10 Products (Sampled)')
            plt.tight_layout()
            plt.savefig(os.path.join(VIZ_DIR, 'top_products_sampled.png'))
            plt.close() # Close plot to free memory
        if 'Company response to consumer' in sample_df.columns and not sample_df.empty:
            plt.figure(figsize=(12, 6))
            response_counts = sample_df['Company response to consumer'].value_counts()
            sns.barplot(x=response_counts.values, y=response_counts.index)
            plt.title('Company Response Distribution (Sampled)')
            plt.tight_layout()
            plt.savefig(os.path.join(VIZ_DIR, 'response_distribution_sampled.png'))
            plt.close() # Close plot
    except Exception as plot_e:
         # logger.warning(f"Error during manual visualization: {plot_e}") # Assuming logger is available
         pass

    try:
        from autoviz.AutoViz_Class import AutoViz_Class
        # logger.info("Running AutoViz for automated visualizations...") # Assuming logger is available
        av = AutoViz_Class()
        # Using dfte=sample_df might be more robust if saving CSV failed
        dft = av.AutoViz(filename="", sep=",", depVar="", dfte=sample_df, header=0, verbose=0,
                         lowess=False, chart_format="png", max_rows_analyzed=10000, max_cols_analyzed=30,
                         save_plot_dir=VIZ_DIR) # Specify save directory
        # logger.info(f"AutoViz visualizations saved to: {VIZ_DIR}") # Assuming logger is available
    except ImportError:
        # logger.warning("AutoViz not installed. Skipping automated visualizations.") # Assuming logger is available
        pass
    except Exception as e:
        # logger.warning(f"AutoViz error: {str(e)}. Skipping automated visualizations.") # Assuming logger is available
        pass
    # logger.info("EDA completed") # Assuming logger is available

def prepare_data_for_kafka(df):
    """
    Filters the Spark DataFrame to select relevant records for Kafka ingestion.

    What:
        Applies filtering logic to the input DataFrame, primarily keeping records
        that have non-empty complaint narratives and valid Complaint IDs.

    Why:
        To ensure that only meaningful data (complaints with actual text) is sent
        to the Kafka topic intended for raw data processing. Filtering out records
        without narratives reduces noise and processing overhead downstream. Ensuring
        a non-null 'Complaint ID' is crucial if it's used as the Kafka message key
        for partitioning or identification.

    How:
        1.  Logs the start of the preparation phase.
        2.  Defines the column name for the complaint narrative.
        3.  Filters the DataFrame `df` using `df.filter()` to keep rows where the
            narrative column is not null AND is not an empty string.
        4.  Logs the number of rows before and after this narrative filtering.
        5.  Applies a second filter to ensure the 'Complaint ID' column is not null.
        6.  Logs the final count of rows after the ID filter.
        7.  Returns the resulting filtered Spark DataFrame.

    Args:
        df (pyspark.sql.DataFrame): The original Spark DataFrame loaded from the source.

    Returns:
        pyspark.sql.DataFrame: A filtered Spark DataFrame containing records suitable
                               for sending to the raw Kafka topic.

    Note:
        Relies on the input DataFrame having columns named 'Consumer complaint narrative'
        and 'Complaint ID'. Depends on a pre-configured `logger`.
    """
    # logger.info("Preparing data for Kafka...") # Assuming logger is available
    narrative_col = "Consumer complaint narrative"
    complaint_id_col = "Complaint ID"
    total_count = df.count()
    # Filter for non-empty narrative
    prepared_df = df.filter(
        col(narrative_col).isNotNull() & (col(narrative_col) != "")
    )
    filtered_count = prepared_df.count()
    # logger.info(f"Filtered for non-empty narrative: {filtered_count} rows (from {total_count})") # Assuming logger is available
    # Filter for non-null Complaint ID
    prepared_df = prepared_df.filter(col(complaint_id_col).isNotNull())
    final_count = prepared_df.count()
    if final_count < filtered_count:
         # logger.info(f"Filtered for non-null Complaint ID: {final_count} rows (from {filtered_count})") # Assuming logger is available
         pass
    return prepared_df

def write_to_kafka(df):
    """
    Writes the prepared Spark DataFrame to a specified Kafka topic.

    What:
        Takes a Spark DataFrame, formats it appropriately for Kafka (key-value pairs
        with JSON value), and writes it to the target Kafka topic using Spark's
        built-in Kafka connector.

    Why:
        To publish the raw, filtered consumer complaints data onto the Kafka message
        bus, making it available for downstream consumers (like streaming applications
        or batch processing jobs) to process further. Using 'Complaint ID' as the key
        helps with partitioning and potentially idempotent processing downstream.

    How:
        1.  Logs the target Kafka topic name.
        2.  Selects columns from the input DataFrame `df` and transforms them:
            -   Casts the 'Complaint ID' column to string and aliases it as `key`.
            -   Uses `struct("*")` to gather all columns into a struct.
            -   Uses `to_json()` to serialize the struct into a JSON string, aliased as `value`.
            The result is a DataFrame (`kafka_df`) with 'key' and 'value' columns.
        3.  Initiates a write operation using `kafka_df.write`.
        4.  Specifies the format as "kafka".
        5.  Provides the Kafka broker list via `option("kafka.bootstrap.servers", KAFKA_BROKERS)`.
        6.  Specifies the target topic via `option("topic", KAFKA_TOPIC_RAW)`.
        7.  Executes the write operation using `.save()`. *Note: `.save()` is typically for batch writes.*
            For continuous streaming writes, `.start()` would be used with `.writeStream`.
        8.  Logs success upon completion or logs an error and re-raises the exception if writing fails.

    Args:
        df (pyspark.sql.DataFrame): The prepared Spark DataFrame ready for Kafka.

    Raises:
        Exception: Propagates exceptions encountered during the Kafka write operation
                   after logging the error.

    Note:
        Relies on global constants `KAFKA_BROKERS`, `KAFKA_TOPIC_RAW`, and a pre-configured
        `logger`. Assumes the Spark session has the necessary Kafka connector JARs available
        (usually specified during session creation). This function performs a batch write.
    """
    # logger.info(f"Writing data to Kafka topic: {KAFKA_TOPIC_RAW}") # Assuming logger is available
    try:
        kafka_df = df.select(
            col("Complaint ID").cast("string").alias("key"),
            to_json(struct("*")).alias("value")
        )
        kafka_df.write \
            .format("kafka") \
            .option("kafka.bootstrap.servers", KAFKA_BROKERS) \
            .option("topic", KAFKA_TOPIC_RAW) \
            .save()
        # logger.info(f"Successfully wrote {kafka_df.count()} records to Kafka topic {KAFKA_TOPIC_RAW}") # Assuming logger is available
    except Exception as e:
        # logger.error(f"Error writing to Kafka: {str(e)}") # Assuming logger is available
        raise

try:
    complaints_df = load_and_inspect_data()
    perform_eda(complaints_df)
    prepared_df = prepare_data_for_kafka(complaints_df)
    write_to_kafka(prepared_df)
    print("\n✅ Phase 2 completed successfully: Data ingested, explored, and loaded to Kafka")
except Exception as e:
    print(f"\n❌ Error in Phase 2: {str(e)}")
    # Potentially re-raise e if you want the notebook cell to fail explicitly
    # raise e