## Introduction

This notebook demonstrates how to process and evaluate the results from a Kluster batch processing job. The output from Kluster is typically a JSONL file, where each line represents a single API response. We'll parse this JSONL file, handle potential issues like markdown formatting, and convert the results into a pandas DataFrame for easy analysis and export.

## Code Implementation

First, let's import the necessary libraries. We'll use `json` for handling JSON data, `pandas` for creating DataFrames, `typing` for type hints, and `re` for regular expressions (to handle markdown).

In [None]:
import json
import pandas as pd
from typing import List, Dict, Any, Optional
import re

### Extracting JSON from Markdown

Sometimes, the AI response might be wrapped in markdown code blocks (```json ... ```). This function extracts the actual JSON content from such markdown formatting.

In [None]:
def extract_json_from_markdown(content: str) -> str:
    """Extract JSON content from markdown code blocks"""
    # Look for JSON content between markdown code blocks
    json_match = re.search(r'```json\s*(.*?)\s*```', content, re.DOTALL)  # Regex to find JSON in markdown
    if json_match:
        return json_match.group(1)  # Return the extracted JSON
    return content  # Return original if no markdown formatting

### Parsing JSONL to DataFrame

This is the core function. It reads the JSONL file, parses each line (which is a JSON object), extracts the relevant data (including handling markdown), and compiles everything into a pandas DataFrame.  It also handles list columns by converting them to comma-separated strings.

In [None]:
def parse_jsonl_to_dataframe(jsonl_file_path: str) -> pd.DataFrame:
    """
    Parse a JSONL file with Kluster API response format and convert to a pandas DataFrame.

    Args:
        jsonl_file_path: Path to the JSONL file containing API responses

    Returns:
        A pandas DataFrame with the extracted data
    """
    # Store all parsed data
    extracted_data = []

    # Read the JSONL file
    with open(jsonl_file_path, 'r', encoding='utf-8') as file:
        for line in file:
            try:
                # Parse the JSON object
                response_obj = json.loads(line.strip())

                # Extract custom_id for reference
                custom_id = response_obj.get('custom_id', '')

                # Extract the content from the response
                content = response_obj.get('response', {}).get('body', {}).get('choices', [{}])[0].get('message', {}).get('content', '{}')

                # Handle markdown-formatted JSON
                json_content = extract_json_from_markdown(content)

                # Parse the content into a Python dictionary
                try:
                    parsed_content = json.loads(json_content)

                    # Add metadata (custom_id for tracking)
                    parsed_content['custom_id'] = custom_id

                    # Add to our results list
                    extracted_data.append(parsed_content)
                except json.JSONDecodeError:
                    print(f"Warning: Could not parse JSON content for {custom_id}")

            except Exception as e:
                print(f"Error processing line: {e}")

    # Convert to DataFrame
    df = pd.DataFrame(extracted_data)

    # Handle list columns - convert to comma-separated strings
    for col in df.columns:
        if df[col].apply(lambda x: isinstance(x, list)).any():
            df[col] = df[col].apply(lambda x: ', '.join(x) if isinstance(x, list) and x else '')

    return df

### Usage Example

Here's how to use the `parse_jsonl_to_dataframe` function.  Replace `"results_20250303_200200.jsonl"` with the actual path to your Kluster output file.

In [None]:
# Usage example
if __name__ == "__main__":
    jsonl_file_path = "results_20250303_200200.jsonl"  # Replace with your file path
    df = parse_jsonl_to_dataframe(jsonl_file_path)

    # Display the first few rows
    print(df.head())

    # Save to CSV if needed
    df.to_csv("extracted_articles.csv", index=False)

This script reads the JSONL file, parses each response, handles potential markdown formatting, creates a DataFrame, and optionally saves the result to a CSV file (`extracted_articles.csv`). The `print(df.head())` line displays the first few rows of the DataFrame, allowing you to quickly inspect the extracted data.