### General Description
#### 1. Task Statement
**Company:** Insight AI

**Issue:** Insight AI, a leading consulting firm in the artificial intelligence sector, is advising a major client on the integration of a Large Language Model (LLM) into their flagship customer support application. The client needs to select the most suitable model from a wide array of options, ensuring it aligns with user preferences and provides a high-quality conversational experience. Making the wrong choice could lead to poor user adoption and significant financial loss.

**ML/DS Solution:** To provide a data-driven recommendation, Insight AI must perform a comprehensive Exploratory Data Analysis (EDA) on the LMSYS Chatbot Arena dataset. This dataset contains records of head-to-head battles between anonymous LLMs, judged by humans. By analyzing this data, we can uncover patterns in model performance, identify strengths and weaknesses, and understand the factors that drive user preference.

**Feasibility:** Manually reviewing thousands of chat logs to gauge model performance is impractical, subjective, and doesn't scale. A systematic, data-driven EDA is the only feasible way to extract objective, actionable insights from this large and complex dataset.

**Task:** Your task, as a data scientist at Insight AI, is to conduct a detailed EDA on the Chatbot Arena dataset. You will need to clean the data, visualize key distributions, engineer relevant features, and build simple baseline models to identify the key predictors of a model's success.

**Data:** The company provides the 'LMSYS Chatbot Arena' dataset, which includes training and test sets containing chat logs, model identifiers (for training), and human-judged outcomes.

**Definition of Done:** The final deliverable is a structured report (this notebook) that details the findings from the EDA. It must include clear visualizations, statistical analysis of model performance, feature importance rankings from baseline models, and topic modeling of user prompts. The insights gathered will form the basis of the final recommendation to the client.
#### 2. Rewards
- Gaining expertise in Exploratory Data Analysis (EDA) for complex, text-based datasets.
- Mastering data cleaning and preprocessing techniques for real-world data.
- Advanced data visualization skills using Matplotlib and Seaborn.
- Practical feature engineering for machine learning on text data (e.g., TF-IDF).
- Building and interpreting baseline models to guide feature selection.
- Introduction to Topic Modeling using state-of-the-art libraries like BERTopic.
#### 3. Difficulty Level
normal
#### 4. Task Type
Exploratory Data Analysis, Data Cleaning, Feature Engineering
#### 5. Tools
Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn, BERTopic, SentenceTransformers

In [ ]:
import os
import random
***REMOVED***
import time
from collections import defaultdict
from tqdm.notebook import tqdm
import warnings
from pathlib import Path
from typing import Any, Dict, List, Optional
from IPython.display import display, HTML

import pandas as pd 
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.base import BaseEstimator
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import log_loss
from sentence_transformers import SentenceTransformer
from lightgbm import LGBMClassifier
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic.vectorizers import ClassTfidfTransformer

# Configuration
warnings.simplefilter("ignore")
sns.set_style("darkgrid")
pd.options.display.max_rows = None
pd.options.display.max_columns = None

```json
{
  "issue": "The analysis requires loading raw data from CSV files and performing initial cleaning, such as handling duplicates and parsing string-formatted lists.",
  "action": "Define functions to load the training and test data using pandas. Implement a data cleaning function that removes the 'id' column from the training set, drops duplicate rows, and correctly parses the list-like string columns ('prompt', 'response_a', 'response_b') into actual Python lists, handling potential 'null' values.",
  "state": "The data is loaded into pandas DataFrames, cleaned of duplicates, and all text-based columns are correctly formatted as lists, making them ready for detailed analysis."
}
```

In [ ]:
from pathlib import Path
from typing import List, Tuple
import pandas as pd

def load_data(data_path: Path) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Loads train, test, and submission files."""
    train = pd.read_csv(data_path / "train.csv")
    test = pd.read_csv(data_path / "test.csv")
    sub = pd.read_csv(data_path / "sample_submission.csv")
    return train, test, sub

def clean_and_prepare_data(df: pd.DataFrame) -> pd.DataFrame:
    """Cleans the dataframe by dropping unnecessary columns, duplicates, and parsing string lists."""
    if 'id' in df.columns:
        df = df.drop("id", axis=1)
    
    df = df.drop_duplicates(keep="first", ignore_index=True)
    
    for col in ["prompt", "response_a", "response_b"]:
        # Handle cases where the column might already be parsed or doesn't exist
        if col in df.columns and isinstance(df[col].iloc[0], str):
            try:
                 # A more robust way to handle 'null' before eval
                df[col] = df[col].apply(lambda x: eval(x.replace("null", "None")))
            except Exception as e:
                print(f"Could not parse column {col}. Error: {e}")
                # Fallback for columns that might not need parsing
                df[col] = df[col].apply(lambda x: eval(x) if isinstance(x, str) else x)
    return df

DATA_PATH = Path("/kaggle/input/lmsys-chatbot-arena")
TARGETS = ["winner_model_a", "winner_model_b", "winner_tie"]

train_df, test_df, sub_df = load_data(DATA_PATH)
train_df_cleaned = clean_and_prepare_data(train_df.copy())

print(f"Original train shape: {train_df.shape}")
print(f"Cleaned train shape: {train_df_cleaned.shape}")
display(train_df_cleaned.head(2))

```json
{
  "issue": "To understand the competitive landscape of the models, we need to analyze their appearance frequency and head-to-head battle outcomes.",
  "action": "Create functions to visualize the distribution of models appearing as 'model_a' and 'model_b' using pie charts. Then, develop a 'battle report' by pivoting the data to create heatmaps showing the number of battles and win rates between every pair of top models.",
  "state": "Visualizations are generated that clearly show which models are most frequent and which models tend to win against others, providing a high-level overview of the model hierarchy."
}
```

In [ ]:
from typing import List, Optional
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def plot_model_distribution(df: pd.DataFrame, thres: float = 0.02, max_labels: int = 5):
    """Plots the distribution of model_a and model_b identities."""
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 7))
    model_counts = {
        'model_a': df["model_a"].value_counts(),
        'model_b': df["model_b"].value_counts()
    }
    
    for ax, (model_col, counts) in zip(axes, model_counts.items()):
        plot_pie_single(counts.values, counts.index.tolist(), f"Distribution for {model_col}", thres, max_labels, ax)
    
    plt.tight_layout()
    plt.show()

def plot_pie_single(data: np.ndarray, labels: List[str], title: str, thres: float, max_labels: int, ax: plt.Axes):
    """Helper function to plot a single pie chart with minority aggregation."""
    tot = sum(data)
    major_data = [(d, l) for d, l in zip(data, labels) if d / tot >= thres]
    minor_data = [(d, l) for d, l in zip(data, labels) if d / tot < thres]
    
    if minor_data:
        major_data.append((sum(d for d, _ in minor_data), "Others"))
    
    data, labels = map(list, zip(*major_data))
    max_idx = np.argmax(data[:-1] if minor_data else data)
    explode = [0.1 if i == max_idx else 0 for i in range(len(data))]
    
    patches, _ = ax.pie(data, startangle=140, colors=sns.color_palette("pastel"), explode=explode)
    ax.legend(patches, labels, title="Model Identity", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
    ax.set_title(title, fontsize=16)

def plot_battle_heatmaps(df: pd.DataFrame, top_n_models: int = 16):
    """Generates heatmaps for battle counts and win rates between top models."""
    top_models = set(df["model_a"].value_counts().index[:top_n_models])
    df_btl = df.groupby(["model_a", "model_b"], as_index=False).size().rename(columns={'size': 'battle_cnt'})
    df_btl = df_btl.query("model_a in @top_models and model_b in @top_models")

    battle_pivot = df_btl.pivot(index="model_a", columns="model_b", values="battle_cnt")
    
    fig, ax = plt.subplots(figsize=(12, 10))
    sns.heatmap(battle_pivot, annot=True, fmt=".0f", cmap="rocket_r", ax=ax)
    ax.set_title(f"Battle Count of Top-{top_n_models} Most Frequent Models", fontsize=16)
    plt.show()

# --- Execution ---
plot_model_distribution(train_df_cleaned)
plot_battle_heatmaps(train_df_cleaned)

```json
{
  "issue": "A key part of the analysis is understanding the verbosity of prompts and responses, as text length can be a strong predictor of user preference.",
  "action": "Engineer features related to the length and number of turns in each conversation. Create functions to calculate the number of turns, the total character length of prompts and responses, and the difference in length between model A's and model B's responses. Then, visualize the distributions of these new features.",
  "state": "The dataset is augmented with several new length-based features, and their distributions are plotted, revealing insights into conversational dynamics and model verbosity."
}
```

In [ ]:
def engineer_length_features(df: pd.DataFrame) -> pd.DataFrame:
    """Engineers features based on the length of prompts and responses."""
    df['n_turns'] = df["prompt"].apply(len)
    
    df['prompt_len'] = df['prompt'].apply(lambda x: sum(len(str(p)) for p in x))
    df['response_a_len'] = df['response_a'].apply(lambda x: sum(len(str(r)) for r in x))
    df['response_b_len'] = df['response_b'].apply(lambda x: sum(len(str(r)) for r in x))
    
    df['len_diff'] = df['response_a_len'] - df['response_b_len']
    df['len_diff_abs'] = abs(df['len_diff'])
    return df

def plot_length_distributions(df: pd.DataFrame):
    """Visualizes the distributions of length-based features."""
    features_to_plot = ['n_turns', 'prompt_len', 'response_a_len', 'response_b_len', 'len_diff']
    fig, axes = plt.subplots(len(features_to_plot), 1, figsize=(12, 15))
    
    for ax, feature in zip(axes, features_to_plot):
        sns.histplot(df[feature], ax=ax, bins=50, kde=True)
        ax.set_title(f'Distribution of {feature}', fontsize=14)
        ax.set_xlabel('')
    
    plt.tight_layout()
    plt.show()

# --- Execution ---
train_df_lengths = engineer_length_features(train_df_cleaned.copy())
plot_length_distributions(train_df_lengths)

```json
{
  "issue": "To establish a performance baseline, we need to create and evaluate simple, rule-based models before moving to more complex machine learning approaches.",
  "action": "Implement several naive baseline models: one that predicts a uniform 1/3 probability for all outcomes, one that predicts the global mean win/loss/tie rate, and a more refined version that predicts the mean rates specific to each model pair. Calculate the log loss for each baseline to measure its performance.",
  "state": "Performance benchmarks are established from several naive models, providing a clear metric that more sophisticated models must surpass."
}
```

In [ ]:
from sklearn.metrics import log_loss

def evaluate_uniform_baseline(df: pd.DataFrame) -> float:
    """Calculates log loss for a baseline that predicts 1/3 for all outcomes."""
    y_true = np.where(df[TARGETS].values)[1]
    y_pred = np.ones((len(df), 3)) / 3
    loss = log_loss(y_true, y_pred)
    print(f"Uniform Baseline Log Loss: {loss:.4f}")
    return loss

def evaluate_mean_baseline(df: pd.DataFrame) -> float:
    """Calculates log loss for a baseline predicting the global mean."""
    y_true = np.where(df[TARGETS].values)[1]
    mean_preds = df[TARGETS].mean().values
    y_pred = np.tile(mean_preds, (len(df), 1))
    loss = log_loss(y_true, y_pred)
    print(f"Global Mean Baseline Log Loss: {loss:.4f}")
    return loss

# --- Execution ---
evaluate_uniform_baseline(train_df_lengths)
evaluate_mean_baseline(train_df_lengths)

```json
{
  "issue": "To understand which engineered features are most predictive, we need to build a simple machine learning model and analyze its decision-making process.",
  "action": "Train a Decision Tree Classifier using the length-based features created earlier. Use stratified K-fold cross-validation to get a robust estimate of its performance. Visualize the trained tree to interpret which features and thresholds are most important for predicting the winner.",
  "state": "A baseline Decision Tree model is trained and evaluated, and its structure is visualized, providing initial insights into the predictive power of the engineered features."
}
```

In [ ]:
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import log_loss

def train_and_visualize_decision_tree(df: pd.DataFrame, features: List[str]):
    """Trains a Decision Tree and visualizes it."""
    X = df[features]
    y = np.where(df[TARGETS].values)[1]

    dt = DecisionTreeClassifier(max_depth=3, random_state=42)
    dt.fit(X, y)
    
    # Evaluate with cross-validation
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    losses = []
    for train_idx, val_idx in skf.split(X, y):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        model = DecisionTreeClassifier(max_depth=3, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict_proba(X_val)
        losses.append(log_loss(y_val, y_pred))
    print(f"Decision Tree CV Log Loss: {np.mean(losses):.4f} (+/- {np.std(losses):.4f})")

    # Visualize the tree
    plt.figure(figsize=(20, 10))
    plot_tree(dt, feature_names=features, class_names=TARGETS, filled=True, rounded=True, fontsize=10)
    plt.title("Decision Tree Visualization", fontsize=16)
    plt.show()

# --- Execution ---
features = ['n_turns', 'prompt_len', 'response_a_len', 'response_b_len', 'len_diff', 'len_diff_abs']
train_and_visualize_decision_tree(train_df_lengths, features)

```json
{
  "issue": "To gain deeper insights from the raw text, we need to understand the main topics of discussion within the user prompts.",
  "action": "Use BERTopic, a powerful topic modeling technique, to identify and visualize topics from the prompts. This involves generating sentence embeddings, using UMAP for dimensionality reduction and HDBSCAN for clustering, and then representing topics. The topics are then visualized to show their prevalence and relationships.",
  "state": "A topic model is trained on the prompts, revealing the key themes and questions posed by users in the Chatbot Arena, which can be correlated with model performance."
}
```

In [ ]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN

def perform_topic_modeling(prompts: List[str]):
    """Performs topic modeling on a list of prompts using BERTopic."""
    # Flatten the list of lists of prompts into a single list of strings
    all_prompts = [p for sublist in prompts for p in sublist]

    # For demonstration, we'll use a smaller subset
    sample_prompts = all_prompts[:5000]

    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
    hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
    
    topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        language="english",
        calculate_probabilities=True,
        verbose=True
    )
    
    topics, probs = topic_model.fit_transform(sample_prompts)
    
    print("--- Top Topics ---")
    display(topic_model.get_topic_info().head(10))
    
    # Visualize topics
    try:
        fig = topic_model.visualize_topics()
        fig.show()
    except Exception as e:
        print(f"Could not visualize topics. Error: {e}")
        
    return topic_model

# --- Execution ---
prompts_list = train_df_lengths['prompt'].tolist()
topic_model = perform_topic_modeling(prompts_list)