### General Description
#### 1. Task Statement
**Company:** Insight AI

**Issue:** Insight AI, a leading consulting firm in the artificial intelligence sector, is advising a major client on the integration of a Large Language Model (LLM) into their flagship customer support application. The client needs to select the most suitable model from a wide array of options, ensuring it aligns with user preferences and provides a high-quality conversational experience. Making the wrong choice could lead to poor user adoption and significant financial loss.

**ML/DS Solution:** To provide a data-driven recommendation, Insight AI must perform a comprehensive Exploratory Data Analysis (EDA) on the LMSYS Chatbot Arena dataset. This dataset contains records of head-to-head battles between anonymous LLMs, judged by humans. By analyzing this data, we can uncover patterns in model performance, identify strengths and weaknesses, and understand the factors that drive user preference.

**Feasibility:** Manually reviewing thousands of chat logs to gauge model performance is impractical, subjective, and doesn't scale. A systematic, data-driven EDA is the only feasible way to extract objective, actionable insights from this large and complex dataset.

**Task:** Your task, as a data scientist at Insight AI, is to conduct a detailed EDA on the Chatbot Arena dataset. You will need to clean the data, visualize key distributions, engineer relevant features, and build simple baseline models to identify the key predictors of a model's success.

**Data:** The company provides the 'LMSYS Chatbot Arena' dataset, which includes training and test sets containing chat logs, model identifiers (for training), and human-judged outcomes.

**Definition of Done:** The final deliverable is a structured report (this notebook) that details the findings from the EDA. It must include clear visualizations, statistical analysis of model performance, feature importance rankings from baseline models, and topic modeling of user prompts. The insights gathered will form the basis of the final recommendation to the client.
#### 2. Rewards
- Gaining expertise in Exploratory Data Analysis (EDA) for complex, text-based datasets.
- Mastering data cleaning and preprocessing techniques for real-world data.
- Advanced data visualization skills using Matplotlib and Seaborn.
- Practical feature engineering for machine learning on text data (e.g., TF-IDF).
- Building and interpreting baseline models to guide feature selection.
- Introduction to Topic Modeling using state-of-the-art libraries like BERTopic.
#### 3. Difficulty Level
normal
#### 4. Task Type
Exploratory Data Analysis, Data Cleaning, Feature Engineering
#### 5. Tools
Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn, BERTopic, SentenceTransformers

In [None]:
import os
import random
import re
import time
from collections import defaultdict
from tqdm.notebook import tqdm
import warnings
from pathlib import Path
from typing import Any, Dict, List, Optional
from IPython.display import display, HTML

import pandas as pd 
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.base import BaseEstimator
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import log_loss
from sentence_transformers import SentenceTransformer
from lightgbm import LGBMClassifier
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic.vectorizers import ClassTfidfTransformer

# Configuration
warnings.simplefilter("ignore")
sns.set_style("darkgrid")
pd.options.display.max_rows = None
pd.options.display.max_columns = None

```json
{
  "issue": "The analysis requires loading raw data from CSV files and performing initial cleaning, such as handling duplicates and parsing string-formatted lists.",
  "action": "Define functions to load the training and test data using pandas. Implement a data cleaning function that removes the 'id' column from the training set, drops duplicate rows, and correctly parses the list-like string columns ('prompt', 'response_a', 'response_b') into actual Python lists, handling potential 'null' values.",
  "state": "The data is loaded into pandas DataFrames, cleaned of duplicates, and all text-based columns are correctly formatted as lists, making them ready for detailed analysis."
}
```

In [None]:
from pathlib import Path
import pandas as pd

def load_data(data_path: Path):
    train = pd.read_csv(data_path / "train.csv")
    test = pd.read_csv(data_path / "test.csv")
    return train, test

def clean_data_incorrectly(df: pd.DataFrame):
    """Incorrectly handles string-to-list conversion and fails to remove duplicates."""
    # Error 1: Naive `eval` is unsafe and fails on `null`.
    # Error 2: Fails to drop duplicate rows from the dataset.
    for col in ["prompt", "response_a", "response_b"]:
        if col in df.columns:
            # This is unsafe and will fail on 'null' values
            df[col] = df[col].apply(eval)
    
    # The line to drop duplicates is missing.
    return df

DATA_PATH = Path("/kaggle/input/lmsys-chatbot-arena")
train_df, test_df = load_data(DATA_PATH)

try:
    train_df_cleaned = clean_data_incorrectly(train_df.copy())
    print(f"Dataframe shape after cleaning: {train_df_cleaned.shape}")
except Exception as e:
    print(f"An error occurred during cleaning: {e}")

print(f"Original dataframe shape: {train_df.shape}")

```json
{
    "required_ml_terms": ["data cleaning", "parsing", "duplicates", "exception handling"],
    "problems_to_detect": [
        "The use of `eval` for parsing is unsafe and can execute arbitrary code; it also fails to handle `null` values, which will raise an error.",
        "The code does not remove duplicate rows, which can skew analysis and model training."
    ]
}
```

```json
{
  "issue": "To understand the competitive landscape of the models, we need to analyze their appearance frequency and head-to-head battle outcomes.",
  "action": "Create functions to visualize the distribution of models appearing as 'model_a' and 'model_b' using pie charts. Then, develop a 'battle report' by pivoting the data to create heatmaps showing the number of battles and win rates between every pair of top models.",
  "state": "Visualizations are generated that clearly show which models are most frequent and which models tend to win against others, providing a high-level overview of the model hierarchy."
}
```

In [None]:
import matplotlib.pyplot as plt

def plot_model_distribution_flawed(df: pd.DataFrame):
    """Plots distribution but only for one column and omits the battle heatmap."""
    # Error 1: Only plots 'model_a', ignoring 'model_b', giving an incomplete picture.
    # Error 2: Fails to generate the battle count heatmap, which is a key part of the task.
    model_a_counts = df["model_a"].value_counts()
    
    plt.figure(figsize=(8, 8))
    plt.pie(model_a_counts, labels=model_a_counts.index, autopct='%1.1f%%', startangle=140)
    plt.title("Distribution for model_a")
    plt.show()
    

plot_model_distribution_flawed(train_df)

```json
{
    "required_ml_terms": ["data visualization", "exploratory data analysis"],
    "problems_to_detect": [
        "The analysis is incomplete as it only visualizes the distribution for `model_a` while ignoring `model_b`.",
        "The required battle heatmap, which shows head-to-head model performance, was not implemented or generated."
    ]
}
```

```json
{
  "issue": "A key part of the analysis is understanding the verbosity of prompts and responses, as text length can be a strong predictor of user preference.",
  "action": "Engineer features related to the length and number of turns in each conversation. Create functions to calculate the number of turns, the total character length of prompts and responses, and the difference in length between model A's and model B's responses. Then, visualize the distributions of these new features.",
  "state": "The dataset is augmented with several new length-based features, and their distributions are plotted, revealing insights into conversational dynamics and model verbosity."
}
```

In [None]:
import pandas as pd

def engineer_length_features_partially(df: pd.DataFrame):
    """Engineers only a subset of required features and does not visualize them."""
    # This assumes `prompt` column is already parsed into lists, which might not be true.
    # Error 1: It calculates number of turns, but not the character lengths or length differences.
    # Error 2: It fails to plot the distributions of the newly created features.
    try:
        # Incomplete feature engineering
        df['n_turns'] = df["prompt"].apply(len)
        print("Engineered 'n_turns' feature.")
        
        # Missing other features like 'prompt_len', 'response_a_len', 'response_b_len', 'len_diff', etc.
        
        # Missing visualization of the feature distributions
        print("Feature visualization was not performed.")
    except TypeError:
        print("Could not engineer features because 'prompt' column is not a list.")
    return df


```json
{
    "required_ml_terms": ["feature engineering", "data visualization"],
    "problems_to_detect": [
        "The feature engineering is incomplete; it only calculates the number of turns (`n_turns`) and omits other critical length-based features like character counts and response length differences.",
        "The distributions of the newly created features were not visualized, failing to provide insight into their characteristics."
    ]
}
```

```json
{
  "issue": "To establish a performance baseline, we need to create and evaluate simple, rule-based models before moving to more complex machine learning approaches.",
  "action": "Implement several naive baseline models: one that predicts a uniform 1/3 probability for all outcomes, one that predicts the global mean win/loss/tie rate, and a more refined version that predicts the mean rates specific to each model pair. Calculate the log loss for each baseline to measure its performance.",
  "state": "Performance benchmarks are established from several naive models, providing a clear metric that more sophisticated models must surpass."
}
```

In [None]:
import pandas as pd
from sklearn.metrics import log_loss

def evaluate_naive_baseline_incorrectly(df: pd.DataFrame, targets: list):
    """Calculates a naive baseline but fails to implement the better mean-based one."""
    # Error 1: Implements only the most naive baseline (uniform probability).
    # Error 2: It does not calculate or return the log loss, so the baseline is not evaluated.
    y_pred = [[1/3, 1/3, 1/3]] * len(df)
    
    print("Generated uniform predictions, but did not calculate log loss.")
    y_true = df[targets].values
    score = log_loss(y_true, y_pred)
    print(f'Uniform Baseline Log Loss: {score:.4f}')
    return None # Does not return score

TARGETS = ["winner_model_a", "winner_model_b", "winner_tie"]
evaluate_naive_baseline_incorrectly(train_df, TARGETS)

```json
{
    "required_ml_terms": ["baseline model", "log loss", "class imbalance"],
    "problems_to_detect": [
        "Only the most naive uniform-probability baseline was implemented; the more informative mean-based baseline was omitted.",
        "The function generates predictions but fails to calculate the `log_loss` score, so the baseline's performance is never actually measured."
    ]
}
```

```json
{
  "issue": "To understand which engineered features are most predictive, we need to build a simple machine learning model and analyze its decision-making process.",
  "action": "Train a Decision Tree Classifier using the length-based features created earlier. Use stratified K-fold cross-validation to get a robust estimate of its performance. Visualize the trained tree to interpret which features and thresholds are most important for predicting the winner.",
  "state": "A baseline Decision Tree model is trained and evaluated, and its structure is visualized, providing initial insights into the predictive power of the engineered features."
}
```

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

def train_decision_tree_flawed(df: pd.DataFrame, features: list, targets: list):
    """Trains a Decision Tree but omits CV and visualization."""
    # Error 1: Training on the entire dataset without a train/test split or cross-validation.
    # Error 2: The trained decision tree is not visualized, so it cannot be interpreted.
    X = df[features]
    y = df[targets]
    
    model = DecisionTreeClassifier(max_depth=3, random_state=42)
    model.fit(X, y)
    
    print("Decision tree trained on the full dataset, but not evaluated or visualized.")
    
    # The code for cross-validation and plotting the tree is missing.
    return model


features = ['n_turns', 'prompt_len', 'response_a_len', 'response_b_len', 'len_diff']
train_df_featured = pd.DataFrame(columns=features, data=np.random.rand(100, len(features)))
train_df_featured[TARGETS] = pd.DataFrame(np.random.randint(0, 2, size=(100, 3)))
dt_model = train_decision_tree_flawed(train_df_featured, features, TARGETS)

```json
{
    "required_ml_terms": ["decision tree", "overfitting", "cross-validation", "model interpretation"],
    "problems_to_detect": [
        "The model was trained on the entire dataset without cross-validation, making it impossible to get a robust measure of performance and check for overfitting.",
        "The decision tree was not visualized, which is a key step for interpreting the model and understanding which features are most important."
    ]
}
```

```json
{
  "issue": "To gain deeper insights from the raw text, we need to understand the main topics of discussion within the user prompts.",
  "action": "Use BERTopic, a powerful topic modeling technique, to identify and visualize topics from the prompts. This involves generating sentence embeddings, using UMAP for dimensionality reduction and HDBSCAN for clustering, and then representing topics. The topics are then visualized to show their prevalence and relationships.",
  "state": "A topic model is trained on the prompts, revealing the key themes and questions posed by users in the Chatbot Arena, which can be correlated with model performance."
}
```

In [None]:
import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

def run_bertopic_with_bad_vectorizer(prompts: pd.Series):
    """Runs BERTopic with a suboptimal vectorizer and without reducing dimensionality properly."""
    # Error 1: Using a simple CountVectorizer with default settings (no stop word removal)
    # can lead to noisy topics dominated by common, uninformative words.
    # Error 2: A UMAP model is not explicitly defined and passed, so BERTopic uses default UMAP settings,
    # which may not be optimal for reducing the dimensionality of the sentence embeddings.
    vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=None)
    
    # UMAP model is not defined, which is a missed opportunity for optimization
    topic_model = BERTopic(
        vectorizer_model=vectorizer_model, 
        verbose=False,
        # umap_model=... is missing
    )
    
    # This would take a long time and produce poor topics
    print("BERTopic configured with a suboptimal vectorizer and default UMAP.")
    # topics, probs = topic_model.fit_transform(prompts.astype(str))
    return topic_model


prompts = train_df['prompt'].explode()
topic_model = run_bertopic_with_bad_vectorizer(prompts)

```json
{
    "required_ml_terms": ["topic modeling", "vectorization", "stop words", "dimensionality reduction", "UMAP"],
    "problems_to_detect": [
        "BERTopic was configured with a basic `CountVectorizer` that does not remove English stop words, which will likely result in uninformative topics.",
        "A custom UMAP model was not configured and passed to BERTopic, which is a missed opportunity to tune the dimensionality reduction step for better topic separation."
    ]
}
```