# Exploratory Data Analysis (EDA)

### 1. Research Objective and Hypothesis

**Primary Research Question:** Do users demonstrate distinct preferences for specific Large Language Models (LLMs) based on task type, and how do model characteristics (speed, cost, performance) influence these preferences?

**Hypothesis:** Users exhibit statistically significant preferences for certain LLMs when performing specific tasks (e.g., coding, creative writing, data analysis). Furthermore, model characteristics such as response speed, operational cost, and benchmark performance metrics correlate with user preference patterns, with faster and more capable models being favored for technical tasks while cost-effective models are preferred for general conversation.

**Research Sub-questions:**
1. Which LLMs are most frequently preferred by users for coding-related tasks compared to creative or conversational tasks?
2. Do model performance characteristics (speed, cost, benchmark scores) correlate with user preference rates?
3. What task categories dominate user interactions with LLMs, and how does this vary across different models?
4. Are there statistically significant differences in user satisfaction across LLM models when controlling for task type?

---

### 2. Significance and Application

**Relevance**:

As artificial intelligence tools become increasingly integrated into professional and personal workflows, understanding user preferences and usage patterns is critical for multiple stakeholders:
- **For AI Developers:** Insights into task-specific model performance can inform targeted optimization efforts and feature development priorities.
- **For End Users:** Understanding which models excel at specific tasks can guide tool selection and improve productivity.
- **For Researchers:** This analysis contributes to the growing body of knowledge on human-AI interaction patterns and user experience design.

**Potential Applications:**

The findings from this research could be applied to:
1. **Product Development:** Guide AI companies in optimizing models for specific use cases and user needs.
2. **User Interface Design:** Inform the development of recommendation systems that suggest appropriate models based on user tasks.
3. **Educational Resources:** Create evidence-based guidance for users learning to work effectively with LLMs.
4. **Market Analysis:** Provide insights into competitive positioning and market segmentation in the LLM space.

**Future Investigation**:

This work establishes a foundation for several future research directions:
- Longitudinal studies tracking how user preferences evolve as models improve
- Investigation of demographic factors influencing LLM preference patterns
- Analysis of task complexity and its relationship to model selection
- Exploration of multi-model workflows and when users switch between different LLMs

---

### 3. Data Sources

**Primary Dataset: Chatbot Arena Human Preferences**

**Source:** [Hugging Face Repository](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-55k) <br>
**Access:** Publicly available via Hugging Face Datasets library

**Dataset Characteristics:**
- Volume: 55,000+ real-world user conversations
- LLM Coverage: Over 70 state-of-the-art models including GPT-4, Claude 2, Llama 2, Gemini, Mistral, and others
- Data Collection Period: Ongoing collection from Chatbot Arena platform
- Key Features:
    - Conversation prompts (user queries)
    - Model responses from two competing models
    - User vote (which model response was preferred)
    - Model identifiers
    - Timestamp information
    - Conversation metadata

**Preliminary Summary Statistics:**
- Total conversations: 55,000+
- Unique models represented: 70+
- Data format: Structured JSON/CSV format
- Average conversation length: To be determined during initial exploration
- Vote distribution: To be analyzed for balance across models

**Secondary Dataset: Large Language Models Comparison Dataset**

**Source:** [Kaggle](https://www.kaggle.com/datasets/samayashar/large-language-models-comparison-dataset) <br>
**Access:** Publicly available via Kaggle

**Dataset Characteristics:**
- Volume: Comprehensive coverage of major LLM models
- Key Features:
    - Model names and providers
    - Performance metrics and benchmark scores
    - Speed/latency measurements
    - Cost per token information
    - Model specifications (parameter count, architecture)
    - Release dates

**Integration Strategy:** The two datasets will be merged using model names as the common key, enriching user preference data with technical model characteristics to enable deeper analysis of the relationship between model attributes and user choices (could be difficult due to the differences in dataset sizes and possible unmatch 
model names).

---

### 4.Intended Analytical Approach

**Techniques and Methods:**  
- **Data Preparation:** pandas, Hugging Face datasets, data cleaning, and merging  
- **Feature Engineering:** Conversation length, task classification (keyword-based), win rate per model  
- **Exploratory Analysis:**  
    - Distribution of preferences by model and task  
    - Correlations between model characteristics (speed, cost) and user choices 
- **Statistical Testing:**  
    - Chi-square tests for independence  
    - ANOVA and t-tests for group comparisons  
    - Correlation analysis between benchmarks and preferences  
- **Visualization:** matplotlib, seaborn (bar charts, boxplots, heatmaps)  

**Expected Output:**Cleaned dataset, descriptive and inferential statistics, and visualizations showing user preference trends and correlations with model performance.




In [1]:
import pandas as pd
import numpy as np
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns
import os

print("Setup complete!")

Setup complete!


### Data Exploration

**Objective:** Understand the structure, quality, and characteristics of both datasets before cleaning.

In [30]:
# Load Chatbot Arena dataset
df_arena = pd.read_csv('../data/raw/chatbot_arena.csv')
print(df_arena.shape[0], "rows ×", df_arena.shape[1], "columns")

# Load Kaggle dataset (update filename if different)
df_kaggle = pd.read_csv('../data/raw/llm_comparison_dataset.csv')
print(df_kaggle.shape[0], "rows ×", df_kaggle.shape[1], "columns")

57477 rows × 9 columns
200 rows × 15 columns


### Inspect Arena Dataset

In [31]:
print("Columns in Arena Dataset: ")
print(df_arena.columns.tolist())

Columns in Arena Dataset: 
['id', 'model_a', 'model_b', 'prompt', 'response_a', 'response_b', 'winner_model_a', 'winner_model_b', 'winner_tie']


In [32]:
df_arena.head()

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


In [33]:
df_arena.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57477 entries, 0 to 57476
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              57477 non-null  int64 
 1   model_a         57477 non-null  object
 2   model_b         57477 non-null  object
 3   prompt          57477 non-null  object
 4   response_a      57477 non-null  object
 5   response_b      57477 non-null  object
 6   winner_model_a  57477 non-null  int64 
 7   winner_model_b  57477 non-null  int64 
 8   winner_tie      57477 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 3.9+ MB


### Inspect Kaggle Dataset

In [34]:
print("Columns in Kaggle Dataset: ")
print(df_kaggle.columns.tolist())

Columns in Kaggle Dataset: 
['Model', 'Provider', 'Context Window', 'Speed (tokens/sec)', 'Latency (sec)', 'Benchmark (MMLU)', 'Benchmark (Chatbot Arena)', 'Open-Source', 'Price / Million Tokens', 'Training Dataset Size', 'Compute Power', 'Energy Efficiency', 'Quality Rating', 'Speed Rating', 'Price Rating']


In [35]:
df_kaggle.head()

Unnamed: 0,Model,Provider,Context Window,Speed (tokens/sec),Latency (sec),Benchmark (MMLU),Benchmark (Chatbot Arena),Open-Source,Price / Million Tokens,Training Dataset Size,Compute Power,Energy Efficiency,Quality Rating,Speed Rating,Price Rating
0,DeepSeek-4,Deepseek,128000,95,2.74,85,1143,1,18.81,760952565,13,0.5,2,2,3
1,Llama-8,Meta AI,300000,284,3.21,71,1390,1,3.98,22891342,22,2.07,1,3,3
2,Llama-5,Meta AI,300000,225,2.95,85,1406,0,1.02,827422145,21,0.95,2,3,2
3,DeepSeek-3,Deepseek,2000000,242,12.89,72,1264,1,27.63,694305632,86,3.51,1,3,3
4,DeepSeek-8,Deepseek,1000000,71,3.8,77,1381,1,18.52,378552278,92,1.8,2,2,3


In [36]:
df_kaggle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 15 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Model                      200 non-null    object 
 1   Provider                   200 non-null    object 
 2   Context Window             200 non-null    int64  
 3   Speed (tokens/sec)         200 non-null    int64  
 4   Latency (sec)              200 non-null    float64
 5   Benchmark (MMLU)           200 non-null    int64  
 6   Benchmark (Chatbot Arena)  200 non-null    int64  
 7   Open-Source                200 non-null    int64  
 8   Price / Million Tokens     200 non-null    float64
 9   Training Dataset Size      200 non-null    int64  
 10  Compute Power              200 non-null    int64  
 11  Energy Efficiency          200 non-null    float64
 12  Quality Rating             200 non-null    int64  
 13  Speed Rating               200 non-null    int64  

### Summary Statistics

In [39]:
# Arena - Numerical
df_arena.describe()

Unnamed: 0,id,winner_model_a,winner_model_b,winner_tie
count,57477.0,57477.0,57477.0,57477.0
mean,2142564000.0,0.349079,0.341911,0.309011
std,1238327000.0,0.476683,0.474354,0.46209
min,30192.0,0.0,0.0,0.0
25%,1071821000.0,0.0,0.0,0.0
50%,2133658000.0,0.0,0.0,0.0
75%,3211645000.0,1.0,1.0,1.0
max,4294947000.0,1.0,1.0,1.0


In [41]:
# Arena - Categorical
df_arena.describe(include=['object'])

Unnamed: 0,model_a,model_b,prompt,response_a,response_b
count,57477,57477,57477,57477,57477
unique,64,64,51734,56566,56609
top,gpt-4-1106-preview,gpt-4-1106-preview,"[""Answer the following statements with \""Agree...","[""Hello! How can I assist you today?""]","[""Hello! How can I assist you today?""]"
freq,3678,3709,101,109,100


In [42]:
# Kaggle - Numerical
df_kaggle.describe()

Unnamed: 0,Context Window,Speed (tokens/sec),Latency (sec),Benchmark (MMLU),Benchmark (Chatbot Arena),Open-Source,Price / Million Tokens,Training Dataset Size,Compute Power,Energy Efficiency,Quality Rating,Speed Rating,Price Rating
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,637180.0,163.24,9.35875,77.945,1192.96,0.49,14.4752,490264300.0,46.915,2.5191,1.9,2.275,2.91
std,690943.9,79.188106,5.489481,10.182356,174.649767,0.501154,8.890484,274754400.0,28.408679,1.458241,0.802008,0.625565,0.303911
min,128000.0,20.0,0.6,60.0,902.0,0.0,0.2,2012584.0,2.0,0.15,1.0,1.0,1.0
25%,200000.0,93.75,4.265,69.0,1043.25,0.0,6.09,262297600.0,22.0,1.15,1.0,2.0,3.0
50%,256000.0,165.5,8.82,80.0,1200.5,0.0,14.66,500249400.0,43.5,2.525,2.0,2.0,3.0
75%,1000000.0,236.0,14.035,87.0,1343.75,1.0,21.515,721085700.0,72.0,3.8075,3.0,3.0,3.0
max,2000000.0,294.0,19.8,94.0,1493.0,1.0,29.89,984434500.0,99.0,4.98,3.0,3.0,3.0


In [43]:
# Kaggle - Categorical
df_kaggle.describe(include=['object'])

Unnamed: 0,Model,Provider
count,200,200
unique,70,8
top,Command-9,Cohere
freq,8,34


### Missing Values Analysis

In [45]:
# Arena missing values
missing_arena = df_arena.isnull().sum()
missing_pct = (missing_arena / len(df_arena) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing_arena, 
    'Missing %': missing_pct
    }).sort_values('Missing %', ascending=False)

# Only show columns with missing values
missing_df[missing_df['Missing Count'] > 0]

Unnamed: 0,Missing Count,Missing %


In [46]:
# Kaggle missing values
missing_kaggle = df_kaggle.isnull().sum()
missing_pct_k = (missing_kaggle / len(df_kaggle) * 100).round(2)
missing_df_k = pd.DataFrame({
    'Missing Count': missing_kaggle, 
    'Missing %': missing_pct_k
    }).sort_values('Missing %', ascending=False)

# Only show columns with missing values
missing_df_k[missing_df_k['Missing Count'] > 0]

Unnamed: 0,Missing Count,Missing %
