# COGS 108 - Data Checkpoint

## Authors

- James Qian: Data curation, Writing, Analysis, Background research
- Even Wu: Background research, Writing, Analysis, Data curation,
- Aston Martini-Facio: Methodology, Writing, Background research
- Matthew Odom (**droped from the class**): Conceptualization, Background research, Writing

list for reference
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing


## Research Question

**Can benchmark performance across different domains (reasoning, coding, mathematics, and general knowledge) reliably predict the token pricing of AI language models?**

Subquestions might include: 

- Which models deliver the best overall performance across benchmarks?

- Which models provide the most cost-efficient performance?

## Background and Prior Work

Large language models (LLMs) are largely evaluated using standardized benchmarks to measure performance in reasoning, coding, mathematics, and general knowledge. For example, some widely used tasks include MMLU for academic knowledge, GSM8K for multi-step math reasoning, and HumanEval for code generation. These benchmarks give each model an overall accuracy score that can be compared across multiple systems. In addition to these static tests, preference-based evaluations such as the LMSYS Chatbot Arena use large-scale human comparisons to capture perceived quality across diverse prompts, offering a much broader and holistic view of model performance.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Alongside performance data, providers also publish detailed token pricing, usually expressed as cost per million input and output tokens. OpenAI, Google, and other companies differentiate their models by capability tier, context window, and modality, which results in a pricing structure that is viewable to the general public, however interpreting and analyzing this data is much harder.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) While pricing is public, it is not obvious whether higher benchmark scores actually justify higher token costs. Some analyses suggest that cost-efficiency varies widely across models, and that smaller or open-source systems can sometimes deliver better “performance per dollar” on specific tasks compared to their larger counterparts.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

The findings of this project will be able to help both students and researchers. Students in college span a wide variety of majors, and in todays age, they now mostly all use LLMs in their lives. However, each students’ use for LLMs differs and hence finding the best model for their task is critical, for example: computer science students may care most about coding accuracy, math and statistics students about problem-solving, and social science or humanities students about writing quality and general reasoning. Furthermore, many students also work within a limited budget. If one model is cheaper but nearly as accurate as a flagship model for a given domain, it may be the better practical choice. Moreover, understanding whether or not newer models are actually becoming more cost-efficient and which models are best suited to particular types of tasks could help students pick tools that fit both their needs and their budgets.

*Past work examples:*

Past work has begun to explore model comparisons in more applied settings. For example, articles and blog posts can be found directly comparing “ChatGPT vs DeepSeek,” noting that one system may feel better for general use while another excels on more technical tasks.<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5) There are even ongoing experiments that let different AI systems trade a virtual portfolio to compare their performance in financial markets over time.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) These examples highlight that “best” depends heavily on the task at hand.



Prior work on LLM evaluation has shown both the strengths and limits of current approaches. For example, the Hugging Face Open LLM Leaderboard combines results from many benchmarks into a single view, which helps make trade-offs between accuracy, efficiency, and model size easier to see.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Other reviews of evaluation methods argue that models should be judged along several dimensions, since no single benchmark can capture everything an LLM can do.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Building on this, our project will compare large language models ranked on public leaderboards across multiple evaluation domains with their published pricing structures. We will then fit a predictive model using benchmark scores as independent variables and token cost as the dependent variable, in order to test whether performance can reliably explain or predict pricing.

**References:**

1. <a name="cite_note-1"></a> [^](#cite_ref-1)Lianmin Zheng, Ying Sheng, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica. "LMSYS Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings", 03 May, 2023. https://lmsys.org/blog/2023-05-03-arena/

2. <a name="cite_note-2"></a> [^](#cite_ref-2)Hugging Face Open LLM Leaderboard. "Leaderboards on the Hub" https://huggingface.co/docs/leaderboards/en/leaderboards/intro

3. <a name="cite_note-3"></a> [^](#cite_ref-3)OpenAI API pricing. https://openai.com/api/pricing

4. <a name="cite_note-4"></a> [^](#cite_ref-4)“Ai Trading Benchmark.” Alpha Arena, Accessed 29 Oct. 2025. https://nof1.ai/

5. <a name="cite_note-5"></a> [^](#cite_ref-5)“Deepseek vs Chatgpt: Which Ai Is Right for You?” InvoZone, 19 Oct, 2025. https://invozone.com/blog/deepseek-vs-chatgpt


## Hypothesis


**Null Hypothesis:** There is no significant relationship between a LLM model’s benchmark accuracy score and its token cost. Improvements in performance do not systematically result in higher or lower token cost.

**Alternative Hypothesis:** A LLM model's benchmark accuracy is significantly related to token cost — either positively (higher performance requires more cost) or non-linearly (diminishing returns).


## Data

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
%pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 

leaderboard_path = 'data/00-raw/llm_performance_leaderboard.csv'

# **LLM Performance Leaderboard Dataset**

## **Dataset Column Descriptions:**
---
### **Basic Model Information**
- **id** – Unique numerical identifier for each model.  
- **name** – Human-readable name of the model (e.g., “GPT-5-mini”).  
- **slug** – URL-friendly identifier for the model.  
- **release_date** – Date the model was released.

### **Speed & Latency Metrics**
- **median_output_tokens_per_second** – Model generation speed in tokens per second (higher = faster).  
- **median_time_to_first_token_seconds** – Time until the first generated token appears (seconds).  
- **median_time_to_first_answer_token** – Time until the model begins producing its final answer.

### **Creator Metadata**
- **model_creator.id** – Identifier for the organization that developed the model.  
- **model_creator.name** – Name of the model’s creator (e.g., OpenAI, Anthropic).  
- **model_creator.slug** – URL-friendly version of the creator name.

---
### **Performance Metrics (Evaluation Scores)**  
All scores represent benchmark accuracy or composite abilities; higher values indicate better performance.

### **Composite Scores**
- **evaluations.artificial_analysis_intelligence_index** – Overall composite reasoning score.  
- **evaluations.artificial_analysis_coding_index** – Composite coding/programming ability score.  
- **evaluations.artificial_analysis_math_index** – Composite mathematical reasoning score.

### **Academic / Professional Benchmarks**
- **evaluations.mmlu_pro** – Accuracy on the professional-level MMLU-Pro exam.  
- **evaluations.gpqa** – Accuracy on the graduate-level GPQA benchmark.  
- **evaluations.hle** – High-Level Evaluation score for advanced reasoning.  
- **evaluations.livecodebench** – Coding performance on real-time coding tasks.  
- **evaluations.scicode** – Scientific-reasoning benchmark accuracy.  
- **evaluations.math_500** – Accuracy on 500 diverse math problems.  
- **evaluations.aime** – Score on AIME (American Invitational Math Exam).  
- **evaluations.aime_25** – Score on the 2025 AIME-style problem set.  
- **evaluations.ifbench** – Logical inference (IFBench) reasoning accuracy.  
- **evaluations.lcr** – LCR reasoning benchmark score.  
- **evaluations.terminalbench_hard** – Performance on hard terminal-based coding tasks.  
- **evaluations.tau2** – Score on the TAU2 advanced reasoning benchmark.
---

### **Pricing Metrics**  
Prices are in **USD per 1,000,000 tokens**.

- **pricing.price_1m_blended_3_to_1** – Estimated cost assuming a 3:1 input-to-output token ratio.  
- **pricing.price_1m_input_tokens** – Cost per one million input (prompt) tokens.  
- **pricing.price_1m_output_tokens** – Cost per one million output (generated) tokens.
---
### **Major Concerns**
A major concern with this dataset is that many benchmark scores are missing for a large number of models. Several evaluation suites—such as AIME, IFBench, LCR, TAU2, and TerminalBench—only test a small subset of high-end or well-known models, meaning mid-tier models appear to have missing values not because they perform poorly, but simply because they were never evaluated. This selective inclusion introduces bias and makes direct comparison across all models difficult. Another issue is that different benchmarks are run by different organizations using slightly different prompting setups, evaluation conditions, or decoding parameters, which reduces cross-benchmark consistency. Additionally, token pricing is influenced by company strategy and business decisions rather than purely by model performance, so price should not be interpreted as an objective reflection of capability. Finally, the dataset only includes models that are publicly listed and evaluated through certain platforms, which excludes many open-source or privately deployed models and may skew the sample toward commercial systems from major AI labs.

---

## **Data Cleaning**

In [1]:
# Load the dataset:
import pandas as pd
leaderboard = pd.read_csv('data/00-raw/llm_performance_leaderboard.csv')

In [2]:
# first few rows of the dataset:
leaderboard.head()

Unnamed: 0,id,name,slug,release_date,median_output_tokens_per_second,median_time_to_first_token_seconds,median_time_to_first_answer_token,model_creator.id,model_creator.name,model_creator.slug,...,evaluations.math_500,evaluations.aime,evaluations.aime_25,evaluations.ifbench,evaluations.lcr,evaluations.terminalbench_hard,evaluations.tau2,pricing.price_1m_blended_3_to_1,pricing.price_1m_input_tokens,pricing.price_1m_output_tokens
0,05e45a36-b5c6-47a1-8adb-9ddc19add5b3,GPT-5 nano (minimal),gpt-5-nano-minimal,2025-08-07,0.0,0.0,0.0,e67e56e3-15cd-43db-b679-da4660a69f41,OpenAI,openai,...,,,0.273,0.325,0.2,0.064,0.257,0.138,0.05,0.4
1,16149b9c-a1e9-4669-a5cb-ff3c00d78f89,gpt-oss-20B (low),gpt-oss-20b-low,2025-08-05,194.081,0.481,10.786,e67e56e3-15cd-43db-b679-da4660a69f41,OpenAI,openai,...,,,0.623,0.578,0.31,0.043,0.503,0.094,0.06,0.2
2,f0083258-8646-45b8-8082-7aaf6c2ea82a,gpt-oss-120B (high),gpt-oss-120b,2025-08-05,364.657,0.385,5.87,e67e56e3-15cd-43db-b679-da4660a69f41,OpenAI,openai,...,,,0.934,0.69,0.507,0.22,0.658,0.263,0.15,0.6
3,c99f3bde-7c08-4de8-bd5c-8ee9123ebffa,gpt-oss-120B (low),gpt-oss-120b-low,2025-08-05,337.351,0.374,6.302,e67e56e3-15cd-43db-b679-da4660a69f41,OpenAI,openai,...,,,0.667,0.583,0.437,0.05,0.45,0.263,0.15,0.6
4,36f73aaf-d38a-4b56-a2b3-d04d17186910,gpt-oss-20B (high),gpt-oss-20b,2025-08-05,228.811,0.396,9.136,e67e56e3-15cd-43db-b679-da4660a69f41,OpenAI,openai,...,,,0.893,0.651,0.343,0.099,0.602,0.094,0.06,0.2


In [3]:
# the columns:
leaderboard.columns.tolist()

['id',
 'name',
 'slug',
 'release_date',
 'median_output_tokens_per_second',
 'median_time_to_first_token_seconds',
 'median_time_to_first_answer_token',
 'model_creator.id',
 'model_creator.name',
 'model_creator.slug',
 'evaluations.artificial_analysis_intelligence_index',
 'evaluations.artificial_analysis_coding_index',
 'evaluations.artificial_analysis_math_index',
 'evaluations.mmlu_pro',
 'evaluations.gpqa',
 'evaluations.hle',
 'evaluations.livecodebench',
 'evaluations.scicode',
 'evaluations.math_500',
 'evaluations.aime',
 'evaluations.aime_25',
 'evaluations.ifbench',
 'evaluations.lcr',
 'evaluations.terminalbench_hard',
 'evaluations.tau2',
 'pricing.price_1m_blended_3_to_1',
 'pricing.price_1m_input_tokens',
 'pricing.price_1m_output_tokens']

In [4]:
# remove columns that are not relavent to our research & renaming some columns:
leaderboard.columns = leaderboard.columns.str.replace('evaluations.', '').str.replace('pricing.','')
leaderboard = leaderboard[['id', 'name', 'release_date', 'model_creator.name', 'artificial_analysis_intelligence_index',
    'terminalbench_hard', 'tau2', 'lcr', 'hle', 'mmlu_pro', 'gpqa', 'livecodebench', 'scicode', 'ifbench', 'aime_25',
    'price_1m_blended_3_to_1', 'price_1m_input_tokens', 'price_1m_output_tokens']]

leaderboard.head()

Unnamed: 0,id,name,release_date,model_creator.name,artificial_analysis_intelligence_index,terminalbench_hard,tau2,lcr,hle,mmlu_pro,gpqa,livecodebench,scicode,ifbench,aime_25,price_1m_blended_3_to_1,price_1m_input_tokens,price_1m_output_tokens
0,05e45a36-b5c6-47a1-8adb-9ddc19add5b3,GPT-5 nano (minimal),2025-08-07,OpenAI,29.1,0.064,0.257,0.2,0.041,0.556,0.428,0.47,0.291,0.325,0.273,0.138,0.05,0.4
1,16149b9c-a1e9-4669-a5cb-ff3c00d78f89,gpt-oss-20B (low),2025-08-05,OpenAI,44.3,0.043,0.503,0.31,0.051,0.718,0.611,0.652,0.34,0.578,0.623,0.094,0.06,0.2
2,f0083258-8646-45b8-8082-7aaf6c2ea82a,gpt-oss-120B (high),2025-08-05,OpenAI,60.5,0.22,0.658,0.507,0.185,0.808,0.782,0.878,0.389,0.69,0.934,0.263,0.15,0.6
3,c99f3bde-7c08-4de8-bd5c-8ee9123ebffa,gpt-oss-120B (low),2025-08-05,OpenAI,47.5,0.05,0.45,0.437,0.052,0.775,0.672,0.707,0.36,0.583,0.667,0.263,0.15,0.6
4,36f73aaf-d38a-4b56-a2b3-d04d17186910,gpt-oss-20B (high),2025-08-05,OpenAI,52.4,0.099,0.602,0.343,0.098,0.748,0.688,0.777,0.344,0.651,0.893,0.094,0.06,0.2


In [5]:
# Check whether 'name' is a unique identifier for each row
len(leaderboard['name'].unique()) == leaderboard.shape[0]

True

In [6]:
# since 'name' is an unique identifier for each row, drop the 'id' column and set 'name' as the index:
leaderboard = leaderboard.set_index('name').drop(columns='id')
leaderboard.head()

Unnamed: 0_level_0,release_date,model_creator.name,artificial_analysis_intelligence_index,terminalbench_hard,tau2,lcr,hle,mmlu_pro,gpqa,livecodebench,scicode,ifbench,aime_25,price_1m_blended_3_to_1,price_1m_input_tokens,price_1m_output_tokens
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
GPT-5 nano (minimal),2025-08-07,OpenAI,29.1,0.064,0.257,0.2,0.041,0.556,0.428,0.47,0.291,0.325,0.273,0.138,0.05,0.4
gpt-oss-20B (low),2025-08-05,OpenAI,44.3,0.043,0.503,0.31,0.051,0.718,0.611,0.652,0.34,0.578,0.623,0.094,0.06,0.2
gpt-oss-120B (high),2025-08-05,OpenAI,60.5,0.22,0.658,0.507,0.185,0.808,0.782,0.878,0.389,0.69,0.934,0.263,0.15,0.6
gpt-oss-120B (low),2025-08-05,OpenAI,47.5,0.05,0.45,0.437,0.052,0.775,0.672,0.707,0.36,0.583,0.667,0.263,0.15,0.6
gpt-oss-20B (high),2025-08-05,OpenAI,52.4,0.099,0.602,0.343,0.098,0.748,0.688,0.777,0.344,0.651,0.893,0.094,0.06,0.2


In [7]:
# find out the null values each column have:
leaderboard.isnull().sum()

release_date                                0
model_creator.name                          0
artificial_analysis_intelligence_index      3
terminalbench_hard                        134
tau2                                      130
lcr                                       121
hle                                        35
mmlu_pro                                   35
gpqa                                       31
livecodebench                              36
scicode                                    37
ifbench                                   121
aime_25                                   120
price_1m_blended_3_to_1                     0
price_1m_input_tokens                       0
price_1m_output_tokens                      0
dtype: int64

We noticed that some benchmarks have significantly more null entries than others, we decided to focus on these specific benchmarks that have comparatively less null values:
- hle
- mmlu_pro
- gpqa
- livecodebench
- scicode

since these benchmarks are all testing on similar topics (Reasoning, Coding, Knowledge), droping the other columns does not compromise the overall coverage of the analysis, as the retained benchmarks still capture the core competencies being evaluated. At the same time, it allow us to keep more data after droping rows containing null values

In [8]:
# drop the other benchmark columns:
leaderboard = leaderboard.drop(columns=['terminalbench_hard', 'tau2', 'lcr', 'ifbench', 'aime_25'])

In [9]:
# drop rows containing any null values:
leaderboard = leaderboard.dropna()

In [10]:
leaderboard.isnull().any()

release_date                              False
model_creator.name                        False
artificial_analysis_intelligence_index    False
hle                                       False
mmlu_pro                                  False
gpqa                                      False
livecodebench                             False
scicode                                   False
price_1m_blended_3_to_1                   False
price_1m_input_tokens                     False
price_1m_output_tokens                    False
dtype: bool

In [11]:
# all numerical variables are float type
leaderboard.dtypes

release_date                               object
model_creator.name                         object
artificial_analysis_intelligence_index    float64
hle                                       float64
mmlu_pro                                  float64
gpqa                                      float64
livecodebench                             float64
scicode                                   float64
price_1m_blended_3_to_1                   float64
price_1m_input_tokens                     float64
price_1m_output_tokens                    float64
dtype: object

In [14]:
# summary statistics:
leaderboard.describe()

Unnamed: 0,artificial_analysis_intelligence_index,hle,mmlu_pro,gpqa,livecodebench,scicode,price_1m_blended_3_to_1,price_1m_input_tokens,price_1m_output_tokens
count,281.0,281.0,281.0,281.0,281.0,281.0,281.0,281.0,281.0
mean,31.051601,0.066274,0.666107,0.538922,0.39473,0.251416,1.622495,0.868089,3.86895
std,16.495261,0.041654,0.179972,0.185807,0.242261,0.118668,4.07067,2.104984,10.045099
min,1.0,0.028,0.055,0.098,0.002,0.0,0.0,0.0,0.0
25%,17.4,0.043,0.571,0.371,0.181,0.156,0.085,0.05,0.15
50%,29.7,0.05,0.73,0.557,0.343,0.267,0.419,0.2,0.8
75%,43.8,0.072,0.805,0.698,0.617,0.354,1.313,0.7,2.8
max,68.5,0.265,0.88,0.877,0.878,0.465,30.0,15.0,75.0


In [13]:
leaderboard

Unnamed: 0_level_0,release_date,model_creator.name,artificial_analysis_intelligence_index,hle,mmlu_pro,gpqa,livecodebench,scicode,price_1m_blended_3_to_1,price_1m_input_tokens,price_1m_output_tokens
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
GPT-5 nano (minimal),2025-08-07,OpenAI,29.1,0.041,0.556,0.428,0.470,0.291,0.138,0.05,0.40
gpt-oss-20B (low),2025-08-05,OpenAI,44.3,0.051,0.718,0.611,0.652,0.340,0.094,0.06,0.20
gpt-oss-120B (high),2025-08-05,OpenAI,60.5,0.185,0.808,0.782,0.878,0.389,0.263,0.15,0.60
gpt-oss-120B (low),2025-08-05,OpenAI,47.5,0.052,0.775,0.672,0.707,0.360,0.263,0.15,0.60
gpt-oss-20B (high),2025-08-05,OpenAI,52.4,0.098,0.748,0.688,0.777,0.344,0.094,0.06,0.20
...,...,...,...,...,...,...,...,...,...,...,...
Qwen2.5 Instruct 32B,2024-09-19,Alibaba,22.9,0.038,0.697,0.466,0.248,0.229,0.000,0.00,0.00
Qwen3 235B A22B (Reasoning),2025-04-28,Alibaba,41.7,0.117,0.828,0.700,0.622,0.399,2.625,0.70,8.40
Qwen3 0.6B (Reasoning),2025-04-28,Alibaba,14.2,0.057,0.347,0.239,0.121,0.028,0.398,0.11,1.26
Qwen3 Max (Preview),2025-09-05,Alibaba,48.5,0.093,0.838,0.764,0.651,0.370,2.400,1.20,6.00


### The Data is now **clean** and **tidy**:
- Each variable forms a column
- Each observation forms a row
- There are no missing values
- Unit is standardized
- Irrelevant columns are droped
---

## Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> The data we are using is primarily anonymous, and although using LLMs and training them involves user data, in this instance in relation to our question it doesn’t expose any personally identifiable information. 

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> Yes they are. We accurately display performance comparisons between AI models without manipulating scales or selectively reporting results.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> Yes, all data collection methods, prompts, model settings, and evaluation criteria for AI models are clearly recorded. This ensures that if any issues arise in the future, the experiments can be accurately replicated and verified for consistency.


### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> Yes, we have taken steps to ensure that neither model’s evaluation relies on variables or proxies that could be unfairly discriminatory. Our comparisons focus only on objective performance metrics such as accuracy, response relevance, and coherence rather than user demographics or sensitive attributes.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> Yes, we carefully considered the effects of optimizing for our chosen metrics. While we primarily used metrics such as accuracy, response quality, and coherence were prioritized, we also reviewed other factors like bias, response diversity, and factual consistency

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Yes, both models’ decisions and outputs can be explained in understandable terms when justification is needed. For example, we analyze responses by looking at the prompt structure, model training objectives, and language generation patterns to clarify why each model produced a specific answer, and taking This approach helps interpret differences in reasoning, tone, or factual accuracy between different AI models.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

1. Use discord for communication, expected response time: 1hr, briefly meet up after lectures
2. be polite and respectful
3. decisions are made collectively
4. specialization in tasks
5. plan ahead, try to not have last minute works as it greatly increase the chance of something to go wrong
6. if someone doesn't do their part the rest of the team will need to complete the unfinished work in time to turn it in. the person who has errored may be reported.

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 11/4  |  2 PM | Searched for large public datasets on LLM performance and pricing; reviewed project prompt and brainstormed research ideas.  | Met to choose the Hugging Face Open LLM Leaderboard + provider pricing pages as our main data sources and discussed the direction of our project.| 
| 11/11  |  2 PM |  Downloaded and uploaded the leaderboard data, did first-round of data cleaning (selecting variables, renaming columns, checking missingness), and re-drafting proposal sections. | Met to merge and edit the proposal based on feedback, and confirm that the cleaned dataset fits our research question, and submit second data check point. | 
| 11/18  | 2 PM  | EDA; Begin Analysis. Build a processed dataset with one row per model and run exploratory data analysis (summary stats, basic plots of performance and token cost). | Meet to review EDA and visualizations, decide how to construct composite performance scores, and agree on the main modeling approach.   |
| 11/25  | 2 PM  | Finalize checkpoint 02. Run baseline models (correlation and simple regressions of performance vs. token cost, including domain-specific analyses) and create initial draft plots. | Discuss details to edit before submission. Short check-in meeting to interpret early results, decide which relationships to focus on and possible correlations, and adjust the analysis plan if needed. |
| 12/2  | 2 PM  | Finalize the full set of analyses and produce polished figures and tables showing the relationship between benchmark performance and token pricing. Start drafting up conclusion and findings of study. | Discuss feedback from TA. Meeting focused on outlining the final report sections (Intro, Methods, Results, Discussion, Conclusion).|
| 12/9  | 2 PM  | Complete full drafts of all report sections and ensure code + visualizations are fully reproducible from the cleaned dataset. | Finalize project, format checking. Meet to do group edits, tighten the narrative so it clearly answers the research question, finalize formatting and citations, and submit the final project|