# COGS 108 - Project Proposal

## Authors

- James Qian: Data curation, Writing
- Even Wu: Background research, Writing
- Matthew Odom: Conceptualization, Background research, Writing
- Aston Martini-Facio: Methodology, Writing

list for reference
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing


## Research Question

This project investigates the relationship between large language model (LLM) performance and token cost. Specifically, it asks: Which publicly available AI models achieve the highest overall benchmark performance, and is that performance proportional to their token costs? To answer this, we will compare models ranked at the top of public leaderboards across multiple evaluation domains (reasoning, coding, math, etc.) with their published pricing structures. We will then fit a predictive model using benchmark scores as independent variables and token cost as the dependent variable, in order to test whether performance can reliably explain or predict pricing.

Subquestions might include: 

- Which models deliver the best overall performance across benchmarks?

- Which models provide the most cost-efficient performance?

## Background and Prior Work

Large language models (LLMs) are typically evaluated using standardized benchmarks that measure reasoning, coding, mathematics, and general knowledge. Widely used tasks include MMLU for academic knowledge, GSM8K for multi-step math reasoning, and HumanEval for code generation. These benchmarks provide us with a quantitative basis for comparing model performances, giving us a an accuracy score that can be standarized across different models. Besides static benchmarks, there are also preference-based evaluations such as the LMSYS Chatbot Arena that uses large-scale human comparisons to capture perceived quality across diverse prompts, offering a more holistic view of model performance.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Alongside performance data, providers publish detailed token pricing, typically expressed as cost per million input and output tokens. OpenAI, Google and any other providers all differentiate their models by capability tier, context window, and modality, creating a transparent but complex pricing structure.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) While pricing is public, it is not always clear whether higher benchmark performance directly explains higher costs. Some analyses suggest that cost-efficiency varies significantly across models, with smaller or open-source systems sometimes offering better performance-per-dollar on specific tasks.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

A simple google search reveals this statement insight on 2 top models “ChatGpt is better for general users seeking feature rich experience while deepseek is superior in technical tasks…” (google 2025).  Testing artificial intelligence on a variety of categories can result in metrics which can be used as a foundation for determining what is “best”. Topics which worked in the past to help analyse artificial intelligence are: model, operationalization, proxies, ground truth, bias, machine learning, power, discrimination, agency, privacy and harms, data brokers, consent, goodhearts law, accountability, and standpoint epistemology. Leveraging this vocabulary we can begin to talk about something intangible and circumstantial like “quality” of Ai. Other topics to consider are speed, coding success, accuracy, creativity, multimodality, user experience, and cost. For example if chatGPT is functionally superior, but steals all your data, personal info and even steals your current project that you used chatGPT for, is it “better”?

*Past work examples:*

In this project<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5) the authors are attempting to experiment with a similar question. “who wins the battle of wits?” chatGPT or deepseek. This author has done something unique, they have made the model itself do most of the actual works. As illustrated by this example, what to name the article. 
chatGPT : GPT vs. DeepSeek: Which AI Wins for Technical Writing?
deepseek: GPT vs Deepseek: Which AI is Your Ultimate Wingman for Technical Writing?
-which is better-
Through the test making the model to reveal itself by itself doing the test is a clever inception method to run this project. Do we somehow trick the Ai to be tested on testing itself? to devise this test itself may require much planning. A traditional method may result in more clear results.

Another example is the ongoing test of the top Ai models in the world of financial trading. started october 17, 2025 the top Ai models are each given 10,000 dollars to practice stock trading. Since then chatGPT has lowered to the bottom group and deepseek has risen to the top.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) Since we are seeking information on the quality of the available Ai models on the market today. We will observe the model's token metrics in cross reference with the quality of benchmarks performance tests just like these to determine the result.

Prior works have highlighted both the strengths and limitations of current evaluation practices. The Hugging Face Open LLM Leaderboard aggregates multiple benchmarks into composite rankings, displaying trade-offs between accuracy and efficiency.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Surveys of evaluation methods emphasize the need for multidimensional comparisons, noting that no single benchmark captures the full spectrum of model capabilities.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)Building on these foundations, our project will investigate whether benchmark performance in terms of accuracy score can reliably predict token pricing.

**References:**

1. <a name="cite_note-1"></a> [^](#cite_ref-1)Lianmin Zheng, Ying Sheng, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica. "LMSYS Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings", 03 May, 2023. https://lmsys.org/blog/2023-05-03-arena/

2. <a name="cite_note-2"></a> [^](#cite_ref-2)Hugging Face Open LLM Leaderboard. "Leaderboards on the Hub" https://huggingface.co/docs/leaderboards/en/leaderboards/intro

3. <a name="cite_note-3"></a> [^](#cite_ref-3)OpenAI API pricing. https://openai.com/api/pricing

4. <a name="cite_note-4"></a> [^](#cite_ref-4)“Ai Trading Benchmark.” Alpha Arena, Accessed 29 Oct. 2025. https://nof1.ai/

5. <a name="cite_note-5"></a> [^](#cite_ref-5)“Deepseek vs Chatgpt: Which Ai Is Right for You?” InvoZone, 19 Oct, 2025. https://invozone.com/blog/deepseek-vs-chatgpt


## Hypothesis


**Null Hypothesis:** There is no significant relationship between a LLM model’s benchmark accuracy score and its token cost. Improvements in performance do not systematically result in higher or lower token cost.
- Benchmark scores and pricing are shaped by different forces. As noted in the background, public pricing is transparent but complex, and smaller or open models can be more cost-efficient on specific tasks—suggesting performance alone may not systematically dictate token cost.

**Alternative Hypothesis:** A LLM model's benchmark accuracy is significantly related to token cost — either positively (higher performance requires more cost) or non-linearly (diminishing returns).
- Providers often price higher-capability tiers above lower ones, and aggregate benchmarks tend to correlate with perceived quality, which can command premium pricing. However, we also expect non-linearity: gains at the top end may incur disproportionately higher costs due to scaling, yielding diminishing returns.


## Data

1. We are investigating the relationship between LLM model performance and the individual token cost. Therefore, we need information about the evaluation on LLM models, better be comprehensive and include many criterions, and the token costs, which is relatively simpler.
   Variables are: LLM model names, performance scores (math, coding, literature, image recognition, etc), weighted score, token cost. We would collect these data from datasets online.
   We need as much obervations as possible, to eliminate potential bias.
   James edits this part, and will look for suitable datasets. Try to find these datasets on websites like kaggle, huggingface, etc. These data would be stored and organized on github.

3. For example, https://www.vellum.ai/open-llm-leaderboard?utm_source=www.vellum.ai&utm_medium=referral provides a decent "open model comparison." It has different LLM models and their performances in tests, such as the math competition AIME. It's a small dataset, so we will also look for other bigger datasets.
   https://nof1.ai/ is an interesting performance criterion, about LLM models' ability to make money. This is just an example, but we want detailed data about AI's performances like this.
    

## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> The data we are using is primarily anonymous, and although using LLMs and training them involves user data, in this instance in relation to our question it doesn’t expose any personally identifiable information. 

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> Yes they are. We accurately display performance comparisons between AI models without manipulating scales or selectively reporting results.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> Yes, all data collection methods, prompts, model settings, and evaluation criteria for AI models are clearly recorded. This ensures that if any issues arise in the future, the experiments can be accurately replicated and verified for consistency.


### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> Yes, we have taken steps to ensure that neither model’s evaluation relies on variables or proxies that could be unfairly discriminatory. Our comparisons focus only on objective performance metrics such as accuracy, response relevance, and coherence rather than user demographics or sensitive attributes.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> Yes, we carefully considered the effects of optimizing for our chosen metrics. While we primarily used metrics such as accuracy, response quality, and coherence were prioritized, we also reviewed other factors like bias, response diversity, and factual consistency

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Yes, both models’ decisions and outputs can be explained in understandable terms when justification is needed. For example, we analyze responses by looking at the prompt structure, model training objectives, and language generation patterns to clarify why each model produced a specific answer, and taking This approach helps interpret differences in reasoning, tone, or factual accuracy between different AI models.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

1. Use discord for communication, expected response time: 1hr, briefly meet up after lectures
2. be polite and respectful
3. decisions are made collectively
4. specialization in tasks
5. plan ahead, try to not have last minute works as it greatly increase the chance of something to go wrong
6. if someone doesn't do their part the rest of the team will need to complete the unfinished work in time to turn it in. the person who has errored may be reported.

## Project Timeline Proposal

Instructions: REPLACE the contents of this cell with your work

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 11/4  |  2 PM | Find relevant datasets  | Discuss which datasets to use, and work load distribution| 
| 11/11  |  2 PM |  have a first draft of the section responsible | Discuss details to edit before submission | 
| 11/18  | 2 PM  | EDA; Begin Analysis | Discuss possible analytical approaches; Assign group members to lead each specific part   |
| 11/25  | 2 PM  | TBD |  TBD |
| 12/2  | 2 PM  | TBD |TBD |
| 12/9  | 2 PM  |TBD | TBD |