# COGS 108 - Project Proposal

## Authors

- Kenric Hoang: Conceptualization, Data curation, Writing - original draft
- Deborah Kim: Conceptualization, Writing - original draft
- Andy Tang: Methodology, Software, Visualization, Writing - review & editing
- Aidan Tjon: Software, Project Admin, Investigation, Writing - original draft

## Research Question

Do performance gains on standardized LLM benchmarks exhibit diminishing returns in terms of estimated inference cost, when scaled up? Performance will be measured utilizing scores from different benchmarks such as MMLU and GSM8K. Environmental costs will be measured through available proxies for computation power such as model parameter count and estimated FLOPs per token. This project involves statistical inference with exploratory modelling in order to draw a relationship between performance metrics and increasing compute costs. Examining this relationship, we can find out whether marginal increases in performance require a disproportionate surge in computational, and thus environmental costs.

## Background and Prior Work

Even as generative AI has seen rapid, mainstream adoption, a fundamental problem in the race to train the latest and greatest LLM models has been the sustainability of training and running each model. To bring a response to an end user query, the associated model must have first been trained with enormous text datasets, or of the most recent multimodal models inputs such as image / video as well, before processing the query with its associated inference engine, also requiring computational power to piece together a response for the user. Of which, these data sources for models have been growing exponentially, requiring a similar exponential growth in computational power required.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

The amount of computational complexity within each step is, unfortunately, obfuscated to the end user. Each generative AI company’s flagship model exists solely within a black box, where the computational power required to answer a query is unknown, and while time to perform inference has been the most common metric<a name="cite_ref-2"></a>[<sup>2, </sup>](#cite_note-2)<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) as to calculate the environmental impact, the transformer models that underline all LLMs are highly parallelizable, obfuscating the real computational time required by spreading the query across multiple servers. HELM, a framework for LLM comparison created by the Center for Research on Foundation Models at Stanford, attempts to normalize the metric by running each model on the same hardware.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

We see real impacts to this computational complexity, server farms come at a massive cost, not only through power, but water use for cooling, costs to create the hardware going into server farms, to properly dispose of e-waste at the end of its useful lifecycle (shortened by intense usage), and in some cases, noise pollution to nearby residential environment when improperly placed. Not only that, the rapid adoption of LLMs has led to a developmental style that runs counter to that of long-term sustainability<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1), both in terms of environmental impacts, but also longevity of several AI firms, focusing on output performance. 

As we address today, using the data we have access to, we want to put forward some rough ideas for how model performance relates to the environmental cost of each model. We understand this analysis is underscored by power and computational cost not being a priority for those who train frontier models. 


1. <a name="cite_note-1"></a> [^](#cite_ref-1) Wu, C. et. al. (Jan 2022) Sustainable AI: Environmental Implications, Challenges and Opportunities. *Proceedings of Machine Learning and Systems 4.* https://arxiv.org/abs/2111.00364
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Liang, P. et. al. (Aug 2023) Holistic Evaluation of Language Models. *Transactions on Machine Learning Research.* https://arxiv.org/pdf/2211.09110
2. <a name="cite_note-3"></a> [^](#cite_ref-3) huggingface.co (Date Unknown) Open LLM Leaderboard. *Huggingface.co.* https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/

## Hypothesis


Do performance gains on standardized LLM benchmarks exhibit diminishing returns in terms of estimated inference cost, when scaled up? 

We predict that the relationship between inference cost in terms of parameter count and FLOPs per token and model performance evaluated by benchmark scores are directly proportional. As model scale increases, benchmark scores will keep improving, but the performance gain per unit of inference cost proxy will decrease. Specifically, we predict models under 70B parameters will show linear performance gains, while models exceeding this will show significant diminishing returns. 

As the input of training data increases, improvements in the output to a natural language query can increase to a certain extent. An end user can only get the same answer in so many ways, and optimizing for accuracy / avoiding hallucinations sees diminishing returns. Training these models to answer the same queries better sees a Pareto Paradox-like outcome, where perfection becomes the enemy of good.

## Data

### Ideal Dataset

The ideal dataset we would want to answer this question with should include verified training energy in joules, exact carbon intensity in CO2 per kWh, multiple different benchmarks to evaluate the model’s accuracy (MMLU, MMLU Pro, MATH, etc.), and floating point precision.

Verified training energy provides us with direct hardware-level power measurements taken during the training run, rather than arbitrary estimates drawn from TDP. This would help us see the potential diminishing returns from increasing energy consumption and model performance. Exact carbon intensity provides us with real-time grid emissions in relation to the data center location at the time of training. Depending on the location, training a certain model in an area that uses fossil fuels for energy is worse than training a model in an area that uses renewable energy. Multiple benchmarks evaluating the model’s accuracy is ideal as performing extremely well on one benchmark versus performing good on multiple different benchmarks should be considered as a more accurate model. The floating point precision of a model used during training and inference matters, as the higher floating point precision, the more complex the computation will be, resulting in more energy.

We ideally would require over 1000 unique models that span the full spectrum of parameter sizes (7 billion to 1 trillion parameters) to ensure statistical power. Crucially, we need at least 50 observations for models greater than 70 billion parameters to test the frontier models effectively. 

Data would be collected ideally from API calls that would directly interact GPUs from major labs (OpenAI, Meta, Google) to log power draw and carbon intensity in real-time, to avoid a confounded comparison.

The data would be stored in a centralized SQL Database. One table would hold immutable Model Metadata (Name, Precision, etc.), linked via a primary key (model_id) to a time-series table of energy logs and a results table for benchmark performance.

### Real Datasets

**Dataset 1: Hugging Face Open LLM Leaderboard**
URL: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/ 

The data is publicly hosted and we can access it directly via the Python datasets library, without special permission. However, we must comply with their citation policy.

*Important Variables:*

Co2_cost: The reported carbon emissions in kg serves as our ground-truth environmental impact metric 

Model_precision: The bit-depth in floating point precision allows us to test if lower precision improves carbon efficiency

Mmlu_pro_normalized: A measure of reasoning accuracy for each model

**Dataset 2: Epoch AI Models Dataset**
URL: https://epoch.ai/data/ai-models

The dataset is available as a direct CSV download (all_ai_models.csv). It is licensed under CC-BY 4.0, requiring attribution, but there is no application process.

*Important Variables:*

Training compute (Flops): The theoretical computational work where we can predict the amount of CO2 it takes to train a given model

Training hardware: Identifies the GPU type that we can use to control hardware efficiency confounders

**Dataset 3: Stanford HELM (Holistic Evaluation of Language Models)**
URL: https://github.com/stanford-crfm/helm

We can download the structured JSON files via wget or Python requests. There is no login required.

*Important Variables:*

Denoised Inference Runtime: Provides a normalized operation cost (speed) of the model, allowing us to contrast amount of carbon it takes to train these models, with inference latency (actually using the models)

Model Name: String identifier that allows to crossmatch data with Hugging Face and Epoch.ai


### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Data sources regarding the computational cost of generative AI has been kept under lock and key. Data we use involved individual groups running and testing LLMs outside of these black boxes, and the data will show remnants of each source's testing environment. For our datasets, running large models on local hardware is a limiting factor, so these dataset providers run the models on their own hardware, which may incorporate biases in data center preparation.

 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

 > CFRM HELM explicitly asks: “...you do not reveal examples from this dataset in plain text or images online, to minimize the risk of these instances being included in foundation model training corpora.” We will follow these principles regarding data management, ensuring the results of the data we use do not go into the sample pool to dilute future LLM comparisons.

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Upstream confounding variables (Inference Time as a metric of Computational Cost can be confounded by parallelization) will need to be filtered / cleaned based on how our sources normalize their data and generate their metrics. The sources we've found so far have each independently calculated a CO_2 Cost for each model, and we will need to take their assumptions and build upon that to compare models based on Environmental impact. Not only that, our sources for data are few and far between, as is the nature of analyzing a frontier model. We do not have the computational power to run our own inference, and will depend on the sources we can find.

 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

 > Our analysis will be purely based on the data we have. The analysis will be diluted by the method of cleaning data we have, however, we plan for our analysis to be straightforward, based on the data we collect.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 
 > Comparing various Language Models purely based on environmental cost would be, at best, minimizing the scope at which language models have been deployed. Our report suggestions will be inherently skewed by analyzing model performance to computational cost.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> We are not targeting any sort of ML model with our inputs and outputs, we are looking for and analyzing the data we get, communicating results with the standards we use to obtain them. 

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We will communicate the limitations of our analysis within our report. Notable limitations we face are the use of proxies for the environmental impact of LLMs. 

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

Respond in group chat within 24 hours

If you cannot make a meeting/ deadline, tell the group early 

Tasks are assigned with a clear owner + due time

Do your part in completing the work before our weekly meeting, as to keep the work flowing. 

Be respectful in feedback; disagreements = suggestions + reason

Stay proactive. Our team meets are on Thursday, so if an assignment is due before then, work on it early.

Questions are to be asked, not kept in your head.. 


## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 01/29  |  18:00  | Introductions, communicate meeting times and expectations; Brainstorm topics / questions related to AI | Completed Project Review Assignment | 
| 02/02  |  16:00 |  Look for applicable datasets to narrow our topic selection; Brainstorming specific AI-related question  | Looking for datasets, discussing how to narrow our question based on the Data, plan meeting to finish 00-ProjectProposal | 
| 02/04  | 19:00  | Edit, finalize Project Proposal  |  **Complete Checkpoint #0, Project Proposal**; Group Check-In  |
| 02/05  | 18:00  | Getting Data from epoch.ai & CRFM HELM. (et. al.) | Discuss Data format, cleaning, important data metrics; Discussing Programs & Packages used for analysis, analysis plan  |
| 02/12  | 18:00  | Work on Data Wrangling / Cleaning | **Complete Checkpoint #1 Data**; Discuss Analysis plans / outputs and Datavis |
| 02/19  | 18:00  | Work on Data Wrangling -> Analysis | Group Check-In; Discuss Analysis |
| 02/26  | 18:00  | Work on Data Analysis -> DataVis | **Complete Checkpoint #2: EDA**; Group Check-In |
| 03/05  | 18:00  | Work on Datavis -> Final Report + Video | Group Check-In; Discuss Final Sumbmission |
| 03/1x  | TBD  | Work on Datavis -> Final Report + Video |  *Week 10, meeting TBD (finals)*; Group Check-In  |
| 03/1x  | TBD | Finalize Project Submission | Turn in Final Project |