# COGS 108 - Project Proposal

## Authors

- Keilani Li: Conceptualization, Writing - original draft, Experimental investigation
- Bryan Lu: Writing - review & editing, Experimental investigation, Methodology
- Hazel Liang: Writing - original draft, Visualization, Data curation, Formal analysis
- Lillian Tran: Data curation, Visualization, Software, Analysis
- Luis Quintanilla: Project administration, Software, Writing - review & editing

## Research Question

How do headshot percentage and agent pick rates relate to team win rates across regions in the 2024 and 2025 VCT playoffs? 

We will approach our research question using variables such as agent pick rates, match results, region labels, and headshot percentages gathered from players who participated in VCT playoffs for both years.

## Background and Prior Work

### Introduction

Developed and released in 2020 by Riot Games, VALORANT is a free-to-play 5v5 tactical shooter game with 28 agents divided into 4 different roles–duelist, initiator, controller, sentinel–as of February 2026. Players are split into two teams of five and are able to select which “agent” they want to play prior to the start of the game. Each agent has their own unique set of abilities. The goal of the game is to win 13 rounds by attacking and defending your side using gunplay and abilities.

VALORANT Champion Tour (VCT) is an annual tournament hosted by Riot Games. It features four regional leagues, Americas, EMEA, Pacific, and China. VCT events are divided into regional leagues and international leagues. Regional leagues refer to the four regions as mentioned previously while international leagues refer to Masters or Champions events. In other words, the international leagues consist of top teams that won their regional games and the one team that wins is considered the best team in the world as of that year. Since there are many different levels of tournaments and stages, for this project, we will be focusing on VCT Challengers playoff games from 2024-2025. We decided to leave out games from previous stages since the playoffs stage involves top teams. 

With this, we wanted to explore how headshot percentages and agent pick rates affect win percentages across different regions. 

### Prior Work

One study that has been done related to our topic is Valorant Analysis: A Comparative Study of Machine Learning Models<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). As suggested by the title, the author compared the performances between different machine learning models trained on Valorant data. More specifically, they compared these models to determine which is the best at revealing underlying patterns and predicting outcomes in a "high-stakes gaming environment." The author used a Kaggle dataset on a player's first 1000 Valorant games (not their own games). They performed basic EDA and created data visualizations to understand visual patterns across different variables like agent pick distributions, rank distribution, map frequency, wins/losses, and KDA (kills, deaths, assists) metrics. Afterwards, they trained an XGB (eXtreme Gradient Boosting) model and an RF (Random Forest) model. When comparing metrics&mdash;accuracy, precision, recall, F1-score, class 0 performance, weight average&mdash;between the two models, they found that the RF model performed slightly better in terms of precision and F1-score.

Another study, titled Valorant VCT Challengers 2024 Analysis, has actually been done on a dataset we may consider using<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). The author first preprocessed the data by renaming columns, converting data types, and handling missing values. Going into EDA, they looked into region and agent distribution, and then they visualized selected columns as a density histogram. Afterward, they looked into the top 1% of players with the highest rating in the tournament and top 1% of agents played.

The last study we were able to find related to our topic is the VCT 2024 Agent Composition Analysis<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). This analysis has also been done on a different dataset we consider using for this project. The author preprocessed their data through feature engineering, renaming columns, and did some basic exploration where they looked at descriptive summary statistics for: number of matches played per tournament, stage distribution, match type distribution, and number of matches played per map. They also visualized win rates between agent picks, maps, and they applied a regression model to see if meta agent picks have an impact on win probability and a prediction model on whether map and agent pick influences win probability. Lastly, they created a recommendation list of agents based on win rate and pick rate for each map, which was very interesting to look at.

### References

1. <a name="cite_note-1"></a> [^](#cite_ref-1) https://medium.com/@sanketangchekar/valorant-analysis-a-comparative-study-of-machine-learning-models-fa3f2866335e
2. <a name="cite_note-2"></a> [^](#cite_ref-2) https://www.kaggle.com/code/blackdragonk333/valorant-vct-challengers-2024-analysis-eda
3. <a name="cite_note-3"></a> [^](#cite_ref-3) https://www.kaggle.com/code/setsuna00/vct-2024-agent-composition-analysis

## Hypothesis


We hypothesize that teams with higher average headshot percentages will have higher win rates in VCT playoffs.

From a viewer's perspective, each region has different play styles. Regions like EMEA may play very strategically while other regions like to ‘run it down,’ or in other words, their playstyles are more aggressive. This potentially means that despite the current game meta, each region will select different agents based on their play styles and different guns as well. With these differences in mind, it is likely that headshot and agent pick may vary between regions.


## Data

The ideal dataset would need the following variables to answer our question:
- headshot percentage
- region labels
- agent pick rates
- match outcome
- map played

It would be best for our data to be organized in tabular format and stored as a csv file, and more importantly, that our data is found through source(s) that have publicly available esports stats. To make reasonable analyses, we think that having at least 1000 VCT playoff observations would be sufficient.

This Kaggle dataset, [Valorant Champion Tour 2021-2025](https://www.kaggle.com/datasets/ryanluong1/valorant-champion-tour-2021-2023-data?select=vct_2025), is one we are strongly considering as it may potentially provide us all the information we need to carry out this project. This dataset has been collected by web scraping off of [vlr.gg](https://www.vlr.gg/), a website that updates the community of the esports side of VALORANT. Not only does it contain information on esports schedules, forums and news, but it also collects data&mdash;match results, team stats&mdash; from real games, which is what we are looking for. Additionally, since the author of the dataset was able to collect data from the past 4 years of VCT, it can also be used to look at how trends have changed over time.

As there are many different folders&mdash;``agents``, ``ids``, ``matches``, ``players_stats``&mdash;to look into for this dataset, we may need to do a lot of data merging between files. Some of the variables in this dataset we may use include:
- ``Pick Rate``
- ``Headshot %``
- ``Average Combat Score``
- ``Outcome`` (win/loss a round)
- ``Match Result`` (win/loss of the overall game)
- ``Stage``
- ``Teams``
- ``Agents``
- ``Map`` (since map pool changes per year)

Another potential Kaggle dataset we can use is titled [Valorant Champions Tour 2024](https://www.kaggle.com/datasets/sauurabhkr/valorant-champions-tour-2024). The dataset would be good for looking into individual player performance across 3 different VCT tiers (VCT International, VCT Challengers, and VCT Game Changers) and comparing statistics across competitive regions. Should we decide to focus on only one year's worth of VCT data, we may consider using this dataset. However, we would also have to tweak our research question slightly since this dataset mainly includes player statistics from the 3 tiers and less on the overall outcome of the match played (i.e which team won and by how many rounds). While the dataset can be publicly found on Kaggle, it has not been specifically stated where the data was collected from. The data has been stored as 3 separate JSON files, but should not be a problem to import since the pandas library can read in dataframes from different file formats. We would most likely use the following variables for analysis:
- ``region``
- ``headshot_percentage``
- ``agent``
- ``playerCategory`` (which VCT tier they belong to)
- ``team``


The last Kaggle dataset we considered is the [Valorant VCT Champions 2025 Dataset](https://www.kaggle.com/datasets/kierru/valorant-vct-champions-2025-dataset). Like the first dataset, the data has been collected from [vlr.gg](https://www.vlr.gg/), but also similar to the previous dataset in which the data only covers VCT 2025 and not from previous years. This dataset has multiple csv files that allow for different analytic approaches such as looking into esports analytics, performance analysis, or predicting match outcomes. Some variables that could be used for our analysis include:
- ``team_id``, ``name`` (of team)
- ``match_id``
- ``agent``
- ``game_id``
- ``team_1_wl`` (whether a team won/lost)
- ``team_1_ct``, ``team_1_t``, ``team_1_ot``
- ``country``

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

<!-- > Example of how to use the checkbox, and also of how you can put in a short paragraph that discusses the way this checklist item affects your project.  Remove this paragraph and the X in the checkbox before you fill this out for your project -->

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

 > From the Kaggle website of the first dataset we listed, the author was stated that some of the China stats are missing because Chinese hosted events do not have APIs available for post-game stats. This may lead to underrepresentation from Chinese teams and bias our findings towards patterns from the other regions. To combat this limitation, we will avoid making strong claims about China where data is missing.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

 > While the dataset contains publicly available professional player names, we will not link their player profile to their private identities as they are not needed for analyses. We will focus more on teams as a whole rather than looking at each player individually.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

 > From the Kaggle website of the first dataset we listed, the author has also noted that the dataset contains missing values, placeholder team names, and inconsistencies such as two teams having the same team ID. As mentioned earlier, China has missing statistics as APIs are not publicly available. We plan to document any missing data we find and exclude rows where missing values are present.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

 > We plan to clearly label visualizations and document any uncertainties we come across to avoid strong conclusions. For example, if there are regional differences between the Americas and China, we will make sure to clarify whether the statement was based on complete or incomplete data.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

 > When performing data analysis and cleaning, we will make sure our code is well-documented so that others can easily follow along and reproduce our results. For example, since we will be looking at VCT playoff games for the years 2024 and 2025, we will clearly state these filtering decisions.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

- Communication
    - Discord will be used as the main form of communication and we will either meet in person or online every week
    - If you're struggling with a task, don't hesitate to ask others for help
- Tone
    - Be respectful, patient, and friendly towards each other
- Decision making
    - If pushing to the main branch, announce it to the group first just in case anyone else is working on it at the same time
- Tasks
    - Work will be discussed and divided equally among members
    - Everyone is expected to contribute equally to coding, writing, etc., and to meet deadlines on time
    - Have reminders of deadlines accordingly with our timeline
- Conflicts
    - We will first discuss issues calmly within the group, and if it cannot be resolved, we will ask the professor or TAs for help

 

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/1  | 5 PM  | Brainstorm & search for potential datasets  | Draft project proposal, assign roles   |
| 2/8  | 5 PM  | Do background research on topic   | Discuss data wrangling approaches |
| 2/15  | 5 PM  | Import & wrangle data, EDA | Discuss analysis plan |
| 2/22  | 5 PM  | Finalize wrangling/EDA, begin analysis | Discuss & revise analysis if needed |
| 3/1   | 5 PM  | Finish analysis; draft results, conclusion, discussion | Revise results, conclusion, & discussion if needed |
| 3/8   | 5 PM  | Complete analysis; begin results, conclusion, and discussion | Finalize group project, split presentation slides across team members |
| 3/16  | Before 11:59 PM  | NA | Turn in final project, video presentation, & group project surveys |