# COGS 108 - Project Proposal

## Authors

- Danielle Trinh: Writing – original draft, Writing – review & editing, Conceptualization, Data curation
- Kayne Maniti:  Writing – original draft, Writing – review & editing, Conceptualization, Data curation
- Dana Wei: Writing – original draft, Writing – review & editing, Background research
- Max Kim: Writing – original draft, Writing – review & editing, Background research
– Agnes Tang: Writing – original draft, Writing – review & editing, Methodology

## Research Question

Do regions systematically differ in what they buy (genres/platforms), and has that gap widened or shrunk over time?

We found a dataset with information about North America, Europe, and Japan from 1995–2016. We’ll use Year and Genre plus regional sales (NA_Sales, EU_Sales, JP_Sales) to compute each region’s genre market-share per year (genre sales ÷ total regional sales), then run a statistical inference test comparing the genre-share distributions across regions (and a simple trend test/regression of “region difference” over time).

## Background and Prior Work

The video game industry is a global market in which a variety of games are produced, distributed, and consumed across many different cultural and economic contexts. Questions have emerged about whether consumer preferences are shared worldwide or whether they remain shaped by regional markets. In particular, it is unclear whether players in different regions increasingly converge on similar genres and platforms as the market globalizes, or whether long‑standing regional tastes, platform ecosystems, and local industry structures continue to generate distinct patterns of demand. This tension between global convergence and regional differentiation motivates a closer empirical examination of how video game purchases vary across regions and how those patterns evolve over time.

Prior work in the video game industry has highlighted substantial variation in commercial performance across titles, even within the same release year. Building on this observation, researchers have identified product characteristics, most notably platform and genre as central determinants of sales outcomes. Studies using global sales rankings show that both platform and genre exert significant influence on overall units sold, suggesting that hardware ecosystems and content categories shape how games compete in the marketplace (Wang 2025). In addition to these product‑level attributes, regional sales variables have been shown to explain further variation in performance, indicating that demand conditions differ meaningfully across geographic markets. These findings motivate a more region sensitive approach to studying video game sales, in which regional market characteristics are treated as integral components of explanatory models rather than as residual or secondary factors.
Additionally, regional variation in video game consumption is not only cross‑sectional, but also temporally persistent. Studies of long‑run sales patterns have found that, while some genres attain widespread popularity, regional markets still tend to maintain distinct preferences that reflect cultural tastes and local industry structures (Zhang, 2021). This suggests that differences in the composition of regional game libraries are the result of relatively stable underlying demand conditions, rather than short‑term trends or one‑off blockbuster releases. In other words, regions do not simply converge on the same set of titles. Instead, they continue to favor different mixes of genres and platforms over time, indicating that regional identity and market context play a lasting role in shaping what players choose to buy.

“Predictive Analytics for Video Game Sales” by Rifaz primarily focuses on predictive modeling of sales outcomes rather than on comparative analysis of regional preferences. This prior work shows that global sales outcomes can be predicted with very high accuracy using regional sales data, indicating that regional markets exhibit stable and structured sales patterns that consistently explain global success. However, it treats regions mainly as inputs to prediction rather than as objects of comparison. In particular, it does not examine whether regions systematically differ in their genre preferences after accounting for total market size, nor does it analyze how those differences change over time. This project builds on that analytical foundation but reframes the problem toward comparative and longitudinal analysis, focusing on genre market share by region and year rather than on "The impact of platform on global video game sales" has identified gaming platforms as a key structural factor influencing global video game sales patterns. Babb, Terry, and Dana (2013) examined worldwide video game sales across major platforms between 2006 and 2011 and found significant differences in sales performance attributable to platform ecosystems rather than individual game attributes alone. Their study showed that Nintendo platforms—particularly the Wii and Nintendo DS—consistently outperformed competing consoles globally, while Microsoft and Sony platforms occupied lower sales tiers. These findings suggest that platform strategy, vertical integration, and regional platform adoption play a central role in shaping observed sales distributions. This work provides an important foundation for our analysis of regional sales differences, particularly in understanding why Japan’s sales distribution diverges from North America and Europe, where non-Nintendo platforms are more dominant.

Wang, Z. (2025). Factors Affecting Global Video Game Sales Rankings: Analyzing the Impact of Platform, Genre, and Market Regions. American Journal of Student Research. https://doi.org/10.70251/hyjr2348.337280

Zhang, X. (2021). Evaluation of Game Publishers and Study on the Player Preference Based on the Data from 1980 to 2020 in Video Game Market. 2021 the 8th International Conference on Industrial Engineering and Applications(Europe). https://doi.org/10.1145/3463858.3463870

Rifaz, F. (2025). Predictive Analytics for Video Game Sales [Kaggle Notebook]. Kaggle. https://www.kaggle.com/code/faryalrifaz3374/predictive-analytics-for-video-game-sales/notebook

Babb, J., Terry, N., & Dana, K. (2013). The impact of platform on global video game sales. International Business & Economics Research Journal, 12(10), 1241–1248.


## Hypothesis


We believe that there is a strong correlation between the difference in game genres and platforms from which those games are being sold by region, and we believe that the gap in video game sales will remain unchanged in the future. This can be seen in the research article on factors that impact global video game sales written by Zilu Wang, regions in North America and the European Union make up the majority of global video game sales, while Japan and other regions make up a smaller percentage, despite having a positive trend in video game sales. There is also a major regional difference in terms of video game genres, as cited from Xiaohan Zhang's research article on player preference, where North American regions and the EU favor action-type genre games, while Japanese players enjoy more moderate games like role-playing. No changes in the gap between regions in terms of sales will occur in the future based on the predictive analysis created by Faryal Rifaz and edited by Muhammad Tufail, as genres like Platform, Shooting, and Role-playing are the top three genres being sold which favor North American and EU regions.

## Data

Instructions: REPLACE the contents of this cell with your work

1. Explain what the **ideal** dataset you would want to answer this question. (This should include: 
   1. What variables? 
       * It would need precise variables such as: Game Title, Genre, Platform, Publisher, Release Date (Month/Year), and disaggregated region specific Sales Figures Ideally, it would also distinguish between Digital vs. Physical sales, as digital adoption varies by region.
   2. How many observations are needed? 
       * The ideal dataset would have tens of thousands of observations to show a variety of games of different niches, not just the popular ones in order to ensure representative samples for each region.
   3. Who/what/how would these data be collected? 
       * Ideally this data should be collected from the game sales providers from major retailers and digital store front APIs (platforms like Steam, Playstation store, Nintendo eshop) all aggregated properly to ensure consistency in reporting standards.
   4. How would these data be stored/organized?
       * Preferably through CSV or any sort of tabular storage format, with each row being a unique game title, with columns and metadata describing the game and sales figures of each region.
2. Search for potential **real** datasets that could provide you with something useful for this project.  For each dataset that you find write 3-5 sentences describing 
   1. where the data is located (e.g., URL) and anything you need to do to use it (e.g., ask for permission, fill out an application)
   2. what the important variables are in this dataset that you might use
   
   * Dataset 1- Video Game Sales (by anandshaw2001)
       - This dataset can be found at [this url](https://www.kaggle.com/datasets/anandshaw2001/video-game-sales ). It contains 16,598 game titles covering the period from 1995–2016, and is available to download in csv format for free with zero extra steps. Some key features that we can utilize are “year”, “genre” and regional sales data like “NA_sales”, “JP_sales”, “EU_sales”. Preprocessing may be needed as some data is missing like 278 values from “year”.
   * Dataset 2: Steam Store Data (Digital/PC Market)
       - This dataset can be found at [this url](https://www.kaggle.com/datasets/fronkongames/steam-games-dataset) on Kaggle. The data was taken directly from SteamSpy and Steam Store APIs, being free to download with no extra permissions  in csv format (unless we were to take directly from the API, this would require an API key). Some important variables are genre, release data, price, and owners(estimated range of units sold). In order to answer our regional question we may need to use a proxy such as the “supported_languages”.
   * Dataset 3:  Google Play Store Data
       - This dataset can be found at [this url](https://www.kaggle.com/datasets/lava18/google-play-store-apps) on Kaggle and can be downloaded for free with no extra permissions in the CSV format. Some important variables are genre, installs (a proxy for sales), price, and content rating. This dataset could prove useful in representing the mobile gaming platform when answering our question.
  

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> There is potentially a bias in the video game data collected in our source dataset, so we can cross reference a couple of other datasets in case we are missing information.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

* COMMUNICATE! Keep the group updated on any issues you encounter or ideas you come up with
* When you agree to something, get it done
* Be chill and have fun

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 01/26  |  15:30 | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 02/02  | 15:30  | Do background research on topic  | Discuss ideal dataset(s) and ethics; draft project proposal   |
| 02/09  | 15:30  | Individual preparations | Data checkpoint work session   |
| 02/16  | 15:30  | Complete data checkpoint | Discuss EDA checkpoint |
| 02/23  | 15:30  | Progress on respective sections | Complete EDA checkpoint |
| 03/02  | 15:30  | Progress on respective sections | Final project work session |
| 03/09  | 15:30  | Finishing touches on final project | Submit final project + video |