# COGS 108 - EDA Checkpoint

## Authors

- Brian Liu: Conceptualization, Analysis, Writing – original draft
- Anchita Dash:  Data curation,  Experimental investigation, Writing – original draft
- Zihan Zhang:  Project administration, Visualization, Writing – original draft
- Ariane Hai: Background research, Methodology, Writing – original draft

# Research Question

Using statistical inference, what are the top predictive features among quantitative game statistics—such as total ratings, install milestones, and growth rates—that determine a game's specific rank within the Google Play Store’s genre-specific top 100 charts? Furthermore, does the relative importance of these features vary significantly across different game genres? 

**Target variable**   
- **rank** (ordinal/numerical: the game’s rank (1–100) in the genre’s top 100 list) 

**Predictor variables** 
- **total ratings** (numerical: game’s total number of ratings) 
- **installs** (ordinal/numerical: game’s approximate install milestone—e.g., 100.0 M installs vs. 500.0 M installs) 
- **average rating** (numerical: average rating out of 5)  
- **growth (30 days)** (numerical: percent growth in 30 days)  
- **growth (60 days)** (numerical: percent growth in 60 days) 
- **price** (numerical: price in dollars)   
- **5 star ratings** (numerical: number of 5 star ratings) 

**Grouping variable**
- **category** (nominal: genre of the game) 

## Background and Prior Work

The Google Play Store is the top distributor of downloadable content for Android devices. Run by Google as Android's official app store, it offers apps, games, books, movies, TV, and other media<sup><a href="#ref1">1</a></sup>. As of 2024, the Google Play Store offered approximately 264,000 mobile games ranging across genres like Action, Puzzle, and RPG<sup><a href="#ref2">2</a></sup>. Within this ecosystem, top chart rankings play an important role in influencing which games users actually see and download. Higher ranked games receive more visibility in lists, search results, and curated sections, which in turn increases the likelihood that users will click through and install them<sup><a href="#ref3">3</a></sup>. However, since Google does not disclose its full algorithm behind rankings, it remains unclear which measurable performance metrics, such as install volume, rating averages, or review counts, most strongly predict a game's chart position<sup><a href="#ref3">3</a></sup>.

From our own experience scrolling through the Play Store, we noticed that we almost always check the star rating and number of reviews before downloading a game. If an app has a high average rating and thousands of reviews, it just feels more trustworthy than one with barely any feedback. So, we hypothesize that review count and average star rating will be the features most strongly correlated with a game's ranking. With hundreds of thousands of games competing for attention on the Play Store, pinpointing the very factors that increase the attraction of games will help game developers to focus their resources on improving in those specific areas instead of guessing what might work. This gap motivates our analysis: We seek to identify which quantitative signals in the Play Store data correlate most closely with ranking, thereby shedding light on the factors that likely drive discoverability in this marketplace. The "Top Games on Google Play Store" dataset provides rankings for the top 100 games across multiple genres, allowing us to analyze whether the relative importance of these predictive factors varies significantly between categories like Action, Puzzle, and RPG.

Prior studies indicate that publishers' prior release volume, intragenre ranking, consumer ratings, and review volume jointly drive high levels of game downloads, with the influence of any single information cue varying depending on the presence of other coexisting cues<sup><a href="#ref4">4</a></sup>. Complementary research using TAM and IDT based feature selection methods further shows that mobile game downloads are shaped by both usability perceptions, such as perceived ease of use and usefulness, and diffusion-related factors, including visibility, trialability, and social communication channels<sup><a href="#ref5">5</a></sup>. Together, these findings suggest that download performance emerges from interacting quantitative and contextual signals, a premise that directly informs our investigation of how such signals translate into genre specific ranking outcomes on the Google Play Store and similar app distribution platforms.

From the Google Play Store project<sup><a href="#ref6">6</a></sup>, something that we could incorporate into our project are the different graphs that are present. The bar plot that outlines the number of paid and free apps based on category is very useful to see if the category in particular affects the ranking. The project also does a great job of looking at how ratings differ between paid and free apps. This is something that we could also incorporate into our project and see how pricing affects ranking. The correlation matrix is also useful since we are trying to figure out which features strongly influence rankings and this matrix summarizes it well by finding the correlation between ranking and other features. From the app store project<sup><a href="#ref7">7</a></sup>, something that we could incorporate into our project is making a pipeline since that is nice especially if we want new unclean and untidy data for prediction to go through the same processes that we undertook for data cleaning/preprocessing in our project since that would make sure that the new data is fit for modeling and prediction. There were different ML models used like linear regression, SVM regression, polynomial regression etc. and their RMSE were compared, this is also nice if we want to see how each model performs on the validation set and use the best one later for the test set.

**References**

<a name="ref1"></a> ^ UW Connect. What is Google Play Store. [https://uwconnect.uw.edu/it?id=kb_article_view&sysparm_article=KB0034369](https://uwconnect.uw.edu/it?id=kb_article_view&sysparm_article=KB0034369)

<a name="ref2"></a> ^ Statista. (2024). Number of available gaming apps in the Google Play Store. [https://www.statista.com/statistics/780229/number-of-available-gaming-apps-in-the-google-play-store-quarter/](https://www.statista.com/statistics/780229/number-of-available-gaming-apps-in-the-google-play-store-quarter/)

<a name="ref3"></a> ^ Appinventiv. How rankings are determined in Google Play Store. [https://appinventiv.com/blog/google-play-store-statistics/](https://appinventiv.com/blog/google-play-store-statistics/)

<a name="ref4"></a> ^ Emerald Insight. (2022). The influence of information configuration on mobile game download. [https://www.emerald.com/intr/article-abstract/32/4/1191/176295/The-influence-of-information-configuration-on](https://www.emerald.com/intr/article-abstract/32/4/1191/176295/The-influence-of-information-configuration-on)

<a name="ref5"></a> ^ Springer. A Study of Downloading Game Applications. [https://link.springer.com/chapter/10.1007/978-3-662-47200-2_90](https://link.springer.com/chapter/10.1007/978-3-662-47200-2_90)

<a name="ref6"></a> ^ Alhajali, A. N. Google Play Store App Dataset Analysis. Kaggle. [https://www.kaggle.com/code/ammarnassanalhajali/google-play-store-app-dataset-analysis/notebook](https://www.kaggle.com/code/ammarnassanalhajali/google-play-store-app-dataset-analysis/notebook)

<a name="ref7"></a> ^ Nish, A. Analysis of Apple's App Store. Kaggle. [https://www.kaggle.com/code/avnishnish/analysis-of-apple-s-app-store](https://www.kaggle.com/code/avnishnish/analysis-of-apple-s-app-store)


# Hypothesis


We hypothesize that total ratings and average rating will be the most statistically significant predictors of chart ranking. Intuitively, these features seem the most pertinent factors for assessing rank. We also anticipate that the relative weight of these features will vary by genre, with total ratings dominating mass-appeal categories like Action, while average rating holds greater predictive power in strategy-focused genres where quality drives retention.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your data checkpoint feedback


In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- Run only once after cloning!!! 
#
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: REPLACE the contents of this cell and the one below with your work, including any updates to recover points lost in your data checkpoint feedback

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

### Dataset #2
 as above, add any more copies of this that you need to given how many datasets you have

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

## Results

### Exploratory Data Analysis

Instructions: replace the words in this subsection with whatever words you need to setup and preview the EDA you're going to do.   

Please explicitly load the fully wrangled data you will use from `data/02-processed`.  This is a good idea rather than forcing people to re-run the data getting / wrangling cells above.  Sometimes it takes a long time to get / wrangle data compared to reloading the fixed up dataset.

Carry out whatever EDA you need to for your project in the code cells below.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

Please note that you should consider the use of python modules in your work.  Any code which gets called repeatedly should be modularized. So if you run the same pre-processing, analysis or visualiazation on different subsets of the data, then you should turn that into a function or class.  Put that function or class in a .py file that lives in `modules/`.  Import the module you made and use it to get your work done.  For reference see `get_raw()` which is inside `modules/get_data.py`. 



#### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

#### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Our dataset only includes the top 100 ranked games per genre, which introduces significant selection bias. By excluding games ranked below 100, we systematically omit indie games, newer releases, and potentially diverse developers who haven't broken into the top charts. This could skew our findings toward characteristics of already-successful games with substantial marketing budgets, potentially reinforcing the advantage of established publishers. We acknowledge this limitation and will interpret our results as applicable to high-ranking games specifically, rather than generalizing to the entire Play Store ecosystem.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> Our analysis focuses on quantitative metrics visible to users (ratings, installs, growth), but we recognize we lack perspective from game developers who understand the internal factors affecting rankings. Additionally, we do not account for qualitative factors like gameplay innovation, artistic merit, or accessibility features that may influence user satisfaction but not be captured in star ratings. To partially address this, we reviewed existing literature on mobile game success factors and will acknowledge that our statistical model captures correlation, not necessarily the causal mechanisms developers could control.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> The growth metrics (30-day and 60-day) may favor newer games with volatile early trajectories over established titles with stable user bases. The ordinal install milestones (e.g., "1M+", "10M+") compress continuous variation, potentially masking meaningful differences. We will examine correlations stratified by genre to identify where these biases most affect our conclusions.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We commit to transparent reporting of our findings, including null or weak correlations that do not support our hypothesis. When presenting visualizations, we will use appropriate axis scales to avoid exaggerating effects and include confidence intervals or error bars where applicable. 
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

## Team Expectations 

- We will use Discord as our primary means of communication. Responses should be within 24 hours. We will meet every Wednesday at Geisel at 5 PM unless notified otherwise through Discord.

- We will communicate in a supportive tone with each other. Every message should be responded to. If there is no response needed, always like the message. We will be inclusive of all opinions; no opinion should be rejected without being heard. To express disagreement, we will start with “I personally think.”

- Decisions should be made with the consent of all group members. Unless there are significant conflicts, the general rule of thumb should apply. If an immediate decision is needed, at least one other member must agree with the person proposing the decision.

- Everyone is expected to do a little bit of everything. Since we all have different strengths, members should be proactive in their area of expertise. Always reach out for help if there is difficulty. Communicate openly if work needs to be split differently.

- If a deadline cannot be met, notify the group as soon as possible (at least 36 hours before the deadline). The team will come together to figure out a solution. We commit to being understanding and supportive, with no pressure in communicating these situations. Each member has two chances to request deadline flexibility.

- For issues such as problem teammates or conflicts, communicate with the entire group first. The group will decide together whether the issue should be taken outside the group.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discussed At Meeting |
|-------------|-------------|--------------------------|----------------------|
| 1/21 | 5:00 PM | Reviewed project description and brainstormed initial project ideas | Reviewed project requirements, discussed potential research questions, explored possible datasets, and determined regular meeting times |
| 1/28 | 5:00 PM | Conducted background research related to the project topic | Completed project review and refined understanding of project expectations |
| 2/3 | 3:20 PM | Reviewed draft project proposal and brainstormed more specific research questions | Began drafting the project proposal, identified and selected datasets, and finalized the research question |
| 2/4 | 5:00 PM | Explored and familiarized ourselves with the datasets relevant to the research question | Finished Project Proposal |
| 2/11 | 5:00 PM | Reviewed and understood Checkpoint #1 requirements | Began Checkpoint #1, assigned tasks to group members, started dataset preprocessing and initial coding, and discussed wrangling and analytical approaches |
| 2/18 | 5:00 PM | Group members completed assigned portions of Checkpoint #1 | Reviewed and finalized Checkpoint #1 |
| 2/25 | 5:00 PM | Reviewed EDA and Checkpoint #2 requirements and completed exploratory data analysis using the fully processed dataset | Assigned tasks, loaded cleaned datasets, conducted EDA, and discussed observed patterns and insights |
| 3/4 | 5:00 PM | Reviewed final video requirements and brainstormed presentation ideas | Finalized Checkpoint #2, discussed next stages of analysis, planned the final video structure, drafted scripts, and began filming |
| 3/11 | 5:00 PM | Filmed individual segments and reviewed recorded clips | Completed filming, re-filmed unsatisfactory clips, and began video editing |
| 3/17 | 5:00 PM | Continued editing the final video | Verified all requirements were met and finalized the project |