# COGS 108 - Data Checkpoint

## Authors
Team:
- Ashley Vo: Conceptualization, Writing – review & editing, Project administration, Data curation
- Dorje Pradhan: Conceptualization, Writing – original draft, Writing – review & editing, Data curation
- Kilhoon (Andy) Kim: Writing – original draft, Writing – review & editing, Data curation
- Kobe Wood: Data curation, Writing – original draft, Writing – review & editing
- Vy (Kiet) Dang: Background research, Writing – original draft, Data curation

## Research Question

How did the popularity of each different video game mode (singleplayer, multiplayer, online co-op) on PC change between the pre-COVID period (2018-2019), the COVID period (2020-2021), and post-COVID period (2022-2023) among the top 250 player-count Steam games from each year from each mode?

where we are defining **popularity** by metrics of:
- Average concurrent player count over a given period 
- Peak player count over a given period

## Background and Prior Work

The global coronavirus outbreak in 2020, called COVID-19, has caused a global pandemic, forcing people to isolate and quarantine from each other. During the lockdown period, people spent most of their time at home and turned to digital entertainment and video games as a way to socialize and de-stress. Therefore, the gaming industry in this period witnessed a peak in gamer activity, play time, sales, and stock values.

In this study, we are interested in finding which games were the most popular from each game mode (single player, multiplayer, online co-op). Seeing what game mode people are most interested in, could give the gaming industry a better understanding of their consumer's desires. At first, the group was interested in researching the top 50 games from each game mode, between the pre-COVID period (2018-2019), the COVID era (2020-2021), and the post-COVID era (2021-2023), across multiple popular platforms such as Steam, Epic Games, Xbox, PlayStation, Nintendo, etc. **However**, not all platforms share their players' statistics to the public. So, we narrowed down our platform domain to only include Steam because it has a public database which is called "Steam Charts" and "SteamDB".

There are multiple research papers conducted about video games' activity and price range analysis over the course of time, such as:
- Aliev et al. (2025): These researchers investigated how the pandemic affected the prices and player reviews of mostly Indie games on Steam. By analyzing SteamDB data, they found that player reviews and activity levels are highly correlated. This study confirmed that SteamDB is a reliable tool for our project, but their work mostly focused on Indie games category while treating AAA - a pricing category - as its own game category, but also, we want to look at the top 50 games overall from each game mode.

- Şener et al. (2021): This paper successfully investigated the broader economic impact of COVID-19 on the gaming industry and showed a significant rise in player activity on Steam during 2020. Their findings gave us a "baseline" of when player activity started to increase, but it did not include other factors like playtime, pricing dynamics, game modes, etc. 

- Toledo (2021): Toledo used consumer **surveys** to study how gaming habits changed during lockdown. The survey showed that games became a bridge to "online social life" during quarantine period. This findings is why we emphasize on comparing Online Co-op and Multiplayer modes against Single Player games, as we wanted to see if that online social trend lasted after the pandemic ends. This is the harder data to categorize due to Steam tags are very misleading by mixing Singleplayers and Multiplayers altogether.

Therefore, as curious gamers and data analysts, we decided to conduct extensive research on what are the top 50 popular games from each game mode on Steam pre, during, and post the COVID-19 era.

References (include links):
1. Aliev, A. R., Eyniyev, R., & Aliyev, T. A. (n.d.). Analyzing Price Dynamics, Activity of Players and Reviews of Popular Indie Games on Steam Post-COVID-19 Pandemic using SteamDB. https://www.mecs-press.org/ijitcs/ijitcs-v17-n3/IJITCS-V17-N3-3.pdf
2. Şener, Mehmet & Yalcin, Turkan & Gulseven, Osman. (2021). The Impact of COVID-19 on the Video Game Industry. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3766147
3. Toledo, M. (2021). Video Game Habits COVID-19. Journal of Marketing Management and Consumer Behavior, 3(4), 66–89. https://doi.org/10.2139/ssrn.3676004

## Hypothesis


We believe there will be a dramatic rise in the popularity of online co-op and multiplayer games during the COVID era, with some of that increase continuing after COVID. Our thinking is as follows: people were stuck inside and had largely lost the ability to connect with each other in person, so games that allowed online interaction became more appealing. In this study, popularity will be measured using average concurrent player count and peak player count, and we will examine these patterns within the top 250 Steam games across the pre-COVID period (2018-2019), COVID period (2020-2021), and post-COVID period (2022-2023). We also expect that, within the top 250, the number of games tagged as multiplayer or online co-op will increase during COVID compared with pre-COVID, though we recognize that tags are not mutually exclusive and a game may appear in more than one mode. 

## Data

### Data overview
#### Dataset 1: Steam250 - Top 250 Games Of Each Year
- Dataset name: Steam250 - Top 250 Games Of Each Year
- Link to the dataset:
  - 2018: https://steam250.com/2018 
  - 2019: https://steam250.com/2019
  - 2020: https://steam250.com/2020
  - 2021: https://steam250.com/2021
  - 2022: https://steam250.com/2022
  - 2023: https://steam250.com/2023
- Number of observations (per year): 250 
- Number of variables (per year): 5
- Important notes: Steam250 does not provide play mode data, so we will need to obtain and join this with another dataset, in our case, Dataset 2.

#### Dataset 2: Steam Games Dataset (Kaggle)
- Dataset name: Steam Games Dataset (Kaggle)
- Link to the dataset: https://www.kaggle.com/datasets/fronkongames/steam-games-dataset
- Number of observations: 122611
- Number of variables: 39
- Important notes: This dataset serves as a supplementary dataset to provide playmode data for Dataset 1.

#### Dataset 3: Steam Charts Historical Player Activity
- Dataset name: Steam Charts game-level player activity (scraped/collected)
- Link to the dataset: https://steamcharts.com/
- Number of observations: [1463] (roughly: number of games × number of months in 2018–2023, not recounting for repeat games)
- Number of variables: [8] (year, rank, name, appid, month, avg_players, peak_players, status)
- This dataset is the core time-series source for our project because it contains the two popularity metrics we can reliably measure across all periods: average concurrent players and peak concurrent players. In practical terms, average concurrent players captures the typical number of people actively playing a game at the same time during a month, while peak concurrent players captures the maximum simultaneous activity reached during that month. Both metrics are counts of players (not percentages), and both are useful: average concurrency reflects sustained engagement, while peak concurrency reflects major surges and maximum demand.
- Important notes: We will first gather a list of the top 250 games from each year before pulling from SteamCharts to avoid needlessly gathering data we don't need.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

# OUR IMPORTS
import csv
import time
import hashlib
from pathlib import Path
from typing import Optional
import requests
import pandas as pd
from bs4 import BeautifulSoup

# (!) Our data comes from scraping HTML so this section is unneeded.
import get_data # this is where we get the function we need to download data


In [None]:
# Note: this will need to be run to get games.csv (Dataset 2) as the CSV is too large for GitHub.
get_data.main()

The `get_data.main()` function normally has output, but the output is way too large and simply confirms the status of what was able to be grabbed.

### Steam250 Data Collection
Steam250 is a site that displays a ranking of the top 250 of each year. Although there is no CSV provided, we can scrape the website to gather the data we need. They provide 5 variables of interest to us: 
1. Rank is an integer that denotes the ranking of each game with 1 being the best of that year and 250 being the 250th best of the year.
2. AppID is an integer that uniquely identifies each and every game on Steam.
3. Name is a string that belongs to the name of the game.
4. Rating is a player voted score. This is an integer but realistically acts as a percentage (e.g. 94 corresponds to 94%).
5. Number of votes is an integer that denotes the number of players that have voted for that game.

The biggest acknowledgement we have to make is that our analysis is going to be based off of Steam250's definition of top games. However, upon [reading their process](https://steam250.com/about), we concluded that it was an agreeable and good enough approach as finding the "top games" isn't readily or easily available elsewhere.

Let's load all of the CSVs so we can observe their structure en masse.

In [4]:
steam250_2018_df = pd.read_csv('data/00-raw/2018_top250.csv')
steam250_2019_df = pd.read_csv('data/00-raw/2019_top250.csv')
steam250_2020_df = pd.read_csv('data/00-raw/2020_top250.csv')
steam250_2021_df = pd.read_csv('data/00-raw/2021_top250.csv')
steam250_2022_df = pd.read_csv('data/00-raw/2022_top250.csv')
steam250_2023_df = pd.read_csv('data/00-raw/2023_top250.csv')

yearly_dfs = [steam250_2018_df, steam250_2019_df, steam250_2020_df, steam250_2021_df, steam250_2022_df, steam250_2023_df]

We want to make sure there aren't any missing values and our data types line up correctly.

In [5]:
for i in range(6):
    curr_df = yearly_dfs[i]

    print(
        '=' * 64, '\n',
        f'Current year data frame: {2018 + i}\n',
        '=' * 64, '\n',
        'Shape: ',
        curr_df.shape, '\n\n',
        'Any nulls?\n',
        curr_df.isna().any(), '\n\n',
        'Column types:\n',
        curr_df.dtypes, '\n\n',
        'First five rows of the data:\n',
        curr_df.head(5), '\n\n',
        sep=''
    )

Current year data frame: 2018
Shape: (250, 5)

Any nulls?
rank         False
appid        False
name         False
rating       False
num_votes    False
dtype: bool

Column types:
rank           int64
appid          int64
name          object
rating       float64
num_votes      int64
dtype: object

First five rows of the data:
   rank   appid         name  rating  num_votes
0     1  960090  Bloons TD 6    97.0     379590
1     2  242760   The Forest    96.0     661632
2     3  264710   Subnautica    97.0     340307
3     4  294100     RimWorld    98.0     233787
4     5  588650   Dead Cells    97.0     177451


Current year data frame: 2019
Shape: (250, 5)

Any nulls?
rank         False
appid        False
name         False
rating       False
num_votes    False
dtype: bool

Column types:
rank           int64
appid          int64
name          object
rating       float64
num_votes      int64
dtype: object

First five rows of the data:
   rank    appid                                    

The data looks as we expect!

### Steam Games Dataset
This dataset was provided by Martin Bustos on Kaggle. Bustos created the dataset using Steam's API and Steam Spy. It provides meta information of over 122,000 games on Steam. It has 39 variables. However, we are only interested in three key variables while being able to ignore the rest as they are not important to our research question:
1. AppID is a integer value that uniquely identifies each and every game on Steam.
2. Name is a string that belongs to the name of the game.
3. Categories should be an array of strings. Each string corresponds to a characterization of the game such as "singleplayer" or "PvP."

As we'll discover, reading the CSV proved to be a little troublesome with column mismatch. Since we're only concerned with the Categories of each game this dataset provides, we can safely disregard the other columns. There were other datasets we experimented with that didn't fully account for all of the games we've gathered.

First, let's read in the CSV downloaded from Kaggle and get an idea of its size.

In [None]:
# games.csv cannot be pushed to GitHub because it's 371.1 MB
# can only be import and run ONCE to get results, then remove to push.
steam_games_df = pd.read_csv('./data/00-raw/games.csv', index_col=False)

steam_games_df.shape

Now, let's take a peak at what the data looks like:

In [None]:
steam_games_df.head()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DiscountDLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,2539430,Black Dragon Mage Playtest,"Aug 1, 2023",0 - 0,0,0,0.0,0,0,,...,0,0,0,0,,,,,,https://shared.akamai.steamstatic.com/store_it...
1,496350,Supipara - Chapter 1 Spring Has Come!,"Jul 29, 2016",0 - 20000,0,0,5.24,65,0,"Springtime, April: when the cherry trees come ...",...,8,0,8,0,minori,MangaGamer,"Single-player,Steam Trading Cards,Steam Cloud,...",Adventure,"Adventure,Visual Novel,Anime,Cute",https://shared.akamai.steamstatic.com/store_it...
2,1034400,Mystery Solitaire The Black Raven,"May 6, 2019",0 - 20000,0,0,4.99,0,0,"Immerse yourself in the most beloved, mystical...",...,0,0,0,0,Somer Games,8floor,"Single-player,Family Sharing",Casual,"Casual,Card Game,Solitaire,Puzzle,Hidden Objec...",https://shared.akamai.steamstatic.com/store_it...
3,3292190,버튜버 파라노이아 - Vtuber Paranoia,"Oct 31, 2024",0 - 20000,1,0,8.99,0,1,"synopsis 'Hello, I'm Hiyoro, a new YouTuber!' ...",...,0,0,0,0,유진게임즈,유진게임즈,"Single-player,Steam Achievements,Family Sharing","Casual,Indie,Simulation",,https://shared.akamai.steamstatic.com/store_it...
4,3631080,Maze Quest VR,"Apr 24, 2025",0 - 20000,0,0,4.99,0,0,Its not just a Maze; its a Quest! Enter the ca...,...,0,0,0,0,Reality Expanded LLC,Reality Expanded LLC,"Single-player,VR Only,Steam Leaderboards,Famil...","Action,Early Access",,https://shared.akamai.steamstatic.com/store_it...


Let's look at the column types.

In [None]:
steam_games_df.dtypes

AppID                           int64
Name                           object
Release date                   object
Estimated owners               object
Peak CCU                        int64
Required age                    int64
Price                         float64
DiscountDLC count               int64
About the game                  int64
Supported languages            object
Full audio languages           object
Reviews                        object
Header image                   object
Website                        object
Support url                    object
Support email                  object
Windows                        object
Mac                              bool
Linux                            bool
Metacritic score                 bool
Metacritic url                  int64
User score                     object
Positive                        int64
Negative                        int64
Score rank                      int64
Achievements                  float64
Recommendati

This data set is not clean. The columns don't line up with the values in the rows. For example, we see what looks like descriptions of games under "Supported Languages" rather than "About the game." In printing out the column types, we see some discrepencies like "Metacritic url" being an `int64` rather than an object. Luckily, we only care about a few variables in the dataset: AppID, Name, and Categories. We can ignore the rest. Note that Categories is currently under the "Genres" column. We can fix that:

In [None]:
steam_games_subset_df = steam_games_df[['AppID', 'Name', 'Genres']]
steam_games_subset_df.columns = ['appid', 'name', 'genres']

steam_games_subset_df.head()

Unnamed: 0,appid,name,genres
0,2539430,Black Dragon Mage Playtest,
1,496350,Supipara - Chapter 1 Spring Has Come!,"Single-player,Steam Trading Cards,Steam Cloud,..."
2,1034400,Mystery Solitaire The Black Raven,"Single-player,Family Sharing"
3,3292190,버튜버 파라노이아 - Vtuber Paranoia,"Single-player,Steam Achievements,Family Sharing"
4,3631080,Maze Quest VR,"Single-player,VR Only,Steam Leaderboards,Famil..."


Let's check for missing values.

In [None]:
steam_games_subset_df.isna().sum()

appid        0
name         1
genres    8953
dtype: int64

There appears to be one entry without a name. Let's find out what game that is and whether it should be cause for concern.

In [None]:
steam_games_subset_df[steam_games_subset_df['name'].isna()]

Unnamed: 0,appid,name,genres
44432,396420,,


Interestingly enough, after [looking up the game on Steam](https://store.steampowered.com/app/396420/_/), it literally has no name, so this isn't something to worry about. Now, we can move on to merging datasets 1 and 2 together.

### Merging: Steam250 and Steam Games (Kaggle)

Now, we want to merge datasets 1 and 2. There are a few things we must fix like adjusting the categories to what we need. So, we will find out what needs to be replaced. We'll use 2018 as an example to find out what categories we need to map as well as find out if there are any python typing quirks with our data.

In [None]:
steam250_2018_df_merged = steam250_2018_df.merge(
    steam_games_subset_df,
    on='appid',
    how='left'
)

print('What is the type of the tags column?\n', type(steam250_2018_df_merged['genres'].iloc[0]), '\n')

What is the type of the tags column?
 <class 'str'> 



Since it's a string, we should convert it to a list to easily parse it.

In [None]:
steam250_2018_df_merged['genres'] = steam250_2018_df_merged['genres'].fillna('').str.split(',')
steam250_2018_df_merged
# print('What is the type of the tags column now?\n', type(steam250_2018_df_merged['genres'].iloc[0]), '\n')

Unnamed: 0,rank,appid,name_x,rating,num_votes,name_y,genres
0,1,960090,Bloons TD 6,97.0,379590,Bloons TD 6,"[Single-player, Multi-player, Co-op, Online Co..."
1,2,242760,The Forest,96.0,661632,The Forest,"[Single-player, Multi-player, Co-op, Online Co..."
2,3,264710,Subnautica,97.0,340307,Subnautica,"[Single-player, Steam Achievements, Full contr..."
3,4,294100,RimWorld,98.0,233787,RimWorld,"[Single-player, Steam Workshop, Partial Contro..."
4,5,588650,Dead Cells,97.0,177451,Dead Cells,"[Single-player, Steam Achievements, Full contr..."
...,...,...,...,...,...,...,...
245,246,955560,Evenicle,92.0,1309,Evenicle,"[Single-player, Steam Trading Cards, Steam Clo..."
246,247,882110,Google Spotlight Stories: Age of Sail,98.0,470,Google Spotlight Stories: Age of Sail,"[Single-player, VR Only]"
247,248,904740,东方试闻广纪 ~ Perfect Memento of Touhou Question,94.0,882,东方试闻广纪 ~ Perfect Memento of Touhou Question,[Single-player]
248,249,851530,Mini-Dead,94.0,907,Mini-Dead,"[Single-player, Steam Achievements]"


In [None]:
# See unique tags
unique_tags = (
    steam250_2018_df_merged['genres']
    .explode()
    .str
    .strip()
    .unique()
)

print('What are the unique values that can appear under categories?\n', unique_tags)

What are the unique values that can appear under categories?
 ['Single-player' 'Multi-player' 'Co-op' 'Online Co-op'
 'Steam Achievements' 'In-App Purchases' 'Remote Play on Phone'
 'Remote Play on Tablet' 'Family Sharing' 'LAN Co-op'
 'Tracked Controller Support' 'VR Supported' 'Partial Controller Support'
 'Steam Cloud' 'Full controller support' 'Steam Trading Cards'
 'VR Support' 'Remote Play on TV' 'Steam Workshop' 'PvP' 'Online PvP'
 'LAN PvP' 'Cross-Platform Multiplayer' 'Includes level editor' 'MMO'
 'Valve Anti-Cheat enabled' 'Stats' 'Shared/Split Screen PvP'
 'Shared/Split Screen Co-op' 'Shared/Split Screen' 'Remote Play Together'
 'Adjustable Text Size' 'Camera Comfort' 'Custom Volume Controls'
 'Keyboard Only Option' 'Stereo Sound' 'Surround Sound'
 'Steam Leaderboards' 'HDR available' 'Captions available'
 'Adjustable Difficulty' 'Mouse Only Option'
 'Playable without Timed Input' 'Save Anytime' 'Commentary available'
 'Color Alternatives' 'VR Only' 'SteamVR Collectibles' '

Let's create two things:
- A mapping to group and normalize similar play modes together to the ones that are of our interest to our research question: singleplayer, multiplayer, and co-op
- A helper function to clean up the columns in the rows for us

In [None]:
tag_map = {
    'Single-player': 'singleplayer',

    'Multi-player': 'multiplayer',
    'MMO': 'multiplayer',
    'PvP': 'multiplayer',
    'Online PvP': 'multiplayer',
    'LAN PvP': 'multiplayer',
    'Shared/Split Screen PvP': 'multiplayer',
    'Cross-Platform Multiplayer': 'multiplayer',

    'Co-op': 'co-op',
    'Online Co-op': 'co-op',
    'LAN Co-op': 'co-op',
    'Shared/Split Screen Co-op': 'co-op',
}

In [None]:
def clean_tags(tag_list):
    if not isinstance(tag_list, list):
        return []

    cleaned = []

    for tag in tag_list:
        tag = tag.strip()

        if tag in tag_map:
            cleaned.append(tag_map[tag])

    return list(set(cleaned))

Now, we can adjust the tags for each top 250 per year in batch.

In [None]:
for i in range(6):
    curr_year_df = pd.read_csv(f'./data/00-raw/{2018 + i}_top250.csv')

    curr_merged_df = curr_year_df.merge(
        steam_games_subset_df,
        on='appid',
        how='left'
    )

    curr_merged_df = curr_merged_df[['rank', 'appid', 'name_x', 'num_votes', 'rating', 'genres']]
    curr_merged_df.columns = ['rank', 'appid', 'name', 'num_votes', 'rating', 'tags']

    # Convert the tags column to a list
    curr_merged_df['tags'] = (
        curr_merged_df['tags']
        .fillna('')
        .str.split(',')
    )

    curr_merged_df['tags'] = curr_merged_df['tags'].apply(clean_tags)

    # Save to the data/02-processed directory
    curr_merged_df.to_csv(f'./data/02-processed/{2018 + i}_top250_final.csv')

Let's make sure all of our data is in order.

In [None]:
for i in range(6): 
    curr_df = pd.read_csv(f'./data/02-processed/{2018 + i}_top250_final.csv')
    print(
        '=' * 64, '\n',
        f'Current year data frame: {2018 + i}\n',
        '=' * 64, '\n',
        'Shape: ',
        curr_df.shape, '\n\n',
        'Any nulls?\n',
        curr_df.isna().any(), '\n\n',
        'Column types:\n',
        curr_df.dtypes, '\n\n',
        'First five rows of the data:\n',
        curr_df.head(5), '\n\n',
        sep=''
    )

Current year data frame: 2018
Shape: (250, 7)

Any nulls?
Unnamed: 0    False
rank          False
appid         False
name          False
num_votes     False
rating        False
tags          False
dtype: bool

Column types:
Unnamed: 0      int64
rank            int64
appid           int64
name           object
num_votes       int64
rating        float64
tags           object
dtype: object

First five rows of the data:
   Unnamed: 0  rank   appid         name  num_votes  rating  \
0           0     1  960090  Bloons TD 6     379590    97.0   
1           1     2  242760   The Forest     661632    96.0   
2           2     3  264710   Subnautica     340307    97.0   
3           3     4  294100     RimWorld     233787    98.0   
4           4     5  588650   Dead Cells     177451    97.0   

                                       tags  
0  ['co-op', 'multiplayer', 'singleplayer']  
1  ['co-op', 'multiplayer', 'singleplayer']  
2                          ['singleplayer']  
3             

And with that, everything looks as expected, so we can now look to SteamCharts for more granular data!

### Steam Charts Player Data Collection
For this project, the most relevant variables are: game identifier (name and appid), date (month and year), average concurrent players, and peak concurrent players.

- A game name (a string) may not be unique against the entire Steam catalog, so there exists a Steam appid that uniquely identifies each game. This is a number, and it is expected every game has one.
- Dates are going to have some format similar to `YYYY-MM` such that we can parse it to aggregate the monthly player data by the defined study periods.
- Average and peak concurrent players are numeric values that we expect to be greater than or equal to zero. Since we are looking at the top 250 Steam games, it's expected this value is certainly greater than zero.
We may later aggregate monthly values into three study periods: pre-COVID (2018–2019), COVID (2020–2021), and post-COVID (2022–2023). This allows direct period-to-period comparisons for each game and for groups of games by mode tags.

A key strength of this dataset is that it provides consistent and public Steam activity data at scale. SteamCharts obtains data directly from Steam's Web API. The main shortcomings are that it is Steam-only therefore not representative of console ecosystems like Nintendo, Sony, and Xbox's. This may lead to underrepresention and edge cases where historical coverage is incomplete for certain titles and does not directly provide causal explanations for changes in player activity. It is also important to note Steam does not record data for players who decide to play offline i.e. disconnected from the internet therefore disconnected from Steam's servers. As such, the data does not account for these cases even if players are playing through Steam. Lastly, the top-game selection introduces survivorship/popularity bias relative to the full Steam catalog, leaving out games that may see interesting growth or variability in their player populations despite not being in the top 250.

We wrote some helper functions that parse our combined CSVs and reports back with summary stats on their contents. First, establish the location and file names for the CSVs:

In [8]:
interim_dir = Path("./data/01-interim")
final_dir = Path("./data/02-processed")

start_year = 2018
end_year = 2023

This is the data we got after parsing the HTML we scraped.

In [9]:
combined_before = interim_dir / f"steamcharts_{start_year}_{end_year}_combined.csv"
get_data.report_steamcharts_summary(combined_before, label="Combined BEFORE status filtering")


=== Combined BEFORE status filtering ===
File: data\01-interim\steamcharts_2018_2023_combined.csv
Rows: 9583 | Columns: 8

Columns:
['year', 'rank', 'name', 'appid', 'month', 'avg_players', 'peak_players', 'status']

Status counts:
status
ok                  9468
request_error        103
no_data_for_year      12

Period-level summary (avg/peak players):
    period  n_rows  avg_players_mean  avg_players_median  peak_players_mean  peak_players_median
     covid    3113       1348.084240               94.93        3149.519435                253.0
post_covid    3202       2040.683745              105.52        4680.950031                276.0
 pre_covid    3153       1218.562674               48.17        2750.359340                142.0


As we see, we have 9583 observations but 103 + 12 = 115 errors. Since we're only interested in data we actually have, we did intermediate processing to get rid of it. Let's observe the summary now.

In [10]:
combined_after = final_dir / f"steamcharts_{start_year}_{end_year}_ok.csv"
get_data.report_steamcharts_summary(combined_after, label="Combined AFTER status filtering (ok-only)")


=== Combined AFTER status filtering (ok-only) ===
File: data\02-processed\steamcharts_2018_2023_ok.csv
Rows: 9468 | Columns: 8

Columns:
['year', 'rank', 'name', 'appid', 'month', 'avg_players', 'peak_players', 'status']

Status counts:
status
ok    9468

Period-level summary (avg/peak players):
    period  n_rows  avg_players_mean  avg_players_median  peak_players_mean  peak_players_median
     covid    3113       1348.084240               94.93        3149.519435                253.0
post_covid    3202       2040.683745              105.52        4680.950031                276.0
 pre_covid    3153       1218.562674               48.17        2750.359340                142.0


Everything looks as expected. Now, we can do general checking of the CSV's contents

In [12]:
steamcharts_df = pd.read_csv(combined_after)
steamcharts_df.head()

Unnamed: 0,year,rank,name,appid,month,avg_players,peak_players,status
0,2018,1,Bloons TD 6,960090,2018-12,465.62,688.0,ok
1,2018,2,The Forest,242760,2018-01,5406.31,11300.0,ok
2,2018,2,The Forest,242760,2018-02,5169.69,11253.0,ok
3,2018,2,The Forest,242760,2018-03,4018.4,7626.0,ok
4,2018,2,The Forest,242760,2018-04,3674.44,15745.0,ok


This looks as we expect. Every observation is a game + the statistics of a given month. Let's ensure there are no nulls and types are as we expect.

In [13]:
steamcharts_df.dtypes

year              int64
rank              int64
name             object
appid             int64
month            object
avg_players     float64
peak_players    float64
status           object
dtype: object

In [14]:
steamcharts_df.isna().any()

year            False
rank            False
name            False
appid           False
month           False
avg_players     False
peak_players    False
status          False
dtype: bool

Everything looks as we expect, so we're ready to use the data.

## Ethics

### A. Data Collection
  [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Our project does not involve human subjects directly. We are analyzing publicly avaliable gaming statistics from Steam, which does not require consent.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> We agree that by focusing solely on Steam data, we introduce platform bias. However, our research question is specifically aimed on PC gaming on Steam.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> Our dataset contains no personally identifiable information (PII). We are using aggregate player statistics and game mode data only.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> We are not collecting data on protected groups such as gender or race, as our analysis focuses on gaming trends at the aggregate game mode rather than individual player demographics. 

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
> The data we are using is publicly available from Steam Charts and SteamDB. We are not collecting any new data or storing sensitive information.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
> Not applicable, as we are not collecting or storing any personal information from indivisuals.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> Since we are using publicly available data and not collecting new data, data retention is not our concern. Any data we download for analysis will be retained only for the duration of this project.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> We acknowledge that other gaming platforms exist (e.g., Nintendo, Sony Playstation, Microsoft Xbox, Mobile) which provide different gaming experiences. Our dataset is limited to Steam users, who may not be representative of the global gaming population (e.g., mobile gamers or console game players). However, our research question is intentionally narrowed to PC gaming on Steam due to data availability and accessibility. 

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> There could be potential bias in our dataset if certain games have wealthy backers who promote them more heavily, which could inflate their popularity metrics. We are aware of this possibility when interpreting our results.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> We are committed to representing our data honestly and will strive to create visualization and statistics that accurately reflect the underlying trends without misleading interpretations.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
> Privacy is not a conter for our analysis since we are not using any data with PII. 

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> We are committed to documenting our analysis process thoroughly to ensure reproducibility. This includes maintaining clear records of our data sources, processing steps.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> We are considering PC players as a general population. Since our analysis focuses on game mode trends rather than player demographics, bias and discrimination concerns are not relevant to our research.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
> Not applicable for the same reason as D.1.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> We are not creating a predictive model or optimizing for specific metrics.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Not applicable, as we are not buliding a predictive model or making automated decisions.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> While we are not buliding a model, we will clearly communicate the limitation of our analysis, including our focus on Steam data only and potential biases in the dataset.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> Not applicable. We are not deploying a model. This is a research project analyzing game mode trends.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
> Not applicable as our analysis is unlikely to cause harm and we are not deploying a system that impacts indivisuals.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
> Not applicable for the Same reason as E.1

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> We are using data for research, but game developers might usage our results. For example, if they see old games were popular, they might just copy them instead of making new, creative games. This could lead to too many similar games and less variety for players.

## Team Expectations 

Team expectations are also [found separately here](./admin/rules.md).

### Communication
**Discord** is our main form of communication. We have a group chat.
- **Responding** If there is a message that needs responding to/acknowledgement, respond within about 1 day/24 hours (48 hours can be acceptable if there's an emergency). The exception to this if we have a planned meeting coming up and we ask for a ready check, then it's expected that there's an almost immediate response.
- **Respect** Stay reasonably respectful to one another. It's okay to disagree, but do talk about the issue together or bring in another person (or the entire group) to discuss the matter if needed to mediate. If you don't talk about something, there's no way we'd know what's wrong.
  
### Missing Tasks/Meetings
- **Tasks** If you can't complete a task, let us know as soon as possible (i.e. as soon as you find out) so we can reorganize task assignment or move our schedule around.
- **Meetings** If you can't make a meeting, that's okay, and it's not detrimental. However, that would mean you can't provide your input on something live. You can share your thoughts and ideas in our group chat in this event so we can discuss your ideas. We do take meeting notes, so please read them to stay up to date with the team.

### Team Structure and Decision Making
- **Team Roles** We don't plan on having established team roles, but we'll try to have everyone do a bit of everything (to the best of our ability). The only real "role" we'll have is one note taker per meeting.
- **Task Tracking** We'll use the GitHub Projects tab/Kanban on the team repository. 
- **Decision Making** If it comes to a decision, we'll have a vote to decide (more votes = win).

### Addressing Problem Members
This is our protocol on addressing non-responsive teammates/those refusing to do work:
1. First offense: check-in and see if everything is okay.
2. Second offense: what we do depends, but we'll talk with you again.
3. Clearly becoming a pattern: talk to a TA and/or the professor.

## Project Timeline Proposal

| Type | Date | Meeting/Due Time | To Complete Before Meeting | Discuss at Meeting |
| ---- | ---- | ---- | ---- |  ---- |
| Meeting | 2/22 | 2pm  | Read up on EDA checkpoint requirements; come into meeting with ideas on how to approach things | Discuss EDA and split up tasks for EDA checkpoint. |
| Meeting | 3/1  | 2pm  | Make 70-80% progress on EDA tasks | Check in on EDA progress and see what needs to be done. |
| **DUE** EDA Checkpoint | 3/4 | 11:59pm | - | - |
| Meeting | 3/8  | 2pm  | Wrap up any loose ends we didn't finish (if applicable). Read up on final project expectations. | Discuss final project tasks and split up tasks. |
| Meeting | 3/15 | 2pm  | Make about 80% progress on final project tasks. | Discuss the video work. |
| **DUE** Final Project + Video | 3/18 | 11:59pm | - | - |