# COGS 108 - Data Checkpoint

## Authors

- Lauren: project visualization, conceptualization, project administration
- Benson: software, experimental investigation, methodology
- Joseph: software, experimental investigation, methodology
- Heeyoung: writing - original draft, background research, data curation 
- Oscar: writing - review and editing, data curation

## Research Question

**How do variables such as game genre, price, average length of playtime, and achievement number influence how Steam games are rated on the Steam shop globally since the platform's launch in 2003? Can these variables be used to predict how a Steam game might be rated?**

Note: All the variables are attributes of games listed in the Steam storefront page for a game.
Steam game genre refers to the available platforms and number of players it is targeting, price refers to the amount of currency in dollars required to purchase (including free games), average length of playtime refers to the average amount of time in hours it takes a player to reach an ending of the game for the first time, and achievement number refers to how many achievements can be attained by meeting in-game requirements for a game.
We are restricting our population to games on Steam, and thus only reviewing ratings left on the Steam platform. 
We are interested in sampling across a time period of 26 years and investigating whether there are any significant patterns in the aforementioned variables. 


## Background and Prior Work

Steam, a digital distribution service, lets its users buy video games from its digital store with a catalog of thousands of publishers, as well as install into and play video games from the users’ own digital library.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Important attributes of the games, such as their genre, price, number of game achievements are displayed on their store pages, allowing interested players to quickly get a feel for the game they may want to purchase.

Reviews from players who purchased the game are also displayed, where each review provides a Positive or Negative rating, as well as how long the player played the game for.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

An overall game rating thus presents the percentage proportion of positive reviews in contrast to negative reviews, as well as the number of reviews submitted (eg. 80% positive reviews out of a 100 total reviews would seem more attractive than 80% positive reviews out of 10 total reviews).

However, there could potentially be certain factors such as specific genres that tend to get rated more highly by players. Price could also impact player ratings by making people judge more expensive games more harshly. We were therefore interested in finding if there is any significant influence of the displayed variables on how a game is rated.

When looking through previous group projects as resources, we found several that several projects previously also examined how the rating of something is influenced by multiple factors, such as how _Game of Thrones_ episode ratings, or restaurant Michelin Stars ratings.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) <a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)


The Game of Thrones project classified episode features by order of rating impact through a Random Forest method and XGBoost Classifier Model, finding Character Presence and Episode Sentiment as the two most impactful features through both methods. The Michelin Stars project analyzed restaurant names and cuisines through Support Vector Regression and Linear Regression on price, attempting to predict star ratings through these factors, and found no particular directional relationship of these factors with Michelin Star Rating.

Both projects provide valuable insight as to how our group could approach our question. We can find and specify variables most relevant to game rating and test specific algorithms to see if they could predict rating off these factors, since our project will use similar parameters due to Steam games also being a form of consumed entertainment. However, unlike previous projects that focused on how to give recommendations based on previous consumer activity, we aim to use concrete statistics such as cost and average play time to analyze the factors behind what gives a Steam game its ratings. 

There are also several publicly available repositories on github showing projects using steam games datasets for the purposes of building a recommendation system, such as one by Christian Craft that went into data analysis with the variables of reviews, popular tags, game details, and genres to use the K-Nearest Neighbors and Collaborative Filtering methods to determine games similar to an inputted game.<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5)

This project showed an example of how predictive machine learning could project similarities between given games, while our project would focus more on analyzing the existing potential impacts on game rating. 

Overall, these previous projects inspired us on the topic and methods related to analyzing data on ratings, and gave us a sense of what direction our project should head towards. We aim to take our investigation one step further and address the unexplored potential in whether the way players evaluate Steam games is impacted by objective properties of the games such as price and play time rather than subjective properties such as writing and style.

In general, game ratings are a reflection of the quality and enjoyment a game delivers to its consumers. Our topic of finding variable relationships to game ratings is significant as it could potentially give game creators a reference on what they should focus on during development to get the best reception possible from players. For consumers, this will increase their engagement and improve their perception of the game, which could therefore increase consumer rates of these games. For developers, increased consumption can result in increased sales of these games, allowing them to generate more revenue for their company and access a more universal player base. 


1. <a name="cite_note-1"></a> [^](#cite_ref-1) Steam About Page. https://store.steampowered.com/about
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Information about Steam Reviews. https://help.steampowered.com/en/faqs/view/2DA6-9CB3-F84A-643E.
3. <a name="cite_note-3"></a> [^](#cite_ref-3) COGS 108 Game of Thrones Project. https://github.com/COGS108/FinalProjects-Sp20/blob/master/FinalProject_group39.ipynb.
4. <a name="cite_note-4"></a> [^](#cite_ref-4) COGS 108 Michelin Stars Project. https://github.com/COGS108/FinalProjects-Sp20/blob/master/FinalProject_group65.ipynb. 
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Steam Games Recommendation Project. https://github.com/craftstl97/SteamGamesAnalysis/blob/main/markdown/SemesterProject.ipynb.

## Hypothesis


We predict that there is a strong positive correlation between variables such as genre, game length, and achievement count with the overall rating of a game, while cost would have a negative correlation. For instance, if a game had a higher cost, that may cause players to have higher expectations of a game's quality, with the high cost of certain games coming from game complexities and design, studio reputation, and development time. Additionally, we predict that game genre has a high influence on the rating of a game, with more popular genres such as FPS games generally receiving higher ratings than puzzle or gacha games due to some players feeling frustrated over genre-specific mechanics such as overly difficult/unintuitive puzzles or unforgiving gacha rates.  

## Data

### Data overview

- Dataset #1
    - Dataset Name: Steam games Dataset 2025
    - Link to the dataset: https://www.kaggle.com/datasets/srgiomanhes/steam-games-dataset-2025/data
    - Number of Observations: 71429 games
    - Number of variables: 21 columns
    - Most relevant variables:
        - steam_appid: ID of the game within Steam's database
        - name: Name of a given game
        - categories: Tags used to help describe what the games feature
        - genres: Tags used to help describe the gameplay content
        - release-date: Could create connections and correlations with reviews
        - is_free: Makes it easier to filter out free games from not
    - Shortcomings:
        - Lacks current cost, only has initial pricing in USD
        - Smaller side of datasets
- Dataset #2
    - Dataset Name: Steam Games Dataset
    - Link to the dataset: https://www.kaggle.com/datasets/fronkongames/steam-games-dataset/data
    - Number of observations: 122611 games
    - Number of variables: 40 columns
    - Relevant variables:
        - AppID: ID of the game within Steam's database
        - Name: Name of a given name
        - Estimated Owners: Shows how successful the game was at selling
        - Playtime: Measure of how successful the game is at retaining players
        - Peak CCU: Peak concurrent players - Measure of how successful a game was at its peak
        - Positive/Negative: Reviews, measures player's actual enjoyment of the game
        - Metacritic/User Score: More official, third party scores of the game, from a critic and user standpoint
        - Price: The cost of the game
    - Shortcomings:
        - Was not initially tidy, missing a comma
        - Very large dataset takes a lot of memory to process

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [5]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

!pip install -q gdown
import gdown

file_id = "166o0UjmSgPYyMD8vzIe178IhtWHKG2Q0"  # the long string after /d/ in the share link
url = f"https://drive.google.com/uc?id={file_id}"

gdown.download(url, "data/00-raw/steam_games_2.csv", quiet=False)

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://drive.google.com/uc?id=1lF1c8fHKD7Ymzzlw3dNL_Q9IbdsgKagh&export=download', 'filename':'steam_games_1.csv'},
    { 'url': 'https://drive.google.com/uc?id=17VtzFY3WFINJ3buja9MoNRjZzBBVdKgA&export=download', 'filename':'steam_games_3.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Downloading...
From (original): https://drive.google.com/uc?id=166o0UjmSgPYyMD8vzIe178IhtWHKG2Q0
From (redirected): https://drive.google.com/uc?id=166o0UjmSgPYyMD8vzIe178IhtWHKG2Q0&confirm=t&uuid=6970edfa-a010-4294-944b-f6027248aeaa
To: /home/jotse/Group100_WI26/data/00-raw/steam_games_2.csv
100%|██████████| 389M/389M [00:07<00:00, 54.9MB/s] 
Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading steam_games_1.csv:   0%|          | 0.00/20.4M [00:00<?, ?B/s][A
Downloading steam_games_1.csv:  23%|██▎       | 4.76M/20.4M [00:00<00:00, 47.6MB/s][A
Downloading steam_games_1.csv:  68%|██████▊   | 13.9M/20.4M [00:00<00:00, 73.1MB/s][A
Overall Download Progress:  50%|█████     | 1/2 [00:01<00:01,  1.79s/it]           [A

Successfully downloaded: steam_games_1.csv



Downloading steam_games_3.csv:   0%|          | 0.00/11.6M [00:00<?, ?B/s][A
Downloading steam_games_3.csv:   4%|▍         | 454k/11.6M [00:00<00:02, 4.52MB/s][A
Downloading steam_games_3.csv:  17%|█▋        | 2.02M/11.6M [00:00<00:00, 9.68MB/s][A
Downloading steam_games_3.csv:  25%|██▌       | 2.95M/11.6M [00:00<00:01, 8.55MB/s][A
Downloading steam_games_3.csv:  75%|███████▌  | 8.76M/11.6M [00:00<00:00, 25.9MB/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:03<00:00,  1.80s/it]           [A

Successfully downloaded: steam_games_3.csv





### Steam Games Dataset: Genres, Achievements, Reviews 

In [12]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
# Step A: Load the Dataset
steam1 = pd.read_csv("data/00-raw/steam_games_1.csv")

#Step B: Dataset Tidiness
steam1.info

#Step C: Dataset Size
steam1.shape

#Step D: Missing Data
steam1.isna().sum #0 for every column
(steam1 == "").sum() #Checks for empty strings, 0 for every column
steam1.apply(lambda col: col.astype(str).str.lower().isin(["unknown", "n/a", "na", "null"]).sum()) #Checks for any of the terms in any given column. 1 in "name", 0 elsewhere
steam1.apply(lambda col: col.astype(str).str.strip().eq("").sum()) #Checks for white spaces, 0 in all columns

#One game is named one of the trigger words, so check it out
steam1[steam1["name"].str.lower().isin(["unknown", "na", "n/a", "null", "NaN"])]
#After further investigation via official Steam page steam_appid lookup, "Unknown" is the name of a game, therefore a false positive.

#Meaning, there is no missing data.

#Step E: Outliers and/or suspicious entries
steam1.describe()

#Note: price_initial has a max of $999.98
steam1[steam1["price_initial (USD)"] > 100]
#There are a lot of games with an initial price range of over $100 USD, so check how many
steam1[steam1["price_initial (USD)"] > 100].count
#There are 171 rows, meaning 171 games with an initial price range of over $100 USD. 
#This means this likely is just a strategy that developers intentionally use to prevent early purchases, while saving their game page on Steam.
#Because of this inaccuracy to our research question, we cannot use this column from this dataset.

#Note: required_age has a max of 97. 
(steam1["required_age"] > 30).sum()
#There are 2519 games with a required age of over 30, which is unusual considering the standard age rating systems go up to 21+. 
#Having a required age count past 21 is unusual, and could indicate illegitimate age ratings.
#Meaning, games with 30+ age ratings will be removed for improved data accuracy.

#Addressing number of achievements, while 9.8k achievement count seem unusual, this should be noted, but not immediately cleaned
steam1[steam1["n_achievements"] > 300][["name", "n_achievements"]].head()
#All team members reached the consensus to leave these games in, as they are still considered games

#Step F: Cleaning the Data

steam1 = steam1[steam1["required_age"] <= 30]
steam1 = steam1[steam1["is_released"] == True]
steam1 = steam1.drop(columns=["price_initial (USD)", "developers", "publishers", "platforms", "review_score_desc", "additional_content", "required_age", "is_released", "metacritic"])

#Removed the data considered harmful to our data, or unnecessary.

#Step G: Saving Processed Dataset
steam1.to_csv("data/02-processed/steam_games_1_cleaned.csv", index = False)

steam1.head()

Unnamed: 0,steam_appid,name,categories,genres,n_achievements,release_date,total_reviews,total_positive,total_negative,review_score,positive_percentual,is_free
0,2719580,勇者の伝説の勇者,"['Single-player', 'Family Sharing']","['Casual', 'Indie']",0,2024-01-04,0,0,0,0.0,0.0,False
2,2719600,Lorhaven: Cursed War,"['Single-player', 'Multi-player', 'PvP', 'Shar...","['RPG', 'Strategy']",32,2024-01-26,9,8,1,0.0,88.9,False
3,2719610,PUIQ: Demons,"['Single-player', 'Steam Achievements', 'Famil...","['Action', 'Casual', 'Indie', 'RPG']",28,2024-02-17,0,0,0,0.0,0.0,False
4,2719650,Project XSTING,"['Single-player', 'Steam Achievements', 'Steam...","['Action', 'Casual', 'Indie', 'Early Access']",42,2024-01-05,9,9,0,0.0,100.0,False
7,2719710,Manor Madness,"['Single-player', 'Steam Achievements', 'HDR a...","['Action', 'Adventure', 'Indie', 'RPG', 'Simul...",5,2024-01-15,0,0,0,0.0,0.0,True


### Steam Games Dataset: Cost, Genres

In [14]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
#Step A: Load the Dataset
steam2 = pd.read_csv("data/00-raw/steam_games_2.csv")
#Step B: Dataset tidiness
steam2.info
steam2.columns
#Tidiness was an issue. 
#Initially, the original CSV file missed a column in the header, meaning all information was incorrectly displayed
#After editing the comma back in using a text editor, all information is displayed correctly, and therefore tidy.

#Step C: Dataset Size
steam2.shape
#Size of (122611, 40)

#Step D: Missing Data
steam2.isna().sum() # A lot of missing data. A lot of it is unnecessary, but some notable missing information are: 8413; Genres, 8953; Categories, 1; Name
steam2[["Genres", "Categories"]].isna().sum() #Missing 8413 Genres, and 8953 Categories
steam2[["Genres", "Categories"]].isna().mean()*100 #6.86% of genres are missing, while 7.3% of categories are missing

steam2.apply(lambda col: col.astype(str).str.strip().eq("").sum()) #None have empty strings
steam2.apply(lambda col: col.astype(str).str.lower().isin(["Unknown", "n/a", "na", "NaN", "null"]).sum()) #Some irrelevant information, such as About the game and Support URL, are missing.


steam2[steam2["Genres"].isna()]["Price"].describe() #Average price of games with missing genres is 0.096, with the max being 129.99, and std being 1.88. 75% is 0, showing that most of them are free.
#This seems more systemic, as most of the games that are missing Genres are free, meaning that these games likely do not have genres for many reasons

steam2[steam2["Categories"].isna()]["Price"].describe() #Similar with Categories, which may indicate that these free games are not popular enough to be considered. 

#Step E: Outliers and/or suspicious entries
steam2.describe() #Price of 999.98 max, which is an outlier. 
(steam2["Price"] > 100).sum() #309
#In the current market of video games, game prices should not reach over 100. 
#AAA games standards, known for being the priciest of games, are $70, so anything over $100 should be considered unreasonable, and likely meant to save game page space for an unfinished game.

#Step F: Cleaning Data
steam2 = steam2[steam2["Price"] <= 100]
# steam2 = steam2.drop(columns=["Discount", "Required age", "DLC count", "About the game", "Supported languages", "Full audio languages", "Reviews", "Header image", "Website", "Support url", "Support email", "Windows", "Mac", "Linux", "Metacritic url", "Recommendations", "Notes", "Developers", "Publishers", "Screenshots", "Movies", "Score rank", "Tags", "User Score"])
steam2 = steam2.dropna(subset=["Genres", "Categories"]) #Drop the ones where there are no genres, as it is a necessary component

steam2.isna().sum()
steam2.shape #We are left with 113232 entries

#Step G: Uploading data
steam2.to_csv("data/02-processed/steam_games_2_cleaned.csv", index = False)

steam2.head()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,Discount,DLC count,About the game,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
1,496350,Supipara - Chapter 1 Spring Has Come!,"Jul 29, 2016",0 - 20000,0,0,5.24,65,0,"Springtime, April: when the cherry trees come ...",...,0,8,0,minori,MangaGamer,"Single-player,Steam Trading Cards,Steam Cloud,...",Adventure,"Adventure,Visual Novel,Anime,Cute",https://shared.akamai.steamstatic.com/store_it...,
2,1034400,Mystery Solitaire The Black Raven,"May 6, 2019",0 - 20000,0,0,4.99,0,0,"Immerse yourself in the most beloved, mystical...",...,0,0,0,Somer Games,8floor,"Single-player,Family Sharing",Casual,"Casual,Card Game,Solitaire,Puzzle,Hidden Objec...",https://shared.akamai.steamstatic.com/store_it...,
3,3292190,버튜버 파라노이아 - Vtuber Paranoia,"Oct 31, 2024",0 - 20000,1,0,8.99,0,1,"synopsis 'Hello, I'm Hiyoro, a new YouTuber!' ...",...,0,0,0,유진게임즈,유진게임즈,"Single-player,Steam Achievements,Family Sharing","Casual,Indie,Simulation",,https://shared.akamai.steamstatic.com/store_it...,
4,3631080,Maze Quest VR,"Apr 24, 2025",0 - 20000,0,0,4.99,0,0,Its not just a Maze; its a Quest! Enter the ca...,...,0,0,0,Reality Expanded LLC,Reality Expanded LLC,"Single-player,VR Only,Steam Leaderboards,Famil...","Action,Early Access",,https://shared.akamai.steamstatic.com/store_it...,
5,1654170,Agony VR,"Apr 5, 2023",0 - 20000,0,0,13.99,0,0,A JOURNEY THROUGH HELL! Explore the most terri...,...,0,0,0,Ignibit,"Ignibit,Madmind Studio","Single-player,Tracked Controller Support,VR On...","Action,Adventure",,https://shared.akamai.steamstatic.com/store_it...,


In [15]:
#Merging the two datasets based on ID of the game.
steam1 = steam1.rename(columns={"steam_appid": "AppID"})
steam_games = pd.merge(steam1, steam2, on="AppID", how="inner")
steam_games.shape #54250 valid data

steam_games.head() #We have duplicate positive and negative columns, with different data. Best choice is to only pick one.
steam_games = steam_games.drop(columns=["total_reviews", "total_positive", "total_negative", "Name", "Achievements", "Categories", "Genres", "Release date", "review_score", "positive_percentual"])

#Although we only have 1 of each variable now, we will want to reorder it in a sensible manner

steam_games = steam_games[["AppID", "name", "release_date", "Price", "genres", "categories", "Estimated owners", "Peak CCU", "Average playtime forever", "Average playtime two weeks", "Median playtime forever", "Median playtime two weeks", "Positive", "Negative", "is_free"]]

#Now that everything is to our liking, upload
steam_games.to_csv("data/02-processed/steam_games.csv", index = False)
steam_games.head()


Unnamed: 0,AppID,name,release_date,Price,genres,categories,Estimated owners,Peak CCU,Average playtime forever,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Positive,Negative,is_free
0,2719580,勇者の伝説の勇者,2024-01-04,0.99,"['Casual', 'Indie']","['Single-player', 'Family Sharing']",0 - 20000,0,0,0,0,0,7,0,False
1,2719600,Lorhaven: Cursed War,2024-01-26,0.99,"['RPG', 'Strategy']","['Single-player', 'Multi-player', 'PvP', 'Shar...",0 - 20000,0,0,0,0,0,12,5,False
2,2719610,PUIQ: Demons,2024-02-17,2.99,"['Action', 'Casual', 'Indie', 'RPG']","['Single-player', 'Steam Achievements', 'Famil...",0 - 20000,0,0,0,0,0,2,0,False
3,2719650,Project XSTING,2024-01-05,7.99,"['Action', 'Casual', 'Indie', 'Early Access']","['Single-player', 'Steam Achievements', 'Steam...",0 - 20000,0,0,0,0,0,15,0,False
4,2719710,Manor Madness,2024-01-15,0.0,"['Action', 'Adventure', 'Indie', 'RPG', 'Simul...","['Single-player', 'Steam Achievements', 'HDR a...",0 - 0,0,0,0,0,0,0,0,True


## Ethics

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
       
> Our project will be focusing on data on video games, not human subjects. This data is publicly available and displayed by developers and publishers on Steam for the purpose of informing interested individuals about their games.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
       
> We will check for and acknowledge any unusual patterns in the data for any specific games seemingly being over or under represented.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
       
> Variables of interest we will collect information on are information such as genre and rating and other descriptive measurements of games, which pertain to the games themselves and not any individuals.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
       
> The variables we are collecting information on with regards to video games are not based on protect human group status.



### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
       
> The data we are working with is publicly available and accessible to any interested user through Kaggle. However, the data in its edited form for the project will only be directly available through our group's local machines.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
       
> No personal or sensitive information of individuals is being collected or analyzed. However, we will be open to information takedown requests through our contacts such as by email.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
       
> We will likely retain the data until the members of this group graduate or longer, but we will be open to information takedown if needed. However, the datasets we will use are all still publicly available.


### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
       
> We will be open to discussing and critiquing our analysis and findings with any interested stakeholders. We have looked through previous projects and work done in the area of Steam game data analysis to consider our research question.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
       
> We acknowledge that reviews collected from Steam will likely not generalize to a broader population, so we are designating our population of interest as Steam users specifically. We will examine our data and the source of our dataset to consider any biased processes that could have impacted data collection.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
       
> We will design our visualizations, summary statistics, and reports to be as legible and reflective of the underlying data as possible. We will not tamper with our analyses to present our ideal findings over the facts.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
       
> No data with PII will be used in the analyses.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
       
> We will document each step of our analysis clearly for any future reader.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
       
> The variables involved in the model are objective metrics of video games such as genre and achievement number that are not collected through discriminatory methods.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
       
> We will test our model's results for fairness across the various aspects of video games.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
       
> We are looking into the most relevant variables out of the datasets available to us that can effectively describe video games and can potentially affect how people rate games. So far, game genre, playtime, game price, game achievement numbers seem like a solid start. 

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
       
> We will explain as best we can why our models will behave in specific ways given the data available to it and the model's behavior.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
       
> We will acknowledge the limitations we have with regards to how our models will act on limited data collected on Steam games as clearly as we can.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
       
> We can examine how generally applicable our model of predicting Steam game ratings is on data of new games released this year to see the consistency of its performance.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
       
> If anyone is harmed by our model, we will consider what aspect of the model could have been used to harm them and how we can clarify the intended use for our model instead. For example, if consumers were to actually base their purchasing decisions off of the model predicting game ratings and thereby denying new games a playerbase, we would clarify that the model results are intended to be predictive and reflect data collected on previously released games and cannot always generalize well.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
       
> We will be open to taking down our model from public access if it is misused.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
       
> Unintended uses and/or abuses could be game publishers' playerbases being significantly impacted due to unfavorable predicted game ratings from the model, or model results of predicted ratings being used to undeservedly slander or praise particular games. We will be open to being contacted for any concerns over how our model is being used.

## Team Expectations 

**Overview**: -> Heeyoung 

* Write a clear summary of what you did
* Briefly describe the results of your project
* Limit overview to 3-4 sentences

**Research Question**: -> Heeyoung
* Include a specific, clear data science question
* Make sure what you're measuring (variables) to answer the question is clear

**Background & Prior Work**: -> Heeyoung (first draft: find background info and citations), Oscar (second draft/revisions to first draft and make a list of citations)
* Include a general introduction to your topic
* Include explanation of what work has been done previously
* Include citations or links to previous work

**Hypothesis**: -> Benson
* Include your team's hypothesis
* Ensure that this hypothesis is clear to readers
* Explain why you think this will be the outcome (what was your thinking?)

**Dataset(s)**: -> Lauren
* Include an explanation of dataset(s) used (i.e. features/variables included, number of observations, information in dataset)
* Source included (if outside dataset(s) being used)

**Data Analysis**: -> Benson, Joseph 
* Data Cleaning & Pre-processing
* Perform Data Cleaning and explain steps taken OR include an explanation as to why data cleaning was unnecessary (how did you determine your dataset was ready to go?)
* Dataset actually clean and usable after data wrangling steps carried out

**Data Visualization:** -> Lauren 

* Include at least three visualizations
* Clearly label all axes on plots
* Type of all plots appropriate given data displayed
* Interpretation of each visualization included in the text

**Data Analysis & Results** -> Benson + Joseph (first draft: explain what was done), Lauren (second draft: interpretation)

* EDA carried out with explanations of what was done and interpretations of output included
* Appropriate analysis performed
* Output of analysis interpreted and interpretation included in notebook

**Privacy/Ethics Considerations:** -> Heeyoung 
* Thoughtful discussion of ethical concerns included
* Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
* How your group handled bias/ethical concerns clearly described

**Conclusion & Discussion:** -> Joseph (first draft), Oscar (revisions to first draft) 
* Clear conclusion (answer to the question being asked) and discussion of results
* Limitations of analysis discussed
* Does not ramble on beyond providing necessary information



## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/4  |  1 PM | Brainstorm topics/questions, decide on final project topic  | Finish project proposal | 
| 2/8  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/14 | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/18  | 6 PM  | Import & Wrangle Data; EDA; make revisions based on given critiques | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |