# CS304 Data Science Project
## Title: **The Performance of Video Games Based on the Masses**
### Team Members: **Christopher Holt, Ethan Smith**

## Abstract:

**Video games are entertainment purchases, and the pressure is on developers to present a product that their customer base will enjoy. Thus, public opinion is important to the immediate and long-term success of these projects. We plan to use a Kaggle-sourced data set regarding Steam’s library of video games. The data source contains various characteristics of video games, including user scores, average playtime, price, and more. In examining this data set, we want to explore the question: “Does public opinion affect a game’s ability to garner support in the form of owners and playtime?” The ability to visualize potential relationships between factors such as user scores, playtime, and cost would aid in understanding the influence of public opinion, whether present or not, on the perceived success of games through playtime metrics. We plan to use scatterplots and histograms to create models that illustrate comparisons among the previously mentioned aspects of the data set. In addition to our main question, any outliers in the data will be highlighted to uncover exceptions and subtler dynamics. Possible cases might include games that have high user scores but are extremely inexpensive or expensive, games that have low user scores but considerable playtime, or games that have high user scores and have very little playtime. We believe negative reviews can diminish engagement even among current owners, motivated by social-image concerns of appearing as an "incompetent consumer," leading users to reduce playtime after reading negative feedback. This aligns with broader observations that negative sentiment affects post-purchase behavior, not just buying decisions. By integrating these insights with our visual modeling, our study aims to reveal how sentiment shapes long-term engagement. We anticipate confirming that higher ratios of negative reviews are associated with lower playtime per owner, providing developers with both analytical tools and a strategic perspective on managing user sentiment to maintain player engagement.**

## Setup

In [2]:
# Setup
# TODO: load libraries and other utilities you'll use throughout your notebook here
import pandas as pd
from requests import get
import pathlib
import subprocess
import numpy as np

## Data Set / Aquisition

**TODO: (200-500 words) describe your dataset including important details such as provenance, authenticity, why it was collected, etc. Describe the format of the data and what attributes are present including their purpose and datatypes. Include any other information we should know about the data.**

# The Steam Games Dataset
is a comprehensive collection of game information from the Steam digital distribution platform, containing detailed attributes for over ninety thousand games. This dataset was carefully curated to provide insights into the Steam ecosystem, including game characteristics, user engagement metrics, and market trends.

---

The dataset was compiled through the hard work of Joakim Arvidsson's systematic web scraping of Steam's public API and store pages, focusing on the most popular games to create a representative sample of the platform's content. The collection process involved:

    Structured data extraction from Steam's public interfaces
    Standardized formatting of game attributes
    Quality control measures to ensure data consistency
    Annual updates to maintain currency of information


The dataset is organized into a tabular format with the following key attributes:

    Attribute	Data Type	Description	Purpose
    app_id	Integer	Unique Steam identifier	Primary key for game identification
    name	String	Game title	Game identification
    release_date	Date	Launch date on Steam	Temporal analysis
    price	Float	Current price in USD	Monetary analysis
    categories	Array	Game genres/tags	Content classification
    platforms	Array	Supported operating systems	Compatibility tracking
    achievements	Integer	Number of achievements	Game feature analysis
    positive_ratings	Integer	Number of positive reviews	Quality assessment
    negative_ratings	Integer	Number of negative reviews	Quality assessment
    average_playtime	Float	Average hours played	Engagement metrics
    owners	String	Estimated player count range	Popularity metrics
    Data Quality and Limitations

The dataset maintains high data quality through:

    Annual updates to reflect current Steam store information
    Consistent formatting across all entries
    Comprehensive coverage of key game attributes
    Standardized rating calculation methods

Important considerations:

    Owner counts are provided as ranges (e.g., "1,000 - 2,000")
    Ratings are subject to change over time
    Some games may have incomplete information for certain fields
    Data represents a snapshot of the Steam store at time of collection

This dataset is particularly valuable for:

    Game market analysis
    Popularity trend studies
    Genre distribution research
    Player engagement patterns
    Price-point analysis

The structured format and comprehensive attribute set make it suitable for both exploratory data analysis and machine learning applications, while maintaining data integrity through consistent formatting and regular updates.

---
**Do not run the code below, it was used to get the dataset into our collab
project. Running it again will at best generate more copies of our almost 3gb file. Ill advised as we dont know how much space is alloted to us for this class.**

---

In [5]:
# TODO: load data from external, public source
from requests import get

In [7]:
r = get('https://www.kaggle.com/api/v1/datasets/download/joebeachcapital/top-1000-steam-games')
r, r.status_code

(<Response [200]>, 200)

In [8]:
import pathlib
import subprocess

In [25]:
ds = pathlib.Path('data/steam_games.zip')
ds.write_bytes(r.content)
print(ds)

data/steam_games.zip


In [26]:
pathlib.Path('data/steam_games.zip').exists()

True

In [33]:
subprocess.run(['unzip', 'data/steam_games.zip'])

CompletedProcess(args=['unzip', 'data/steam_games.zip'], returncode=0)

In [37]:
subprocess.run('ls -la', shell=True, capture_output=True).stdout.splitlines()

[b'total 306608',
 b'drwxr-xr-x 1 root root      4096 Sep 12 17:05 .',
 b'drwxr-xr-x 1 root root      4096 Sep 12 16:29 ..',
 b'-rw-r--r-- 1 root root 297347038 Sep 15  2024 93182_steam_games.csv',
 b'drwxr-xr-x 4 root root      4096 Sep  9 13:46 .config',
 b'drwxr-xr-x 2 root root      4096 Sep 12 17:05 data',
 b'drwxr-xr-x 1 root root      4096 Sep  9 13:46 sample_data',
 b'-rw-r--r-- 1 root root  15969959 Sep 15  2024 steam_app_data.csv',
 b'-rw-r--r-- 1 root root    622825 Sep 15  2024 steamspy_data.csv']

In [16]:
import numpy as np

---
Alright, all of the code above got our data loaded into the Google Collab environment. It does not need to be ran again, but I am leaving it in to show our work. With the cell below, our dataframe is stored in the obvious variable *df*

---

In [43]:
df = pd.read_csv('./93182_steam_games.csv')
df.head()

  df = pd.read_csv('./93182_steam_games.csv')


Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
0,1424640,余烬,"Oct 3, 2020",20000 - 50000,0,0,3.99,0,'Ashes of war' is an anti war theme adventure ...,['Simplified Chinese'],...,0,0,0,宁夏华夏西部影视城有限公司,宁夏华夏西部影视城有限公司,"Single-player,Family Sharing","Adventure,Casual,Indie,RPG","Sokoban,RPG,Puzzle-Platformer,Exploration,Adve...",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
1,402890,Nyctophilia,"Sep 23, 2015",50000 - 100000,0,0,0.0,0,NYCTOPHILIA Nyctophilia is an 2D psychological...,"['English', 'Russian']",...,0,0,0,Cat In A Jar Games,Cat In A Jar Games,Single-player,"Adventure,Free To Play,Indie","Free to Play,Indie,Adventure,Horror,2D,Pixel G...",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
2,1151740,Prison Princess,"Apr 2, 2020",0 - 20000,0,0,19.99,0,"ABOUT Now nothing more than a phantom, can the...","['English', 'Simplified Chinese', 'Traditional...",...,0,0,0,qureate,qureate,"Single-player,Steam Achievements,Full controll...","Adventure,Indie","Sexual Content,Adventure,Indie,Nudity,Anime,Ma...",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
3,875530,Dead In Time,"Oct 12, 2018",0 - 20000,0,0,7.99,0,Is a hardcore action with a non-trivial level ...,"['English', 'Russian']",...,0,0,0,Zelenov Artem,Zelenov Artem,"Single-player,Full controller support,Family S...","Action,Indie","Action,Indie,Souls-like,Fantasy,Early Access,R...",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
4,1835360,Panacle: Back To Wild,"Mar 11, 2022",0 - 20000,2,0,3.99,0,Panacle: Back to the Wild is a indie card game...,"['English', 'Japanese', 'Simplified Chinese', ...",...,0,0,0,渡鸦游戏,"渡鸦游戏,电钮组","Single-player,Family Sharing","Indie,Strategy,Early Access","Trading Card Game,Turn-Based Strategy,Lore-Ric...",https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...


In [44]:
df.tail()

Unnamed: 0,AppID,Name,Release date,Estimated owners,Peak CCU,Required age,Price,DLC count,About the game,Supported languages,...,Average playtime two weeks,Median playtime forever,Median playtime two weeks,Developers,Publishers,Categories,Genres,Tags,Screenshots,Movies
93177,2251030,Mutant Monty (C64/CPC/Spectrum),"Jan 5, 2023",,0,0,4.99,0,Originally released in 1984 for home microcomp...,['English'],...,0,0,0,Artic Computing,Pixel Games UK,"Single-player,Partial Controller Support,Steam...",Action,,https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
93178,1330890,Crystal Caves HD,"Oct 15, 2020",,0,0,7.99,0,The best miner in the Galaxy is back! Revisit ...,['English'],...,0,0,0,Emberheart Games,Apogee Entertainment,"Single-player,Steam Achievements,Full controll...",Action,,https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
93179,1844230,Malicious ReloadⅡ,"Sep 5, 2023",,0,0,5.99,0,★ To ensure that the game you have purchased w...,"['Japanese', 'English', 'Simplified Chinese', ...",...,0,0,0,UNDER HILL,Playmeow,"Single-player,Family Sharing","Action,Adventure,Simulation",,https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
93180,2623690,Mutant Frog,"Jan 27, 2024",,0,0,0.99,0,As a result of an unknown meteorite hitting an...,['English'],...,0,0,0,Run-O Games,Run-O Games,"Single-player,Family Sharing","Action,Adventure,Casual,Indie",,https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...
93181,2313950,CRAZY GUY,"Mar 20, 2023",,0,0,9.99,0,CRAZY GUY is the story of a space tramp who tr...,"['English', 'Russian']",...,0,0,0,NIK Studios,NIK Studios,"Single-player,Steam Achievements,Family Sharing","Action,Adventure,Casual",,https://shared.akamai.steamstatic.com/store_it...,http://video.akamai.steamstatic.com/store_trai...


## Exploratory Analysis / Visualization

**TODO: (300-600 words) Explain what you're looking for in the data to help you understand how to use it to answer your question or otherwise meet your project goals. Then, explain what steps you took to explore the dataset, including code that implements these explorations. Finally, comment on the results of your exploration and identify how it shaped the latter steps of your project. You may want to use additional text and code cells to intersperse elements of your report with code that computes supporting statistics and visualizations.**

In [None]:
# Analyze

# TODO: add code to perform exploratory analysis or visualize
# elements of your dataset.

## Data Cleaning and Rescaling

**TODO: (300 - 600 words) This section will include both "cleaning" and "rescaling" tasks from the data science pipeline in section 1.1.7. Intersperse text and code cells outlining each inidividual "cleaning" or "rescaling" task. The text cell should explain what is being done and why it is necessary and the code cell should implement the step.**

In [None]:
# Data Cleaning and Rescaling

# TODO: add code that integrates, filters, or otherwise updates data to make it
# more suitable for analysis in later steps.

## Training / Test Split

**TODO: (200-600 words) explain the methodology you select for forming your training/test sets and why it's applicable to the problem.**

In [None]:
# Create training / test sets

# TODO: create variables holding all training, validation, and testing data
# you'll use in the model building step

## Model Building

**TODO: (200-600 words) Explain the choice of model or models for this project, from where and how you integrated libraries to develop these models, and the process of building the models. If model creation needs to be interspersed with model testing (such as with leave-one-out testing), develop code here that will be used to generate models. Include all code for loading libraries for developing models and include functions or other simplifying abstractions that will make the testing phase easier.**

In [None]:
# Model Building

# TODO: Train ML models over data from the prior step and prepare the model
# to be tested in the next step.

## Model Testing

**TODO: (25-300 words) Explain any deviations from original testing plan or special circumstances you had to develop code to overcome for this phase.**

In [None]:
# Model Testing

# Implement the testing plan explained two sections ago

## Results

**TODO: (200 - 600 words) Explain the signifigance and meaning of the results from the model testing phase. You may want to copy and intersperse code blocks developing different values / visualizations from the results with multiple text cells explaining them.**

In [None]:
# Results

# Evaluate and visualize results from model testing.

## Conclusions / Lessons Learned

**TODO: (500 - 1000 words) Explain how your experiments led to either additional context, a full answer, or anything in between in relation to the question you posed for your project. Expand with other lessons learned, both in terms of the project topic, data science, and new tools or skills you've gained.**