# Steam Game Dataset Analysis

## Introduction

This is a practice dataset analysis pilot project for my High School Computer Science course. Throughout this notebook, I'll be using Steam Store data that was scraped and uploaded onto Kaggle. In the future, I'll likely do another trial by scraping and cleaning out my own data.

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import ast

Now, the data that'll be used for the purpose of this notebook has already been scraped from the internet and is stored on my repository. Unfortunately, GitHub limits files to 50 MB. The json data files had to be split into different chunks in order to fit. In order to analyze them with pandas, they need to be formed back into one dataframe.

(The JSONs were downloaded and split using the file titled 'scrapedata.py'. It's available on the repo if needed.)

In [3]:
# read the split json files and concatenate them
json_files = [f'https://raw.githubusercontent.com/1metropolis/steam-analysis/refs/heads/main/data/data_{i}.json' for i in range(1, 17)]
dfs = [pd.read_json(file, orient='index') for file in json_files]

sd = pd.concat(dfs)

In [4]:
sd.head()

Unnamed: 0,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,...,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,peak_ccu,tags,discount
20200,Galactic Bowling,"Oct 21, 2008",0,19.99,0,Galactic Bowling is an exaggerated and stylize...,Galactic Bowling is an exaggerated and stylize...,Galactic Bowling is an exaggerated and stylize...,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,6,11,0 - 20000,0,0,0,0,0,"{'Indie': 22, 'Casual': 21, 'Sports': 21, 'Bow...",
655370,Train Bandit,"Oct 12, 2017",0,0.99,0,THE LAW!! Looks to be a showdown atop a train....,THE LAW!! Looks to be a showdown atop a train....,THE LAW!! Looks to be a showdown atop a train....,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,53,5,0 - 20000,0,0,0,0,0,"{'Indie': 109, 'Action': 103, 'Pixel Graphics'...",
1732930,Jolt Project,"Nov 17, 2021",0,4.99,0,Jolt Project: The army now has a new robotics ...,Jolt Project: The army now has a new robotics ...,"Shoot vehicles, blow enemies with a special at...",,https://cdn.akamai.steamstatic.com/steam/apps/...,...,0,0,0 - 20000,0,0,0,0,0,[],
1355720,Henosis™,"Jul 23, 2020",0,5.99,0,HENOSIS™ is a mysterious 2D Platform Puzzler w...,HENOSIS™ is a mysterious 2D Platform Puzzler w...,HENOSIS™ is a mysterious 2D Platform Puzzler w...,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,3,0,0 - 20000,0,0,0,0,0,"{'2D Platformer': 161, 'Atmospheric': 154, 'Su...",
1139950,Two Weeks in Painland,"Feb 3, 2020",0,0.0,0,ABOUT THE GAME Play as a hacker who has arrang...,ABOUT THE GAME Play as a hacker who has arrang...,Two Weeks in Painland is a story-driven game a...,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,50,8,0 - 20000,0,0,0,0,0,"{'Indie': 42, 'Adventure': 41, 'Nudity': 22, '...",


Each entry is identifiable by the game_id located in the leftmost column. The following attributes are available for each game entry:

In [5]:
print("game count:",len(sd))
print(sd.columns)

game count: 111452
Index(['name', 'release_date', 'required_age', 'price', 'dlc_count',
       'detailed_description', 'about_the_game', 'short_description',
       'reviews', 'header_image', 'website', 'support_url', 'support_email',
       'windows', 'mac', 'linux', 'metacritic_score', 'metacritic_url',
       'achievements', 'recommendations', 'notes', 'supported_languages',
       'full_audio_languages', 'packages', 'developers', 'publishers',
       'categories', 'genres', 'screenshots', 'movies', 'user_score',
       'score_rank', 'positive', 'negative', 'estimated_owners',
       'average_playtime_forever', 'average_playtime_2weeks',
       'median_playtime_forever', 'median_playtime_2weeks', 'peak_ccu', 'tags',
       'discount'],
      dtype='object')


## Cleaning Data

Let's see what kinds of games are included in the dataset.

Here are some tags that Steam users assigned to different games:

In [None]:
sd.tags.value_counts()

<bound method IndexOpsMixin.value_counts of 20200      {'Indie': 22, 'Casual': 21, 'Sports': 21, 'Bow...
655370     {'Indie': 109, 'Action': 103, 'Pixel Graphics'...
1732930                                                   []
1355720    {'2D Platformer': 161, 'Atmospheric': 154, 'Su...
1139950    {'Indie': 42, 'Adventure': 41, 'Nudity': 22, '...
                                 ...                        
3600970    {'Action Roguelike': 296, 'Bullet Hell': 290, ...
3543710                                                   []
3265370    {'Simulation': 70, 'Walking Simulator': 44, 'I...
3423620                                                   []
3183790                                                   []
Name: tags, Length: 111452, dtype: object>

That formatting isn't great. Let's try listing each tag individually:

In [None]:
unique_tags = set(
    tag 
    for tag_dict in sd['tags'] 
    if isinstance(tag_dict, dict) 
    for tag in tag_dict.keys()
)

print(unique_tags)

{'Heist', 'Romance', 'Spelling', 'Choose Your Own Adventure', 'Hidden Object', 'Agriculture', 'PvP', 'Dating Sim', 'Sailing', 'Political Sim', 'Party', 'Rhythm', 'Birds', 'Bikes', 'Rogue-lite', 'Cartoon', 'Coding', 'Mouse only', 'Naval', 'Building', 'Hockey', 'Destruction', 'Superhero', 'Rock Music', '2D Platformer', 'Immersive', 'Photo Editing', 'Asymmetric VR', 'Conspiracy', "1990's", 'Class-Based', 'Online Co-Op', 'Memes', "Shoot 'Em Up", 'Vehicular Combat', 'Mystery Dungeon', 'Gambling', 'Shooter', 'Remake', 'Violent', 'Hack and Slash', 'Snow', 'Local Multiplayer', 'Software', 'Arcade', 'Mythology', 'Martial Arts', 'Action RPG', 'Card Battler', 'Bullet Hell', 'Blood', 'Dragons', 'Base-Building', 'Massively Multiplayer', 'Quick-Time Events', 'Tabletop', 'ATV', 'Trading', 'Foreign', '360 Video', 'Platformer', 'Funny', 'Volleyball', 'Comedy', 'JRPG', 'Experience', 'Farming', 'Fox', 'Psychological Horror', 'Software Training', 'Difficult', 'Narration', 'Touch-Friendly', 'Satire', 'Ches

## Modification

## Analysis

### Section 1

Placeholder text - lets see the popularity of certain games over time

In [None]:
visual_novel_games = sd[sd['tags'].apply(lambda x: isinstance(x, dict) and 'Visual Novel' in x)]
visual_novel_games

Unnamed: 0,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,header_image,...,positive,negative,estimated_owners,average_playtime_forever,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,peak_ccu,tags,discount
1777550,Fuyu no Tsuma,"Oct 15, 2021",0,1.99,0,General Fuyu no Tsuma is an addicting game in ...,General Fuyu no Tsuma is an addicting game in ...,Fuyu no Tsuma is an addicting game in which yo...,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,2,2,0 - 20000,0,0,0,0,0,"{'Adventure': 179, 'Indie': 170, 'Casual': 155...",
1431470,Mythos Ever After: A Cthulhu Dating Sim,"Oct 28, 2020",0,4.99,0,Welcome to Hallowearth Academy! As the premier...,Welcome to Hallowearth Academy! As the premier...,Just your average cosmic horror dating sim.,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,1,0,0 - 20000,0,0,0,0,0,"{'Dating Sim': 252, 'Visual Novel': 246, 'Love...",
1135380,不可思议佣兵团,"Oct 3, 2019",0,3.99,0,"With the rise of professional soldiers, the st...","With the rise of professional soldiers, the st...",This is a small story about a small mercenary ...,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,6,4,0 - 20000,0,0,0,0,0,"{'Sexual Content': 24, 'Nudity': 23, 'Indie': ...",
1292520,Crimson Spires,"Oct 27, 2020",0,19.99,1,"The small mining town of Bataille, Missouri ha...","The small mining town of Bataille, Missouri ha...",This otome-style visual novel blends eeriness ...,,https://cdn.akamai.steamstatic.com/steam/apps/...,...,51,6,0 - 20000,0,0,0,0,1,"{'Indie': 82, 'Adventure': 81, 'Sexual Content...",
1131550,I Walk Among Zombies Vol. 3,"Dec 15, 2020",0,14.99,1,We recommend playing I Walk Among Zombies Vol....,We recommend playing I Walk Among Zombies Vol....,"Gunfire rings out, shattering the safe haven o...",,https://cdn.akamai.steamstatic.com/steam/apps/...,...,168,18,0 - 20000,34,0,34,0,2,"{'Adventure': 57, 'Visual Novel': 56, 'Zombies...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3447410,The Test: Reality Check,"Feb 24, 2025",0,3.99,1,"(This game, as well as many others in our bund...","(This game, as well as many others in our bund...",The Test: Reality Check is one of multiple ins...,,https://shared.akamai.steamstatic.com/store_it...,...,401,5,0 - 20000,0,0,0,0,0,"{'Simulation': 426, 'RPG': 405, 'Choices Matte...",20.0
3373900,Star Dream Journey,"Mar 28, 2025",0,8.99,0,Star Dream Trajectory is a live action interac...,Star Dream Trajectory is a live action interac...,Star Dream Trajectory - Chinese Style Agent 'i...,,https://shared.akamai.steamstatic.com/store_it...,...,36,9,0 - 20000,0,0,0,0,80,"{'Visual Novel': 303, 'Sexual Content': 275, '...",0.0
3574510,Serre,"Apr 15, 2025",0,4.24,2,🐝🐝 A visual novel about a girl and an alien dr...,🐝🐝 A visual novel about a girl and an alien dr...,A visual novel about a girl and an alien drink...,“It’s the kind of lesbian fulfilment and escap...,https://shared.akamai.steamstatic.com/store_it...,...,154,0,0 - 20000,0,0,0,0,5,"{'Romance': 50, 'LGBTQ+': 46, 'Visual Novel': ...",15.0
3584380,Personal Murder Theater: Strangled Roots,"Apr 11, 2025",0,4.99,0,¡¡ English Translation Coming Very Soon!! !! A...,¡¡ English Translation Coming Very Soon!! !! A...,Immerse yourself in this grotesque visual nove...,,https://shared.akamai.steamstatic.com/store_it...,...,2,0,0 - 20000,0,0,0,0,0,"{'Visual Novel': 202, 'Story Rich': 196, 'Anim...",0.0


lorem ipsum dolor sit amet

In [None]:
visual_novel_games.loc[:, 'release_date'] = pd.to_datetime(visual_novel_games['release_date'], errors='coerce')
release_year = visual_novel_games.loc[:, 'release_year'] = visual_novel_games['release_date'].dt.year

In [None]:
yearly_vn = visual_novel_games['release_year'].value_counts().sort_index()


In [1]:
fig = plt.figure(figsize=(10,10))
vnp = fig.add_subplot()

vnp.plot(release_year, yearly_vn, 'bol')

vnp.title('Popularity of Visual Novel Games Over Time (By Year)')
vnp.set_xlabel('Year')
vnp.set_ylabel('Visual Novel Games')
vnp.set_xticks(rotation=45)
vnp.grid(True)
vnp.tight_layout()

plt.show()

NameError: name 'plt' is not defined

### Section 2

### Section 3