<h1>Analysis of Steam Games and Categories<h1>



<h5>Data Sources:</h5>
<a href='https://www.kaggle.com/datasets/trolukovich/steam-games-complete-dataset' abc>Dataset-1</a><br>
<span>-By Aleksandr Antonov</span><br>
<a href='https://www.reddit.com/r/gamedev/comments/165cii0/this_year_we_gathered_data_about_65000_games_in/' abc>Dataset-2</a><br>
<span>-By Alex and Lev</span>

<h3>1. Introduction</h3>

<p>The gaming industry has witnessed exponential growth over the years, with digital distribution platforms like <b>Steam</b> playing a pivotal role in reshaping how games are accessed and experienced. Steam, hosting thousands of games spanning diverse genres, provides an immense amount of data that can offer meaningful insights and inferences into the trends, success factors, and categories of games that dominated or will dominate the market.

This project aims to delve into the vast datasets of Steam games, exploring their categories, revenue, and review metrics to uncover patterns and trends that define the gaming landscape. By leveraging interactive visualizations and comprehensive data analysis, the project provides actionable insights for game developers, publishers, and gaming enthusiasts alike.</p>

<h3>2. Motivation</h3>

<p>Being gamers, we have always been interested in what makes games successful or well-liked. Playing and exploring various game genres takes up a lot of our time, and we frequently ponder why some games perform so well while others don’t. Because of this, we became curious about the gaming industry and the elements that contribute to its success.
There is a lot of information about games on platforms like Steam, including player preferences, earnings, and reviews. However, it can be difficult and challenging to comprehend raw data. We aimed to design something straightforward that would aid the viewing and analysis of this data. We chose to create a dashboard for this project because of this.
</p>

In [1]:
%load_ext autoreload
%autoreload 2

# libraries
import pandas as pd

# modules
from modules.DataCleaner import DataCleaner
from modules.Imputer import Imputer
from modules.DashBoard import DashBoard

import warnings
warnings.filterwarnings('ignore')

In [2]:
df1 = pd.read_csv('/workspaces/final-project-Bharath26214/Project/Data/steam_dataset-1.csv')
df2 = pd.read_csv('/workspaces/final-project-Bharath26214/Project/Data/steam_dataset-2.csv')

<h3>3. Methods<h3>

<div style:{'line-height':1.5;}>
<h4>3.1 Data Integration</h4>

<div>In order to carry out the project we have used a couple of datasets to show various insights. The datasets were from two different sources so we thought of integrating the two datasets. The main challenge was there were a lot of noise and inconsistencies in the datasets, also we couldn't find a common feature present in both datasets to combine them.</div>
<br>

<div>After carefully viewing the features of both datasets we found that there is a common feature not directly. The first dataset consists of <b>'url'</b> which consists of an app_id which is a unique to each game. For instance: The url of game DOOM is https://store.steampowered.com/app/379720/DOOM/ for which the app_id is 379720 which is present in the second dataset. Therefore, The datasets have been merged based on the app_id to increase the quantity of data which could help in the visualization process.</div>

</div>

<img src="/workspaces/final-project-Bharath26214/Project/images/columns_1.png" width="500" height='300'>
<img src="/workspaces/final-project-Bharath26214/Project/images/columns_2.png" width="500" height='300'>

In [3]:
print('----------Dataset-1 (field: url)-----------')
print(df1['url'].head())

print('----------Dataset-2 (field: App ID)--------')
print(df2['App ID'].head())

----------Dataset-1 (field: url)-----------
0      https://store.steampowered.com/app/379720/DOOM/
1    https://store.steampowered.com/app/578080/PLAY...
2    https://store.steampowered.com/app/637090/BATT...
3      https://store.steampowered.com/app/221100/DayZ/
4    https://store.steampowered.com/app/8500/EVE_On...
Name: url, dtype: object
----------Dataset-2 (field: App ID)--------
0       730
1    578080
2       570
3    271590
4    359550
Name: App ID, dtype: int64


<h4>3.2 Data Cleaning</h4>

<div>The next step was to clean the data as it consists of lot of noise and inconsistencies after it has been integrated. There are a bunch of cleaning procedures done one after another to fine tune our data<div><br>

<ol>
<li><b>Drop Unnecessary Features:</b> All the columns in our dataset aren't needed for our analysis especially columns such as 'game_description', 'game_details', 'developer', 'publisher' so we have dropped the dataset inorder to reduce the complexity of data.</li><br>

<li><b>Handling NAN values:</b> Some of the textual columns which are null are replaced with empty strings as it would reduce the chance of errors.</li><br>

<li><b>Fill Game Names which are empty:</b> Few of the game names in the dataset are null which are replaced with the help of url feature. For example: The url for a certain game is https://store.steampowered.com/app/379720/DOOM/, the game name can be retrieved from the url using string processing methods which turns out to be DOOM.</li><br>

<li><b>Integrating similar Features:</b> The features which are present in both datasets initially when integrated have two features with identical information. For instance: 'Release Date' and 'release_date' which shows the date when the game was launched. Such features are combined to reduce redundancy in the data.</li><br>

<li><b>Type Casting:</b> The features 'lauch_price', 'Estimated Revenue' and 'review_score' which should be numeric provided their names bu they actually are objects due to the data being scrapped from web where the data retrieved will be '$8.99' which is considered as string by the scrapping tool. Such features are converted into floating point numbers to understand the trends and patterns of data.</li><br>

<li><b>Feature Extraction:</b> A new feature 'review_summary' is retrieved from the 'all_reviews' column to get the overall game outcome. For example: The entry in 'all_reviews' column will be like--- Very Positive,(42,550),- 92% of the 42,550 user reviews for this game are positive. We have taken the outcome of the game which is 'Very Positive'.</li><br>

<li><b>Imputing Null Values:</b> The nan values in 'Reviews Total' and 'review_score' have been replaced with median and mean respectively. The reason behind this is because the data in 'Reviews Total' is highly skewed and 'reviews_total' is pretty normally distributed.</li>

</ol>

<h5>'all_reviews' column</h5>
<img src='/workspaces/final-project-Bharath26214/Project/images/reviews_column.png' width='400' height='400'>

In [4]:
df = DataCleaner(df1, df2).clean_data()

In [5]:
df.head()

Unnamed: 0,name,Reviews Total,Release Date,Tags,Revenue Estimated,launch_price,review_score,review_summary
0,Counter-Strike,137421.0,2000-11-01,"e-sports,1980s,Action,Competitive,Multiplayer,...",1372835.79,9.99,97.0,Overwhelmingly Positive
1,Team Fortress Classic,5475.0,1999-04-01,"Action,Competitive,Violent,Multiplayer,Mod,Tea...",27320.25,4.99,85.0,Very Positive
2,Day of Defeat,3692.0,2003-05-01,"Action,Multiplayer,Team Based,Tactical,Class-B...",18423.08,4.99,87.0,Very Positive
3,Deathmatch Classic,1923.0,2001-06-01,"Action,Competitive,Multiplayer,Classic,Singlep...",9595.77,4.99,80.0,Very Positive
4,Half-Life: Opposing Force,15498.0,1999-11-01,"Action,Classic,Military,Singleplayer,Co op,Sci...",77335.02,4.99,95.0,Very Positive


<h4>3.3 Imputing Missing Values</h4>

<div>While extracting the feature 'review_summary' from 'all_reviews' there were rows which have '' strings filled from the cleaning procedure. The 'review_summary' being a categorical feature cannot be replaced with the measures of central tendency. We thought of a different procedure to introduce Machine Learning to impute these missing values in the 'review_summary' column. </div>

<h5>Algorithm: eXtreme Gradient Boosting Classifier</h5>

<p>XGBoost Algorithm resided in the xgboost library is a powerful machine Learning Algorithm for both classification and regression tasks. It is a tree based algorithm extended version of Gradient Boosting algorithm which uses boosting technique(parallel processing) train the model.</p>

<p>Features of XGBoost</p>

<ol>
<li>Regularization: Supports L1 (Lasso) and L2 (Ridge) regularization, which helps prevent overfitting.
<li>Handling Missing Data: Automatically learns the best direction for missing values during tree splitting.</li>
<li>Parallel Processing: Utilizes CPU or GPU resources efficiently, enabling faster computation.</li>
</ol>

<h5>Optuna: Hyper Parameter Optimization</h5>

<p>Optuna is an automatic hyperparamter optimization technique designed for machine learning algorithms, especially for the tree based algorithms which consists of numerous parameters. The objective of Optuna is to find the best parameters which improve the scoring metrics (accuracy, precision, F1 score, etc) for 'n' number of trials</p>

<p>Features of Optuna</p>

<ol>
<li>Internally used Bayesian Optimization methods to get best parameters.</li>
<li>The search space can be changed dynamically based on trial results.</li>
<li>Consists of inbuilt visualization tools to show the metrics of each trial over time.</li>
</ol>

<h5>Procedure</h5>

<p>The 'review_summary' column is filled with the combination of XgBoost Classifier and Optuna. The ultimate review of the game depends on the price, number of reviews and percentage of positive/negative reviews. Considering these features as independent features and 'review_summary' as the dependent feature we ran the model for 50 trials to get the best possible parameters and used them to predict the unknown/missing values.</p>

In [6]:
df = Imputer(df, False).predict_data() # switch to True to enable OPTUNA training

<h5>Best Parameters obtained by optuna for 50 trials with accuracy 91%</h5>
<img src='/workspaces/final-project-Bharath26214/Project/images/optuna_params.png'>

<h4>3.4 Data Visualization</h4>

<h5>Plotly Library</h5>

<p>Plotly is an interactive, open-source data visualization library created in python, R, javascript and many more programming languages. I is built on top of D3.js and can interact with frameworks such as Dash and JupyterDash.</p>

<p>Features of Plotly</p>

<ol>
<li>Interactive Visualizations: Highly interactive charts with zooming, panning, and tooltips.</li>
<li>Wide range of Chart types: Plotly consists of wide range of charts ranging from basic plots such as scatter plots and bar charts to specialized charts like sankey diagrams, Tree Maps and Network graphs.</li>
<li>3D Plotting: Allows visualization on higher dimensional datasets.</li>
</ol>


<h5>Plots:</h5>

<ol>
<li><b>Percentage of Overall Game Outcome:</b>
<p> The pie chart which shows the proportion of 'review_summary' of all games. It was evident from the pie chart that over 60% of games are reviewed as positive or above where over 30% of games have mixed reviews and the rest haven't done well in the market.</p>
</li>
<br>

<li><b>Top Games by Revenue vs. Top Games by Reviews:</b>
<p>The bar charts show top 10 games which have collected the maximum revenue and the games which have the most number of reviews. Interestingly. The top 3 games with the highest revenue earned are the top 3 games which have most number of reviews which shows the features are highly correlated in a positive manner.
</p>
</li>
<br>


<li><b>Revenue Collected by Top Categories over time:</b>
<p>This tab shows two different plotting diagrams, one is the bar chart which represents the Top 20 categories which have earned the highest revenue with 'Co Op' being ranked first followed by 'action', 'first person' and so on. The pie charts represent the popular tags which have collected a proportion of Total revenue over four time periods. The action, first person and single player games are reigning their popularity for over two decades.
</p>
</li>
<br>

<li><b>Reviews scores of Game Categories over time periods:</b>
<p>This tab explains the various selected tags and their trend over time. The metric here considered is weighted average review score which is the weighted average of number of reviews and review scores combined. This is done because the data consists of game having one review with review score 100. Such analysis wouldn't be beneficial to draw any conclusions. The line chart shows the certain tags with their review scores (weighted).
</p>
</li>
<br>

<li><b>Analysis of Revenue vs. Review Score:</b>
<p>This tab shows the games which have collected revenue over 5 Million USD and explains four significant features for game's success. The bubble chart shows the year in which the game was released on the x axis along with the review score on the y axis, the size of the bubble shows the estimated revenue collected by the game and the color shows the overall game outcome. It is clear that the number of games which have collected revenue over $5 Million have increased with just two games until 2010 and increased gradually in the next decade.
</p>
</li>
<br>

<li><b>Number of Games Released and the revenue earned by them over years:</b>
<p>This Tab shows the number of games released and the combined revenue of the games in that particular year. It is striking from the graph that the number of games produced and the revenue earned by them has drastically increased. The maximum revenue collected by the games is in the year 2012 with 152 Million USD and stayed in that range until 2022. The number of game released has also went up reaching the maximum in year 2015 when 8800 were released.
</p>
</li>
<br>

</ol>

<h4>3.5 Dashboard Implemention</h4>

<h5>Dash Library</h5>

<p>Dash is a powerful Python framework for building web-based, interactive data visualization dashboards. It integrates seamlessly with Python libraries like Plotly, Pandas, and NumPy. Dash is often used to create interactive, production-ready web applications for data visualization without needing extensive knowledge of web development.</p>

<p>Features of Dash:</p>

<ol>
<li>Dash provides pre-built components for graphs, tables, sliders, dropdowns, and more.</li>
<li>Dash applications are reactive. User inputs (e.g., selecting a dropdown or adjusting a slider) dynamically update outputs (e.g., charts, tables).</li>
<li>Dash integrates seamlessly with machine learning models, APIs, and databases, making it a good choice for interactive model visualizations and real-time analytics.</li>
</ol>

<h5>Procedure:</h5>

<p>Dash library is used to build the final deliverable which is the dashboard. It is used to render the plotly graphs as dash components in each tab. The final application is ran on a randomized port.</p>

In [8]:
dashboard = DashBoard(df).run()

<h4>4. Results</h4>

<h5>Game Reviews</h5>
<p>Positive or very positive reviews were given to the majority of the games on Steam. Only a small number of games received negative reviews. Games with more reviews were frequently more profitable, showing a correlation between visibility and success.</p>

<h5>Popular Games and Types</h5>
<p>The games with the highest revenue were Counter-Strike: Global Offensive, PUBG, and Dota 2. The most lucrative game genres were co-op, action, and first-person. These genres remained popular over time, consistently generating high revenue.</p>

<h5>Trends Over Time</h5>
<p>Some genres, such as trading card games and artificial intelligence, have become more popular in recent years. Meanwhile, older genres like "Classic" and "Masterpiece" saw a decline in player interest and revenue generation.</p>

<h5>Revenue and Reviews</h5>
<p>Games with very positive reviews often had the highest revenue. However, in recent years, there have been examples of highly profitable games that received mixed reviews, showing that other factors also contribute to financial success.</p>

<h5>Game Releases and Revenue Growth</h5>
<p>The number of games released annually increased from 50 in 2000 to over 2,000 in 2023. The highest revenue year was 2015, driven by blockbuster titles like Call of Duty and Grand Theft Auto V. Revenue increased again in 2020, likely influenced by the pandemic, which led to more people playing games.</p>

<h5>Revenue by Time Period</h5>
<p>Revenue grew significantly over time, starting at $283 million from 2000 to 2005 and surpassing $21 billion between 2018 and 2023. Action and co-op games consistently contributed a large portion of the revenue, while other genres gained popularity and market share in later years.</p>


<h3>5. Conclusions</h3>

<div>Data can be used to better understand the gaming market, as this project demonstrates. The dashboard clearly shows game categories which are popular, the relationship between reviews and earnings, and the evolution of player interests. The most popular game genres over time were discovered to be action, cooperative, and first-person. Recently, new genres like artificial intelligence and trading card games have gained popularity. Additionally, the analysis demonstrates the growth in both revenue and game numbers over time, particularly during pandemic years. When deciding what kinds of games to make and how to increase their chances of success, game developers and marketers can use this dashboard to make better decisions.</div>

<h3>References</h3>

<ol>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://xgboost.readthedocs.io/en/stable/">XGBoost Algorithm</a></li>
<li><a href="https://optuna.org/">Optuna</a></li>
<li><a href="https://dash.plotly.com/">Dash and Plotly</a></li>
<li><a href="https://www.semanticscholar.org/reader/3ad95671f78d3205ef9f183f241d5884758a2799">Popularity of Steam Games(research paper)</a></li>


</ol>