# Steam's videogames platform - Big Data Analysis

## 1. Business Context

Ubisoft, a major video game publisher, is planning to release a new innovative game on the global market.  
In order to maximize the commercial success of this future release, the company wants to better understand the current video game ecosystem available on Steam, one of the largest digital distribution platforms in the world.

Steam gathers thousands of games developed by a wide variety of publishers and offers valuable information regarding:
- game genres
- pricing strategies
- platforms availability
- age restrictions
- customer engagement through user reviews
- estimated number of owners

The objective of this exploratory data analysis is to study the global video game market on Steam in order to identify the key factors that may influence a game's popularity and commercial performance.

Through this analysis, Ubisoft aims to:
- understand the evolution of game releases over time
- identify the most represented and best performing genres
- analyze pricing and discount strategies
- evaluate the impact of platform availability
- explore which types of games generate the highest player ownership

The insights derived from this study will support strategic decision-making regarding the positioning, pricing, platform compatibility and genre selection of Ubisoft's future game release.

## 2. Data Loading

The dataset used for this analysis contains detailed information about video games available on the Steam platform.

It includes various attributes such as publishers, genres, pricing, supported platforms, release dates, user reviews and estimated number of owners.

This data will allow us to explore global video game market trends and identify key characteristics that may influence a game's commercial performance and popularity.

In [0]:
steam_df = spark.read \
    .option("inferSchema", "true") \
    .option("multiline", "true") \
    .json("s3://full-stack-bigdata-datasets/Big_Data/Project_Steam/steam_game_output.json")

## 3. Schema Exploration

Before starting the analysis, it is essential to explore the dataset structure in order to understand how the information is organized.

As the dataset comes from a JSON source, several attributes are stored in nested formats such as arrays or structured fields.

This step allows us to identify:
- relevant variables for market analysis
- nested fields that need to be flattened
- attributes requiring cleaning or transformation
- categorical variables such as genres or platforms

Understanding the schema is a necessary step before preparing the data for further business analysis.

In [0]:
steam_df.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- appid: long (nullable = true)
 |    |-- categories: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- ccu: long (nullable = true)
 |    |-- developer: string (nullable = true)
 |    |-- discount: string (nullable = true)
 |    |-- genre: string (nullable = true)
 |    |-- header_image: string (nullable = true)
 |    |-- initialprice: string (nullable = true)
 |    |-- languages: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- negative: long (nullable = true)
 |    |-- owners: string (nullable = true)
 |    |-- platforms: struct (nullable = true)
 |    |    |-- linux: boolean (nullable = true)
 |    |    |-- mac: boolean (nullable = true)
 |    |    |-- windows: boolean (nullable = true)
 |    |-- positive: long (nullable = true)
 |    |-- price: string (nullable = true)
 |    |-- publisher: string (nullable = true)
 |    |-- release_date: string (nullable = true)
 |    |-

The dataset contains nested structures inside the data field, meaning that several relevant attributes such as price, platform availability, or release date must be extracted before conducting further analysis.

In [0]:
steam_df.columns

['data', 'id']

From the schema exploration, we observe that several key attributes are stored in nested formats.

For example:
- genres are stored as arrays
- supported platforms are grouped within a structured field
- release date is stored as text
- estimated owners are expressed as ranges
- pricing information requires transformation for analysis

These characteristics indicate that data preparation steps such as flattening, cleaning and transformation will be required before conducting any market analysis.

## 4. Data Preparation

In order to perform a reliable market analysis, the dataset needs to be prepared before any aggregation or segmentation.

As previously observed, several key attributes are stored in nested or non-standard formats which prevents direct analytical use.

This preparation phase includes:
- flattening nested structures
- selecting relevant business variables
- preparing categorical attributes for segmentation
- converting specific fields into analysis-friendly formats

These transformations will allow us to conduct meaningful macro-level and genre-based market analysis.

### 4.1 Flatten JSON Structure

Some variables such as supported platforms are stored within structured fields.

Flattening these attributes allows us to transform the dataset into a tabular format suitable for aggregation and segmentation analysis.

In [0]:
from pyspark.sql.functions import col

steam_flat = steam_df.select(
    col("data.name").alias("name"),
    col("data.publisher").alias("publisher"),
    col("data.genre").alias("genre"),
    col("data.price").alias("price"),
    col("data.owners").alias("owners"),
    col("data.positive").alias("positive"),
    col("data.negative").alias("negative"),
    col("data.release_date").alias("release_date"),
    col("data.required_age").alias("required_age"),
    col("data.platforms.windows").alias("windows"),
    col("data.platforms.mac").alias("mac"),
    col("data.platforms.linux").alias("linux"),
    col("data.categories").alias("categories"),
    col("data.languages").alias("languages")
)

In [0]:
display(steam_flat.limit(10))

name,publisher,genre,price,owners,positive,negative,release_date,required_age,windows,mac,linux,categories,languages
Counter-Strike,Valve,Action,999,"10,000,000 .. 20,000,000",201215,5199,2000/11/1,0,True,True,True,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)","English, French, German, Italian, Spanish - Spain, Simplified Chinese, Traditional Chinese, Korean"
ASCENXION,PsychoFlux Entertainment,"Action, Adventure, Indie",999,"0 .. 20,000",27,5,2021/05/14,0,True,False,False,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud)","English, Korean, Simplified Chinese"
Crown Trick,"Team17, NEXT Studios","Adventure, Indie, RPG, Strategy",599,"200,000 .. 500,000",4032,646,2020/10/16,0,True,False,False,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud, Steam Trading Cards)","Simplified Chinese, English, Japanese, Traditional Chinese, French, German, Spanish - Spain, Russian, Portuguese - Brazil"
"Cook, Serve, Delicious! 3?!",Vertigo Gaming Inc.,"Action, Indie, Simulation, Strategy",1999,"100,000 .. 200,000",1575,115,2020/10/14,0,True,True,False,"List(Multi-player, Single-player, Co-op, Steam Achievements, Steam Cloud, Shared/Split Screen, Full controller support, Steam Trading Cards, Shared/Split Screen Co-op, Remote Play on Phone, Remote Play on Tablet, Remote Play on TV, Remote Play Together)",English
细胞战争,DoubleC Games,"Action, Casual, Indie, Simulation",199,"0 .. 20,000",0,1,2019/03/30,0,True,False,False,List(Single-player),Simplified Chinese
Zengeon,2P Games,"Action, Adventure, Indie, RPG",799,"100,000 .. 200,000",1018,462,2019/06/24,0,True,True,False,"List(Multi-player, Single-player, Steam Achievements, Full controller support, Steam Trading Cards)","Simplified Chinese, English, Traditional Chinese, Japanese, Korean"
干支セトラ　陽ノ卷｜干支etc.　陽之卷,Starship Studio,"Adventure, Indie, RPG, Strategy",1299,"0 .. 20,000",18,6,2019/01/24,0,True,False,False,"List(Single-player, Steam Achievements, Steam Cloud)","Japanese, Simplified Chinese, Traditional Chinese"
Jumping Master(跳跳大咖),重庆环游者网络科技,"Action, Adventure, Casual, Free to Play, Massively Multiplayer",0,"20,000 .. 50,000",50,34,2019/04/8,0,True,False,False,"List(Multi-player, Single-player, Co-op, Online PvP, Online Co-op, PvP)","English, Simplified Chinese, Traditional Chinese"
Cube Defender,Simon Codrington,"Casual, Indie",299,"0 .. 20,000",6,0,2019/01/6,0,True,True,False,"List(Single-player, Steam Achievements, Steam Leaderboards)",English
Tower of Origin2-Worm's Nest,Villain Role,"Indie, RPG",1399,"0 .. 20,000",32,12,2021/09/9,0,True,False,False,List(Single-player),"English, Simplified Chinese, Traditional Chinese"


### 4.2 Release Date Cleaning

Release dates are stored in various textual formats within the dataset.

In order to perform time-based analysis, it is necessary to convert this attribute into a standardized date format.

Invalid or inconsistent date formats will be converted to null values to ensure data consistency and avoid processing errors.

In [0]:
from pyspark.sql.functions import expr

steam_flat = steam_flat.withColumn(
    "release_date_clean",
    expr("""
        coalesce(
            try_to_date(release_date, 'yyyy/MM/dd'),
            try_to_date(release_date, 'yyyy/MM/d')
        )
    """)
)

In [0]:
from pyspark.sql.functions import year

steam_flat = steam_flat.withColumn(
    "release_year",
    year("release_date_clean")
)

In [0]:
display(
    steam_flat.select("release_date", "release_date_clean", "release_year").limit(10)
)

release_date,release_date_clean,release_year
2000/11/1,2000-11-01,2000
2021/05/14,2021-05-14,2021
2020/10/16,2020-10-16,2020
2020/10/14,2020-10-14,2020
2019/03/30,2019-03-30,2019
2019/06/24,2019-06-24,2019
2019/01/24,2019-01-24,2019
2019/04/8,2019-04-08,2019
2019/01/6,2019-01-06,2019
2021/09/9,2021-09-09,2021


### 4.3 Price Cleaning

Pricing strategy is a key component of a game's commercial positioning.

However, some games in the dataset have missing or undefined pricing information, particularly free-to-play titles.

Cleaning this attribute ensures that pricing analysis reflects actual market positioning and allows us to explore how price levels relate to game popularity and ownership.

In [0]:
from pyspark.sql.functions import col, when, regexp_replace

steam_flat = steam_flat.withColumn(
    "price_clean",
    when(col("price") == "Free", 0)
    .otherwise(regexp_replace(col("price"), ",", ".").cast("float"))
)


In [0]:
steam_flat = steam_flat.withColumn(
    "price_eur",
    col("price_clean") / 100
)

In [0]:
display(
    steam_flat.select("price", "price_clean","price_eur").limit(10)
)

price,price_clean,price_eur
999,999.0,9.99
999,999.0,9.99
599,599.0,5.99
1999,1999.0,19.99
199,199.0,1.99
799,799.0,7.99
1299,1299.0,12.99
0,0.0,0.0
299,299.0,2.99
1399,1399.0,13.99


### 4.4 Owners Cleaning

TThe dataset provides ownership estimates as numerical ranges rather than exact values.

In order to evaluate a game's commercial performance, these ranges are transformed into numerical proxies by calculating the average between the minimum and maximum ownership values.

This transformation enables meaningful comparisons of popularity across games and genres.

In [0]:
from pyspark.sql.functions import split, regexp_replace, col

steam_flat = steam_flat.withColumn(
    "owners_min",
    regexp_replace(
        split(col("owners"), " .. ").getItem(0),
        ",",
        ""
    ).cast("int")
).withColumn(
    "owners_max",
    regexp_replace(
        split(col("owners"), " .. ").getItem(1),
        ",",
        ""
    ).cast("int")
)

In [0]:
steam_flat = steam_flat.withColumn(
    "owners_avg",
    ((col("owners_min") + col("owners_max")) / 2)
)

In [0]:
display(
    steam_flat.select("owners", "owners_min", "owners_max", "owners_avg").limit(10)
)

owners,owners_min,owners_max,owners_avg
"10,000,000 .. 20,000,000",10000000,20000000,15000000.0
"0 .. 20,000",0,20000,10000.0
"200,000 .. 500,000",200000,500000,350000.0
"100,000 .. 200,000",100000,200000,150000.0
"0 .. 20,000",0,20000,10000.0
"100,000 .. 200,000",100000,200000,150000.0
"0 .. 20,000",0,20000,10000.0
"20,000 .. 50,000",20000,50000,35000.0
"0 .. 20,000",0,20000,10000.0
"0 .. 20,000",0,20000,10000.0


### 4.5 Categories Explosion

Games may belong to multiple gameplay categories such as single-player or multiplayer modes.

Exploding this attribute allows us to associate each game with one category per row, enabling segmentation analysis based on gameplay experience.

In [0]:
from pyspark.sql.functions import explode

steam_flat = steam_flat.withColumn(
    "category",
    explode("categories")
)

In [0]:
display(
    steam_flat.select("name", "categories", "category").limit(10)
)

name,categories,category
Counter-Strike,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)",Multi-player
Counter-Strike,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)",Valve Anti-Cheat enabled
Counter-Strike,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)",Online PvP
Counter-Strike,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)",Shared/Split Screen PvP
Counter-Strike,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)",PvP
ASCENXION,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud)",Single-player
ASCENXION,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud)",Partial Controller Support
ASCENXION,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud)",Steam Achievements
ASCENXION,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud)",Steam Cloud
Crown Trick,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud, Steam Trading Cards)",Single-player


### 4.6 Platforms Extraction

Supported platforms represent an important distribution factor that may influence a game's accessibility and market reach.

Extracting platform availability allows us to evaluate how games are distributed across operating systems and determine whether certain genres tend to target specific platforms.

This information will support platform compatibility decisions for future game releases.

In [0]:
display(
    steam_flat.select("windows", "mac", "linux").limit(10)
)

windows,mac,linux
True,True,True
True,True,True
True,True,True
True,True,True
True,True,True
True,False,False
True,False,False
True,False,False
True,False,False
True,False,False


## 5. Macro Market Analysis

### 5.1 Games per Publisher

Analyzing the number of games released by each publisher provides insights into market structure and competitive dynamics.

This helps identify major content producers on the Steam platform and understand whether the market is driven by a few dominant publishers or by a large number of smaller contributors.

In [0]:
publisher_counts = (
    steam_flat.groupBy("publisher")
      .count()
      .orderBy("count", ascending=False)
      .limit(20)
)

In [0]:
display(publisher_counts)

publisher,count
SEGA,829
Square Enix,654
THQ Nordic,628
Devolver Digital,603
BANDAI NAMCO Entertainment,579
Electronic Arts,513
Choice of Games,503
Ubisoft,487
Nacon,459
Big Fish Games,457


Databricks visualization. Run in Databricks to view.

The distribution of games per publisher shows that certain companies such as SEGA, Square Enix, THQ Nordic and Devolver Digital adopt high-volume publishing strategies on the Steam platform.

Compared to these publishers, Ubisoft appears to release fewer titles, suggesting a more selective production approach.

This indicates that some competitors rely on frequent releases to maintain visibility and engagement within the marketplace.

For Ubisoft, this may highlight the need to balance quality-focused development with release frequency in order to remain competitive on digital distribution platforms.

### 5.2 Games Releases per Year

Analyzing the number of games released per year allows us to observe the evolution of the Steam marketplace over time and assess the potential impact of external events such as the Covid-19 pandemic on game production.

In [0]:
releases_per_year = (
    steam_flat.groupBy("release_year")
        .count()
        .orderBy("release_year")
)

In [0]:
display(releases_per_year)

release_year,count
,727
1997.0,7
1998.0,9
1999.0,11
2000.0,9
2001.0,10
2002.0,1
2003.0,7
2004.0,28
2005.0,12


Databricks visualization. Run in Databricks to view.

The number of game releases on Steam has significantly increased over time, particularly from 2014 onwards.

A notable peak in releases is observed around 2019 and 2020, followed by a slight decline in the following years.

This trend suggests that the Covid-19 period may have accelerated game production and digital content distribution, likely due to increased player demand during lockdowns.

For Ubisoft, this indicates that external market conditions can strongly influence release dynamics, and that adapting production strategies during periods of increased demand may offer opportunities for greater market visibility.

### 5.3 Average Price Evolution

Understanding how the average price of games evolves over time provides insights into global pricing strategies on the Steam marketplace.

This analysis helps determine whether the platform is moving towards:

lower-priced accessible games  
premium-priced productions  
or a mixed pricing positioning  

Tracking this evolution allows Ubisoft to better position the pricing strategy of its future game release in relation to current market standards.

In [0]:
from pyspark.sql.functions import avg, round

avg_price_per_year = (
    steam_flat
    .filter(col("price_eur").isNotNull())
    .groupBy("release_year")
    .agg(
        round(avg("price_eur"),2).alias("avg_price_eur")
    )
    .orderBy("release_year")
)

In [0]:
display(avg_price_per_year)

release_year,avg_price_eur
,7.66
1997.0,15.69
1998.0,9.99
1999.0,4.54
2000.0,7.77
2001.0,6.99
2002.0,14.99
2003.0,5.28
2004.0,9.99
2005.0,6.58


Databricks visualization. Run in Databricks to view.

The evolution of average game prices on Steam shows significant fluctuations in the early years, followed by a relatively stable pricing trend from 2010 onwards.

Despite the growing number of game releases over time, the average price has remained within a moderate range, typically between 8€ and 12€.

This suggests that increasing competition among developers may have contributed to maintaining accessible pricing strategies in order to attract a larger player base.

For Ubisoft, this indicates that pricing competitiveness remains an important factor for adoption, and that aligning with market standards may be necessary to ensure commercial success.

### 5.4 Price Distribution

Analyzing the price distribution of games available on Steam provides insights into the overall market structure.

This allows us to determine whether the platform is primarily composed of:

free-to-play games  
low-cost independent productions  
mid-range titles  
premium AAA productions  

Understanding this distribution is essential for positioning a future game release in terms of pricing competitiveness and accessibility.

In [0]:
from pyspark.sql.functions import when

steam_flat = steam_flat.withColumn(
    "price_segment",
    when(col("price_eur") == 0, "Free to Play")
    .when((col("price_eur") > 0) & (col("price_eur") <= 10), "Low Price")
    .when((col("price_eur") > 10) & (col("price_eur") <= 30), "Mid Price")
    .when(col("price_eur") > 30, "Premium")
    .otherwise("Unknown")
)

In [0]:
price_distribution = (
    steam_flat
    .groupBy("price_segment")
    .count()
)

In [0]:
display(price_distribution)

price_segment,count
Mid Price,47708
Premium,5894
Free to Play,26280
Low Price,111388


Databricks visualization. Run in Databricks to view.

The price distribution of games on Steam reveals that the majority of titles are positioned within the low-price segment, representing approximately 58% of the market.

Mid-priced games account for around 25%, while free-to-play titles make up nearly 14% of available content.

In contrast, premium-priced games represent only a small fraction of the overall offering.

This suggests that the Steam marketplace is largely driven by accessible pricing strategies aimed at maximizing player adoption.

For Ubisoft, this indicates that competitive pricing may be essential to ensure visibility and market penetration, particularly in a platform dominated by lower-cost alternatives.

### 5.5 Age Restrictions

Analyzing age restrictions provides insights into the target audience of games available on the Steam platform.

This helps determine whether the marketplace is primarily oriented towards:

family-friendly content  
teenage audiences  
or mature players  

Understanding the distribution of age-restricted content enables Ubisoft to better define the target demographic for its future game release and anticipate potential market reach limitations.

In [0]:
from pyspark.sql.functions import regexp_extract

steam_flat = steam_flat.withColumn(
    "required_age_clean",
    regexp_extract(col("required_age"), r'\d+', 0).cast("int")
)

In [0]:
steam_flat = steam_flat.withColumn(
    "age_segment",
    when(col("required_age_clean") == 0, "All Ages")
    .when((col("required_age_clean") > 0) & (col("required_age_clean") <= 12), "Children")
    .when((col("required_age_clean") > 12) & (col("required_age_clean") <= 16), "Teen")
    .when(col("required_age_clean") > 16, "Mature")
    .otherwise("Unknown")
)

In [0]:
age_distribution = (
    steam_flat
    .groupBy("age_segment")
    .count()
)

In [0]:
display(age_distribution)

age_segment,count
Children,179
All Ages,188205
Teen,1741
Mature,1145


Databricks visualization. Run in Databricks to view.

The distribution of age restrictions indicates that approximately 98% of games available on Steam are accessible to all age groups.

Titles classified as Teen, Mature or restricted for younger audiences represent only a very small portion of the overall catalog.

This suggests that most developers aim to target a broad player base by offering content with minimal age limitations.

For Ubisoft, this highlights the potential advantage of developing games with wider accessibility in order to maximize audience reach and adoption on digital distribution platforms.

## 6. Genre Analysis

### 6.1 Most Represented Genres

Analyzing the distribution of game genres allows us to better understand which types of content are most commonly produced on the Steam platform.

This provides insights into market trends and helps identify potential areas of saturation or underrepresentation within the video game ecosystem.

Understanding genre prevalence enables Ubisoft to align future game development strategies with existing market demand.

In [0]:
from pyspark.sql.functions import split

steam_flat = steam_flat.withColumn(
    "genre_array",
    split(col("genre"), ", ")
)

In [0]:
from pyspark.sql.functions import explode

steam_genre = steam_flat.withColumn(
    "genre_exploded",
    explode("genre_array")
)

In [0]:
genre_count = (
    steam_genre
    .groupBy("genre_exploded")
    .count()
    .orderBy("count", ascending=False)
)

In [0]:
display(genre_count)

genre_exploded,count
Indie,140956
Action,99146
Casual,70668
Adventure,68142
Strategy,39690
Simulation,35183
RPG,33139
Early Access,21754
Free to Play,13425
Sports,13353


Databricks visualization. Run in Databricks to view.

The genre distribution highlights that Indie games represent the largest share of titles available on Steam.

Other widely represented genres include Action, Casual, Adventure, Strategy and Simulation.

This indicates that the platform is largely driven by independent productions, reflecting relatively low barriers to entry for smaller development studios.

For Ubisoft, this suggests that competition is particularly strong within these popular genres, and that differentiation through gameplay innovation or production quality may be necessary to stand out in a saturated market.

### 6.2 Best Rated Genres

Analyzing the positive review ratio by genre allows us to evaluate player satisfaction across different types of games available on Steam.

This helps identify which genres tend to generate more positive player feedback and engagement.

Understanding player satisfaction by genre enables Ubisoft to focus development efforts on content types that are more likely to be well received by the gaming community.

In [0]:
from pyspark.sql.functions import when

steam_genre = steam_genre.withColumn(
    "positive_ratio",
    when(
        (col("positive") + col("negative")) > 0,
        col("positive") / (col("positive") + col("negative"))
    )
)

In [0]:
from pyspark.sql.functions import avg, round

genre_rating = (
    steam_genre
    .filter(col("positive_ratio").isNotNull())
    .groupBy("genre_exploded")
    .agg(
        round(avg("positive_ratio"),2).alias("avg_positive_ratio")
    )
    .orderBy("avg_positive_ratio", ascending=False)
)

In [0]:
display(genre_rating)

genre_exploded,avg_positive_ratio
Web Publishing,0.8
Photo Editing,0.77
Casual,0.77
Indie,0.76
Action,0.76
Adventure,0.76
RPG,0.74
Strategy,0.74
Education,0.73
Racing,0.73


Databricks visualization. Run in Databricks to view.

The analysis of positive review ratios across genres shows that certain categories such as Web Publishing, Photo Editing, Casual, Indie and Adventure tend to achieve higher player satisfaction levels.

In contrast, genres such as Movie, Nudity and Violent display comparatively lower positive feedback ratios.

This suggests that games offering accessible, creative or experience-driven content may be more positively received by players than those associated with more niche or sensitive themes.

For Ubisoft, this highlights the importance of aligning game design with player expectations in order to enhance user satisfaction and overall reception.

### 6.3 Most Lucrative Genres

Analyzing the average number of game owners by genre allows us to estimate the commercial performance of different types of games available on Steam.

Since direct sales data is not available, the number of owners is used as a proxy to evaluate a game's potential market success.

This analysis helps identify which genres tend to attract the largest player base and may therefore generate higher revenue potential.

Understanding these trends enables Ubisoft to focus development efforts on genres with stronger commercial performance.

In [0]:
from pyspark.sql.functions import avg, round

genre_owners = (
    steam_genre
    .filter(col("owners_avg").isNotNull())
    .groupBy("genre_exploded")
    .agg(
        round(avg("owners_avg"),0).alias("avg_owners")
    )
    .orderBy("avg_owners", ascending=False)
)

In [0]:
display(genre_owners)

genre_exploded,avg_owners
Photo Editing,4232738.0
Free to Play,1141236.0
Massively Multiplayer,878689.0
Animation & Modeling,855639.0
Design & Illustration,780244.0
Movie,750000.0
Utilities,518771.0
RPG,389877.0
Action,379607.0
Strategy,278496.0


Databricks visualization. Run in Databricks to view.

The analysis of average ownership across genres shows that Photo Editing titles have the highest player adoption levels on the Steam platform.

Other genres such as Free to Play, Massively Multiplayer, Animation & Modeling and Casual also demonstrate strong ownership performance.

In contrast, genres such as Accounting, Education and Gore attract a significantly smaller player base.

This suggests that interactive or community-driven experiences tend to achieve greater market reach compared to more niche or specialized content.

For Ubisoft, this highlights the importance of focusing on widely appealing gameplay experiences in order to maximize player acquisition and commercial success.

## 7. Platform Analysis

### 7.1 Games per Operating System

Analyzing platform availability allows us to evaluate how games are distributed across different operating systems on the Steam platform.

This helps determine whether the majority of games are developed for specific platforms such as Windows, Mac or Linux.

Understanding platform distribution enables Ubisoft to assess which operating systems are most commonly targeted and optimize compatibility decisions for future game releases.

In [0]:
from pyspark.sql.functions import sum, col

platform_count = steam_flat.select(
    sum(col("windows").cast("int")).alias("Windows"),
    sum(col("mac").cast("int")).alias("Mac"),
    sum(col("linux").cast("int")).alias("Linux")
)

platform_long = platform_count.selectExpr(
    "stack(3, 'Windows', Windows, 'Mac', Mac, 'Linux', Linux) as (Platform, Game_Count)"
)

In [0]:
display(platform_long)

Platform,Game_Count
Windows,191240
Mac,54519
Linux,39536


Databricks visualization. Run in Databricks to view.

The platform availability analysis shows that approximately 67% of games on Steam are compatible with Windows systems.

In comparison, Mac and Linux platforms represent around 19% and 14% respectively.

This indicates that Windows remains the primary target for game distribution on the platform, likely due to its larger player base and development compatibility.

For Ubisoft, this highlights the importance of prioritizing Windows support in order to maximize accessibility and market reach.

### 7.2 Genre per Platform

Analyzing the availability of different genres across platforms allows us to determine whether certain types of games are more commonly developed for specific operating systems.

This helps identify potential technical or strategic preferences in game development depending on platform compatibility.

Understanding these patterns enables Ubisoft to better align genre selection with platform targeting strategies for future game releases.

In [0]:
from pyspark.sql.functions import col

genre_platform = steam_genre.select(
    col("genre_exploded"),
    col("windows").cast("int").alias("Windows"),
    col("mac").cast("int").alias("Mac"),
    col("linux").cast("int").alias("Linux")
)

In [0]:
genre_platform_count = (
    genre_platform
    .groupBy("genre_exploded")
    .sum("Windows", "Mac", "Linux")
)

In [0]:
display(genre_platform_count)

genre_exploded,sum(Windows),sum(Mac),sum(Linux)
Photo Editing,42,5,1
Web Publishing,67,21,22
Accounting,11,0,0
Indie,140946,44371,33498
Violent,523,112,62
Nudity,120,30,23
Casual,70656,20411,14436
Movie,13,0,0
Simulation,35171,9906,6726
Action,99131,24844,19388


Databricks visualization. Run in Databricks to view.

The analysis of genre availability across platforms shows that most game categories are predominantly compatible with Windows systems.

Genres such as Indie, Action, Casual, Strategy and Simulation display significantly higher availability on Windows compared to Mac and Linux platforms.

This suggests that platform compatibility remains uneven across operating systems, with developers prioritizing Windows support due to technical or market-related considerations.

For Ubisoft, this highlights the importance of ensuring strong Windows compatibility when targeting widely represented genres in order to maximize accessibility and player reach.

## 8. Conclusion

This exploratory analysis of the Steam marketplace provided valuable insights into the global video game ecosystem and the key factors influencing game popularity and commercial performance.

The results indicate that the platform is largely driven by independent productions, accessible pricing strategies and content targeting a broad audience. In addition, Windows remains the dominant platform for game distribution across all genres.

Player satisfaction and ownership levels also vary depending on genre, suggesting that both content type and pricing strategy play a significant role in market success.

For Ubisoft, these findings highlight the importance of aligning game development with market expectations in terms of genre selection, pricing strategy and platform compatibility in order to maximize player adoption and commercial performance.