Analyzed Steam user–game interaction data to distinguish between ownership and actual engagement, modeling behavior at the user–game level. Defined meaningful engagement using playtime thresholds, identified that 35% of purchased games receive under 2 hours of total playtime, and showed that low engagement is heavily concentrated among highly popular titles, indicating large exploratory funnels rather than poor product quality. The analysis highlights a substantial gap between purchase behavior and sustained usage, with implications for recommendation systems, discovery, and post-purchase engagement strategies.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('/kaggle/input/steam-200k-dataset/steam-200k.csv')

df.head()

Unnamed: 0,151603712,The Elder Scrolls V Skyrim,purchase,1.0,0
0,151603712,The Elder Scrolls V Skyrim,play,273.0,0
1,151603712,Fallout 4,purchase,1.0,0
2,151603712,Fallout 4,play,87.0,0
3,151603712,Spore,purchase,1.0,0
4,151603712,Spore,play,14.9,0


TASK 1: Data Understanding & Grain

Business question : What do users actually do with the games they buy?

In [3]:
df.shape

(199999, 5)

In [4]:
df.describe()

Unnamed: 0,151603712,1.0,0
count,199999.0,199999.0,199999.0
mean,103655600.0,17.874468,0.0
std,72080840.0,138.057292,0.0
min,5250.0,0.1,0.0
25%,47384200.0,1.0,0.0
50%,86912010.0,1.0,0.0
75%,154230900.0,1.3,0.0
max,309903100.0,11754.0,0.0


In [5]:
df.dtypes

151603712                       int64
The Elder Scrolls V Skyrim     object
purchase                       object
1.0                           float64
0                               int64
dtype: object

Adding headers - the dataset comes without them

In [6]:
df.columns = [
    'user_id',
    'game',
    'behavior',
    'hours',
    'extra'
]

Modyifying data types

In [7]:
df['user_id'] = df['user_id'].astype('str')

In [8]:
df.describe()

Unnamed: 0,hours,extra
count,199999.0,199999.0
mean,17.874468,0.0
std,138.057292,0.0
min,0.1,0.0
25%,1.0,0.0
50%,1.0,0.0
75%,1.3,0.0
max,11754.0,0.0


In [9]:
df['extra'].unique()

array([0])

In [10]:
df = df.drop(columns='extra')

In [11]:
df.head()

Unnamed: 0,user_id,game,behavior,hours
0,151603712,The Elder Scrolls V Skyrim,play,273.0
1,151603712,Fallout 4,purchase,1.0
2,151603712,Fallout 4,play,87.0
3,151603712,Spore,purchase,1.0
4,151603712,Spore,play,14.9


The granularity is action per user per game.

One row represents a single user–game interaction for a specific behavior (purchase or play), with hours meaning different things depending on behavior.

TASK 2A — Defining Engagement Logic

Business question - Which games are purchased but never meaningfully played, and which users actually engage with what they buy?

In [12]:
df['is_purchase'] = df['behavior'] == 'purchase'
df['is_play'] = df['behavior'] == 'play'
df['has_meaningful_playtime'] = (df['behavior'] == 'play') & (df['hours'] > 0)

df.head()

Unnamed: 0,user_id,game,behavior,hours,is_purchase,is_play,has_meaningful_playtime
0,151603712,The Elder Scrolls V Skyrim,play,273.0,False,True,True
1,151603712,Fallout 4,purchase,1.0,True,False,False
2,151603712,Fallout 4,play,87.0,False,True,True
3,151603712,Spore,purchase,1.0,True,False,False
4,151603712,Spore,play,14.9,False,True,True


TASK 2B — Sanity Checks

Count is_purchase vs is_play

Check if any play rows have hours == 0

Verify purchases always have hours == 1

In [13]:
df[(df['is_play'] == True) & (df['hours'] == 0) ]

Unnamed: 0,user_id,game,behavior,hours,is_purchase,is_play,has_meaningful_playtime


In [14]:
df[(df['is_purchase'] == True) & (df['hours'] != 1)]

Unnamed: 0,user_id,game,behavior,hours,is_purchase,is_play,has_meaningful_playtime


Conclusion : Purchase events encode ownership, not usage; play events encode engagement via hours played.

TASK 3A — User–Game Engagement Modeling


Business question : Which user–game pairs show real engagement, and which purchases never turn into play?

In [15]:
user_df = df[df['is_play'] == True]

hours_played = user_df.groupby(['user_id', 'game'])['hours'].sum()

hours_played = hours_played.reset_index(name='total_play_hours')



In [16]:
hours_played.head()

Unnamed: 0,user_id,game,total_play_hours
0,100012061,Star Trek D-A-C,0.7
1,100053304,Dota 2,1.0
2,100053304,Dream Of Mirror Online,0.5
3,100053304,Dungeons & Dragons Online,12.6
4,100053304,PAYDAY The Heist,1.1


In [17]:
hours_played['has_played'] = hours_played['total_play_hours'] > 0

In [18]:
has_purchased = (
    df[df['is_purchase']]
      .groupby(['user_id', 'game'])
      .size()
      .reset_index(name='n_purchase_rows')
)

has_purchased['has_purchased'] = True
has_purchased = has_purchased.drop(columns='n_purchase_rows')

Merging the results into my hours_played dataframe

In [19]:
user_game = hours_played.merge(has_purchased, on=['user_id', 'game'], how='left')
user_game['has_purchased'] = user_game['has_purchased'].fillna(False)

  user_game['has_purchased'] = user_game['has_purchased'].fillna(False)


In [20]:
user_game.head()

Unnamed: 0,user_id,game,total_play_hours,has_played,has_purchased
0,100012061,Star Trek D-A-C,0.7,True,True
1,100053304,Dota 2,1.0,True,True
2,100053304,Dream Of Mirror Online,0.5,True,True
3,100053304,Dungeons & Dragons Online,12.6,True,True
4,100053304,PAYDAY The Heist,1.1,True,True


TASK 3B — Purchased but Never Played

Business question: How many games are purchased but never actually played?

In [21]:
never_played = user_game[(user_game['has_played'] == False) & (user_game['has_purchased'] ==True)]

never_played

Unnamed: 0,user_id,game,total_play_hours,has_played,has_purchased


Conclusion - In this dataset, all purchases are followed by at least some recorded playtime, there are no purchased-only user–game pairs.

TASK 3C - Low Engagement Analysis

Business question: How many purchased games receive very little engagement?

low engagement: total_play_hours < 2 hours

In [22]:
user_game['is_low_engagement'] = False
user_game.loc[user_game['total_play_hours'] < 2, 'is_low_engagement'] = True
user_game.head()

Unnamed: 0,user_id,game,total_play_hours,has_played,has_purchased,is_low_engagement
0,100012061,Star Trek D-A-C,0.7,True,True,True
1,100053304,Dota 2,1.0,True,True,True
2,100053304,Dream Of Mirror Online,0.5,True,True,True
3,100053304,Dungeons & Dragons Online,12.6,True,True,False
4,100053304,PAYDAY The Heist,1.1,True,True,True


In [23]:
purchased_games_only = user_game[user_game['has_purchased'] == True]

total_purchased_games = len(purchased_games_only)

low_engagement_games = purchased_games_only[
    purchased_games_only['is_low_engagement']
]

total_low_engagement_games = len(low_engagement_games)

pct_of_total = (total_low_engagement_games / total_purchased_games) * 100

pct_of_total

35.39360917191668

What percentage of purchased games receive less than 2 hours of playtime ?

~ 35.4%

Over one-third of purchases fail to turn into meaningful engagement.
Buying a game on Steam does not reliably imply sustained usage.
There is a substantial engagement drop-off after purchase.

Top 10 games with the highest number of low-engagement purchases


In [24]:
top10 = low_engagement_games.groupby('game')['user_id'].size().sort_values(ascending=False)

top10.head(10)

game
Dota 2                    1626
Team Fortress 2            847
Unturned                   412
Heroes & Generals          199
Counter-Strike             174
Robocraft                  166
Half-Life 2 Lost Coast     163
Counter-Strike Source      156
Portal                     143
Alien Swarm                138
Name: user_id, dtype: int64

The Top 10 is dominated by very popular, often free-to-play or widely owned games.

Conclusion: high popularity correlates with high low-engagement counts, indicating that popularity increases exploratory adoption but also magnifies early drop-off.