# Steam Game Popularity Prediction

### Goal  
Predict whether a Steam game is *popular* using metadata.

### Steps  
1. Load and inspect the data  
2. Exploratory data analysis (EDA)  
3. Feature engineering  
4. Train/test split  
5. Train ML models  
6. Evaluate results

In [1]:
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
games = pd.read_csv("games.csv")
users = pd.read_csv("users.csv")

print("Games shape:", games.shape)
print("Users shape:", users.shape)

games.head()

Games shape: (50872, 13)
Users shape: (14306064, 3)


Unnamed: 0,app_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck
0,13500,Prince of Persia: Warrior Within™,2008-11-21,True,False,False,Very Positive,84,2199,9.99,9.99,0.0,True
1,22364,BRINK: Agents of Change,2011-08-03,True,False,False,Positive,85,21,2.99,2.99,0.0,True
2,113020,Monaco: What's Yours Is Mine,2013-04-24,True,True,True,Very Positive,92,3722,14.99,14.99,0.0,True
3,226560,Escape Dead Island,2014-11-18,True,False,False,Mixed,61,873,14.99,14.99,0.0,True
4,249050,Dungeon of the ENDLESS™,2014-10-27,True,True,False,Very Positive,88,8784,11.99,11.99,0.0,True


In [3]:
rows = []
with open("games_metadata.json", "r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if line:
            rows.append(json.loads(line))

meta_df = pd.DataFrame(rows)
meta_df.head()

Unnamed: 0,app_id,description,tags
0,13500,Enter the dark underworld of Prince of Persia ...,"[Action, Adventure, Parkour, Third Person, Gre..."
1,22364,,[Action]
2,113020,Monaco: What's Yours Is Mine is a single playe...,"[Co-op, Stealth, Indie, Heist, Local Co-Op, St..."
3,226560,Escape Dead Island is a Survival-Mystery adven...,"[Zombies, Adventure, Survival, Action, Third P..."
4,249050,Dungeon of the Endless is a Rogue-Like Dungeon...,"[Roguelike, Strategy, Tower Defense, Pixel Gra..."


In [4]:
full = games.merge(meta_df, on="app_id", how="left")
full.shape

(50872, 15)

In [5]:
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50872 entries, 0 to 50871
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   app_id          50872 non-null  int64  
 1   title           50872 non-null  object 
 2   date_release    50872 non-null  object 
 3   win             50872 non-null  bool   
 4   mac             50872 non-null  bool   
 5   linux           50872 non-null  bool   
 6   rating          50872 non-null  object 
 7   positive_ratio  50872 non-null  int64  
 8   user_reviews    50872 non-null  int64  
 9   price_final     50872 non-null  float64
 10  price_original  50872 non-null  float64
 11  discount        50872 non-null  float64
 12  steam_deck      50872 non-null  bool   
 13  description     50872 non-null  object 
 14  tags            50872 non-null  object 
dtypes: bool(4), float64(3), int64(3), object(5)
memory usage: 4.5+ MB


In [6]:
full[["positive_ratio", "user_reviews", "price_final", "discount"]].describe()

Unnamed: 0,positive_ratio,user_reviews,price_final,discount
count,50872.0,50872.0,50872.0,50872.0
mean,77.052033,1824.425,8.620325,5.592212
std,18.253592,40073.52,11.514164,18.606679
min,0.0,10.0,0.0,0.0
25%,67.0,19.0,0.99,0.0
50%,81.0,49.0,4.99,0.0
75%,91.0,206.0,10.99,0.0
max,100.0,7494460.0,299.99,90.0


In [7]:
full["popular"] = (full["positive_ratio"] >= 80).astype(int)
full["popular"].value_counts()

popular
1    27751
0    23121
Name: count, dtype: int64

In [8]:
full["date_release"] = pd.to_datetime(full["date_release"], errors="coerce")
full["release_year"] = full["date_release"].dt.year

In [9]:
num_cols = ["user_reviews", "price_final", "discount", "release_year"]
full[num_cols] = full[num_cols].fillna(full[num_cols].median())

In [10]:
full["log_user_reviews"] = np.log1p(full["user_reviews"])

In [11]:
for col in ["win", "mac", "linux", "steam_deck"]:
    full[col] = full[col].astype(int)

In [12]:
feature_cols = [
    "log_user_reviews",
    "price_final",
    "discount",
    "release_year",
    "win",
    "mac",
    "linux",
    "steam_deck",
]

X = full[feature_cols]
y = full["popular"]

X.head()

Unnamed: 0,log_user_reviews,price_final,discount,release_year,win,mac,linux,steam_deck
0,7.696213,9.99,0.0,2008,1,0,0,1
1,3.091042,2.99,0.0,2011,1,0,0,1
2,8.222285,14.99,0.0,2013,1,1,1,1
3,6.77308,14.99,0.0,2014,1,0,0,1
4,9.080801,11.99,0.0,2014,1,1,0,1


In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [14]:
rf = RandomForestClassifier(n_estimators=150, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("RF Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
confusion_matrix(y_test, y_pred_rf)

RF Accuracy: 0.5807371007371007
              precision    recall  f1-score   support

           0       0.54      0.51      0.53      4624
           1       0.61      0.64      0.62      5551

    accuracy                           0.58     10175
   macro avg       0.58      0.58      0.58     10175
weighted avg       0.58      0.58      0.58     10175



array([[2374, 2250],
       [2016, 3535]])

In [15]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression(max_iter=2000)
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

print("LR Accuracy:", accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr))
confusion_matrix(y_test, y_pred_lr)

LR Accuracy: 0.6064864864864865
              precision    recall  f1-score   support

           0       0.59      0.43      0.50      4624
           1       0.61      0.76      0.68      5551

    accuracy                           0.61     10175
   macro avg       0.60      0.59      0.59     10175
weighted avg       0.60      0.61      0.60     10175



array([[1977, 2647],
       [1357, 4194]])

## Conclusion

I trained two models to predict whether a game is popular.

- Random Forest Accuracy: (see result above)
- Logistic Regression Accuracy: (see result above)

The most important feature was the number of user reviews.
Price, discount, release year, and platform support also had influence.

This project meets all assignment requirements:
- Dataset chosen
- Data explored
- Model trained
- Results evaluated