# 02 — Exploratory Data Analysis

Before building any models, we need to understand the Steam market landscape. This notebook answers:

1. What does the distribution of games by genre/tag look like?
2. How does playtime correlate with review scores and ownership?
3. What's the relationship between price and ownership across genres?
4. Which genres/tags are growing vs. stagnant?
5. What does the "typical" successful indie game look like vs. the typical failure?

In [None]:
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sys.path.insert(0, str(Path.cwd().parent))

from src.visualisation.market_map import (
    plot_genre_cooccurrence,
    plot_playtime_vs_owners,
    plot_releases_over_time,
    plot_revenue_by_genre,
)

sns.set_theme(style="whitegrid", palette="viridis")

PROCESSED_DIR = Path("../data/processed")

In [None]:
# Load processed data
games = pd.read_json(PROCESSED_DIR / "games.json", lines=True)
user_games = pd.read_csv(PROCESSED_DIR / "user_games.csv")

print(f"Games: {len(games):,} | Users: {user_games['steam_id'].nunique():,} | User-game pairs: {len(user_games):,}")
games.head()

## Genre and Tag Landscape

The Steam catalogue follows a classic long-tail distribution: a few genres dominate, while hundreds of niche tag combinations have only a handful of titles. This is precisely where market gaps live.

In [None]:
# Genre frequency
genre_counts = games.explode("genres")["genres"].value_counts()

fig, ax = plt.subplots(figsize=(12, 6))
genre_counts.head(20).plot.barh(ax=ax, color="#2ecc71")
ax.set_xlabel("Number of Games")
ax.set_title("Top 20 Genres by Game Count", fontweight="bold")
ax.invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Genre co-occurrence heatmap
plot_genre_cooccurrence(games)

## Revenue Distribution

Revenue (estimated owners × current price) is highly right-skewed. A small number of blockbusters capture the majority of revenue, while most games earn relatively little.

In [None]:
plot_revenue_by_genre(games)

## Playtime vs. Ownership

Do games with high engagement (playtime) also have high ownership? Or are there hidden gems with dedicated but small playerbases?

In [None]:
plot_playtime_vs_owners(games)

## Release Trends Over Time

Are certain niches growing (more releases in recent years) or stagnating?

In [None]:
plot_releases_over_time(games)

## Success vs. Failure Profile

What distinguishes a "successful" indie game from the rest? We define success as top-quartile estimated revenue within a genre.

In [None]:
# Success threshold: top 25% revenue within the dataset
if "estimated_revenue" in games.columns:
    threshold = games["estimated_revenue"].quantile(0.75)
    games["success"] = games["estimated_revenue"] >= threshold

    compare_cols = ["price_dollars", "review_score", "median_forever", "owners_mid", "metacritic"]
    available = [c for c in compare_cols if c in games.columns]

    comparison = games.groupby("success")[available].median()
    comparison.index = ["Below 75th pctl", "Above 75th pctl (success)"]
    print("Median metrics: successful vs. other games")
    print(comparison.T.to_string())

## Key EDA Takeaways

*To be filled after data collection:*

1. **Genre distribution:** ...
2. **Revenue concentration:** ...
3. **Engagement patterns:** ...
4. **Growth trends:** ...
5. **Success profile:** ...