# {Project Title}📝

![Banner](./assets/banner.jpeg)

## Topic
*What problem are you (or your stakeholder) trying to address?*
📝 <!-- Answer Below -->

## Project Question
*What specific question are you seeking to answer with this project?*
*This is not the same as the questions you ask to limit the scope of the project.*
📝 <!-- Answer Below -->

## What would an answer look like?
*What is your hypothesized answer to your question?*
📝 <!-- Answer Below -->

## Data Sources
*What 3 data sources have you identified for this project?*
*How are you going to relate these datasets?*
📝 <!-- Answer Below -->

## Approach and Analysis
*What is your approach to answering your project question?*
*How will you use the identified data to answer your project question?*
📝 <!-- Start Discussing the project here; you can add as many code cells as you need -->

In [None]:
# Start your code here
#What are the long-term trends in PC player playtime and engagement across different game genres?
#Multiplayer genres will show steady growth in engagement, while single-player genres will decline over time.
#Steam Charts, Kaggle Video Game Dataset, Steam Web API
#I will import and clean the datasets, then merge them using the Steam App ID
# to link player counts, genres, and release data. After aggregating playtime by genre each year,
# I’ll create visualizations such as line charts, bar charts, and heatmaps to reveal long-term engagement
# trends. Finally, I’ll interpret the results to identify which genres are rising or declining
# and what that means for the future of PC gaming.
#data sources
#https://www.kaggle.com/datasets/fronkongames/steam-games-dataset?resource=download
#https://steamcharts.com/app/730
#https://www.kaggle.com/datasets/nikdavis/steam-store-games


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import requests
from bs4 import BeautifulSoup

#df = pd.read_csv("kaggle_videogames.csv")

df2 = pd.read_csv("steam.csv")

df = pd.read_csv("sample_videogames.csv")

url = "https://steamcharts.com/app/730"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
rows = soup.select("table.common-table tbody tr")
data = []
for row in rows:
    cols = [col.text.strip() for col in row.find_all("td")]
    data.append(cols)

#cleaning kaggles_videogames.csv
df.drop_duplicates(inplace=True)
missing = df.isnull().sum().sort_values(ascending=False)
print(missing.head(10))
df = df.dropna(subset=["Genres", "Average playtime forever", "Release date"])
df = pd.DataFrame(data, columns=["Month", "Avg Players", "Gain", "Peak Players"])
df["Avg Players"] = df["Avg Players"].str.replace(",", "").astype(float)
df["Release date"] = pd.to_datetime(df["Release date"], errors="coerce")
df["Year"] = df["Release date"].dt.year
df["Average playtime forever"] = pd.to_numeric(df["Average playtime forever"], errors="coerce")
df["Estimated owners"] = pd.to_numeric(df["Estimated owners"], errors="coerce")

#cleaning steam.csv
df2['release_date'] = pd.to_datetime(df2['release_date'], errors='coerce')
df2['price'] = pd.to_numeric(df2['price'], errors='coerce')
def owners_to_midpoint(value):
    if isinstance(value, str) and '-' in value:
        low, high = value.split('-')
        return (int(low) + int(high)) / 2
    return np.nan

df2['owners'] = df2['owners'].apply(owners_to_midpoint)
for col in ['developer', 'publisher', 'genres', 'categories', 'platforms']:
    if col in df2.columns:
        df2[col] = df2[col].fillna("Unknown")

for col in ['positive_ratings', 'negative_ratings', 'average_playtime', 'price']:
    if col in df2.columns:
        df2[col] = df2[col].fillna(0)

for col in ['genres', 'categories', 'platforms']:
    if col in df2.columns:
        df2[col] = df2[col].apply(lambda x: x.split(';')[0] if isinstance(x, str) else x)

df2 = df2.drop_duplicates(subset=['appid'])
df2 = df2[df2['release_date'].notna()]

#cleaning the scrapped page to easily merge the datasets
soup.rename(columns={'Game': 'Name'}, inplace=True)
soup['name'] = soup['name'].str.lower().str.strip()

#combining datasets
combined_df = pd.merge(df, df2, on='name', how='outer', suffixes=('_steam', '_kaggle'))
combined_df = pd.merge(combined_df, soup, on='title', how='left')

#viz 1
genre_players = combined_df.groupby('genres')['avg_players'].mean().reset_index().dropna()
plt.figure(figsize=(10,6))
sns.barplot(y='genres', x='avg_players', data=genre_players.sort_values('avg_players', ascending=False).head(10))
plt.title("Top 10 Game Genres by Average Players")
plt.xlabel("Average Players")
plt.ylabel("Genre")
plt.tight_layout()
plt.show()

#Viz 2
combined_df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
combined_df['year'] = combined_df['release_date'].dt.year
yearly_players = df.groupby('year')['avg_players'].mean().reset_index().dropna()
plt.figure(figsize=(10,6))
sns.lineplot(x='year', y='avg_players', data=yearly_players, marker='o')
plt.title("Average Player Engagement Over Time")
plt.xlabel("Year")
plt.ylabel("Average Players")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

#Viz 3
plt.figure(figsize=(8,6))
sns.scatterplot(x='price', y='avg_players', data=combined_df, alpha=0.6)
plt.title("Price vs. Player Engagement")
plt.xlabel("Game Price")
plt.ylabel("Average Players")
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

#Viz 4
numeric_df = df.select_dtypes(include=['float64', 'int64']).dropna()
corr = numeric_df.corr()
plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Game Metrics")
plt.tight_layout()
plt.show()


## Resources and References
*What resources and references have you used for this project?*
📝 <!-- Answer Below -->

In [None]:
# ⚠️ Make sure you run this cell at the end of your notebook before every submission!
!jupyter nbconvert --to python source.ipynb

[NbConvertApp] Converting notebook source.ipynb to python
[NbConvertApp] Writing 1271 bytes to source.py
