# Steam Games Dataset Analysis

This notebook analyzes the Steam Games dataset from Kaggle, which contains information about various games available on the Steam platform.

## Setup and Data Loading

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import kagglehub
import os
from pathlib import Path

# Set style for better visualizations
sns.set_style("whitegrid")  # Set seaborn style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("All libraries successfully imported and configured!")

All libraries successfully imported and configured!


In [21]:
# Define the path to save the dataset
data_dir = Path('../data')
dataset_path = data_dir / 'steam_games.csv'

# Check if the dataset already exists
if not dataset_path.exists():
    print(os.getcwd())
    print("Downloading dataset...")
    # Download the dataset
    kaggle_path = kagglehub.dataset_download("artermiloff/steam-games-dataset")
    # The downloaded path is a directory, find the CSV file
    csv_file = list(Path(kaggle_path).glob('*.csv'))[0]
    # Copy the file to our data directory
    import shutil
    shutil.copy2(csv_file, dataset_path)
    print(f"Dataset saved to {dataset_path}")
else:
    print("Dataset already exists, loading from local file...")

# Load the dataset
df = pd.read_csv(dataset_path)

# Display basic information about the dataset
print("\nDataset Shape:", df.shape)
print("\nDataset Info:")
df.info()

Dataset already exists, loading from local file...

Dataset Shape: (87806, 46)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87806 entries, 0 to 87805
Data columns (total 46 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   AppID                     87806 non-null  int64  
 1   name                      87803 non-null  object 
 2   release_date              87806 non-null  object 
 3   required_age              87806 non-null  int64  
 4   price                     87806 non-null  float64
 5   dlc_count                 87806 non-null  int64  
 6   detailed_description      83656 non-null  object 
 7   about_the_game            83634 non-null  object 
 8   short_description         83713 non-null  object 
 9   reviews                   10314 non-null  object 
 10  header_image              87806 non-null  object 
 11  website                   39906 non-null  object 
 12  support_url           

## Data Cleaning and Preprocessing

In [22]:
# Drop data with insufficient data
df = df.dropna(subset=['name'])
df = df.drop(['support_email', 'support_url', 'notes', 'score_rank', 'website', 'reviews', 'metacritic_url' ], 
             axis=1)

# Drop low value columns
df = df.drop(['header_image', 'screenshots', 'movies', 'full_audio_languages', 'average_playtime_2weeks', 'AppID',
              'median_playtime_2weeks', 'windows', 'mac', 'linux', 'packages', 'pct_pos_recent', 'num_reviews_recent',
             'positive', 'negative'], 
             axis=1)

# Combine similar columns
df['description'] = df['detailed_description'].fillna('') + ' ; ' + \
                       df['about_the_game'].fillna('') + ' ; ' + \
                       df['short_description'].fillna('').str.strip()

df['developer_publisher'] = df['developers'].fillna('') + ' ; ' + \
                       df['publishers'].fillna('').str.strip()

df['game_classification'] = df['categories'].fillna('') + ' ; ' + \
                       df['genres'].fillna('')+ ' ; ' + \
                       df['tags'].fillna('').str.strip()

# Drop original columns that were combined
df  = df.drop(['detailed_description', 'about_the_game', 'short_description', 'developers', 'publishers', 'categories',
              'genres', 'tags'], axis=1)

# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
print(missing_values[missing_values > 0])

# Display basic information about the dataset
print("\nCleaned dataset shape:", df.shape)
print("\nCleaned dataset info:")
df.info()

Missing values in each column:
Series([], dtype: int64)

Cleaned dataset shape: (87803, 19)

Cleaned dataset info:
<class 'pandas.core.frame.DataFrame'>
Index: 87803 entries, 0 to 87805
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   name                      87803 non-null  object 
 1   release_date              87803 non-null  object 
 2   required_age              87803 non-null  int64  
 3   price                     87803 non-null  float64
 4   dlc_count                 87803 non-null  int64  
 5   metacritic_score          87803 non-null  int64  
 6   achievements              87803 non-null  int64  
 7   recommendations           87803 non-null  int64  
 8   supported_languages       87803 non-null  object 
 9   user_score                87803 non-null  int64  
 10  estimated_owners          87803 non-null  object 
 11  average_playtime_forever  87803 non-null  int64  
 12  median

In [None]:
# Clean price data
def clean_price(price):
    if pd.isna(price):
        return np.nan
    try:
        # Remove currency symbol and convert to float
        return float(str(price).replace('$', '').strip())
    except:
        return np.nan

df['clean_price'] = df['price'].apply(clean_price)

# Convert release date to datetime
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

# Extract year from release date
df['release_year'] = df['release_date'].dt.year

<class 'pandas.core.frame.DataFrame'>
Index: 87803 entries, 0 to 87805
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   name                      87803 non-null  object 
 1   release_date              87803 non-null  object 
 2   required_age              87803 non-null  int64  
 3   price                     87803 non-null  float64
 4   dlc_count                 87803 non-null  int64  
 5   metacritic_score          87803 non-null  int64  
 6   achievements              87803 non-null  int64  
 7   recommendations           87803 non-null  int64  
 8   supported_languages       87803 non-null  object 
 9   user_score                87803 non-null  int64  
 10  estimated_owners          87803 non-null  object 
 11  average_playtime_forever  87803 non-null  int64  
 12  median_playtime_forever   87803 non-null  int64  
 13  peak_ccu                  87803 non-null  int64  
 14  pct_pos_tot

## Exploratory Data Analysis (EDA)

In [None]:
# Basic statistics of numerical columns
print("Basic statistics of numerical columns:")
df.describe()

In [None]:
# Price Distribution Analysis
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='clean_price', bins=50)
plt.title('Distribution of Game Prices')
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.show()

# Remove outliers for better visualization
plt.figure(figsize=(12, 6))
sns.boxplot(x=df['clean_price'].clip(0, 100))
plt.title('Box Plot of Game Prices (Clipped at $100)')
plt.xlabel('Price ($)')
plt.show()

In [None]:
# Games released per year
yearly_releases = df['release_year'].value_counts().sort_index()

plt.figure(figsize=(15, 6))
yearly_releases.plot(kind='bar')
plt.title('Number of Games Released per Year')
plt.xlabel('Year')
plt.ylabel('Number of Games')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Top 10 genres
def extract_genres(genre_str):
    if pd.isna(genre_str):
        return []
    return [g.strip() for g in str(genre_str).split(',')]

all_genres = [genre for genres in df['genres'].apply(extract_genres) for genre in genres]
genre_counts = pd.Series(all_genres).value_counts().head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=genre_counts.values, y=genre_counts.index)
plt.title('Top 10 Most Common Genres')
plt.xlabel('Number of Games')
plt.show()

In [None]:
# Average price by genre
genre_prices = []
for genre in genre_counts.index:
    mask = df['genres'].str.contains(genre, na=False)
    avg_price = df.loc[mask, 'clean_price'].mean()
    genre_prices.append({'Genre': genre, 'Average Price': avg_price})

genre_price_df = pd.DataFrame(genre_prices)
plt.figure(figsize=(12, 6))
sns.barplot(data=genre_price_df, x='Average Price', y='Genre')
plt.title('Average Price by Genre')
plt.xlabel('Average Price ($)')
plt.show()

## Developer and Publisher Analysis

In [None]:
# Top 10 developers by number of games
top_developers = df['developer'].value_counts().head(10)

plt.figure(figsize=(12, 6))
sns.barplot(x=top_developers.values, y=top_developers.index)
plt.title('Top 10 Developers by Number of Games')
plt.xlabel('Number of Games')
plt.show()

In [None]:
# Average ratings analysis
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='rating', bins=50)
plt.title('Distribution of Game Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

# Top 10 highest rated games (with minimum number of reviews)
min_reviews = 100  # Set minimum number of reviews threshold
top_rated = df[df['reviews_count'] >= min_reviews].nlargest(10, 'rating')
print(f"\nTop 10 Highest Rated Games (with at least {min_reviews} reviews):")
print(top_rated[['name', 'rating', 'reviews_count', 'developer']])

## Price Analysis Over Time

In [None]:
# Average price by year
yearly_avg_price = df.groupby('release_year')['clean_price'].mean()

plt.figure(figsize=(15, 6))
yearly_avg_price.plot(kind='line', marker='o')
plt.title('Average Game Price by Release Year')
plt.xlabel('Year')
plt.ylabel('Average Price ($)')
plt.grid(True)
plt.show()

## Correlation Analysis

In [None]:
# Create correlation matrix
numeric_columns = ['clean_price', 'rating', 'reviews_count', 'release_year']
correlation_matrix = df[numeric_columns].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numeric Variables')
plt.show()

## Key Findings and Insights

1. Price Distribution:
   - Analyze the distribution of game prices
   - Identify price ranges and common price points

2. Release Trends:
   - Observe the trend in game releases over years
   - Identify peak periods and any patterns

3. Genre Analysis:
   - Most popular genres
   - Price variations across genres

4. Developer Analysis:
   - Most prolific developers
   - Relationship between developers and ratings

5. Rating Analysis:
   - Distribution of ratings
   - Correlation between ratings and other factors

6. Price Trends:
   - How prices have evolved over time
   - Price variations by genre and developer

Add your observations and insights based on the analysis above.