# Video Game Sales Research

## Dataset Overview

This notebook analyzes a comprehensive video game sales dataset containing **16,600 entries** ranked by global sales performance, covering games from **1984 to 2016**.

### Key Highlights

- **Top Game**: Wii Sports with 82.74 million global sales
- **Market Leaders**: Nintendo dominates top rankings with iconic franchises
- **Sales Range**: From 82.74M (top) down to 0.01M (bottom tier)
- **Regional Coverage**: NA, EU, JP, and other markets
- **Platform Diversity**: 30+ years of gaming platforms from NES to PS4
- **Genre Variety**: Sports, Platform, Racing, Action, RPG, Shooter, and more

### Research Questions

This analysis will explore:
- Genre popularity trends over time
- Platform lifecycle and market share evolution
- Regional market preferences and differences
- Publisher market dominance and competition
- Franchise performance patterns

### Notebook Structure

1. **Setup & Data Loading**: Import libraries and load dataset
2. **Data Quality Assessment**: Check for missing values and data integrity
3. **Helper Functions**: Reusable analysis functions
4. **Exploratory Data Analysis**: Initial data exploration
5. **Market Analysis**: Platform, genre, and publisher insights
6. **Regional Analysis**: Geographic market differences
7. **Conclusions**: Key findings and next steps


## Setup and Data Loading


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Import custom analysis modules
from data_loader import setup_display_options, load_game_sales_data
from data_analysis import (
    analyze_missing_data, get_top_performers, analyze_distribution,
    calculate_regional_breakdown, analyze_publishers, analyze_year_trends,
    get_regional_market_share, generate_summary_statistics
)
from utils import suggest_next_analysis_steps, print_analysis_complete_message

print("Libraries and custom modules imported successfully!")


Libraries and custom modules imported successfully!


In [2]:
# Apply display configuration
setup_display_options()
print("Display configuration applied successfully!")


Display configuration applied successfully!


In [17]:
# Load the dataset using our custom function
df = load_game_sales_data()

✅ Dataset loaded successfully from vgsales.csv
📊 Shape: (16598, 11)
📋 Columns: ['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']


In [4]:
# Display sample of the data
print("=== First 5 rows of the dataset ===")
df.head()


=== First 5 rows of the dataset ===


Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


## Data Quality Assessment

Analyze the dataset for missing values and basic statistics.


In [5]:
# Analysis functions are now imported from separate modules
# See: data_analysis.py, data_loader.py, and utils.py for function definitions
print("✅ Analysis functions loaded from modules!")


✅ Analysis functions loaded from modules!


### Missing Data Analysis


In [6]:
# Analyze data quality using helper function
missing_data_info = analyze_missing_data(df)


=== Missing Values Analysis ===
📊 Year: 271 missing values (1.63%)
📊 Publisher: 58 missing values (0.35%)


In [7]:
# Display basic statistics
print("=== Dataset Statistics ===")
df.describe()


=== Dataset Statistics ===


Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16327.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.61,2006.41,0.26,0.15,0.08,0.05,0.54
std,4791.85,5.83,0.82,0.51,0.31,0.19,1.56
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2020.0,41.49,29.02,10.22,10.57,82.74


## Top Performers Analysis


In [8]:
# Analyze top 10 best-selling games
print("=== Top 10 Best-Selling Games ===")
top_10_games = get_top_performers(df, 'Global_Sales', 10)
display_cols = ['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'Global_Sales']
display(top_10_games[display_cols])


=== Top 10 Best-Selling Games ===


Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,31.37
5,6,Tetris,GB,1989.0,Puzzle,Nintendo,30.26
6,7,New Super Mario Bros.,DS,2006.0,Platform,Nintendo,30.01
7,8,Wii Play,Wii,2006.0,Misc,Nintendo,29.02
8,9,New Super Mario Bros. Wii,Wii,2009.0,Platform,Nintendo,28.62
9,10,Duck Hunt,NES,1984.0,Shooter,Nintendo,28.31


In [9]:
# Regional sales breakdown for top 10 games
print("=== Regional Sales Breakdown (Top 10) ===")
top_10_regional = calculate_regional_breakdown(df, 10)
display(top_10_regional)


=== Regional Sales Breakdown (Top 10) ===


Unnamed: 0,Name,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Wii Sports,41.49,29.02,3.77,8.46,82.74
1,Super Mario Bros.,29.08,3.58,6.81,0.77,40.24
2,Mario Kart Wii,15.85,12.88,3.79,3.31,35.82
3,Wii Sports Resort,15.75,11.01,3.28,2.96,33.0
4,Pokemon Red/Pokemon Blue,11.27,8.89,10.22,1.0,31.37
5,Tetris,23.2,2.26,4.22,0.58,30.26
6,New Super Mario Bros.,11.38,9.23,6.5,2.9,30.01
7,Wii Play,14.03,9.2,2.93,2.85,29.02
8,New Super Mario Bros. Wii,14.59,7.06,4.7,2.26,28.62
9,Duck Hunt,26.93,0.63,0.28,0.47,28.31


## Market Overview


In [10]:
# Analyze platform distribution
platform_distribution = analyze_distribution(df, 'Platform', 15)


=== Platform Distribution (Top 15) ===
Platform
DS      2163
PS2     2161
PS3     1329
Wii     1325
X360    1265
PSP     1213
PS      1196
PC       960
XB       824
GBA      822
GC       556
3DS      509
PSV      413
PS4      336
N64      319
Name: count, dtype: int64

Total unique values: 31


In [11]:
# Analyze genre distribution
genre_distribution = analyze_distribution(df, 'Genre', 12)


=== Genre Distribution (Top 12) ===
Genre
Action          3316
Sports          2346
Misc            1739
Role-Playing    1488
Shooter         1310
Adventure       1286
Racing          1249
Platform         886
Simulation       867
Fighting         848
Strategy         681
Puzzle           582
Name: count, dtype: int64

Total unique values: 12


In [12]:
# Analyze publishers using imported function
publisher_games, publisher_sales = analyze_publishers(df)


=== Top 10 Publishers by Number of Games ===
Publisher
Electronic Arts                 1351
Activision                       975
Namco Bandai Games               932
Ubisoft                          921
Konami Digital Entertainment     832
THQ                              715
Nintendo                         703
Sony Computer Entertainment      683
Sega                             639
Take-Two Interactive             413
Name: count, dtype: int64

=== Top 10 Publishers by Total Global Sales ===
Publisher
Nintendo                       1786.56
Electronic Arts                1110.32
Activision                      727.46
Sony Computer Entertainment     607.50
Ubisoft                         474.72
Take-Two Interactive            399.54
THQ                             340.77
Konami Digital Entertainment    283.64
Sega                            272.99
Namco Bandai Games              254.09
Name: Global_Sales, dtype: float64


## Next Steps for Analysis

### Ready for Deep Dive Analysis
The notebook is now set up with the essential data exploration. Here are some next steps you can explore:

**📊 Visualizations to Create:**
- Time series of genre popularity over decades
- Regional market share comparisons (heatmaps)
- Platform lifecycle analysis
- Publisher market concentration
- Sales distribution patterns

**🔍 Analysis Questions:**
- Which genres perform best in different regions?
- How have platform preferences evolved over time?
- What factors contribute to a game's global success?
- Are there seasonal patterns in game releases?
- How concentrated is the gaming market among top publishers?

**💡 Business Insights:**
- Optimal release strategies by region
- Genre-platform combinations for success
- Market timing and competition analysis
- Franchise vs. standalone game performance

Feel free to add new cells below to continue your analysis!


## Data Validation and Quality Insights

### Key Findings from Initial Analysis

- **Data Completeness**: 271 games missing year information, 58 missing publisher data
- **Sales Range**: Global sales range from 0.01M to 82.74M copies
- **Time Coverage**: Dataset spans from 1980 to 2020, with median year 2007
- **Platform Diversity**: 31 unique gaming platforms represented
- **Market Concentration**: Nintendo dominates both top games and total sales


## Advanced Analysis Functions

Additional specialized functions for deeper insights.


In [13]:
# Advanced analysis functions are imported from data_analysis module
print("✅ Advanced analysis functions loaded!")


✅ Advanced analysis functions loaded!


In [14]:
# Apply advanced analysis functions
year_analysis = analyze_year_trends(df)
print("\n")
market_share = get_regional_market_share(df)


=== Gaming Industry Timeline ===
📅 Dataset covers: 1980 - 2020
📊 Median release year: 2007
🎮 Peak gaming year: 2009 (1431 games)


=== Global Market Share by Region ===
🌍 North America: 49.3%
🌍 Europe: 27.3%
🌍 Japan: 14.5%
🌍 Other: 8.9%


## Summary Statistics and Insights


In [15]:
# Generate comprehensive summary statistics using imported function
summary_stats = generate_summary_statistics(df)


=== Dataset Summary Statistics ===
🎮 Total games analyzed: 16,598
💰 Total global sales: 8920.44M copies
📊 Average sales per game: 0.54M copies
🕹️  Unique platforms: 31
🎯 Unique genres: 12
🏢 Unique publishers: 578
👑 Top selling game: Wii Sports
🎯 Nintendo market presence: 4.2% of all games


## Comprehensive Analysis Results

All analysis sections completed successfully! The notebook now provides:

### ✅ Completed Analysis Sections:
1. **Data Loading & Quality Check** - Robust data loading with error handling
2. **Helper Functions** - Reusable analysis functions for maintainable code  
3. **Top Performers Analysis** - Top games and regional breakdowns
4. **Market Distribution** - Platform, genre, and publisher analysis
5. **Advanced Analytics** - Temporal trends and regional market share
6. **Summary Statistics** - Comprehensive dataset overview

### 🎯 Key Insights Discovered:
- Nintendo dominates with both top-selling games and market presence
- North America represents the largest regional market
- Platform diversity spans 30+ years of gaming evolution
- Action and Sports genres lead in game volume
- Missing data is minimal and manageable (< 2% of records)

### 📈 Dataset Characteristics:
- **16,598 games** across multiple decades
- **Global sales data** broken down by region  
- **Comprehensive coverage** from 1980-2020
- **Rich metadata** including platform, genre, publisher information


## Next Steps for Extended Analysis


In [18]:
# Display suggested next analysis steps using imported function
next_steps = suggest_next_analysis_steps()


=== Recommended Next Analysis Steps ===
1. 📊 Create visualizations: Time series plots for genre trends over decades
2. 🗺️  Geographic analysis: Regional preference heatmaps by genre/platform
3. 📈 Correlation analysis: Relationship between regional sales patterns
4. 🎯 Market segmentation: Identify distinct gaming market segments
5. 🚀 Predictive modeling: Forecast future gaming trends
6. 🎮 Platform lifecycle: Analyze console adoption and decline patterns
7. 💰 Revenue optimization: Identify factors for commercial success
8. 🏆 Franchise analysis: Multi-game series performance patterns
