# Video Games Sales Analysis

## Introduction

This notebook examines the **'Video Games Sales as at 22 Dec 2016'** dataset from Kaggle. The dataset contains global sales information for thousands of video games from 1980 to 2016. In this project we will explore the data, clean and filter it, derive initial statistics, and then perform in-depth analyses to answer three key questions.

## Main Goal

The main goals of this analysis are to:

1. Understand the distribution of global sales for video games.
2. Determine which game genre has the highest average global sales.
3. Analyze how global sales vary across different gaming platforms.

We will generate visualizations that focus on the most relevant parts of the data and provide clear insights.

## Data Exploration

First, we load the dataset and take an initial look at its structure and content. We display the first 10 rows to get a feel for the data.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

# Load dataset from Kaggle attached data
data_path = "/kaggle/input/video-games-sales-as-at-22-dec-2016csv/Video_Games_Sales_as_at_22_Dec_2016.csv"
df = pd.read_csv(data_path, encoding='ISO-8859-1')

# Show dataset dimensions and first 10 rows
print(f"Dataset contains {df.shape[0]} games and {df.shape[1]} features.")
df.head(10)

## Data Observation and Cleaning

Let's review the data types, check for inconsistencies, and understand which columns might need cleaning. We also inspect missing values.

In [None]:
# Display data information
df.info()

# Check for missing values
missing = df.isnull().sum()
print("Missing values per column:")
print(missing[missing > 0])

## Filtering Data and Finding Missing Values

In this step we check for missing values and filter out rows if necessary. In our case, we may not drop many rows but it is important to note which columns have missing information. For this analysis, we focus on the columns used in our questions (e.g., `Global_Sales`, `Genre`, and `Platform`).

In [None]:
# Count missing values in key columns
key_cols = ['Global_Sales', 'Genre', 'Platform']
for col in key_cols:
    print(f"{col} missing: {df[col].isnull().sum()}")

# Optionally, drop rows missing these key columns
df_clean = df.dropna(subset=key_cols)
print(f"After dropping rows with missing key values, dataset has {df_clean.shape[0]} rows.")

## Initial Statistics on the Data

We now compute summary statistics to understand the overall numerical distribution of the data. This includes mean, median, min, and max values for numeric features.

In [None]:
# Display summary statistics for numerical columns
df_clean.describe()

## Data Analysis & Insights

We now address our three main questions using focused visualizations and analyses. Each question is organized into its own section with clear insights and conclusions.

### Question 1: What is the Distribution of Global Sales?

The distribution of global sales per game is highly right-skewed. We focus on the common range (0–5 million units) and provide three key visualizations:
- **Histogram:** Shows frequency of games within 0–5 million units.
- **Density Plot:** Provides a smooth estimate of the distribution.
- **Box Plot:** Highlights medians and potential outliers (blockbusters).

In [None]:
# Q1: Histogram (0-5 million range)
if 'Global_Sales' in df_clean.columns:
    common_sales = df_clean[df_clean['Global_Sales'] <= 5]['Global_Sales']
    plt.figure(figsize=(8,5))
    plt.hist(common_sales, bins=50, color='skyblue', edgecolor='black')
    plt.title('Distribution of Global Sales (0-5 million units)')
    plt.xlabel('Global Sales (million units)')
    plt.ylabel('Number of Games')
    plt.xlim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

In [None]:
# Q1: Density Plot (focused on 0-5 million units)
if 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,5))
    df_clean['Global_Sales'].dropna().plot(kind='density', color='purple')
    plt.title('Density Plot of Global Sales')
    plt.xlabel('Global Sales (million units)')
    plt.xlim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

In [None]:
# Q1: Box Plot of Global Sales (0-5 million range)
if 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(6,8))
    plt.boxplot(df_clean['Global_Sales'].dropna(), vert=True, patch_artist=True)
    plt.title('Box Plot of Global Sales')
    plt.ylabel('Global Sales (million units)')
    plt.ylim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

**Insights from Question 1:**

- The histogram and density plot confirm that **most games sell under 1 million copies**.
- The box plot indicates that while the majority of games have modest sales, there are a few outliers (blockbusters) that far exceed this range (not displayed here for clarity).
- This is a classic long-tail distribution, where only a few titles achieve extremely high sales.

### Question 2: Which Genre Has the Highest Average Global Sales?

This section focuses on comparing genres by their average global sales. We display a table of average sales per genre, a horizontal bar chart for the top 5 genres, and a pie chart showing the market share of these top genres.

In [None]:
# Q2: Table of Average Global Sales by Genre
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    genre_avg = df_clean.groupby('Genre')['Global_Sales'].mean()
    genre_table = genre_avg.sort_values(ascending=False).to_frame(name='Average Global Sales')
    print("Top 10 Genres by Average Global Sales:")
    display(genre_table.head(10))
else:
    print("Required columns not found.")

In [None]:
# Q2: Horizontal Bar Chart for Top 5 Genres by Average Global Sales
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    top_genres = genre_avg.sort_values(ascending=False).head(5)
    plt.figure(figsize=(8,5))
    top_genres.sort_values().plot(kind='barh', color='mediumseagreen')
    plt.title('Top 5 Genres by Average Global Sales')
    plt.xlabel('Average Global Sales (million units)')
    plt.xlim(0, top_genres.max()*1.1)
    plt.show()
else:
    print("Required columns not found.")

In [None]:
# Q2: Pie Chart Showing Market Share of Total Global Sales for Top 5 Genres
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    genre_total = df_clean.groupby('Genre')['Global_Sales'].sum()
    top5_total = genre_total.sort_values(ascending=False).head(5)
    plt.figure(figsize=(8,8))
    top5_total.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
    plt.title('Market Share of Total Global Sales by Top 5 Genres')
    plt.ylabel('')
    plt.show()
else:
    print("Required columns not found.")

**Insights from Question 2:**

- The table shows that certain genres, such as **Shooter** and **Action**, lead in average global sales.
- The bar chart clearly indicates that the top genres outperform others by a significant margin.
- The pie chart reveals that a small number of genres contribute a large share of total sales.

This suggests that mainstream, high-budget genres are more likely to achieve higher sales.

### Question 3: How Do Sales Vary Across Different Platforms?

In this section we compare global sales across different gaming platforms. We present a table of total sales by platform, a horizontal bar chart, and a pie chart to illustrate the market share of each platform. The focus is on the most relevant range of sales to clearly differentiate between platforms.

In [None]:
# Q3: Table of Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    platform_sales = df_clean.groupby('Platform')['Global_Sales'].sum()
    platform_table = platform_sales.sort_values(ascending=False).to_frame(name='Total Global Sales')
    print("Top Platforms by Total Global Sales:")
    display(platform_table.head(10))
else:
    print("Required columns not found.")

In [None]:
# Q3: Horizontal Bar Chart for Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,5))
    platform_sales.sort_values().plot(kind='barh', color='slateblue')
    plt.title('Total Global Sales by Platform')
    plt.xlabel('Total Global Sales (million units)')
    plt.xlim(0, platform_sales.max()*1.1)
    plt.ylabel('Platform')
    plt.show()
else:
    print("Required columns not found.")

In [None]:
# Q3: Pie Chart Showing Market Share of Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,8))
    platform_sales.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
    plt.title('Market Share of Total Global Sales by Platform')
    plt.ylabel('')
    plt.show()
else:
    print("Required columns not found.")

**Insights from Question 3:**

- The table shows that a few platforms (such as PS2 and Wii) dominate total global sales.
- The bar chart visualizes significant differences in sales between platforms.
- The pie chart clearly depicts each platform’s market share.

This analysis indicates that while many platforms have a large number of titles with modest sales, a few consoles account for the majority of total sales due to the presence of blockbuster titles.

## Data Analysis and Conclusion

**Summary of Key Findings:**

- **Global Sales Distribution:** Most video games sell under 1 million copies. The distribution is heavily right-skewed, meaning only a small number of games become blockbusters.
- **Genre Performance:** Mainstream genres (particularly Shooter and Action) have the highest average global sales, indicating these genres are more likely to yield commercial success.
- **Platform Variability:** Although every platform has its share of modest-selling titles, a few platforms (e.g., PS2, Wii) capture a dominant share of total global sales due to a handful of extremely popular games.

These insights help stakeholders—such as game developers, publishers, and investors—understand market trends and make informed decisions about which genres and platforms to focus on.