# Video Games Sales Analysis

## Introduction

In this exploration, we analyze the **'Video Games Sales as at 22 Dec 2016'** dataset from Kaggle. This dataset contains global sales data for thousands of video games spanning from 1980 to 2016. Our goal is to extract meaningful insights from the data to help stakeholders understand market trends.

## Main Goal

The main objectives of this analysis are to:

1. Understand the distribution of global sales for video games.
2. Determine which game genre has the highest average global sales.
3. Analyze how global sales vary across different gaming platforms.

We will accomplish this by exploring, cleaning, and analyzing the data with targeted visualizations that focus on the most relevant data ranges.

## Data Exploration

First, we load the dataset and take a look at its structure. We display the first 10 rows to get an idea of the data.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

# Load dataset from Kaggle's attached data
data_path = "/kaggle/input/video-games-sales-as-at-22-dec-2016csv/Video_Games_Sales_as_at_22_Dec_2016.csv"
df = pd.read_csv(data_path, encoding='ISO-8859-1')

print(f"Dataset contains {df.shape[0]} games and {df.shape[1]} features.")
df.head(10)

## Data Observation and Cleaning

We inspect the DataFrame structure, data types, and check for missing or inconsistent values. This step is important to ensure that our analysis is based on clean data.

In [None]:
# Display basic information about the dataset
df.info()

# Check for missing values in the entire DataFrame
missing = df.isnull().sum()
print("Missing values per column:")
print(missing[missing > 0])

## Filtering Data and Finding Missing Values

For our analysis, we focus on the key columns: `Global_Sales`, `Genre`, and `Platform`. We check for missing values in these columns and drop rows if necessary.

In [None]:
# Define key columns for analysis
key_cols = ['Global_Sales', 'Genre', 'Platform']

# Print missing value counts for key columns
for col in key_cols:
    print(f"{col} missing: {df[col].isnull().sum()}")

# Drop rows missing key column values
df_clean = df.dropna(subset=key_cols)
print(f"After dropping rows with missing key values, dataset has {df_clean.shape[0]} rows.")

## Initial Statistics on the Data

We compute summary statistics for the numerical features to get an overall sense of the data distribution.

In [None]:
# Display summary statistics for numerical columns
df_clean.describe()

## Data Analysis & Insights

Now we move on to our main analysis, structured into three separate questions. Each section includes focused visualizations and accompanying insights.

### Question 1: What is the Distribution of Global Sales?

Global sales are highly skewed. To understand where most games fall, we limit our view to 0–5 million units. We present:
- **Histogram** (frequency of games within 0–5M)
- **Density Plot** (smoothed distribution)
- **Box Plot** (median and outliers)

In [None]:
# Q1: Histogram of Global Sales (0-5 million range)
if 'Global_Sales' in df_clean.columns:
    common_sales = df_clean[df_clean['Global_Sales'] <= 5]['Global_Sales']
    plt.figure(figsize=(8,5))
    plt.hist(common_sales, bins=50, color='skyblue', edgecolor='black')
    plt.title('Distribution of Global Sales (0-5 million units)')
    plt.xlabel('Global Sales (million units)')
    plt.ylabel('Number of Games')
    plt.xlim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

In [None]:
# Q1: Density Plot of Global Sales (focused on 0-5 million units)
if 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,5))
    df_clean['Global_Sales'].dropna().plot(kind='density', color='purple')
    plt.title('Density Plot of Global Sales')
    plt.xlabel('Global Sales (million units)')
    plt.xlim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

In [None]:
# Q1: Box Plot of Global Sales (y-axis limited to 0-5 million)
if 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(6,8))
    plt.boxplot(df_clean['Global_Sales'].dropna(), vert=True, patch_artist=True)
    plt.title('Box Plot of Global Sales')
    plt.ylabel('Global Sales (million units)')
    plt.ylim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

**Insights from Question 1:**

- Most games sell under 1 million copies (as shown by the histogram and density plot).
- The box plot confirms a long-tail distribution with a few blockbusters (outliers) that are not shown due to the focused y-axis.
- Overall, the data confirms that the video game market is dominated by many low-selling titles and a few extremely popular ones.

### Question 2: Which Genre Has the Highest Average Global Sales?

Next, we analyze game genres. We compute the average global sales per genre and visualize the results with a table, a horizontal bar chart, and a pie chart that shows the market share of the top genres.

In [None]:
# Q2: Table of Average Global Sales by Genre
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    genre_avg = df_clean.groupby('Genre')['Global_Sales'].mean()
    genre_table = genre_avg.sort_values(ascending=False).to_frame(name='Average Global Sales')
    print("Top 10 Genres by Average Global Sales:")
    display(genre_table.head(10))
else:
    print("Required columns not found.")

In [None]:
# Q2: Horizontal Bar Chart for Top 5 Genres by Average Global Sales
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    top_genres = genre_avg.sort_values(ascending=False).head(5)
    plt.figure(figsize=(8,5))
    top_genres.sort_values().plot(kind='barh', color='mediumseagreen')
    plt.title('Top 5 Genres by Average Global Sales')
    plt.xlabel('Average Global Sales (million units)')
    plt.xlim(0, top_genres.max()*1.1)
    plt.show()
else:
    print("Required columns not found.")

In [None]:
# Q2: Pie Chart Showing Market Share of Total Global Sales for Top 5 Genres
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    genre_total = df_clean.groupby('Genre')['Global_Sales'].sum()
    top5_total = genre_total.sort_values(ascending=False).head(5)
    plt.figure(figsize=(8,8))
    top5_total.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
    plt.title('Market Share of Total Global Sales by Top 5 Genres')
    plt.ylabel('')
    plt.show()
else:
    print("Required columns not found.")

**Insights from Question 2:**

- The table shows that genres such as **Shooter** and **Action** lead in average global sales.
- The bar chart clearly distinguishes the top 5 genres, emphasizing their superior performance.
- The pie chart illustrates that a small number of genres account for a large market share in total global sales.

This analysis suggests that mainstream, high-budget genres are more likely to achieve higher sales.

### Question 3: How Do Sales Vary Across Different Platforms?

Finally, we compare global sales across gaming platforms. We present a table, a horizontal bar chart, and a pie chart to show each platform's total sales and market share. The visualizations are scaled to focus on the typical range of values.

In [None]:
# Q3: Table of Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    platform_sales = df_clean.groupby('Platform')['Global_Sales'].sum()
    platform_table = platform_sales.sort_values(ascending=False).to_frame(name='Total Global Sales')
    print("Top Platforms by Total Global Sales:")
    display(platform_table.head(10))
else:
    print("Required columns not found.")

In [None]:
# Q3: Horizontal Bar Chart for Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,5))
    platform_sales.sort_values().plot(kind='barh', color='slateblue')
    plt.title('Total Global Sales by Platform')
    plt.xlabel('Total Global Sales (million units)')
    plt.xlim(0, platform_sales.max()*1.1)
    plt.ylabel('Platform')
    plt.show()
else:
    print("Required columns not found.")

In [None]:
# Q3: Pie Chart Showing Market Share of Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,8))
    platform_sales.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
    plt.title('Market Share of Total Global Sales by Platform')
    plt.ylabel('')
    plt.show()
else:
    print("Required columns not found.")

**Insights from Question 3:**

- The table shows that a few platforms (e.g., PS2 and Wii) have very high total global sales.
- The bar chart visualizes significant differences in sales between platforms.
- The pie chart clearly illustrates the market share, indicating that a small number of consoles dominate the market.

This analysis indicates that while many platforms host a large number of games with modest sales, a few consoles are responsible for the bulk of global sales due to blockbuster titles.

## Data Analysis and Conclusion

**Summary of Key Findings:**

- **Global Sales Distribution:** Most video games sell under 1 million copies, with a long-tail distribution indicating a few extreme blockbusters.
- **Genre Performance:** Mainstream genres (e.g., Shooter, Action) lead in average global sales, highlighting their commercial potential.
- **Platform Variability:** While many platforms have titles with modest sales, a few consoles (e.g., PS2, Wii) dominate total global sales.

These insights help stakeholders understand market trends and make informed decisions regarding game development and marketing.