# Video Games Sales Analysis

## Main Goal: Understanding Video Game Sales Trends

### Target Audience: Real Estate Investor (Adapted for our Analysis)

Although this analysis is tailored for a different domain, our main objective here is to determine key trends in video game sales. Our goals are:

1. **Distribution Analysis:** Determine the typical sales range for video games.
2. **Genre Analysis:** Identify which game genres achieve the highest average global sales.
3. **Platform Analysis:** Understand how global sales vary across different gaming platforms.

By carefully cleaning the data and focusing on relevant ranges, we aim to provide insights that help stakeholders (e.g., developers and publishers) make data-driven decisions.

## Data Exploration

We start by loading the dataset and exploring its structure. This includes viewing the first 10 rows and basic dataset information.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

# Load the dataset from Kaggle attached data
data_path = "/kaggle/input/video-games-sales-as-at-22-dec-2016csv/Video_Games_Sales_as_at_22_Dec_2016.csv"
df = pd.read_csv(data_path, encoding='ISO-8859-1')

print(f"Dataset contains {df.shape[0]} games and {df.shape[1]} features.")
df.head(10)

## Data Observation and Cleaning

Let's inspect the DataFrame’s structure, data types, and missing values. We focus on key columns required for our analysis: `Global_Sales`, `Genre`, and `Platform`.

In [None]:
# Check data info and missing values
df.info()

missing = df.isnull().sum()
print("Missing values per column:")
print(missing[missing > 0])

In [None]:
# Focus on key columns and drop rows with missing values in those columns
key_cols = ['Global_Sales', 'Genre', 'Platform']
for col in key_cols:
    print(f"{col} missing: {df[col].isnull().sum()}")

df_clean = df.dropna(subset=key_cols)
print(f"After dropping rows with missing key values, dataset has {df_clean.shape[0]} rows.")

## Initial Statistics on the Data

We now compute summary statistics to understand the numerical distribution of the data.

In [None]:
df_clean.describe()

## Data Analysis & Insights

We now delve into our main questions with focused analyses and visualizations.

### Question 1: What is the Distribution of Global Sales?

Global sales are highly right-skewed. In order to see where most games lie, we zoom in on the range of 0–5 million units. We then visualize the distribution using three plots:

- **Histogram:** Frequency of games within the specified range.
- **Density Plot:** Smoothed estimation of the distribution.
- **Box Plot:** Highlights the median and identifies outliers (blockbusters).

In [None]:
# Histogram of Global Sales (0-5 million units)
if 'Global_Sales' in df_clean.columns:
    common_sales = df_clean[df_clean['Global_Sales'] <= 5]['Global_Sales']
    plt.figure(figsize=(8,5))
    plt.hist(common_sales, bins=50, color='skyblue', edgecolor='black')
    plt.title('Distribution of Global Sales (0-5 million units)')
    plt.xlabel('Global Sales (million units)')
    plt.ylabel('Number of Games')
    plt.xlim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

In [None]:
# Density Plot of Global Sales (0-5 million units)
if 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,5))
    df_clean['Global_Sales'].dropna().plot(kind='density', color='purple')
    plt.title('Density Plot of Global Sales')
    plt.xlabel('Global Sales (million units)')
    plt.xlim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

In [None]:
# Box Plot of Global Sales (y-axis limited to 0-5 million)
if 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(6,8))
    plt.boxplot(df_clean['Global_Sales'].dropna(), vert=True, patch_artist=True)
    plt.title('Box Plot of Global Sales')
    plt.ylabel('Global Sales (million units)')
    plt.ylim(0,5)
    plt.show()
else:
    print("Global_Sales column not found.")

**Insights from Question 1:**

- The histogram shows that most games sell under 1 million copies.
- The density plot confirms a heavy right-skew in the data.
- The box plot reveals a long-tail distribution with a few outliers (blockbusters) that are not displayed in this focused view.

Thus, the video game market is characterized by a multitude of low-selling titles with a few exceptionally high-selling blockbusters.

### Question 2: Which Genre Has the Highest Average Global Sales?

Next, we analyze game genres to determine which types of games, on average, sell the most copies globally. We use a table, a horizontal bar chart, and a pie chart to visualize the findings.

In [None]:
# Table of Average Global Sales by Genre
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    genre_avg = df_clean.groupby('Genre')['Global_Sales'].mean()
    genre_table = genre_avg.sort_values(ascending=False).to_frame(name='Average Global Sales')
    print("Top 10 Genres by Average Global Sales:")
    display(genre_table.head(10))
else:
    print("Required columns not found.")

In [None]:
# Horizontal Bar Chart for Top 5 Genres by Average Global Sales
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    top_genres = genre_avg.sort_values(ascending=False).head(5)
    plt.figure(figsize=(8,5))
    top_genres.sort_values().plot(kind='barh', color='mediumseagreen')
    plt.title('Top 5 Genres by Average Global Sales')
    plt.xlabel('Average Global Sales (million units)')
    plt.xlim(0, top_genres.max()*1.1)
    plt.show()
else:
    print("Required columns not found.")

In [None]:
# Pie Chart Showing Market Share of Total Global Sales by Top 5 Genres
if 'Genre' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    genre_total = df_clean.groupby('Genre')['Global_Sales'].sum()
    top5_total = genre_total.sort_values(ascending=False).head(5)
    plt.figure(figsize=(8,8))
    top5_total.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
    plt.title('Market Share of Total Global Sales by Top 5 Genres')
    plt.ylabel('')
    plt.show()
else:
    print("Required columns not found.")

**Insights from Question 2:**

- The table shows that genres such as **Shooter** and **Action** have the highest average sales.
- The bar chart clearly distinguishes the top 5 genres, emphasizing their superior performance.
- The pie chart demonstrates that a small number of genres hold a large share of total sales.

This analysis suggests that mainstream, high-budget genres dominate global sales.

### Question 3: How Do Sales Vary Across Different Platforms?

Finally, we examine how global sales differ by gaming platform. We use a table, a horizontal bar chart, and a pie chart to show the distribution and market share of total sales by platform.

In [None]:
# Table of Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    platform_sales = df_clean.groupby('Platform')['Global_Sales'].sum()
    platform_table = platform_sales.sort_values(ascending=False).to_frame(name='Total Global Sales')
    print("Top Platforms by Total Global Sales:")
    display(platform_table.head(10))
else:
    print("Required columns not found.")

In [None]:
# Horizontal Bar Chart for Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,5))
    platform_sales.sort_values().plot(kind='barh', color='slateblue')
    plt.title('Total Global Sales by Platform')
    plt.xlabel('Total Global Sales (million units)')
    plt.xlim(0, platform_sales.max()*1.1)
    plt.ylabel('Platform')
    plt.show()
else:
    print("Required columns not found.")

In [None]:
# Pie Chart Showing Market Share of Total Global Sales by Platform
if 'Platform' in df_clean.columns and 'Global_Sales' in df_clean.columns:
    plt.figure(figsize=(8,8))
    platform_sales.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
    plt.title('Market Share of Total Global Sales by Platform')
    plt.ylabel('')
    plt.show()
else:
    print("Required columns not found.")

**Insights from Question 3:**

- A few platforms (e.g., PS2, Wii) dominate in total global sales.
- The bar chart shows substantial differences between platforms.
- The pie chart clearly depicts the market share, indicating that only a handful of consoles account for most of the sales.

This suggests that while many platforms host many games with modest sales, a few key consoles are responsible for the bulk of global sales due to blockbuster titles.

## Data Analysis and Conclusion

**Summary of Key Findings:**

- **Global Sales Distribution:** Most video games sell under 1 million copies; a few blockbusters skew the distribution.
- **Genre Performance:** Mainstream genres (especially Shooter and Action) achieve the highest average global sales.
- **Platform Variability:** A few platforms (e.g., PS2, Wii) dominate total global sales despite many games selling modestly on every platform.

These insights provide a clear picture of the video game market, highlighting the long-tail nature of sales and the dominance of certain genres and platforms. Such findings can assist developers, publishers, and investors in making informed, data-driven decisions.