# Exploratory Data Analysis (EDA) of Video Game Sales

This notebook presents an Exploratory Data Analysis (EDA) of video game sales data. The goal of this analysis is to uncover insights and trends within the video game industry, focusing on various aspects such as sales by publisher, genre, region, and over time.

## Objectives:
- **Understand Sales Distribution**: Analyze the distribution of sales across different publishers, genres, and regions.
- **Identify Trends**: Examine sales trends over the years to identify periods of growth and decline.
- **Regional Preferences**: Explore regional preferences for different genres to inform market strategies.
- **Visualize Data**: Use various visualizations to present the data in an intuitive and informative manner.

## Key Sections:
1. **Top 10 Publishers by Sales**: A bar chart highlighting the leading publishers based on global sales.
2. **Regional Sales by Genre**: A bar chart showing the distribution of sales across different genres in various regions.
3. **Video Game Sales Trend Over the Years**: A line chart illustrating the trend of video game sales over time.
4. **Heat Map of Sales by Genre and Region**: A heat map providing a detailed view of sales across genres and regions.

## Data Source:
The dataset used for this analysis includes information on video game sales, including columns for publisher, genre, year of release, and sales in different regions (NA, EU, JP, Other).

By the end of this analysis, we aim to gain a deeper understanding of the video game market, uncovering valuable insights that can inform strategic decisions for publishers, developers, and marketers.

### Importing required libraries 

In [49]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

### Datasets and checking for NULL values

In [68]:
# Load dataset
file_path = "Video_Games_Sales_as_at_22_Dec_2016.csv"
df = pd.read_csv(file_path)

# Display the first 5 rows
df.head()


Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [51]:
# Check missing values in each column
df.isnull().sum()


Name                  2
Platform              0
Year_of_Release     269
Genre                 2
Publisher            54
NA_Sales              0
EU_Sales              0
JP_Sales              0
Other_Sales           0
Global_Sales          0
Critic_Score       8582
Critic_Count       8582
User_Score         6704
User_Count         9129
Developer          6623
Rating             6769
dtype: int64

In [None]:
# dropping rows with null values
df = df.dropna(subset=['Year_of_Release', 'Genre', 'Publisher'])

In [None]:
# replacing missing values with 0
df.fillna({'Critic_Score': 0, 'User_Score': 0}, inplace=True)

In [92]:
# summary
df.describe()
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16717 non-null  object 
 1   Platform         16719 non-null  object 
 2   Year_of_Release  16450 non-null  float64
 3   Genre            16717 non-null  object 
 4   Publisher        16665 non-null  object 
 5   NA_Sales         16719 non-null  float64
 6   EU_Sales         16719 non-null  float64
 7   JP_Sales         16719 non-null  float64
 8   Other_Sales      16719 non-null  float64
 9   Global_Sales     16719 non-null  float64
 10  Critic_Score     8137 non-null   float64
 11  Critic_Count     8137 non-null   float64
 12  User_Score       10015 non-null  object 
 13  User_Count       7590 non-null   float64
 14  Developer        10096 non-null  object 
 15  Rating           9950 non-null   object 
dtypes: float64(9), object(7)
memory usage: 2.0+ MB


(16719, 16)

## Top-Selling Game Genres

The bar chart below illustrates the top-selling game genres based on global sales. The data is grouped by genre and the total global sales for each genre are summed up. The genres are then sorted in descending order of their global sales. This visualization helps in identifying which game genres have been the most commercially successful.

### Top Performer - Action


In [82]:
# Group by genre and sum up global sales
genre_sales = df.groupby("Genre")["Global_Sales"].sum().reset_index()

# Sort in descending order
genre_sales = genre_sales.sort_values(by="Global_Sales", ascending=False)

# Create a bar chart using Plotly
fig_genre = px.bar(
    genre_sales, 
    x="Genre", 
    y="Global_Sales", 
    title="Top-Selling Game Genres", 
    color="Global_Sales"
)

# Show plot
fig_genre.show()

# Create pie chart
fig_pie_genre = px.pie(
    genre_sales.head(10), 
    names="Genre", 
    values="Global_Sales", 
    title="Market Share of Game Genres", 
    hole=0.3
)

# Show plot
fig_pie_genre.show()


## Top-Selling Gaming Platforms

The bar chart below highlights the top-selling gaming platforms based on global sales. The data is aggregated by platform, and the total global sales for each platform are summed up. The platforms are then sorted in descending order of their global sales. This visualization provides insights into which gaming platforms have achieved the highest commercial success.

### Top Performer - PS2

In [75]:
# Group by platform and sum up global sales
platform_sales = df.groupby("Platform")["Global_Sales"].sum().reset_index()

# Sort in descending order
platform_sales = platform_sales.sort_values(by="Global_Sales", ascending=False)

# Create a bar chart
fig_platform = px.bar(
    platform_sales, 
    x="Platform", 
    y="Global_Sales", 
    title="Top-Selling Gaming Platforms", 
    color="Global_Sales"
)

# Create pie chart
fig_pie_platforms = px.pie(
    platform_sales.head(10), 
    names="Platform", 
    values="Global_Sales", 
    title="Market Share of Top 10 Gaming Platforms", 
    hole=0.3
)

# Show plot
fig_pie_platforms.show()

fig_platform.show()

# Top 10 Publishers by Sales

The bar chart below illustrates the top 10 publishers based on their global sales. The data is aggregated by summing up the global sales for each publisher and then sorted in descending order. This visualization helps in identifying the leading publishers in the market.

## Key Insights:
- The chart highlights the publishers with the highest sales, providing a clear comparison.
- The color intensity represents the magnitude of sales, making it easy to distinguish between different publishers.
- This information can be useful for market analysis, strategic planning, and understanding competitive dynamics.

## Data Source:
The data used for this visualization is derived from the dataset, grouped by the `Publisher` column and summed up for the `Global_Sales` column.

## Visualization Details:
- **X-axis**: Publisher names
- **Y-axis**: Total global sales
- **Color**: Represents the total global sales for each publisher

This chart provides a quick and effective way to understand the distribution of sales among the top publishers in the industry.

In [None]:
# Group by publisher and sum up global sales
publisher_sales = df.groupby("Publisher")["Global_Sales"].sum().reset_index()

# Sort in descending order and take top 10
publisher_sales = publisher_sales.sort_values(by="Global_Sales", ascending=False).head(10)

# Create a bar chart
fig_publisher = px.bar(
    publisher_sales, 
    x="Publisher", 
    y="Global_Sales", 
    title="Top 10 Publishers by Sales", 
    color="Global_Sales"
)

fig_publisher.show()



# Regional Sales by Genre

The bar chart below showcases the regional sales distribution across different genres. This visualization provides insights into how various genres perform in different regions, helping to understand regional preferences and market trends.

## Key Insights:
- The chart breaks down sales by genre for each region, allowing for a detailed comparison.
- Different colors represent different regions, making it easy to identify regional sales patterns.
- This information is valuable for publishers and developers to tailor their strategies according to regional preferences.

## Data Source:
The data used for this visualization is derived from the dataset, grouped by the `Genre` column and summed up for the regional sales columns (`NA_Sales`, `EU_Sales`, `JP_Sales`, `Other_Sales`).

## Visualization Details:
- **X-axis**: Genre names
- **Y-axis**: Total sales in each region
- **Color**: Represents different regions (NA, EU, JP, Other)

This chart provides a comprehensive view of how different genres perform across various regions, aiding in strategic decision-making for targeting specific markets.

In [80]:
fig_regional = go.Figure()

regions = ["NA_Sales", "EU_Sales", "JP_Sales", "Other_Sales"]
for region in regions:
    fig_regional.add_trace(go.Bar(name=region, x=df["Genre"], y=df[region]))

fig_regional.update_layout(barmode='stack', title="Regional Sales by Genre")

fig_regional.show()


# Video Game Sales Trend Over the Years

The line chart below illustrates the trend of video game sales over the years. This visualization helps in understanding the growth and fluctuations in the video game market over time.

## Key Insights:
- The chart shows the total sales for each year, highlighting periods of growth and decline.
- Peaks in the chart indicate years with significant sales, which could be due to the release of popular games or consoles.
- This information is useful for analyzing market trends, predicting future sales, and understanding the impact of major releases.

## Data Source:
The data used for this visualization is derived from the dataset, grouped by the `Year` column and summed up for the `Global_Sales` column.

## Visualization Details:
- **X-axis**: Years
- **Y-axis**: Total global sales
- **Line**: Represents the trend of sales over the years

This chart provides a clear view of the historical sales performance in the video game industry, helping stakeholders make informed decisions based on past trends.

In [81]:
# Group sales by year
yearly_sales = df.groupby("Year_of_Release")["Global_Sales"].sum().reset_index()

# Create a line chart
fig_yearly = px.line(
    yearly_sales, 
    x="Year_of_Release", 
    y="Global_Sales", 
    title="Game Sales Over Time"
)

fig_yearly.show()

# Create bar-line chart
fig_bar_line = go.Figure()

# Bar chart for global sales
fig_bar_line.add_trace(go.Bar(
    x=yearly_sales["Year_of_Release"], 
    y=yearly_sales["Global_Sales"], 
    name="Global Sales", 
    marker_color='blue'
))

# Line chart overlay
fig_bar_line.add_trace(go.Scatter(
    x=yearly_sales["Year_of_Release"], 
    y=yearly_sales["Global_Sales"], 
    name="Trend Line", 
    mode="lines+markers", 
    line=dict(color='red', width=2)
))

# Update layout
fig_bar_line.update_layout(
    title="Video Game Sales Trends Over the Years",
    xaxis_title="Year of Release",
    yaxis_title="Global Sales (millions)",
    barmode="overlay"
)

# Show plot
fig_bar_line.show()


# Heat Map of Video Game Sales by Genre and Region

The heat map below provides a detailed view of video game sales across different genres and regions. This visualization helps in identifying patterns and trends in the popularity of various genres in different parts of the world.

## Key Insights:
- The heat map uses color intensity to represent the magnitude of sales, with darker colors indicating higher sales.
- This visualization allows for quick identification of the most popular genres in each region.
- It highlights regional preferences, showing which genres are more popular in North America, Europe, Japan, and other regions.
- The heat map can be used to tailor marketing strategies and game development efforts to cater to regional tastes.

## Data Source:
The data used for this visualization is derived from the dataset, grouped by the `Genre` and regional sales columns (`NA_Sales`, `EU_Sales`, `JP_Sales`, `Other_Sales`).

## Visualization Details:
- **X-axis**: Genres
- **Y-axis**: Regions (NA, EU, JP, Other)
- **Color Intensity**: Represents the total sales for each genre in each region

This heat map provides a comprehensive overview of how different genres perform across various regions, aiding in strategic decision-making for targeting specific markets and understanding regional preferences.

In [66]:
import plotly.figure_factory as ff

# Select numerical columns
num_cols = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']

# Compute correlation matrix
corr_matrix = df[num_cols].corr()

# Convert to Plotly format
fig_heatmap = ff.create_annotated_heatmap(
    z=corr_matrix.values, 
    x=num_cols, 
    y=num_cols, 
    annotation_text=corr_matrix.round(2).values, 
    colorscale='jet', 
    showscale=True
)

# Set title
fig_heatmap.update_layout(title="Correlation Heatmap of Sales in Different Regions")

# Show plot
fig_heatmap.show()


# Impact of Major Console Release on Sales

The line chart below illustrates the impact of a major console release on video game sales. The data is filtered to include sales from four years before and after the event year (2006), which is the release year of a significant gaming console. This visualization helps in understanding how the release of a major console influences the overall sales in the video game industry.

## Key Insights:
- The chart shows a noticeable increase in sales following the release of the console in 2006.
- The peak sales year is 2008, indicating a strong market response to the console release.
- Sales begin to decline after 2009, suggesting the initial surge in sales may taper off over time.
- This information is valuable for analyzing the market impact of major console releases and can help in forecasting future sales trends.

## Data Source:
The data used for this visualization is derived from the dataset, filtered for the years around the event year (2006), and grouped by the `Year_of_Release` column to sum up the `Global_Sales`.

## Visualization Details:
- **X-axis**: Year of Release
- **Y-axis**: Total global sales
- **Line**: Represents the trend of sales around the console release year

This chart provides a clear view of how a major console release can affect video game sales, offering insights into market dynamics and consumer behavior.

In [89]:
# Example: Analyze sales before and after a major console release
# Filter data for years around the event
event_year = 2006  # Example year of a major console release
sales_around_event = df[(df["Year_of_Release"] >= event_year - 4) & (df["Year_of_Release"] <= event_year + 4)]

# Group by year and sum sales
event_sales = sales_around_event.groupby("Year_of_Release")["Global_Sales"].sum().reset_index()

# Create a line chart
fig_event_impact = px.line(
    event_sales, 
    x="Year_of_Release", 
    y="Global_Sales", 
    title="Impact of Major Console Release on Sales"
)

fig_event_impact.show()

