# **Import Libraries**

In [58]:
import pandas as pd
import numpy as np
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from sklearn.cluster import KMeans

#set default values for graphs
px.defaults.width = 900
px.defaults.height = 400
px.defaults.template = "plotly_dark"

# **Import Dataset**

In [4]:
df = pd.read_csv('/content/Video Games Sales.csv')

In [5]:
df.head()

Unnamed: 0,index,Rank,Game Title,Platform,Year,Genre,Publisher,North America,Europe,Japan,Rest of World,Global,Review
0,0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,40.43,28.39,3.77,8.54,81.12,76.28
1,1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,91.0
2,2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,14.5,12.22,3.63,3.21,33.55,82.07
3,3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,14.82,10.51,3.18,3.01,31.52,82.65
4,4,5,Tetris,GB,1989.0,Puzzle,Nintendo,23.2,2.26,4.22,0.58,30.26,88.0


# **Data Preperation**

In [6]:
print(df.isnull().sum())

index             0
Rank              0
Game Title        0
Platform          0
Year             29
Genre             0
Publisher         2
North America     0
Europe            0
Japan             0
Rest of World     0
Global            0
Review            0
dtype: int64


**Impute missing values using mode**

In [7]:
# Impute missing values in categorical columns with mode
for col in df.columns:
    if df[col].dtype == "object":
        df[col].fillna(df[col].mode()[0], inplace=True)

In [9]:
median_year = df["Year"].median()

# Replace the missing values in the "Year" column with the median
df["Year"].fillna(median_year, inplace=True)

In [10]:
print(df.isnull().sum())

index            0
Rank             0
Game Title       0
Platform         0
Year             0
Genre            0
Publisher        0
North America    0
Europe           0
Japan            0
Rest of World    0
Global           0
Review           0
dtype: int64


# **Data Analysis & EDA**

## **Descriptive statistics**

**Shape of the dataset**

In [11]:
df.shape

(1907, 13)

**Info regarding dataset**

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1907 entries, 0 to 1906
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          1907 non-null   int64  
 1   Rank           1907 non-null   int64  
 2   Game Title     1907 non-null   object 
 3   Platform       1907 non-null   object 
 4   Year           1907 non-null   float64
 5   Genre          1907 non-null   object 
 6   Publisher      1907 non-null   object 
 7   North America  1907 non-null   float64
 8   Europe         1907 non-null   float64
 9   Japan          1907 non-null   float64
 10  Rest of World  1907 non-null   float64
 11  Global         1907 non-null   float64
 12  Review         1907 non-null   float64
dtypes: float64(7), int64(2), object(4)
memory usage: 193.8+ KB


**Summary of the numerical features**

In [13]:
print(df.describe())

           index       Rank         Year  North America       Europe  \
count  1907.0000  1907.0000  1907.000000    1907.000000  1907.000000   
mean    953.0000   954.0000  2003.785527       1.258789     0.706675   
std     550.6478   550.6478     5.852295       1.956560     1.148904   
min       0.0000     1.0000  1983.000000       0.000000     0.000000   
25%     476.5000   477.5000  2001.000000       0.510000     0.230000   
50%     953.0000   954.0000  2005.000000       0.810000     0.440000   
75%    1429.5000  1430.5000  2008.000000       1.375000     0.810000   
max    1906.0000  1907.0000  2012.000000      40.430000    28.390000   

             Japan  Rest of World       Global       Review  
count  1907.000000    1907.000000  1907.000000  1907.000000  
mean      0.317493       0.206471     2.489240    79.038977  
std       0.724945       0.343093     3.563159    10.616899  
min       0.000000       0.000000     0.830000    30.500000  
25%       0.000000       0.060000     1.1

**Count the number of unique values for each feature**

In [14]:
print(df.nunique())

index            1907
Rank             1907
Game Title       1519
Platform           22
Year               30
Genre              12
Publisher          94
North America     375
Europe            273
Japan             218
Rest of World     129
Global            479
Review            734
dtype: int64


**Total number of unique values in each column**

In [15]:
for column in df.columns:
    if df[column].dtype == 'object':
        unique_vals = len(df[column].value_counts())
        print(f"Column name: {column} | Total unique values: {unique_vals}")
    else:
        continue

Column name: Game Title | Total unique values: 1519
Column name: Platform | Total unique values: 22
Column name: Genre | Total unique values: 12
Column name: Publisher | Total unique values: 94


**Top 5 selling games in North America**

In [23]:
# Find the top 5 selling games in North America
top_5_na = df.nlargest(5, "North America")[["Game Title", "North America"]]
print("Top 5 Selling Games in North America:")
print(top_5_na)

Top 5 Selling Games in North America:
          Game Title  North America
0         Wii Sports          40.43
1  Super Mario Bros.          29.08
7          Duck Hunt          26.93
4             Tetris          23.20
3  Wii Sports Resort          14.82


**Top 5 selling games in Europe**

In [24]:
# Find the top 5 selling games in Europe
top_5_eu = df.nlargest(5, "Europe")[["Game Title", "Europe"]]
print("\nTop 5 Selling Games in Europe:")
print(top_5_eu)


Top 5 Selling Games in Europe:
          Game Title  Europe
0         Wii Sports   28.39
2     Mario Kart Wii   12.22
9         Nintendogs   10.81
3  Wii Sports Resort   10.51
6           Wii Play    9.11


**Top 5 Selling Games in Japan**

In [25]:
top_5_jp = df.nlargest(5, "Japan")[["Game Title", "Japan"]]
print("\nTop 5 Selling Games in Japan:")
print(top_5_jp)


Top 5 Selling Games in Japan:
                         Game Title  Japan
10    Pokémon Gold / Silver Version   7.20
1                 Super Mario Bros.   6.81
5             New Super Mario Bros.   6.48
19  Pokémon Diamond / Pearl Version   6.04
27    Pokémon Black / White Version   5.64


**Top 5 Selling Games in the Rest of the World**

In [27]:
top_5_other = df.nlargest(5, "Rest of World")[["Game Title", "Rest of World"]]
print("\nTop 5 Selling Games in the Rest of the World:")
print(top_5_other)


Top 5 Selling Games in the Rest of the World:
              Game Title  Rest of World
0             Wii Sports           8.54
2         Mario Kart Wii           3.21
3      Wii Sports Resort           3.01
5  New Super Mario Bros.           2.88
6               Wii Play           2.84


**Top 5 Selling Games Overall**

In [28]:
# Find the top 5 selling games overall
top_5_global = df.nlargest(5, "Global")[["Game Title", "Global"]]
print("\nTop 5 Selling Games Overall:")
print(top_5_global)


Top 5 Selling Games Overall:
          Game Title  Global
0         Wii Sports   81.12
1  Super Mario Bros.   40.24
2     Mario Kart Wii   33.55
3  Wii Sports Resort   31.52
4             Tetris   30.26


### **Insights**

1. Wii Sports is the top-selling game in North America, Europe, and globally, indicating its widespread popularity.

2. Mario Bros franchise, including Super Mario Bros and New Super Mario Bros, is also highly popular and successful in North America, Europe, and Japan.

3. Pokémon franchise games, such as Pokémon Gold / Silver Version and Pokémon Diamond / Pearl Version, are popular in Japan.

4. Sports-themed games, such as Wii Sports and Wii Sports Resort, seem to be well-received in multiple regions.

5. The top-selling games are not necessarily limited to a particular platform or genre, but rather the popularity of the game is widely spread across multiple regions and platforms.

6. The high sales of the games in North America and Europe indicate that these regions are crucial markets for video game developers and publishers.

7. The low sales of games in the Rest of the World compared to North America, Europe, and Japan suggest that there may be potential opportunities to expand the market in these regions.

## **Visualisations**

**Top Selling Video Games: Characteristics and Insights**

In [36]:
# Find the top-selling games
top_games = df.nlargest(5, "Global")

# Analyze the characteristics of the top-selling games
platforms = top_games['Platform'].value_counts()
genres = top_games['Genre'].value_counts()
publishers = top_games['Publisher'].value_counts()

# Visualize the characteristics of the top-selling games
fig_platforms = px.bar(platforms, x=platforms.index, y=platforms.values, title="Top Platforms of the Top-Selling Games")
fig_genres = px.bar(genres, x=genres.index, y=genres.values, title="Top Genres of the Top-Selling Games")
fig_publishers = px.bar(publishers, x=publishers.index, y=publishers.values, title="Top Publishers of the Top-Selling Games")

# Show the plots
fig_platforms.show()
print()
fig_genres.show()
print()
fig_publishers.show()







The top 5 selling games are published by Nintendo, with 5 games in the list. The most common genre among these top games is Sports, followed by Platform, Racing, and Puzzle. The most common platform used among these games is Wii, with 3 games in the list, followed by NES and GB. This suggests that the Nintendo Wii platform, sports genre, and Nintendo publisher have had a strong impact on the success of these top-selling games. Additionally, having a diverse range of genres, such as Sports, Platform, Racing, and Puzzle, has also contributed to the success of these games.

**Top Platforms and Genres by Global Sales Trends**

In [41]:
# Filter the data to only include platforms and genres with significant sales
platform_sales = df.groupby('Platform').agg({'Global':'sum'}).sort_values('Global', ascending=False)
platform_sales = platform_sales[platform_sales['Global'] >= platform_sales['Global'].quantile(0.1)]
genre_sales = df.groupby('Genre').agg({'Global':'sum'}).sort_values('Global', ascending=False)
genre_sales = genre_sales[genre_sales['Global'] >= genre_sales['Global'].quantile(0.1)]

# Plot the data
fig = px.bar(platform_sales, x='Global', y=platform_sales.index, orientation='h', title='Top Selling Platforms')
fig.show()
print()
fig = px.bar(genre_sales, x='Global', y=genre_sales.index, orientation='h', title='Top Selling Genres')
fig.show()




The top 3 platforms with the highest global sales are PS2, Wii, and X360.

The genres with the highest global sales are Sports, Action, and Platform.

These insights can help game developers understand the popularity of different platforms and genres and make informed decisions about what to develop for.

The data suggests that Sports and Action games have a higher demand compared to other genres, and developing games for PS2, Wii, and X360 platforms could be a profitable decision.

**Top 10 Publishers by Global Sales**

In [45]:
# Group the data by the publisher column
top_publishers = df.groupby(["Publisher"]).sum().sort_values(by=["Global"], ascending=False).head(10)

fig = px.bar(top_publishers, x=top_publishers.index, y="Global", color="Global",
             title="Top 10 Publishers by Global Sales")

fig.show()

Based on the data, it can be seen that Nintendo is the publisher with the highest global sales. They have consistently released successful games and are the top-selling publisher in North America, Europe, and Japan. Electronic Arts comes in second place, followed by Sony Computer Entertainment, Activision, and Take-Two Interactive. The data suggests that these publishers have been consistently successful in their game releases and have a large market share.

**Global Sales for each year**

In [48]:
# Aggregate the data by year
agg_df = df.groupby("Year")["Global"].sum().reset_index()

# Plot the aggregated data
fig = px.bar(agg_df, x="Year", y="Global")
fig.show()

It can be seen that there is an overall upward trend in the sales of video games, with some fluctuations. The highest global sales were in the late 90s and early 2000s, with a peak in 2006, before declining somewhat in the later years.

This trend can likely be attributed to the growth and evolution of the video game industry, with new technology and platforms being introduced and adopted, leading to an increase in sales. The decline in later years could be due to the global financial crisis of 2008, which had an impact on consumer spending and the video game industry.

**Correlation between features**

In [54]:
# calculate the correlation matrix
corr = df.corr()

# create the heatmap
fig = px.imshow(corr, color_continuous_scale='bluered', 
                labels={'x':'Features', 'y':'Features', 'color': 'Correlation'})

# add the correlation values as annotations
for i in range(len(corr)):
    for j in range(len(corr)):
        text = f"{corr.values[i,j]:.2f}"

# show the plot
fig.show()

1. There is a strong positive correlation between "Global Sales" and "North America Sales" (0.933073), "Europe Sales" (0.888902), and "Rest of World Sales" (0.837469). This means that as one of these variables increases, the other also tends to increase.

1. There is a moderate positive correlation between "North America Sales" and "Europe Sales" (0.720766).

1. There is a weak positive correlation between "Year" and "Global Sales" (0.201001). This means that as the year increases, global sales tend to increase as well, but the correlation is not very strong.

1. There is a weak negative correlation between "Rank" and "Global Sales" (-0.529373). This means that as the rank increases, global sales tend to decrease, but the correlation is not very strong.

1. There is a weak negative correlation between "Review" and "Global Sales" (-0.292892). This means that as the review score increases, global sales tend to decrease, but the correlation is not very strong.

1. There is a weak positive correlation between "Japan Sales" and "Review" (0.148584). This means that as the review score increases, Japan sales tend to increase, but the correlation is not very strong.