# About the Dataset
This dataset contains a list of video games with sales greater than 100,000 copies. It was generated by a scrape of vgchartz.com.

# Features
- Rank - Ranking of overall sales

- Name - The games name

- Platform - Platform of the games release (i.e. PC,PS4, etc.)

- Year - Year of the game's release

- Genre - Genre of the game

- Publisher - Publisher of the game

- NA_Sales - Sales in North America (in millions)

- EU_Sales - Sales in Europe (in millions)

- JP_Sales - Sales in Japan (in millions)

- Other_Sales - Sales in the rest of the world (in millions)

- Global_Sales - Total worldwide sales.


# Table of Contents

1. [Data Loading and Data Cleaning](#1.-Data-Loading-and-Data-Cleaning)
2. [Descriptive Analysis](#2.-Descriptive-Analysis)
3. [Popularity Through the Years by Platform](#3.-Popularity-Through-the-Years-by-Platform)
4. [Popularity Through the Years by Genre](#4.-Popularity-Through-the-Years-by-Genre)
5. [Global Sales Growth](#5-Global-Sales-Growth)
6. [Clusters](#6.-Clusters)
7. [Bonus](#Bonus.)

# Purpose
The purose of the dataset is to explore it's features along with their relationships and find interesting patterns, analyze them and visualize them.

The Dataset is pretty tidy, has categorical values, date values and sales values. Those values will be perfect to practice s visualization skills and maybe we can find a useful machine learning model

don't forget to upvote if you find the notebook useful, i'm so close to become a notebook expert so i would really (REALLY) appreciate it :D 

In [None]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Viz
import matplotlib.pyplot as plt
from matplotlib.colors import DivergingNorm
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio

# Cluster
from sklearn.cluster import KMeans

np.warnings.filterwarnings('ignore')

# 1. Data Loading and Data Cleaning

In [None]:
game = pd.read_csv('../input/videogamesales/vgsales.csv')
display(game)

## Inspecting Null's Values

In [None]:
sns.heatmap(game.isnull())

Looks like we have a few nulls values in the dataset. Let's check them out

In [None]:
display(game[game['Year'].isnull()].head(10))
display(game[game['Year'].isnull()].shape)
display(game[game['Publisher'].isnull()].head(10))
display(game[game['Publisher'].isnull()].shape)


Nulls values dont seem to be inconsistent data and there are not a lot of them. I don't think that dropping them would make a huge impact on the analysis. Quite the contrary, to my mind, we would be losing some interest information, like sports games in the case of nulls values in the year features, or GBA games in the publisher feature.

## Data Types

In [None]:
game.info()

The data types are also in order.

Let's then continue with the descriptive analysis

# 2. Descriptive Analysis

In [None]:
game.info()

## 2.1. Categories

In [None]:
game.select_dtypes('object').columns

### Platform

In [None]:
fig = px.histogram(game, x="Platform", template="plotly_white", color_discrete_sequence=["rgb(127,232,186)"]).update_xaxes(categoryorder="total descending")
fig.show()

print("Platform has {} unique values".format(len(game['Platform'].unique())))
print("Top 5 values are: {}".format(', '.join(game['Platform'].value_counts().index[:5])))

### Genre

In [None]:
fig = px.histogram(game, x="Genre", template="plotly_white", color_discrete_sequence=["rgb(127,232,186)"]).update_xaxes(categoryorder="total descending")
fig.show()

print("Genre has {} unique values".format(len(game['Genre'].unique())))
print("Top 5 values are: {}".format(', '.join(game['Genre'].value_counts().index[:5])))

### Publisher

In [None]:
fig = px.histogram(game, x="Publisher", template="plotly_white", color_discrete_sequence=["rgb(127,232,186)"]).update_xaxes(categoryorder="total descending")
fig.show()

print("Publisher has {} unique values".format(len(game['Publisher'].unique())))
print("Top 10 values are: {}".format(', '.join(game['Publisher'].value_counts().index[:10])))

'Publisher' feature has lot's of values values. One of the advantages of plotly is that you are able to zoom-in and out the graph to freely explore it and make a more readable graph

## 2.2. Numbers

In [None]:
numbers = list(game.select_dtypes(['int64', 'float64']).columns)[2:]
numbers

In [None]:
fig = make_subplots(rows=6, cols=2)

fig.add_trace(go.Histogram(x=game['NA_Sales'], nbinsx=10, name='NA_Sales', marker_color='rgb(254, 147, 140)', opacity=0.8), row=1, col=1)
fig.add_trace(go.Box(x=game['NA_Sales'], name='NA_Sales', marker_color='rgb(254, 147, 140)'), row=2, col=1)

fig.add_trace(go.Histogram(x=game['EU_Sales'], nbinsx=10, name='EU_Sales', marker_color='rgb(230, 184, 156)', opacity=0.8), row=1, col=2)
fig.add_trace(go.Box(x=game['EU_Sales'], name='EU_Sales', marker_color='rgb(230, 184, 156)'), row=2, col=2)

fig.add_trace(go.Histogram(x=game['JP_Sales'], nbinsx=10, name='JP_Sales', marker_color='rgb(234, 210, 172)', opacity=0.8), row=3, col=1)
fig.add_trace(go.Box(x=game['JP_Sales'], name='JP_Sales', marker_color='rgb(234, 210, 172)'), row=4, col=1)

fig.add_trace(go.Histogram(x=game['Other_Sales'], nbinsx=10, name='Other_Sales', marker_color='rgb(156, 175, 183)', opacity=0.8), row=3, col=2)
fig.add_trace(go.Box(x=game['Other_Sales'], name='Other_Sales', marker_color='rgb(156, 175, 183)'), row=4, col=2)

fig.add_trace(go.Histogram(x=game['Global_Sales'], nbinsx=10, name='Global_Sales', marker_color='rgb(66, 129, 164)', opacity=0.8), row=5, col=1)
fig.add_trace(go.Box(x=game['Global_Sales'], name='Global_Sales', marker_color='rgb(66, 129, 164)'), row=6, col=1)

fig.update_layout(template="plotly_white",
    autosize=False,
    width=1200,
    height=1000,
    margin=dict(
        l=50,
        r=50,
        b=100,
        t=100))

fig.show()

looks like the vast majority of values lie in the lower bin, the boxplots show where exactly the outliers are and the reason why the distributions don't have a shape. Let's take the outliers out of the analysis for this section to see if we can get a more readable distribution

In [None]:
woutliers = game[game['Global_Sales']<0.5]
woutliers[numbers].hist(figsize=(20,10), color='#aaf0d1', edgecolor='white')

plt.show()

game[numbers].describe()

- The graphs without the outliers show the shape of the distribution. Most of the values are indeed in the lower values and then it descends.
- All of the number features have a similar descending shape
- The box plots show how rich in outliers the data set is. We are not going to drop the ouliers since they could tell an interesting success story about some games. Nevertheless, for the machine learning model, i might be dropping the extreme values to create a more realitic and perhaps accuarate model.

## 2.3. Date

In [None]:
fig = px.histogram(game, x="Year", template="plotly_white", color_discrete_sequence=["rgb(127,232,186)"]).update_xaxes(categoryorder="total descending")
fig.show()

game['Year'].describe()

- The graph and the statistics show us that the games in this dataset tend to be from year 2003 until 2010

# Data analysis

# 3. Popularity Through the Years by Platform
A good measure of popularity is the total sales that a platform generated. Therefore, in this seciton we are going to take a look at how each platform has been gaining or loosing popularity or sales through the years.

To graph the dataframe with plotly it is necessary that we drop the Nan's values from the 'Year' column

In [None]:
sales = game.dropna(subset=['Year'], how='all')
sales['Year'] = sales['Year'].astype(str)
sales = sales.sort_values(by=['Year'])

# Code to create 'empty data', necessary to fit in the data into the plotly slide figure
platform = list(sales['Platform'].value_counts().index)
year = list(sales['Year'].value_counts().sort_index().index)


d = {}
p = []
y = []


for i in platform:
    for j in year:
        p.append(i)
        y.append(j)

d['Platform'] = p
d['Year'] = y

scratch = pd.DataFrame(d)
scratch['Global_Sales'] = 0


sales = sales.loc[:,['Platform', 'Year', 'Global_Sales']]

final = pd.concat([sales,scratch])
final = final.sort_values(by=['Year'])

finalx = pd.DataFrame(final.groupby(['Platform', 'Year'])['Global_Sales'].sum())
finalx = finalx.reset_index()

# Plotly figure with slide
fig = px.bar(
    data_frame=finalx,
    y='Global_Sales',
    x='Platform',
    animation_frame='Year', template="plotly_white", color_discrete_sequence=['rgb(254, 147, 140)']).update_xaxes(categoryorder="total descending")

fig.show()

Feel free to explor the chart, remember to press the 'Autoscale' button as you advance to get a more clear view of the chart.
From the chart we can see how a the platforms have been overperforming other platforms as the years pass. Some platforms that dominated an are are:

- SNES: 1983-1994
- PS: 1995-2000
- PS2: 2001-2005
- Wii: 2006-2009
- PS3 : 2011-2013
- PS4: 2014-2017

There must be something that PSX consoles are doing that always dominate the market, will they keep performing the same when the PS5 is realeased?

We see that the PSX consoles together dominate the market, but which consoles absolutly wins in terms of sales through all these years? Let's find out

In [None]:
best =  pd.DataFrame(game.groupby('Platform')['Global_Sales'].sum())
best = best.reset_index()

fig = px.bar(
  data_frame=best,
  y='Global_Sales',
  x='Platform',
    template="plotly_white", color_discrete_sequence=['rgb(254, 147, 140)']).update_xaxes(categoryorder="total descending")

fig.show()

- The winner is the PS2 with a staggering total sales of 1,255.64.
- Although X360 was just in place 1 in one year (2010) it was able to make it place 2. Seems like the 10's was an excelent period to sell video games.

Let's continue with how the 'Genre' performed through the years

# 4. Popularity Through the Years by Genre
Since the logic is same (category vs sales) we are going to use the same code as in section 3. 

Let's analyze 'Genre'

In [None]:
sales = game.dropna(subset=['Year'], how='all')
sales['Year'] = sales['Year'].astype(str)
sales = sales.sort_values(by=['Year'])

# Code to create 'empty data', necessary to fit in the data into the plotly slide figure
genre = list(sales['Genre'].value_counts().index)
year = list(sales['Year'].value_counts().sort_index().index)


d = {}
g = []
y = []


for i in genre:
    for j in year:
        g.append(i)
        y.append(j)

d['Genre'] = g
d['Year'] = y

scratch = pd.DataFrame(d)
scratch['Global_Sales'] = 0


sales = sales.loc[:,['Genre', 'Year', 'Global_Sales']]

final = pd.concat([sales,scratch])
final = final.sort_values(by=['Year'])

finalx = pd.DataFrame(final.groupby(['Genre', 'Year'])['Global_Sales'].sum())
finalx = finalx.reset_index()

# Plotly figure with slide
fig = px.bar(
  data_frame=finalx,
  y='Global_Sales',
  x='Genre',
  animation_frame='Year', template="plotly_white", color_discrete_sequence=['rgb(66, 129, 164)']).update_xaxes(categoryorder="total descending")

fig.show()

- Until 2001, we see how the different genres fight for position 1 without dominating for years.
- Since 2001, the 'Action' genre dominated the market, until 2017, when role-playing took its position

What is the absolute winner in sales?

In [None]:
best =  pd.DataFrame(game.groupby('Genre')['Global_Sales'].sum())
best = best.reset_index()

fig = px.bar(
  data_frame=best,
  y='Global_Sales',
  x='Genre',
    template="plotly_white", color_discrete_sequence=['rgb(66, 129, 164)']).update_xaxes(categoryorder="total descending")

fig.show()

No surpise that action is the winner with 1,751 in total sales.

# 5 Global Sales Growth

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(23,12))
fig.tight_layout(pad=8)

line = game.groupby(['Year'])['Global_Sales'].sum()
line = pd.DataFrame(line)

group = game.groupby(['Year'])['Global_Sales'].sum()
group = pd.DataFrame(group)
group['Global_Sales'] = np.round(group['Global_Sales'].pct_change() * 100, 2)

norm = DivergingNorm(vmin=group['Global_Sales'].min(), vcenter=0, vmax=group['Global_Sales'].max())
colors = [plt.cm.RdYlGn(norm(c)) for c in group['Global_Sales']]

sns.lineplot(x=line.index, y=line['Global_Sales'], data=line, ax=ax[0])
ax[0].tick_params(labelrotation=90)

sns.barplot(x=group.index, y=group['Global_Sales'], data=group, palette=colors, ax=ax[1])
ax[1].tick_params(labelrotation=90)

plt.show()

display(line.iloc[19:,:].T)

- We see in the graph how the video game sales have been increasing from 1980 until 2008 when they reached theyr peak at 678.9
- from to 2008 until today, sales have been suspiciously decreasing. Call it intuition but i guess the data from 2016 until 2020 is not complete

# 6. Clusters

Let's apply k-means. My intention is to divide the dataset into low-sales, medium-sales and high-sales; in order to inspect and build a sales profile.

In [None]:
X = game.loc[:,'NA_Sales':].values

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

game['Cluster'] = kmeans.labels_
game['Cluster'] = game['Cluster'].astype(str)

fig = px.scatter(data_frame=game,
    x='EU_Sales',
    y='Global_Sales',
    color='Cluster',
    template="plotly_white",
    color_discrete_map={'0':"rgb(219, 58, 52)", '2':"rgb(255, 200, 87)", '1':"rgb(8, 76, 97)"},
     hover_name='Name',
    hover_data=['Platform', 'Publisher', 'Year'])
fig.show()

- Cluster 0 = low sales
- Cluster 1 = high sales
- Cluster 2 = medium sales
let's change the names in the dataset so that we donÄt getconfuse later

In [None]:
def cat(cluster):
    if cluster == '0':
        return 'low'
    if cluster == '1':
        return 'high'
    if cluster == '2':
        return 'medium'

game['Cluster'] = game['Cluster'].apply(cat)


In [None]:
sns.countplot(game['Cluster'], palette='Set3')
plt.show()

Most of the data is 'low' but thats normal. Let's see each category

## 6.1. Low-Sales

In [None]:
low = game[game['Cluster']=='low']


print("5 most frequent Platform in the low category are: {}".format(', '.join(low['Platform'].value_counts().index[:5])))
print("5 most frequent Genre in the low category are: {}".format(', '.join(low['Genre'].value_counts().index[:5])))
print("5 most frequent Publisher in the low category are: {}".format(', '.join(low['Publisher'].value_counts().index[:5])))

display(low.describe())

## 6.2. Medium-Sales

In [None]:
medium = game[game['Cluster']=='medium']


print("5 most frequent Platform in the medium category are: {}".format(', '.join(medium['Platform'].value_counts().index[:5])))
print("5 most frequent Genre in the medium category are: {}".format(', '.join(medium['Genre'].value_counts().index[:5])))
print("5 most frequent Publisher in the medium category are: {}".format(', '.join(medium['Publisher'].value_counts().index[:5])))

display(medium.describe())

## 6.3. High-Sales

In [None]:
high = game[game['Cluster']=='high']


print("5 most frequent Platform in the high category are: {}".format(', '.join(high['Platform'].value_counts().index[:5])))
print("5 most frequent Genre in the high category are: {}".format(', '.join(high['Genre'].value_counts().index[:5])))
print("5 most frequent Publisher in the high category are: {}".format(', '.join(high['Publisher'].value_counts().index[:5])))

display(high.describe())

# Bonus.
## Whats the best selling game ever?

In [None]:
best = game.sort_values(by=['Global_Sales'], ascending=False)['Name'][0]
print("Best selling game ever is: {}".format(best))

Hope you enjoy the notebook and learn something from it! i can gurantee i got hook by this fun dataset. I would appreciate some honest feedback on the comments on ways i could further improve, or any thought you have about the dataset. If you like it, please upvote (i'm really close to become a notebook expert :D).

Cheers and happy coding!