## Video Games Sales Analysis and Visualization
<b> The Dataset I worked on contains a list of video games with sales in different region. In the beginning of this project, I cleaned the dataset using Pandas as there were some none values and inadequate information. Then I tried to show some key information on the dataset. For example, top games, top genre, highest sales, etc. The whole point of this project is to show a different kind of visualization using Python Matplotlib and Seaborn Library and understand what the dataset tells. </b> <br><br>  <b>N.B: This dataset does not have adequate information from 2017 onwards.</b>  <br><b>  </b><br>[Data set link](https://www.kaggle.com/gregorut/videogamesales) <br> <b>Columns</b>
<ul>
<li>Name - The games name</li>
<li>Platform - Platform of the games release (i.e. PC,PS4, etc.)</li>
<li>Year - Year of the game's release</li>
    <li>Genre - Genre of the game</li>
    <li>Publisher - Publisher of the game</li>
    <li>NA_Sales - Sales in North America (in millions)</li>
    <li>EU_Sales - Sales in Europe (in millions)</li>
    <li>JP_Sales - Sales in Japan (in millions)</li>
    <li>Other_Sales - Sales in the rest of the world (in millions)</li>
    <li>Global_Sales - Total worldwide sales.</li>
</ul>

# Importing Necessary Library

In [None]:
%autosave 10
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as st

        
%matplotlib inline

# Importing Data

In [None]:
filename = "/kaggle/input/videogamesales/vgsales.csv"
df = pd.read_csv(filename)

In [None]:
df.head()

> # Overview of Datafset

In [None]:
df.info()

In [None]:
df.shape

-    Since there are some null object, lets find out these in percentage because it would be easier to understand.

In [None]:
nullvalues_percentage = df.isna().sum()*100 / df.shape[0]
nullvalues_percentage

-    There are only 1.6% of Year and .3% of Publisher null value. If we delete these we would not loose lots of data.

In [None]:
df.dropna(inplace = True)

In [None]:
nullvalues_percentage = df.isna().sum()*100 / df.shape[0]
nullvalues_percentage

# Off-topic: 
### If you want to use different theme for jupyter notebook could go to this link [theme](https://towardsdatascience.com/optimizing-jupyter-notebook-tips-tricks-and-nbextensions-26d75d502663)

In [None]:
# Uncomment the below line and install jupyterthemes. 
#!pip install jupyterthemes

In [None]:
# To show all available theme
!jt -l

In [None]:
# To apply a theme  #do not forget to uncomment

#!jt -t chesterish

In [None]:
# To go back to the default theme 

#!jt -r

# Data types of each column

In [None]:
df.dtypes

-    We can see the year is in float format. We can change this to int format

In [None]:
convert = {'Year':int}

df = df.astype(convert)

In [None]:
df.head(2)

-   Let's check number of entries in each year

In [None]:
df['Year'].value_counts().sort_values(ascending = True).head()

- We see there are only three and one entry for the year 2017, 2020 consecutively. This is not enough to get the real picture so we can ignore these entries.

In [None]:
# Dropping rows with the year 2017 and 2020
drop = df.drop(df[(df['Year'] == 2017) | (df['Year'] == 2020)].index, inplace= True)

#### Let's find out the unique values of Genre as well as the number as most of the task will be around this column

In [None]:
df['Genre'].unique()

In [None]:
# Figure style
plt.figure(figsize=[11,5], dpi= 95)
sns.set_style(style= 'whitegrid')

# Plot
sns.countplot(x=df['Genre'])

# Axis and title label
plt.xlabel('Genre',fontsize=12, color='black')
plt.ylabel('Count',fontsize=12, color='black')
plt.title('The Number of Games by Genre', fontsize= 15, color= 'blue')

plt.tight_layout()

-   We see that the action genre is the highest number of produced games 

# Number of produced games by year

In [None]:
plt.figure(figsize=[11,4], dpi= 95)

sns.countplot(x=df['Year'], palette='viridis', order= df['Year'].value_counts().index)

plt.xticks(rotation = 90)
plt.title('Number of Produced Games in Each Year', fontsize= 15, color= 'blue')

# Number of released games by genre in each year.

In [None]:
plt.figure(figsize=[14,4], dpi= 95)

sns.countplot(x='Year',data=df, hue= 'Genre', palette="dark")

plt.xticks(rotation=90)

-   It looks kind of messy here. We can make this graph for top five years which had most released games.  

In [None]:
plt.figure(figsize=[14,5], dpi= 95)

sns.countplot(x='Year',data=df, hue= 'Genre', palette="dark", order= df['Year'].value_counts().head().index, 
              hue_order=df['Genre'].value_counts().index)

# Axis label
plt.xlabel('Year',fontsize=12, color='black')
plt.ylabel('Count',fontsize=12, color='black')

# Axis ticks
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Title
plt.title('Number of Released Games by Genre in Each Year', fontsize= 15, color= 'blue')

# Legend
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0, fontsize= 'x-large')

# Evolution in number of games for each genre.

In [None]:
# Number of entries for each genre to find out the number of games
genre_count_by_year = df.groupby(by= ['Year','Genre']).count()

# Unstack method
genre_count_by_year = genre_count_by_year['Rank'].unstack()
genre_count_by_year.head(3)

-  There are so many NaN values. We can easily assume this NaN value means zero game produced on that year. So we will replace Nan values with zero and we are also gonna reduce the number of years. We will take count year 1994 onward.

In [None]:
# Replacing the NaN values with 0
genre_count_by_year.fillna(0, inplace= True)

# Dropping the rows of between year 1980 to 1999 inclusive
genre_count_by_year.drop(range(1980, 1995), axis= 0, inplace= True)

In [None]:
genre_count_by_year.head(2)

In [None]:
# Evolution of Action games only
# Figure style
sns.set_style('darkgrid')
fig_dpi=90
plt.figure(figsize=(14,6), dpi=fig_dpi)

# Line plot
for column in genre_count_by_year.drop('Action', axis=1):
    plt.plot(genre_count_by_year[column], marker= '', color= 'grey', linewidth=1, alpha=0.4)
    
# The highlighted plot with specific genre
plt.plot(genre_count_by_year['Action'], marker= '', color= 'green', linewidth=4, alpha=0.7)

# Increasing the xlimit as we need to add annotation
plt.xlim(1994,2019)

# Adding annotation
plt.text(2016.5, genre_count_by_year.Action.tail(1), 'Action', horizontalalignment='left', size='large', color='green')

# Titles and axis label
plt.title("Evolution of Action games VS others", loc='left', fontsize=12, fontweight=0, color='green')
plt.xlabel("Year")
plt.ylabel("Number of Games")

In [None]:
# Evolution of one game vs others
# Figure style
plt.style.use('seaborn-darkgrid')
my_dpi=96
plt.figure(figsize=(1000/my_dpi, 900/my_dpi), dpi=my_dpi)

# Color palette
palette = plt.get_cmap('tab10')

# Line plot
num = 0
for column in genre_count_by_year:
    num += 1
    
#     if num==10:
#         break
    
    plt.subplot(4,3, num)
    
    # Generating all lineplot
    for item in genre_count_by_year.drop(column, axis= 1):
        plt.plot(genre_count_by_year[item], marker='', color='grey', linewidth=1.2, alpha=0.3)
    
    # Line plot for the expeceted one
    plt.plot(genre_count_by_year[column], marker='', color= palette(num), linewidth=2.4, alpha=0.9, label=column)
    
    # x-axis limit for subplot
    plt.xlim(1994,2017)
    
    # Sub plot title
    plt.title(column, loc='left', fontsize=12, fontweight=0, color= palette(num))
    
    # Axis label for the figure
    if column == 'Sports':
        plt.xlabel('Year(1995-2016)', fontsize= 15)
    if column == 'Misc':
        plt.ylabel('Number of Games', fontsize= 15)
    
    
# Figure title    
plt.suptitle(' Evolution of Each Genre Compare to Others ', fontsize=15, fontweight=0, color='blue', y= 1.03)


plt.tight_layout()

### Now, let's figure out which year had the highest global sales?

# Top ten publisher who released most number of games.

In [None]:
publisher = df['Publisher'].value_counts().sort_values(ascending= False)[:10]
publisher

In [None]:
# we need this to add text in the graph
x = list(publisher)
y = list(publisher.index)

#plotting figure
plt.figure(figsize= [15,5])
fig = publisher.plot(kind='bar', color = 'green')

# labeling the bar
style = dict(ha= 'center', size= 13, color = 'black')

x_position = 0
for i in range(0,len(x)):
    fig.text(x_position, x[i]+2, str(x[i]), **style)
    x_position += 1

# Styling axis
plt.ylabel('Count',fontsize=14, color='black')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)

# Title
plt.title('Top Ten Publisher in Number of Produced Games', fontsize= 15, color= 'blue')

# Top 3 released games of these publishers 

In [None]:
# Index of publisher dataframe
top_ten_publisher = list(publisher.index)
print(top_ten_publisher)

In [None]:
# Creating a Dataset
g_1 = df.groupby(by = ['Publisher','Genre'])

# Dictionary to make another dataframe
dic = {'Publisher':[],'Genre':[], 'Count':[]}

for item in top_ten_publisher:
    sub_publisher = g_1.count().loc[item,'Year'].sort_values(ascending = False).head(3)
    dic['Publisher'].extend((item, item, item))  # to keep the publisher name next to one another
    dic['Genre'] += list(sub_publisher.index)    # Appendong genre
    dic['Count'] += list(sub_publisher.values)   # Number of games for each genre 

<b>It might be difficult understrand the above cell. So, uncomment and run these line in different cell. And in the 'by' arg, we could use any column. we just need to count rows so it will work with any column.</b> <br><br>
<code>g_1.count().loc['Electronic Arts'].sort_values(by = 'Year', ascending = False).head()
    
#run in a different cell.
g_1.count().head(20)
</code>

In [None]:
# Creating a dataset of top three genre for each publisher.
sub_df = pd.DataFrame(dic) 
sub_df.head(5)

In [None]:
#bar height
height = list(sub_df['Count'])

# xtick label 
bars = list(sub_df['Genre'])

# index position
y_pos = np.arange(len(bars))

#color for all the bar values
color = []
name = ['blue','red','orange','green','purple','violet','olive','gray','brown','cyan']
for item in name:
    color.extend((item, item, item))

# Ticks for second axis
label = list(sub_df['Publisher']) 
ax_label = []
for item in label:
    if item not in ax_label:
        ax_label.append(item)

In [None]:
# Figure Style
sns.set_style(style='dark')
plt.figure(figsize=[17,5])

# Plot
plt.bar(y_pos, height, color=color)


plt.xticks(y_pos, bars, rotation = 90, fontsize=14)  # we did this in an usual way.
plt.yticks(fontsize= 14)

# Secons axis
axes1 = plt.gca()
axes2 = axes1.twiny()                 # to get another axis here we used object oriented.

axes2.set_xticks(list(range(0,30,3))) # indexing for xticks.
axes2.set_xticklabels(ax_label, fontsize= 14, rotation= 90)

# Axis label
axes1.set_xlabel("Productions")
axes2.set_xlabel("Games")

# Axis limit
axes1.set_xlim(-.5,29.5)
axes2.set_xlim(0,30)

# Title
plt.title('Top Three Games of Publishers', fontsize= 15, color= 'blue')
plt.show()

In [None]:
# Figure Size
plt.figure(figsize=[17,5], dpi= 95)

# to sort the bar in descending order
order = df.groupby(by = 'Year').sum()['Global_Sales'].sort_values(ascending= False).index

# plot
sns.barplot(x='Year',y='Global_Sales',data=df,estimator= sum, palette="dark", order= order, ci= 0)

# Axis ticks
plt.xticks(fontsize=12, rotation= 90)
plt.yticks(fontsize=12)

# Title
plt.title('Highest Global Sales by Year', fontsize= 16, color= 'blue')

plt.tight_layout()

### Let's do a line plot of sales to see how it has changed over year in different region

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=[17,5], dpi= 95)

tick = ['NA_Sales', 'EU_Sales','JP_Sales', 'Other_Sales']

for item in tick:
    sns.lineplot(x= 'Year', y= df[item], data= df, estimator= 'sum', label= item, ci=None, marker='o', linewidth= 3)

# Axis label and ticks    
plt.xlabel('Year',fontsize=14, color='black')
plt.ylabel('Sales in Different Continent',fontsize=14, color='black')
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

# Title and legend
plt.title('Sales in Different Region', fontsize= 18, color= 'blue')
plt.legend(bbox_to_anchor=(1, 1), loc=2, borderaxespad=0, fontsize= 'x-large')

-   We can see, the number of sales in different continent was pretty close to each other till 1995 and then a drastic change occured to sales of North America compare to others. It might be because of their revolution in tech industry.

# Sales of a top three genre in different region. 

In [None]:
# Top three genre in number of produced games.
df['Genre'].value_counts().head(3)

In [None]:
# Dataframe for sales of different genre in different continents
sales_by_genre = df.groupby(by= ['Year','Genre']).sum()[['NA_Sales','EU_Sales','JP_Sales','Other_Sales']].unstack()

# Replacing NaN Value 
sales_by_genre.fillna(0, inplace= True)

# Dropping some rows
sales_by_genre.drop(range(1980,1996), inplace= True)

In [None]:
sales_by_genre.head(2)

In [None]:
# Top 3 Genre index list
index = list(df['Genre'].value_counts().head(3).index)


# Figure style
sns.set_style('darkgrid')
plt.figure(figsize=[14,6], dpi=95)

# Color palette
palette = plt.get_cmap('tab20')

# Multi plot
num = 0
col_pal = 0
for name in index:
    num += 1
    plt.subplot(1,3,num)
    
    # Using .xs method to get the region sales with specific genre
    data_frame = sales_by_genre.xs(key=name, level=1, axis=1)
    
    # Line plot
    for column in data_frame:
        col_pal += 1
        plt.plot(data_frame[column], marker='', color=palette(col_pal), linewidth=2.4, alpha=0.9, label=column)
    
    # Subplot title, legend and axis ticks
    plt.title(name, loc='left', fontsize=14, fontweight=0, color='red')
    plt.legend(fontsize=12)
    plt.xticks([1995,2000,2005,2010,2015])
    
    # Figure x-axis and y-axis
    if name == 'Sports':
        plt.xlabel('Year(1995-2016)', fontsize= 15)
    if name == 'Action':
        plt.ylabel('Sales in Million Dollar', fontsize= 15)

# Figure title
plt.suptitle('Sales of Top Three Genre in Different Continents', fontsize= 15, y = 1.05, color= 'blue')

plt.tight_layout()


# Heat map for sales of all genre

In [None]:
genre_heatmap = df.groupby(by= 'Genre')[['NA_Sales','EU_Sales','JP_Sales','Other_Sales']].sum()
genre_heatmap.head()

In [None]:
# Fig style
plt.figure(figsize= [8,6], dpi= 99)

# Plot
ax = sns.heatmap(genre_heatmap, annot= True, fmt= '.1f', linecolor= 'white', linewidths= 1.2, cmap= 'gist_earth_r')

# There was some kind of issue with matplotlib version. That is why i had to use the below line. Otherwise it should work fine.
# Getting the current y limit and then resizing
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + .5, top - .5)

# Title
plt.title('Heatmap for Sales of All Genre by Region', color= 'blue')
plt.show()

# Sales percentage in pie chart

In [None]:
plt.figure(figsize=[5,5], dpi= 120)

# parameters
labels = 'NA_Sales','JP_Sales','Other_sales', 'EU_Sales'
sizes = [df['NA_Sales'].sum(),df['JP_Sales'].sum(), df['Other_Sales'].sum(), df['EU_Sales'].sum()]
colors = ['green','yellowgreen','red','lightskyblue']
explode = (0.1,0,0.1,0) # explode the highest and lowest slice.

# Pie plot
plt.pie(sizes, explode= explode, labels= labels, colors= colors, autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')  # to get a circle shape
plt.title('Sales Percentage by Region', color= 'blue', y= 1.02, x= .45)
plt.show()

### Pairplot

In [None]:
sns.pairplot(df[['Year','NA_Sales','EU_Sales','JP_Sales','Other_Sales']])

# Correlation

In [None]:
# Figure style
plt.figure(figsize=[8,5], dpi= 95)

# PLot
ax = sns.heatmap(df.corr(), annot= True)

# There was some kind of issue with matplotlib version. That is why i had to use the below line. Otherwise it should work fine.
# Getting the current y limit and then resizing
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + .5, top - .5)

# Title
plt.title('Correlation Between Columns', fontsize= 14, color= 'blue')

### Thank you so much for spending your valuable time on my project. I would really appreciate if you give me some feedback on this project. 