# Sales information

### Data Analysis to be performed over the gaming dataset:
- Find Top gaming platforms most games developed for a platform
- Top Gamming Genre
- Year with most game releses
- top performing gaming platform by year
- Top performing gaming genre


In [1]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("dark")
sns.despine()

In [2]:
# import the video game sales CSV in to pandas dataframe.
sales_df = pd.read_csv('video_game_sales.csv')
sales_df.head()


Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


This data set consists of data on games sales and scores. <br>
Every row corresponds to a game. <br>
This dataset also gives information such as Platform, Year Released, Genre, Publisher, Sales in millions.<br>
Other features include: Critc scores, critic count, user score, user count, developer, Rating

Exploring the dataset
- check for null/NaN values.
- check for anomalies like errorneous data entry
- checking the number of rows and columns
- dropping columns which are not needed
- replacing or dropping rows with null/nan value.
- checking the stats of the data set: Mean, Median, Mode, Quartiles etc.

In [3]:
# Dropping non-essential columns.
drop_columns = ['Critic_Score','Critic_Count','User_Count','User_Score','Developer','Rating']
sales_df.drop(drop_columns,inplace=True,axis = 1)

In [4]:
sales_df.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [5]:
#checking for any null values in the data frame
sales_df.isnull().sum()

Name                 2
Platform             0
Year_of_Release    269
Genre                2
Publisher           54
NA_Sales             0
EU_Sales             0
JP_Sales             0
Other_Sales          0
Global_Sales         0
dtype: int64

we can see from the table above that we are having some missing values.<br>
There are 2 things we can do
- Look up for the information online or
- We could drop the rows with missing values
I will be going to drop the rows with missing values <br>
because we can work with the remaining data to perform our Exploratory analysis <br>
And as far as the NaNs in the Developer and Ratings column goes, I am going to fill them as 'unknown'.

In [6]:
# Dropping NaN from Year_of_Release and filling the NaNs with 'Unknown' for developers and ratings.
#dropping rows in Year_of_Release.
sales_df.dropna(how='any', subset=['Year_of_Release','Name'], inplace=True)


In [7]:
sales_df.isnull().sum()

Name                0
Platform            0
Year_of_Release     0
Genre               0
Publisher          32
NA_Sales            0
EU_Sales            0
JP_Sales            0
Other_Sales         0
Global_Sales        0
dtype: int64

In [8]:
#dilling the NaN with 'Unknown' in Publisher column 
sales_df[['Publisher']] = sales_df[['Publisher']].fillna('Unkonwn')

In [9]:
sales_df.head()
print(sales_df.isnull().sum())

Name               0
Platform           0
Year_of_Release    0
Genre              0
Publisher          0
NA_Sales           0
EU_Sales           0
JP_Sales           0
Other_Sales        0
Global_Sales       0
dtype: int64


As you can see above there are no null values in the dataframe.
If we check the Year_of_Releaase column, you can see that the data type looks like a float.

In [15]:
#checking the data type of year_of_release column
print('Before the conversion',sales_df.Year_of_Release.dtypes)
sales_df[['Year_of_Release']] = sales_df[['Year_of_Release']].astype(int, inplace = True)
print(sales_df.head())
print('After typecasting',sales_df.Year_of_Release.dtypes)

Before the conversion int32
                       Name Platform  Year_of_Release         Genre Publisher  \
0                Wii Sports      Wii             2006        Sports  Nintendo   
1         Super Mario Bros.      NES             1985      Platform  Nintendo   
2            Mario Kart Wii      Wii             2008        Racing  Nintendo   
3         Wii Sports Resort      Wii             2009        Sports  Nintendo   
4  Pokemon Red/Pokemon Blue       GB             1996  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.36     28.96      3.77         8.45         82.53  
1     29.08      3.58      6.81         0.77         40.24  
2     15.68     12.76      3.79         3.29         35.52  
3     15.61     10.93      3.28         2.95         32.77  
4     11.27      8.89     10.22         1.00         31.37  
After typecasting int32


In [17]:
#Now lets check for any errorneous entry in thpe Year_of_Release
print('Maximum Year', sales_df.Year_of_Release.max())
print('Minimum Year', sales_df.Year_of_Release.min())

Maximum Year 2020
Minimum Year 1980
