## Video Games Sales Data Exploration & Cleaning

The first part of any data analysis or predictive modeling task is an initial exploration of the data. Even if you collected the data yourself and you already have a list of questions in mind that you want to answer, it is important to explore the data before doing any serious analysis, since oddities in the data can cause bugs and muddle your results. Before exploring deeper questions, you have to answer many simpler ones about the form and quality of data. That said, it is important to go into your initial data exploration with a big picture question in mind since the goal of your analysis should inform how you prepare the data.

In [46]:

# Load in some packages
import calendar
import pandas as pd
import matplotlib.pyplot as plt
import warnings


warnings.filterwarnings("ignore")

# Load datasets

vg_sales_df = pd.read_csv(r"C:\Users\jki\Downloads\games.csv" )
vg_sales_df.head(7)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,
5,Tetris,GB,1989.0,Puzzle,23.2,2.26,4.22,0.58,,,
6,New Super Mario Bros.,DS,2006.0,Platform,11.28,9.14,6.5,2.88,89.0,8.5,E


In [47]:
# Lets check the data types
vg_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


lets check for missing values

In [35]:
# lets check for missing values

missing_values = vg_sales_df.isna().sum()
print(missing_values)

Name                  2
Platform              0
Year_of_Release     269
Genre                 2
NA_sales              0
EU_sales              0
JP_sales              0
Other_sales           0
Critic_Score       8578
User_Score         6701
Rating             6766
dtype: int64


In [36]:
# lets remove missing values
vg_sales_df.dropna(subset=['Name'],inplace=True)
vg_sales_df.dropna(subset=['Year_of_Release'],inplace=True)
vg_sales_df.dropna(subset=['Critic_Score'],inplace=True)
vg_sales_df.dropna(subset=['User_Score'],inplace=True)
vg_sales_df.dropna(subset=['Rating'],inplace=True)

# lets check for missing values

missing_values = vg_sales_df.isna().sum()
print(missing_values)


Name               0
Platform           0
Year_of_Release    0
Genre              0
NA_sales           0
EU_sales           0
JP_sales           0
Other_sales        0
Critic_Score       0
User_Score         0
Rating             0
dtype: int64


In [5]:
# Lets check the data types
vg_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7878 entries, 0 to 16702
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             7878 non-null   object 
 1   Platform         7878 non-null   object 
 2   Year_of_Release  7878 non-null   float64
 3   Genre            7878 non-null   object 
 4   NA_sales         7878 non-null   float64
 5   EU_sales         7878 non-null   float64
 6   JP_sales         7878 non-null   float64
 7   Other_sales      7878 non-null   float64
 8   Critic_Score     7878 non-null   float64
 9   User_Score       7878 non-null   object 
 10  Rating           7878 non-null   object 
dtypes: float64(6), object(5)
memory usage: 738.6+ KB


In [37]:
# lets change data types
vg_sales_df['NA_sales'] = vg_sales_df['NA_sales'].astype(int)
vg_sales_df['EU_sales'] = vg_sales_df['EU_sales'].astype(int)
vg_sales_df['JP_sales'] = vg_sales_df['JP_sales'].astype(int)
vg_sales_df['Other_sales'] = vg_sales_df['Other_sales'].astype(int)
vg_sales_df['Critic_Score'] = vg_sales_df['Critic_Score'].astype(int)

vg_sales_df['Year_of_Release'] = pd.to_datetime(vg_sales_df['Year_of_Release'])
# Lets check the data types
vg_sales_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7878 entries, 0 to 16702
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Name             7878 non-null   object        
 1   Platform         7878 non-null   object        
 2   Year_of_Release  7878 non-null   datetime64[ns]
 3   Genre            7878 non-null   object        
 4   NA_sales         7878 non-null   int32         
 5   EU_sales         7878 non-null   int32         
 6   JP_sales         7878 non-null   int32         
 7   Other_sales      7878 non-null   int32         
 8   Critic_Score     7878 non-null   int32         
 9   User_Score       7878 non-null   object        
 10  Rating           7878 non-null   object        
dtypes: datetime64[ns](1), int32(5), object(5)
memory usage: 584.7+ KB


In [38]:
# Lets get the Totol sales by perfomring mathematical operation
vg_sales_df['Total Sales'] = vg_sales_df['NA_sales'] * vg_sales_df['EU_sales'] * vg_sales_df['JP_sales']  * vg_sales_df['Other_sales']

vg_sales_df.head(10)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating,Total Sales
0,Wii Sports,Wii,1970-01-01 00:00:00.000002006,Sports,41,28,3,8,76,8.0,E,27552
2,Mario Kart Wii,Wii,1970-01-01 00:00:00.000002008,Racing,15,12,3,3,82,8.3,E,1620
3,Wii Sports Resort,Wii,1970-01-01 00:00:00.000002009,Sports,15,10,3,2,80,8.0,E,900
6,New Super Mario Bros.,DS,1970-01-01 00:00:00.000002006,Platform,11,9,6,2,89,8.5,E,1188
7,Wii Play,Wii,1970-01-01 00:00:00.000002006,Misc,13,9,2,2,58,6.6,E,468
8,New Super Mario Bros. Wii,Wii,1970-01-01 00:00:00.000002009,Platform,14,6,4,2,87,8.4,E,672
11,Mario Kart DS,DS,1970-01-01 00:00:00.000002005,Racing,9,7,4,1,91,8.6,E,252
13,Wii Fit,Wii,1970-01-01 00:00:00.000002007,Sports,8,8,3,2,80,7.7,E,384
14,Kinect Adventures!,X360,1970-01-01 00:00:00.000002010,Misc,15,4,0,1,61,6.3,E,0
15,Wii Fit Plus,Wii,1970-01-01 00:00:00.000002009,Sports,9,8,2,1,80,7.4,E,144


In [39]:
# Do we have any negative values
vg_sales_df.describe()
vg_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7878 entries, 0 to 16702
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Name             7878 non-null   object        
 1   Platform         7878 non-null   object        
 2   Year_of_Release  7878 non-null   datetime64[ns]
 3   Genre            7878 non-null   object        
 4   NA_sales         7878 non-null   int32         
 5   EU_sales         7878 non-null   int32         
 6   JP_sales         7878 non-null   int32         
 7   Other_sales      7878 non-null   int32         
 8   Critic_Score     7878 non-null   int32         
 9   User_Score       7878 non-null   object        
 10  Rating           7878 non-null   object        
 11  Total Sales      7878 non-null   int32         
dtypes: datetime64[ns](1), int32(6), object(5)
memory usage: 615.5+ KB


## Data Analysis
Data Analysis is the process of systematically applying statistical and/or logical techniques to describe and illustrate, condense and recap, and evaluate data. It is done after data exploration and cleaning. When we do data analysis we answer questions corresponding to our dataset.

In [10]:
# Load in some packages
import calendar
import warnings
import pandas as pd
import matplotlib.pyplot as plt
from itertools import combinations
from collections import Counter

warnings.filterwarnings("ignore")

After we cleaned the data it is important to be able to answer some questions related to the data. In this part we will use graphs and group_by function in order to successfully answer them. The questions we need to answer is the following:

1. What was the best Year for sales? How much was earned that  Year?
2. Which Platform had the highest number of sales?
3. Which Genre had the highest number of sales?
5. What product sold the most? Why do you think it sold the most

## 1. What was the best month for sales? How much was earned that month?

In [40]:
# Creating a new variable can be as simple as taking one variable and adding, multiplying or dividing by another. Let's create a new variable, Month, from 'Order Date':



vg_sales_df['Month'] = vg_sales_df['Year_of_Release'].dt.month
vg_sales_df['Year'] = vg_sales_df['Year_of_Release'].dt.year
vg_sales_df['Hour'] = vg_sales_df['Year_of_Release'].dt.hour


vg_sales_df['Month'] = pd.to_datetime(vg_sales_df['Month'])
vg_sales_df['Year']  = pd.to_datetime(vg_sales_df['Year'] )
vg_sales_df['Month'] = pd.to_datetime(vg_sales_df['Month'])
vg_sales_df['Hour'] = pd.to_datetime(vg_sales_df['Hour'])
vg_sales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7878 entries, 0 to 16702
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Name             7878 non-null   object        
 1   Platform         7878 non-null   object        
 2   Year_of_Release  7878 non-null   datetime64[ns]
 3   Genre            7878 non-null   object        
 4   NA_sales         7878 non-null   int32         
 5   EU_sales         7878 non-null   int32         
 6   JP_sales         7878 non-null   int32         
 7   Other_sales      7878 non-null   int32         
 8   Critic_Score     7878 non-null   int32         
 9   User_Score       7878 non-null   object        
 10  Rating           7878 non-null   object        
 11  Total Sales      7878 non-null   int32         
 12  Month            7878 non-null   datetime64[ns]
 13  Year             7878 non-null   datetime64[ns]
 14  Hour             7878 non-null   dateti

In [42]:
vg_sales_df.head(3)

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating,Total Sales,Month,Year,Hour
0,Wii Sports,Wii,1970-01-01 00:00:00.000002006,Sports,41,28,3,8,76,8.0,E,27552,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000001970,1970-01-01
2,Mario Kart Wii,Wii,1970-01-01 00:00:00.000002008,Racing,15,12,3,3,82,8.3,E,1620,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000001970,1970-01-01
3,Wii Sports Resort,Wii,1970-01-01 00:00:00.000002009,Sports,15,10,3,2,80,8.0,E,900,1970-01-01 00:00:00.000000001,1970-01-01 00:00:00.000001970,1970-01-01
