## Data Exploration

In this notebook we explore the dataset being used in this project. All data preparation tasks required for the analysis are completed, such as cleaning and updating the dataset with the correct values. This will ensure the data is in the correct format and is ready to be used in the analysis. We begin with some imports and checking where there are any missing values within the dataset.

In [15]:
# Imports
import numpy as np 
import pandas as pd

In [34]:
# Read in the data and count the number of missing values per feature.
raw = pd.read_csv("../data/raw/vgsales.csv")
raw.isnull().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

It can be seen that there are missing values for both the Year and Publisher features. Year has 271 missing values whilst Publisher has 58. Another thing to note is that this dataset has not been updated since 2016. This means we need to check for any incorrect entries in the dataset and update them with the correct values. We begin by focusing on the Year feature.

In the step below, all missing values for the Year feature are imputed using the median year value. The median is chosen as this is a more robust metric of location than the mean.

In [35]:
# Imputing the missing Year value with the median year
year_median = raw["Year"].median(skipna=True)
raw["Year"] = raw["Year"].fillna(year_median)
missing_vals = raw["Year"].isnull().sum()

print(f"There are {missing_vals} missing values for the Year feature.")

There are 0 missing values for the Year feature.


Now we search for any incorrect entries for the Year. These are any values in the dataset where the Year is greater than 2016. When identified, the actual release year will be found and the incorrect entries will be updated with the correct values.

In [36]:
# Idenitify any incorrect entries for the Year
raw[raw["Year"] > 2016]

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
5957,5959,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.27,0.0,0.0,0.02,0.29
14390,14393,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.0,0.0,0.03,0.0,0.03
16241,16244,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.0,0.0,0.01,0.0,0.01
16438,16441,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.0,0.0,0.01,0.0,0.01


After researching, it was found that these games were released on the specified consoles in the years 2009, 2016, 2013 and 2016 respectively.

In [37]:
# Updating the Year with the correct values
Year = list(raw["Year"])
Year[5957] = 2009
Year[14390] = 2016
Year[16241] = 2013
Year[16438] = 2016
raw["Year"] = Year

The year is now correct for each video game and there are no longer any missing values for this feature. This is now complete and does not require anymore preprocessing. 

Next the Publisher feature will be prepared for analysis. Only a small percentage of values are missing for this feature, also this feature will not play much of a role within the analysis being completed. For these reasons, any missing values will be imputed with the value "Unknown". An alternative to this would be to remove the rows containing missing values. However, imputing the missing values with "Unknown" means we get to keep more data and can perform a more granular analysis.

In [38]:
# Impute "Unknown" for missing values in the Publisher feature
raw['Publisher'] = raw['Publisher'].fillna('Unknown')

All missing values have now been changed and the data is complete. 

Now we investigate each data type and update any which are incorrect.

In [39]:
# Looking at data types
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16598 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16598 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


In [25]:
# Change Year to an integer value instead of a decimal.
raw['Year'] = raw['Year'].astype('int')
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16598 non-null  int64  
 4   Genre         16598 non-null  object 
 5   Publisher     16598 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(5), int64(2), object(4)
memory usage: 1.4+ MB


We conclude the data exploration by looking at some basic descriptive statistics of the numeric features in the dataset. These can be seen below.

In [40]:
# Exploring the numeric features
raw.describe()

Unnamed: 0,Rank,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
count,16598.0,16598.0,16598.0,16598.0,16598.0,16598.0,16598.0
mean,8300.605254,2006.41511,0.264667,0.146652,0.077782,0.048063,0.537441
std,4791.853933,5.780191,0.816683,0.505351,0.309291,0.188588,1.555028
min,1.0,1980.0,0.0,0.0,0.0,0.0,0.01
25%,4151.25,2003.0,0.0,0.0,0.0,0.0,0.06
50%,8300.5,2007.0,0.08,0.02,0.0,0.01,0.17
75%,12449.75,2010.0,0.24,0.11,0.04,0.04,0.47
max,16600.0,2016.0,41.49,29.02,10.22,10.57,82.74


It can be seen that the average year that a video game was released in the dataset is 2006 which suggests that the games in the data are fairly old. We can also see that the region where average sales are the highest is North America, this is followed by Europe but the difference between these averages are rather significant. Globally, on average, a video game has approximately 537,000 sales. 

The oldest game in the dataset was released in 1980 and the latest game released was in 2016. This means we are investigating a variety of different video games from different periods in time. 

In [42]:
raw.to_csv("../data/processed/vgsales_processed.csv", index=False)