## Data Exploration

In this notebook we explore the dataset being used in this project. All data preparation tasks required for the analysis are completed, such as cleaning and updating the dataset with the correct values. This will ensure the data is in the correct format and is ready to be used in the analysis. We begin with some imports and checking where there are any missing values within the dataset.

In [28]:
# Imports
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

In [30]:
# Read in the data and count the number of missing values per feature.
raw = pd.read_csv("../../Video Games Analysis/data/raw/vgsales.csv")
raw.isnull().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

It can be seen that there is missing data for both the Year and Publisher features. Another thing to note is that the dataset hasn't been updated since 2016 so we need to check for any incorrect entries in the dataset. Any entries deemed incorrect will be updated to reflect the correct release Year. 

In the next step, all incorrect entires for the Year feature are changed to the correct value and any missing values in the data are imputed in a manner which is deemed to be the most efficient way.

In [23]:
# Idenitify any incorrect entries for the Year
raw[raw["Year"] >= 2017]

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
5957,5959,Imagine: Makeup Artist,DS,2020.0,Simulation,Ubisoft,0.27,0.0,0.0,0.02,0.29
14390,14393,Phantasy Star Online 2 Episode 4: Deluxe Package,PS4,2017.0,Role-Playing,Sega,0.0,0.0,0.03,0.0,0.03
16241,16244,Phantasy Star Online 2 Episode 4: Deluxe Package,PSV,2017.0,Role-Playing,Sega,0.0,0.0,0.01,0.0,0.01
16438,16441,Brothers Conflict: Precious Baby,PSV,2017.0,Action,Idea Factory,0.0,0.0,0.01,0.0,0.01


These entries can't be correct due to when the data was last updated. After researching, it was found that these games were released on the specified consoles in the years 2009, 2016, 2013 and 2016 respectively. Below these values are replaced.

In [24]:
# Updating the Year feature
Year = list(raw["Year"])
Year[5957] = 2009
Year[14390] = 2016
Year[16241] = 2013
Year[16438] = 2016
raw["Year"] = Year

In [27]:
# Imputing the missing Year value with the median year
year_median = raw["Year"].median(skipna=True)
raw["Year"] = raw["Year"].fillna(year_median)

raw["Year"].isnull().sum()

0