# JUMPlus Python Project 4

## Video Game Sales Basic Data Cleaning

### by Nicholas Crossman

In this project, we've been given a dataset with some missing values. 

Collecting data in the real world is often messy, and some records will be incomplete. It's essential as a data engineer to know how to deal with these missing values. 

There are more advanced techniques like imputing missing values, which tries to fill in missing columns using machine learning. However, that's beyond the scope of this practice project. In this case, we're just going to try and understand which values are missing, and which columns we can keep and which we should ignore.

First, we read in the data from the `.csv` file.

In [2]:
import pandas as pd
import matplotlib as plt

data = pd.read_csv("Video_Games_Sales_as_at_22_Dec_2016.csv")

Let's see the first 5 entries to get an idea of what we're dealing with.

In [3]:
data.head()

Unnamed: 0,Name,Platform,Year_of_Release,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales,Critic_Score,Critic_Count,User_Score,User_Count,Developer,Rating
0,Wii Sports,Wii,2006.0,Sports,Nintendo,41.36,28.96,3.77,8.45,82.53,76.0,51.0,8.0,322.0,Nintendo,E
1,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24,,,,,,
2,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.68,12.76,3.79,3.29,35.52,82.0,73.0,8.3,709.0,Nintendo,E
3,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.61,10.93,3.28,2.95,32.77,80.0,73.0,8.0,192.0,Nintendo,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37,,,,,,


In [5]:
data.shape

(16719, 16)

The `shape` method tells us there are 16 columns in total.

We can already see some `NaN` values indicating missing data. Since this is just the first 5 entries, we need a query to see which columns have null or NaN values.

Helpfully, Pandas provides the `isna()` method, which finds `null` or `NaN` values.

In [4]:
data.columns[data.isna().any()].tolist()

['Name',
 'Year_of_Release',
 'Genre',
 'Publisher',
 'Critic_Score',
 'Critic_Count',
 'User_Score',
 'User_Count',
 'Developer',
 'Rating']

That's 10 columns out of 16 with some missing values. 

We should try and get a count of missing values in each column, because this is too many to throw out all of them. We need to choose which ones are useful and which have too much missing data.