# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.
https://www.kaggle.com/datasets/thedevastator/global-video-game-sales

Import the necessary libraries and create your dataframe(s).
import pandas as pd
import numpy as pd

In [None]:
import pandas as pd

df = pd.read_csv(r'vgsales.csv')

In [None]:
df.head()

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [None]:
# checking where my null values lie before taking any action.
df.isna().sum()

In [None]:
# The dataset is oddly missing 2018-2019, with only a single record for 2020.
df['Year'].sort_values().unique()

In [None]:
# Finding the mean of the year column to use in filling the null values in that column.
df['Year'].mean().round()

In [None]:
#Filled all null values with the mean of "Year" with 2006 so I can convert to an integer later in the cleaning process. 
df['Year'] = df['Year'].fillna(2006)

In [None]:
# The dataset does not have alot of null values in general. Ive decided to let the null values remain since they will not affect
# the business question being asked. The columns containing these null values may be dropped later.

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [None]:
# Checking all sales columns to pick out possible outliers through the upper quartile and max.
df[['NA_Sales','EU_Sales','JP_Sales','Other_Sales','Global_Sales']].describe()

In [None]:
# Any sales below 10K are not counted or are rounded. Since the sales are in fractions, .01 will be 10K.

In [None]:
# Created a function to find all the "outliers" in each sales column
def sales_outlier(x):
    up_quart = x.quantile(.75)
    outliers = x.loc[x > up_quart * 1.5].agg('sum').round()
    return outliers

In [None]:
# Counting the "outliers" in the NA_Sales column. 
sales_outlier(df['NA_Sales'])

In [None]:
# Counting the "outliers" in the EU_Sales column.
sales_outlier(df['EU_Sales'])

In [None]:
# Counting the "outliers" in the JP_Sales column.
sales_outlier(df['JP_Sales'])

In [None]:
# Counting the "outliers" in the Other_Sales column.
sales_outlier(df['Other_Sales'])

In [None]:
# Counting the "outliers" in the Global_Sales column.
sales_outlier(df['Global_Sales'])

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [None]:
# The "publisher" column will not be necessary for the business question being asked, so it will be dropped.
# this will also delete the remaining null values.

In [None]:
df = df.drop('Publisher',axis = 1)
df

## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [None]:
# Checking for correct datatype for each column. 
df.dtypes

In [None]:
# I decided to convert the "Year" column to an integer to be more uniform. I will use the mean of the "Year" column to fill in the
# null values so I can convert from a float to an integer.
df['Year'].describe()

In [None]:
# Converted the "Year" column to an integer dtype to be more uniform. 
df['Year'] = df['Year'].astype(int)

In [None]:
# checking the data types here I can verify that the column is now an integer.
df.dtypes
df

In [None]:
# Checking the "Genre" column for any inconsistent genre names. 
df['Genre'].unique()

In [None]:
# Checking the "Platform" column for any inconsistent console names.
df['Platform'].unique()

In [None]:
# confirming data has no nulls and is properly cleaned before creating new csv.
df.isna().sum()
df

In [None]:
# Creating a new csv called "cleaned_vgsales" to use for task 4.
df.to_csv('cleaned_vgsales')

In [None]:
# reading the csv and producing a sample to make sure everything works as it should.
cleaned_df = pd.read_csv(r'cleaned_vgsales')
cleaned_df.sample(20)

In [None]:
# double checking that there is no nulls left after cleaning. 
cleaned_df.isna().sum()

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?
A. No, I found 3 different types of dirty data however my outliers will need be be included in the dataset since they're mostly highly popular games.
2. Did the process of cleaning your data give you new insights into your dataset?
A. Yes, It really showed how many games dont break through the market and retain small amounts of sales.
3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?
A. Yes, I think it would be very useful to convert the fractional sales into whole dollars to make it easier to read and understand.