In [1]:
# INTRO TO DATA PREPARATION
# Have you ever worked on a spreadsheet that's full of errors such as typos, missing data, or duplicated data?
# These are examples of DIRTY DATA, which is the scourge of financial analysis.
# Working with flawed data leads to flawed analysis. 
# It's essential to prepare and clean any dataset before you analyze it.
# Pandas has built-in functions that allows analysts to accomplish this.

In [4]:
# MISSING DATA
# Analysts commonly encounter the problem of missing data.
# Missing data can result from a human or computer error or even from a real-world event, like the market closing.
# Before you start any analysis, you need to handle the missing values to avoid erroneous visualizations or incorrect calculations.
# Pandas represents missing values by using a data type known as `NaN.
# `NaN` stands for Not a Number. You can think of it as a placeholder for missing data.
# The most important benefit of the NaN datatype is that we can use it with specific Pandas functions.
# With these functions, we can identify the missing values and then either remove or replace them.
# Let's start with an example CSV file that has missing values:

import pandas as pd
from pathlib import Path

items_df = pd.read_csv(Path('items.csv'))
items_df

Unnamed: 0,Item Number,Price,Quantity
0,2468,2.99,250
1,1357,,300
2,9753,3.47,350


In [8]:
# FINDING MISSING DATA
# One of the best tools for finding missing, or null, values is the `isnull` function.
# This function returns `True` if it finds NaN values in the DataFrame and `False` otherwise.
# As you can see in the cell before, in index 1 of the Price column, there is a NaN.
# When we run the `isnull()` function, we'll get a `True` output for that spot and `False` in the others:
display(items_df.isnull())
display(items_df.isnull().sum())
display(items_df.isnull().mean())

# Now that we see the price column has a value missing, we can use that to decide how to handle the missing values.
# Additionally, we can add the function `sum()` to display how many NaN values are present in the DataFrame.
# We can also use the `mean()` function to determine the percentage of missing values in each column. 

# *NOTE* On the job, when deleting missing values, it's important to be awar of how much data you need for your analysis.
# That is, make sure not to delet too much data if doing so will affect your calculations or overall analysis.
# Statisticians might have set guidelines, but most analysts build an intuition for this type of decision-making. 

Unnamed: 0,Item Number,Price,Quantity
0,False,False,False
1,False,True,False
2,False,False,False


Item Number    0
Price          1
Quantity       0
dtype: int64

Item Number    0.000000
Price          0.333333
Quantity       0.000000
dtype: float64

In [9]:
# HANDLING MISSING DATA
# We can handle missing values one of two ways:
    # 1. By removing, or `dropping` them.
    # 2. By replacing, or `filling` them.

In [15]:
# DROP MISSING DATA
# The first way to deal with missing data is to drop any rows that have missing values.
# This option is best for large datasets where dropping some rows won't affect the overall analysis.
# To drop rows of data that have missing values, we use the `dropna()` function
# The following code imports the Pandas NumPy libraries and then creates a `daily_returns` DataFrame containing Apple and Google stock data:
import pandas as pd
import numpy as np

daily_returns = pd.DataFrame({
    "AAPL": [0.5, np.nan, 0.62],
    "GOOG": [0.45, 0.63, 0.55]
})
display(daily_returns)

# We see that `AAPL` is missing a value in index position 1, now we can use the `dropna()` function to remove the NaN value:
new_returns = daily_returns.dropna()
display(new_returns)

# As you can see, the NaN value and the entire row is gone and only two rows remain.
# However, the fact that `dropna()` has just thrown away a third of our data can't be ignored.
# This example works for demonstrative purposes, but it's hard to imagine this scenario not skewing some aspect of your analysis.
# This is especially true in professional settings where every piece of data counts.
# With that in mind, this is where filling the data can be an alternative option.

Unnamed: 0,AAPL,GOOG
0,0.5,0.45
1,,0.63
2,0.62,0.55


Unnamed: 0,AAPL,GOOG
0,0.5,0.45
2,0.62,0.55


In [19]:
# FILL MISSING DATA
# When dropping data isn't an option, financial analysts sometimes fill the missing data with specific values that they can account for in the analysis.
# The three most common replacement values are the following:
    # "Unknown"
    # 0
    # `mean()`
# There are pros and cons for each of these values, so you'll have to be discerning when choosing the best replacement for your particular problem.
# Maybe you want to flag values with "Unkown".
# Or you might want to replace a missing number with either zero or the average value in that column.
# These choices are up to you as the analyst.
# To fill NaN values, use the `fillna()` function.

# Replace the NaNs with "Unknown".
daily_returns.fillna("Unknown")
display(daily_returns)

# Replace the NaNs with 0.
zero_fill_returns = daily_returns.fillna(0)
display(zero_fill_returns)

# Replace the NaNs with the column average
mean_returns = daily_returns.fillna(daily_returns.mean())
display(mean_returns)

# You can also use the `fillna()` function on a single column, giving you control over which values are replaced.
# This is where the `loc` and `iloc` functions come into play.
# For example, if you wanted to replace the NaN value in the AAPL column with 0 and not touch the GOOG column:
aapl_edited_returns = daily_returns.loc[:, "AAPL"] 
aapl_returns = aapl_edited_returns.fillna(0)
display(aapl_returns)
display(aapl_returns.isnull().sum())

Unnamed: 0,AAPL,GOOG
0,0.5,0.45
1,,0.63
2,0.62,0.55


Unnamed: 0,AAPL,GOOG
0,0.5,0.45
1,0.0,0.63
2,0.62,0.55


Unnamed: 0,AAPL,GOOG
0,0.5,0.45
1,0.56,0.63
2,0.62,0.55


0    0.50
1    0.00
2    0.62
Name: AAPL, dtype: float64

0