In [1]:
# HANDLING DIRTY DATA
# Now that missing and duplicate data have been covered, we'll turn our sights to other common problems with datasets.
# Sometimes, datasets have other types of dirty data, which is a loose term for erronous data.
# Besides missing and duplicated data, dirty data can include the following: 
    # Typos
    # Misplaced data
    # Mixed datatypes
    # Incorrect symbols or punctuation
# All these errors need to be fixed in preparation for analysis.

In [2]:
# IDENTIFY MESSY TEXT DATA
# While it's a financial analysts dream to work with large datasets full of numbers, reality is that data often contains text fields as well.
# Names, emails, comments, symbols, and other text values might prove relevant to the analysis.
# So we need tools and techniques to clean and prepare any messy text.
# In the following example, we'll focus on the influence of currency symbols.
# When such symbols are present, it is as a string, therefore the numeric values no longer act as numbers, therefore are incalculable.
# We will apply techniques to removing currency symbols:

# Import library
import pandas as pd

prices = pd.DataFrame({
    "prices_usd": ["$0.53", "$0.65", "0.22"]
})
display(prices)
display(prices.dtypes)

# The code creates the `prices` DataFrame with three values in the `prices_usd` column.
# The first two include the dollar sign, however the last does not.
# The values that contain the currency symbol can't convert to numerical data types such as `float` or `int`.
# They are string objects or, `object` data types, therefore, calculations such as `sum()` and `mean()` won't work.
# As displayed, using the `dtypes` function displays the type of data type Pandas assigns to the elements in the DataFrame that have the currency symbol.

Unnamed: 0,prices_usd
0,$0.53
1,$0.65
2,0.22


prices_usd    object
dtype: object

In [7]:
# REMOVE SYMBOLS FROM DATA
# To remove the dollar sign, we'll use a Pandas trick, calling the `str.replace()` function.
# This will replace each currency string with an empty string, represented by two quotations `""`.
# To replace the values, we will use the `loc` or `iloc` functions to specify the column.
    # Though we only have one column, using `loc` or `iloc` is a best practice.
prices.loc[:, 'prices_usd'] = prices.loc[:, 'prices_usd'].str.replace("$", "")
display(prices)
display(prices.dtypes)

# The `str.replace()` function accepts two arguments:
    # 1. The string we need to search for, which is the "$"
    # 2. The string we are replacing it with, which is the empty string.
# Now that we've removed the currency symbol, the next step it to convert the `object` data type into a numerical data type.

  prices.loc[:, 'prices_usd'] = prices.loc[:, 'prices_usd'].str.replace("$", "")


Unnamed: 0,prices_usd
0,0.53
1,0.65
2,0.22


prices_usd    object
dtype: object

In [9]:
# CONVERT A TEXT DATA TYPE TO A NUMERICAL DATA TYPE
# Removing the currency symbols from the data left us with a string representation of our numbers.
# Now we can use the Pandas function `astype()` to convert strings into numbers.
# The `astype()` function accepts an argument for the numerical type that we want to convert the data to.
# The two most common conversion arguments are the following:
    # 1. `float` - For numbers with decimals.
    # 2. 'int' -  For whole numbers.
# Because financial datasets often deal with currencies or prices, the `float` data type tends to be the one of choice:
prices.loc[:, 'prices_usd'] = prices.loc[:, 'prices_usd'].astype("float")
prices.dtypes

# as you can see the data type of the values in the `prices_usd` column is now a `float64` data type.

prices_usd    float64
dtype: object