# Standardization

When working with dataframes, we want to make sure all data is saved in a particular way -- this is referred to as *standardization*. Standard formats for data are necessary when combining or analyzing data. 

For example, consider the date 12/10/11. Is this date:
- December 10, 2011 (MONTH/DAY/20YEAR)
- December 10, 1911 (MONTH/DAY/19YEAR)
- October 12, 2011  (DAY/MONTH/20YEAR)
- October 12, 1911  (DAY/MONTH/19YEAR)
- October 11, 2012  (20YEAR/MONTH/DAY)

We wouldn't know without defining a standard beforehand. 

EXAMPLE WITH JUST SPLIT AND REPLACE, NO REGEX

# Regex

**Regex** or **reg**ular **ex**pressions is an incredibly useful Python library for manipulating data to conform to a standard. First we'll look at a list of days that we want to convert into MONTH/DAY/YEAR format and NOT use regex. 

And rather than type out a bunch of dates, let's create some random data.

In [None]:
from numpy import random # needed for random. functions

# List of dicts with month and the month number
day_months_31 = ['January', 'March', 'May', 'July', 
                 'August', 'October', 'December']
day_months_30 = ['April', 'June', 'September', 'November']
day_months_28 = ['February'] # Ignoring leap years
list_of_dict_months = day_months_31 + day_months_30 + day_months_28

# define a function that creates a random date
def random_date():
    month = list_of_dict_months[random.randint(0,12)]
    # ensures the month has the correct number of days
    if month in day_months_31:
        day = random.randint(1, 32)
    elif month in day_months_30:
        day = random.randint(1, 31)
    else:
        day = random.randint(1, 29)
    year = random.randint(1900, 2024)
    return f'{month} {day}, {year}'
    
# initialize empty list
list_of_dates = []

# Create a list of 1000 dates
while len(list_of_dates) < 1000:
    list_of_dates.append(random_date())

In [None]:
# After running the previous cell, we have a random list called
list_of_dates

In [None]:
# Create dictionary of months and associated number
months_in_order = ['January', 'February', 'March', 'April', 
                       'May', 'June', 'July', 'August', 
                       'September', 'October', 'November', 'December']
dict_of_months = {}
    
for index in range(1, 13): # remember range does not include the last value
    # This adds a key (Month name) and an associated value (Month number)
    dict_of_months[months_in_order[index-1]] = index

def clean_date(date):
    # str.split(pattern) splits the str by the pattern
    # our code below splits by an empty space
    list_of_strings = date.split(' ')
    # After split, list is ['raw_month', 'day,' , 'year']
    
    # Remember list index starts at 0
    raw_month = list_of_strings[0]
    
    # Since the day has an extra comma, we can do .replace to replace commas with nothing, which deletes the comma
    raw_day = list_of_strings[1].replace(',', '')
    
    raw_year = list_of_strings[2]
    
    # Uses the dictionary to convert from month to associated number
    month_number = dict_of_months[raw_month]
    
    return f'{month_number}/{raw_day}/{raw_year}'

In [None]:
converted_list_of_dates = []
for date in list_of_dates:
    converted_list_of_dates.append(clean_date(date))

print(converted_list_of_dates)

We converted from written form to a short hand. What if we wanted to convert back? We can't just rely on the location since some days/months are a single digit while others are two digits. We can use regex for this! See if you can convert the date from the short form to long form. *Hint: What could we split each string by to separate the month, day, and year?*

## Regex pattern matching

Regex excels at pattern matching. Let's take our previous example and, rather than split the text, use matching.

In [None]:
import re

converted_list_of_dates = []

# Using parentheses groups chunks of string together as a "group"
    # This is how we defined group(1), group(2), group(3)
# .+ means one or more occurances of any character
# \s means a white space
# , is just a comma is not special here
# This effective searches for the following groups
    # (group 1) (group 2), (group 3)
    # Month Day, Year
pattern_to_match = '(.+)\s(.+),\s(.+)'

def regex_convert(date):
    temp_match_object = re.search(pattern_to_match, date)
    
    raw_month = temp_match_object.group(1)
    day = temp_match_object.group(2)
    year = temp_match_object.group(3)
    
    month_number = dict_of_months[raw_month]
    
    short_date = f'{month_number}/{day}/{year}'
    return short_date

In [None]:
for date in list_of_dates:
    short_date = regex_convert(date)
    converted_list_of_dates.append(short_date)

print(converted_list_of_dates)

Regex can be unintuitive but can be an incredible tool for advanced data cleaning. In this course, you will not _need_ to use regex but it can make life a little easier. 

# dataframe.apply 

One of the most important functions you can use on a dataframe is the .apply function. This allows you to use some function you define on an entire column or an entire dataframe! It is also faster than iterating through every single row (but we won't get into the weeds of why it is faster). Let's see an example to check it out in action.

In [None]:
import numpy
import pandas

# We are going to use this cell just to load the dataframe
# That way, running the code in other cells will be quick since we don't have to re-load the dataframe
dataframe = pandas.read_excel("all_data_M_2022.xlsx")

In [None]:
def add_a_dollar_sign(integer):
    new_text = f'${integer}'
    return new_text

# We want to create a new dataframe that we will modify 
# and leave the original one alone
new_dataframe = dataframe 

# We want to modify the 'H_MEDIAN' column to add a dollar sign
new_dataframe['H_MEDIAN'] = dataframe['H_MEDIAN'].apply(add_a_dollar_sign)
new_dataframe.head()

Let's combine both lessons into one. 

In [None]:
# Create dataframe with a column of long dates
date_df = pandas.DataFrame(list_of_dates, columns=['Long Date'])

# Create a new column of short dates by applying our regex_convert function to the Long Date column
date_df['Short Date'] = date_df['Long Date'].apply(regex_convert)

date_df.head()

## Recap
Let's recap some new things we did as we learned about regex and dataframe.apply
- Splitting code between blocks for clarity
- Importing libraries at the top or near top of cell
- Creating a function that can be used in multiple cells
- Loading a dataframe once in a separate cell
- Copying the raw dataframe before making changes