# Standardization

When working with dataframes, we want to make sure all data is saved in a particular way -- this is referred to as *standardization*. Standard formats for data are necessary when combining or analyzing data. 

For example, consider the date 12/10/11. Is this date:
- December 10, 2011 (MONTH/DAY/20YEAR)
- December 10, 1911 (MONTH/DAY/19YEAR)
- October 12, 2011  (DAY/MONTH/20YEAR)
- October 12, 1911  (DAY/MONTH/19YEAR)
- October 11, 2012  (20YEAR/MONTH/DAY)

We wouldn't know without defining a standard beforehand. 

Rather than type out a bunch of dates, let's create some random data.

In [2]:
from numpy import random # needed for random. functions

# List of dicts with month and the month number
day_months_31 = ['January', 'March', 'May', 'July', 
                 'August', 'October', 'December']
day_months_30 = ['April', 'June', 'September', 'November']
day_months_28 = ['February'] # Ignoring leap years
list_of_dict_months = day_months_31 + day_months_30 + day_months_28

# define a function that creates a random date
def random_date():
    month = list_of_dict_months[random.randint(0,12)]
    # ensures the month has the correct number of days
    if month in day_months_31:
        day = random.randint(1, 32)
    elif month in day_months_30:
        day = random.randint(1, 31)
    else:
        day = random.randint(1, 29)
    year = random.randint(1900, 2024)
    return f'{month} {day}, {year}'
    
# initialize empty list
list_of_dates = []

# Create a list of 1000 dates
while len(list_of_dates) < 1000:
    list_of_dates.append(random_date())

In [None]:
# After running the previous cell, we have a random list called
list_of_dates

# Standardizing with split() and replace()

In [1]:
# Create dictionary of months and associated number
months_in_order = ['January', 'February', 'March', 'April', 
                       'May', 'June', 'July', 'August', 
                       'September', 'October', 'November', 'December']
dict_of_months = {} # As we saw before, dictionaries are basically sets where each element has an associated key, so we start by defining an empty set that we will populate
    
for index in range(1, 13): # remember range does not include the last value
    # This adds a key (Month name) and an associated value (Month number)
    dict_of_months[months_in_order[index-1]] = index

def clean_date(date):
    # str.split(pattern) splits the str by the pattern
    # our code below splits by an empty space
    list_of_strings = date.split(' ')
    # After split, list is ['raw_month', 'day,' , 'year']
    
    # Remember list index starts at 0
    raw_month = list_of_strings[0]
    
    # Since the day has an extra comma, we can do .replace to replace commas with nothing, which deletes the comma
    raw_day = list_of_strings[1].replace(',', '')
    raw_year = list_of_strings[2]
    
    # Uses the dictionary to convert from month to associated number
    month_number = dict_of_months[raw_month]
    
    return f'{month_number}/{raw_day}/{raw_year}'

In [None]:
converted_list_of_dates = []
for date in list_of_dates:
    converted_list_of_dates.append(clean_date(date))

print(converted_list_of_dates)

We converted from written form to a short hand. What if we wanted to convert back? We could split by '/', then convert the month from a number to the corresponding name.