# dataframe.apply 

One of the most important functions you can use on a dataframe is the .apply function. This allows you to use some function you define on an entire column or an entire dataframe! It is also faster than iterating through every single row (but we won't get into the weeds of why it is faster). Let's see an example to check it out in action.

In [2]:
import pandas as pd

# We are going to use this cell just to load the dataframe
# That way, running the code in other cells will be quick since we don't have to re-load the dataframe
dataframe = pd.read_excel("../Datasets/all_data_M_2022.xlsx")

In [None]:
def add_a_dollar_sign(integer):
    new_text = f'${integer}'
    return new_text

# We want to create a new dataframe that we will modify and leave the original one alone
new_dataframe = dataframe.copy() # We don't *need* to do this, but it is faster to make copies and modify our dataframe than it is to have to reload it again

# We want to modify the 'H_MEDIAN' column to add a dollar sign
new_dataframe['H_MEDIAN'] = dataframe['H_MEDIAN'].apply(add_a_dollar_sign)
new_dataframe.head()

Now let's see why we spent all the time creating the RegEx function by applying it to our dataframe.

## Run this group of code to create the fake data that gets converted
Click the run button next to "2 cells hidden"

In [11]:
from numpy import random # needed for random. functions

# List of dicts with month and the month number
day_months_31 = ['January', 'March', 'May', 'July', 
                 'August', 'October', 'December']
day_months_30 = ['April', 'June', 'September', 'November']
day_months_28 = ['February'] # Ignoring leap years
list_of_dict_months = day_months_31 + day_months_30 + day_months_28

# define a function that creates a random date
def random_date():
    month = list_of_dict_months[random.randint(0,12)]
    # ensures the month has the correct number of days
    if month in day_months_31:
        day = random.randint(1, 32)
    elif month in day_months_30:
        day = random.randint(1, 31)
    else:
        day = random.randint(1, 29)
    year = random.randint(1900, 2024)
    return f'{month} {day}, {year}'
    
# initialize empty list
list_of_dates = []

# Create a list of 1000 dates
while len(list_of_dates) < 1000:
    list_of_dates.append(random_date())

In [12]:
import re

# Create dictionary of months and associated number
months_in_order = ['January', 'February', 'March', 'April', 
                       'May', 'June', 'July', 'August', 
                       'September', 'October', 'November', 'December']
dict_of_months = {} # As we saw before, dictionaries are basically sets where each element has an associated key, so we start by defining an empty set that we will populate
    
for index in range(1, 13): # remember range does not include the last value
    # This adds a key (Month name) and an associated value (Month number)
    dict_of_months[months_in_order[index-1]] = index

converted_list_of_dates = []

# Using parentheses groups chunks of string together as a "group"
    # This is how we defined group(1), group(2), group(3)
# .+ means one or more occurances of any character
# , is just a comma and is not special here
# This effectively searches for the following groups
    # (group 1) (group 2), (group 3)
    # Month Day, Year
pattern_to_match = '(.+) (.+), (.+)'

def regex_convert(date):
    temp_match_object = re.search(pattern_to_match, date) # We know there will be only 1 match, so there is no issue using re.search() here.
    
    raw_month = temp_match_object.group(1) 
    day = temp_match_object.group(2)
    year = temp_match_object.group(3)
    
    month_number = dict_of_months[raw_month] # Converting from word to number
    
    short_date = f'{month_number}/{day}/{year}'
    return short_date

## The finale

In [None]:
# Create dataframe with a column of long dates
date_df = pd.DataFrame(list_of_dates, columns=['Long Date'])

# Create a new column of short dates by applying our regex_convert function to the Long Date column
date_df['Short Date'] = date_df['Long Date'].apply(regex_convert)

date_df.head()

## Recap
Let's recap some new things we did as we learned about regex and dataframe.apply
- Splitting code between blocks for clarity
- Importing libraries at the top or near top of cell
- Creating a function that can be used in multiple cells
- Loading a dataframe once in a separate cell
- Copying the raw dataframe before making changes