In [None]:
import pandas as pd

parks_df = pd.read_parquet("../../data/nps/nps_public_data_parks.parquet")

parks_df.head()

The `apply()` method in pandas is used to apply a function along an axis of a DataFrame or Series. It allows you to perform custom operations on the data within each row or column, depending on how you specify the axis parameter. Here's how it works:

1. **DataFrame.apply()**:
   - When applied to a DataFrame, `df.apply(func, axis=0)` applies the function `func` to each column (axis=0) or `df.apply(func, axis=1)` applies the function to each row (axis=1).
   - The `func` parameter can be a built-in function, lambda function, or a custom function defined by the user.
   - For example, you can create a custom function that calculates the sum of two columns and apply it to each row or column using `df.apply(custom_function, axis=0)` or `df.apply(custom_function, axis=1)`.

2. **Series.apply()**:
   - When applied to a Series, `series.apply(func)` applies the function `func` to each element in the Series.
   - Similar to DataFrame.apply(), the `func` parameter can be a built-in function, lambda function, or a custom function.
   - For instance, you can use `series.apply(lambda x: x * 2)` to multiply each element in the Series by 2.

The `apply()` method works by iterating over the elements (rows or columns) of the DataFrame or Series and applying the specified function to each element. It is a powerful tool for performing complex transformations, calculations, or filtering operations on data within pandas objects.

For example, we note that `parks_df['addresses']` is a list of json, but what if we only wanted the `city` value from the first listed address. We could do it with apply:

In [None]:
parks_df["city_state"] = parks_df["addresses"].apply(
    lambda x: f"{x[0]['city']}, {x[0]['stateCode']}"
)

parks_df[["name", "city_state"]]


Note the syntax, it can be tricky `lambda x: f"{x[0]['city']}, {x[0]['stateCode']`. You can think of `lambda x` as saying "for each x," so this command is saying _for each x in the column, build an f-string with the 0th index city and state_

And we can easily count the states with the most parks:

In [None]:
parks_df["city_state"].value_counts()

In the above example, we only used `apply` on one column, but it can be used on entire rows, for example:

In [None]:
parks_df.apply(
    lambda row: f"{row['fullName']}, {row['addresses'][0]['city']}, {row['addresses'][0]['stateCode']}",
    axis=1,
)

But note the differences: we now have to specify the _column_ in the query _and_ we have to specify `axis=1`. Of course we don't need to use `x` or `row`, we can use whatever:

In [None]:
# why not? 🐦

parks_df.apply(
    lambda bird: f"{bird['fullName']}, {bird['addresses'][0]['city']}, {bird['addresses'][0]['stateCode']}",
    axis=1,
)

But your coworkers and collaborators will like you better if you use descriptive names. The power in apply comes from being able to apply arbitrary functions— you can even define your own.

In [None]:
def cubed_len(input_string):
    return len(input_string) ** 3


parks_df["city_state"].apply(cubed_len)

In [None]:
parks_df["states_list"] = parks_df["states"].apply(lambda x: x.split(","))

parks_df[["name", "states_list"]]

The thing to be most aware of is that apply operates _per row_. That means for massive datasets, this can be a _very_ expensive operation. For that reason, columnar-oriented SQL or frameworks like Polars are likely best for those operations, but you'll only need to worry about that in the hundred-thousand to million row range.

So we come to another pattern— using `apply` to perform row-wise transformations using Pandas. This is useful for complex transformations, but can be slow. 