In [None]:
import pandas as pd

parks_df = pd.read_parquet("../../data/nps/nps_public_data_parks.parquet")
parks_df.head()

Write a snippet to fetch all the parks in Utah and order the results by the park name.

In [None]:
parks_df[parks_df["states"].str.contains("UT")].sort_values(by="fullName").head()

Build a query to fetch all the National Parks that cross state boundaries. 

Hint: `parks.states` is a string representation of a list, i.e. `UT,CA,NC`. The `parks` table includes parks that aren't National Parks.

In [None]:
parks_df[
    (parks_df["designation"].str.contains("National Park"))
    & (parks_df["states"].str.contains(","))
]

For all national parks, return the `states` column as a `STRUCT` type with each element as a state

In [None]:
parks_df["states_list"] = list(parks_df["states"].str.split(","))
parks_df["states_list"]

Which parks are in either Montana or Wyoming?

In [None]:
national_parks_df = parks_df[parks_df["designation"].str.contains("National Park")]

national_parks_df[
    national_parks_df["states"].str.contains("MT")
    | national_parks_df["states"].str.contains("WY")
].head()

What about _both_ Montana _and_ wyoming?

In [None]:
national_parks_df[
    national_parks_df["states"].str.contains("MT")
    & national_parks_df["states"].str.contains("WY")
].head()

Which park is in the greatest number of states?

In [None]:
national_parks_df["num_states"] = national_parks_df["states_list"].str.len()

national_parks_df[["fullName", "num_states", "states_list"]].sort_values(
    by="num_states", ascending=False
).head()

Now, how many parks are in each "group" of state border-crossings?

Hint: we're grouping by the _number_ of states.

In [None]:
national_parks_df.groupby("num_states")["fullName"].count()

What's the percentage share of the total? Hint: window functions might be helpful.

In [None]:
total = national_parks_df.groupby("num_states")["fullName"].count().sum()

national_parks_df.groupby("num_states")["fullName"].count() / total

Write a query that returns the _largest_ campsite in each park. Note: when creating a `total_campsites` column, you might have to fill `na` columns with zero. You might also find the `idxmax` function helpful for this exercise 🙂

In [None]:
import pandas as pd

campsites_df = pd.read_parquet("../../data/nps/nps_public_data_campgrounds.parquet")

In [None]:
national_parks_df["parkCode"] = national_parks_df["parkCode"].astype("str")
campsites_df["parkCode"] = campsites_df["parkCode"].astype("str")

join_df = national_parks_df.merge(
    campsites_df, how="inner", on=["parkCode"], suffixes=("_park", "_camp")
)

join_df["total_campsites"] = (
    join_df["numberOfSitesFirstComeFirstServe"] + join_df["numberOfSitesReservable"]
)

join_df["total_campsites"] = join_df["total_campsites"].fillna(value=0)

join_df[["name_park", "name_camp", "total_campsites"]].loc[
    join_df.groupby("fullName")["total_campsites"].idxmax()
].sort_values(by="total_campsites", ascending=False).head()

Say you'll be in California this spring and have time for three National Parks visits. How many combinations of national parks can you visit? Can you return the combinations in a list ordered by the name of the first park?

In [None]:
import itertools

unique_ca_parks = national_parks_df[national_parks_df["states"].str.contains("CA")][
    "name"
].unique()

combinations_ca_parks = list(itertools.combinations(unique_ca_parks, 3))

sorted(combinations_ca_parks)

Find the combinations in alphabetical order, that is, the first letter of each visit occurs in the order of the alphabet, e.g. `[C]hannel Islands, [D]eath Valley, [J]oshua Tree` would satisfy that condition.

In [None]:
[c for c in combinations_ca_parks if list(c) == sorted(c)]

Ok, now let's have some fun with `apply`

In [None]:
import pandas as pd

alerts_df = pd.read_parquet("../../data/nps/nps_public_data_alerts.parquet")

Create a new column 'alert_date' that converts `lastIndexedDate` to a date (no time)

In [None]:
import datetime

alerts_df["alert_date"] = pd.to_datetime(alerts_df["lastIndexedDate"]).dt.date

Grouping by `alert_date` and `category` return a `grouby` that counts the number of alerts, per category per day. Be sure to reset the index so that each row has an `alert_date`

In [None]:
alerts_df.groupby(["alert_date", "category"])["title"].count().reset_index()

Join the `parks_df` on `parkCode` and create a new df grouping alerts by date and park.

In [None]:
alerts_joined = parks_df.merge(
    alerts_df, how="inner", on=["parkCode"], suffixes=("_park", "_alert")
)

alerts_agg = (
    alerts_joined.groupby(["alert_date", "name"])["title"].count().reset_index()
)

alerts_agg.head()

Return a dataframe that contains the latest alert date for each park

In [None]:
most_recent_alerts = alerts_agg.groupby("name")["alert_date"].max()

How many parks had alerts in 2023?

In [None]:
alerts_agg.groupby("name").apply(
    lambda x: x[x["alert_date"] == most_recent_alerts[x.name]]
)

In [None]:
alerts_df["alert_date"] = pd.to_datetime(alerts_df["lastIndexedDate"]).dt.date