In [None]:
import pandas as pd

The first difference we'll note with Python and Pandas is that instead of selecting from a database, we'll be reading in individual files to query with Pandas. This is primarily a byproduct of _when_ we use tools like Pandas. Frequently, this _will not_ be against a database, as with SQL, but against structured or semi-structured data stored in JSON, CSV, parquet, or other formats. 

Here we're using `read_parquet` to pull in the corresponding parquet file for the Parks dataset.

In [None]:
parks_df = pd.read_parquet("../../data/nps/nps_public_data_parks.parquet")
parks_df.head()

Let's perform some similar operations to our DuckDB example— renaming a column and expanding the `STRUCT` column, `operatingHours`

In [None]:
from pprint import pprint

rename_dict = {"operatingHours": "operating_hours"}

# Note that rename requires casing and mnames to be precicely correct..
# It often won't throw errows if they are not, but it also won't rename the columns, so be sure to check.

parks_df.rename(columns=rename_dict, inplace=True)

pprint(list(parks_df.columns))

In [None]:
# We can expect the operating_hours column to better understand the data structure
# Let's look at a sample record

parks_df["operating_hours"].iloc[0]

This record is a _list_ of dictionaries... That means we can't just unpack the values since each row could have more than one value. DuckDB handled that for us, but Pandas won't! 

This will be a theme throughout the remainder of the course. Certain tools might be more effective than others. It's our job to figure out what makes sense.

In [None]:
# The explode() method will unpack our list into individual rows
parks_df_exploded = parks_df.explode("operating_hours")
parks_df_exploded.head()

From there, we can unnest the operating hours column with `pd.json_normalize`

In [None]:
park_operating_hours_df = pd.json_normalize(parks_df_exploded["operating_hours"])

park_operating_hours_df.rename(
    columns={"name": "category", "description": "operating_hours_description"},
    inplace=True,
)

park_operating_hours_df.head()

But now we have a separate dataframe! To join it back to our original df, we can use `pd.concat` with `axis=1`— that tells us to concatenate our dataframes by column.

We also have to `reset_index` for each of our dataframes to ensure a true join. Yes, this is confusing. Unfortunately, the only way to learn is through trial, error, [StackOverflow](https://stackoverflow.com/a/47657006), and the Pandas [documentation](https://pandas.pydata.org/docs/reference/api/pandas.concat.html).

In [None]:
# Because the dataframes' order are identical, we can simply join them
parks_with_hours_df = pd.concat(
    [
        parks_df_exploded.reset_index(drop=True),
        park_operating_hours_df.reset_index(drop=True),
    ],
    axis=1,
)

parks_with_hours_df.head()

We can then perform filters, as in SQL

In [None]:
parks_with_hours_df[parks_with_hours_df["category"] == "Hours of Operation"]

Note the filter syntax:

```
df[
    df[column] [LOGICAL MODIFIER] [VALUE]
]
```

we're taking the dataframe and applying a _mask_ where the column satisfies a certain condition. This is fundamentally different from filtering in SQL and can take some getting used to.

For multiple filters, we can use the logical operators `& |` and parentheses for grouping, for example:

```
df[
    (CONDITION 1 & CONDITION 2) | CONDITION 3
]
```

In [None]:
# Let's get the hours of operation for Theodore Roosevelt National Park based on the description

parks_with_hours_df[
    (parks_with_hours_df["category"] == "Hours of Operation")
    & (
        parks_with_hours_df["operating_hours_description"].str.contains(
            "Theodore Roosevelt"
        )
    )
]

# Note the multiline formatting— this can directly improve readability

Distinct values can be accessed with the `unique()` method

In [None]:
pprint(list(parks_with_hours_df["standardHours.thursday"].unique())[:5])

The Pandas equivalent of `CASE WHEN` is accessed through the numpy library `.where` function. If follows a similar pattern. 

We'll set a condition to select for— `parks_with_hours_df['standardHours.monday'] == 'unknown'`, if true, we'll return `Closed`. If not, we'll return the existing value.

In [None]:
import numpy as np

# CASE monday WHEN 'unknown' THEN 'Closed' ELSE monday END as monday_hours,

parks_with_hours_df["monday_hours"] = np.where(
    parks_with_hours_df["standardHours.monday"] == "unknown",
    "Closed",
    parks_with_hours_df["standardHours.monday"],
)
parks_with_hours_df["tuesday_hours"] = np.where(
    parks_with_hours_df["standardHours.tuesday"] == "unknown",
    "Closed",
    parks_with_hours_df["standardHours.tuesday"],
)
parks_with_hours_df["wednesday_hours"] = np.where(
    parks_with_hours_df["standardHours.wednesday"] == "unknown",
    "Closed",
    parks_with_hours_df["standardHours.wednesday"],
)
parks_with_hours_df["thursday_hours"] = np.where(
    parks_with_hours_df["standardHours.thursday"] == "unknown",
    "Closed",
    parks_with_hours_df["standardHours.thursday"],
)
parks_with_hours_df["friday_hours"] = np.where(
    parks_with_hours_df["standardHours.friday"] == "unknown",
    "Closed",
    parks_with_hours_df["standardHours.friday"],
)
parks_with_hours_df["saturday_hours"] = np.where(
    parks_with_hours_df["standardHours.saturday"] == "unknown",
    "Closed",
    parks_with_hours_df["standardHours.saturday"],
)
parks_with_hours_df["sunday_hours"] = np.where(
    parks_with_hours_df["standardHours.sunday"] == "unknown",
    "Closed",
    parks_with_hours_df["standardHours.sunday"],
)


parks_with_hours_df["open_seven_days_a_week"] = np.where(
    (
        (parks_with_hours_df["monday_hours"] != "Closed")
        & (parks_with_hours_df["tuesday_hours"] != "Closed")
        & (parks_with_hours_df["wednesday_hours"] != "Closed")
        & (parks_with_hours_df["thursday_hours"] != "Closed")
        & (parks_with_hours_df["friday_hours"] != "Closed")
        & (parks_with_hours_df["saturday_hours"] != "Closed")
        & (parks_with_hours_df["sunday_hours"] != "Closed")
    ),
    True,
    False,
)

cols_to_select = [
    "fullName",
    "open_seven_days_a_week",
    "monday_hours",
    "tuesday_hours",
    "wednesday_hours",
    "thursday_hours",
    "friday_hours",
    "saturday_hours",
    "sunday_hours",
]

parks_with_hours_df[cols_to_select].head()

In [None]:
open_seven_days_df = parks_with_hours_df[parks_with_hours_df["open_seven_days_a_week"]]

open_seven_days_df[cols_to_select].head()