In [None]:
import pandas as pd
import numpy as np

parks_df = pd.read_parquet("../../data/nps/nps_public_data_parks.parquet")
parks_df.head()

Coming directly from a SQL module on transformation, we'll approach aggregations similarly via Pandas— that means we're assuming you've completed module 2 _or_ have a basic understanding of SQL `GROUP` functionality.

Pandas `groupby` works quite similarly to SQL, with a Pythonic twist:

In [None]:
# In the groupby method, we specify the column we want to group by, and then the column we want to count in brackets.

parks_df.groupby("states")["states"].count()

Of course, we'll need to do something similar with states here, but think in terms of Python— create a new column by _splitting_ `states` on the comma separator to convert it to a list.

In [None]:
parks_df["states_list"] = parks_df["states"].str.split(",")

In [None]:
parks_df["states_list"].head()

Now we can perform an `explode` like we did in lesson 1

In [None]:
parks_exploded_df = parks_df.explode("states_list").rename(
    columns={"states_list": "state"}
)
parks_exploded_df.head(1)

Finally, we can perform a groupby in pandas:    

In [None]:
parks_exploded_df.groupby("state")["state"].count()[:5]

Above, we used the slice to obtain the top 5 results. We can sort the results of _any_ dataframe using `sort_values`

In [None]:
parks_exploded_df.groupby("state")["state"].count().sort_values(ascending=False)[:5]

But what if we only want those designated as National Parks?

In [None]:
national_park_count_df = (
    parks_exploded_df[parks_exploded_df["designation"] == "National Park"]
    .groupby("state")["state"]
    .count()
    .sort_values(ascending=False)[:5]
)

Note how we formatted our query— by using parenthesis `()`, we were able to split the query into multiple lines and organize the operations. That makes it much easier to read then the alternative. The same can be accomplished with backslashes.

In [None]:
national_park_count_df = (
    parks_exploded_df[parks_exploded_df["designation"] == "National Park"]
    .groupby("state")["state"]
    .count()
    .sort_values(ascending=False)[:5]
)

national_park_count_df

Like our earlier example, let's find campgrounds with the least & most sites

In [None]:
campgrounds_df = pd.read_parquet("../../data/nps/nps_public_data_campgrounds.parquet")
campgrounds_df.head()

In [None]:
campgrounds_df["total_sites"] = (
    campgrounds_df["numberOfSitesFirstComeFirstServe"]
    + campgrounds_df["numberOfSitesReservable"]
)

national_park_campgrounds_df = campgrounds_df.merge(
    parks_df, on="parkCode", how="inner", suffixes=("_campground", "_park")
)

national_park_campgrounds_df = national_park_campgrounds_df[
    (national_park_campgrounds_df["designation"] == "National Park")
    & (national_park_campgrounds_df["total_sites"] > 0)
]

In [None]:
min_sites = national_park_campgrounds_df["total_sites"].min()

max_sites = national_park_campgrounds_df["total_sites"].max()

min_max_df = national_park_campgrounds_df[
    (national_park_campgrounds_df["total_sites"] == min_sites)
    | (national_park_campgrounds_df["total_sites"] == max_sites)
]

min_max_df["min_max"] = np.where(
    min_max_df["total_sites"] == min_sites,
    "min",
    np.where(min_max_df["total_sites"] == max_sites, "max", "other"),
)

In [None]:
min_max_df[["fullName", "name_campground", "total_sites", "min_max"]].sort_values(
    by="total_sites", ascending=False
)

In [None]:
max_sites

What about the parks? Now, we just need to group by park!

In [None]:
min_sites_park = national_park_campgrounds_df.groupby(["fullName"])["total_sites"].min()

max_sites_park = national_park_campgrounds_df.groupby(["fullName"])["total_sites"].max()

park_campgrounds_agg = (
    national_park_campgrounds_df.groupby("fullName")["total_sites"]
    .sum()
    .sort_values(ascending=False)
)

park_campgrounds_agg

Groupby is pretty similar to the SQL equivalent! You can perform all sorts of basic groups. If you'd like to play around with more Groupby examples, check out the Pandas [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html)!