# Exploring and Summarizing Data

This lesson covers the methods and approaches for exploring and producing summary statistics of data using a variety of Pandas methods and functions.

## Learning Objectives

By the end of this lesson you will be able to:

- Summarize numerical data with [Descriptive Statistics](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics) - mean, std, var, median, min, max
- Sort rows of a DataFrame by a particular column - [Sorting](https://pandas.pydata.org/docs/user_guide/basics.html#sorting)
- Summarize categorical data with value counts
- Select a subset of rows based upon a particular logical criteria using a mask [Boolean Indexing](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing
- Filter rows using the Pandas query method - [The query() Method](https://pandas.pydata.org/docs/user_guide/indexing.html#the-query-method)
- Use groupby operations to summarize subsets of categorical and numerical data Dataframe 


## Datasets used

- The Clean Trees dataset produced in the previous lesson


In [None]:
# load pandas using the alias pd
import pandas as pd

## Loading Data

Run the code cell below to load the clean City of Pittsburgh Trees dataset produced in the previous lesson.

In [None]:
# load the clean trees dataset
trees = pd.read_csv("pgh-trees-clean.csv")
trees

#### Task - How many trees?

1. How many trees are in the dataset? Use whatever computational technique you want (there are several ways).
2. Use a `print()` function to display the answer in the sentence `There are ??? trees in Pittsburgh (according to this dataset).` Replace the `???` with the number of trees.



In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
print("There are", trees.shape[0], "trees in Pittsburgh (according to this dataset).")

#### Task - I am the `info()` I speak for the `trees`

1. Use the `info()` method to get some information about the trees dataset
2. How many columns are in the dataset? 
3. Look at the column names to infer the information they contain, what groupings of columns might you create?

In [None]:
# your code here


#### Answer - I am the `info()` I speak for the `trees`

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
trees.info()

*answer*

2. There are 58 columns
3. There are many possible groupings, here is one potential set of groups:

- Columns about the tree's location
- Columns about characteristics of the tree itself
- Information about the tree's economic value
- Information about the tree's ecological value
- Administrative information and other


#### Task - Columns about locations

1. Look at the column names, which columns do you think might be relevant to the tree's location within the city?
2. Create a python list called `location_columns` that contains strings with the names of these columns.

You don't need to be comprehensive and there are subjective decisions in these categories.

In [None]:
# your code here


#### Answer - Columns about tree locations

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
# create a list of columns about a tree's location
location_columns = ["address_number", "street", "neighborhood", "latitude", "longitude"]

#### Task - Columns about the trees themselves

1. Look at the column names, which columns do you think might contain information about the tree itself?
2. Create a python list called `tree_info_columns` that contains strings with the names of these columns.


You don't need to be comprehensive and there are subjective decisions in these categories.

In [None]:
# your code here


#### Answer - Columns about the trees themselves

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# create a list of columns containing information about a tree
tree_info_columns = ["common_name", "scientific_name", "height", 
                "width", "growth_space_length", "growth_space_width", 
                "growth_space_type", "diameter_base_height", "stems","overhead_utilities", "condition"]

#### Task - Valuing trees economically & environmentally

The trees dataset contains a bunch of columns related to the "dollar value" and environmental benefits of the trees. Rather than type out all those column names, identify single columns

1. what single column do you think has a useful summary of the economic value of a tree?
1. what single column do you think has a useful summary of the ecological value of a tree?

#### Answer - Valuing tree economically & environmentally

Click on the ellipses (...) below to see the answers.

*answer*

1. The `overall_benefits_dollar_value` column is probably the best for summarizing the economic value. 
2. There are a couple columns that might be useful, but the `air_quality_benefits_total` is probably the best for summarizing environmental benefits.

## Selecting Columns

The dataset has many columns with information about trees, so much that it is hard to see all at once. By grouping and identifying specific columns we can begin to explore the tree data more deliberately.

#### Task - Just the tree information

1. Use your newly created list of columns with information about the trees, `tree_info_columns` as an index to select just those columns from the larger `trees` DataFrame.

In [None]:
# your code here


#### Answer - Just the tree information

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# select the columns 
trees[tree_info_columns]

#### Task - The tallest tree in the land

1. Look at the results, is there a column that you included in your `tree_info_columns` list that could help answer the question, what is the tallest tree in Pittsburgh?*
2. What are some of the ways to manipulate the data to answer this question?

\* According to this dataset. Not every tree in the city is in this dataset, just those managed by the city.

#### Answer - The tallest tree in the land

Click on the ellipses (...) below to see the answers.

*answer*

1. The `height` column is most likely the best column for answering the question about the tallest tree in the dataset/pittsburgh.
2. We could calculate the max value or we could try to sort data based on the height column

#### Task - The max value

1. Select just the column you identified in the last task
2. Review the documentation for [Pandas Series](https://pandas.pydata.org/pandas-docs/version/1.5/reference/series.html#). Find the method that will tell us the height of the tallest tree? Use that method to find the answer to our question.

In [None]:
# your code here


#### Answer - The max value

Click on the ellipses (...) below to see the answers.

In [None]:
# find the max value of hte height column
trees['height'].max()

#### Task - What and Where is the tallest tree?

The previous task gave us the answer to our question about the tallest tree in Pittsburgh, but it doesn't provide much context about that tree. It would be nice to have a bit more information about 

1. Review the [Pandas Series documentation](https://pandas.pydata.org/pandas-docs/version/1.5/reference/series.html#reindexing-selection-label-manipulation) to find a function that will help us find an identifier for the row with the largest value for height.
2. Use that method on the `height` column and save the results into a variable called `tallest_tree_id`
3. Use the variable with the appropriate indexing property to select the row of the `trees` dataset for the tallest tree
4. What type of tree is it? What neighborhood is it located in?

In [None]:
# your code here


#### Answer - What and where is the tallest tree?

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

tallest_tree_id = trees['height'].idxmax()
trees.loc[tallest_tree_id]

*answer*

The tallest tree in the dataset is a Japanese lilac and it is located in Lower Lawrenceville

#### Task - Top ten trees

What if we wanted to get information about the 10 tallest trees in the dataset? What DataFrame method should we use?

1. Review the [Pandas documentation for DataFrames](https://pandas.pydata.org/pandas-docs/version/1.5/reference/frame.html#). Find the section that lists methods relevant for answering this question. 
2. Find two methods that can answer the question about the 10 tallest trees and enter them in the code cells below.

In [None]:
# your code here


#### Answer - Top ten tree

Click on the ellipses (...) below to see the answers.

In [None]:
# answer 1 

# use the nlargest() method to find the top ten rows based on the heigh column
trees.nlargest(10, "height")

In [None]:
# another answer

# sort trees by heigh and display first ten rows
trees.sort_values("height", ascending=False).head(10)

#### Task - Every descriptive statistic everywhere all at once

1. Run the code cell below to compute descriptive statistics on the numeric columns
2. Identify the individual dataframe methods that will generate each individual descriptive statistics (count, mean, min, max, etc.). Refer to the [Pandas documentation](https://pandas.pydata.org/pandas-docs/version/1.5/reference/frame.html#computations-descriptive-stats) for help. You already know one of them!
3. Run each descriptive statistic in a code cell below (you will need to create additional code cells).


In [None]:
# describe the information about the trees
trees.describe()

In [None]:
# your code here


#### Answer - Every descriptive statistic everywhere all at once

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# compute counts
trees[tree_info_columns].count()

In [None]:
# answer

# compute mean
trees[tree_info_columns].mean()

In [None]:
# answer

# compute standard deviation
trees[tree_info_columns].std()

In [None]:
# answer

# compute minimum
trees[tree_info_columns].min()

In [None]:
# answer

# compute 25% quantiles
trees[tree_info_columns].quantile(.25)

In [None]:
# answer

# compute 50% quantiles
trees[tree_info_columns].quantile()

In [None]:
# answer

# compute 75% quantiles
trees[tree_info_columns].quantile(.75)

In [None]:
# answer 

# comput the max 
trees[tree_info_columns].max()

## Working with Categorical data

When performing summary statistics, Pandas will automatically ignore non-numeric columns. That doesn't mean there isn't valuable information in those columns! There are a separate set of functions and techniques for working with categorical data.

#### Task - The 20 most poplar trees

1. Review the [Pandas DataFrame Documentation](https://pandas.pydata.org/pandas-docs/version/1.5/reference/frame.html#computations-descriptive-stats) and identify a method that will compute the number of unique values for each type of tree.

In [None]:
# your code here

In [None]:
# answer

trees['common_name'].value_counts().head(20)

#### Task - Most arboreal neighborhoods

1. Use the same function above, but now identify the top 10 most tree-y neighborhoods

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
trees['neighborhood'].value_counts().head(10)

#### Task - How many types of trees

1. Review the [Pandas DataFrame Documentation](https://pandas.pydata.org/pandas-docs/version/1.5/reference/frame.html#computations-descriptive-stats) and identify a method that will return a *single* value representing how many types of trees are present in the dataset.

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# return the number of unique values for the common_name column
trees['common_name'].nunique()

#### Task - Types of trees in Latin

1. Repeat the task above, but calculate the number of trees based on their botanical names.
2. Is the number the same? Why might the two values be different?

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer 

# return the number of unique values for the scientific name
trees['scientific_name'].nunique()

*answer*

2. The number is not the same! There are more scientific names than common names. Dirty data? 

#### Task - Which names not the same?

The previous tasks showed that there is not a one to one correspondence between the common name and scientific name of the trees in the dataset. The code cell below tries to performs a series of computations on the data to identify the mismatch between common and scientific names, however there is a bug and it doesn't produce the correct results.

1. Fix the bug in the code below so we can see which values for names are duplicated.
2. Can you describe what these operations are doing and how they work to tell us about the mismatched names?

In [None]:
# 
trees[['common_name','scientific_name']].drop_duplicates().value_counts(['scientific_name'])

#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
# select the name columns, create a dataframe of the unique pairs, and count the occurances of common name
trees[['common_name','scientific_name']].drop_duplicates().value_counts(['common_name'])

*answer*

2. The code first selects just the `common_name` and `scientific_name` columns from the trees datasets, this produces a new DataFrame of just those two columns. Then the [`drop_duplicates()` DataFrame method](https://pandas.pydata.org/pandas-docs/version/1.5/reference/api/pandas.DataFrame.drop_duplicates.html#pandas.DataFrame.drop_duplicates) is used to create yet another DataFrame of only the unique pairs of common and scientific name. The [`value_counts()` DataFrame method](https://pandas.pydata.org/pandas-docs/version/1.5/reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts) is used to find the duplicate names. The `common_name` column needs to be used because that is the column with fewer unique values and therefor has duplicates that can be identified with `value_counts()` method.

## Filtering Data by value

Pandas provides a couple mechanisms to create subsets of your data based upon particular criteria, *masking* and *querying*. This section will discuss how to create subset of your data by creating a *boolean mask* of values, `True`/`False`, based upon a logical criteria that applies to values within each row.

The tasks below will show how to filter data using boolean masks.

#### Task - Making the Yellowwood mask

1. The code below selects the `common_name` column from the `trees` dataset, but is missing the appropriate [comparison operator](https://docs.python.org/3/library/stdtypes.html#comparisons). Replace the `???` below with the appropriate operator to produce a boolean mask for American Yellowwood trees.
2. How many and what type of values are in the resulting Pandas Series? How does that compare to the number of trees? 

In [None]:
# which operator?
yellowwood_mask = trees['common_name'] ??? "Yellowwood: American"
yellowwood_mask

#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
# True if the common name equals American yellowwood
yellowwood_mask = trees['common_name'] == "Yellowwood: American"
yellowwood_mask

*answer*

2. There are 45709 values in the series, th

#### Task - Using the Yellowwood mask

1. Use the boolean mask, `yellowwood_mask`, to index the `trees` DataFrame and select only the rows that match the conditional express from the task above.

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# select american yellowwoods
trees[yellowwood_mask]

#### Task - Create another Mask for Japanese pagodatree

1. There were two types of trees with mismatched common and scientific names. Create a boolean mask and save it to the variable `pagodatree_mask` and generate a table like the one in the previous task but for the other tree type.

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# create a mask
pagodatree_mask = trees['common_name'] == "Pagodatree: Japanese" 
# select rows where mask is True
trees[pagodatree_mask]

#### Task - Son of the Mask

1. Use the two boolean mask variables with a [bitwise operation](https://docs.python.org/3/library/stdtypes.html#bitwise-operations-on-integer-types) to filter for rows matching both of the misnamed tree types. Note, you cannot use the Python [boolean comparions]() because they will not be applied on an element-wide basis. See [this post for an explaination](https://towardsdatascience.com/bitwise-operators-and-chaining-comparisons-in-pandas-d3a559487525)

In [None]:
# your code here

#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# filter for misnamed trees
trees[yellowwood_mask | pagodatree_mask]

#### Task - Display the misnamed trees


|       | common_name          | scientific_name         |
|------:|:---------------------|:------------------------|
|   339 | Pagodatree: Japanese | Sophora japonica        |
|  4335 | Yellowwood: American | Cladrastis kentukea     |
| 31991 | Yellowwood: American | Cladrastis lutea        |
| 32001 | Pagodatree: Japanese | Styphnolobium japonicum |


1. Use your mask plus and some of the code from the task "Which names not the same?" to produce the table above.
2. Do some web searching to determine what is the propert scientific name for these two trees.

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

trees[yellowwood_mask | pagodatree_mask][['common_name','scientific_name']].drop_duplicates()

*answer*

2. Wikipedia says
- The scientific name of American Yellow wood is [*Cladrastis kentukea*](https://en.wikipedia.org/wiki/Cladrastis_kentukea) with *C lutea* as a synonym
- The scientific name of Japanese Pagodatree is [*Styphnolobium japonicum*](https://en.wikipedia.org/wiki/Styphnolobium_japonicum) with *sephoa* as a synonym

## Query Method

Because masking can be somewhat cumbersome, Pandas provides a [DataFrame method called `query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) that allows for writing concise logical expressions as strings. Documentation and guidance on the `query()` method is a bit sparse. Refer to the [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) for guidance on how to write query strings. Also Google. Under the hood the query methods uses the Pandas DataFrame `eval()` method so reviewing the documentation for that can help with understanding how the query strings are evaluated.

The examples below show some of the functionality and ways of writing query strings.

In [None]:
# select stumps that aren't stumps
trees.query("common_name == 'Stump' and height > 0")

The query below uses a vectorized string method, `startswith()` in the logical expression to select the different varieties of maple tree. The `engine="python"` parameter is necessary when working with string methods inside of the query string. However, this still causes an error because the `common_name` column contains missing values which don't evaluate to True or False.

In [None]:
# filter for maple trees of all types
trees.query("common_name.str.startswith('Maple')", engine="python")

The solution is to expand the query string to first filter for the non-missing values in the `common_name` column and then filter for names that start with "Maple". 

In [None]:
# filter out missing values and then filter for maple tress.
trees.query("common_name.notna() and common_name.str.startswith('Maple')", engine="python")

So many maple trees!

#### Task - Query misnamed trees

1. Use the `query()` method and write a query string that is a boolean expression that will only match rows where the `scientific_name` matches the incorrect scientific names identified above.

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
trees.query("scientific_name == 'Sophora japonica'  or scientific_name == 'Cladrastis lutea' ")

#### Task - Fix the names

1. Review documentation for the Pandas DataFrame `replace()` method and use it to update the scientific names for American Yellowwood and Japanese Pagodatree. The `replace()` method can be used in many different ways, try them.
2. Test to see if the code works by using the query from the previous task.
3. Once you have confirmed your `replace()` function is working correctly, add a parameter to update the `trees` DataFrame directly rather than creating a copy. 


In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# show data before replacement
trees.query("scientific_name == 'Sophora japonica'  or scientific_name == 'Cladrastis lutea' ")

In [None]:
# replace with a dictionary
trees.replace({"Sophora japonica":"Styphnolobium japonicum",
               "Cladrastis lutea":"Cladrastis kentukea"}
             ).query("scientific_name == 'Sophora japonica' or scientific_name == 'Cladrastis lutea' ")

In [None]:
# replace with two lists
trees.replace(["Sophora japonica","Cladrastis lutea"], 
              ["Styphnolobium japonicum","Cladrastis kentukea"]
             ).query("scientific_name == 'Sophora japonica'  or scientific_name == 'Cladrastis lutea' ")

In [None]:
# use the dictionary replace and set inplace to True
# replace with a dictionary
trees.replace({"Sophora japonica":"Styphnolobium japonicum",
               "Cladrastis lutea":"Cladrastis kentukea"
              }, inplace=True)
# check to see if the original dataframe has been updated
trees.query("scientific_name == 'Sophora japonica'  or scientific_name == 'Cladrastis lutea' ")

#### Task - Save the cleaner trees

1. Save thee updated `trees` DataFrame as a CSV file called `pgh_trees_cleaner.csv` and don't include the index column in the output file.

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
#answer
# save cleaner trees to disk
trees.to_csv('pgh_trees_cleaner.csv', index=False)

## Grouping data

Grouping operations, often called "group by," are designed to aggregate datasets by grouping rows based upon a particular value, typically categorical data types, and then computing a single value through an operation that summarizes all the rows associated with the group.

Groupby operations require two things. First, a column of values in which to *group by*, that is, the values that will become rows in the new DataFrame. Second, an operation that can *aggregate* multiple rows into a single row. Groupby operations will then perform the aggregation operation on all rows that have the same value for the particular group by column. Mathematical aggregation operations will automatically be applied to numerical columns.

#### Task - Identifying columns for grouping operations

1. Look at the tree data and consider, what columns are suitable for grouping and what columns are suitable for aggregation? You don't have to list all of them, just highlight some that might be most interesting.

|    |         id |   address_number | street        | common_name        | scientific_name   |   height |   width |   growth_space_length |   growth_space_width | growth_space_type    |   diameter_base_height |   stems | overhead_utilities   | land_use              | condition   |   stormwater_benefits_dollar_value |   stormwater_benefits_runoff_elim |   property_value_benefits_dollarvalue |   property_value_benefits_leaf_surface_area |   energy_benefits_electricity_dollar_value |   energy_benefits_gas_dollar_value |   air_quality_benfits_o3dep_dollar_value |   air_quality_benfits_o3dep_lbs |   air_quality_benfits_vocavd_dollar_value |   air_quality_benfits_vocavd_lbs |   air_quality_benfits_no2dep_dollar_value |   air_quality_benfits_no2dep_lbs |   air_quality_benfits_no2avd_dollar_value |   air_quality_benfits_no2avd_lbs |   air_quality_benfits_so2dep_dollar_value |   air_quality_benfits_so2dep_lbs |   air_quality_benfits_so2avd_dollar_value |   air_quality_benfits_so2avd_lbs |   air_quality_benfits_pm10depdollar_value |   air_quality_benfits_pm10dep_lbs |   air_quality_benfits_pm10avd_dollar_value |   air_quality_benfits_pm10avd_lbs |   air_quality_benfits_total_dollar_value |   air_quality_benfits_total_lbs |   co2_benefits_dollar_value |   co2_benefits_sequestered_lbs |   co2_benefits_sequestered_value |   co2_benefits_avoided_lbs |   co2_benefits_avoided_value |   co2_benefits_decomp_lbs |   co2_benefits_maint_lbs |   co2_benefits_totalco2_lbs |   overall_benefits_dollar_value | neighborhood     |   council_district |   ward |       tract |   public_works_division |   pli_division |   police_zone | fire_zone   |   latitude |   longitude |
|---:|-----------:|-----------------:|:--------------|:-------------------|:------------------|---------:|--------:|----------------------:|---------------------:|:---------------------|-----------------------:|--------:|:---------------------|:----------------------|:------------|-----------------------------------:|----------------------------------:|--------------------------------------:|--------------------------------------------:|-------------------------------------------:|-----------------------------------:|-----------------------------------------:|--------------------------------:|------------------------------------------:|---------------------------------:|------------------------------------------:|---------------------------------:|------------------------------------------:|---------------------------------:|------------------------------------------:|---------------------------------:|------------------------------------------:|---------------------------------:|------------------------------------------:|----------------------------------:|-------------------------------------------:|----------------------------------:|-----------------------------------------:|--------------------------------:|----------------------------:|-------------------------------:|---------------------------------:|---------------------------:|-----------------------------:|--------------------------:|-------------------------:|----------------------------:|--------------------------------:|:-----------------|-------------------:|-------:|------------:|------------------------:|---------------:|--------------:|:------------|-----------:|------------:|
|  0 |  754166088 |             7428 | MONTICELLO ST | Stump              | Stump             |        0 |       0 |                    10 |                    2 | Well or Pit          |                     16 |       1 | Yes                  | Vacant                | nan         |                          nan       |                           nan     |                              nan      |                                    nan      |                                  nan       |                           nan      |                               nan        |                      nan        |                               nan         |                     nan          |                                nan        |                      nan         |                                nan        |                       nan        |                               nan         |                      nan         |                                nan        |                      nan         |                                nan        |                       nan         |                                 nan        |                       nan         |                                nan       |                      nan        |                  nan        |                        nan     |                       nan        |                   nan      |                   nan        |                 nan       |                nan       |                     nan     |                        nan      | Homewood North   |                  9 |     13 | 4.20031e+10 |                       2 |             13 |             5 | 3-17        |    40.4582 |    -79.8897 |
|  1 | 1946899269 |              220 | BALVER AVE    | Linden: Littleleaf | Tilia cordata     |        0 |       0 |                    99 |                   99 | Open or Unrestricted |                     22 |       0 | No                   | Residential           | nan         |                           13.9467  |                          1743.34  |                               21.9848 |                                     36.5383 |                                   15.7765  |                            61.0683 |                                 2.36085  |                        0.514346 |                                 0.0721382 |                       0.0312287  |                                  0.992384 |                        0.216206  |                                  3.70231  |                         0.806603 |                                 0.274901  |                        0.0789944 |                                  1.40772  |                        0.404518  |                                  2.18533  |                         0.262976  |                                   0.46181  |                         0.0555728 |                                 11.4574  |                        2.37044  |                    0.944601 |                        115.328 |                         0.847431 |                   277.541  |                     2.03937  |                 -96.3455  |                -13.7088  |                     282.815 |                        125.178  | Oakwood          |                  2 |     28 | 4.20036e+10 |                       5 |             28 |             6 | 1-19        |    40.4293 |    -80.0679 |
|  2 | 1431517397 |             2822 | SIDNEY ST     | Maple: Red         | Acer rubrum       |       22 |       6 |                     6 |                    3 | Well or Pit          |                      6 |       1 | No                   | Commercial/Industrial | Fair        |                            3.97486 |                           496.857 |                               51.5291 |                                     85.6404 |                                    3.38882 |                            16.0847 |                                 0.464026 |                        0.101095 |                                 0.0176434 |                       0.00763782 |                                  0.200391 |                        0.0436582 |                                  0.875966 |                         0.190842 |                                 0.0587249 |                        0.016875  |                                  0.302735 |                        0.0869929 |                                  0.444639 |                         0.0535065 |                                   0.110526 |                         0.0133004 |                                  2.47465 |                        0.513908 |                    0.314952 |                         45.288 |                         0.332776 |                    59.6164 |                     0.438061 |                  -6.86864 |                 -3.73876 |                      94.297 |                         77.7671 | South Side Flats |                  3 |     16 | 4.20032e+10 |                       3 |             16 |             3 | 4-24        |    40.4268 |    -79.965  |
|  3 |  994063598 |              608 | SUISMON ST    | Maple: Freeman     | Acer x freemanii  |       25 |      10 |                     3 |                    3 | Well or Pit          |                      7 |       1 | Conflicting          | Residential           | Fair        |                            4.77566 |                           596.958 |                               43.1845 |                                     71.7718 |                                    5.39622 |                            24.2209 |                                 0.742735 |                        0.161816 |                                 0.0270871 |                       0.011726   |                                  0.312209 |                        0.0680194 |                                  1.357    |                         0.295642 |                                 0.0864852 |                        0.0248521 |                                  0.481898 |                        0.138476  |                                  0.687516 |                         0.0827335 |                                   0.170684 |                         0.0205395 |                                  3.86561 |                        0.803805 |                    0.395314 |                         33.565 |                         0.246635 |                    94.9307 |                     0.697551 |                  -5.77618 |                 -4.36189 |                     118.358 |                         81.8383 | East Allegheny   |                  1 |     23 | 4.20036e+10 |                       1 |             23 |             1 | 1-6         |    40.4555 |    -79.9993 |
|  4 | 1591838573 |             1135 | N NEGLEY AVE  | Maple: Norway      | Acer platanoides  |       52 |      13 |                    99 |                   99 | Open or Unrestricted |                     38 |       1 | Yes                  | Residential           | Good        |                           41.2284  |                          5153.55  |                              194.129  |                                    322.638  |                                   28.5715  |                            94.9301 |                                 5.87279  |                        1.27948  |                                 0.138217  |                       0.0598343  |                                  2.53883  |                        0.553121  |                                  7.27429  |                         1.58481  |                                 0.730628  |                        0.209951  |                                  2.95144  |                        0.848114  |                                  5.27533  |                         0.634817  |                                   0.856869 |                         0.103113  |                                 25.6384  |                        5.27324  |                    6.04169  |                       1391.74  |                        10.2265   |                   582.319  |                     4.27888  |                -137.739   |                -27.4328  |                    1808.89  |                        390.539  | Highland Park    |                  7 |     11 | 4.20031e+10 |                       2 |             11 |             5 | 3-9         |    40.4767 |    -79.9241 |

#### Answer - 

Click on the ellipses (...) below to see the answers.

*answer* 

`common_name` would be a good column for grouping because it contains categorical data. The `neighborhood` column could also be used as a group column. Columns that contain values for which you want to see the distrbution of different aggregations across the unique values in that categorical column are good candidates for groupby. The `height`, `width`, or `overall_benefits_dollar_value` column would all be good for aggregation because they contains numerical data that can be easily aggregated. 

#### Task - Finding the aggregation functions

Review the [pandas documentation](https://pandas.pydata.org/pandas-docs/version/0.23/api.html#id39) of the built-in aggregation functions for the group by operations. 
1. What groupby and aggregation function would give us the total economic benefit per neighborhood? 
2. What aggregation function would tell us how many types of trees are present in the dataset?
3. Are there aggregation functions that could perform computations on non-numeric columns? 

*answer*

1. Using the `sum` aggregation function and grouping by `neighborhood` and then selecting the `overall_benefits_dollar_value` column would 
2. Grouping by `common_name` or `scientific_name` and computing the `size()`
3. Look for functions that can produce a single value from categorical content. The `count` and `size` methods will just calculate the total number of values. The `first` and `last` will return the first and last value from the group.

#### Task -  Grouping and aggregating trees

1. Modify the code cell below to put a categorical column name in as the parameter for the `groupby()` method and add an aggregation method. 

In [None]:
# Your code here
# grouping by a category and aggregating by an aggregation function
trees.groupby(???).???()

#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# Answer 1
# grouping by common name and computing the median value for all numeric columns
trees.groupby("common_name").median()

#### Task - Meaningful Means

1. The code below groups by the neighborhood name, selects by a subset of columns, and then aggregates by the mean. However, the aggregation is computing meaningless values for numeric columns. 
2. Modify the code to calculate the mean just for the columns about the trees themselves.

In [None]:
# group by neighborhood, select location columns, and compute the mean
trees.groupby("neighborhood")[location_columns].mean()

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer
# aggregate treeby the mean
trees.groupby("neighborhood")[tree_info_columns].mean()

#### Task - The Sorting Hat

1. Copy your answer to the previous task and add code to sort the results by a column of your choosing. 


In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# tree summaries per neighborhood, sorted by 
trees.groupby("neighborhood")[tree_info_columns].mean().sort_values(by="height")

#### Task - Neighborhoods with the most valuable trees

1. Use the groupby, aggregation, and sorting methods to calculate the total economic value of all the trees per neighborhood.
2. Save the results in a variable called `value_per_neighborhood` and display the results

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# Group by neighborhood and select the overal_benefits_value column and compute the sum. sort the resultsing series
value_per_neighborhood = trees.groupby('neighborhood')['overall_benefits_dollar_value'].sum().sort_values(ascending=False)
value_per_neighborhood

#### Task - How many trees per neighborhood?

1. Not every neighborhood has the same number of trees. Use groupby, aggregation, and sorting methods to calculate how many trees there are in each neighborhood.
2. Save the results in a variable called `trees_per_neighborhood` and display the results

In [None]:
# your code here


#### Answer - 

Click on the ellipses (...) below to see the answers.

In [None]:
# answer

# calculate the number of trees per neigborhood
trees_per_neighborhood = trees.groupby('neighborhood').size().sort_values()
trees_per_neighborhood

#### Task - Neighborhood value per trees

1. Using the `value_per_neighborhood` and `trees_per_neighborhood` variables, compute new economic value for each neighborhood based upon the number of trees.
2. Save the results in a variable `value_by_tree` and display the results sorted from high to low. Are the results different than the total economic benefit per neighborhood? 

In [None]:
# answer

# calculate the total economic value per neighborhood divided by the number of trees per neighborhood
value_by_trees = value_per_neighborhood / trees_per_neighborhood
# sort from high to low
value_by_trees.sort_values(ascending=False)