# Section 02: Aggregating Data
### `01-Counting by region`
- Use `count(`) to find the number of counties in each region, using a second argument to sort in descending order.

In [33]:
library(dplyr)
counties <- read.csv("C:\\Users\\mosman\\Desktop\\Github\\Data_Scientist_with_R\\00_Datasets\\counties.csv", 
                    header=TRUE)

In [34]:
counties_selected <- counties %>%
    select(county, region, state, population, citizens)

In [35]:
# Use count to find the number of counties in each region
counties_selected %>%
  count(region, sort = TRUE)

region,n
<chr>,<int>
South,1420
North~,1271
West,447


### `02-Counting citizens by state`
- Count the number of counties in each state, weighted based on the `citizens` column, and sorted in descending order.

In [36]:
# Find number of counties per state, weighted by citizens, sorted in descending order
counties_selected %>%
  count(state, wt = citizens, sort=TRUE)

state,n
<chr>,<int>
California,24280349
Texas,16864864
Florida,13933052
New York,13531404
Pennsylvan,9710416
Illinois,8979999
Ohio,8709050
Michigan,7380136
North Caro,7107998
Georgia,6978660


### `03-Mutating and counting`
- Use `mutate()` to calculate and add a column called `population_walk`, containing the total number of people who walk to work in a county.
- Use a (weighted and sorted) `count()` to find the total number of people who walk to work in each state.

In [37]:
counties_selected <- counties %>%
  select(county, region, state, population, walk)

In [38]:
counties_selected %>%
  # Add population_walk containing the total number of people who walk to work
  mutate(population_walk =  population * walk / 100) %>%
  # Count weighted by the new column, sort in descending order
  count(state, wt = population_walk, sort = TRUE)

state,n
<chr>,<dbl>
New York,1237938.17
California,1017963.68
Pennsylvan,505397.19
Texas,430783.43
Illinois,400345.6
Massachuse,316765.03
Florida,284722.87
New Jersey,273047.19
Ohio,266910.98
Washington,239764.32


### `04-Summarizing`
- Summarize the counties dataset to find the following columns: `min_population` (with the smallest population), `max_unemployment` (with the maximum unemployment), and `average_income` (with the mean of the income variable).

In [39]:
counties_selected <- counties %>%
  select(county, population, income, unemployment)

In [40]:
counties_selected %>%
  # Summarize to find minimum population, maximum unemployment, and average income
summarize(min_population = min(population), 
          max_unemployment = max(unemployment), 
          average_income = mean(income))

min_population,max_unemployment,average_income
<int>,<dbl>,<dbl>
85,29.4,46832


### `05-Summarizing by state`
- Group the data by state, and summarize to create the columns `total_area` (with total area in square miles) and `total_population` (with total population).

In [41]:
counties_selected <- counties %>%
  select(state, county, population, land_area)

In [42]:
counties_selected %>%
  # Group by state 
  group_by(state) %>%
  # Find the total area and population
  summarise(total_area = sum(land_area), 
            total_population = sum(population))

state,total_area,total_population
<chr>,<dbl>,<int>
Alabama,50649.0,4830620
Alaska,553559.0,725461
Arizona,113596.0,6641928
Arkansas,52037.0,2958208
California,155778.9,38421464
Colorado,103643.0,5278906
Connecticu,4843.0,3593222
Delaware,1948.0,926454
Florida,53627.0,19645772
Georgia,57509.0,10006693


- Add a `density` column with the people per square mile, then arrange in descending order.



In [43]:
counties_selected %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  # Add a density column
  mutate(density = total_population / total_area ) %>%
  # Sort by density in descending order
  arrange(desc(density))

state,total_area,total_population,density
<chr>,<dbl>,<int>,<dbl>
New Jersey,7356.2,8904413,1210.46369
Rhode Isla,1034.2,1053661,1018.817443
Massachuse,7800.2,6705586,859.66847
Connecticu,4843.0,3593222,741.941359
Maryland,9706.9,5930538,610.961069
Delaware,1948.0,926454,475.592402
New York,47127.1,19673174,417.449281
Florida,53627.0,19645772,366.34106
Pennsylvan,44740.0,12779559,285.640568
Ohio,40855.0,11575977,283.342969


### `06-Summarizing by state and region`
- Summarize to find the total population, as a column called `total_pop`, in each combination of region and state.



In [44]:
counties_selected <- counties %>%
  select(region, state, county, population)

In [45]:
counties_selected %>%
  # Group and summarize to find the total population
  group_by(region, state) %>%
  summarize(total_pop = sum(population))

[1m[22m`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.


region,state,total_pop
<chr>,<chr>,<int>
North~,Connecticu,3593222
North~,Illinois,12873761
North~,Indiana,6568645
North~,Iowa,3093526
North~,Kansas,2892987
North~,Maine,1329100
North~,Massachuse,6705586
North~,Michigan,9900571
North~,Minnesota,5419171
North~,Missouri,6045448


- Notice the tibble is still grouped by region; use another `summarize()` step to calculate two new columns: the average state population in each region (`average_pop`) and the median state population in each region (`median_pop`).

In [46]:
counties_selected %>%
  # Group and summarize to find the total population
  group_by(region, state) %>%
  summarize(total_pop = sum(population)) %>%
  # Calculate the average_pop and median_pop columns 
  summarize(average_pop = mean(total_pop), 
            median_pop = median(total_pop))
  

[1m[22m`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.


region,average_pop,median_pop
<chr>,<dbl>,<dbl>
North~,5881989,5419171
South,7370486,4804098
West,5722755,2798636


### `07-Selecting a county from each region`
- Find the county in each region with the highest percentage of citizens who walk to work.



In [47]:
counties_selected <- counties %>%
  select(region, state, county, metro, population, walk)

In [48]:
counties_selected %>%
  # Group by region
  group_by(region) %>%
  # Find the greatest number of citizens who walk to work
  top_n(n = 1, walk)

region,state,county,metro,population,walk
<chr>,<chr>,<chr>,<chr>,<int>,<dbl>
West,Alaska,"""Aleutians East Borough""",Nonmetro,3304,71.2
North~,New York,"""New York""",Metro,1629507,20.7
South,Virginia,"""Lexington city""",Nonmetro,7071,31.7


### `08-Finding the highest-income state in each region`

- Calculate the average income (as `average_income`) of counties within each region and state (notice the `group_by()` has already been done for you).
- Find the highest income state in each region.

In [49]:
counties_selected <- counties %>%
  select(region, state, county, population, income)

In [50]:
counties_selected %>%
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income)) %>%
  # Find the highest income state in each region
  top_n(n = 1, average_income)

[1m[22m`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.


region,state,average_income
<chr>,<chr>,<dbl>
North~,New Jersey,73014.1
South,Maryland,69200.38
West,Alaska,65124.54


In [51]:
counties_selected %>%
  group_by(region, state) %>%
  # Calculate average income
  summarize(average_income = mean(income))

[1m[22m`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.


region,state,average_income
<chr>,<chr>,<dbl>
North~,Connecticu,71184.12
North~,Illinois,50163.44
North~,Indiana,48745.4
North~,Iowa,50483.12
North~,Kansas,47322.21
North~,Maine,46141.75
North~,Massachuse,65974.43
North~,Michigan,44464.99
North~,Minnesota,53926.99
North~,Missouri,41755.4


### `09-Using summarize, top_n, and count together`
In this chapter, you've learned to use five `dplyr` verbs related to aggregation: `count()`, `group_by()`, `summarize()`, `ungroup()`, and `top_n()`. In this exercise, you'll use all of them to answer a question: **In how many states do more people live in metro areas than non-metro areas?**

Recall that the `metro` column has one of the two values `"Metro"` (for high-density city areas) or `"Nonmetro"` (for suburban and country areas).

In [52]:
counties_selected <- counties %>%
  select(state, metro, population)

- For each combination of `state` and `metro`, find the total population as `total_pop`.
- Extract the most populated row from each state, which will be either `Metro` or `Nonmetro`.
- Ungroup, then count how often `Metro` or `Nonmetro` appears to see how many states have more people living in those areas.

In [53]:
counties_selected %>%
  # Find the total population for each combination of state and metro
  group_by(state, metro) %>%
  summarize(total_pop = sum(population)) %>%
  # Extract the most populated row for each state
  top_n(1, total_pop) %>%
  # Count the states with more people in Metro or Nonmetro areas
  ungroup() %>%
  count(metro)

[1m[22m`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.


metro,n
<chr>,<int>
Metro,44
Nonmetro,6


### `The End`