# The GroupBy Object

In [1]:
import pandas as pd

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [10]:
fortune = pd.read_csv('fortune1000.csv', index_col=0)
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [11]:
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [14]:
sectors = fortune.groupby('Sector')

In [16]:
sectors.first()

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Boeing,Aerospace and Defense,96114,5176,161400
Apparel,Nike,Apparel,30601,3273,62600
Business Services,ManpowerGroup,Temporary Help,19330,419,27000
Chemicals,Dow Chemical,Chemicals,48778,7685,49495
Energy,Exxon Mobil,Petroleum Refining,246204,16150,75600
Engineering & Construction,Fluor,"Engineering, Construction",18114,413,38758
Financials,Berkshire Hathaway,Insurance: Property and Casualty (Stock),210821,24083,331000
Food and Drug Stores,CVS Health,Food and Drug Stores,153290,5237,199000
"Food, Beverages & Tobacco",Archer Daniels Midland,Food Production,67702,1849,32300
Health Care,McKesson,Wholesalers: Health Care,181241,1476,70400


In [17]:
sectors.last()

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Delta Tucker Holdings,Aerospace and Defense,1923,-133,12000
Apparel,Guess,Apparel,2204,82,13500
Business Services,DeVry Education Group,Education,1910,140,11770
Chemicals,H.B. Fuller,Chemicals,2084,87,4425
Energy,Portland General Electric,Utilities: Gas and Electric,1898,172,2646
Engineering & Construction,MDC Holdings,Homebuilders,1909,66,1225
Financials,New York Community Bancorp,Commercial Banks,1902,-47,3448
Food and Drug Stores,Fred’s,Food and Drug Stores,2151,-7,7103
"Food, Beverages & Tobacco",Alliance One International,Tobacco,2066,-15,6835
Health Care,Providence Service,Health Care: Pharmacy and Other Services,1987,84,9072


In [19]:
sectors

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000021A8C7AEC90>

## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [20]:
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [24]:
sectors.get_group('Energy')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
14,Chevron,Energy,Petroleum Refining,131118,4587,61500
30,Phillips 66,Energy,Petroleum Refining,87169,4227,14000
32,Valero Energy,Energy,Petroleum Refining,81824,3990,10103
42,Marathon Petroleum,Energy,Petroleum Refining,64566,2852,45440
...,...,...,...,...,...,...
981,WPX Energy,Energy,"Mining, Crude-Oil Production",1958,-1727,1040
983,Adams Resources & Energy,Energy,Petroleum Refining,1944,-1,809
995,EP Energy,Energy,"Mining, Crude-Oil Production",1908,-3748,665
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646


In [25]:
sectors.get_group('Technology')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
18,Amazon.com,Technology,Internet Services and Retailing,107006,596,230800
20,HP,Technology,"Computers, Office Equipment",103355,4554,287000
25,Microsoft,Technology,Computer Software,93580,12193,118000
31,IBM,Technology,Information Technology Services,82461,13190,411798
...,...,...,...,...,...,...
970,Rackspace Hosting,Technology,Internet Services and Retailing,2001,126,6189
971,VeriFone Systems,Technology,"Computers, Office Equipment",2001,79,5400
975,Super Micro Computer,Technology,"Computers, Office Equipment",1991,102,2285
984,Nuance Communications,Technology,Computer Software,1931,-115,13500


In [26]:
sectors.get_group('Health Care')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
6,UnitedHealth Group,Health Care,Health Care: Insurance and Managed Care,157107,5813,200000
12,AmerisourceBergen,Health Care,Wholesalers: Health Care,135962,-135,17000
21,Cardinal Health,Health Care,Wholesalers: Health Care,102531,1215,34500
22,Express Scripts Holding,Health Care,Health Care: Pharmacy and Other Services,101752,2476,25900
...,...,...,...,...,...,...
935,VCA,Health Care,Health Care: Medical Facilities,2134,211,12700
960,PharMerica,Health Care,Health Care: Pharmacy and Other Services,2029,35,5200
965,Bio-Rad Laboratories,Health Care,Medical Products and Equipment,2019,113,7770
977,Hill-Rom Holdings,Health Care,Medical Products and Equipment,1988,48,10000


## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).