# The GroupBy Object

In [1]:
import pandas as pd

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [10]:
fortune = pd.read_csv('fortune1000.csv', index_col=0)
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [11]:
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [14]:
sectors = fortune.groupby('Sector')

In [16]:
sectors.first()

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Boeing,Aerospace and Defense,96114,5176,161400
Apparel,Nike,Apparel,30601,3273,62600
Business Services,ManpowerGroup,Temporary Help,19330,419,27000
Chemicals,Dow Chemical,Chemicals,48778,7685,49495
Energy,Exxon Mobil,Petroleum Refining,246204,16150,75600
Engineering & Construction,Fluor,"Engineering, Construction",18114,413,38758
Financials,Berkshire Hathaway,Insurance: Property and Casualty (Stock),210821,24083,331000
Food and Drug Stores,CVS Health,Food and Drug Stores,153290,5237,199000
"Food, Beverages & Tobacco",Archer Daniels Midland,Food Production,67702,1849,32300
Health Care,McKesson,Wholesalers: Health Care,181241,1476,70400


In [17]:
sectors.last()

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Delta Tucker Holdings,Aerospace and Defense,1923,-133,12000
Apparel,Guess,Apparel,2204,82,13500
Business Services,DeVry Education Group,Education,1910,140,11770
Chemicals,H.B. Fuller,Chemicals,2084,87,4425
Energy,Portland General Electric,Utilities: Gas and Electric,1898,172,2646
Engineering & Construction,MDC Holdings,Homebuilders,1909,66,1225
Financials,New York Community Bancorp,Commercial Banks,1902,-47,3448
Food and Drug Stores,Fred’s,Food and Drug Stores,2151,-7,7103
"Food, Beverages & Tobacco",Alliance One International,Tobacco,2066,-15,6835
Health Care,Providence Service,Health Care: Pharmacy and Other Services,1987,84,9072


In [19]:
sectors

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000021A8C7AEC90>

## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [20]:
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [24]:
sectors.get_group('Energy')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
14,Chevron,Energy,Petroleum Refining,131118,4587,61500
30,Phillips 66,Energy,Petroleum Refining,87169,4227,14000
32,Valero Energy,Energy,Petroleum Refining,81824,3990,10103
42,Marathon Petroleum,Energy,Petroleum Refining,64566,2852,45440
...,...,...,...,...,...,...
981,WPX Energy,Energy,"Mining, Crude-Oil Production",1958,-1727,1040
983,Adams Resources & Energy,Energy,Petroleum Refining,1944,-1,809
995,EP Energy,Energy,"Mining, Crude-Oil Production",1908,-3748,665
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646


In [25]:
sectors.get_group('Technology')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
18,Amazon.com,Technology,Internet Services and Retailing,107006,596,230800
20,HP,Technology,"Computers, Office Equipment",103355,4554,287000
25,Microsoft,Technology,Computer Software,93580,12193,118000
31,IBM,Technology,Information Technology Services,82461,13190,411798
...,...,...,...,...,...,...
970,Rackspace Hosting,Technology,Internet Services and Retailing,2001,126,6189
971,VeriFone Systems,Technology,"Computers, Office Equipment",2001,79,5400
975,Super Micro Computer,Technology,"Computers, Office Equipment",1991,102,2285
984,Nuance Communications,Technology,Computer Software,1931,-115,13500


In [26]:
sectors.get_group('Health Care')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
6,UnitedHealth Group,Health Care,Health Care: Insurance and Managed Care,157107,5813,200000
12,AmerisourceBergen,Health Care,Wholesalers: Health Care,135962,-135,17000
21,Cardinal Health,Health Care,Wholesalers: Health Care,102531,1215,34500
22,Express Scripts Holding,Health Care,Health Care: Pharmacy and Other Services,101752,2476,25900
...,...,...,...,...,...,...
935,VCA,Health Care,Health Care: Medical Facilities,2134,211,12700
960,PharMerica,Health Care,Health Care: Pharmacy and Other Services,2029,35,5200
965,Bio-Rad Laboratories,Health Care,Medical Products and Equipment,2019,113,7770
977,Hill-Rom Holdings,Health Care,Medical Products and Equipment,1988,48,10000


## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

In [34]:
fortune = pd.read_csv('fortune1000.csv', index_col=0)
fortune

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400
...,...,...,...,...,...,...
996,New York Community Bancorp,Financials,Commercial Banks,1902,-47,3448
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,1896,161,21200


In [36]:
sectors = fortune.groupby('Sector')
sectors

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000021A8C7C92E0>

In [37]:
sectors['Revenue'].sum()

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64

In [38]:
sectors['Profits'].mean()

Sector
Aerospace & Defense             1437.100000
Apparel                          549.066667
Business Services                553.470588
Chemicals                        754.266667
Energy                          -602.024590
Engineering & Construction       204.000000
Financials                      1872.007194
Food and Drug Stores            1117.266667
Food, Beverages & Tobacco       1195.744186
Health Care                     1414.853333
Hotels, Resturants & Leisure     827.880000
Household Products               515.285714
Industrials                      451.391304
Materials                        102.976744
Media                            973.880000
Motor Vehicles & Parts          1079.083333
Retailing                        597.875000
Technology                      1769.343137
Telecommunications              3242.466667
Transportation                  1226.916667
Wholesalers                      205.825000
Name: Profits, dtype: float64

In [39]:
sectors['Profits'].min()

Sector
Aerospace & Defense              -240
Apparel                            82
Business Services               -1481
Chemicals                        -816
Energy                         -23119
Engineering & Construction       -155
Financials                      -1194
Food and Drug Stores              -62
Food, Beverages & Tobacco        -253
Health Care                      -458
Hotels, Resturants & Leisure    -1394
Household Products              -1149
Industrials                     -6126
Materials                       -1642
Media                            -881
Motor Vehicles & Parts           -889
Retailing                       -1243
Technology                      -4359
Telecommunications               -271
Transportation                   -191
Wholesalers                      -502
Name: Profits, dtype: int64

In [40]:
sectors['Profits'].max()

Sector
Aerospace & Defense              7608
Apparel                          3273
Business Services                6328
Chemicals                        7685
Energy                          16150
Engineering & Construction        803
Financials                      24442
Food and Drug Stores             5237
Food, Beverages & Tobacco        7351
Health Care                     18108
Hotels, Resturants & Leisure     5920
Household Products               7036
Industrials                      4833
Materials                         991
Media                            8382
Motor Vehicles & Parts           9687
Retailing                       14694
Technology                      53394
Telecommunications              17879
Transportation                   7610
Wholesalers                      1472
Name: Profits, dtype: int64

In [43]:
sectors[['Revenue', 'Employees']].sum()

Unnamed: 0_level_0,Revenue,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,357940,968057
Apparel,95968,346397
Business Services,272195,1361050
Chemicals,243897,463651
Energy,1517809,1188927
Engineering & Construction,153983,406708
Financials,2217159,3359948
Food and Drug Stores,483769,1395398
"Food, Beverages & Tobacco",555967,1211632
Health Care,1614707,2678289


## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

In [46]:
sectors = fortune.groupby(['Sector', 'Industry'])

In [53]:
sectors['Revenue'].sum()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            357940
Apparel              Apparel                                           95968
Business Services    Advertising, marketing                            22748
                     Diversified Outsourcing Services                  64829
                     Education                                          7485
                                                                       ...  
Transportation       Trucking, Truck Leasing                           35950
Wholesalers          Miscellaneous                                      8982
                     Wholesalers: Diversified                         176138
                     Wholesalers: Electronics and Office Equipment    147906
                     Wholesalers: Food and Grocery                    111774
Name: Revenue, Length: 79, dtype: int64

In [59]:
sectors.get_group(('Energy', 'Pipelines'))

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
65,Energy Transfer Equity,Energy,Pipelines,42126,1189,30078
104,Enterprise Products Partners,Energy,Pipelines,27028,2521,6800
121,Plains GP Holdings,Energy,Pipelines,23152,118,5400
198,Kinder Morgan,Energy,Pipelines,14403,253,11290
348,Oneok,Energy,Pipelines,7763,245,2364
387,Targa Resources,Energy,Pipelines,6659,58,1870
493,Spectra Energy,Energy,Pipelines,5234,196,6250
656,Buckeye Partners,Energy,Pipelines,3453,437,1765
859,Enable Midstream Partners,Energy,Pipelines,2418,-752,1640
908,Genesis Energy,Energy,Pipelines,2247,423,1400


In [60]:
sectors['Profits'].sum()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            28742
Apparel              Apparel                                           8236
Business Services    Advertising, marketing                            1549
                     Diversified Outsourcing Services                  4305
                     Education                                           69
                                                                      ...  
Transportation       Trucking, Truck Leasing                           1910
Wholesalers          Miscellaneous                                       17
                     Wholesalers: Diversified                          5193
                     Wholesalers: Electronics and Office Equipment     1857
                     Wholesalers: Food and Grocery                     1166
Name: Profits, Length: 79, dtype: int64

## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

In [63]:
sectors = fortune.groupby('Sector')

In [68]:
sectors.agg({'Profits': 'sum', 'Revenue': 'mean', 'Employees': 'max'})

Unnamed: 0_level_0,Profits,Revenue,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,28742,17897.0,197200
Apparel,8236,6397.866667,65300
Business Services,28227,5337.156863,216500
Chemicals,22628,8129.9,52000
Energy,-73447,12441.057377,75600
Engineering & Construction,5304,5922.423077,92000
Financials,260209,15950.784173,331000
Food and Drug Stores,16759,32251.266667,431000
"Food, Beverages & Tobacco",51417,12929.465116,263000
Health Care,106114,21529.426667,203500


## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).

In [69]:
sectors

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000021A8A4D73E0>

In [72]:
for name, df in sectors:
    print(name)
    print(df)

Aerospace & Defense
                            Company               Sector  \
Rank                                                       
24                           Boeing  Aerospace & Defense   
45              United Technologies  Aerospace & Defense   
60                  Lockheed Martin  Aerospace & Defense   
88                 General Dynamics  Aerospace & Defense   
118                Northrop Grumman  Aerospace & Defense   
120                        Raytheon  Aerospace & Defense   
209                         Textron  Aerospace & Defense   
245              L-3 Communications  Aerospace & Defense   
282             Precision Castparts  Aerospace & Defense   
378   Huntington Ingalls Industries  Aerospace & Defense   
389     Spirit AeroSystems Holdings  Aerospace & Defense   
490                Rockwell Collins  Aerospace & Defense   
560                     Orbital ATK  Aerospace & Defense   
605                   Triumph Group  Aerospace & Defense   
785                 

In [74]:
def find_two_highest_profits(df):
    return df.nlargest(2,'Profits')
sectors.apply(find_two_highest_profits)

  sectors.apply(find_two_highest_profits)


Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aerospace & Defense,45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
Aerospace & Defense,24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
Apparel,91,Nike,Apparel,Apparel,30601,3273,62600
Apparel,231,VF,Apparel,Apparel,12377,1232,64000
Business Services,204,Visa,Business Services,Financial Data Services,13880,6328,11300
Business Services,294,MasterCard,Business Services,Financial Data Services,9667,3808,11300
Chemicals,56,Dow Chemical,Chemicals,Chemicals,48778,7685,49495
Chemicals,189,Monsanto,Chemicals,Chemicals,15001,2314,24000
Energy,2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
Energy,14,Chevron,Energy,Petroleum Refining,131118,4587,61500


In [75]:
def find_three_highest_employees_count(df):
    return df.nlargest(3, 'Employees')

sectors.apply(find_three_highest_employees_count)

  sectors.apply(find_three_highest_employees_count)


Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aerospace & Defense,45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
Aerospace & Defense,24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
Aerospace & Defense,60,Lockheed Martin,Aerospace & Defense,Aerospace and Defense,46132,3605,126000
Apparel,448,Hanesbrands,Apparel,Apparel,5732,429,65300
Apparel,231,VF,Apparel,Apparel,12377,1232,64000
...,...,...,...,...,...,...,...
Transportation,58,FedEx,Transportation,"Mail, Package, and Freight Delivery",47453,1050,323035
Transportation,67,American Airlines Group,Transportation,Airlines,40990,7610,118500
Wholesalers,212,Synnex,Wholesalers,Wholesalers: Electronics and Office Equipment,13338,209,78500
Wholesalers,57,Sysco,Wholesalers,Wholesalers: Food and Grocery,48681,687,51700
