In [1]:
import pandas as pd

# 7. Dataframes VI: GroupBy Object


## The Fortune 1000 Dataset

- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [3]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


## 1. The `groupby` Method

- **Grouping** is a way to organize/categorize/group rows based on the values in certain columns.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on each group within it.


In [6]:
# A. Why use .groupby() method?

# example 1: group companies together based on Sector they are in
# Way 1: inconvenient approach

# calculate sum of revenues per sector: 3 steps 
fortune["Sector"] == "Retailing" # create Boolean Series 

Rank
1        True
2       False
3       False
4       False
5       False
        ...  
996     False
997     False
997     False
999     False
1000    False
Name: Sector, Length: 1000, dtype: bool

In [7]:
fortune[fortune["Sector"] == "Retailing"] # use Boolean Series to subset Df & only retain companies from Retailing Sector

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
15,Costco,Retailing,Specialty Retailers: Other,"Issaquah, WA",116199,2377,161000
28,Home Depot,Retailing,Specialty Retailers: Other,"Atlanta, GA",88519,7009,385000
38,Target,Retailing,General Merchandisers,"Minneapolis, MN",73785,3363,341000
47,Lowe’s,Retailing,Specialty Retailers: Other,"Mooresville, NC",59074,2546,225000
...,...,...,...,...,...,...,...
899,Sears Hometown & Outlet Stores,Retailing,Specialty Retailers: Other,"Hoffman Estates, IL",2288,-27,2918
922,Outerwall,Retailing,Specialty Retailers: Other,"Bellevue, WA",2195,44,2670
937,hhgregg,Retailing,Specialty Retailers: Other,"Indianapolis, IN",2129,-133,4941
940,Restoration Hardware,Retailing,Specialty Retailers: Other,"Corte Madera, CA",2109,91,4200


In [8]:
fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum() # on this Df, we can subset "Revenue" column to get Series & on this Series we can perform sum() method

1465076

In [26]:
# <--> Problem: we have many different sectors! We cannot write a line of code for every Sector

# therefore: Way 2: more convenient approach = use grouby() method: all 21 sectors at one time!
fortune.groupby("Sector")["Revenue"].sum() # simply indicate what column to group on, subset Revenue column, apply sum() method

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64

In [25]:
# B. What is DataFrameGroupBy object?

fortune.groupby("Sector") # gives a DataFrameGroupBy object = a collection of Dataframes in which each Df holds rows for a specific Sector

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x14b88cb90>

In [13]:
len(fortune.groupby("Sector")) # len() function tells us how many unique values there are in Sector column

21

In [15]:
fortune.groupby("Sector").size() # .size() method tells us how many rows/instances each group holds

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

In [17]:
fortune.groupby("Sector").first() # we can also extract the first row from each Df within our DataFrameGroupBy object that holds all these Dfs
fortune.groupby("Sector").last() # or the last row

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Delta Tucker Holdings,Aerospace and Defense,"McLean, VA",1923,-133,12000
Apparel,Guess,Apparel,"Los Angeles, CA",2204,82,13500
Business Services,DeVry Education Group,Education,"Downers Grove, IL",1910,140,11770
Chemicals,H.B. Fuller,Chemicals,"St. Paul, MN",2084,87,4425
Energy,Portland General Electric,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
Engineering & Construction,MDC Holdings,Homebuilders,"Denver, CO",1909,66,1225
Financials,New York Community Bancorp,Commercial Banks,"Westbury, NY",1902,-47,3448
Food and Drug Stores,Fred’s,Food and Drug Stores,"Memphis, TN",2151,-7,7103
"Food, Beverages & Tobacco",Alliance One International,Tobacco,"Morrisville, NC",2066,-15,6835
Health Care,Providence Service,Health Care: Pharmacy and Other Services,"Tucson, AZ",1987,84,9072


## 2. Retrieve a Group with the `get_group` Method

- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.


In [18]:
# load dataset

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [19]:
# method to retrieve Df/group within DataFrameGroupBy Object
sectors.get_group("Technology")
sectors.get_group("Retailing")
sectors.get_group("Energy")

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
14,Chevron,Energy,Petroleum Refining,"San Ramon, CA",131118,4587,61500
30,Phillips 66,Energy,Petroleum Refining,"Houston, TX",87169,4227,14000
32,Valero Energy,Energy,Petroleum Refining,"San Antonio, TX",81824,3990,10103
42,Marathon Petroleum,Energy,Petroleum Refining,"Findlay, OH",64566,2852,45440
...,...,...,...,...,...,...,...
981,WPX Energy,Energy,"Mining, Crude-Oil Production","Tulsa, OK",1958,-1727,1040
983,Adams Resources & Energy,Energy,Petroleum Refining,"Houston, TX",1944,-1,809
995,EP Energy,Energy,"Mining, Crude-Oil Production","Houston, TX",1908,-3748,665
997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646


## 3. Methods on the GroupBy Object

- Use square brackets on the **DataFrameGroupBy** object to *"extract" a column* from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation *on every group* within the collection.
- For example, the `sum` method will sum together the **Revenue** for every row by group/category.


In [21]:
# load dataset

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [22]:
sectors["Revenue"] # extract 1 column from DataFrameGroupBy Object = we get back a SeriesGroupBy object

<pandas.core.groupby.generic.SeriesGroupBy object at 0x14b92ced0>

In [23]:
# on this SeriesGroupBy object we can perform multiple aggregation methods
fortune.groupby("Sector")["Revenue"].sum() # simply indicate what column to group on, subset Revenue column, apply sum() method
# or with the sectors-variable notation:
sectors["Revenue"].sum()

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64

In [27]:
# other methods:
sectors["Revenue"].max()
sectors["Revenue"].min()
sectors["Revenue"].mean()

Sector
Aerospace & Defense             17897.000000
Apparel                          6397.866667
Business Services                5337.156863
Chemicals                        8129.900000
Energy                          12441.057377
Engineering & Construction       5922.423077
Financials                      15950.784173
Food and Drug Stores            32251.266667
Food, Beverages & Tobacco       12929.465116
Health Care                     21529.426667
Hotels, Resturants & Leisure     6781.840000
Household Products               8383.464286
Industrials                     10816.978261
Materials                        6026.627907
Media                            8830.560000
Motor Vehicles & Parts          20105.833333
Retailing                       18313.450000
Technology                      13505.882353
Telecommunications              30788.933333
Transportation                  11347.444444
Wholesalers                     11120.000000
Name: Revenue, dtype: float64

## 4. Grouping by Multiple Columns

- Pass a list of columns to the `groupby` method to group by combination of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex Series** where the levels will be the original groups.


In [28]:
# load dataset

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [30]:
# make groups based on combination of values accross multiple datasets

# example: Industry column is subtype of Sector column
# for every of these combined groups Pandas will then perform aggregation

sectors_annex_industries = fortune.groupby(["Sector","Industry"])
sectors_annex_industries.size() # for every 79 unique combis we get a separate grouping

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            20
Apparel              Apparel                                          15
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         3
                                                                      ..
Transportation       Trucking, Truck Leasing                           9
Wholesalers          Miscellaneous                                     1
                     Wholesalers: Diversified                         25
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
Length: 79, dtype: int64

In [31]:
sectors_annex_industries["Revenue"].sum() # we can again target a column for which we want to perform aggregation & call method on it

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            357940
Apparel              Apparel                                           95968
Business Services    Advertising, marketing                            22748
                     Diversified Outsourcing Services                  64829
                     Education                                          7485
                                                                       ...  
Transportation       Trucking, Truck Leasing                           35950
Wholesalers          Miscellaneous                                      8982
                     Wholesalers: Diversified                         176138
                     Wholesalers: Electronics and Office Equipment    147906
                     Wholesalers: Food and Grocery                    111774
Name: Revenue, Length: 79, dtype: int64

## 5. The `agg` Method

- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.


In [33]:
# load dataset

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [35]:
# allows to do different operations on different groups

sectors.agg({"Revenue":"sum","Profits":"max"}) # apply .agg() method directly on DataFrameGroupBy Object

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,357940,7608
Apparel,95968,3273
Business Services,272195,6328
Chemicals,243897,7685
Energy,1517809,16150
Engineering & Construction,153983,803
Financials,2217159,24442
Food and Drug Stores,483769,5237
"Food, Beverages & Tobacco",555967,7351
Health Care,1614707,18108


## 6. Iterating through Groups with `.apply()` method

- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).


In [36]:
# load dataset

fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [37]:
# with Series Object .apply() method invokes function once for every value
# with DataFrame Object .apply() method invokes function once for every row
# with DataDrameGrouby Object .apply() method invokes function once for every group/nested Df in DataFrameGroupby Object

# Example: find 2 companies in every sector with most employees

fortune.nlargest(2,"Employees") # this only works on the whole of the Dataset

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
218,Yum Brands,"Hotels, Resturants & Leisure",Food Services,"Louisville, KY",13105,1293,505000


In [38]:
def top_two_companies_by_employee_count(sector):
    return sector.nlargest(2,"Employees")

sectors.apply(top_two_companies_by_employee_count) # will apply function 21 times for every nested Df

  sectors.apply(top_two_companies_by_employee_count) # will apply function 21 times for every nested Df


Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Aerospace & Defense,45,United Technologies,Aerospace & Defense,Aerospace and Defense,"Farmington, CT",61047,7608,197200
Aerospace & Defense,24,Boeing,Aerospace & Defense,Aerospace and Defense,"Chicago, IL",96114,5176,161400
Apparel,448,Hanesbrands,Apparel,Apparel,"Winston-Salem, NC",5732,429,65300
Apparel,231,VF,Apparel,Apparel,"Greensboro, NC",12377,1232,64000
Business Services,199,Aramark,Business Services,Diversified Outsourcing Services,"Philadelphia, PA",14329,236,216500
Business Services,744,Convergys,Business Services,Diversified Outsourcing Services,"Cincinnati, OH",2951,169,130000
Chemicals,101,DuPont,Chemicals,Chemicals,"Wilmington, DE",27940,1953,52000
Chemicals,56,Dow Chemical,Chemicals,Chemicals,"Midland, MI",48778,7685,49495
Energy,2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
Energy,117,Halliburton,Energy,"Oil and Gas Equipment, Services","Houston, TX",23633,-671,65000
