# The GroupBy Object

In [1]:
import pandas as pd

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [13]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
fortune.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Company    1000 non-null   object
 1   Sector     1000 non-null   object
 2   Industry   1000 non-null   object
 3   Revenue    1000 non-null   int64 
 4   Profits    1000 non-null   int64 
 5   Employees  1000 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 54.7+ KB


In [6]:
pd.read_csv('fortune1000.csv').nunique()

Rank         996
Company      996
Sector        21
Industry      73
Revenue      945
Profits      760
Employees    755
dtype: int64

In [14]:

fortune['Sector'] = fortune['Sector'].astype('category')
fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [15]:
fortune.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   Company    1000 non-null   object  
 1   Sector     1000 non-null   category
 2   Industry   1000 non-null   category
 3   Revenue    1000 non-null   int64   
 4   Profits    1000 non-null   int64   
 5   Employees  1000 non-null   int64   
dtypes: category(2), int64(3), object(1)
memory usage: 44.3+ KB


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [16]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
fortune['Sector'] = fortune['Sector'].astype('category')
fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [25]:
# sums of revenues for each sector

# we pass as an argument to group by function a string (or list of strings) containing the columns names whose values we want to group
# as a return we get a DataFrameGroupBy object, which actually represents a set of all the possible dataframes with unique individual values of the column (or list of columns) we put as argument
sectors= fortune.groupby('Sector', observed= True) # observed paramater is only importand when we're dealing with categorical columns
sectors
len(sectors) # amount of distinct dataframes

21

In [26]:
sectors.size()

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

In [28]:
sectors.first().head() # looking to the very first row of each "nested" dataframe inside the DataFrameGroupBy object

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Boeing,Aerospace and Defense,96114,5176,161400
Apparel,Nike,Apparel,30601,3273,62600
Business Services,ManpowerGroup,Temporary Help,19330,419,27000
Chemicals,Dow Chemical,Chemicals,48778,7685,49495
Energy,Exxon Mobil,Petroleum Refining,246204,16150,75600


## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [30]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [36]:
sectors.get_group('Energy') # getting the nested dataframe corresponding to the "Energy" Sector
sectors.get_group('Technology')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
18,Amazon.com,Technology,Internet Services and Retailing,107006,596,230800
20,HP,Technology,"Computers, Office Equipment",103355,4554,287000
25,Microsoft,Technology,Computer Software,93580,12193,118000
31,IBM,Technology,Information Technology Services,82461,13190,411798
...,...,...,...,...,...,...
970,Rackspace Hosting,Technology,Internet Services and Retailing,2001,126,6189
971,VeriFone Systems,Technology,"Computers, Office Equipment",2001,79,5400
975,Super Micro Computer,Technology,"Computers, Office Equipment",1991,102,2285
984,Nuance Communications,Technology,Computer Software,1931,-115,13500


## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

In [37]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [52]:
sectors['Revenue'].sum() # pandas is going to calculate the overall sum of 'Revenue' for each single sector of the SeriesGroupBy object
sectors['Employees'].sum()
sectors['Profits'].max()
sectors['Profits'].min()

sectors['Employees'].mean().apply(lambda x: round(x))

sectors[['Revenue','Profits']].sum()
sectors[['Revenue','Profits']].mean().apply(lambda row: round(row,2))

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,17897.0,1437.1
Apparel,6397.87,549.07
Business Services,5337.16,553.47
Chemicals,8129.9,754.27
Energy,12441.06,-602.02
Engineering & Construction,5922.42,204.0
Financials,15950.78,1872.01
Food and Drug Stores,32251.27,1117.27
"Food, Beverages & Tobacco",12929.47,1195.74
Health Care,21529.43,1414.85


## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

In [54]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby(['Sector', 'Industry'])

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [None]:
sectors.size() # number of rows within each group

# in this case, the result is a multi-index series

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            20
Apparel              Apparel                                          15
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         3
                                                                      ..
Transportation       Trucking, Truck Leasing                           9
Wholesalers          Miscellaneous                                     1
                     Wholesalers: Diversified                         25
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
Length: 79, dtype: int64

In [58]:
sectors['Revenue'].sum()
sectors['Employees'].mean()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            48402.850000
Apparel              Apparel                                          23093.133333
Business Services    Advertising, marketing                           62050.000000
                     Diversified Outsourcing Services                 50595.000000
                     Education                                        15585.000000
                                                                          ...     
Transportation       Trucking, Truck Leasing                          18939.555556
Wholesalers          Miscellaneous                                     9200.000000
                     Wholesalers: Diversified                          9353.240000
                     Wholesalers: Electronics and Office Equipment    20832.625000
                     Wholesalers: Food and Grocery                    19317.500000
Name: Employees, Len

## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

In [59]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [67]:
sectors.agg( {'Revenue': 'sum'} ).head()
# the line of code above gives the exact same result as 

sectors['Revenue'].sum().head()
# of course, considering the difference that the first one returns us a dataframe while the second one returns us a series

Sector
Aerospace & Defense     357940
Apparel                  95968
Business Services       272195
Chemicals               243897
Energy                 1517809
Name: Revenue, dtype: int64

In [75]:
# the advantage of the agg method is that we can pass more than one column in which we want to apply any aggregation function and, onde the column is set, we can also apply more than one aggregation function on it
sectors.agg({
        'Revenue': ['sum','mean'],
        'Employees': 'max'
}).head()

Unnamed: 0_level_0,Revenue,Revenue,Employees
Unnamed: 0_level_1,sum,mean,max
Sector,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Aerospace & Defense,357940,17897.0,197200
Apparel,95968,6397.866667,65300
Business Services,272195,5337.156863,216500
Chemicals,243897,8129.9,52000
Energy,1517809,12441.057377,75600


In [79]:
# an interesting feature is that when you apply it to a single column you can set the column names provided as results
sectors['Revenue'].agg(
    RevenueSum= 'sum',
    RevenueMean= 'mean',
    RevenueMax= 'max'
).head()

Unnamed: 0_level_0,RevenueSum,RevenueMean,RevenueMax
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,357940,17897.0,96114
Apparel,95968,6397.866667,30601
Business Services,272195,5337.156863,19330
Chemicals,243897,8129.9,48778
Energy,1517809,12441.057377,246204


## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).

In [3]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [92]:
# We want to find, for each sector, two companies with the most employees

def top_two_companies_by_employ_count(sector):
    return sector.nlargest(2, 'Employees') # for each dataframe I can apply this function

sectors.apply(top_two_companies_by_employ_count)
sectors.apply(lambda nested_df: nested_df.nlargest(2, 'Employees')) # since "each element" of the sectors object is an entire dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aerospace & Defense,45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
Aerospace & Defense,24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
Apparel,448,Hanesbrands,Apparel,Apparel,5732,429,65300
Apparel,231,VF,Apparel,Apparel,12377,1232,64000
Business Services,199,Aramark,Business Services,Diversified Outsourcing Services,14329,236,216500
Business Services,744,Convergys,Business Services,Diversified Outsourcing Services,2951,169,130000
Chemicals,101,DuPont,Chemicals,Chemicals,27940,1953,52000
Chemicals,56,Dow Chemical,Chemicals,Chemicals,48778,7685,49495
Energy,2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
Energy,117,Halliburton,Energy,"Oil and Gas Equipment, Services",23633,-671,65000
