# The GroupBy Object

In [1]:
import pandas as pd

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [13]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
fortune.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Company    1000 non-null   object
 1   Sector     1000 non-null   object
 2   Industry   1000 non-null   object
 3   Revenue    1000 non-null   int64 
 4   Profits    1000 non-null   int64 
 5   Employees  1000 non-null   int64 
dtypes: int64(3), object(3)
memory usage: 54.7+ KB


In [6]:
pd.read_csv('fortune1000.csv').nunique()

Rank         996
Company      996
Sector        21
Industry      73
Revenue      945
Profits      760
Employees    755
dtype: int64

In [14]:

fortune['Sector'] = fortune['Sector'].astype('category')
fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [15]:
fortune.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   Company    1000 non-null   object  
 1   Sector     1000 non-null   category
 2   Industry   1000 non-null   category
 3   Revenue    1000 non-null   int64   
 4   Profits    1000 non-null   int64   
 5   Employees  1000 non-null   int64   
dtypes: category(2), int64(3), object(1)
memory usage: 44.3+ KB


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [16]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
fortune['Sector'] = fortune['Sector'].astype('category')
fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [25]:
# sums of revenues for each sector

# we pass as an argument to group by function a string (or list of strings) containing the columns names whose values we want to group
# as a return we get a DataFrameGroupBy object, which actually represents a set of all the possible dataframes with unique individual values of the column (or list of columns) we put as argument
sectors= fortune.groupby('Sector', observed= True) # observed paramater is only importand when we're dealing with categorical columns
sectors
len(sectors) # amount of distinct dataframes

21

In [26]:
sectors.size()

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

In [28]:
sectors.first().head() # looking to the very first row of each "nested" dataframe inside the DataFrameGroupBy object

Unnamed: 0_level_0,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aerospace & Defense,Boeing,Aerospace and Defense,96114,5176,161400
Apparel,Nike,Apparel,30601,3273,62600
Business Services,ManpowerGroup,Temporary Help,19330,419,27000
Chemicals,Dow Chemical,Chemicals,48778,7685,49495
Energy,Exxon Mobil,Petroleum Refining,246204,16150,75600


## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [30]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [36]:
sectors.get_group('Energy') # getting the nested dataframe corresponding to the "Energy" Sector
sectors.get_group('Technology')

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
18,Amazon.com,Technology,Internet Services and Retailing,107006,596,230800
20,HP,Technology,"Computers, Office Equipment",103355,4554,287000
25,Microsoft,Technology,Computer Software,93580,12193,118000
31,IBM,Technology,Information Technology Services,82461,13190,411798
...,...,...,...,...,...,...
970,Rackspace Hosting,Technology,Internet Services and Retailing,2001,126,6189
971,VeriFone Systems,Technology,"Computers, Office Equipment",2001,79,5400
975,Super Micro Computer,Technology,"Computers, Office Equipment",1991,102,2285
984,Nuance Communications,Technology,Computer Software,1931,-115,13500


## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

In [37]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [52]:
sectors['Revenue'].sum() # pandas is going to calculate the overall sum of 'Revenue' for each single sector of the SeriesGroupBy object
sectors['Employees'].sum()
sectors['Profits'].max()
sectors['Profits'].min()

sectors['Employees'].mean().apply(lambda x: round(x))

sectors[['Revenue','Profits']].sum()
sectors[['Revenue','Profits']].mean().apply(lambda row: round(row,2))

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,17897.0,1437.1
Apparel,6397.87,549.07
Business Services,5337.16,553.47
Chemicals,8129.9,754.27
Energy,12441.06,-602.02
Engineering & Construction,5922.42,204.0
Financials,15950.78,1872.01
Food and Drug Stores,32251.27,1117.27
"Food, Beverages & Tobacco",12929.47,1195.74
Health Care,21529.43,1414.85


## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

In [54]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby(['Sector', 'Industry'])

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [None]:
sectors.size() # number of rows within each group

# in this case, the result is a multi-index series

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            20
Apparel              Apparel                                          15
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         3
                                                                      ..
Transportation       Trucking, Truck Leasing                           9
Wholesalers          Miscellaneous                                     1
                     Wholesalers: Diversified                         25
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
Length: 79, dtype: int64

In [58]:
sectors['Revenue'].sum()
sectors['Employees'].mean()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            48402.850000
Apparel              Apparel                                          23093.133333
Business Services    Advertising, marketing                           62050.000000
                     Diversified Outsourcing Services                 50595.000000
                     Education                                        15585.000000
                                                                          ...     
Transportation       Trucking, Truck Leasing                          18939.555556
Wholesalers          Miscellaneous                                     9200.000000
                     Wholesalers: Diversified                          9353.240000
                     Wholesalers: Electronics and Office Equipment    20832.625000
                     Wholesalers: Food and Grocery                    19317.500000
Name: Employees, Len

## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

In [59]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [67]:
sectors.agg( {'Revenue': 'sum'} ).head()
# the line of code above gives the exact same result as 

sectors['Revenue'].sum().head()
# of course, considering the difference that the first one returns us a dataframe while the second one returns us a series

Sector
Aerospace & Defense     357940
Apparel                  95968
Business Services       272195
Chemicals               243897
Energy                 1517809
Name: Revenue, dtype: int64

In [75]:
# the advantage of the agg method is that we can pass more than one column in which we want to apply any aggregation function and, onde the column is set, we can also apply more than one aggregation function on it
sectors.agg({
        'Revenue': ['sum','mean'],
        'Employees': 'max'
}).head()

Unnamed: 0_level_0,Revenue,Revenue,Employees
Unnamed: 0_level_1,sum,mean,max
Sector,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Aerospace & Defense,357940,17897.0,197200
Apparel,95968,6397.866667,65300
Business Services,272195,5337.156863,216500
Chemicals,243897,8129.9,52000
Energy,1517809,12441.057377,75600


In [79]:
# an interesting feature is that when you apply it to a single column you can set the column names provided as results
sectors['Revenue'].agg(
    RevenueSum= 'sum',
    RevenueMean= 'mean',
    RevenueMax= 'max'
).head()

Unnamed: 0_level_0,RevenueSum,RevenueMean,RevenueMax
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,357940,17897.0,96114
Apparel,95968,6397.866667,30601
Business Services,272195,5337.156863,19330
Chemicals,243897,8129.9,48778
Energy,1517809,12441.057377,246204


## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).

In [3]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
#fortune['Sector'] = fortune['Sector'].astype('category')
#fortune['Industry'] = fortune['Industry'].astype('category')
fortune.sort_index(inplace= True)

sectors= fortune.groupby('Sector')

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [92]:
# We want to find, for each sector, two companies with the most employees

def top_two_companies_by_employ_count(sector):
    return sector.nlargest(2, 'Employees') # for each dataframe I can apply this function

sectors.apply(top_two_companies_by_employ_count)
sectors.apply(lambda nested_df: nested_df.nlargest(2, 'Employees')) # since "each element" of the sectors object is an entire dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aerospace & Defense,45,United Technologies,Aerospace & Defense,Aerospace and Defense,61047,7608,197200
Aerospace & Defense,24,Boeing,Aerospace & Defense,Aerospace and Defense,96114,5176,161400
Apparel,448,Hanesbrands,Apparel,Apparel,5732,429,65300
Apparel,231,VF,Apparel,Apparel,12377,1232,64000
Business Services,199,Aramark,Business Services,Diversified Outsourcing Services,14329,236,216500
Business Services,744,Convergys,Business Services,Diversified Outsourcing Services,2951,169,130000
Chemicals,101,DuPont,Chemicals,Chemicals,27940,1953,52000
Chemicals,56,Dow Chemical,Chemicals,Chemicals,48778,7685,49495
Energy,2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
Energy,117,Halliburton,Energy,"Oil and Gas Equipment, Services",23633,-671,65000


## Chat GPT exercises

1. Find the average revenue of companies in each sector.
2. Calculate the total profits of each industry within each sector.
3. Determine the sector with the highest average number of employees per company.
4. Identify the company with the maximum profits in each sector. Return only the company name and profits.
5. Calculate the percentage contribution of each industry's total revenue to the overall revenue of its sector.
6. Within each sector, find the company with the minimum revenue, and also display its revenue and industry.
7. Rank the industries within each sector by total profits and return the top 3 industries in each sector.
8. Create a new column called Revenue Per Employee and calculate its average for each industry.
9. For each sector, calculate the difference between the maximum and minimum profits among companies.
10. Identify the number of companies in each sector where profits are greater than the sector's average profits.

In [1]:
import pandas as pd

In [2]:
fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
fortune.sort_index(inplace= True)

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [12]:
#1)
fortune.groupby(['Sector', 'Company'])[['Revenue']].mean().sort_index().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue
Sector,Company,Unnamed: 2_level_1
Aerospace & Defense,B/E Aerospace,2730.0
Aerospace & Defense,Boeing,96114.0
Aerospace & Defense,Curtiss-Wright,2264.0
Aerospace & Defense,Delta Tucker Holdings,1923.0
Aerospace & Defense,General Dynamics,31469.0


In [14]:
#2) 
fortune.groupby(['Sector', 'Industry'])[['Profits']].sum().sort_index().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Profits
Sector,Industry,Unnamed: 2_level_1
Aerospace & Defense,Aerospace and Defense,28742
Apparel,Apparel,8236
Business Services,"Advertising, marketing",1549
Business Services,Diversified Outsourcing Services,4305
Business Services,Education,69


In [22]:
#3)
fortune.groupby('Sector')['Employees'].mean().sort_values(ascending= False).head(1)

Sector
Hotels, Resturants & Leisure    99369.8
Name: Employees, dtype: float64

In [39]:
#4) Identify the company with the maximum profits in each sector. Return only the company name and profits.
labels_max_profits_per_sector= fortune.groupby('Sector')['Profits'].idxmax()
fortune.loc[ labels_max_profits_per_sector, ['Sector', 'Company', 'Profits'] ].sort_values(by='Profits', ascending= False)


# another possible (better) solution
# fortune.groupby('Sector')[['Company', 'Profits']].apply(lambda df: df.nlargest(1, 'Profits')).sort_values(by='Profits', ascending= False)

Unnamed: 0_level_0,Sector,Company,Profits
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,Technology,Apple,53394
23,Financials,J.P. Morgan Chase,24442
86,Health Care,Gilead Sciences,18108
13,Telecommunications,Verizon,17879
2,Energy,Exxon Mobil,16150
1,Retailing,Walmart,14694
8,Motor Vehicles & Parts,General Motors,9687
53,Media,Disney,8382
56,Chemicals,Dow Chemical,7685
67,Transportation,American Airlines Group,7610


In [94]:
fortune.groupby('Sector')['Company'].count().sort_values(ascending= False)

Sector
Financials                      139
Energy                          122
Technology                      102
Retailing                        80
Health Care                      75
Business Services                51
Industrials                      46
Food, Beverages & Tobacco        43
Materials                        43
Wholesalers                      40
Transportation                   36
Chemicals                        30
Household Products               28
Engineering & Construction       26
Hotels, Resturants & Leisure     25
Media                            25
Motor Vehicles & Parts           24
Aerospace & Defense              20
Apparel                          15
Telecommunications               15
Food and Drug Stores             15
Name: Company, dtype: int64

In [97]:
#5) Calculate the percentage contribution of each industry's total revenue to the overall revenue of its sector.
total_revenue_by_sector_industry= fortune.groupby('Sector')['Revenue'].sum()

fortune['Revenue_SecInd']= fortune.apply(lambda row: total_revenue_by_sector_industry.loc[row['Sector']], axis= 'columns')
fortune['Revenue_Contrib']= fortune.apply(lambda row: round(row['Revenue']/total_revenue_by_sector_industry.loc[row['Sector']],2)*100, axis= 'columns')

fortune.loc[ fortune['Sector'] == 'Food and Drug Stores' ][
    ['Sector', 'Industry', 'Company', 'Revenue', 'Revenue_SecInd', 'Revenue_Contrib']
]\
    .sort_values(by=['Sector', 'Industry'])
    

Unnamed: 0_level_0,Sector,Industry,Company,Revenue,Revenue_SecInd,Revenue_Contrib
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
7,Food and Drug Stores,Food and Drug Stores,CVS Health,153290,483769,32.0
17,Food and Drug Stores,Food and Drug Stores,Kroger,109830,483769,23.0
19,Food and Drug Stores,Food and Drug Stores,Walgreens Boots Alliance,103444,483769,21.0
87,Food and Drug Stores,Food and Drug Stores,Publix Super Markets,32619,483769,7.0
107,Food and Drug Stores,Food and Drug Stores,Rite Aid,26528,483769,5.0
160,Food and Drug Stores,Food and Drug Stores,Supervalu,17820,483769,4.0
181,Food and Drug Stores,Food and Drug Stores,Whole Foods Market,15389,483769,3.0
595,Food and Drug Stores,Food and Drug Stores,Smart & Final Stores,3971,483769,1.0
614,Food and Drug Stores,Food and Drug Stores,Ingles Markets,3779,483769,1.0
631,Food and Drug Stores,Food and Drug Stores,Sprouts Farmers Market,3593,483769,1.0


In [12]:
#6) Within each sector, find the company with the minimum revenue, and also display its revenue and industry.
indexes_min_revenue_per_sector= fortune.groupby('Sector')['Revenue'].idxmin()
fortune.loc[ indexes_min_revenue_per_sector, ['Sector','Company', 'Industry', 'Revenue']].sort_values(by='Revenue', ascending= False).head()

Unnamed: 0_level_0,Sector,Company,Industry,Revenue
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
786,Telecommunications,Equinix,Telecommunications,2726
917,Apparel,Guess,Apparel,2204
928,Food and Drug Stores,Fred’s,Food and Drug Stores,2151
949,Chemicals,H.B. Fuller,Chemicals,2084
954,"Food, Beverages & Tobacco",Alliance One International,Tobacco,2066


In [19]:
fortune[fortune['Sector']=='Technology'].sort_values(by='Profits', ascending= False)

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
36,Alphabet,Technology,Internet Services and Retailing,74989,16348,61814
31,IBM,Technology,Information Technology Services,82461,13190,411798
25,Microsoft,Technology,Computer Software,93580,12193,118000
51,Intel,Technology,Semiconductors and Other Electronic Components,55355,11420,107300
...,...,...,...,...,...,...
946,Engility Holdings,Technology,Information Technology Services,2086,-235,9800
842,Autodesk,Technology,Computer Software,2504,-331,9500
914,Twitter,Technology,Internet Services and Retailing,2218,-521,3898
593,Advanced Micro Devices,Technology,Semiconductors and Other Electronic Components,3991,-660,9100


In [28]:
#7) Rank the industries within each sector by total profits and return the top 3 industries in each sector.

fortune.groupby('Sector')[['Industry', 'Company', 'Profits']].apply(lambda nested_df: nested_df.nlargest(3, 'Profits')).sort_index().sort_values(by=['Industry', 'Profits'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Industry,Company,Profits
Sector,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aerospace & Defense,60,Aerospace and Defense,Lockheed Martin,3605
Aerospace & Defense,24,Aerospace and Defense,Boeing,5176
Aerospace & Defense,45,Aerospace and Defense,United Technologies,7608
Transportation,80,Airlines,United Continental Holdings,7340
Transportation,67,Airlines,American Airlines Group,7610
...,...,...,...,...
Telecommunications,13,Telecommunications,Verizon,17879
"Food, Beverages & Tobacco",106,Tobacco,Philip Morris International,6873
Wholesalers,183,Wholesalers: Diversified,Genuine Parts,706
Wholesalers,285,Wholesalers: Diversified,W.W. Grainger,769


In [37]:
#8) For each sector, calculate the difference between the maximum and minimum profits among companies.

max_profits_per_sector= fortune.groupby('Sector')['Profits'].max()
min_profits_per_sector= fortune.groupby('Sector')['Profits'].min()

fortune['MaxSectorProfits']= fortune.apply(lambda row: max_profits_per_sector[row['Sector']], axis= 'columns')
fortune['MinSectorProfits']= fortune.apply(lambda row: min_profits_per_sector[row['Sector']], axis= 'columns')

fortune['Diff_Max']= fortune.apply(lambda row: abs(row['Profits']-row['MaxSectorProfits']), axis= 'columns')
fortune['Diff_Min']= fortune.apply(lambda row: abs(row['Profits']-row['MinSectorProfits']), axis= 'columns')

fortune[['Sector', 'Company', 'Industry', 'Profits', 'MaxSectorProfits', 'MinSectorProfits', 'Diff_Max', 'Diff_Min']].sort_values(by= 'Sector').head()

Unnamed: 0_level_0,Sector,Company,Industry,Profits,MaxSectorProfits,MinSectorProfits,Diff_Max,Diff_Min
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
60,Aerospace & Defense,Lockheed Martin,Aerospace and Defense,3605,7608,-240,4003,3845
378,Aerospace & Defense,Huntington Ingalls Industries,Aerospace and Defense,404,7608,-240,7204,644
120,Aerospace & Defense,Raytheon,Aerospace and Defense,2074,7608,-240,5534,2314
245,Aerospace & Defense,L-3 Communications,Aerospace and Defense,-240,7608,-240,7848,0
45,Aerospace & Defense,United Technologies,Aerospace and Defense,7608,7608,-240,0,7848


In [38]:
fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees,MaxSectorProfits,MinSectorProfits,Diff_Max,Diff_Min
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000,14694,-1243,0,15937
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600,16150,-23119,0,39269
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000,53394,-4359,0,57753
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000,24442,-1194,359,25277
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400,18108,-458,16632,1934


In [44]:
#9) Create a new column called Revenue Per Employee and calculate its average for each industry.

fortune['RevenuePerEmployee']= fortune.apply(lambda row: row['Revenue']/row['Employees'] , axis= 'columns')
fortune.groupby('Industry')['RevenuePerEmployee'].mean().sort_values(ascending= False)

Industry
Miscellaneous                          11.051044
Real estate                             5.911200
Diversified Financials                  4.249644
Petroleum Refining                      3.052848
Insurance: Life, Health (Mutual)        2.778417
                                         ...    
Education                               0.176859
Health Care: Medical Facilities         0.160782
Mail, Package, and Freight Delivery     0.158965
Hotels, Casinos, Resorts                0.150742
Food Services                           0.069547
Name: RevenuePerEmployee, Length: 73, dtype: float64

In [62]:
#10) Identify the number of companies in each sector where profits are greater than the sector's average profits.


mean_profits_by_sector= fortune.groupby('Sector')['Profits'].mean()
indexes_profits_greater_than_average_per_sector= fortune.apply(lambda row: True if row['Profits'] > mean_profits_by_sector[row['Sector']] else False, axis= 'columns' )

fortune[indexes_profits_greater_than_average_per_sector].groupby('Sector')['Company'].count()

Sector
Aerospace & Defense              7
Apparel                          4
Business Services               16
Chemicals                       10
Energy                          90
Engineering & Construction      11
Financials                      28
Food and Drug Stores             5
Food, Beverages & Tobacco        9
Health Care                     19
Hotels, Resturants & Leisure     7
Household Products               7
Industrials                     12
Materials                       23
Media                            6
Motor Vehicles & Parts           5
Retailing                       17
Technology                      16
Telecommunications               4
Transportation                   8
Wholesalers                     14
Name: Company, dtype: int64

## Some more GPT Exercises

In [110]:
import pandas as pd

fortune= pd.read_csv('fortune1000.csv', index_col='Rank')
fortune.sort_index(inplace= True)

fortune.head()

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [107]:
#11) Find the top 2 companies with the highest revenue in each industry. Display the company name, industry, and revenue.

# solution: using the groupby object property (nested dataframes)
fortune.groupby('Industry').apply(lambda nested_df: nested_df.nlargest(2, 'Revenue'))[['Company', 'Industry', 'Revenue']]


Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Industry,Revenue
Industry,Rank,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Advertising, marketing",186,Omnicom Group,"Advertising, marketing",15134
"Advertising, marketing",355,Interpublic Group,"Advertising, marketing",7614
Aerospace and Defense,24,Boeing,Aerospace and Defense,96114
Aerospace and Defense,45,United Technologies,Aerospace and Defense,61047
Airlines,67,American Airlines Group,Airlines,40990
...,...,...,...,...
Wholesalers: Electronics and Office Equipment,102,Avnet,Wholesalers: Electronics and Office Equipment,27925
Wholesalers: Food and Grocery,57,Sysco,Wholesalers: Food and Grocery,48681
Wholesalers: Food and Grocery,122,US Foods Holding,Wholesalers: Food and Grocery,23128
Wholesalers: Health Care,5,McKesson,Wholesalers: Health Care,181241


In [109]:
#12) For each sector, calculate the cumulative sum of profits sorted by revenue in descending order.
fortune.sort_values(by="Revenue", ascending=False)\
    .groupby("Sector")["Profits"]\
    .cumsum()

Rank
1        14694
2        16150
3        53394
4        24083
5         1476
         ...  
996     260209
997     -73619
997     -73447
999      20697
1000     20764
Name: Profits, Length: 1000, dtype: int64

In [None]:
#13) Find the industry in each sector with the highest average revenue per employee. Display the sector, industry, and the average value.

# solution 1)
fortune.groupby(["Sector", "Industry"])\
    .apply(lambda x: (x["Revenue"] / x["Employees"]).mean())\
    .reset_index(name="Avg Revenue Per Employee")\
    .sort_values(by=["Sector", "Avg Revenue Per Employee"], ascending=[True, False])\
    .drop_duplicates(subset=["Sector"])

# solution 2)
fortune['RevenuePerEmployee']= fortune['Revenue']/fortune['Employees']
fortune.groupby(['Sector','Industry'])['RevenuePerEmployee'].mean()

aux_df= pd.DataFrame(fortune.groupby(['Sector','Industry'])['RevenuePerEmployee'].mean())
aux_df.reset_index(inplace= True)
aux_df.loc[aux_df.groupby('Sector')['RevenuePerEmployee'].idxmax()]

Unnamed: 0,Sector,Industry,RevenuePerEmployee
0,Aerospace & Defense,Aerospace and Defense,0.321265
1,Apparel,Apparel,0.311082
7,Business Services,Temporary Help,0.565881
9,Chemicals,Chemicals,0.594194
14,Energy,Petroleum Refining,3.052848
18,Engineering & Construction,Homebuilders,1.516281
25,Financials,Real estate,5.9112
27,Food and Drug Stores,Food and Drug Stores,0.4331
30,"Food, Beverages & Tobacco",Food Production,1.011409
37,Health Care,Wholesalers: Health Care,2.655857


In [113]:
#14) Identify the company with the highest profit-to-revenue ratio in each sector and display the company name, sector, and the ratio.

fortune['ProfitsToRevenue']= fortune['Profits']/fortune['Revenue']
fortune.loc[
    fortune.groupby('Sector')['ProfitsToRevenue'].idxmax(),
    ['Company', 'Sector', 'ProfitsToRevenue']
]

Unnamed: 0_level_0,Company,Sector,ProfitsToRevenue
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
788,TransDigm Group,Aerospace & Defense,0.165127
91,Nike,Apparel,0.106957
870,Public Storage,Business Services,0.550378
566,CF Industries Holdings,Chemicals,0.162488
924,Magellan Midstream Partners,Energy,0.374143
576,Toll Brothers,Engineering & Construction,0.087029
861,General Growth Properties,Financials,0.571963
809,GNC Holdings,Food and Drug Stores,0.082986
266,Reynolds American,"Food, Beverages & Tobacco",0.304731
86,Gilead Sciences,Health Care,0.554796


In [118]:
#15) For each industry, calculate the percentage of total employees in its sector contributed by that industry.

# solution 1)
sector_total_employees = fortune.groupby("Sector")["Employees"].sum()

fortune.groupby(["Sector", "Industry"])["Employees"].sum()\
.div(sector_total_employees, level="Sector")\
.mul(100)

# solution 2)
total_employees_by_sector= fortune.groupby('Sector')['Employees'].sum()
total_employees_by_industry= fortune.groupby('Industry')['Employees'].sum()

fortune['TotalIndustryEmployees']= fortune.apply(lambda row: total_employees_by_industry.loc[row['Industry']], axis= 'columns')
fortune['TotalSectorEmployees']= fortune.apply(lambda row: total_employees_by_sector.loc[row['Sector']], axis= 'columns')
fortune['IndustryContribution']= round( (fortune['TotalIndustryEmployees']/fortune['TotalSectorEmployees'])*100,2)

fortune[['Sector', 'Industry', 'TotalIndustryEmployees', 'TotalSectorEmployees', 'IndustryContribution']].sort_values(by= ['Sector', 'Industry']).drop_duplicates()


Unnamed: 0_level_0,Sector,Industry,TotalIndustryEmployees,TotalSectorEmployees,IndustryContribution
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
24,Aerospace & Defense,Aerospace and Defense,968057,968057,100.00
91,Apparel,Apparel,346397,346397,100.00
186,Business Services,"Advertising, marketing",124100,1361050,9.12
199,Business Services,Diversified Outsourcing Services,708330,1361050,52.04
737,Business Services,Education,46755,1361050,3.44
...,...,...,...,...,...
395,Transportation,"Trucking, Truck Leasing",170456,1536793,11.09
315,Wholesalers,Miscellaneous,188518,525597,35.87
92,Wholesalers,Wholesalers: Diversified,233831,525597,44.49
64,Wholesalers,Wholesalers: Electronics and Office Equipment,166661,525597,31.71
