# Introduction

When multiple Series or DataFrames are combined in some way, each dimension of the data automatically aligns on each axis first before any computation happens. This silent and automatic alignment of axes can cause tremendous confusion for the uninitiated, but it gives great flexibility to the power user. This chapter explores the Index object in-depth before showcasing a variety of recipes that take advantage of its automatic alignment.

In [1]:
import pandas as pd
import numpy as np

# Examining the index object

As was discussed in Chapter 1, Pandas Foundations, each axis of Series and DataFrames has an Index object that labels the values. There are many different types of Index objects, but they all share the same common behavior. All Index objects, except for the special MultiIndex, are single-dimensional data structures that combine the functionality and implementation of Python sets and NumPy ndarrays.

### Getting ready

In this recipe, we will examine the column index of the college dataset and explore much of its functionality.

### How to do it...

Read in the college dataset, assign for the column index to a variable, and output it:

In [2]:
college = pd.read_csv('data/college.csv')
columns = college.columns
columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

Use the values attribute to access the underlying NumPy array:


In [3]:
columns.values

array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)

Select items from the index by integer location with scalars, lists, or slices:

In [4]:
columns[5]

'WOMENONLY'

In [5]:
columns[[1,8,10]]

Index(['CITY', 'SATMTMID', 'UGDS'], dtype='object')

In [6]:
columns[-7:-4]

Index(['PPTUG_EF', 'CURROPER', 'PCTPELL'], dtype='object')

Indexes share many of the same methods as Series and DataFrames:


In [7]:
columns.min(), columns.max(), columns.isnull().sum()

('CITY', 'WOMENONLY', 0)

Use basic arithmetic and comparison operators directly on Index objects:

In [8]:
columns + '_A'

Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A',
       'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
       'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
       'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
       'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
       'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
      dtype='object')

In [9]:
columns > 'G'

array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

Trying to change an Index value directly after its creation fails. Indexes are immutable objects:

In [10]:
columns[1] = 'city'

TypeError: Index does not support mutable operations

As you can see from many of the Index object operations, it appears to have quite a bit in common with both Series and ndarrays. One of the biggest differences comes in step 6. Indexes are immutable and their values cannot be changed once created.

### There's more...

Indexes support the set operations, union, intersection, difference, and symmetric difference:

In [11]:
c1 = columns[:4]
c1

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')

In [12]:
c2 = columns[2:5]
c2

Index(['STABBR', 'HBCU', 'MENONLY'], dtype='object')

In [13]:
c1.union(c2)

Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR'], dtype='object')

In [14]:
c1 | c2

ValueError: operands could not be broadcast together with shapes (4,) (3,) 

In [15]:
c1.symmetric_difference(c2)

Index(['CITY', 'INSTNM', 'MENONLY'], dtype='object')

In [16]:
c1 ^ c2

ValueError: operands could not be broadcast together with shapes (4,) (3,) 

Indexes share some of the same operations as Python sets. Indexes are similar to Python sets in another important way. They are (usually) implemented using hash tables, which make for extremely fast access when selecting rows or columns from a DataFrame. As they are implemented using hash tables, the values for the Index object need to be immutable such as a string, integer, or tuple just like the keys in a Python dictionary.

**Note**

Indexes support duplicate values, and if there happens to be a duplicate in any Index, then a hash table can no longer be used for its implementation, and object access becomes much slower.

# Producing Cartesian Product

Whenever two Series or DataFrames operate with another Series or DataFrame, the indexes (both the row index and column index) of each object align first before any operation begins. This index alignment happens silently and can be very surprising for those new to pandas. This alignment always creates a Cartesian product between the indexes unless the indexes are identical.

**Note**

A Cartesian product is a mathematical term that usually appears in set theory. A Cartesian product between two sets is all the combinations of pairs of both sets. For example, the 52 cards in a standard playing card deck represent a Cartesian product between the 13 ranks (A, 2, 3,..., Q, K) and the four suits.

### Getting ready

Producing a Cartesian product isn't always the intended outcome, but it's extremely important to be aware of how and when it occurs to avoid unintended consequences. In this recipe, two Series with overlapping but non-identical indexes are added together, yielding a surprising result.

### How to do it...

Follow these steps to create a Cartesian product:

Construct two Series that have indexes that are different but contain some of the same values:


In [17]:
s1 = pd.Series(index=list('aaab'), data=np.arange(4))
s1

a    0
a    1
a    2
b    3
dtype: int32

In [18]:
s2 = pd.Series(index=list('cababb'), data=np.arange(6))
s2

c    0
a    1
b    2
a    3
b    4
b    5
dtype: int32

Add the two Series together to produce a Cartesian product:

In [19]:
s1 + s2

a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64

Each Series was created with the class constructor which accepts a wide variety of inputs with the simplest being a sequence of values for each of the parameters index and data.

Mathematical Cartesian products are slightly different from the outcome of operating on two pandas objects. Each a label in s1 pairs up with each a label in s2. This pairing produces six a labels, three b labels, and one c label in the resulting Series. A Cartesian product happens between all identical index labels.

As the element with label c is unique to Series s2, pandas defaults its value to missing, as there is no label for it to align to in s1. Pandas defaults to a missing value whenever an index label is unique to one object. This has the unfortunate consequence of changing the data type of the Series to a float, whereas each Series had only integers as values. This occurred because of NumPy's missing value object; np.nan only exists for floats but not for integers. Series and DataFrame columns must have homogeneous numeric data types; therefore, each value was converted to a float. This makes very little difference for this small dataset, but for larger datasets, this can have a significant memory impact.

### There's more...

An exception to the preceding example takes place when the indexes contain the same exact elements in the same order. When this occurs, a Cartesian product does not take place, and the indexes instead align by their position. Notice here that each element aligned exactly by position and that the data type remained an integer:

In [20]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('aaabb'), data=np.arange(5))
s1 + s2

a    0
a    2
a    4
b    6
b    8
dtype: int32

If the elements of the index are identical, but the order is different between the Series, a Cartesian product occurs. Let's change the order of the index in s2 and rerun the same operation:

In [21]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('bbaaa'), data=np.arange(5))
s1 + s2

a    2
a    3
a    4
a    3
a    4
a    5
a    4
a    5
a    6
b    3
b    4
b    4
b    5
dtype: int32

It is quite interesting that pandas has two drastically different outcomes for this same operation. If a Cartesian product was the only choice for pandas, then something as simple as adding DataFrame columns together would explode the number of elements returned.

In this recipe, each Series had a different number of elements. Typically, array-like data structures in Python and other languages do not allow operations to take place when the operating dimensions do not contain the same number of elements. Pandas allows this to happen by aligning the indexes first before completing the operation.

# Exploding Indexes

The previous recipe walked through a trivial example of two small Series being added together with unequal indexes. This problem can produce comically incorrect results when dealing with larger data.

### Getting ready

In this recipe, we add two larger Series that have indexes with only a few unique values but in different orders. The result will explode the number of values in the indexes.

### How to do it...

Read in the employee data and set the index equal to the race column:

In [22]:
employee = pd.read_csv('data/employee.csv', index_col='RACE')
employee.head()

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Hispanic/Latino,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Full Time,Female,Active,2006-06-12,2012-10-13
Hispanic/Latino,1,LIBRARY ASSISTANT,Library,26125.0,Full Time,Female,Active,2000-07-19,2010-09-18
White,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,Full Time,Male,Active,2015-02-03,2015-02-03
White,3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,Full Time,Male,Active,1982-02-08,1991-05-25
White,4,ELECTRICIAN,General Services Department,56347.0,Full Time,Male,Active,1989-06-19,1994-10-22


Select the BASE_SALARY column as two different Series. Check to see whether this operation actually did create two new objects:

In [23]:
salary1 = employee['BASE_SALARY']
salary2 = employee['BASE_SALARY']
salary1 is salary2

True

The salary1 and salary2 variables are actually referring to the same object. This means that any change to one will change the other. To ensure that you receive a brand new copy of the data, use the copy method:

In [24]:
salary1 = employee['BASE_SALARY'].copy()
salary2 = employee['BASE_SALARY'].copy()
salary1 is salary2

False

Let's change the order of the index for one of the Series by sorting it:

In [25]:
salary1 = salary1.sort_index()
salary1.head()

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
Name: BASE_SALARY, dtype: float64

In [26]:
salary2.head()

RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
White               63166.0
White               56347.0
Name: BASE_SALARY, dtype: float64

Let's add these salary Series together:

In [27]:
salary_add = salary1 + salary2

In [28]:
salary_add.head()

RACE
American Indian or Alaskan Native    138702.0
American Indian or Alaskan Native    156710.0
American Indian or Alaskan Native    176891.0
American Indian or Alaskan Native    159594.0
American Indian or Alaskan Native    127734.0
Name: BASE_SALARY, dtype: float64

The operation completed successfully. Let's create one more Series of salary1 added to itself and then output the lengths of each Series. We just exploded the index from 2,000 values to more than 1 million:

In [29]:
salary_add1 = salary1 + salary1
len(salary1), len(salary2), len(salary_add), len(salary_add1)

(2000, 2000, 1175424, 2000)

Step 2 appears at first to create two unique objects but in fact, it creates a single object that is referred to by two different variable names. The expression employee['BASE_SALARY'], technically creates a view, and not a brand new copy. This is verified with the is operator.

Note
In pandas, a view is not a new object but just a reference to another object, usually some subset of a DataFrame. This shared object can be a cause for many issues.

To ensure that both variables reference completely different objects, we use the copy Series method and again verify that they are different objects with the is operator. Step 4 uses the sort_index method to sort the Series by race. Step 5 adds these different Series together to produce some result. By just inspecting the head, it's still not clear what has been produced.

Step 6 adds salary1 to itself to show a comparison between the two different Series additions. The length of all the Series in this recipe are output and we clearly see that series_add has now exploded to over one million values. A Cartesian product took place for each unique value in the index because the indexes were not exactly the same. This recipe dramatically shows how much of an impact the index can have when combining multiple Series or DataFrames.

### There's more...

We can verify the number of values of salary_add by doing a little mathematics. As a Cartesian product takes place between all of the same index values, we can sum the square of their individual counts. Even missing values in the index produce Cartesian products with themselves:

In [30]:
index_vc = salary1.index.fillna('None').value_counts()
index_vc

RACE
Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
None                                  35
American Indian or Alaskan Native     11
Others                                 2
Name: count, dtype: int64

In [31]:
index_vc.pow(2).sum()

1175424

# New
1. appending chunks of data
2. scale by the max from both sides

# Filling values with unequal indexes

When two Series are added together using the plus operator and one of the index labels does not appear in the other, the resulting value is always missing. Pandas offers the add method, which provides an option to fill the missing value.

### Getting ready

In this recipe, we add together multiple Series from the baseball dataset with unequal indexes using the fill_value parameter of the add method to ensure that there are no missing values in the result.

### How to do it...

Read in the three baseball datasets and set the index as playerID:

In [32]:
baseball_14 = pd.read_csv('data/baseball14.csv', index_col='playerID')
baseball_15 = pd.read_csv('data/baseball15.csv', index_col='playerID')
baseball_16 = pd.read_csv('data/baseball16.csv', index_col='playerID')
baseball_14.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
altuvjo01,2014,1,HOU,AL,158,660,85,225,47,3,...,59.0,56.0,9.0,36,53.0,7.0,5.0,1.0,5.0,20.0
cartech02,2014,1,HOU,AL,145,507,68,115,21,1,...,88.0,5.0,2.0,56,182.0,6.0,5.0,0.0,4.0,12.0
castrja01,2014,1,HOU,AL,126,465,43,103,21,2,...,56.0,1.0,0.0,34,151.0,1.0,9.0,1.0,3.0,11.0
corpoca01,2014,1,HOU,AL,55,170,22,40,6,0,...,19.0,0.0,0.0,14,37.0,0.0,3.0,1.0,2.0,3.0
dominma01,2014,1,HOU,AL,157,564,51,121,17,0,...,57.0,0.0,1.0,29,125.0,2.0,5.0,2.0,7.0,23.0


Use the index method difference to discover which index labels are in baseball_14 and not in baseball_15, and vice versa:

In [33]:
baseball_14.index.difference(baseball_15.index)

Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
      dtype='object', name='playerID')

There are quite a few players unique to each index. Let's find out how many hits each player has in total over the three-year period. The H column contains the number of hits:

In [34]:
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']
hits_14.head()

playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64

Let's first add together two Series using the plus operator:

In [35]:
(hits_14 + hits_15).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64

Even though players congeha01 and corpoca01 have recorded hits for 2015, their result is missing. Let's use the add method and its parameter, fill_value, to avoid missing values:

In [36]:
hits_14.add(hits_15, fill_value=0).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

We add hits from 2016 by chaining the add method once more:

In [37]:
hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)
hits_total.head()

playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

Check for missing values in the result:

In [38]:
hits_total.hasnans

False

## How it works...

The add method works similarly to the plus operator but allows for more flexibility by providing the fill_value parameter to take the place of a non-matching index. In this problem, it makes sense to default the non-matching index value to 0, but you could have used any other number.

There will be occasions when each Series contains index labels that correspond to missing values. In this specific instance, when the two Series are added, the index label will still correspond to a missing value regardless if the fill_value parameter is used. To clarify this, take a look at the following example where the index label a corresponds to a missing value in each Series:



In [39]:
s = pd.Series(index=['a', 'b', 'c', 'd'], data=[np.nan, 3, np.nan, 1])
s

a    NaN
b    3.0
c    NaN
d    1.0
dtype: float64

In [40]:
s1 = pd.Series(index=['a', 'b', 'c'], data=[np.nan, 6, 10])
s1

a     NaN
b     6.0
c    10.0
dtype: float64

In [41]:
s.add(s1, fill_value=5)

a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64

In [42]:
s1.add(s, fill_value=5)

a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64

## There's more
This recipe shows how to add Series with only a single index together. It is also entirely possible to add DataFrames together. Adding DataFrames together will align both the index and columns before computation and yield missing values for non-matching indexes. Let's start by selecting a few of the columns from the 2014 baseball dataset.

In [43]:
df_14 = baseball_14[['G','AB', 'R', 'H']]
df_14.head()

Unnamed: 0_level_0,G,AB,R,H
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,158,660,85,225
cartech02,145,507,68,115
castrja01,126,465,43,103
corpoca01,55,170,22,40
dominma01,157,564,51,121


Let's also select a few of the same and a few different columns from the 2015 baseball dataset:

In [44]:
df_15 = baseball_15[['AB', 'R', 'H', 'HR']]
df_15.head()

Unnamed: 0_level_0,AB,R,H,HR
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
altuvjo01,638,86,200,15
cartech02,391,50,78,24
castrja01,337,38,71,11
congeha01,201,25,46,11
correca01,387,52,108,22


Adding the two DataFrames together create missing values wherever rows or column labels cannot align. Use the style attribute to access the highlight_null method to easily see where the missing values are:

In [45]:
(df_14 + df_15).head(10).style.highlight_null('yellow')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,,425.0,,171.0
cartech02,898.0,,193.0,,118.0
castrja01,802.0,,174.0,,81.0
congeha01,,,,,
corpoca01,,,,,
correca01,,,,,
dominma01,,,,,
fowlede01,,,,,
gattiev01,,,,,
gomezca01,,,,,


Only the rows with playerID appearing in both DataFrames will be non-missing. Similarly, the columns AB, H, and R are the only ones that appear in both DataFrames. Even if we use the add method with the fill_value parameter specified, we still have missing values. This is because some combinations of rows and columns never existed in our input data. For example, the intersection of playerIDcongeha01 and column G. He only appeared in the 2015 dataset that did not have the G column. Therefore, no value was filled with it:



In [46]:
df_14.add(df_15, fill_value=0).head(10).style.highlight_null('yellow')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,158.0,425.0,15.0,171.0
cartech02,898.0,145.0,193.0,24.0,118.0
castrja01,802.0,126.0,174.0,11.0,81.0
congeha01,201.0,,46.0,11.0,25.0
corpoca01,170.0,55.0,40.0,,22.0
correca01,387.0,,108.0,22.0,52.0
dominma01,564.0,157.0,121.0,,51.0
fowlede01,434.0,116.0,120.0,,61.0
gattiev01,566.0,,139.0,27.0,66.0
gomezca01,149.0,,36.0,4.0,19.0


# Appending columns from different DataFrames

All DataFrames can add new columns to themselves. However, as usual, whenever a DataFrame is adding a new column from another DataFrame or Series, the indexes align first before the new column is created.

### Getting ready

This recipe uses the employee dataset to append a new column containing the maximum salary of that employee's department.

### How to do it...

Import the employee data and select the DEPARTMENT and BASE_SALARY columns in a new DataFrame:

In [47]:
employee = pd.read_csv('data/employee.csv')
dept_salary = employee[['DEPARTMENT', 'BASE_SALARY']]

Sort this smaller DataFrame by salary within each department:

In [48]:
dept_salary = dept_salary.sort_values(['DEPARTMENT', 'BASE_SALARY'], ascending=[True, False])

Use the drop_duplicates method to keep the first row of each DEPARTMENT:

In [49]:
max_dept_salary = dept_salary.drop_duplicates(subset='DEPARTMENT')
max_dept_salary.head()

Unnamed: 0,DEPARTMENT,BASE_SALARY
1494,Admn. & Regulatory Affairs,140416.0
149,City Controller's Office,64251.0
236,City Council,100000.0
647,Convention and Entertainment,38397.0
1500,Dept of Neighborhoods (DON),89221.0


Put the DEPARTMENT column into the index for each DataFrames:

In [50]:
max_dept_salary = max_dept_salary.set_index('DEPARTMENT')
employee = employee.set_index('DEPARTMENT')

Now that the indexes contain matching values, we can append a new column to the employee DataFrame:


In [51]:
employee['MAX_DEPT_SALARY'] = max_dept_salary['BASE_SALARY']

In [52]:
pd.options.display.max_columns = 6

In [53]:
employee.head()

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,...,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Municipal Courts Department,0,ASSISTANT DIRECTOR (EX LVL),121862.0,...,2006-06-12,2012-10-13,121862.0
Library,1,LIBRARY ASSISTANT,26125.0,...,2000-07-19,2010-09-18,107763.0
Houston Police Department-HPD,2,POLICE OFFICER,45279.0,...,2015-02-03,2015-02-03,199596.0
Houston Fire Department (HFD),3,ENGINEER/OPERATOR,63166.0,...,1982-02-08,1991-05-25,210588.0
General Services Department,4,ELECTRICIAN,56347.0,...,1989-06-19,1994-10-22,89194.0


We can validate our results with the query method to check whether there exist any rows where BASE_SALARY is greater than MAX_DEPT_SALARY:

In [54]:
employee.query('BASE_SALARY > MAX_DEPT_SALARY')

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,...,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


## How it works...

Steps 2 and 3 find the maximum salary for each department. For automatic index alignment to work properly, we set each DataFrame index as the department. Step 5 works because each row index from the left DataFrame; employee aligns with one and only one index from the right DataFrame, max_dept_sal. If max_dept_sal had repeats of any departments in its index, then the operation would fail.

For instance, let's see what happens when we use a DataFrame on the right-hand side of the equality that has repeated index values. We use the sample DataFrame method to randomly choose ten rows without replacement:

In [55]:
np.random.seed(1234)
random_salary = dept_salary.sample(n=10).set_index('DEPARTMENT')
random_salary

Unnamed: 0_level_0,BASE_SALARY
DEPARTMENT,Unnamed: 1_level_1
Public Works & Engineering-PWE,50586.0
Houston Police Department-HPD,66614.0
Houston Police Department-HPD,66614.0
Housing and Community Devp.,78853.0
Houston Police Department-HPD,66614.0
Parks & Recreation,
Public Works & Engineering-PWE,37211.0
Public Works & Engineering-PWE,54683.0
Human Resources Dept.,58474.0
Health & Human Services,47050.0


Notice how there are several repeated departments in the index. Now when we attempt to create a new column, an error is raised alerting us that there are duplicates. At least one index label in the employee DataFrame is joining with two or more index labels from random_salary:

In [56]:
employee['RANDOM_SALARY'] = random_salary['BASE_SALARY']

ValueError: cannot reindex on an axis with duplicate labels

## There's more...
Not all indexes on the left-hand side of the equal sign need to have a match, but at most can have one. If there is nothing for the left DataFrame index to align to, the resulting value will be missing. Let's create an example where this happens. We will use only the first three rows of the max_dept_sal Series to create a new column:

In [58]:
employee['MAX_SALARY2'] = max_dept_salary['BASE_SALARY'].head(3)

In [59]:
employee.MAX_SALARY2.value_counts()

MAX_SALARY2
140416.0    29
100000.0    11
64251.0      5
Name: count, dtype: int64

In [60]:
employee.MAX_SALARY2.isnull().mean()

0.9775

The operation completed successfully but filled in salaries for only three of the departments. All the other departments that did not appear in the first three rows of the max_dept_sal Series resulted in a missing value.

# Highlighting maximum value from each column

The college dataset has many numeric columns describing different metrics about each school. Many people are interested in schools that perform the best for certain metrics.

### Getting ready

This recipe discovers the school that has the maximum value for each numeric column and styles the DataFrame in order to highlight the information so that it is easily consumed by a user.

### How to do it...

Read the college dataset with the institution name as the index:

In [61]:
pd.options.display.max_rows = 8

In [62]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college.dtypes

CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
                       ...   
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
Length: 26, dtype: object

All the other columns besides CITY and STABBR appear to be numeric. Examining the data types from the preceding step reveals unexpectedly that the MD_EARN_WNE_P10 and GRAD_DEBT_MDN_SUPP columns are of type object and not numeric. To help get a better idea of what kind of values are in these columns, let's examine their first value:

In [63]:
college.MD_EARN_WNE_P10.value_counts().head()

MD_EARN_WNE_P10
PrivacySuppressed    822
38800                151
21500                 97
49200                 78
27400                 46
Name: count, dtype: int64

In [64]:
college.MD_EARN_WNE_P10.iloc[0]

'30300'

In [65]:
college.GRAD_DEBT_MDN_SUPP.iloc[0]

'33888'

These values are strings but we would like them to be numeric. This means that there are likely to be non-numeric characters that appear elsewhere in the Series. One way to check for this is to sort these columns in descending order and examine the first few rows:

In [66]:
college.MD_EARN_WNE_P10.sort_values(ascending=False).head()

INSTNM
Sharon Regional Health System School of Nursing          PrivacySuppressed
P&A Scholars Beauty School                               PrivacySuppressed
Fairview Beauty Academy                                  PrivacySuppressed
Rabbi Jacob Joseph School                                PrivacySuppressed
Acupuncture and Integrative Medicine College-Berkeley    PrivacySuppressed
Name: MD_EARN_WNE_P10, dtype: object

The culprit appears to be that some schools have privacy concerns about these two columns of data. To force these columns to be numeric, use the pandas function to_numeric:

In [67]:
college['MD_EARN_WNE_P10'] = pd.to_numeric(college.MD_EARN_WNE_P10, errors='coerce')
college['GRAD_DEBT_MDN_SUPP'] = pd.to_numeric(college.GRAD_DEBT_MDN_SUPP, errors='coerce')

In [68]:
college.dtypes.loc[['MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP']]

MD_EARN_WNE_P10       float64
GRAD_DEBT_MDN_SUPP    float64
dtype: object

Use the select_dtypes method to filter for only numeric columns. This will exclude STABBR and CITY columns, where a maximum value doesn't make sense with this problem:


In [69]:
college_numeric = college.select_dtypes(include=[np.number])
college_numeric.head() # only numeric columns

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,1.0,0.0,0.0,...,0.1049,30300.0,33888.0
University of Alabama at Birmingham,0.0,0.0,0.0,...,0.2422,39700.0,21941.5
Amridge University,0.0,0.0,0.0,...,0.854,40100.0,23370.0
University of Alabama in Huntsville,0.0,0.0,0.0,...,0.264,45500.0,24097.0
Alabama State University,1.0,0.0,0.0,...,0.127,26600.0,33118.5


By utilizing the data dictionary, there are several columns that have only binary (0/1) values that will not provide useful information. To programmatically find these columns, we can create boolean Series and find all the columns that have two unique values with the nunique method:

In [70]:
criteria = college_numeric.nunique() == 2
criteria.head()

HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool

Pass this boolean Series to the indexing operator of the columns index object and create a list of the binary columns:

In [71]:
binary_cols = college_numeric.columns[criteria].tolist()
binary_cols

['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER']

Remove the binary columns with the drop method:

In [72]:
college_numeric2 = college_numeric.drop(labels=binary_cols, axis='columns')
college_numeric2.head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,424.0,420.0,4206.0,...,0.1049,30300.0,33888.0
University of Alabama at Birmingham,570.0,565.0,11383.0,...,0.2422,39700.0,21941.5
Amridge University,,,291.0,...,0.854,40100.0,23370.0
University of Alabama in Huntsville,595.0,590.0,5451.0,...,0.264,45500.0,24097.0
Alabama State University,425.0,430.0,4811.0,...,0.127,26600.0,33118.5


Use the idxmax method to find the index label of the maximum value for each column:

In [73]:
max_cols = college_numeric2.idxmax()
max_cols

SATVRMID                      California Institute of Technology
SATMTMID                      California Institute of Technology
UGDS                               University of Phoenix-Arizona
UGDS_WHITE                Mr Leon's School of Hair Design-Moscow
                                         ...                    
PCTFLOAN                                  ABC Beauty College Inc
UG25ABV                           Dongguk University-Los Angeles
MD_EARN_WNE_P10                     Medical College of Wisconsin
GRAD_DEBT_MDN_SUPP    Southwest University of Visual Arts-Tucson
Length: 18, dtype: object

Call the unique method on the max_cols Series. This returns an ndarray of the unique column names

In [74]:
unique_max_cols = max_cols.unique()
unique_max_cols[:5]

array(['California Institute of Technology',
       'University of Phoenix-Arizona',
       "Mr Leon's School of Hair Design-Moscow",
       'Velvatex College of Beauty Culture',
       'Thunderbird School of Global Management'], dtype=object)

Use the values of max_cols to select only the rows that have schools with a maximum value and then use the style attribute to highlight these values:

In [75]:
college_numeric2.loc[unique_max_cols].style.highlight_max()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
California Institute of Technology,765.0,785.0,983.0,0.2787,0.0153,0.1221,0.4385,0.001,0.0,0.057,0.0875,0.0,0.0,0.1126,0.2303,0.0082,77800.0,11812.5
University of Phoenix-Arizona,,,151558.0,0.3098,0.1555,0.076,0.0082,0.0042,0.005,0.1131,0.0131,0.3152,0.0,0.6009,0.592,,,33000.0
Mr Leon's School of Hair Design-Moscow,,,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.625,0.2,,15710.0
Velvatex College of Beauty Culture,,,25.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.7692,0.0,0.52,,
Thunderbird School of Global Management,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,118900.0,
Cosmopolitan Beauty and Tech School,,,110.0,0.0091,0.0,0.0182,0.9727,0.0,0.0,0.0,0.0,0.0,0.3182,0.7761,0.1244,0.9545,,
Haskell Indian Nations University,430.0,440.0,805.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0224,0.8396,0.0,0.2089,22800.0,
Palau Community College,,,602.0,0.0,0.0017,0.0,0.0,0.0,0.9983,0.0,0.0,0.0,0.3887,0.856,0.0,0.2616,24700.0,
LIU Brentwood,,,15.0,0.0,0.1333,0.2667,0.0,0.0,0.0,0.5333,0.0,0.0667,0.4,0.5652,0.7826,0.7826,44600.0,25499.0
California University of Management and Sciences,,,98.0,0.0102,0.0204,0.0,0.0408,0.0,0.0,0.0,0.9286,0.0,0.0,0.0926,0.0556,0.6852,,


In [76]:
college_numeric2.loc[max_cols.values]

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
California Institute of Technology,765.0,785.0,983.0,...,0.0082,77800.0,11812.5
California Institute of Technology,765.0,785.0,983.0,...,0.0082,77800.0,11812.5
University of Phoenix-Arizona,,,151558.0,...,,,33000.0
Mr Leon's School of Hair Design-Moscow,,,16.0,...,0.2000,,15710.0
...,...,...,...,...,...,...,...
ABC Beauty College Inc,,,38.0,...,0.4688,,16500.0
Dongguk University-Los Angeles,,,20.0,...,1.0000,,
Medical College of Wisconsin,,,,...,,233100.0,
Southwest University of Visual Arts-Tucson,,,161.0,...,0.8657,27200.0,49750.0


The idxmax method is very powerful and becomes quite useful when the index is meaningfully labeled. It was unexpected that both MD_EARN_WNE_P10 and GRAD_DEBT_MDN_SUPP were of object data type. When importing, pandas coerces all numeric values of columns to strings if the column contains at least one string.

By examining a specific column value in step 2, we were able to see clearly that we had strings in these columns. In step 3, we sort in descending order as numeric characters appear first. This elevates all alphabetical values to the top of the Series. We uncover the PrivacySuppressed string causing havoc. Pandas has the ability to force all strings that contain only numeric characters to actual numeric data types with the to_numeric function. To override the default behavior of raising an error when to_numeric encounters a string that cannot be converted, you must pass coerce to the errors parameter. This forces all non-numeric character strings to become missing values (np.nan).

Several columns don't have useful or meaningful maximum values. They were removed in step 4 through step 6. The select_dtypes can be extremely useful for very wide DataFrames with lots of columns.

In step 7, idxmax iterates through all the columns to find the index of the maximum value for each column. It outputs the results as a Series. The school with both the highest SAT math and verbal scores is California Institute of Technology. Dongguk University Los Angeles has the highest number of students older than 25.

Although the information provided by idxmax is nice, it does not yield the corresponding maximum value. To do this, we gather all the unique school names from the values of the max_cols Series.

Finally, in step 8, we use the .loc indexer to select rows based on the index label, which we made as school names in the first step. This filters for only schools that have a maximum value. DataFrames have an experimental style attribute that itself has some methods to alter the appearance of the displayed DataFrame. Highlighting the maximum value makes the result much clearer.

### There's more...

By default, the highlight_max method highlights the maximum value of each column. We can use the axis parameter to highlight the maximum value of each row instead. Here, we select just the race percentage columns of the college dataset and highlight the race with the highest percentage for each school:

In [77]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds = college.filter(like='UGDS_').head()
college_ugds.style.highlight_max(axis='columns')


Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


# Replicating idxmax with method chaining
It can be a good exercise to attempt an implementation of a built-in DataFrame method on your own. This type of replication can give you a deeper understanding of other pandas methods that you normally wouldn't have come across. idxmax is a challenging method to replicate using only the methods covered thus far in the book.

### Getting ready

This recipe slowly chains together basic methods to eventually find all the row index values that contain a maximum column value.

### How to do it...

Load in the college dataset and execute the same operations as the previous recipe to get only the numeric columns that are of interest:

In [78]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college['MD_EARN_WNE_P10'] = pd.to_numeric(college.MD_EARN_WNE_P10, errors='coerce')
college['GRAD_DEBT_MDN_SUPP'] = pd.to_numeric(college.GRAD_DEBT_MDN_SUPP, errors='coerce')
college_numeric = college.select_dtypes(include=[np.number])
criteria = college_numeric.nunique() == 2
binary_cols = college_numeric.columns[criteria].tolist()
college_numeric = college_numeric.drop(labels=binary_cols, axis='columns')

Find the maximum of each column with the max method:

In [79]:
college_numeric.max().head()

SATVRMID         765.0
SATMTMID         785.0
UGDS          151558.0
UGDS_WHITE         1.0
UGDS_BLACK         1.0
dtype: float64

Use the eq DataFrame method to test each value with its column max. By default, the eq method aligns the columns of the column DataFrame with the labels of the passed Series index:


In [80]:
college_numeric.eq(college_numeric.max()).head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,False,False,False,...,False,False,False
University of Alabama at Birmingham,False,False,False,...,False,False,False
Amridge University,False,False,False,...,False,False,False
University of Alabama in Huntsville,False,False,False,...,False,False,False
Alabama State University,False,False,False,...,False,False,False


All the rows in this DataFrame that have at least one True value must contain a column maximum. Let's use the any method to find all such rows that have at least one True value:


In [81]:
has_row_max =college_numeric.eq(college_numeric.max()).any(axis='columns')
has_row_max.head()

INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

There are only 18 columns, which means that there should only be at most 18 True values in has_row_max. Let's find out how many there actually are:

In [82]:
college_numeric.shape

(7535, 18)

This was a bit unexpected, but it turns out that there are columns with many rows that equal the maximum value. This is common with many of the percentage columns that have a maximum of 1. idxmax returns the first occurrence of the maximum value. Let's back up a bit, remove the any method, and look at the output from step 3. Let's run the cumsum method instead to accumulate all the True values. The first and last three rows are shown:

In [83]:
has_row_max.sum()

401

In [84]:
college_numeric.eq(college_numeric.max()).cumsum().cumsum().eq(1).any(axis='columns')[lambda x: x]

INSTNM
Thunderbird School of Global Management             True
Southwest University of Visual Arts-Tucson          True
ABC Beauty College Inc                              True
Velvatex College of Beauty Culture                  True
                                                    ... 
Palau Community College                             True
California University of Management and Sciences    True
Cosmopolitan Beauty and Tech School                 True
University of Phoenix-Arizona                       True
Length: 16, dtype: bool

In [85]:
%timeit college_numeric2.idxmax()

1.01 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [86]:
college_numeric2.idxmax??

[1;31mSignature:[0m
[0mcollege_numeric2[0m[1;33m.[0m[0midxmax[0m[1;33m([0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mskipna[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnumeric_only[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'Series'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return index of first occurrence of maximum over requested axis.

NA/null values are excluded.

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
    The axis to use. 0 or 'index' for row-wise, 1 or 'columns' for column-wise.
skipna : bool, default True
    Exclude NA/null values. If an entire row/column is NA, the result
    will be NA.
numeric_only : bool, default False
    Include only `float`, `int` or `boolean` data.

    .. versionadded:: 1.5.0

Returns
-------
Series
    

In [87]:
pd.options.display.max_rows=6

In [88]:
college_numeric.eq(college_numeric.max()).cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,0,0,0,...,0,0,0
University of Alabama at Birmingham,0,0,0,...,0,0,0
Amridge University,0,0,0,...,0,0,0
...,...,...,...,...,...,...,...
National Personal Training Institute of Cleveland,1,1,1,...,12,1,2
Bay Area Medical Academy - San Jose Satellite Location,1,1,1,...,12,1,2
Excel Learning Center-San Antonio South,1,1,1,...,12,1,2


Some columns have one unique maximum like SATVRMID and SATMTMID, while others like UGDS_WHITE have many. 109 schools have 100% of their undergraduates as white. If we chain the cumsum method one more time, the value 1 would only appear once in each column and it would be the first occurrence of the maximum:


In [89]:
college_numeric.eq(college_numeric.max()).cumsum().cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,0,0,0,...,0,0,0
University of Alabama at Birmingham,0,0,0,...,0,0,0
Amridge University,0,0,0,...,0,0,0
...,...,...,...,...,...,...,...
National Personal Training Institute of Cleveland,7307,7307,417,...,36207,3447,10270
Bay Area Medical Academy - San Jose Satellite Location,7308,7308,418,...,36219,3448,10272
Excel Learning Center-San Antonio South,7309,7309,419,...,36231,3449,10274


We can now test the equality of each value against 1 with the eq method and then use the any method to find rows that have at least one True value

In [90]:
college_numeric.eq(college_numeric.max()).cumsum().cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,...,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,0,0,0,...,0,0,0
University of Alabama at Birmingham,0,0,0,...,0,0,0
Amridge University,0,0,0,...,0,0,0
...,...,...,...,...,...,...,...
National Personal Training Institute of Cleveland,7307,7307,417,...,36207,3447,10270
Bay Area Medical Academy - San Jose Satellite Location,7308,7308,418,...,36219,3448,10272
Excel Learning Center-San Antonio South,7309,7309,419,...,36231,3449,10274


We can now test the equality of each value against 1 with the eq method and then use the any method to find rows that have at least one True value:

In [91]:
college_idxmax = college_numeric.eq(college_numeric.max())\
                                .cumsum()\
                                .cumsum()\
                                .eq(1)\
                                .any(axis='columns')
college_idxmax.head()

INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

We need all the institutions where has_row_max2 is True. We can simply use boolean indexing on the Series itself:

In [92]:
idxmax_cols = college_idxmax[college_idxmax].index
idxmax_cols

Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')

All 16 of these institutions are the index of the first maximum occurrence for at least one of the columns. We can check whether they are the same as the ones found with the idxmax method:

In [93]:
set(college_numeric.idxmax().unique()) == set(idxmax_cols)

True

The first step replicates work from the previous recipe by converting two columns to numeric and eliminating the binary columns. We find the maximum value of each column in step 2. Care needs to be taken here as pandas silently drops columns that it cannot produce a maximum. If this happens, then step 3 will still complete but produce all False values for each column without an available maximum.

Step 4 uses the any method to scan across each row in search of at least one True value. Any row with at least one True value contains a maximum value for a column. We sum up the resulting boolean Series in step 5 to determine how many rows contain a maximum. Somewhat unexpectedly, there are far more rows than columns. Step 6 gives insight on why this happens. We take a cumulative sum of the output from step 3 and detect the total number of rows that equal the maximum for each column.

Many colleges have 100% of their student population as only a single race. This is by far the largest contributor to the multiple rows with maximums. As you can see, there is only one row with a maximum value for both SAT score columns and undergraduate population, but several of the race columns have a tie for the maximum.

Our goal is to find the first row with the maximum value. We need to take the cumulative sum once more so that each column has only a single row equal to 1. Step 8 formats the code to have one method per line and runs the any method exactly as it was done in step 4. If this step is successful, then we should have no more Truevalues than the number of columns. Step 9 asserts that this is true.

To validate that we have found the same columns as idxmaxin the previous columns, we use boolean selection on has_row_max2 with itself. The columns will be in a different order so we convert the sequence of column names to sets, which are inherently unordered to compare equality.





# Finding the most common maximum

The college dataset contains the undergraduate population percentage of eight different races for over 7,500 colleges. It would be interesting to find the race with the highest undergrad population for each school and then find the distribution of this result for the entire dataset. We would be able to answer a question like, What percentage of institutions have more white students than any other race?

### Getting ready

In this recipe, we find the race with the highest percentage of the undergraduate population for each school with the idxmax method and then find the distribution of these maximums.

### How to do it...

Read in the college dataset and select just those columns with undergraduate race percentage information:

In [94]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds = college.filter(like='UGDS_')
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,0.0333,0.9353,0.0055,...,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,...,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,...,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,...,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,...,0.0098,0.0243,0.0137


Use the idxmax method to get the column name with the highest race percentage for each row:

In [95]:
highest_percentage_race = college_ugds.idxmax(axis='columns')
highest_percentage_race.head()

  highest_percentage_race = college_ugds.idxmax(axis='columns')


INSTNM
Alabama A & M University               UGDS_BLACK
University of Alabama at Birmingham    UGDS_WHITE
Amridge University                     UGDS_BLACK
University of Alabama in Huntsville    UGDS_WHITE
Alabama State University               UGDS_BLACK
dtype: object

Use the value_counts method to return the distribution of maximum occurrences:

In [96]:
highest_percentage_race.value_counts(normalize=True)

UGDS_WHITE    0.670352
UGDS_BLACK    0.151586
UGDS_HISP     0.129473
                ...   
UGDS_NRA      0.004073
UGDS_NHPI     0.001746
UGDS_2MOR     0.001164
Name: proportion, Length: 9, dtype: float64

### How it works...

The key to this recipe is recognizing that the columns all represent the same unit of information. We can compare these columns with each other, which is usually not the case. For instance, it wouldn't make sense to directly compare SAT verbal scores with the undergraduate population. As the data is structured in this manner, we can apply the idxmax method to each row of data to find the column with the largest value. We need to alter its default behavior with the axis parameter.

Step 2 completes this operation and returns a Series, to which we can now simply apply the value_counts method to return the distribution. We pass True to the normalize parameter as we are interested in the distribution (relative frequency) and not the raw counts.

### There's more...

We might want to explore more and answer the question: For the schools with more black students than any other race, what is the distribution of its second highest race percentage?



In [97]:
college_black = college_ugds[highest_percentage_race == 'UGDS_BLACK']
college_black = college_black.drop('UGDS_BLACK', axis='columns')
college_black.idxmax(axis='columns').value_counts(normalize=True)

UGDS_WHITE    0.661228
UGDS_HISP     0.230326
UGDS_UNKN     0.071977
                ...   
UGDS_2MOR     0.006718
UGDS_AIAN     0.000960
UGDS_NHPI     0.000960
Name: proportion, Length: 8, dtype: float64

We needed to drop the UGDS_BLACK column before applying the same method from this recipe. Interestingly, it seems that these schools with higher black populations have a tendency to have higher Hispanic populations.