In [1]:
print("""
@File         : ch08_index_alignment.ipynb
@Author(s)    : Stephen CUI
@LastEditor(s): Stephen CUI
@CreatedTime  : 2024-07-22 23:02:22
@Email        : cuixuanstephen@gmail.com
@Description  : 索引对齐
""")


@File         : ch08_index_alignment.ipynb
@Author(s)    : Stephen CUI
@LastEditor(s): Stephen CUI
@CreatedTime  : 2024-07-22 23:02:22
@Email        : cuixuanstephen@gmail.com
@Description  : 索引对齐



当 Series 或 DataFrames 组合在一起时，在进行任何计算之前，**数据的每个维度都会首先自动在每个轴上对齐**。这种静默且自动的轴对齐可能会让初学者感到困惑，但它为高级用户提供了灵活性。

In [2]:
%cd ../

d:\Data-Analysis-and-Science\P1XC2E


In [3]:
import pandas as pd
import numpy as np

## 检查 Index 对象

Series 和 DataFrame 的每个轴都有一个用于标记值的 Index 对象。Index 对象有很多种类型，但它们都具有共同的行为。除 MultiIndex 之外，所有 Index 对象都是一维数据结构，结合了 Python 集和 NumPy ndarray 的功能。

In [4]:
college = pd.read_csv('data/college.csv')
columns = college.columns
columns

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')

In [5]:
columns.values

array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)

使用标量、列表或切片按位置从索引中选择项目：

In [6]:
columns[5]

'WOMENONLY'

In [7]:
columns[[1, 8, 10]]

Index(['CITY', 'SATMTMID', 'UGDS'], dtype='object')

In [8]:
columns[-7:-4]

Index(['PPTUG_EF', 'CURROPER', 'PCTPELL'], dtype='object')

索引与Series和DataFrames共享许多相同的方法：

In [9]:
columns.min(), columns.max(), columns.isnull().sum()

('CITY', 'WOMENONLY', 0)

可以在 Index 对象上使用基本算术和比较运算符：

In [10]:
columns + '_A'

Index(['INSTNM_A', 'CITY_A', 'STABBR_A', 'HBCU_A', 'MENONLY_A', 'WOMENONLY_A',
       'RELAFFIL_A', 'SATVRMID_A', 'SATMTMID_A', 'DISTANCEONLY_A', 'UGDS_A',
       'UGDS_WHITE_A', 'UGDS_BLACK_A', 'UGDS_HISP_A', 'UGDS_ASIAN_A',
       'UGDS_AIAN_A', 'UGDS_NHPI_A', 'UGDS_2MOR_A', 'UGDS_NRA_A',
       'UGDS_UNKN_A', 'PPTUG_EF_A', 'CURROPER_A', 'PCTPELL_A', 'PCTFLOAN_A',
       'UG25ABV_A', 'MD_EARN_WNE_P10_A', 'GRAD_DEBT_MDN_SUPP_A'],
      dtype='object')

In [11]:
columns > 'G'

array([ True, False,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True])

**创建索引后尝试更改索引值失败。索引是不可变的对象**

In [12]:
try:
    columns[4] = 'city'
except TypeError:
    print('创建索引后尝试更改索引值失败。索引是不可变的')

创建索引后尝试更改索引值失败。索引是不可变的


索引支持集合运算——并集、交集、差集、对称差集：

In [13]:
c1 = columns[:4]
c1

Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')

In [14]:
c2 = columns[2:6]
c2

Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')

In [15]:
c1.union(c2) # c1 | c2

Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')

In [16]:
c1.symmetric_difference(c2) # c1 ^ c2

Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')

索引具有许多与 Python 集合相同的操作，并且在另一个重要方面与 Python 集合相似。它们（通常）使用哈希表实现，这使得从 DataFrame 中选择行或列时访问速度极快。由于值需要可哈希，因此 Index 对象的值需要是不可变类型，例如字符串、整数或元组，就像 Python 字典中的键一样。

索引支持重复值，如果任何索引中恰好有重复，则哈希表不能再用于其实现，并且对象访问变得更慢。

## 生成笛卡尔积

每当一个 Series 或 DataFrame 与另一个 Series 或 DataFrame 进行操作时，每个对象的索引（行索引和列索引）在任何操作开始之前都会先对齐。这种索引对齐在幕后进行，对于刚接触 Pandas 的人来说可能非常令人吃惊。除非索引完全相同，否则这种对齐始终会在索引之间创建笛卡尔积。

> 应该只是并集，笛卡尔积是不对的，如果是重复的 Index，那么就生成笛卡尔积，然后相加，即相乘相加

In [17]:
s1 = pd.Series(index=list('aaab'), data=np.arange(4))
s1

a    0
a    1
a    2
b    3
dtype: int32

In [18]:
s2 = pd.Series(index=list('cababb'), data=np.arange(6))
s2

c    0
a    1
b    2
a    3
b    4
b    5
dtype: int32

In [19]:
s1 + s2

a    1.0
a    3.0
a    2.0
a    4.0
a    3.0
a    5.0
b    5.0
b    7.0
b    8.0
c    NaN
dtype: float64

s1 中的每个 a 标签与 s2 中的每个 a 标签配对。此配对在结果系列中产生六个 a 标签、三个 b 标签和一个 c 标签。所有相同的索引标签之间都会发生笛卡尔积。

由于标签 c 的元素是 Series s2 所独有的，因此 pandas 将其值默认为缺失，因为在 s1 中没有可与之对齐的标签。每当索引标签对于一个对象是唯一的，pandas 就会默认将其值设置为缺失值。这会导致 Series 的数据类型变为浮点数，而每个 Series 的值都只有整数。类型发生变化的原因是 NumPy 的缺失值对象 np.nan 仅存在于浮点数中，而不存在于整数中。Series 和 DataFrame 列必须具有同类的数字数据类型。因此，列中的每个值都转换为浮点数。对于这个小数据集，更改类型没什么区别，但对于较大的数据集，这可能会对内存产生重大影响。

> 当索引是唯一的或者包含完全相同的元素且元素的顺序相同时，不会创建笛卡尔积。当索引值是唯一的或者它们相同且顺序相同时，不会创建笛卡尔积，索引将按其位置对齐。请注意，每个元素都按位置精确对齐，并且数据类型仍为整数：

In [20]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('aaabb'), data=np.arange(5))

In [21]:
s1 + s2

a    0
a    2
a    4
b    6
b    8
dtype: int32

如果索引的元素相同，但 Series 之间的顺序不同，则会发生笛卡尔积。让我们更改s2中索引的顺序并重新运行相同的操作：

In [22]:
s1 = pd.Series(index=list('aaabb'), data=np.arange(5))
s2 = pd.Series(index=list('bbaaa'), data=np.arange(5))

In [23]:
s1 + s2

a    2
a    3
a    4
a    3
a    4
a    5
a    4
a    5
a    6
b    3
b    4
b    4
b    5
dtype: int32

通常，当操作维度不包含相同数量的元素时，Python 和其他语言中的类似数组的数据结构不允许进行操作。pandas 通过在完成操作之前先对齐索引来允许这种情况发生。

另一个可能发生这种情况的情况是在 groupby 操作期间。如果你执行了 groupby，如果有多个列并且其中一个是分类类型，您将得到一个笛卡尔积，其中每个外部索引都具有每个内部索引值。

## 爆炸式增长的索引!!!

In [24]:
employee = pd.read_csv('data/employee.csv', index_col='RACE')
employee.head()

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE
RACE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Hispanic/Latino,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Full Time,Female,Active,2006-06-12,2012-10-13
Hispanic/Latino,1,LIBRARY ASSISTANT,Library,26125.0,Full Time,Female,Active,2000-07-19,2010-09-18
White,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,Full Time,Male,Active,2015-02-03,2015-02-03
White,3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,Full Time,Male,Active,1982-02-08,1991-05-25
White,4,ELECTRICIAN,General Services Department,56347.0,Full Time,Male,Active,1989-06-19,1994-10-22


In [25]:
# 从技术上讲创建了一个视图，而不是一个全新的副本
# 在 Pandas 中，视图不是一个新对象，而只是对另一个对象的引用，通常是 DataFrame 的某个子集。此共享对象可能会
# 导致许多问题。
salary1 = employee['BASE_SALARY']
salary2 = employee['BASE_SALARY']
salary1 is salary2

True

In [26]:
salary2 = employee["BASE_SALARY"].copy()
salary1 is salary2

False

In [27]:
salary1 = salary1.sort_index()
salary1.head()

RACE
American Indian or Alaskan Native    78355.0
American Indian or Alaskan Native    26125.0
American Indian or Alaskan Native    98536.0
American Indian or Alaskan Native        NaN
American Indian or Alaskan Native    55461.0
Name: BASE_SALARY, dtype: float64

In [28]:
salary2.head()

RACE
Hispanic/Latino    121862.0
Hispanic/Latino     26125.0
White               45279.0
White               63166.0
White               56347.0
Name: BASE_SALARY, dtype: float64

In [29]:
salary_add = salary1 + salary2
salary_add.head()

RACE
American Indian or Alaskan Native    138702.0
American Indian or Alaskan Native    156710.0
American Indian or Alaskan Native    176891.0
American Indian or Alaskan Native    159594.0
American Indian or Alaskan Native    127734.0
Name: BASE_SALARY, dtype: float64

In [30]:
salary_add1 = salary1 + salary1
len(salary1), len(salary2), len(salary_add), len(salary_add1)

(2000, 2000, 1175424, 2000)

验证 `salary_add` 中的数量

In [31]:
index_vc = salary1.index.value_counts(dropna=False)
index_vc

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
NaN                                   35
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

In [32]:
index_vc.pow(2).sum()

1175424

## 用不相等的索引填充值

当使用加法运算符将两个系列相加时，如果其中一个索引标签未出现在另一个中，则结果值始终为缺失值。pandas 具有 `.add` 方法，该方法提供了填充缺失值的选项。

In [33]:
baseball_14 = pd.read_csv('data/baseball14.csv', index_col='playerID')
baseball_15 = pd.read_csv('data/baseball15.csv', index_col='playerID')
baseball_16 = pd.read_csv('data/baseball16.csv', index_col='playerID')

In [34]:
baseball_14.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
altuvjo01,2014,1,HOU,AL,158,660,85,225,47,3,...,59.0,56.0,9.0,36,53.0,7.0,5.0,1.0,5.0,20.0
cartech02,2014,1,HOU,AL,145,507,68,115,21,1,...,88.0,5.0,2.0,56,182.0,6.0,5.0,0.0,4.0,12.0
castrja01,2014,1,HOU,AL,126,465,43,103,21,2,...,56.0,1.0,0.0,34,151.0,1.0,9.0,1.0,3.0,11.0
corpoca01,2014,1,HOU,AL,55,170,22,40,6,0,...,19.0,0.0,0.0,14,37.0,0.0,3.0,1.0,2.0,3.0
dominma01,2014,1,HOU,AL,157,564,51,121,17,0,...,57.0,0.0,1.0,29,125.0,2.0,5.0,2.0,7.0,23.0


In [35]:
baseball_14.index.difference(baseball_15.index)

Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
      dtype='object', name='playerID')

In [36]:
baseball_15.index.difference(baseball_14.index)

Index(['congeha01', 'correca01', 'gattiev01', 'gomezca01', 'lowrije01',
       'rasmuco01', 'tuckepr01', 'valbulu01'],
      dtype='object', name='playerID')

In [37]:
hits_14 = baseball_14['H']
hits_15 = baseball_15['H']
hits_16 = baseball_16['H']
hits_14.head()

playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64

In [38]:
(hits_14 + hits_15).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64

In [39]:
hits_14.add(hits_15, fill_value=0).head()

playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64

In [40]:
hits_total = hits_14.add(hits_15, fill_value=0).add(hits_16, fill_value=0)

In [41]:
hits_total.head()

playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
Name: H, dtype: float64

In [42]:
hits_total.hasnans

False

`.add` 方法的工作方式与加法运算符类似，但通过提供fill_value参数来代替不匹配的索引，可以提供更大的灵活性。

有时候每个 Series 对应的索引标签都包含与缺失值。在这个特定情况下，当两个 Series 相加时，无论是否使用 `fill_value` 参数，索引标签仍然会对应缺失值。

In [44]:
s = pd.Series(
    index=list('abcd'),
    data=[np.nan, 3, np.nan, 1],
)
s

a    NaN
b    3.0
c    NaN
d    1.0
dtype: float64

In [45]:
s1 = pd.Series(
    index=['a', 'b', 'c'],
    data=[np.nan, 6, 10]
)
s1

a     NaN
b     6.0
c    10.0
dtype: float64

In [47]:
s.add(s1, fill_value=5)
# fill_value 只是将 s1 的缺失值填充，s 有缺失值没法填充

a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64

In [48]:
df_14 = baseball_14[['G', 'AB', 'R', 'H']]
df_14.sample()

Unnamed: 0_level_0,G,AB,R,H
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
villajo01,87,263,31,55


In [49]:
df_15 = baseball_15[['AB', 'R', 'H', 'HR']]
df_15.sample()

Unnamed: 0_level_0,AB,R,H,HR
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
marisja01,339,46,80,9


In [50]:
(
    df_15 + df_14
).head(10).style.highlight_null('yellow')

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,,425.0,,171.0
cartech02,898.0,,193.0,,118.0
castrja01,802.0,,174.0,,81.0
congeha01,,,,,
corpoca01,,,,,
correca01,,,,,
dominma01,,,,,
fowlede01,,,,,
gattiev01,,,,,
gomezca01,,,,,


In [53]:
(
    df_14.add(df_15, fill_value=0).head(10).style.highlight_null('yellow')
)
# 缺失值因为 df_14 中没有

Unnamed: 0_level_0,AB,G,H,HR,R
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
altuvjo01,1298.0,158.0,425.0,15.0,171.0
cartech02,898.0,145.0,193.0,24.0,118.0
castrja01,802.0,126.0,174.0,11.0,81.0
congeha01,201.0,,46.0,11.0,25.0
corpoca01,170.0,55.0,40.0,,22.0
correca01,387.0,,108.0,22.0,52.0
dominma01,564.0,157.0,121.0,,51.0
fowlede01,434.0,116.0,120.0,,61.0
gattiev01,566.0,,139.0,27.0,66.0
gomezca01,149.0,,36.0,4.0,19.0


## 添加来自不同 DataFrames 的列
所有 DataFrame 都可以向自身添加新列。但是，通常，当 DataFrame 从另一个 DataFrame 或 Series 添加新列时，会先对齐索引，然后再创建新列。

In [54]:
employee = pd.read_csv('data/employee.csv')
dept_sal = employee[['DEPARTMENT', 'BASE_SALARY']]

In [55]:
dept_sal = dept_sal.sort_values(
    ['DEPARTMENT', 'BASE_SALARY'],
    ascending=[True, False]
)

In [57]:
max_dept_sal = dept_sal.drop_duplicates(
    subset=['DEPARTMENT']
)
max_dept_sal

Unnamed: 0,DEPARTMENT,BASE_SALARY
1494,Admn. & Regulatory Affairs,140416.0
149,City Controller's Office,64251.0
236,City Council,100000.0
647,Convention and Entertainment,38397.0
1500,Dept of Neighborhoods (DON),89221.0
276,Finance,96272.0
561,Fleet Management Department,125884.0
697,General Services Department,89194.0
11,Health & Human Services,180416.0
396,Housing and Community Devp.,98536.0


In [58]:
max_dept_sal.set_index("DEPARTMENT", inplace=True)
employee.set_index("DEPARTMENT", inplace=True)

In [59]:
employee = employee.assign(
    MAX_DEPT_SALARY=max_dept_sal['BASE_SALARY']
)
employee

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Municipal Courts Department,0,ASSISTANT DIRECTOR (EX LVL),121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,121862.0
Library,1,LIBRARY ASSISTANT,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18,107763.0
Houston Police Department-HPD,2,POLICE OFFICER,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03,199596.0
Houston Fire Department (HFD),3,ENGINEER/OPERATOR,63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25,210588.0
General Services Department,4,ELECTRICIAN,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22,89194.0
...,...,...,...,...,...,...,...,...,...,...
Houston Police Department-HPD,1995,POLICE OFFICER,43443.0,White,Full Time,Male,Active,2014-06-09,2015-06-09,199596.0
Houston Fire Department (HFD),1996,COMMUNICATIONS CAPTAIN,66523.0,Black or African American,Full Time,Male,Active,2003-09-02,2013-10-06,210588.0
Houston Police Department-HPD,1997,POLICE OFFICER,43443.0,White,Full Time,Male,Active,2014-10-13,2015-10-13,199596.0
Houston Police Department-HPD,1998,POLICE OFFICER,55461.0,Asian/Pacific Islander,Full Time,Male,Active,2009-01-20,2011-07-02,199596.0


In [60]:
employee.query('BASE_SALARY > MAX_DEPT_SALARY')

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


In [62]:
# Refactor our code into a chain:
employee = pd.read_csv('data/employee.csv')
max_dept_sal = (
    employee[['DEPARTMENT', 'BASE_SALARY']]
    .sort_values(['DEPARTMENT', 'BASE_SALARY'],
                 ascending=[True, False])
    .drop_duplicates(subset='DEPARTMENT')
    .set_index('DEPARTMENT')
)

(
    employee.set_index("DEPARTMENT")
    .assign(
        MAX_DEPT_SALARY=max_dept_sal['BASE_SALARY']
    )
)

Unnamed: 0_level_0,UNIQUE_ID,POSITION_TITLE,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SALARY
DEPARTMENT,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Municipal Courts Department,0,ASSISTANT DIRECTOR (EX LVL),121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,121862.0
Library,1,LIBRARY ASSISTANT,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18,107763.0
Houston Police Department-HPD,2,POLICE OFFICER,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03,199596.0
Houston Fire Department (HFD),3,ENGINEER/OPERATOR,63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25,210588.0
General Services Department,4,ELECTRICIAN,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22,89194.0
...,...,...,...,...,...,...,...,...,...,...
Houston Police Department-HPD,1995,POLICE OFFICER,43443.0,White,Full Time,Male,Active,2014-06-09,2015-06-09,199596.0
Houston Fire Department (HFD),1996,COMMUNICATIONS CAPTAIN,66523.0,Black or African American,Full Time,Male,Active,2003-09-02,2013-10-06,210588.0
Houston Police Department-HPD,1997,POLICE OFFICER,43443.0,White,Full Time,Male,Active,2014-10-13,2015-10-13,199596.0
Houston Police Department-HPD,1998,POLICE OFFICER,55461.0,Asian/Pacific Islander,Full Time,Male,Active,2009-01-20,2011-07-02,199596.0


In [64]:
random_salary = dept_sal.sample(n=10, random_state=42).set_index('DEPARTMENT')

In [65]:
random_salary

Unnamed: 0_level_0,BASE_SALARY
DEPARTMENT,Unnamed: 1_level_1
Public Works & Engineering-PWE,34861.0
Houston Airport System (HAS),29286.0
Houston Police Department-HPD,31907.0
Houston Police Department-HPD,66614.0
Houston Police Department-HPD,42000.0
Houston Police Department-HPD,43443.0
Houston Police Department-HPD,66614.0
Public Works & Engineering-PWE,52582.0
Finance,93168.0
Houston Police Department-HPD,35318.0


**请注意索引中有几个重复的部门。当我们尝试创建新列时，会引发错误，提醒我们存在重复项。员工 DataFrame 中至少有一个索引标签与 random_salary 中的两个或多个索引标签连接：**

In [66]:
employee['RANDOM_SALARY'] = random_salary['BASE_SALARY']

  employee['RANDOM_SALARY'] = random_salary['BASE_SALARY']


ValueError: cannot reindex on an axis with duplicate labels

在对齐期间，如果 DataFrame 索引没有任何内容可对齐，则结果值将会丢失。

> 当使用 pd.DataFrame.assign 方法并传递一个 DataFrame 作为新列的值时，该 DataFrame 中的所有列将被添加到原始 DataFrame 中，**索引将自动对齐**。

In [67]:
(
    # assign 可以从参数中找数据对齐？
    employee
    .set_index('DEPARTMENT')
    .assign(MAX_SALARY2=max_dept_sal['BASE_SALARY'].head(3))
    .MAX_SALARY2
    .value_counts(dropna=False)
)

NaN         1955
140416.0      29
100000.0      11
64251.0        5
Name: MAX_SALARY2, dtype: int64

In [96]:
max_sal = (
    employee.groupby("DEPARTMENT")
    .BASE_SALARY
    .transform('max')
)
employee.assign(MAX_DEPT_SAL=max_sal)

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SAL
0,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,121862.0
1,1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18,107763.0
2,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03,199596.0
3,3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25,210588.0
4,4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22,89194.0
...,...,...,...,...,...,...,...,...,...,...,...
1995,1995,POLICE OFFICER,Houston Police Department-HPD,43443.0,White,Full Time,Male,Active,2014-06-09,2015-06-09,199596.0
1996,1996,COMMUNICATIONS CAPTAIN,Houston Fire Department (HFD),66523.0,Black or African American,Full Time,Male,Active,2003-09-02,2013-10-06,210588.0
1997,1997,POLICE OFFICER,Houston Police Department-HPD,43443.0,White,Full Time,Male,Active,2014-10-13,2015-10-13,199596.0
1998,1998,POLICE OFFICER,Houston Police Department-HPD,55461.0,Asian/Pacific Islander,Full Time,Male,Active,2009-01-20,2011-07-02,199596.0


In [99]:
max_sal = (
    employee.groupby("DEPARTMENT")
    .BASE_SALARY
    .max()
)

(
    employee.merge(max_sal.rename('MAX_DEPT_SAL'),
                   how='left', left_on='DEPARTMENT',
                   right_index=True)
)

Unnamed: 0,UNIQUE_ID,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,EMPLOYMENT_STATUS,HIRE_DATE,JOB_DATE,MAX_DEPT_SAL
0,0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,Active,2006-06-12,2012-10-13,121862.0
1,1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,Active,2000-07-19,2010-09-18,107763.0
2,2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,Active,2015-02-03,2015-02-03,199596.0
3,3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,Active,1982-02-08,1991-05-25,210588.0
4,4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,Active,1989-06-19,1994-10-22,89194.0
...,...,...,...,...,...,...,...,...,...,...,...
1995,1995,POLICE OFFICER,Houston Police Department-HPD,43443.0,White,Full Time,Male,Active,2014-06-09,2015-06-09,199596.0
1996,1996,COMMUNICATIONS CAPTAIN,Houston Fire Department (HFD),66523.0,Black or African American,Full Time,Male,Active,2003-09-02,2013-10-06,210588.0
1997,1997,POLICE OFFICER,Houston Police Department-HPD,43443.0,White,Full Time,Male,Active,2014-10-13,2015-10-13,199596.0
1998,1998,POLICE OFFICER,Houston Police Department-HPD,55461.0,Asian/Pacific Islander,Full Time,Male,Active,2009-01-20,2011-07-02,199596.0


## 突出显示每列的最大值

In [100]:
college = pd.read_csv("data/college.csv", index_col='INSTNM')
college.dtypes

CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
WOMENONLY             float64
RELAFFIL                int64
SATVRMID              float64
SATMTMID              float64
DISTANCEONLY          float64
UGDS                  float64
UGDS_WHITE            float64
UGDS_BLACK            float64
UGDS_HISP             float64
UGDS_ASIAN            float64
UGDS_AIAN             float64
UGDS_NHPI             float64
UGDS_2MOR             float64
UGDS_NRA              float64
UGDS_UNKN             float64
PPTUG_EF              float64
CURROPER                int64
PCTPELL               float64
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
dtype: object

In [101]:
college.MD_EARN_WNE_P10.sample(10, random_state=42)

INSTNM
Career Point College                                            20700
Ner Israel Rabbinical College                       PrivacySuppressed
Reflections Academy of Beauty                                     NaN
Capital Area Technical College                                  26400
West Virginia University Institute of Technology                43400
Mid-State Technical College                                     32000
Strayer University-Huntsville Campus                            49200
National Aviation Academy of Tampa Bay                          45000
University of California-Santa Cruz                             43000
Lexington Theological Seminary                                    NaN
Name: MD_EARN_WNE_P10, dtype: object

In [103]:
college.GRAD_DEBT_MDN_SUPP.sample(10, random_state=42)

INSTNM
Career Point College                                            14977
Ner Israel Rabbinical College                       PrivacySuppressed
Reflections Academy of Beauty                       PrivacySuppressed
Capital Area Technical College                      PrivacySuppressed
West Virginia University Institute of Technology                23969
Mid-State Technical College                                      8025
Strayer University-Huntsville Campus                          36173.5
National Aviation Academy of Tampa Bay                          22778
University of California-Santa Cruz                             19884
Lexington Theological Seminary                      PrivacySuppressed
Name: GRAD_DEBT_MDN_SUPP, dtype: object

In [104]:
set(college.MD_EARN_WNE_P10.apply(type))

{float, str}

要强制这些列为数字，请使用 pandas 函数 `to_numeric`。如果我们使用 `errors =coerce` 参数，它会将这些值转换为 NaN

In [108]:
cols = ["MD_EARN_WNE_P10", "GRAD_DEBT_MDN_SUPP"]
for col in cols:
    college[col] = pd.to_numeric(
        college[col], errors='coerce'
    )
college.dtypes.loc[cols]

MD_EARN_WNE_P10       float64
GRAD_DEBT_MDN_SUPP    float64
dtype: object

In [109]:
college_n = college.select_dtypes('number')
college_n.head()

Unnamed: 0_level_0,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600.0,33118.5


In [111]:
binary_only = college_n.nunique() == 2
binary_only.head()

HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool

In [112]:
binary_cols = binary_only[binary_only].index
binary_cols

Index(['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER'], dtype='object')

In [113]:
college_n2 = college_n.drop(columns=binary_cols)
college_n2.head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,424.0,420.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,0.7356,0.8284,0.1049,30300.0,33888.0
University of Alabama at Birmingham,570.0,565.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,0.346,0.5214,0.2422,39700.0,21941.5
Amridge University,,,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,0.6801,0.7795,0.854,40100.0,23370.0
University of Alabama in Huntsville,595.0,590.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,0.3072,0.4596,0.264,45500.0,24097.0
Alabama State University,425.0,430.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,0.7347,0.7554,0.127,26600.0,33118.5


In [114]:
max_cols = college_n2.idxmax()
max_cols

SATVRMID                             California Institute of Technology
SATMTMID                             California Institute of Technology
UGDS                                      University of Phoenix-Arizona
UGDS_WHITE                       Mr Leon's School of Hair Design-Moscow
UGDS_BLACK                           Velvatex College of Beauty Culture
UGDS_HISP                       Thunderbird School of Global Management
UGDS_ASIAN                          Cosmopolitan Beauty and Tech School
UGDS_AIAN                             Haskell Indian Nations University
UGDS_NHPI                                       Palau Community College
UGDS_2MOR                                                 LIU Brentwood
UGDS_NRA               California University of Management and Sciences
UGDS_UNKN             Le Cordon Bleu College of Culinary Arts-San Fr...
PPTUG_EF                        Thunderbird School of Global Management
PCTPELL                                        MTI Business Coll

In [115]:
unique_max_cols = max_cols.unique()
unique_max_cols[:5]

array(['California Institute of Technology',
       'University of Phoenix-Arizona',
       "Mr Leon's School of Hair Design-Moscow",
       'Velvatex College of Beauty Culture',
       'Thunderbird School of Global Management'], dtype=object)

In [117]:
college_n2.loc[unique_max_cols].style.highlight_max()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
California Institute of Technology,765.0,785.0,983.0,0.2787,0.0153,0.1221,0.4385,0.001,0.0,0.057,0.0875,0.0,0.0,0.1126,0.2303,0.0082,77800.0,11812.5
University of Phoenix-Arizona,,,151558.0,0.3098,0.1555,0.076,0.0082,0.0042,0.005,0.1131,0.0131,0.3152,0.0,0.6009,0.592,,,33000.0
Mr Leon's School of Hair Design-Moscow,,,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.625,0.2,,15710.0
Velvatex College of Beauty Culture,,,25.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.7692,0.0,0.52,,
Thunderbird School of Global Management,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,118900.0,
Cosmopolitan Beauty and Tech School,,,110.0,0.0091,0.0,0.0182,0.9727,0.0,0.0,0.0,0.0,0.0,0.3182,0.7761,0.1244,0.9545,,
Haskell Indian Nations University,430.0,440.0,805.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0224,0.8396,0.0,0.2089,22800.0,
Palau Community College,,,602.0,0.0,0.0017,0.0,0.0,0.0,0.9983,0.0,0.0,0.0,0.3887,0.856,0.0,0.2616,24700.0,
LIU Brentwood,,,15.0,0.0,0.1333,0.2667,0.0,0.0,0.0,0.5333,0.0,0.0667,0.4,0.5652,0.7826,0.7826,44600.0,25499.0
California University of Management and Sciences,,,98.0,0.0102,0.0204,0.0,0.0408,0.0,0.0,0.0,0.9286,0.0,0.0,0.0926,0.0556,0.6852,,


Refactor the code to make it easier to read:

In [118]:
def remove_binary_cols(df: pd.DataFrame):
    binary_only = df.nunique() == 2
    cols = binary_only[binary_only].index.tolist()
    return df.drop(columns=cols)

def select_rows_with_max_cols(df):
    max_cols = df.idxmax()
    unique = max_cols.unique()
    return df.loc[unique]

(
    college.assign(
        MD_EARN_WNE_P10=pd.to_numeric(
            college.MD_EARN_WNE_P10, errors='coerce'
        ),
        GRAD_DEST_MDN_SUPP=pd.to_numeric(
            college.GRAD_DEBT_MDN_SUPP, errors='coerce'
        )
    )
    .select_dtypes('number')
    .pipe(remove_binary_cols)
    .pipe(select_rows_with_max_cols)
)

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP,GRAD_DEST_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
California Institute of Technology,765.0,785.0,983.0,0.2787,0.0153,0.1221,0.4385,0.001,0.0,0.057,0.0875,0.0,0.0,0.1126,0.2303,0.0082,77800.0,11812.5,11812.5
University of Phoenix-Arizona,,,151558.0,0.3098,0.1555,0.076,0.0082,0.0042,0.005,0.1131,0.0131,0.3152,0.0,0.6009,0.592,,,33000.0,33000.0
Mr Leon's School of Hair Design-Moscow,,,16.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.625,0.625,0.2,,15710.0,15710.0
Velvatex College of Beauty Culture,,,25.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.7692,0.0,0.52,,,
Thunderbird School of Global Management,,,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,118900.0,,
Cosmopolitan Beauty and Tech School,,,110.0,0.0091,0.0,0.0182,0.9727,0.0,0.0,0.0,0.0,0.0,0.3182,0.7761,0.1244,0.9545,,,
Haskell Indian Nations University,430.0,440.0,805.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0224,0.8396,0.0,0.2089,22800.0,,
Palau Community College,,,602.0,0.0,0.0017,0.0,0.0,0.0,0.9983,0.0,0.0,0.0,0.3887,0.856,0.0,0.2616,24700.0,,
LIU Brentwood,,,15.0,0.0,0.1333,0.2667,0.0,0.0,0.0,0.5333,0.0,0.0667,0.4,0.5652,0.7826,0.7826,44600.0,25499.0,25499.0
California University of Management and Sciences,,,98.0,0.0102,0.0204,0.0,0.0408,0.0,0.0,0.0,0.9286,0.0,0.0,0.0926,0.0556,0.6852,,,


In [119]:
college_ugds = college.filter(like='UGDS_').head()

In [120]:
college_ugds.style.highlight_max(axis='columns')

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


## 使用方法链复制 idxmax

In [121]:
college_n = (
    college.assign(
        MD_EARN_WNE_P10=pd.to_numeric(
            college.MD_EARN_WNE_P10, errors='coerce'
        ),
        GRAD_DEST_MDN_SUPP=pd.to_numeric(
            college.GRAD_DEBT_MDN_SUPP, errors='coerce'
        )
    )
    .select_dtypes('number')
    .pipe(remove_binary_cols)
)

In [122]:
college_n.max().head()

SATVRMID         765.0
SATMTMID         785.0
UGDS          151558.0
UGDS_WHITE         1.0
UGDS_BLACK         1.0
dtype: float64

By default, the `.eq` method aligns the columns of the column DataFrame with the labels of the passed Series index:

In [123]:
college_n.eq(college_n.max()).head()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP,GRAD_DEST_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Alabama A & M University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [125]:
has_raw_max = (
    college_n
    .eq(college_n.max())
    .any(axis='columns')
)
has_raw_max.head()

INSTNM
Alabama A & M University               False
University of Alabama at Birmingham    False
Amridge University                     False
University of Alabama in Huntsville    False
Alabama State University               False
dtype: bool

In [126]:
college_n.shape

(7535, 19)

In [127]:
has_raw_max.sum()

401

这是因为有很多的最大值是相同的，idxmax 只选择第一个最大值

In [128]:
college_n.eq(college_n.max()).cumsum()

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP,GRAD_DEST_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Alabama A & M University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama at Birmingham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Amridge University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama in Huntsville,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alabama State University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2,2
Rasmussen College - Overland Park,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2,2
National Personal Training Institute of Cleveland,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2,2
Bay Area Medical Academy - San Jose Satellite Location,1,1,1,109,28,136,1,2,1,1,1,1,44,66,55,12,1,2,2


如果我们再链接一次 `.cumsum` 方法，值 1 只会在每一列中出现一次，并且它将是最大值的第一次出现：

In [129]:
(
    college_n.eq(college_n.max())
    .cumsum()
    .cumsum()
)

Unnamed: 0_level_0,SATVRMID,SATMTMID,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP,GRAD_DEST_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Alabama A & M University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama at Birmingham,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Amridge University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
University of Alabama in Huntsville,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Alabama State University,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,7305,7305,415,379750,73107,341103,983,11382,3316,5056,1076,7276,113649,172944,170273,36183,3445,10266,10266
Rasmussen College - Overland Park,7306,7306,416,379859,73135,341239,984,11384,3317,5057,1077,7277,113693,173010,170328,36195,3446,10268,10268
National Personal Training Institute of Cleveland,7307,7307,417,379968,73163,341375,985,11386,3318,5058,1078,7278,113737,173076,170383,36207,3447,10270,10270
Bay Area Medical Academy - San Jose Satellite Location,7308,7308,418,380077,73191,341511,986,11388,3319,5059,1079,7279,113781,173142,170438,36219,3448,10272,10272


In [130]:
has_raw_max2 = (
    college_n.eq(college_n.max())
    .cumsum()
    .cumsum()
    .eq(1)
    .any(axis='columns')
)

In [131]:
has_raw_max2.sum()

16

In [133]:
idxmax_cols = has_raw_max2[has_raw_max2].index
idxmax_cols

Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')

In [134]:
set(college_n.idxmax().unique()) == set(idxmax_cols)

True

In [135]:
def idx_max(df:pd.DataFrame):
    has_raw_max = (
        df
        .eq(df.max())
        .cumsum()
        .cumsum()
        .eq(1)
        .any(axis='columns')
    )
    return has_raw_max[has_raw_max].index

In [136]:
idx_max(college_n)

Index(['Thunderbird School of Global Management',
       'Southwest University of Visual Arts-Tucson', 'ABC Beauty College Inc',
       'Velvatex College of Beauty Culture',
       'California Institute of Technology',
       'Le Cordon Bleu College of Culinary Arts-San Francisco',
       'MTI Business College Inc', 'Dongguk University-Los Angeles',
       'Mr Leon's School of Hair Design-Moscow',
       'Haskell Indian Nations University', 'LIU Brentwood',
       'Medical College of Wisconsin', 'Palau Community College',
       'California University of Management and Sciences',
       'Cosmopolitan Beauty and Tech School', 'University of Phoenix-Arizona'],
      dtype='object', name='INSTNM')

In [147]:
def idx_max(df:pd.DataFrame):
    has_raw_max = (
        df
        .eq(df.max())
        .cumsum()
        .cumsum()
        .eq(1)
        .any(axis='columns')
        [lambda ser_ : ser_]
        .index
    )
    return has_raw_max

In [146]:
%timeit college_n.idxmax()

1.07 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [148]:
%timeit idx_max(college_n)

2.36 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## 查找最常见的列中最大值

In [156]:
college = pd.read_csv('data/college.csv', index_col='INSTNM')
college_ugds = college.filter(like='UGDS_')

highest_percentage_race = college_ugds.idxmax(axis='columns')
highest_percentage_race.head()

INSTNM
Alabama A & M University               UGDS_BLACK
University of Alabama at Birmingham    UGDS_WHITE
Amridge University                     UGDS_BLACK
University of Alabama in Huntsville    UGDS_WHITE
Alabama State University               UGDS_BLACK
dtype: object

In [157]:
highest_percentage_race.value_counts(normalize=True)

UGDS_WHITE    0.670352
UGDS_BLACK    0.151586
UGDS_HISP     0.129473
UGDS_UNKN     0.023422
UGDS_ASIAN    0.012074
UGDS_AIAN     0.006110
UGDS_NRA      0.004073
UGDS_NHPI     0.001746
UGDS_2MOR     0.001164
dtype: float64

In [158]:
(
    college_ugds
    [highest_percentage_race == 'UGDS_BLACK']
    .drop(columns='UGDS_BLACK')
    .idxmax(axis='columns')
    .value_counts(normalize=True)
)

UGDS_WHITE    0.661228
UGDS_HISP     0.230326
UGDS_UNKN     0.071977
UGDS_NRA      0.018234
UGDS_ASIAN    0.009597
UGDS_2MOR     0.006718
UGDS_AIAN     0.000960
UGDS_NHPI     0.000960
dtype: float64