In [1]:
print("""
@File         : ch02_essential_dataframe_operations.ipynb
@Author(s)    : Stephen CUI
@LastEditor(s): Stephen CUI
@CreatedTime  : 2024-07-21 12:49:36
@Email        : cuixuanstephen@gmail.com
@Description  : 基本 DataFrame 操作
""")


@File         : ch02_essential_dataframe_operations.ipynb
@Author(s)    : Stephen CUI
@LastEditor(s): Stephen CUI
@CreatedTime  : 2024-07-21 12:49:36
@Email        : cuixuanstephen@gmail.com
@Description  : 基本 DataFrame 操作



In [2]:
%cd ../

d:\Data-Analysis-and-Science\P1XC2E


In [3]:
import pandas as pd
import numpy as np

movies = pd.read_csv('data/movie.csv')

## 选择多个 DataFrame 列

In [4]:
movie_actor_director = movies[
    [
        'actor_1_name',
        'actor_2_name',
        'actor_3_name',
        'director_name'
        ]
]

In [5]:
movie_actor_director.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


有时候需要选择 DataFrame 中的一列。使用索引操作可以返回 Series 或 DataFrame。**如果我们传入一个包含单个项目的列表，我们将返回一个 DataFrame。如果我们只传入一个带有列名的字符串，我们将返回一个 Series**。

In [6]:
type(movies[['director_name']])

pandas.core.frame.DataFrame

In [7]:
type(movies['director_name'])

pandas.core.series.Series

我们也可以使用 `.loc` 按名称提取列。因为此索引操作需要我们先传入一个行选择器，我们将使用冒号(:)来表示选择所有行的切片。这也可以返回 DataFrame 或 Series：

In [8]:
type(movies.loc[:, ['director_name']])

pandas.core.frame.DataFrame

In [9]:
type(movies.loc[:, 'director_name'])

pandas.core.series.Series

在索引运算符中传递长列表可能会导致可读性问题。为了解决这个问题，您可以先将所有列名保存到列表变量中。

In [10]:
cols = [
    'actor_1_name',
    'actor_2_name',
    'actor_3_name',
    'director_name',
]
movie_actor_director = movies[cols]

使用 pandas 时最常见的异常之一是 `KeyError`。此错误主要是由于列名或索引名输入错误造成的。每当尝试选择多列而不使用列表时，都会引发同样的错误：

In [11]:
movies['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']

KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')

## 使用方法选择列

尽管列选择通常使用索引运算符完成，但也有一些 DataFrame 方法可以以其他方式促进列选择。`.select_dtypes` 和 `.filter` 方法是实现此目的的两种有用方法。

In [12]:
def shorten(col):
    return (
        str(col)
        .replace('facebook_likes', 'fb')
        .replace('_for_reviews', '')
    )
# 缩短一些列名
movies = movies.rename(columns=shorten)

In [13]:
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

In [14]:
movies.select_dtypes(include='int').head()

Unnamed: 0,num_voted_users,cast_total_fb,movie_fb
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000
3,1144337,106759,164000
4,8,143,0


如果你想选择所有数字列，你可以传递字符串数字给 `include` 参数：

In [15]:
movies.select_dtypes(include='number').head()

Unnamed: 0,num_critic,duration,director_fb,actor_3_fb,actor_1_fb,gross,num_voted_users,cast_total_fb,facenumber_in_poster,num_user,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


如果我们想要整数和字符串列，我们可以执行以下操作：

In [16]:
movies.select_dtypes(include=['int', 'object']).head()

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000
3,Color,Christopher Nolan,Christian Bale,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,English,USA,PG-13,164000
4,,Doug Walker,Rob Walker,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,0


要仅排除浮点列，请执行以下操作：

In [17]:
movies.select_dtypes(exclude='float').head()

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000
3,Color,Christopher Nolan,Christian Bale,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,English,USA,PG-13,164000
4,,Doug Walker,Rob Walker,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,0


选择列的另一种方法是使用 `.filter` 方法。此方法非常灵活，可根据使用的参数搜索列名（或索引标签）。在这里，我们使用 `like` 参数搜索所有 Facebook 列或包含精确字符串 fb 的名称。`like` 参数检查列名中的子字符串：

In [18]:
movies.filter(like='fb').head()

Unnamed: 0,director_fb,actor_3_fb,actor_1_fb,cast_total_fb,actor_2_fb,movie_fb
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000
3,22000.0,23000.0,27000.0,106759,23000.0,164000
4,131.0,,131.0,143,12.0,0


In [19]:
cols

['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name']

如果使用 `items` 参数，则可以传入列名列表：

In [20]:
movies.filter(items=cols).head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


`.filter` 方法允许使用 `regex` 参数通过正则表达式搜索列。在这里，我们搜索名称中包含数字的所有列：

In [21]:
movies.filter(regex=r'\d').head()

Unnamed: 0,actor_3_fb,actor_2_name,actor_1_fb,actor_1_name,actor_3_name,actor_2_fb
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0
3,23000.0,Christian Bale,27000.0,Tom Hardy,Joseph Gordon-Levitt,23000.0
4,,Rob Walker,131.0,Doug Walker,,12.0


`.filter` 方法通过仅检查列名而不是实际数据值来选择列。它有三个互斥的参数：`items`、`like`和 `regex`，一次只能使用其中一个。

`.select_dtypes` 的一个令人困惑的方面是它可以灵活地同时接受字符串和 Python 对象。在 pandas 中没有引用数据类型的标准或首选方法，因此最好了解这两种方法：


- np.number, 'number' – 选择整数和浮点数，无论大小
- np.float64, np.float_, float, 'float64', 'float_', 'float' – 仅选择 64 位浮点数
- np.float16, np.float32, np.float128, 'float16', 'float32', 'float128' – 分别选择精确的 16、32 和 128 位浮点数
- np.floating, 'floating' – 选择所有浮点数，无论其大小
- np.int0, np.int64, np.int_, int, 'int0', 'int64', 'int_', 'int' – 仅选择 64 位整数
- np.int8, np.int16, np.int32, 'int8', 'int16', 'int32' – 分别选择精确的 8、16 和 32 位整数
- np.integer, 'integer' – 选择所有整数，无论大小
- 'Int64' – 选择可空整数；没有 NumPy 等效项
- np.object, 'object', 'O' – 选择所有对象数据类型
- np.datetime64, 'datetime64', 'datetime' – 所有日期时间都是 64 位
- np.timedelta64, 'timedelta64', 'timedelta' – 所有时间增量均为 64 位
- pd.Categorical, 'category' – pandas 独有；没有 NumPy 等效项

## 对列名称进行排序

没有一套标准化的规则来规定应如何在数据集内组织列。但是，制定一套始终遵循的指导方针是一种很好的做法。如果您与一群共享大量数据集的分析师一起工作，这一点尤其重要。

In [22]:
movies = movies.rename(columns=shorten)

In [23]:
movies.columns

Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')

In [24]:
cat_core = ['movie_title', 'title_year', 'content_rating', 'genres']

In [25]:
cat_people = ['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']

In [26]:
cat_other = ['color', 'country', 'language', 'plot_keywords', 'movie_imdb_link']

In [27]:
cont_fb = [
    "director_fb",
    "actor_1_fb",
    "actor_2_fb",
    "actor_3_fb",
    "cast_total_fb",
    "movie_fb",
    ]
cont_finance = ["budget", "gross"]
cont_num_reviews = [
    "num_voted_users",
    "num_user",
    "num_critic",
    ]
cont_other = [
    "imdb_score",
    "duration",
    "aspect_ratio",
    "facenumber_in_poster",
    ]

In [28]:
new_col_order = (
    cat_core
    + cat_people
    + cat_other
    + cont_fb
    + cont_finance
    + cont_num_reviews
    + cont_other
)

In [30]:
set(movies.columns) == set(new_col_order)
# 手动排序列容易出现人为错误，因为很容易错误地忘记新列列表中的列。

True

In [31]:
movies[new_col_order].head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,...,movie_fb,budget,gross,num_voted_users,num_user,num_critic,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Color,USA,...,33000,237000000.0,760505847.0,886204,3054.0,723.0,7.9,178.0,1.78,0.0
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action|Adventure|Fantasy,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Color,USA,...,0,300000000.0,309404152.0,471220,1238.0,302.0,7.1,169.0,2.35,0.0
2,Spectre,2015.0,PG-13,Action|Adventure|Thriller,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Color,UK,...,85000,245000000.0,200074175.0,275868,994.0,602.0,6.8,148.0,2.35,1.0
3,The Dark Knight Rises,2012.0,PG-13,Action|Thriller,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Color,USA,...,164000,250000000.0,448130642.0,1144337,2701.0,813.0,8.5,164.0,2.35,0.0
4,Star Wars: Episode VII - The Force Awakens,,,Documentary,Doug Walker,Doug Walker,Rob Walker,,,,...,0,,,8,,,7.1,,,0.0


Hadley Wickham 关于 Tidy Data 的开创性论文建议将固定变量放在第一位，然后是测量变量。由于这些数据并非来自受控实验，因此在确定哪些变量是固定的以及哪些变量是测量的方面有一定的灵活性。测量变量的良好候选者是我们想要预测的变量。

## Summarizing a DataFrame(统计汇总)

In [32]:
movies.shape, movies.size, movies.ndim, len(movies)

((4916, 28), 137648, 2, 4916)

In [33]:
movies.count()

color                   4897
director_name           4814
num_critic              4867
duration                4901
director_fb             4814
actor_3_fb              4893
actor_2_name            4903
actor_1_fb              4909
gross                   4054
genres                  4916
actor_1_name            4909
movie_title             4916
num_voted_users         4916
cast_total_fb           4916
actor_3_name            4893
facenumber_in_poster    4903
plot_keywords           4764
movie_imdb_link         4916
num_user                4895
language                4904
country                 4911
content_rating          4616
budget                  4432
title_year              4810
actor_2_fb              4903
imdb_score              4916
aspect_ratio            4590
movie_fb                4916
dtype: int64

In [36]:
movies.min(numeric_only=True)

num_critic                 1.00
duration                   7.00
director_fb                0.00
actor_3_fb                 0.00
actor_1_fb                 0.00
gross                    162.00
num_voted_users            5.00
cast_total_fb              0.00
facenumber_in_poster       0.00
num_user                   1.00
budget                   218.00
title_year              1916.00
actor_2_fb                 0.00
imdb_score                 1.60
aspect_ratio               1.18
movie_fb                   0.00
dtype: float64

In [37]:
movies.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_critic,4867.0,137.9889,120.2394,1.0,49.0,108.0,191.0,813.0
duration,4901.0,107.0908,25.28602,7.0,93.0,103.0,118.0,511.0
director_fb,4814.0,691.0145,2832.954,0.0,7.0,48.0,189.75,23000.0
actor_3_fb,4893.0,631.2763,1625.875,0.0,132.0,366.0,633.0,23000.0
actor_1_fb,4909.0,6494.488,15106.99,0.0,607.0,982.0,11000.0,640000.0
gross,4054.0,47644510.0,67372550.0,162.0,5019656.25,25043962.0,61108412.75,760505800.0
num_voted_users,4916.0,82644.92,138322.2,5.0,8361.75,33132.5,93772.75,1689764.0
cast_total_fb,4916.0,9579.816,18164.32,0.0,1394.75,3049.0,13616.75,656730.0
facenumber_in_poster,4903.0,1.37732,2.023826,0.0,0.0,1.0,2.0,43.0
num_user,4895.0,267.6688,372.9348,1.0,64.0,153.0,320.5,5060.0


请注意，数字列有缺失值，但具有 `.describe` 返回的结果。默认情况下，pandas 通过跳过数字列中的缺失值来处理它们。可以通过将 `skipna` 参数设置为 False 来更改此行为。如果存在至少一个缺失值，这将导致 pandas 对所有这些聚合方法返回 NaN。

In [38]:
movies.describe(percentiles=[.01, .3, .99]).T

Unnamed: 0,count,mean,std,min,1%,30%,50%,99%,max
num_critic,4867.0,137.9889,120.2394,1.0,2.0,60.0,108.0,546.68,813.0
duration,4901.0,107.0908,25.28602,7.0,43.0,95.0,103.0,189.0,511.0
director_fb,4814.0,691.0145,2832.954,0.0,0.0,11.0,48.0,16000.0,23000.0
actor_3_fb,4893.0,631.2763,1625.875,0.0,0.0,176.0,366.0,11000.0,23000.0
actor_1_fb,4909.0,6494.488,15106.99,0.0,6.08,694.0,982.0,44920.0,640000.0
gross,4054.0,47644510.0,67372550.0,162.0,8474.8,7914068.6,25043962.0,326412800.0,760505800.0
num_voted_users,4916.0,82644.92,138322.2,5.0,53.0,11864.5,33132.5,681584.6,1689764.0
cast_total_fb,4916.0,9579.816,18164.32,0.0,6.0,1684.5,3049.0,62413.9,656730.0
facenumber_in_poster,4903.0,1.37732,2.023826,0.0,0.0,0.0,1.0,8.0,43.0
num_user,4895.0,267.6688,372.9348,1.0,1.94,80.0,153.0,1999.24,5060.0


## DataFrame 链式方法

方法链的关键之一是了解链中每一步返回的确切对象。

In [39]:
movies.isna().head()

Unnamed: 0,color,director_name,num_critic,duration,director_fb,actor_3_fb,actor_2_name,actor_1_fb,gross,genres,...,num_user,language,country,content_rating,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,True,True,False,True,False,False,True,False,...,True,True,True,True,True,True,False,False,True,False


In [40]:
movies.isna().sum()

color                    19
director_name           102
num_critic               49
duration                 15
director_fb             102
actor_3_fb               23
actor_2_name             13
actor_1_fb                7
gross                   862
genres                    0
actor_1_name              7
movie_title               0
num_voted_users           0
cast_total_fb             0
actor_3_name             23
facenumber_in_poster     13
plot_keywords           152
movie_imdb_link           0
num_user                 21
language                 12
country                   5
content_rating          300
budget                  484
title_year              106
actor_2_fb               13
imdb_score                0
aspect_ratio            326
movie_fb                  0
dtype: int64

In [41]:
movies.isna().sum().sum()

2654

In [46]:
movies.isna().any().any()

True

In [47]:
movies[['color', 'movie_title']].max()

  movies[['color', 'movie_title']].max()


movie_title    Æon Flux
dtype: object

## DataFrame 操作

当算术运算符或比较运算符与 DataFrame 一起使用时，每个列的每个值都会应用该操作。通常，当运算符与 DataFrame 一起使用时，列要么全部是数字，要么全部是对象（通常是字符串）。如果 DataFrame 不包含同类数据，则操作可能会失败。

In [57]:
colleges = pd.read_csv('data/college.csv', index_col='INSTNM')

In [58]:
try:
    colleges + 5
except TypeError:
    print('如果 DataFrame 不包含同类数据，则操作可能会失败。')

如果 DataFrame 不包含同类数据，则操作可能会失败。


In [59]:
college_ugds = colleges.filter(like='UGDS_')
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [60]:
name = "Northwest-Shoals Community College"

In [62]:
college_ugds.loc[name]

UGDS_WHITE    0.7912
UGDS_BLACK    0.1250
UGDS_HISP     0.0339
UGDS_ASIAN    0.0036
UGDS_AIAN     0.0088
UGDS_NHPI     0.0006
UGDS_2MOR     0.0012
UGDS_NRA      0.0033
UGDS_UNKN     0.0324
Name: Northwest-Shoals Community College, dtype: float64

In [63]:
college_ugds.loc[name].round(2)

UGDS_WHITE    0.79
UGDS_BLACK    0.12
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

In [64]:
(college_ugds.loc[name] + 1e-4).round(2)

UGDS_WHITE    0.79
UGDS_BLACK    0.13
UGDS_HISP     0.03
UGDS_ASIAN    0.00
UGDS_AIAN     0.01
UGDS_NHPI     0.00
UGDS_2MOR     0.00
UGDS_NRA      0.00
UGDS_UNKN     0.03
Name: Northwest-Shoals Community College, dtype: float64

In [65]:
(college_ugds + 0.00501)//0.01

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,3.0,94.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
University of Alabama at Birmingham,59.0,26.0,3.0,5.0,0.0,0.0,4.0,2.0,1.0
Amridge University,30.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,27.0
University of Alabama in Huntsville,70.0,13.0,4.0,4.0,1.0,0.0,2.0,3.0,4.0
Alabama State University,2.0,92.0,1.0,0.0,0.0,0.0,1.0,2.0,1.0
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,,,,,,
Rasmussen College - Overland Park,,,,,,,,,
National Personal Training Institute of Cleveland,,,,,,,,,
Bay Area Medical Academy - San Jose Satellite Location,,,,,,,,,


In [66]:
college_ugds_round = (college_ugds + 1e-5).round(2)
college_ugds_round

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.00,0.00,0.0,0.00,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.00,0.0,0.04,0.02,0.01
Amridge University,0.30,0.42,0.01,0.00,0.00,0.0,0.00,0.00,0.27
University of Alabama in Huntsville,0.70,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.00,0.00,0.0,0.01,0.02,0.01
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,,,,,,
Rasmussen College - Overland Park,,,,,,,,,
National Personal Training Institute of Cleveland,,,,,,,,,
Bay Area Medical Academy - San Jose Satellite Location,,,,,,,,,


In [67]:
college_ugds_op_round = (college_ugds + 0.00501)//0.01/100

In [68]:
college_ugds_op_round.equals(college_ugds_round)

True

In [69]:
colleges2 = (college_ugds.add(0.00501).floordiv(0.01).div(100))
colleges2.equals(college_ugds_op_round)

True

## 比较缺失值

pandas 使用 NumPy NaN (np.nan)对象来表示缺失值。这是一个不寻常的对象，具有有趣的数学属性。例如，它不等于其自身。甚至 Python 的None对象与自身进行比较时也会计算为True 。与 `np.nan` 的所有其他比较也返回 False，除了不等于（！=）。

In [70]:
np.nan == np.nan

False

In [71]:
None == None

True

In [72]:
np.nan > 5

False

In [73]:
np.nan != 5

True

Series 和 DataFrames 使用等号运算符 == 进行逐个元素比较。结果是一个具有相同维度的对象。

In [74]:
college_ugds == 0.0019

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,False,False,True,False,True,False,False,False
University of Alabama at Birmingham,False,False,False,False,False,False,False,False,False
Amridge University,False,False,False,False,False,False,False,False,False
University of Alabama in Huntsville,False,False,False,False,False,False,False,False,False
Alabama State University,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,False,False,False,False,False,False,False,False,False
Rasmussen College - Overland Park,False,False,False,False,False,False,False,False,False
National Personal Training Institute of Cleveland,False,False,False,False,False,False,False,False,False
Bay Area Medical Academy - San Jose Satellite Location,False,False,False,False,False,False,False,False,False


In [75]:
college_self_compare = college_ugds == college_ugds
college_self_compare.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,True,True,True,True,True,True,True,True,True
University of Alabama at Birmingham,True,True,True,True,True,True,True,True,True
Amridge University,True,True,True,True,True,True,True,True,True
University of Alabama in Huntsville,True,True,True,True,True,True,True,True,True
Alabama State University,True,True,True,True,True,True,True,True,True


In [76]:
college_self_compare.all()

UGDS_WHITE    False
UGDS_BLACK    False
UGDS_HISP     False
UGDS_ASIAN    False
UGDS_AIAN     False
UGDS_NHPI     False
UGDS_2MOR     False
UGDS_NRA      False
UGDS_UNKN     False
dtype: bool

**发生这种情况的原因是缺失值彼此之间不能进行相等的比较。**

In [77]:
(college_ugds == np.nan).sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

不要使用 == 来查找缺失值，而是使用 `.isna` 方法：

比较两个完整的 DataFrame 的正确方法不是使用等号运算符(==) ，而是使用 `.equals` 方法。此方法将位于同一位置的 NaN 视为相等（请注意，`.eq` 方法相当于==）：

In [78]:
college_ugds.equals(college_ugds)

True

在 pandas.testing 子包中，有一个函数可供开发人员在创建单元测试时使用。如果两个 DataFrames 不相等，`assert_frame_equal` 函数会引发 `AssertionError` 。如果两个 DataFrames 相等，则返回 None ：

In [79]:
from pandas.testing import assert_frame_equal

In [81]:
assert_frame_equal(college_ugds, college_ugds) is None

True

## 转置 DataFrame 操作的方向

许多 DataFrame 方法都有一个axis参数。此参数控制操作发生的方向。`axi`s 参数可以是 “index”（或0）或“columns”（或1）。我更喜欢字符串版本，因为它们更明确，并且往往使代码更易于阅读。

In [82]:
college_ugds.count()

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [83]:
college_ugds.count(axis=0)

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [84]:
college_ugds.count(axis='index')

UGDS_WHITE    6874
UGDS_BLACK    6874
UGDS_HISP     6874
UGDS_ASIAN    6874
UGDS_AIAN     6874
UGDS_NHPI     6874
UGDS_2MOR     6874
UGDS_NRA      6874
UGDS_UNKN     6874
dtype: int64

In [85]:
college_ugds.count(axis='columns').head()

INSTNM
Alabama A & M University               9
University of Alabama at Birmingham    9
Amridge University                     9
University of Alabama in Huntsville    9
Alabama State University               9
dtype: int64

In [86]:
college_ugds.sum(axis='columns').head()

INSTNM
Alabama A & M University               1.0000
University of Alabama at Birmingham    0.9999
Amridge University                     1.0000
University of Alabama in Huntsville    1.0000
Alabama State University               1.0000
dtype: float64

In [87]:
college_ugds.median(axis='columns')

INSTNM
Alabama A & M University                                  0.0055
University of Alabama at Birmingham                       0.0283
Amridge University                                        0.0034
University of Alabama in Huntsville                       0.0350
Alabama State University                                  0.0121
                                                           ...  
SAE Institute of Technology  San Francisco                   NaN
Rasmussen College - Overland Park                            NaN
National Personal Training Institute of Cleveland            NaN
Bay Area Medical Academy - San Jose Satellite Location       NaN
Excel Learning Center-San Antonio South                      NaN
Length: 7535, dtype: float64

In [90]:
college_ugds_cumsum = college_ugds.cumsum(axis=1)
college_ugds_cumsum.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9686,0.9741,0.976,0.9784,0.9803,0.9803,0.9862,1.0
University of Alabama at Birmingham,0.5922,0.8522,0.8805,0.9323,0.9345,0.9352,0.972,0.9899,0.9999
Amridge University,0.299,0.7182,0.7251,0.7285,0.7285,0.7285,0.7285,0.7285,1.0
University of Alabama in Huntsville,0.6988,0.8243,0.8625,0.9001,0.9144,0.9146,0.9318,0.965,1.0
Alabama State University,0.0158,0.9366,0.9487,0.9506,0.9516,0.9522,0.962,0.9863,1.0


## 案例：确定大学校园的多样性

In [93]:
pd.read_csv('data/college_diversity.csv', index_col='School')

Unnamed: 0_level_0,Diversity Index
School,Unnamed: 1_level_1
"Rutgers University--Newark Newark, NJ",0.76
"Andrews University Berrien Springs, MI",0.74
"Stanford University Stanford, CA",0.74
"University of Houston Houston, TX",0.74
"University of Nevada--Las Vegas Las Vegas, NV",0.74
"University of San Francisco San Francisco, CA",0.74
"San Francisco State University San Francisco, CA",0.73
"University of Illinois--Chicago Chicago, IL",0.73
"New Jersey Institute of Technology Newark, NJ",0.72
"Texas Woman's University Denton, TX",0.72


In [94]:
(
    college_ugds.isna().sum(axis='columns')
    .sort_values(ascending=False)
    .head()
)

INSTNM
Excel Learning Center-San Antonio South              9
Western State College of Law at Argosy University    9
Albany Law School                                    9
Albany Medical College                               9
A T Still University of Health Sciences              9
dtype: int64

In [95]:
college_ugds = college_ugds.dropna(how='all')
college_ugds.isna().sum()

UGDS_WHITE    0
UGDS_BLACK    0
UGDS_HISP     0
UGDS_ASIAN    0
UGDS_AIAN     0
UGDS_NHPI     0
UGDS_2MOR     0
UGDS_NRA      0
UGDS_UNKN     0
dtype: int64

In [96]:
college_ugds.ge(0.15)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,False,True,False,False,False,False,False,False,False
University of Alabama at Birmingham,True,True,False,False,False,False,False,False,False
Amridge University,True,True,False,False,False,False,False,False,True
University of Alabama in Huntsville,True,False,False,False,False,False,False,False,False
Alabama State University,False,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
Hollywood Institute of Beauty Careers-West Palm Beach,True,True,True,False,False,False,False,False,False
Hollywood Institute of Beauty Careers-Casselberry,False,True,True,False,False,False,False,False,False
Coachella Valley Beauty College-Beaumont,True,False,True,False,False,False,False,False,False
Dewey University-Mayaguez,False,False,True,False,False,False,False,False,False


In [97]:
diversity_metric = college_ugds.ge(.15).sum(axis='columns')

In [98]:
diversity_metric.head()

INSTNM
Alabama A & M University               1
University of Alabama at Birmingham    2
Amridge University                     3
University of Alabama in Huntsville    1
Alabama State University               1
dtype: int64

In [99]:
diversity_metric.value_counts()

1    3042
2    2884
3     876
4      63
0       7
5       2
dtype: int64

In [100]:
diversity_metric.sort_values(ascending=False)

INSTNM
Central Texas Beauty College-Temple                               5
Regency Beauty Institute-Austin                                   5
Westwood College-O'Hare Airport                                   4
Regency Beauty Institute-Pasadena                                 4
Soma Institute-The National School of Clinical Massage Therapy    4
                                                                 ..
Professional Business College                                     0
Education and Technology Institute                                0
Taft University System                                            0
Prince Institute-Rocky Mountains                                  0
Spanish-American Institute                                        0
Length: 6874, dtype: int64

In [101]:
college_ugds.loc[['Central Texas Beauty College-Temple', 'Regency Beauty Institute-Austin']]

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Central Texas Beauty College-Temple,0.1616,0.2323,0.2626,0.0202,0.0,0.0,0.1717,0.0,0.1515
Regency Beauty Institute-Austin,0.1867,0.2133,0.16,0.0,0.0,0.0,0.1733,0.0,0.2667


In [102]:
us_news_top = [
    "Rutgers University-Newark",
    "Andrews University",
    "Stanford University",
    "University of Houston",
    "University of Nevada-Las Vegas",
]

In [103]:
diversity_metric.loc[us_news_top]

INSTNM
Rutgers University-Newark         4
Andrews University                3
Stanford University               3
University of Houston             3
University of Nevada-Las Vegas    3
dtype: int64

In [104]:
(
    college_ugds.max(axis='columns').sort_values(ascending=False).head(10)
)

INSTNM
Caribbean University-Ponce                                        1.0
Brighton Institute of Cosmetology                                 1.0
Mesivta Torah Vodaath Rabbinical Seminary                         1.0
Rabbinical College Telshe                                         1.0
University of Puerto Rico-Mayaguez                                1.0
Haskell Indian Nations University                                 1.0
Lake Career and Technical Center                                  1.0
Leon Studio One School of Hair Design & Career Training Center    1.0
Dewey University-Hato Rey                                         1.0
Columbia Central University-Caguas                                1.0
dtype: float64

是否有任何学校的所有九个种族类别的比例超过 1％：

In [105]:
(college_ugds > .01).all(axis='columns').any()

True