# pandas的拼接操作

pandas的拼接分为两种：
- 级联：pd.concat, pd.append
- 合并：pd.merge, pd.join

## 0. 回顾numpy的级联

In [1]:
import numpy as np

============================================

练习12：

1. 生成2个3*3的矩阵，对其分别进行两个维度上的级联

============================================

In [2]:
nd1 = np.random.randint(0,10,size=(3,3))
nd1

array([[6, 7, 8],
       [1, 7, 6],
       [9, 6, 7]])

In [3]:
nd2 = np.random.randint(10,20,size=(3,3))
nd2

array([[11, 12, 17],
       [19, 13, 10],
       [16, 11, 15]])

In [6]:
# 级联
np.concatenate([nd1,nd2],axis=1)

array([[ 6,  7,  8, 11, 12, 17],
       [ 1,  7,  6, 19, 13, 10],
       [ 9,  6,  7, 16, 11, 15]])

In [7]:
nd3 = np.random.randint(0,10,size=3)
nd3

array([8, 4, 5])

In [9]:
np.concatenate([nd1,nd3])

TypeError: only integer scalar arrays can be converted to a scalar index

In [10]:
nd4 = np.random.randint(0,10,size=(3,2))
nd4

array([[3, 6],
       [9, 2],
       [7, 9]])

In [12]:
np.concatenate([nd1,nd4],axis=1)

array([[6, 7, 8, 3, 6],
       [1, 7, 6, 9, 2],
       [9, 6, 7, 7, 9]])

为方便讲解，我们首先定义一个生成DataFrame的函数：

In [1]:
import pandas as pd
from pandas import DataFrame,Series

In [2]:
def make_df(rows,cols):
    data = {c:[c+str(i) for i in rows] for c in cols}
    return DataFrame(data=data,index=rows,columns=cols)

    

In [3]:
df1 = make_df([1,2,3],list("abc"))
df1

Unnamed: 0,a,b,c
1,a1,b1,c1
2,a2,b2,c2
3,a3,b3,c3


In [4]:
df2 = make_df([4,5,6],list("abc"))
df2

Unnamed: 0,a,b,c
4,a4,b4,c4
5,a5,b5,c5
6,a6,b6,c6


## 1. 使用pd.concat()级联

pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：
```
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)
```

### 1)  简单级联

和np.concatenate一样，优先增加行数（默认axis=0）

In [5]:
pd.concat([df1,df2],axis=0)

Unnamed: 0,a,b,c
1,a1,b1,c1
2,a2,b2,c2
3,a3,b3,c3
4,a4,b4,c4
5,a5,b5,c5
6,a6,b6,c6


可以通过设置axis来改变级联方向

In [21]:
pd.concat([df1,df2],axis=1) # df级联的时候遵从自动补全机制，缺哪些索引（包含行和列）

Unnamed: 0,a,b,c,a.1,b.1,c.1
1,a1,b1,c1,,,
2,a2,b2,c2,,,
3,a3,b3,c3,,,
4,,,,a4,b4,c4
5,,,,a5,b5,c5
6,,,,a6,b6,c6


注意index在级联时可以重复

In [22]:
df3 = make_df([1,2,4],list("bcd"))
df3

Unnamed: 0,b,c,d
1,b1,c1,d1
2,b2,c2,d2
4,b4,c4,d4


In [23]:
pd.concat([df1,df3],axis=0)

Unnamed: 0,a,b,c,d
1,a1,b1,c1,
2,a2,b2,c2,
3,a3,b3,c3,
1,,b1,c1,d1
2,,b2,c2,d2
4,,b4,c4,d4


In [24]:
pd.concat([df1,df3],axis=1)

Unnamed: 0,a,b,c,b.1,c.1,d
1,a1,b1,c1,b1,c1,d1
2,a2,b2,c2,b2,c2,d2
3,a3,b3,c3,,,
4,,,,b4,c4,d4


也可以选择忽略ignore_index，重新索引

In [26]:
pd.concat([df1,df3],ignore_index=True)

Unnamed: 0,a,b,c,d
0,a1,b1,c1,
1,a2,b2,c2,
2,a3,b3,c3,
3,,b1,c1,d1
4,,b2,c2,d2
5,,b4,c4,d4


In [28]:
pd.concat([df1,df3],axis=1,ignore_index=True)

Unnamed: 0,0,1,2,3,4,5
1,a1,b1,c1,b1,c1,d1
2,a2,b2,c2,b2,c2,d2
3,a3,b3,c3,,,
4,,,,b4,c4,d4


或者使用多层索引 keys  

concat([x,y],keys=['x','y'])

In [30]:
pd.concat([df1,df3],keys=["小明","小王"])

Unnamed: 0,Unnamed: 1,a,b,c,d
小明,1,a1,b1,c1,
小明,2,a2,b2,c2,
小明,3,a3,b3,c3,
小王,1,,b1,c1,d1
小王,2,,b2,c2,d2
小王,4,,b4,c4,d4


============================================

练习13：

1. 想一想级联的应用场景？

2. 使用昨天的知识，建立一个期中考试张三、李四的成绩表ddd

3. 假设新增考试学科"计算机"，如何实现？

4. 新增王老五同学的成绩，如何实现？

============================================

### 2) 不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致

In [6]:
df1

Unnamed: 0,a,b,c
1,a1,b1,c1
2,a2,b2,c2
3,a3,b3,c3


In [7]:
df4 = make_df([3,4,5,6],list("abcd"))
df4

Unnamed: 0,a,b,c,d
3,a3,b3,c3,d3
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6


In [9]:
pd.concat([df1,df4],axis=0)

Unnamed: 0,a,b,c,d
1,a1,b1,c1,
2,a2,b2,c2,
3,a3,b3,c3,
3,a3,b3,c3,d3
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6


有3种连接方式：

- 外连接：补NaN（默认模式）

In [11]:
pd.concat([df1,df4],axis=1,join="outer") # 不匹配的地方补NaN

Unnamed: 0,a,b,c,a.1,b.1,c.1,d
1,a1,b1,c1,,,,
2,a2,b2,c2,,,,
3,a3,b3,c3,a3,b3,c3,d3
4,,,,a4,b4,c4,d4
5,,,,a5,b5,c5,d5
6,,,,a6,b6,c6,d6


- 内连接：只连接匹配的项

In [13]:
pd.concat([df1,df4],join="inner") # 内连接，不匹配的地方直接删除

Unnamed: 0,a,b,c
1,a1,b1,c1
2,a2,b2,c2
3,a3,b3,c3
3,a3,b3,c3
4,a4,b4,c4
5,a5,b5,c5
6,a6,b6,c6


In [15]:
pd.concat([df1,df4],axis=1,join="inner")

Unnamed: 0,a,b,c,a.1,b.1,c.1,d
3,a3,b3,c3,a3,b3,c3,d3


- 连接指定轴 join_axes

In [16]:
df5 = make_df([5,6,7,8],list("abde"))
df5

Unnamed: 0,a,b,d,e
5,a5,b5,d5,e5
6,a6,b6,d6,e6
7,a7,b7,d7,e7
8,a8,b8,d8,e8


In [26]:
pd.concat([df1,df4,df5],join_axes=[pd.Index(["a","c"])]) 
# join_axes值只能是一个元素的列表,在这个列表可以指定保留那些索引


Unnamed: 0,a,c
1,a1,c1
2,a2,c2
3,a3,c3
3,a3,c3
4,a4,c4
5,a5,c5
6,a6,c6
5,a5,
6,a6,
7,a7,


============================================

练习14：

    假设【期末】考试ddd2的成绩没有张三的，只有李四、王老五、赵小六的，使用多种方法级联

============================================

### 3) 使用append()函数添加

由于在后面级联的使用非常普遍，因此有一个函数append专门用于在后面添加

In [27]:
df1

Unnamed: 0,a,b,c
1,a1,b1,c1
2,a2,b2,c2
3,a3,b3,c3


In [28]:
l = [1,2,3]

In [34]:
df1.append(df4) # append只能操作行，并且是外连接

Unnamed: 0,a,b,c,d
1,a1,b1,c1,
2,a2,b2,c2,
3,a3,b3,c3,
3,a3,b3,c3,d3
4,a4,b4,c4,d4
5,a5,b5,c5,d5
6,a6,b6,c6,d6


In [40]:
df1.append([[1,2,3]]) # appen只能和多维的对象进行级联

Unnamed: 0,a,b,c,0,1,2
1,a1,b1,c1,,,
2,a2,b2,c2,,,
3,a3,b3,c3,,,
0,,,,1.0,2.0,3.0


In [43]:
l.append([1,2])
l

[1, 2, 3, 9, 9, [1, 2]]

============================================

练习15：

    新建一个只有张三李四王老五的期末考试成绩单ddd3，使用append()与期中考试成绩表ddd级联

============================================

## 2. 使用pd.merge()合并

merge与concat的区别在于，merge需要依据某一共同的行或列来进行合并

使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。

注意每一列元素的顺序不要求一致

###  1) 一对一合并

In [2]:
df1 = DataFrame({
    "name":["小明","小黄","小绿","小白"],
    "age":[18,19,20,39],
    "job":["stu","driver","seller","boss"]
},index=list("abcd"))
df1

Unnamed: 0,age,job,name
a,18,stu,小明
b,19,driver,小黄
c,20,seller,小绿
d,39,boss,小白


In [48]:
df2 = DataFrame({
    "home":["上海","北京","新加坡","郑州"],
    "name":["小明","小黄","小绿","小白"],
    "phone":[120,110,119,999]
},index=list("abcd"))
df2

Unnamed: 0,home,name,phone
a,上海,小明,120
b,北京,小黄,110
c,新加坡,小绿,119
d,郑州,小白,999


In [49]:
pd.concat([df1,df2],axis=1)

Unnamed: 0,age,job,name,home,name.1,phone
a,18,stu,小明,上海,小明,120
b,19,driver,小黄,北京,小黄,110
c,20,seller,小绿,新加坡,小绿,119
d,39,boss,小白,郑州,小白,999


In [55]:
pd.merge(df1,df2)

Unnamed: 0,age,job,name,home,phone
0,18,stu,小明,上海,120
1,19,driver,小黄,北京,110
2,20,seller,小绿,新加坡,119
3,39,boss,小白,郑州,999


In [57]:
df1.merge(df2) # 合并只能操作列
# 合并的条件：只要有公共的属性（列），就能合并
# 把公共的列对应的值合在一起
# 一对一合并：公共属性中两个表值是一一对应

Unnamed: 0,age,job,name,home,phone
0,18,stu,小明,上海,120
1,19,driver,小黄,北京,110
2,20,seller,小绿,新加坡,119
3,39,boss,小白,郑州,999


### 2) 多对一合并

In [58]:
df3 =  DataFrame({
    "home":["上海","北京","新加坡","郑州","广州","深圳","日本"],
    "name":["小明","小黄","小绿","小白","小明","小黄","小明"],
    "phone":[120,110,119,999,10086,10010,1231244]
},index=list("abcdefg"))
df3

Unnamed: 0,home,name,phone
a,上海,小明,120
b,北京,小黄,110
c,新加坡,小绿,119
d,郑州,小白,999
e,广州,小明,10086
f,深圳,小黄,10010
g,日本,小明,1231244


In [59]:
df1

Unnamed: 0,age,job,name
a,18,stu,小明
b,19,driver,小黄
c,20,seller,小绿
d,39,boss,小白


In [60]:
df1.merge(df3)

Unnamed: 0,age,job,name,home,phone
0,18,stu,小明,上海,120
1,18,stu,小明,广州,10086
2,18,stu,小明,日本,1231244
3,19,driver,小黄,北京,110
4,19,driver,小黄,深圳,10010
5,20,seller,小绿,新加坡,119
6,39,boss,小白,郑州,999


一对多：在某个公共属性中表1中的某个值，在表2中有多个和他对应

合并规则：把表1中对应的那些行赋值到表2 的数量，然后对应合并

### 3) 多对多合并

In [61]:
df4 = DataFrame({
    "name":["小明","小黄","小绿","小白","小明"],
    "age":[18,19,20,39,32],
    "job":["stu","driver","seller","boss","teacher"]
},index=list("abcde"))
df4

Unnamed: 0,age,job,name
a,18,stu,小明
b,19,driver,小黄
c,20,seller,小绿
d,39,boss,小白
e,32,teacher,小明


In [62]:
df3

Unnamed: 0,home,name,phone
a,上海,小明,120
b,北京,小黄,110
c,新加坡,小绿,119
d,郑州,小白,999
e,广州,小明,10086
f,深圳,小黄,10010
g,日本,小明,1231244


In [63]:
df4.merge(df3)

Unnamed: 0,age,job,name,home,phone
0,18,stu,小明,上海,120
1,18,stu,小明,广州,10086
2,18,stu,小明,日本,1231244
3,32,teacher,小明,上海,120
4,32,teacher,小明,广州,10086
5,32,teacher,小明,日本,1231244
6,19,driver,小黄,北京,110
7,19,driver,小黄,深圳,10010
8,20,seller,小绿,新加坡,119
9,39,boss,小白,郑州,999


多对多：表1中有多个相同值，表2中和这个值相同的也有多个

合并：分别拿出表1中所有的值，去和表2中对应的值匹配生成一个新的行，最终生成的数据个条数是两者的乘积

### 4) key的规范化

- 使用on=显式指定哪一列为key,当有多个key相同时使用

In [65]:
df1

Unnamed: 0,age,job,name
a,18,stu,小明
b,19,driver,小黄
c,20,seller,小绿
d,39,boss,小白


In [66]:
df5 =  DataFrame({
    "home":["上海","北京","新加坡","郑州","广州","深圳","日本"],
    "name":["小明","小黄","小绿","小白","小明","小黄","小明"],
    "phone":[120,110,119,999,10086,10010,1231244],
    "age":[18,19,20,21,22,23,34]
},index=list("abcdefg"))
df5

Unnamed: 0,age,home,name,phone
a,18,上海,小明,120
b,19,北京,小黄,110
c,20,新加坡,小绿,119
d,21,郑州,小白,999
e,22,广州,小明,10086
f,23,深圳,小黄,10010
g,34,日本,小明,1231244


In [67]:
df1.merge(df5) # 如果出现了多个相同的属性，在合并的时候要综合考虑所有的相同属性

Unnamed: 0,age,job,name,home,phone
0,18,stu,小明,上海,120
1,19,driver,小黄,北京,110
2,20,seller,小绿,新加坡,119


In [69]:
# 如果有多个相同的属性，可以指定以哪一个为key（参考）进行合并
df1.merge(df5,on="name",suffixes=["_真实","_虚假"])

Unnamed: 0,age_真实,job,name,age_虚假,home,phone
0,18,stu,小明,18,上海,120
1,18,stu,小明,22,广州,10086
2,18,stu,小明,34,日本,1231244
3,19,driver,小黄,19,北京,110
4,19,driver,小黄,23,深圳,10010
5,20,seller,小绿,20,新加坡,119
6,39,boss,小白,21,郑州,999


- 使用left_on和right_on指定左右两边的列作为key，当左右两边的key都不想等时使用

In [71]:
df1

Unnamed: 0,age,job,name
a,18,stu,小明
b,19,driver,小黄
c,20,seller,小绿
d,39,boss,小白


In [3]:
df6 =  DataFrame({
    "家乡":["上海","北京","新加坡","郑州","广州","深圳","日本"],
    "姓名":["小明","小黄","小绿","小白","小明","小黄","小明"],
    "电话":[120,110,119,999,10086,10010,1231244]
},index=list("abcdefg"))
df6

Unnamed: 0,姓名,家乡,电话
a,小明,上海,120
b,小黄,北京,110
c,小绿,新加坡,119
d,小白,郑州,999
e,小明,广州,10086
f,小黄,深圳,10010
g,小明,日本,1231244


In [4]:
df1.merge(df6,left_on="name",right_on="姓名") # 没有共同的属性，不能合并，但是可以通过指定参与合并的属性来合并

Unnamed: 0,age,job,name,姓名,家乡,电话
0,18,stu,小明,小明,上海,120
1,18,stu,小明,小明,广州,10086
2,18,stu,小明,小明,日本,1231244
3,19,driver,小黄,小黄,北京,110
4,19,driver,小黄,小黄,深圳,10010
5,20,seller,小绿,小绿,新加坡,119
6,39,boss,小白,小白,郑州,999


In [6]:
df1.merge(df6,left_on="age",right_on="电话")

Unnamed: 0,age,job,name,姓名,家乡,电话


In [7]:
df1.merge(df6,left_index=True,right_index=True)

Unnamed: 0,age,job,name,姓名,家乡,电话
a,18,stu,小明,小明,上海,120
b,19,driver,小黄,小黄,北京,110
c,20,seller,小绿,小绿,新加坡,119
d,39,boss,小白,小白,郑州,999


In [8]:
df1.merge(df6,left_index=True,right_on="姓名")

Unnamed: 0,age,job,name,姓名,家乡,电话


============================================

练习16：

1. 假设有两份成绩单，除了ddd是张三李四王老五之外，还有ddd4是张三和赵小六的成绩单，如何合并？

2. 如果ddd4中张三的名字被打错了，成为了张十三，怎么办？

3. 自行练习多对一，多对多的情况  

4. 自学left_index,right_index

============================================

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame

### 5) 内合并与外合并

- 内合并：只保留两者都有的key（默认模式）

In [11]:
df1

Unnamed: 0,age,job,name
a,18,stu,小明
b,19,driver,小黄
c,20,seller,小绿
d,39,boss,小白


In [12]:
df6

Unnamed: 0,姓名,家乡,电话
a,小明,上海,120
b,小黄,北京,110
c,小绿,新加坡,119
d,小白,郑州,999
e,小明,广州,10086
f,小黄,深圳,10010
g,小明,日本,1231244


In [14]:
df1.merge(df6,left_on="name",right_on="姓名",how="inner")

Unnamed: 0,age,job,name,姓名,家乡,电话
0,18,stu,小明,小明,上海,120
1,18,stu,小明,小明,广州,10086
2,18,stu,小明,小明,日本,1231244
3,19,driver,小黄,小黄,北京,110
4,19,driver,小黄,小黄,深圳,10010
5,20,seller,小绿,小绿,新加坡,119
6,39,boss,小白,小白,郑州,999


- 外合并 how='outer'：补NaN

In [17]:
df1.merge(df6,left_index=True,right_index=True,how="outer")

Unnamed: 0,age,job,name,姓名,家乡,电话
a,18.0,stu,小明,小明,上海,120
b,19.0,driver,小黄,小黄,北京,110
c,20.0,seller,小绿,小绿,新加坡,119
d,39.0,boss,小白,小白,郑州,999
e,,,,小明,广州,10086
f,,,,小黄,深圳,10010
g,,,,小明,日本,1231244


- 左合并、右合并：how='left'，how='right'，

In [19]:
df1.merge(df6,left_index=True,right_index=True,how="right")

Unnamed: 0,age,job,name,姓名,家乡,电话
a,18.0,stu,小明,小明,上海,120
b,19.0,driver,小黄,小黄,北京,110
c,20.0,seller,小绿,小绿,新加坡,119
d,39.0,boss,小白,小白,郑州,999
e,,,,小明,广州,10086
f,,,,小黄,深圳,10010
g,,,,小明,日本,1231244


============================================

练习17：

1. 如果只有张三赵小六语数英三个科目的成绩，如何合并？

3. 考虑应用情景，使用多种方式合并ddd与ddd4

============================================

### 6) 列冲突的解决

当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key，配合suffixes指定冲突列名

可以使用suffixes=自己指定后缀

============================================

练习18：

    假设有两个同学都叫李四，ddd5、ddd6都是张三和李四的成绩表，如何合并？

============================================

## 作业
## 3. 案例分析：美国各州人口数据分析

首先导入文件，并查看数据样本

In [79]:
abb = pd.read_csv("./usapop/state-abbrevs.csv")
areas = pd.read_csv("./usapop/state-areas.csv")
pop = pd.read_csv("./usapop/state-population.csv")

In [80]:
abb

Unnamed: 0,state,abbreviation
0,Alabama,AL
1,Alaska,AK
2,Arizona,AZ
3,Arkansas,AR
4,California,CA
5,Colorado,CO
6,Connecticut,CT
7,Delaware,DE
8,District of Columbia,DC
9,Florida,FL


In [81]:
areas

Unnamed: 0,state,area (sq. mi)
0,Alabama,52423
1,Alaska,656425
2,Arizona,114006
3,Arkansas,53182
4,California,163707
5,Colorado,104100
6,Connecticut,5544
7,Delaware,1954
8,Florida,65758
9,Georgia,59441


In [82]:
pop

Unnamed: 0,state/region,ages,year,population
0,AL,under18,2012,1117489.0
1,AL,total,2012,4817528.0
2,AL,under18,2010,1130966.0
3,AL,total,2010,4785570.0
4,AL,under18,2011,1125763.0
5,AL,total,2011,4801627.0
6,AL,total,2009,4757938.0
7,AL,under18,2009,1134192.0
8,AL,under18,2013,1111481.0
9,AL,total,2013,4833722.0


合并pop与abbrevs两个DataFrame，分别依据state/region列和abbreviation列来合并。

为了保留所有信息，使用外合并。

In [83]:
pop2 = pop.merge(abb,left_on="state/region",right_on="abbreviation" ,how="outer")
pop2.head()

Unnamed: 0,state/region,ages,year,population,state,abbreviation
0,AL,under18,2012,1117489.0,Alabama,AL
1,AL,total,2012,4817528.0,Alabama,AL
2,AL,under18,2010,1130966.0,Alabama,AL
3,AL,total,2010,4785570.0,Alabama,AL
4,AL,under18,2011,1125763.0,Alabama,AL


去除abbreviation的那一列（axis=1）

In [86]:
pop2.head()

Unnamed: 0,state/region,ages,year,population,state
0,AL,under18,2012,1117489.0,Alabama
1,AL,total,2012,4817528.0,Alabama
2,AL,under18,2010,1130966.0,Alabama
3,AL,total,2010,4785570.0,Alabama
4,AL,under18,2011,1125763.0,Alabama


In [85]:
pop2.drop("abbreviation",axis=1,inplace=True) # inplace是否在原表上修改


查看存在缺失数据的列。

使用.isnull().any()，只有某一列存在一个缺失数据，就会显示True。

In [87]:
cond = pop2.isnull().any(axis=0)
cond

state/region    False
ages            False
year            False
population       True
state            True
dtype: bool

查看缺失数据

根据数据是否缺失情况显示数据，如果缺失为True，那么显示

In [88]:
# pop2[cond]# 不能查看缺失列的具体数据
cond_state = pop2["state"].isnull()
pop2[cond_state]["state/region"].unique() 
#通过了解知道“PR”指的就是Puerto Rico，“USA”指的是“United States”

array(['PR', 'USA'], dtype=object)

In [89]:
#我们的目标，找简称为“PR”那些行
cond2 = pop2["state/region"]=="PR"

In [90]:
pop2["state"][cond2] = "Puerto Rico"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [92]:
cond3 = pop2["state/region"] == "USA"
cond3

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19      False
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
2514     True
2515     True
2516     True
2517     True
2518     True
2519     True
2520     True
2521     True
2522     True
2523     True
2524     True
2525     True
2526     True
2527     True
2528     True
2529     True
2530     True
2531     True
2532     True
2533     True
2534     True
2535     True
2536     True
2537     True
2538     True
2539     True
2540     True
2541     True
2542     True
2543     True
Name: state/region, Length: 2544, dtype: bool

In [95]:
pop2["state"][cond3] = "United States"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [96]:
pop2.isnull().any()

state/region    False
ages            False
year            False
population       True
state           False
dtype: bool

In [97]:
pop2.head()

Unnamed: 0,state/region,ages,year,population,state
0,AL,under18,2012,1117489.0,Alabama
1,AL,total,2012,4817528.0,Alabama
2,AL,under18,2010,1130966.0,Alabama
3,AL,total,2010,4785570.0,Alabama
4,AL,under18,2011,1125763.0,Alabama


合并各州面积数据areas，使用左合并。

思考一下为什么使用外合并？



In [99]:
areas.head()

Unnamed: 0,state,area (sq. mi)
0,Alabama,52423
1,Alaska,656425
2,Arizona,114006
3,Arkansas,53182
4,California,163707


In [103]:
pop3 = pop2.merge(areas,how="outer")
pop3.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012,1117489.0,Alabama,52423.0
1,AL,total,2012,4817528.0,Alabama,52423.0
2,AL,under18,2010,1130966.0,Alabama,52423.0
3,AL,total,2010,4785570.0,Alabama,52423.0
4,AL,under18,2011,1125763.0,Alabama,52423.0


继续寻找存在缺失数据的列

In [104]:
pop3.isnull().any()

state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)     True
dtype: bool

In [108]:
cond4 = pop3["area (sq. mi)"].isnull()

In [111]:
pop3[cond4]["state"].unique()

array(['United States'], dtype=object)

In [112]:
areas

Unnamed: 0,state,area (sq. mi)
0,Alabama,52423
1,Alaska,656425
2,Arizona,114006
3,Arkansas,53182
4,California,163707
5,Colorado,104100
6,Connecticut,5544
7,Delaware,1954
8,Florida,65758
9,Georgia,59441


我们会发现area(sq.mi)这一列有缺失数据，为了找出是哪一行，我们需要找出是哪个state没有数据

In [115]:
# 美国的面积所有州面积的和
usa_area = areas["area (sq. mi)"].sum()
usa_area

3790399

In [117]:
cond5 = pop3["state"] == "United States"

In [120]:
pop3["area (sq. mi)"][cond5] = usa_area

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [121]:
pop3.isnull().any()

state/region     False
ages             False
year             False
population        True
state            False
area (sq. mi)    False
dtype: bool

去除含有缺失数据的行

In [123]:
population = pop3.dropna(axis=0,how="any")

查看数据是否缺失

In [124]:
population.isnull().any()

state/region     False
ages             False
year             False
population       False
state            False
area (sq. mi)    False
dtype: bool

找出2010年的全民人口数据,df.query(查询语句)

In [125]:
population.head()

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
0,AL,under18,2012,1117489.0,Alabama,52423.0
1,AL,total,2012,4817528.0,Alabama,52423.0
2,AL,under18,2010,1130966.0,Alabama,52423.0
3,AL,total,2010,4785570.0,Alabama,52423.0
4,AL,under18,2011,1125763.0,Alabama,52423.0


In [133]:
pop_2010 = population.query("year==2010&ages=='total'")
pop_2010

Unnamed: 0,state/region,ages,year,population,state,area (sq. mi)
3,AL,total,2010,4785570.0,Alabama,52423.0
91,AK,total,2010,713868.0,Alaska,656425.0
101,AZ,total,2010,6408790.0,Arizona,114006.0
189,AR,total,2010,2922280.0,Arkansas,53182.0
197,CA,total,2010,37333601.0,California,163707.0
283,CO,total,2010,5048196.0,Colorado,104100.0
293,CT,total,2010,3579210.0,Connecticut,5544.0
379,DE,total,2010,899711.0,Delaware,1954.0
389,DC,total,2010,605125.0,District of Columbia,68.0
475,FL,total,2010,18846054.0,Florida,65758.0


对查询结果进行处理，以state列作为新的行索引:set_index

In [134]:
pop_2010.set_index("state",inplace=True)
pop_2010

Unnamed: 0_level_0,state/region,ages,year,population,area (sq. mi)
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,AL,total,2010,4785570.0,52423.0
Alaska,AK,total,2010,713868.0,656425.0
Arizona,AZ,total,2010,6408790.0,114006.0
Arkansas,AR,total,2010,2922280.0,53182.0
California,CA,total,2010,37333601.0,163707.0
Colorado,CO,total,2010,5048196.0,104100.0
Connecticut,CT,total,2010,3579210.0,5544.0
Delaware,DE,total,2010,899711.0,1954.0
District of Columbia,DC,total,2010,605125.0,68.0
Florida,FL,total,2010,18846054.0,65758.0


In [135]:
pop_2010.shape

(53, 5)

计算人口密度。注意是Series/Series，其结果还是一个Series。

In [138]:
pop_2010_dens = pop_2010["population"]/pop_2010["area (sq. mi)"]
pop_2010_dens

state
Alabama                   91.287603
Alaska                     1.087509
Arizona                   56.214497
Arkansas                  54.948667
California               228.051342
Colorado                  48.493718
Connecticut              645.600649
Delaware                 460.445752
District of Columbia    8898.897059
Florida                  286.597129
Georgia                  163.409902
Hawaii                   124.746707
Idaho                     18.794338
Illinois                 221.687472
Indiana                  178.197831
Iowa                      54.202751
Kansas                    34.745266
Kentucky                 107.586994
Louisiana                 87.676099
Maine                     37.509990
Maryland                 466.445797
Massachusetts            621.815538
Michigan                 102.015794
Minnesota                 61.078373
Mississippi               61.321530
Missouri                  86.015622
Montana                    6.736171
Nebraska              

In [142]:
pop_2010_desity = pop_2010.merge(DataFrame(pop_2010_dens,columns=["desity"]),left_index=True,right_index=True)
pop_2010_desity

Unnamed: 0_level_0,state/region,ages,year,population,area (sq. mi),desity
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alabama,AL,total,2010,4785570.0,52423.0,91.287603
Alaska,AK,total,2010,713868.0,656425.0,1.087509
Arizona,AZ,total,2010,6408790.0,114006.0,56.214497
Arkansas,AR,total,2010,2922280.0,53182.0,54.948667
California,CA,total,2010,37333601.0,163707.0,228.051342
Colorado,CO,total,2010,5048196.0,104100.0,48.493718
Connecticut,CT,total,2010,3579210.0,5544.0,645.600649
Delaware,DE,total,2010,899711.0,1954.0,460.445752
District of Columbia,DC,total,2010,605125.0,68.0,8898.897059
Florida,FL,total,2010,18846054.0,65758.0,286.597129


排序，并找出人口密度最高的五个州sort_values()

In [147]:
pop_2010_desity.sort_values("desity",inplace=True)
pop_2010_desity

Unnamed: 0_level_0,state/region,ages,year,population,area (sq. mi),desity
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alaska,AK,total,2010,713868.0,656425.0,1.087509
Wyoming,WY,total,2010,564222.0,97818.0,5.768079
Montana,MT,total,2010,990527.0,147046.0,6.736171
North Dakota,ND,total,2010,674344.0,70704.0,9.537565
South Dakota,SD,total,2010,816211.0,77121.0,10.583512
New Mexico,NM,total,2010,2064982.0,121593.0,16.982737
Idaho,ID,total,2010,1570718.0,83574.0,18.794338
Nebraska,NE,total,2010,1829838.0,77358.0,23.654153
Nevada,NV,total,2010,2703230.0,110567.0,24.448796
Utah,UT,total,2010,2774424.0,84904.0,32.677188


In [151]:
pop_2010_desity[:-6:-1]

Unnamed: 0_level_0,state/region,ages,year,population,area (sq. mi),desity
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
District of Columbia,DC,total,2010,605125.0,68.0,8898.897059
Puerto Rico,PR,total,2010,3721208.0,3515.0,1058.665149
New Jersey,NJ,total,2010,8802707.0,8722.0,1009.253268
Rhode Island,RI,total,2010,1052669.0,1545.0,681.339159
Connecticut,CT,total,2010,3579210.0,5544.0,645.600649


In [155]:
pop_2010_desity.tail()[::-1]

Unnamed: 0_level_0,state/region,ages,year,population,area (sq. mi),desity
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
District of Columbia,DC,total,2010,605125.0,68.0,8898.897059
Puerto Rico,PR,total,2010,3721208.0,3515.0,1058.665149
New Jersey,NJ,total,2010,8802707.0,8722.0,1009.253268
Rhode Island,RI,total,2010,1052669.0,1545.0,681.339159
Connecticut,CT,total,2010,3579210.0,5544.0,645.600649


找出人口密度最低的五个州

In [153]:
pop_2010_desity.head(10)

Unnamed: 0_level_0,state/region,ages,year,population,area (sq. mi),desity
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alaska,AK,total,2010,713868.0,656425.0,1.087509
Wyoming,WY,total,2010,564222.0,97818.0,5.768079
Montana,MT,total,2010,990527.0,147046.0,6.736171
North Dakota,ND,total,2010,674344.0,70704.0,9.537565
South Dakota,SD,total,2010,816211.0,77121.0,10.583512
New Mexico,NM,total,2010,2064982.0,121593.0,16.982737
Idaho,ID,total,2010,1570718.0,83574.0,18.794338
Nebraska,NE,total,2010,1829838.0,77358.0,23.654153
Nevada,NV,total,2010,2703230.0,110567.0,24.448796
Utah,UT,total,2010,2774424.0,84904.0,32.677188


要点总结：
- 统一用loc()索引
- 善于使用.isnull().any()找到存在NaN的列
- 善于使用.unique()确定该列中哪些key是我们需要的
- 一般使用外合并、左合并，目的只有一个：宁愿该列是NaN也不要丢弃其他列的信息

## 回顾：Series/DataFrame运算与ndarray运算的区别

- Series与DataFrame没有广播，如果对应index没有值，则记为NaN；或者使用add的fill_value来补缺失值
- ndarray有广播，通过重复已有值来计算