### 级联操作
 - pd.concat,pd.append

pandas使用pd.concat函数,与np.concatenate函数类似,只是多一点参数:
    objs
    axis
    keys
    join = 'outer' / 'inner': 表示的是级联的方式, outer会将所有的项进行级联(忽略匹配和不匹配),而inner只会将匹配项的级联到一起,不匹配的不级联
    ignore_index=False

In [19]:
import pandas as pd
import numpy as np
from pandas import DataFrame

- 匹配级联

In [15]:
df1 = DataFrame(data=np.random.randint(0,100,size=(5,3)),columns=['A','B','C'])
df2 = DataFrame(data=np.random.randint(0,100,size=(5,3)),columns=['A','D','C'])
df1

Unnamed: 0,A,B,C
0,54,21,28
1,59,62,38
2,65,94,29
3,92,47,65
4,26,75,52


In [10]:
df2

Unnamed: 0,A,D,C
0,35,8,18
1,41,74,24
2,5,89,52
3,24,38,68
4,13,87,4


In [17]:
pd.concat((df1,df2),axis=1)

Unnamed: 0,A,B,C,A.1,D,C.1
0,54,21,28,19,72,66
1,59,62,38,98,16,98
2,65,94,29,73,4,10
3,92,47,65,49,93,45
4,26,75,52,83,6,51


- 不匹配级联
  - 不匹配指的是级联的纬度的索引不一致. 例如纵向级联时列索引不一致,横向级联时行索引不一致
  - 有两种连接方式:
    - 外连接: 补NAN(默认样式)
    - 内连接: 只连接匹配的项

In [12]:
pd.concat((df1,df2),axis=0) # 默认是外连接
#q: 什么是外连接
#a: 外连接是指将所有的项进行连接,不匹配的项用NAN填充

Unnamed: 0,A,B,C,D
0,35,83.0,4,
1,26,69.0,20,
2,6,75.0,13,
3,62,31.0,18,
4,57,61.0,79,
0,35,,18,8.0
1,41,,24,74.0
2,5,,52,89.0
3,24,,68,38.0
4,13,,4,87.0


In [30]:
pd.concat((df1,df2),axis=0,join='inner') # 内连接 只将匹配的连接在一起
#q: 什么是内连接
#a: 内连接是指只连接匹配的项

Unnamed: 0,A,C
0,54,28
1,59,38
2,65,29
3,92,65
4,26,52
0,19,66
1,98,98
2,73,10
3,49,45
4,83,51


- 如果想要保留数据的完整性必须使用outer(外连接)

In [27]:
df3 = DataFrame(data=np.random.randint(0,100,size=(5,2)),columns=['A','C'])
df3

Unnamed: 0,A,C
0,56,33
1,12,60
2,58,47
3,94,16
4,9,30


In [28]:
pd.concat((df1,df3),axis=0)

Unnamed: 0,A,B,C
0,54,21.0,28
1,59,62.0,38
2,65,94.0,29
3,92,47.0,65
4,26,75.0,52
0,56,,33
1,12,,60
2,58,,47
3,94,,16
4,9,,30


### 合并操作
  - merge与concat的区别在于,merge需要依据某一共同的列进行合并
  - 使用pd.merge()合并时,会自动根据两者相同的column名称的那一列,作为key来进行合并
  - 注意每一列元素的顺序不要求一致 

#### 级联和合并的操作区别是什么
   - 级联是将数据横向或者纵向拼接在一起
   - 合并是将数据整合在一起
        - 级联是对表格的拼接,合并是对数据进行整合(类似于mysql中的表合并)

#### 一对一合并

In [33]:
dh1 = DataFrame({
    'employee':['bob','jake','lisa','sue'],
    'group':['accounting','engineering','engineering','hr']
})

dh2 = DataFrame({
    'employee':['lisa','bob','jake','sue'],
    'hire_date':[2004,2008,2012,2014]
})
dh1

Unnamed: 0,employee,group
0,bob,accounting
1,jake,engineering
2,lisa,engineering
3,sue,hr


In [35]:
dh2

Unnamed: 0,employee,hire_date
0,lisa,2004
1,bob,2008
2,jake,2012
3,sue,2014


In [40]:
pd.merge(dh1,dh2,on='employee') # on指定连接的列

Unnamed: 0,employee,group,hire_date
0,bob,accounting,2008
1,jake,engineering,2012
2,lisa,engineering,2004
3,sue,hr,2014


### 一对多合并

In [49]:
dh3 = DataFrame({
    'employee':['bob','jake','lisa','sue'],
    'group':['accounting','engineering','engineering','hr'],
    'hire_date':[2004,2008,2012,2014]
})

dh4 = DataFrame({
    'group':['accounting','accounting','engineering','hr'],
    'supervisor':[2004,2008,2012,2014]
})
dh3

Unnamed: 0,employee,group,hire_date
0,bob,accounting,2004
1,jake,engineering,2008
2,lisa,engineering,2012
3,sue,hr,2014


In [50]:
dh4

Unnamed: 0,group,supervisor
0,accounting,2004
1,accounting,2008
2,engineering,2012
3,hr,2014


In [51]:
pd.merge(dh3,dh4,on='group') # 默认是内连接 on指定连接的列 
#q: 一对多的点在哪
#a: 一对多的点在于dh3中的group列,有两个engineering,而dh4中只有一个engineering,所以会出现一对多的情况

Unnamed: 0,employee,group,hire_date,supervisor
0,bob,accounting,2004,2004
1,bob,accounting,2004,2008
2,jake,engineering,2008,2012
3,lisa,engineering,2012,2012
4,sue,hr,2014,2014


### 多对多合并

In [56]:
dh5 = DataFrame({
    'employee':['bob','jake','lisa'],
    'group':['engineering','accounting','engineering'],
})

dh6 = DataFrame({
    'group':['engineering','engineering','hr'],
    'supervisor':[2004,2008,2012]
})

In [58]:
pd.merge(dh5,dh6,how='outer') # 外连接

Unnamed: 0,employee,group,supervisor
0,bob,engineering,2004.0
1,bob,engineering,2008.0
2,lisa,engineering,2004.0
3,lisa,engineering,2008.0
4,jake,accounting,
5,,hr,2012.0


In [59]:
pd.merge(dh5,dh6,how='left') # 左连接
# q: 什么是左连接
# a: 左连接是指以左边的表为基准,将右边的表连接到左边的表上

Unnamed: 0,employee,group,supervisor
0,bob,engineering,2004.0
1,bob,engineering,2008.0
2,jake,accounting,
3,lisa,engineering,2004.0
4,lisa,engineering,2008.0


#### 如果表1 有两个列 与表2其中的两个列都匹配, 默认会以表1的两个列为基准进行合并

In [67]:
dh7 = DataFrame({
    'employee':['bob','jake','lisa'],
    'group':['engineering','accounting','engineering'],
})

dh8 = DataFrame({
    'name':['bob','jake','lisa'],
    'supervisor':[2004,2008,2012]
})

In [68]:
pd.merge(dh7,dh8,left_on='employee', right_on='name') # 根据左边的列指定employee,右边的列指定name

Unnamed: 0,employee,group,name,supervisor
0,bob,engineering,bob,2004
1,jake,accounting,jake,2008
2,lisa,engineering,lisa,2012


### 内合并与外合并: out取并集,inner取交集