In [1]:
print("""
@File         : concatenating_pd.dataframe_objects.ipynb
@Author(s)    : Stephen CUI
@LastEditor(s): Stephen CUI
@CreatedTime  : 2025-01-02 21:11:46
@Email        : cuixuanstephen@gmail.com
@Description  : 连接 pd.DataFrame 对象
""")


@File         : concatenating_pd.dataframe_objects.ipynb
@Author(s)    : Stephen CUI
@LastEditor(s): Stephen CUI
@CreatedTime  : 2025-01-02 21:11:46
@Email        : cuixuanstephen@gmail.com
@Description  : 连接 pd.DataFrame 对象



In [2]:
import pandas as pd
import numpy as np

pandas 中的术语 *concatenation* 是指获取两个或多个 `pd.DataFrame` 对象并以某种方式堆叠它们的过程。最常见的是，pandas 中的用户执行我们认为的垂直连接，即将  `pd.DataFrame` 对象放在彼此之上：


![Vertical concatenation of two pd.DataFrame objects](../../IMAGES/FIG7-1.png)

但是，pandas 还可以灵活地获取 pd.DataFrame 对象并将它们并排堆叠，这一过程通过一个称为水平连接的过程实现：

![Vertical concatenation of two pd.DataFrame objects](../../IMAGES/FIG7-2.png)

In [3]:
df_q1 = pd.DataFrame([
    ["AAPL", 100., 50., 75.],
    ["MSFT", 80., 42., 62.],
    ["AMZN", 60., 100., 120.],
], columns=["ticker", "shares", "low", "high"])
df_q1 = df_q1.convert_dtypes(dtype_backend="numpy_nullable")
df_q1

Unnamed: 0,ticker,shares,low,high
0,AAPL,100,50,75
1,MSFT,80,42,62
2,AMZN,60,100,120


In [4]:
df_q2 = pd.DataFrame([
    ["AAPL", 80., 70., 80., 77.],
    ["MSFT", 90., 50., 60., 55.],
    ["IBM", 100., 60., 70., 64.],
    ["GE", 42., 30., 50., 44.],
], columns=["ticker", "shares", "low", "high", "close"])
df_q2 = df_q2.convert_dtypes(dtype_backend="numpy_nullable")
df_q2

Unnamed: 0,ticker,shares,low,high,close
0,AAPL,80,70,80,77
1,MSFT,90,50,60,55
2,IBM,100,60,70,64
3,GE,42,30,50,44


对 `pd.concat` 的最基本调用将接受列表中的两个 pd.DataFrame 对象。默认情况下，这将垂直堆叠对象，即第一个 pd.DataFrame 简单地堆叠在第二个之上。

In [5]:
pd.concat([df_q1, df_q2])

Unnamed: 0,ticker,shares,low,high,close
0,AAPL,100,50,75,
1,MSFT,80,42,62,
2,AMZN,60,100,120,
0,AAPL,80,70,80,77.0
1,MSFT,90,50,60,55.0
2,IBM,100,60,70,64.0
3,GE,42,30,50,44.0


注意 pandas 在结果中给出的行索引。本质上，pandas 取 `df_q1` 的索引值（范围从 0 到 2），然后取 `df_q2` 的索引值（范围从 0 到 3）。在创建新的行索引时，pandas 只是保留这些值，并将它们垂直堆叠在结果中。如果不喜欢这种行为，可以将 `ignore_index=True` 传递给 `pd.concat`：

In [6]:
pd.concat([df_q1, df_q2], ignore_index=True)

Unnamed: 0,ticker,shares,low,high,close
0,AAPL,100,50,75,
1,MSFT,80,42,62,
2,AMZN,60,100,120,
3,AAPL,80,70,80,77.0
4,MSFT,90,50,60,55.0
5,IBM,100,60,70,64.0
6,GE,42,30,50,44.0


另一个潜在问题是我们无法再看到我们的记录最初来自哪个 pd.DataFrame 。为了保留该信息，我们可以传递 `keys=` 参数，提供自定义标签来表示我们的数据来源：

In [7]:
pd.concat([df_q1, df_q2], keys=['q1', 'q2'])

Unnamed: 0,Unnamed: 1,ticker,shares,low,high,close
q1,0,AAPL,100,50,75,
q1,1,MSFT,80,42,62,
q1,2,AMZN,60,100,120,
q2,0,AAPL,80,70,80,77.0
q2,1,MSFT,90,50,60,55.0
q2,2,IBM,100,60,70,64.0
q2,3,GE,42,30,50,44.0


除了默认的垂直堆叠行为外，我们还可以传递 `axis=1` 来查看水平堆叠的内容：

In [9]:
pd.concat([df_q1, df_q2], keys=['q1', 'q2'], axis='columns')

Unnamed: 0_level_0,q1,q1,q1,q1,q2,q2,q2,q2,q2
Unnamed: 0_level_1,ticker,shares,low,high,ticker,shares,low,high,close
0,AAPL,100.0,50.0,75.0,AAPL,80,70,80,77
1,MSFT,80.0,42.0,62.0,MSFT,90,50,60,55
2,AMZN,60.0,100.0,120.0,IBM,100,60,70,64
3,,,,,GE,42,30,50,44


> pandas 是根据索引的值对齐的，而不是根据任何其他列（比如股票行情），而这可能是我们感兴趣的。如果我们希望 `pd.concat` 按股票行情对齐，我们可以在连接之前将其设置为两个 `pd.DataFrame` 对象的行索引：

In [10]:
pd.concat(
    [df_q1.set_index('ticker'),
     df_q2.set_index('ticker')],
    axis='columns',
    keys=['q1', 'q2']
)

Unnamed: 0_level_0,q1,q1,q1,q2,q2,q2,q2
Unnamed: 0_level_1,shares,low,high,shares,low,high,close
ticker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AAPL,100.0,50.0,75.0,80.0,70.0,80.0,77.0
MSFT,80.0,42.0,62.0,90.0,50.0,60.0,55.0
AMZN,60.0,100.0,120.0,,,,
IBM,,,,100.0,60.0,70.0,64.0
GE,,,,42.0,30.0,50.0,44.0


ne last thing we might want to control about the alignment behavior is how it treats labels that appear
in at least one, but not all, of the objects being concatenated. By default, pd.concat performs an “outer” join, which will take all of the index values (in our case, the ticker symbols) and show them in the output, using a missing value indicator where applicable. Passing `join="inner"` as an argument, by contrast, will only show index labels that appear in all of the objects being concatenated:

In [12]:
pd.concat([
    df_q1.set_index('ticker'),
    df_q2.set_index('ticker')
], axis='columns', join='inner', keys=['q1', 'q2'])

Unnamed: 0_level_0,q1,q1,q1,q2,q2,q2,q2
Unnamed: 0_level_1,shares,low,high,shares,low,high,close
ticker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
AAPL,100,50,75,80,70,80,77
MSFT,80,42,62,90,50,60,55


> **pd.concat 是一项昂贵的操作，永远不应在 Python 循环中调用**。如果在循环中创建了一堆 `pd.DataFrame` 对象，并最终希望将它们连接在一起，最好先将它们存储在序列中，然后在序列完全填充后仅调用一次 `pd.concat`。

In [44]:
%%time

concatenated_dfs = df_q1
for i in range(1_000):
    concatenated_dfs = pd.concat([concatenated_dfs, df_q1])
    
print(f'Final pd.DataFrame shape is {concatenated_dfs.shape}')

Final pd.DataFrame shape is (3003, 4)
CPU times: total: 281 ms
Wall time: 354 ms


In [45]:
%%time

accumulated = [df_q1]
for i in range(1000):
    accumulated.append(df_q1)
    
concatenated_dfs = pd.concat(accumulated)
print(f'Final pd.DataFrame shape is {concatenated_dfs.shape}')

Final pd.DataFrame shape is (3003, 4)
CPU times: total: 62.5 ms
Wall time: 91.3 ms


In [46]:
%%time

concatenated_dfs = pd.concat([df_q1 for i in range(1001)])
print(f'Final pd.DataFrame shape is {concatenated_dfs.shape}')

Final pd.DataFrame shape is (3003, 4)
CPU times: total: 15.6 ms
Wall time: 60.9 ms
