# DataFrame Concatenation in Pandas

In this notebook, we'll explore various ways to concatenate DataFrames using pandas. We'll cover:
1. Basic DataFrame concatenation using `concat()`
2. Using `ignore_index` argument
3. Using `keys` argument
4. Understanding `axis` parameter
5. Joining DataFrames with Series

In [49]:
# Import required libraries
import pandas as pd
import numpy as np
import os

def save_dataframe(df, filename):
    """
    Save DataFrame to a CSV file, overwriting if it already exists
    
    Parameters:
    df : pandas DataFrame
        The DataFrame to save
    filename : str
        The filename to save to (without path)
    """
    # Define the output directory
    output_dir = 'concat_outputs'
    
    # Create output directory if it doesn't exist
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    
    # Full path for the file
    filepath = os.path.join(output_dir, filename)
    
    # Save the DataFrame, overwriting if exists
    df.to_csv(filepath, index=True)
    print(f"DataFrame saved to {filepath}")

In [50]:
# Import required libraries
import pandas as pd
import numpy as np

# Read the sample dataframes
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')

print("DataFrame 1:")
print(df1.head())
print("\nDataFrame 2:")
print(df2.head())

DataFrame 1:
   id    value1 category
0   1  1.764052        C
1   2  0.400157        C
2   3  0.978738        A
3   4  2.240893        B
4   5  1.867558        C

DataFrame 2:
      id     value2            timestamp
0  25001  98.599691  2021-01-01 00:00:00
1  25002  10.099892  2021-01-01 00:01:00
2  25003   4.311686  2021-01-01 00:02:00
3  25004  87.741601  2021-01-01 00:03:00
4  25005  65.238450  2021-01-01 00:04:00


## 1. Basic DataFrame Concatenation

Let's start with basic concatenation using `pd.concat()`. By default, it concatenates DataFrames vertically (along axis=0).

In [51]:
# Basic concatenation
result = pd.concat([df1, df2])
print("Basic concatenation result:")
print(result.head(10))

# Save the result
save_dataframe(result, 'basic_concat.csv')

Basic concatenation result:
   id    value1 category  value2 timestamp
0   1  1.764052        C     NaN       NaN
1   2  0.400157        C     NaN       NaN
2   3  0.978738        A     NaN       NaN
3   4  2.240893        B     NaN       NaN
4   5  1.867558        C     NaN       NaN
5   6 -0.977278        D     NaN       NaN
6   7  0.950088        B     NaN       NaN
7   8 -0.151357        A     NaN       NaN
8   9 -0.103219        A     NaN       NaN
9  10  0.410599        B     NaN       NaN
DataFrame saved to concat_outputs\basic_concat.csv
DataFrame saved to concat_outputs\basic_concat.csv


## 2. Using ignore_index

When concatenating DataFrames, you might want to reset the index to avoid duplicate index values. This is where `ignore_index=True` comes in handy.

In [52]:
# Concatenation with ignore_index
result_ignore_index = pd.concat([df1, df2], ignore_index=True)
print("Concatenation with ignore_index=True:")
print(result_ignore_index.head(10))

# Save the result
save_dataframe(result_ignore_index, 'concat_ignore_index.csv')

Concatenation with ignore_index=True:
   id    value1 category  value2 timestamp
0   1  1.764052        C     NaN       NaN
1   2  0.400157        C     NaN       NaN
2   3  0.978738        A     NaN       NaN
3   4  2.240893        B     NaN       NaN
4   5  1.867558        C     NaN       NaN
5   6 -0.977278        D     NaN       NaN
6   7  0.950088        B     NaN       NaN
7   8 -0.151357        A     NaN       NaN
8   9 -0.103219        A     NaN       NaN
9  10  0.410599        B     NaN       NaN
DataFrame saved to concat_outputs\concat_ignore_index.csv
DataFrame saved to concat_outputs\concat_ignore_index.csv


## 3. Using keys

The `keys` argument allows you to create a hierarchical index (MultiIndex) when concatenating DataFrames. This is useful for identifying which DataFrame each row came from.

In [53]:
# Concatenation with keys
result_keys = pd.concat([df1, df2], keys=['First', 'Second'])
print("Concatenation with keys:")
print(result_keys.head(10))

# Access data from a specific key
print("\nData from 'First' DataFrame:")
print(result_keys.loc['First'].head())

# Save the result
save_dataframe(result_keys, 'concat_with_keys.csv')

Concatenation with keys:
         id    value1 category  value2 timestamp
First 0   1  1.764052        C     NaN       NaN
      1   2  0.400157        C     NaN       NaN
      2   3  0.978738        A     NaN       NaN
      3   4  2.240893        B     NaN       NaN
      4   5  1.867558        C     NaN       NaN
      5   6 -0.977278        D     NaN       NaN
      6   7  0.950088        B     NaN       NaN
      7   8 -0.151357        A     NaN       NaN
      8   9 -0.103219        A     NaN       NaN
      9  10  0.410599        B     NaN       NaN

Data from 'First' DataFrame:
   id    value1 category  value2 timestamp
0   1  1.764052        C     NaN       NaN
1   2  0.400157        C     NaN       NaN
2   3  0.978738        A     NaN       NaN
3   4  2.240893        B     NaN       NaN
4   5  1.867558        C     NaN       NaN
DataFrame saved to concat_outputs\concat_with_keys.csv
DataFrame saved to concat_outputs\concat_with_keys.csv


## 4. Using axis argument

The `axis` parameter determines whether to concatenate along rows (axis=0, default) or columns (axis=1). Let's see both examples.

In [54]:
# Concatenation along rows (axis=0, default)
result_axis0 = pd.concat([df1, df2], axis=0)
print("Concatenation along rows (axis=0):")
print(result_axis0.head(10))

# Save the result for axis=0
save_dataframe(result_axis0, 'concat_axis0.csv')

# Concatenation along columns (axis=1)
result_axis1 = pd.concat([df1, df2], axis=1)
print("\nConcatenation along columns (axis=1):")
print(result_axis1.head(10))

# Save the result for axis=1
save_dataframe(result_axis1, 'concat_axis1.csv')

Concatenation along rows (axis=0):
   id    value1 category  value2 timestamp
0   1  1.764052        C     NaN       NaN
1   2  0.400157        C     NaN       NaN
2   3  0.978738        A     NaN       NaN
3   4  2.240893        B     NaN       NaN
4   5  1.867558        C     NaN       NaN
5   6 -0.977278        D     NaN       NaN
6   7  0.950088        B     NaN       NaN
7   8 -0.151357        A     NaN       NaN
8   9 -0.103219        A     NaN       NaN
9  10  0.410599        B     NaN       NaN
DataFrame saved to concat_outputs\concat_axis0.csv

Concatenation along columns (axis=1):
   id    value1 category     id     value2            timestamp
0   1  1.764052        C  25001  98.599691  2021-01-01 00:00:00
1   2  0.400157        C  25002  10.099892  2021-01-01 00:01:00
2   3  0.978738        A  25003   4.311686  2021-01-01 00:02:00
3   4  2.240893        B  25004  87.741601  2021-01-01 00:03:00
4   5  1.867558        C  25005  65.238450  2021-01-01 00:04:00
5   6 -0.977278   

## 5. Joining DataFrame with Series

You can also concatenate DataFrames with Series objects. Let's create a Series and concatenate it with our DataFrame.

In [55]:
# Create a Series from df1's value1 column mean
series = df1.groupby('category')['value1'].mean()
print("Series created from DataFrame:")
print(series)

# Concatenate DataFrame with Series
result_with_series = pd.concat([df1, series], axis=1)
result_with_series.columns = ['id', 'value1', 'category', 'category_mean']
print("\nDataFrame concatenated with Series:")
print(result_with_series.head(10))

# Save the result
save_dataframe(result_with_series, 'concat_with_series.csv')

Series created from DataFrame:
category
A   -0.000972
B   -0.013369
C   -0.002833
D    0.001889
Name: value1, dtype: float64

DataFrame concatenated with Series:
     id    value1 category  category_mean
0   1.0  1.764052        C            NaN
1   2.0  0.400157        C            NaN
2   3.0  0.978738        A            NaN
3   4.0  2.240893        B            NaN
4   5.0  1.867558        C            NaN
5   6.0 -0.977278        D            NaN
6   7.0  0.950088        B            NaN
7   8.0 -0.151357        A            NaN
8   9.0 -0.103219        A            NaN
9  10.0  0.410599        B            NaN
DataFrame saved to concat_outputs\concat_with_series.csv
DataFrame saved to concat_outputs\concat_with_series.csv


## Additional concat() Parameters

Here are some other useful parameters in pandas concat():

- `join`: {'inner', 'outer'} - How to handle indexes on other axis
- `join_axes`: List of Index objects - Specific indexes to use
- `verify_integrity`: Boolean - Check for duplicates in the new index
- `sort`: Boolean - Sort the non-concatenation axis
- `copy`: Boolean - Whether to copy data

Example with some of these parameters:

In [56]:
# Example with additional parameters
result_advanced = pd.concat(
    [df1, df2],
    axis=0,
    join='outer',
    ignore_index=True,
    verify_integrity=False,
    sort=True
)

print("Concatenation with additional parameters:")
print(result_advanced.head(10))

# Save the result
save_dataframe(result_advanced, 'concat_advanced.csv')

Concatenation with additional parameters:
  category  id timestamp    value1  value2
0        C   1       NaN  1.764052     NaN
1        C   2       NaN  0.400157     NaN
2        A   3       NaN  0.978738     NaN
3        B   4       NaN  2.240893     NaN
4        C   5       NaN  1.867558     NaN
5        D   6       NaN -0.977278     NaN
6        B   7       NaN  0.950088     NaN
7        A   8       NaN -0.151357     NaN
8        A   9       NaN -0.103219     NaN
9        B  10       NaN  0.410599     NaN
DataFrame saved to concat_outputs\concat_advanced.csv
DataFrame saved to concat_outputs\concat_advanced.csv
