# Merging and Joining DataFrames in Pandas

This notebook covers various techniques for combining DataFrames using merge, join, and concatenate operations.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Pandas version: 2.2.3
NumPy version: 2.2.4


## Sample Data

Let's create sample DataFrames to demonstrate merging and joining operations.

In [2]:
# Create sample DataFrames
employees = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4, 5],
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'DepartmentID': [101, 102, 101, 103, 102]
})

departments = pd.DataFrame({
    'DepartmentID': [101, 102, 103, 104],
    'DepartmentName': ['HR', 'IT', 'Finance', 'Marketing'],
    'Location': ['New York', 'London', 'Paris', 'Tokyo']
})

salaries = pd.DataFrame({
    'EmployeeID': [1, 2, 3, 4, 6],
    'Salary': [50000, 60000, 70000, 55000, 65000],
    'Bonus': [5000, 6000, 7000, 5500, 6500]
})

print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)
print("\nSalaries DataFrame:")
print(salaries)

Employees DataFrame:
   EmployeeID     Name  DepartmentID
0           1    Alice           101
1           2      Bob           102
2           3  Charlie           101
3           4    Diana           103
4           5      Eve           102

Departments DataFrame:
   DepartmentID DepartmentName  Location
0           101             HR  New York
1           102             IT    London
2           103        Finance     Paris
3           104      Marketing     Tokyo

Salaries DataFrame:
   EmployeeID  Salary  Bonus
0           1   50000   5000
1           2   60000   6000
2           3   70000   7000
3           4   55000   5500
4           6   65000   6500


## Merge Operations

The `merge()` function combines DataFrames based on common columns or indices. Similar to SQL joins.

In [3]:
# Inner join (default)
print("Inner Join - Employees and Departments:")
inner_join = pd.merge(employees, departments, on='DepartmentID', how='inner')
print(inner_join)

# Left join
print("\nLeft Join - Employees and Departments:")
left_join = pd.merge(employees, departments, on='DepartmentID', how='left')
print(left_join)

# Right join
print("\nRight Join - Employees and Departments:")
right_join = pd.merge(employees, departments, on='DepartmentID', how='right')
print(right_join)

# Outer join
print("\nOuter Join - Employees and Departments:")
outer_join = pd.merge(employees, departments, on='DepartmentID', how='outer')
print(outer_join)

# Merge with different column names
print("\nMerge Employees and Salaries:")
emp_salary = pd.merge(employees, salaries, on='EmployeeID', how='left')
print(emp_salary)

Inner Join - Employees and Departments:
   EmployeeID     Name  DepartmentID DepartmentName  Location
0           1    Alice           101             HR  New York
1           2      Bob           102             IT    London
2           3  Charlie           101             HR  New York
3           4    Diana           103        Finance     Paris
4           5      Eve           102             IT    London

Left Join - Employees and Departments:
   EmployeeID     Name  DepartmentID DepartmentName  Location
0           1    Alice           101             HR  New York
1           2      Bob           102             IT    London
2           3  Charlie           101             HR  New York
3           4    Diana           103        Finance     Paris
4           5      Eve           102             IT    London

Right Join - Employees and Departments:
   EmployeeID     Name  DepartmentID DepartmentName  Location
0         1.0    Alice           101             HR  New York
1         3

## Join Operations

The `join()` method combines DataFrames based on index. It's similar to merge but uses index by default.

In [4]:
# Set EmployeeID as index for join examples
employees_idx = employees.set_index('EmployeeID')
salaries_idx = salaries.set_index('EmployeeID')

print("Employees with EmployeeID as index:")
print(employees_idx)
print("\nSalaries with EmployeeID as index:")
print(salaries_idx)

# Left join using join()
print("\nLeft Join using join():")
joined_left = employees_idx.join(salaries_idx, how='left')
print(joined_left)

# Inner join using join()
print("\nInner Join using join():")
joined_inner = employees_idx.join(salaries_idx, how='inner')
print(joined_inner)

# Outer join using join()
print("\nOuter Join using join():")
joined_outer = employees_idx.join(salaries_idx, how='outer')
print(joined_outer)

Employees with EmployeeID as index:
               Name  DepartmentID
EmployeeID                       
1             Alice           101
2               Bob           102
3           Charlie           101
4             Diana           103
5               Eve           102

Salaries with EmployeeID as index:
            Salary  Bonus
EmployeeID               
1            50000   5000
2            60000   6000
3            70000   7000
4            55000   5500
6            65000   6500

Left Join using join():
               Name  DepartmentID   Salary   Bonus
EmployeeID                                        
1             Alice           101  50000.0  5000.0
2               Bob           102  60000.0  6000.0
3           Charlie           101  70000.0  7000.0
4             Diana           103  55000.0  5500.0
5               Eve           102      NaN     NaN

Inner Join using join():
               Name  DepartmentID  Salary  Bonus
EmployeeID                                      
1 

## Concatenation

The `concat()` function combines DataFrames along rows or columns. Useful for stacking DataFrames.

In [5]:
# Create additional DataFrames for concatenation
employees_q1 = pd.DataFrame({
    'EmployeeID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Quarter': ['Q1', 'Q1', 'Q1']
})

employees_q2 = pd.DataFrame({
    'EmployeeID': [4, 5, 6],
    'Name': ['Diana', 'Eve', 'Frank'],
    'Quarter': ['Q2', 'Q2', 'Q2']
})

print("Q1 Employees:")
print(employees_q1)
print("\nQ2 Employees:")
print(employees_q2)

# Concatenate along rows (default)
print("\nConcatenated DataFrames (rows):")
concat_rows = pd.concat([employees_q1, employees_q2])
print(concat_rows)

# Concatenate along columns
print("\nConcatenated DataFrames (columns):")
concat_cols = pd.concat([employees_q1, employees_q2], axis=1)
print(concat_cols)

# Concatenate with keys for hierarchical indexing
print("\nConcatenated with keys:")
concat_keys = pd.concat([employees_q1, employees_q2], keys=['Q1', 'Q2'])
print(concat_keys)

Q1 Employees:
   EmployeeID     Name Quarter
0           1    Alice      Q1
1           2      Bob      Q1
2           3  Charlie      Q1

Q2 Employees:
   EmployeeID   Name Quarter
0           4  Diana      Q2
1           5    Eve      Q2
2           6  Frank      Q2

Concatenated DataFrames (rows):
   EmployeeID     Name Quarter
0           1    Alice      Q1
1           2      Bob      Q1
2           3  Charlie      Q1
0           4    Diana      Q2
1           5      Eve      Q2
2           6    Frank      Q2

Concatenated DataFrames (columns):
   EmployeeID     Name Quarter  EmployeeID   Name Quarter
0           1    Alice      Q1           4  Diana      Q2
1           2      Bob      Q1           5    Eve      Q2
2           3  Charlie      Q1           6  Frank      Q2

Concatenated with keys:
      EmployeeID     Name Quarter
Q1 0           1    Alice      Q1
   1           2      Bob      Q1
   2           3  Charlie      Q1
Q2 0           4    Diana      Q2
   1           5  

## Summary

You have learned various techniques for combining DataFrames in Pandas:

- **Merge Operations**: Using `merge()` with different join types (inner, left, right, outer)
- **Join Operations**: Using `join()` method based on index
- **Concatenation**: Using `concat()` to stack DataFrames along rows or columns

These operations are essential for combining data from multiple sources in data analysis workflows.