## Day 6: Merging and Joining Data
Goal: Master SQL JOINs in pandas

First, doing some basic joins:

In [None]:
import pandas as pd

# Sample data setup
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'customer_name': ['Alice Corp', 'Bob Inc', 'Charlie Ltd', 'Delta Co'],
    'region': ['North', 'South', 'East', 'West']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 1, 2, 3, 5],  # Note: customer_id 5 doesn't exist in customers
    'total_amount': [1000, 1500, 2000, 750, 3000],
    'order_date': ['2023-01-15', '2023-02-20', '2023-01-30', '2023-03-10', '2023-02-05']
})

# SQL: SELECT * FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id
inner_join = pd.merge(customers, orders, on='customer_id', how='inner')
print("Inner Join Result:")
print(inner_join)

# SQL: SELECT * FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id
left_join = pd.merge(customers, orders, on='customer_id', how='left')

print("\nLeft Join Result:")
print(left_join)    

# SQL: SELECT * FROM customers c RIGHT JOIN orders o ON c.customer_id = o.customer_id
right_join = pd.merge(customers, orders, on='customer_id', how='right')
print("\nRight Join Result:")
print(right_join)

# SQL: SELECT * FROM customers c FULL OUTER JOIN orders o ON c.customer_id = o.customer_id
outer_join = pd.merge(customers, orders, on='customer_id', how='outer')
print("\nOuter Join Result:")
print(outer_join)

These are pretty straightforward.  Merge, on, how is logical.

*Note*:  Today I learned about \n next line formatting character that will give a space between the results instead of them being right next to each other.  

Next, Advanced Joining:

In [None]:
# Joining on multiple columns with matching column names

import pandas as pd

# Sample data for multiple column join
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'customer_name': ['Alice Corp', 'Bob Inc', 'Charlie Ltd', 'Delta Co'],
    'location': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'region': ['North', 'South', 'East', 'West']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 1, 2, 3, 5],  
    'total_amount': [1000, 1500, 2000, 750, 3000],
    'order_date': ['2023-01-15', '2023-02-20', '2023-01-30', '2023-03-10', '2023-02-05'],
    'location': ['New York', 'New York', 'Los Angeles', 'Chicago', 'Miami']
})


# SQL: JOIN ON table1.col1 = table2.col1 AND table1.col2 = table2.col2 
pd.merge(customers, orders, on=['customer_id', 'location'])
print("Join on multiple columns Result:")
print(pd.merge(customers, orders, on=['customer_id', 'location']))

# Joining with suffixes for duplicate column names
print("\nJoin with suffixes for duplicate column names:")
# SQL: SELECT * FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id
print(pd.merge(customers, orders, on='customer_id', suffixes=('_left', '_right')))

In [None]:
# Joining on multiple columns with different column names

import pandas as pd

# Sample data for multiple column join
customers = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'customer_name': ['Alice Corp', 'Bob Inc', 'Charlie Ltd', 'Delta Co'],
    'location': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'region': ['North', 'South', 'East', 'West']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 1, 2, 3, 5],  
    'total_amount': [1000, 1500, 2000, 750, 3000],
    'order_date': ['2023-01-15', '2023-02-20', '2023-01-30', '2023-03-10', '2023-02-05'],
    'location': ['New York', 'New York', 'Los Angeles', 'Chicago', 'Miami']
})

# SQL: JOIN ON orders.customer_id = customers.id
print("\nJoin on different column names Result:")
print(pd.merge(orders, customers, left_on='customer_id', right_on='id'))

# Joining on multiple columns with different column names and suffixes
print("\nJoin on multiple columns with different names and suffixes Result:")
print(pd.merge(orders, customers, left_on=['customer_id', 'location'], right_on=['id', 'location'], suffixes=('_order', '_customer')))



This is interesting because SQL doesn't have this issue... 
In pandas, when you merge two DataFrames that have columns with the same name (other than the one you're joining on), those columns would conflict in the resulting DataFrame. To resolve this, you can use the suffixes parameter in pd.merge(). 

In [None]:
# Joining with suffixes for duplicate column names

import pandas as pd

# Sample data for multiple column join
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4],
    'customer_name': ['Alice Corp', 'Bob Inc', 'Charlie Ltd', 'Delta Co'],
    'location': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'region': ['North', 'South', 'East', 'West']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 1, 2, 3, 5],  
    'total_amount': [1000, 1500, 2000, 750, 3000],
    'order_date': ['2023-01-15', '2023-02-20', '2023-01-30', '2023-03-10', '2023-02-05'],
    'location': ['New York', 'New York', 'Los Angeles', 'Chicago', 'Miami']
})

# SQL: SELECT * FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id
print("\nJoin with suffixes for duplicate column names:")
print(pd.merge(customers, orders, on='customer_id', suffixes=('_left', '_right')))

Suffix Example Breakdown:

`customers` and `orders` are being joined on the `customer_id` column.

If both DataFrames have a column like `name` or `address`, they would conflict.

With `suffixes=('_left', '_right')`, you’d see them renamed like:

- `name_left` → from the `customers` DataFrame (left side)
- `name_right` → from the `orders` DataFrame (right side)


____________
This pandas code performs an **index-based join** between two DataFrames, similar to a SQL join that uses row numbers or positions rather than specific column values.
How it works
The `.join()` method merges DataFrames based on their index values (row labels). It's essentially saying "combine rows that have the same index position."

In [None]:
# Index-based joining (like SQL using row numbers)

import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 90, 78]
}, index=[0, 1, 2])

df2 = pd.DataFrame({
    'name': ['Marketing', 'Engineering', 'Sales'],
    'budget': [50000, 80000, 60000]
}, index=[0, 1, 2])

print(df1.join(df2, lsuffix='_left', rsuffix='_right'))

Key characteristics of index joins
- *Default behavior*: Left join (keeps all rows from df1)
- *Index alignment*: Rows are matched by their index values
- *Automatic suffixing*: Only applies to columns with duplicate names
- *Efficient*: Generally faster than `merge()` for index-based operations
This is particularly useful when you have DataFrames that are naturally aligned by their row positions and you want to combine their columns horizontally.

In [None]:
# Concatenating dataframes (UNION equivalent) (vertical concatenation)

import pandas as pd

# Create sample DataFrames (with some overlapping data)
df1 = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['Engineering', 'Sales', 'Marketing'],
    'salary': [75000, 65000, 70000]
})

df2 = pd.DataFrame({
    'name': ['Bob', 'Diana', 'Charlie'],  # Bob and Charlie are duplicates
    'department': ['Sales', 'Engineering', 'Marketing'],
    'salary': [65000, 80000, 70000]
})

# SQL: SELECT * FROM table1 UNION ALL SELECT * FROM table2
# Preserves duplicates: Like UNION ALL, keeps duplicate rows (unlike UNION)
# Column alignment: Matches columns by name automatically
print("\nConcatenating DataFrames (with duplicates):")
print(pd.concat([df1, df2], ignore_index=True))
# Index handling: ignore_index=True creates clean sequential numbering


# SQL: SELECT * FROM table1 UNION SELECT * FROM table2 (removes duplicates)
print("\nConcatenating DataFrames (removing duplicates):")
print(pd.concat([df1, df2]).drop_duplicates())  

# Removes duplicates: Unlike UNION ALL, eliminates identical rows
# Preserves original indices: Notice indices 0, 1, 2, 4 (index 3 was dropped as duplicate)
# Row-wise comparison: Considers entire rows when identifying duplicates
# Maintains order: Generally keeps the first occurrence of duplicate rows
# This is useful when combining datasets where you want to eliminate redundant records and create a unique set of rows.