# Week 8: Data Joins in Pandas

This week we'll learn:

## Part 1: Join Fundamentals
- Understanding different types of joins (inner, left, right, outer)
- When to use each join type
- Join principles with clear examples

## Part 2: Real-World Multi-System Joins
- Working with data from multiple business systems
- Handling inconsistent keys and data quality issues
- Using composite keys for complex joins
- Understanding cardinality and post-join analysis

## Part 3: Advanced Join Strategies
- Dealing with partial matches and missing data
- Validating join results
- Best practices for multi-system data integration

These skills are essential for:
- **Business intelligence** - Combining data from different departments
- **Customer analytics** - Linking CRM, sales, and support data
- **Financial reporting** - Integrating accounting, sales, and inventory systems
- **Data warehousing** - Building comprehensive analytical datasets

## Your Mission: Integrate TechMart's Business Systems

You're a Data Analyst at **TechMart**, a growing e-commerce company. The business has grown rapidly, and different departments use different systems:

- **Product Database** (Inventory Management System)
- **Customer CRM** (Sales & Marketing System)  
- **Accounts System** (Finance & Billing System)
- **Orders Data** (E-commerce Platform)

**Your Challenge:**
These systems weren't designed to work together perfectly. You'll encounter real-world data integration challenges:
- **Inconsistent identifiers** - Different systems use different customer IDs
- **Missing data** - Not all customers exist in all systems
- **Data quality issues** - Typos, formatting differences, incomplete records
- **Multiple relationships** - Some customers have multiple accounts

**Your Goal:**
Create a comprehensive view of business performance by successfully joining these imperfect datasets, just like you'll do in real data analyst roles!

## Step 1: Understanding Join Types with Simple Examples

Before we tackle real-world complexity, let's understand the four main types of joins using simple examples.

### The Four Join Types

Imagine we have two simple tables:

**Students Table:**
```
student_id | name
1          | Alice
2          | Bob  
3          | Carol
```

**Grades Table:**
```
student_id | grade
1          | A
2          | B
4          | A
```

Notice that:
- Carol (ID 3) has no grade recorded
- Student ID 4 has a grade but no name in the Students table

Let's see how different joins handle this!

In [None]:
import pandas as pd
import numpy as np

Let's create our simple example tables:

In [None]:
# Create simple example tables
students = pd.DataFrame({
    'student_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol']
})

grades = pd.DataFrame({
    'student_id': [1, 2, 4],
    'grade': ['A', 'B', 'A']
})

print('Students Table:')
students.head()

In [None]:
print('Grades Table:')
grades.head()

### Inner Join: Only Matching Records

An **inner join** returns only rows where the key exists in **both** tables.

**Result:** Only Alice and Bob (IDs 1 and 2) appear because they exist in both tables.

In [None]:
# Inner join - only matching records
inner_result = students.merge(grades, on='student_id', how='inner')
print('Inner Join Result:')
inner_result.head()

### Left Join: All Records from Left Table

A **left join** returns **all** rows from the left table, plus matching rows from the right table.

**Result:** Alice, Bob, and Carol all appear. Carol gets NaN for grade because she has no grade recorded.

In [None]:
# Left join - all students, grades where available
left_result = students.merge(grades, on='student_id', how='left')
print('Left Join Result:')
left_result.head()

### Right Join: All Records from Right Table

A **right join** returns **all** rows from the right table, plus matching rows from the left table.

**Result:** All grades appear. Student ID 4 gets NaN for name because they don't exist in the Students table.

In [None]:
# Right join - all grades, student names where available
right_result = students.merge(grades, on='student_id', how='right')
print('Right Join Result:')
right_result.head()

### Outer Join: All Records from Both Tables

An **outer join** (also called full outer join) returns **all** rows from **both** tables.

**Result:** Everyone appears - Alice, Bob, Carol, and the mystery student ID 4. Missing values are filled with NaN.

In [None]:
# Outer join - all records from both tables
outer_result = students.merge(grades, on='student_id', how='outer')
print('Outer Join Result:')
outer_result.head()

### When to Use Each Join Type

**Inner Join:** Use when you only want complete records
- Example: "Show me customers who have both placed orders AND have account information"

**Left Join:** Use when you want to keep all records from your main table
- Example: "Show me all customers, and their order history if they have any"

**Right Join:** Less common, but useful when the right table is your main focus
- Example: "Show me all orders, and customer details if available"

**Outer Join:** Use when you want to see the complete picture
- Example: "Show me all customers and all orders, whether they match or not"

## Step 2: Load Our Real-World Business Data

Now let's work with realistic business data that has the messy characteristics of real-world systems.

In [None]:
# Load our business datasets
products = pd.read_csv('product_database.csv')
customers = pd.read_csv('customer_crm.csv')
accounts = pd.read_csv('accounts_system.csv')
orders = pd.read_csv('orders_data.csv')

Let's examine each dataset:

In [None]:
print('Product Database:')
print(f'Shape: {products.shape}')
products.head()

In [None]:
print('Customer CRM:')
print(f'Shape: {customers.shape}')
customers.head()

In [None]:
print('Accounts System:')
print(f'Shape: {accounts.shape}')
accounts.head()

In [None]:
print('Orders Data:')
print(f'Shape: {orders.shape}')
orders.head()

### Understanding Our Data Relationships

Let's understand how these systems relate to each other and identify potential data quality issues.

**Key Relationships:**
- **Orders** link to **Customers** via `customer_id`
- **Orders** link to **Products** via `product_id`
- **Accounts** link to **Customers** via `customer_id`
- **Accounts** might have multiple records per customer (business + personal accounts)

**Potential Issues:**
- Not all customers may have accounts
- Not all products may have been ordered
- Some orders might reference missing customers or products

In [None]:
# Check data coverage across systems
print('Data Coverage Analysis:')
print(f'Total customers in CRM: {customers["customer_id"].nunique()}')
print(f'Customers with accounts: {accounts["customer_id"].nunique()}')
print(f'Customers who have ordered: {orders["customer_id"].nunique()}')
print(f'Total products in database: {products["product_id"].nunique()}')
print(f'Products that have been ordered: {orders["product_id"].nunique()}')

This shows us the typical real-world scenario - not all data exists in all systems!

## Step 3: Simple Joins - Orders with Customer Information

Let's start with a straightforward join: adding customer information to our orders.

In [None]:
# Join orders with customer information (left join)
orders_with_customers = orders.merge(customers, on='customer_id', how='left')
print('Orders with Customer Information:')
print(f'Original orders: {len(orders)}')
print(f'Orders after join: {len(orders_with_customers)}')
orders_with_customers.head()

### Checking for Missing Customer Data

In [None]:
# Check if any orders have missing customer information
missing_customers = orders_with_customers['first_name'].isnull().sum()
print(f'Orders with missing customer data: {missing_customers}')

if missing_customers > 0:
    print('\nOrders with missing customer information:')
    orders_with_customers[orders_with_customers['first_name'].isnull()].head()
else:
    print('All orders have matching customer information')

## Step 4: Adding Product Information

Now let's add product details to understand what customers are buying.

In [None]:
# Add product information to our orders
orders_complete = orders_with_customers.merge(products, on='product_id', how='left')
print('Orders with Customer and Product Information:')
print(f'Shape: {orders_complete.shape}')
orders_complete.head()

### Checking Product Data Coverage

In [None]:
# Check for missing product information
missing_products = orders_complete['product_name'].isnull().sum()
print(f'Orders with missing product data: {missing_products}')

if missing_products > 0:
    print('\nOrders with missing product information:')
    orders_complete[orders_complete['product_name'].isnull()][['order_id', 'product_id', 'customer_id']].head()
else:
    print('All orders have matching product information')

## Step 5: Complex Joins - Customer Accounts

Now for a more complex scenario. The accounts system uses the same `customer_id` as the CRM, but:
1. Not all customers have accounts
2. Some customers have multiple accounts (personal + business)

This affects our **cardinality** - the relationship between records after joining.

### Understanding Cardinality

In [None]:
# Check how many accounts each customer has
accounts_per_customer = accounts.groupby('customer_id').size()
print('Accounts per customer distribution:')
accounts_per_customer.value_counts().head()

Most customers have 1 account, but some have 2. This is a **one-to-many** relationship.

### Join Customers with Accounts

In [None]:
# Join customers with their accounts
customers_with_accounts = customers.merge(accounts, on='customer_id', how='left')
print('Customers with Account Information:')
print(f'Original customers: {len(customers)}')
print(f'Rows after join: {len(customers_with_accounts)}')
customers_with_accounts.head()

Notice that we now have more rows than customers because some customers have multiple accounts!

### Analysing the Join Results

In [None]:
# Analyse the join results
customers_with_accounts_count = customers_with_accounts['customer_id'].nunique()
customers_without_accounts = customers_with_accounts['account_number'].isnull().sum()

print(f'Unique customers in result: {customers_with_accounts_count}')
print(f'Customers without accounts: {customers_without_accounts}')
print(f'Total rows: {len(customers_with_accounts)}')
print(f'Customers with multiple accounts: {len(customers_with_accounts) - customers_with_accounts_count}')

### Understanding What Each Row Represents

**Critical Concept:** After joining with a one-to-many relationship, **each row represents a customer-account combination**, not just a customer.

This changes how we interpret our data:
- **Before join:** Each row = one customer
- **After join:** Each row = one customer-account pair

This affects calculations like averages, counts, and totals!

In [None]:
# Example: Show customers with multiple accounts
multiple_accounts = customers_with_accounts.groupby('customer_id').size()
customers_multiple = multiple_accounts[multiple_accounts > 1].index


print('Customers with multiple accounts:')
customers_with_accounts[customers_with_accounts['customer_id'].
    isin(customers_multiple)][['customer_id', 'first_name', 'last_name', 'account_type', 'account_number']].head()

## Step 6: Composite Keys and Complex Matching

Sometimes we need to join on multiple columns to create unique matches. This is called using **composite keys**.

Let's create a scenario where we need to match customers using multiple fields because the simple `customer_id` isn't reliable.

### Simulating a Real-World Problem

In [None]:
# Let's imagine the accounts system has some corrupted customer_ids
# We'll need to match using email + name combination
accounts_corrupted = accounts.copy()

# Corrupt some customer IDs to simulate real-world data issues
np.random.seed(42)
corrupt_indices = np.random.choice(accounts_corrupted.index, size=10, replace=False)
accounts_corrupted.loc[corrupt_indices, 'customer_id'] = 'CORRUPTED'

print('Accounts with corrupted customer IDs:')
accounts_corrupted[accounts_corrupted['customer_id'] == 'CORRUPTED'][['account_number', 'customer_id', 'email_on_file', 'billing_name']].head()

### Preparing for Composite Key Matching

We need to create matching fields in both datasets to join on email and name.

In [None]:
# Prepare customers data for composite key matching
customers_for_matching = customers.copy()
customers_for_matching['full_name'] = customers_for_matching['first_name'] + ' ' + customers_for_matching['last_name']

# Prepare accounts data for composite key matching
accounts_for_matching = accounts_corrupted.copy()

print('Prepared data for composite key matching:')
customers_for_matching[['customer_id', 'email', 'full_name']].head()

### Performing Composite Key Join

In [None]:
# Join using composite key: email + full_name
composite_join = customers_for_matching.merge(
    accounts_for_matching,
    left_on=['email', 'full_name'],
    right_on=['email_on_file', 'billing_name'],
    how='inner'
)

print('Composite key join results:')
print(f'Successful matches: {len(composite_join)}')
composite_join[['customer_id_x', 'customer_id_y', 'email', 'full_name', 'account_number']].head()

Notice we have `customer_id_x` and `customer_id_y` because both tables had this column!

## Step 7: Validating Join Results

Always validate your joins to ensure they worked as expected.

### Check for Duplicate Records

In [None]:
# Check for unexpected duplicates in our main orders join
duplicate_orders = orders_complete['order_id'].duplicated().sum()
print(f'Duplicate order IDs after join: {duplicate_orders}')

if duplicate_orders > 0:
    print('Warning: Duplicate orders found - check join logic')
else:
    print('No duplicate orders - join preserved order uniqueness')

### Validate Data Completeness

In [None]:
# Check data completeness after joins
print('Data Completeness Check:')
print(f'Orders with customer info: {(~orders_complete["first_name"].isnull()).sum()}/{len(orders_complete)}')
print(f'Orders with product info: {(~orders_complete["product_name"].isnull()).sum()}/{len(orders_complete)}')

# Calculate completeness percentages
customer_completeness = (~orders_complete['first_name'].isnull()).mean() * 100
product_completeness = (~orders_complete['product_name'].isnull()).mean() * 100

print(f'\nCompleteness rates:')
print(f'Customer data: {customer_completeness:.1f}%')
print(f'Product data: {product_completeness:.1f}%')

## Step 8: Practice Exercises

### Exercise 1: Customer Purchase Summary

Create a summary showing each customer's total purchases, but only include customers who have both CRM records AND have placed orders.

In [None]:
# Exercise 1 Solution
customer_orders = orders.merge(customers, on='customer_id', how='inner')
customer_summary = customer_orders.groupby(['customer_id', 'first_name', 'last_name']).agg({
    'order_id': 'count',
    'total_value': 'sum',
    'quantity': 'sum'
}).reset_index()

customer_summary.columns = ['customer_id', 'first_name', 'last_name', 'total_orders', 'total_spent', 'total_items']
customer_summary = customer_summary.sort_values('total_spent', ascending=False)

print('Top customers by total spending:')
customer_summary.head()

### Exercise 2: Customer Coverage Analysis

Find customers who exist in CRM but have never placed an order

In [None]:
# Exercise 2 Solution
customer_analysis = customers.merge(orders, on='customer_id', how='left')
never_ordered_customers = customer_analysis[customer_analysis['order_id'].isnull()]

print('Customers who have never placed an order:')
print(f'Count: {len(never_ordered_customers)}')
never_ordered_customers[['customer_id', 'first_name', 'last_name', 'customer_segment']].head()

### Exercise 3: Product and Order Data

Use an outer join between orders and products to check for data gaps. Specifically, answer these questions: 1) are there any products that exist but have never been ordered, and 2) are there any orders for products that don't exist in our product database.

In [None]:
# Exercise 3 Solution
# Use outer join to see all products and all orders
coverage_analysis = orders.merge(products, on='product_id', how='outer')

# 1) Products that exist but have never been ordered
unordered_products = coverage_analysis[coverage_analysis['order_id'].isnull()]
print('1) Products that have never been ordered:')
print(f'Count: {len(unordered_products)}')
unordered_products[['product_id', 'product_name', 'category']].head()

In [None]:
# 2) Orders for products not in our product database
missing_products = coverage_analysis[coverage_analysis['product_name'].isnull()]
print('2) Orders for products not in our database:')
print(f'Count: {len(missing_products)}')
missing_products[['order_id', 'product_id', 'customer_id', 'total_value']].head()

## Step 9: Best Practices for Real-World Joins

### Key Lessons for Data Analysts

**1. Always Understand Your Data First**
- Check data coverage across systems
- Identify potential missing or inconsistent keys
- Understand the business relationships between datasets

**2. Choose the Right Join Type**
- **Inner join**: When you only want complete records
- **Left join**: When preserving your main dataset is important
- **Outer join**: When you need to see all data, including gaps

**3. Be Aware of Cardinality Changes**
- One-to-many joins increase row count
- Many-to-many joins can explode your dataset
- Always check row counts before and after joins

**4. Validate Your Results**
- Check for unexpected duplicates
- Verify data completeness
- Understand the business impact of missing data

**5. Document Data Quality Issues**
- Track which records couldn't be joined
- Quantify the impact of missing data
- Communicate limitations to stakeholders

**6. Consider Composite Keys**
- When single keys are unreliable
- For fuzzy matching scenarios
- When dealing with legacy systems

## Summary

You've mastered data joins in pandas and learned to handle real-world data integration challenges.

**Join Types Mastered:**
- **Inner joins** for complete records only
- **Left joins** for preserving your main dataset
- **Right joins** for alternative perspectives
- **Outer joins** for comprehensive views

**Real-World Skills Developed:**
- Working with multi-system data
- Handling inconsistent keys and missing data
- Using composite keys for complex matching
- Understanding cardinality and post-join analysis
- Validating join results and data quality

**Business Value Created:**
- Integrated customer view across CRM and accounts
- Complete order analysis with customer and product details
- Identified data quality issues and their business impact
- Created comprehensive business intelligence datasets

**Next Steps:**
These join skills are fundamental for:
- **Data warehousing** - Building analytical datasets
- **Business intelligence** - Creating dashboards and reports
- **Customer analytics** - 360-degree customer views
- **Financial analysis** - Integrating multiple business systems

You're now equipped to handle the complex data integration challenges that define real-world data analyst roles!