### **Merging and Joining DataFrames**
In real-world data analysis, information is often spread across multiple tables or DataFrames.
To analyze it effectively, you need to **combine** these datasets based on shared keys or indices — similar to **SQL joins**.
Pandas provides powerful tools for this purpose through the **`merge()`** and **`join()`** methods.

---
➡️ **Types of Joins**

You can specify how the merge behaves using the **`how`** parameter:
| Join Type | Description                                       | Keeps                |
| --------- | ------------------------------------------------- | -------------------- |
| `inner`   | Only matching rows                                | Intersection of keys |
| `left`    | All rows from left DataFrame + matches from right | Left keys            |
| `right`   | All rows from right DataFrame + matches from left | Right keys           |
| `outer`   | All rows from both DataFrames                     | Union of keys        |

Example:
```python
pd.merge(df1, df2, on='ID', how='outer')
```
---
➡️ **Joining on Indices**

You can also join DataFrames using their **index values** instead of columns.
```python
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')

joined_df = df1.join(df2, how='inner')
```
---
➡️ **Key Takeaways**
* **`merge()`** → SQL-style joins using column keys
* **`join()`** → Combines DataFrames using **indices** by default
* You can merge on **multiple columns** using `on=['col1', 'col2']`
* Always check the **`how`** parameter to control which data is kept

➡️ **Merging DataFrames**

The **`pd.merge()`** function combines rows from two DataFrames based on matching values in one or more key columns.

In [4]:
import pandas as pd

# Sample data
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [85, 90, 75]})

# Merge on a common column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

   ID   Name  Score
0   1  Alice     85
1   2    Bob     90


➡️ **We have three separate DataFrames:**
* Order data
* Customer data
* Product data

We can find insights such as Total sales by customer, Most popular product & Orders without customer information (if any) by playing around with multiple DataFrames.

In [2]:
import pandas as pd

# Sample Order data
orders = pd.DataFrame({
    'OrderID': [1, 2, 3, 4, 5],
    'CustomerID': [101, 102, 103, None, 105],
    'ProductID': [1001, 1002, 1003, 1001, 1002],
    'Quantity': [1, 2, 1, 4, 2]
})

# Sample Customer data
customers = pd.DataFrame({
    'CustomerID': [101, 102, 103, 104, 105],
    'CustomerName': ['Alice', 'Bob', 'Charlie', 'David', 'Edward']
})

# Sample Product data
products = pd.DataFrame({
    'ProductID': [1001, 1002, 1003],
    'ProductName': ['Laptop', 'Smartphone', 'Tablet'],
    'Price': [1000, 500, 300]
})

# Merging Orders with Customers to get Customer Names
orders_with_customers = pd.merge(orders, customers, on='CustomerID', how='left')
print(f"Orders with Customer information:\n{orders_with_customers}\n")

# Merging the result with Products to get Product details
orders_full = pd.merge(orders_with_customers, products, on='ProductID', how='left')
print(f"Full Order details with Customer and Product information:\n{orders_full}\n")

# Calculating Total Sales by Customer
orders_full['TotalPrice'] = orders_full['Quantity'] * orders_full['Price']
total_sales_by_customer = orders_full.groupby('CustomerName')['TotalPrice'].sum().reset_index()
print(f"Total Sales by Customer:\n{total_sales_by_customer}\n")

# Finding the Most Popular Product
most_popular_product = orders_full.groupby('ProductName')['Quantity'].sum().reset_index()
most_popular_product = most_popular_product.sort_values(by='Quantity', ascending=False)
print(f"Most Popular Product:\n{most_popular_product}")

Orders with Customer information:
   OrderID  CustomerID  ProductID  Quantity CustomerName
0        1       101.0       1001         1        Alice
1        2       102.0       1002         2          Bob
2        3       103.0       1003         1      Charlie
3        4         NaN       1001         4          NaN
4        5       105.0       1002         2       Edward

Full Order details with Customer and Product information:
   OrderID  CustomerID  ProductID  Quantity CustomerName ProductName  Price
0        1       101.0       1001         1        Alice      Laptop   1000
1        2       102.0       1002         2          Bob  Smartphone    500
2        3       103.0       1003         1      Charlie      Tablet    300
3        4         NaN       1001         4          NaN      Laptop   1000
4        5       105.0       1002         2       Edward  Smartphone    500

Total Sales by Customer:
  CustomerName  TotalPrice
0        Alice        1000
1          Bob        1000
2 

##### ➡️ **Merging DataFrames**

In [5]:
import pandas as pd

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Score': [85, 90, 95]})

# Merge DataFrames
merged = pd.merge(df1, df2, on='ID', how='inner')
print(f"Inner Join:\n{merged}\n")

merged1 = pd.merge(df1, df2, on='ID', how='left')
print(f"Left Join:\n{merged1}\n")

merged2 = pd.merge(df1, df2, on='ID', how='right')
print(f"Right Join:\n{merged2}")

Inner Join:
   ID     Name  Score
0   2      Bob     85
1   3  Charlie     90

Left Join:
   ID     Name  Score
0   1    Alice    NaN
1   2      Bob   85.0
2   3  Charlie   90.0

Right Join:
   ID     Name  Score
0   2      Bob     85
1   3  Charlie     90
2   4      NaN     95


In [8]:
import pandas as pd

# Create DataFrames
students = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

grades = pd.DataFrame({
    'ID': [1, 2, 4, 5],
    'Grade': ['A', 'B', 'A', 'C']
})
outer_merge = pd.merge(students, grades, on='ID', how='outer')
print(f"Outer Merge:\n{outer_merge}")

Outer Merge:
   ID     Name Grade
0   1    Alice     A
1   2      Bob     B
2   3  Charlie   NaN
3   4    David     A
4   5      NaN     C


### **Join Using Index**
In Pandas, **joining** is similar to merging but is typically performed **using the index** of the DataFrames instead of specific columns.
It is especially useful when the index represents a meaningful identifier, such as customer ID, date, or category.

---
➡️ **Explanation**
* `df1.join(df2)` joins the two DataFrames **based on their index**.
* The parameter `how` determines how to handle mismatched indices:

| Parameter     | Description                                        |
| ------------- | -------------------------------------------------- |
| `how='inner'` | Keeps only matching indices                        |
| `how='outer'` | Keeps all indices, filling missing values with NaN |
| `how='left'`  | Keeps all indices from `df1`                       |
| `how='right'` | Keeps all indices from `df2`                       |

---
➡️ **Key Difference from Merge**

* `merge()` works with **columns**
* `join()` works with **indices**


In [9]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'A': [1, 2, 3]}, index=['K', 'L', 'M'])
df2 = pd.DataFrame({'B': [4, 5, 6]}, index=['K', 'M', 'N'])

# Join DataFrames
joined = df1.join(df2, how='outer')
print(joined)

     A    B
K  1.0  4.0
L  2.0  NaN
M  3.0  5.0
N  NaN  6.0


In [11]:
import pandas as pd

# Create DataFrames
courses = pd.DataFrame({
    'CourseName': ['Python', 'Data Science', 'Machine Learning']
}, index=['CS101', 'DS201', 'ML301'])

instructors = pd.DataFrame({
    'Instructor': ['Dr. Smith', 'Prof. Johnson', 'Dr. Williams']
}, index=['CS101', 'DS201', 'CS201'])

# Update your code below this line
join_merge = courses.join(instructors, how='outer')
print(f"Joining via Index:\n{join_merge}")

Joining via Index:
             CourseName     Instructor
CS101            Python      Dr. Smith
CS201               NaN   Dr. Williams
DS201      Data Science  Prof. Johnson
ML301  Machine Learning            NaN
