# __Merge__

###  `pd.merge()` ‚Äî **SQL-like joins**

#### Parameters

| Parameter   | Default       | Meaning                                   |
| ----------- | ------------- | ----------------------------------------- |
| `left`      | ‚Äî             | Left DataFrame                            |
| `right`     | ‚Äî             | Right DataFrame                           |
| `on`        | `None`        | Column(s) to join on                      |
| `how`       | `'inner'`     | `'inner'`, `'left'`, `'right'`, `'outer'` |
| `left_on`   | `None`        | Left column if names differ               |
| `right_on`  | `None`        | Right column if names differ              |
| `suffixes`  | `('_x','_y')` | Rename overlapping columns                |
| `indicator` | `False`       | Adds column showing source (`_merge`)     |


#### ***Column to join on must be same / overlapping or aleast have some similarity and should be same name (Just like SQL Foreign key)***

In [1]:
import pandas as pd

In [2]:
df1 = pd.DataFrame({
    'cust_id': [1, 2, 3,5],
    'name': ['Ali', 'Sara', 'Usman','Haroon']
})

df2 = pd.DataFrame({
    'cust_id': [1, 2, 3,4],
    'order_id': [101, 102, 103,104]
})
pd.merge(df1,df2)
# pd.merge(df2,df1,on='cust_id')

#  here 'cust_id' are same in both columns 

Unnamed: 0,cust_id,name,order_id
0,1,Ali,101
1,2,Sara,102
2,3,Usman,103


### `on='Column to join on'` -- **Parameter; On which column to join**

In [3]:
pd.merge(df1,df2,on='cust_id')

Unnamed: 0,cust_id,name,order_id
0,1,Ali,101
1,2,Sara,102
2,3,Usman,103


### `how= 'inner/left/right/outer'` -- **Parameter; Type of join**

In [4]:
pd.merge(df1,df2,how='inner')
pd.merge(df1,df2,how='left')
pd.merge(df1,df2,how='right')
pd.merge(df1,df2,how='outer')

Unnamed: 0,cust_id,name,order_id
0,1,Ali,101.0
1,2,Sara,102.0
2,3,Usman,103.0
3,4,,104.0
4,5,Haroon,


### `indicator= True/False` -- **Parameter; Adds column showing source `_merge`**

In [5]:
pd.merge(df1,df2,how='outer',indicator=True)


Unnamed: 0,cust_id,name,order_id,_merge
0,1,Ali,101.0,both
1,2,Sara,102.0,both
2,3,Usman,103.0,both
3,4,,104.0,right_only
4,5,Haroon,,left_only


###  `left_index / right_index = True/False` -- **Parameter; joins on matching row indexes** 

* **`left_index`**: Tells Pandas to ignore columns and match using the **Index** (Row Labels) of the **Left** table.
* **`right_index`**: Tells Pandas to ignore columns and match using the **Index** (Row Labels) of the **Right** table.


In [6]:
df1 = pd.DataFrame({
    'cust_id': [1, 2, 3, 5],
    'order_id': [101, 102, 103,104]
})

df2 = pd.DataFrame({
    'cust_id': [1, 2, 3, 4],
    'order_id': [101, 102, 103,104]
})

In [7]:
# Join strictly based on their Index
pd.merge(df1,df2,left_index=True,right_index=True)

Unnamed: 0,cust_id_x,order_id_x,cust_id_y,order_id_y
0,1,101,1,101
1,2,102,2,102
2,3,103,3,103
3,5,104,4,104


### `left_on / right_on = `  -- **Parameter; Use different columns and match them**
* **`left_on`**: Tells Pandas to look for the matching key in a **Column** of the **Left** table.
* **`right_on`**: Tells Pandas to look for the matching key in a **Column** of the **Right** table.

In [8]:
left = pd.DataFrame({
    'cust_id': [1, 2, 3, 5,6],
    'transaction_id': [101, 102, 103,104,105]
})

right = pd.DataFrame({
    'cust_id': [1, 2, 3, 4,7],
    'order_id': [101, 102, 103,104,106]
})

In [9]:
pd.merge(left,right,left_on='transaction_id',right_on='order_id')


Unnamed: 0,cust_id_x,transaction_id,cust_id_y,order_id
0,1,101,1,101
1,2,102,2,102
2,3,103,3,103
3,5,104,4,104


In [10]:
pd.merge(left,right,left_index=True,right_index=True)


Unnamed: 0,cust_id_x,transaction_id,cust_id_y,order_id
0,1,101,1,101
1,2,102,2,102
2,3,103,3,103
3,5,104,4,104
4,6,105,7,106





### ‚ö° Cheat Sheet

| Parameter | What it tells Pandas |
| --- | --- |
| `on='ID'` | Both tables have a column named `'ID'`. Use it. |
| `left_on='ID', right_on='uid'` | Left uses `'ID'`, Right uses `'uid'`. Match them. |
| `left_index=True` | Don't look for a column; use the **Left Index** as the key. |
| `right_index=True` | Don't look for a column; use the **Right Index** as the key. |

**In short:**
* **`_on`** = Use a Column.
* **`_index`** = Use the Index.

In [11]:
left = pd.DataFrame({
    'cust_id': [1, 2, 3, 5,6],
    'order_id': [101, 102, 103,104,105]
})

right = pd.DataFrame({
    'cust_id': [1, 2, 3, 4,7],
    'order_id': [101, 102, 103,104,106]
})

### `suffix=(name)` --**Paremeter; Rename overlapping/merged columns**


In [12]:
pd.merge(left,right,left_index=True,right_index=True, suffixes=('ID','Number'))


Unnamed: 0,cust_idID,order_idID,cust_idNumber,order_idNumber
0,1,101,1,101
1,2,102,2,102
2,3,103,3,103
3,5,104,4,104
4,6,105,7,106


---
---

# **Practice Problems**

In [13]:
employees = pd.DataFrame({
    'EmpID': [101, 102, 103, 104, 105],
    'Name': ['Ali', 'Sara', 'Ahmed', 'Zara', 'Usman'],
    'DeptID': [1, 2, 1, 3, 2]
})
departments = pd.DataFrame({
    'DeptID': [1, 2, 3, 4],
    'DeptName': ['IT', 'HR', 'Finance', 'Marketing']
})

salaries = pd.DataFrame({
    'EmpID': [101, 102, 104],
    'Salary': [50000, 60000, 55000]
})

### üß† Problem 1

Merge `employees` and `departments` to show:

> EmpID, Name, DeptName

(Keep **all employees**, even if department is missing)


In [14]:
pd.merge(employees,departments,on='DeptID',how='outer')[['EmpID','Name','DeptName']]

Unnamed: 0,EmpID,Name,DeptName
0,101.0,Ali,IT
1,103.0,Ahmed,IT
2,102.0,Sara,HR
3,105.0,Usman,HR
4,104.0,Zara,Finance
5,,,Marketing



### üß† Problem 2

Merge `employees` and `salaries` so that:

* Employees without salary still appear
* Missing salary values are allowed


In [15]:
pd.merge(employees,salaries, on='EmpID',how='left')

Unnamed: 0,EmpID,Name,DeptID,Salary
0,101,Ali,1,50000.0
1,102,Sara,2,60000.0
2,103,Ahmed,1,
3,104,Zara,3,55000.0
4,105,Usman,2,


### üß† Problem 3

Find employees whose department exists **but salary does NOT exist**
(Hint: merge + filter)


In [16]:
merged=pd.merge(employees,salaries,on='EmpID',how='left',indicator=True)
merged[merged['_merge']=='left_only']


Unnamed: 0,EmpID,Name,DeptID,Salary,_merge
2,103,Ahmed,1,,left_only
4,105,Usman,2,,left_only



### üß† Problem 4

Do an **outer merge** between `employees` and `salaries` and identify:

* Employees without salary
* Salaries without employees (if any)


In [17]:
merged=pd.merge(employees,salaries,on='EmpID',how='outer',indicator=True)
merged[merged['_merge']=='left_only']
merged[merged['_merge']=='right_only']


Unnamed: 0,EmpID,Name,DeptID,Salary,_merge


---
---
# üß† **HARD MERGE PRACTICE (ADVANCED)**

# ‚ö†Ô∏è RULES (STRICT)

‚ùå No loops

‚ùå No `.apply()`

‚ùå No manual row filtering

‚úÖ Only `merge()` logic

‚úÖ Use `_merge`, `validate`, grouping if needed

---

In [26]:
employees = pd.DataFrame({
    'EmpID': [101, 102, 103, 104, 105, 106],
    'Name': ['Ali', 'Sara', 'Ahmed', 'Zara', 'Usman', 'Hina'],
    'DeptID': [1, 2, 1, 3, 2, None]
})
departments = pd.DataFrame({
    'DeptID': [1, 2, 3],
    'DeptName': ['IT', 'HR', 'Finance']
})
salaries = pd.DataFrame({
    'EmpID': [101, 101, 102, 104, 107],
    'Salary': [50000, 52000, 60000, 55000, 70000]
})
ratings = pd.DataFrame({
    'EmpID': [101, 102, 103, 104],
    'Year': [2023, 2023, 2024, 2023],
    'Rating': ['A', 'B', 'A', 'C']
})


In [30]:
employees
departments
salaries
ratings

Unnamed: 0,EmpID,Year,Rating
0,101,2023,A
1,102,2023,B
2,103,2024,A
3,104,2023,C


## üß© Problem 1 ‚Äî Missing Join Keys

Merge `employees` with `departments` so that:

* All employees appear
* Employees with missing `DeptID` remain
* Department name appears where possible

üëâ Explain **why** NaN appears where it does.

In [36]:
pd.merge(employees,departments,on='DeptID',how='left')

Unnamed: 0,EmpID,Name,DeptID,DeptName
0,101,Ali,1.0,IT
1,102,Sara,2.0,HR
2,103,Ahmed,1.0,IT
3,104,Zara,3.0,Finance
4,105,Usman,2.0,HR
5,106,Hina,,


## üß© Problem 2 ‚Äî Duplicate Explosion

Merge `employees` with `salaries` using `EmpID`.

* Observe number of rows **before & after**
* Identify which employee causes row duplication

üëâ Explain **why this happens**.


In [47]:
pd.merge(employees,salaries,on='EmpID')

Unnamed: 0,EmpID,Name,DeptID,Salary
0,101,Ali,1.0,50000
1,101,Ali,1.0,52000
2,102,Sara,2.0,60000
3,104,Zara,3.0,55000


**Becuase salaries has duplicate 'EmpID'**


## üß© Problem 3 ‚Äî Fix Duplicate Explosion

From Problem 2:

* Keep **only the latest salary** per employee
* Then merge again

(No hardcoding, no manual filtering)


In [62]:

salaries['EmpID']=salaries['EmpID'].drop_duplicates()
pd.merge(employees, salaries, on='EmpID')

Unnamed: 0,EmpID,Name,DeptID,Salary
0,101,Ali,1.0,50000
1,102,Sara,2.0,60000
2,104,Zara,3.0,55000


## üß© Problem 4 ‚Äî Anti-Join (Left Only)

Find employees who **do NOT have salary records**.

üö´ You are **NOT allowed** to use `isin()`.

In [74]:
no_salary=pd.merge(employees,salaries,on='EmpID',how='outer',indicator=True)
no_salary[no_salary['_merge']=='left_only']


Unnamed: 0,EmpID,Name,DeptID,Salary,_merge
2,103.0,Ahmed,1.0,,left_only
4,105.0,Usman,2.0,,left_only
5,106.0,Hina,,,left_only


## üß© Problem 5 ‚Äî Anti-Join (Right Only)

Find salary records that **do NOT belong to any employee**.

üö´ No boolean tricks ‚Äî use merge logic only.

In [76]:
no_emp=pd.merge(employees,salaries, on='EmpID',how='outer',indicator=True)
no_emp[no_emp['_merge']=='right_only']

Unnamed: 0,EmpID,Name,DeptID,Salary,_merge
6,107.0,,,70000.0,right_only
7,,,,52000.0,right_only



## üß© Problem 6 ‚Äî Multi-Key Merge

Merge `employees` and `ratings` such that:

* Only ratings from **2023** appear
* Employees without ratings still appear

‚ö†Ô∏è Do **NOT** filter before merging.


## üß© Problem 7 ‚Äî Wrong Merge Trap

Someone writes:

```python
pd.merge(employees, departments, left_on='EmpID', right_on='DeptID')
```

1Ô∏è‚É£ Run it
2Ô∏è‚É£ Explain **why the result is logically wrong**, not syntactically



## üß© Problem 8 ‚Äî Validate Your Merge

Merge `employees` and `departments` but:

* Throw an error if **one DeptID maps to multiple departments**

(Hint: `validate=` parameter)

## üß© Problem 9 ‚Äî Merge Indicator Analysis

Merge `employees` and `ratings` using:

```python
indicator=True
```

Then:

* Count how many rows are:

  * `left_only`
  * `both`
  * `right_only`


## üß© Problem 10 ‚Äî Real-World Debug

You expect **6 employees** after merge, but you get **8 rows**.

üëâ Without printing full DataFrames:

* Identify the cause
* Fix the merge