# **Join**

In [1]:
import pandas as pd
import numpy as np

###  `df.join()` â€” **Index-based join**

#### ðŸ‘‰ Important Parameters

| Parameter | Default  | Meaning                                   |
| --------- | -------- | ----------------------------------------- |
| `other`   | â€”        | DataFrame to join                         |
| `on`      | `None`   | Join key if not index                     |
| `how`     | `'left'` | `'left'`, `'right'`, `'inner'`, `'outer'` |
| `lsuffix` | `''`     | Left suffix for conflicts                 |
| `rsuffix` | `''`     | Right suffix for conflicts                |

ðŸ“Œ **Best when index is meaningful**

**Columns CAN NOT have overlaping name**

In [2]:

df1=pd.DataFrame({'A':[1,2,3,4,5],'B':[11,12,13,14,15]})
df2=pd.DataFrame({'C':[21,22,23,25],'D':[31,32,33,35]})

df1.join(df2)
# df2.join(df1)

# users = pd.DataFrame({'name': ['Ali', 'Sara', 'Usman']}, index=[1, 2, 3])
# scores = pd.DataFrame({'score': [85, 90]}, index=[1, 3])
# users.join(scores)

# users = pd.DataFrame({'name': ['Ali', 'Sara', 'Usman','Haroon']}, index=['a','b','c','d'])
# scores = pd.DataFrame({'score': [85, 90]})
# users.join(scores)


# users = pd.DataFrame({'name': ['Ali', 'Sara', 'Usman','Haroon']}, index=['a','b','c','d'])
# scores = pd.DataFrame({'score': [85, 90]}, index=['a','d'])
# users.join(scores)




Unnamed: 0,A,B,C,D
0,1,11,21.0,31.0
1,2,12,22.0,32.0
2,3,13,23.0,33.0
3,4,14,25.0,35.0
4,5,15,,


## `how= left/right/inner/outer` -- **Parameter**

In [3]:
users = pd.DataFrame({'name': ['Ali', 'Sara', 'Usman','Haroon']}, index=['a','b','c','d'])
scores = pd.DataFrame({'score': [85, 90]}, index=['a','d'])

In [4]:
# show rows according to indexes of left table
users.join(scores,how='left')

Unnamed: 0,name,score
a,Ali,85.0
b,Sara,
c,Usman,
d,Haroon,90.0


In [5]:
# show rows according to indexes of right table
users.join(scores,how='right')


Unnamed: 0,name,score
a,Ali,85
d,Haroon,90


In [6]:
# inner : Intersection
users.join(scores,how='inner')


Unnamed: 0,name,score
a,Ali,85
d,Haroon,90


In [7]:
# outer : Union
users.join(scores,how='outer')


Unnamed: 0,name,score
a,Ali,85.0
b,Sara,
c,Usman,
d,Haroon,90.0


## `suffix : lsuffix / rsuffix` -- **Parameter**

In [8]:
left = pd.DataFrame({'name': ['Ali', 'Sara', 'Usman','Haroon']}, index=['a','b','c','d'])
right= pd.DataFrame({'name': ['Aisha','Areeba','Aleena','Areesha']}, index=['a','b','c','d'])

In [9]:
left.join(right,lsuffix='_Left')
left.join(right,rsuffix='_Right')
# left.join(right,lsuffix='_Left , rsuffix='_Right')

Unnamed: 0,name,name_Right
a,Ali,Aisha
b,Sara,Areeba
c,Usman,Aleena
d,Haroon,Areesha


---
# **Practice Problems**

In [10]:
employees = pd.DataFrame({
    'EmpID': [101, 102, 103, 104, 105],
    'Name': ['Ali', 'Sara', 'Ahmed', 'Zara', 'Usman'],
    'DeptID': [1, 2, 1, 3, 2]
})

salaries = pd.DataFrame({
    'EmpID': [101, 102, 104],
    'Salary': [50000, 60000, 55000]
})
# ðŸ“˜ Dataset F â€” Employee Info (Indexed)
emp_info = employees.set_index('EmpID')
# ðŸ“˜ Dataset G â€” Salary Info (Indexed)

sal_info = salaries.set_index('EmpID')

---
### ðŸ§  Problem 1

Join `emp_info` and `sal_info` so that:

* All employees remain
* Salary appears where available


In [11]:
emp_info.join(sal_info,how='left')

Unnamed: 0_level_0,Name,DeptID,Salary
EmpID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101,Ali,1,50000.0
102,Sara,2,60000.0
103,Ahmed,1,
104,Zara,3,55000.0
105,Usman,2,


### ðŸ§  Problem 2

Change join type so that **only employees with salary** appear.


In [12]:
emp_info.join(sal_info,how='inner')

Unnamed: 0_level_0,Name,DeptID,Salary
EmpID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101,Ali,1,50000
102,Sara,2,60000
104,Zara,3,55000


### ðŸ§  Problem 3

Why would `join()` be cleaner than `merge()` in this case?
(Answer in words, not code)

**Because `merge()` is like SQL join it merges based on the matching columns while `join()` joins based on the  matching index**

âœ… Answer (conceptual)

* Keys already in index

* No need for `on=`, `left_on=`, `right_on=`

* Less error-prone

* More readable for index-aligned data

ðŸ‘‰ `join()` = semantic clarity when indexes matter

---
# ðŸ”¥ 10 HARD & TRICKY `join()` PROBLEMS

In [13]:
# ðŸ“˜ Dataset 1 â€” Index Mismatch
df_a = pd.DataFrame({
    'Name': ['Ali', 'Sara', 'Ahmed'],
    'Dept': ['IT', 'HR', 'IT']
}, index=[101, 102, 103])

df_b = pd.DataFrame({
    'Salary': [50000, 60000, 55000]
}, index=[102, 103, 104])

### ðŸ§© Problem 1

Join `df_a` and `df_b`.

* Explain why NaN appears
* Identify which rows mismatched


In [14]:
df_a.join(df_b)
df_a.join(df_b).count()
df_a.join(df_b,how='inner')


Unnamed: 0,Name,Dept,Salary
102,Sara,HR,50000
103,Ahmed,IT,60000


**Ans: Because `df_b` has no index as 101**

In [15]:
# ðŸ“˜ Dataset 2 â€” Duplicate Index

df_c = pd.DataFrame({
    'Score': [80, 85, 90]
}, index=[101, 101, 102])


### ðŸ§© Problem 2

Join `df_a` with `df_c`.

* Observe row count
* Explain duplication


In [22]:
df_a.join(df_c)
df_a.join(df_c)['Name'].value_counts()

Unnamed: 0,Name,Dept,Score
101,Ali,IT,80.0
101,Ali,IT,85.0
102,Sara,HR,90.0
103,Ahmed,IT,


**`df_c` has duplicate index 101**

In [23]:

## ðŸ“˜ Dataset 3 â€” Join with Series
bonus = pd.Series(
    [5000, 7000],
    index=[101, 103],
    name='Bonus'
)


### ðŸ§© Problem 3

Join `df_a` with `bonus`.

* What happens to column name?
* Why is this useful?


In [24]:
df_a.join(bonus)

Unnamed: 0,Name,Dept,Bonus
101,Ali,IT,5000.0
102,Sara,HR,
103,Ahmed,IT,7000.0


### ðŸ“˜ Dataset 4 â€” Join Multiple DataFrames

---

### ðŸ§© Problem 4

Join `df_a` with `df_b` **and** `bonus` in one statement.

(Hint: list of DataFrames)


In [25]:
dfs=[df_b,bonus]
df_a.join(dfs)

Unnamed: 0,Name,Dept,Salary,Bonus
101,Ali,IT,,5000.0
102,Sara,HR,50000.0,
103,Ahmed,IT,60000.0,7000.0


In [62]:

## ðŸ“˜ Dataset 5 â€” Index Type Trap

df_d = pd.DataFrame({
    'Salary': [50000, 60000]
}, index=['101', '102'])   # strings!



### ðŸ§© Problem 5

Join `df_a` with `df_d`.

* Why does NOTHING match?
* How do you fix it?

In [63]:
df_a.join(df_d)
df_d.index=df_d.index.astype(int)
df_a.join(df_d)

Unnamed: 0,Name,Dept,Salary
101,Ali,IT,50000.0
102,Sara,HR,60000.0
103,Ahmed,IT,


In [44]:
## ðŸ“˜ Dataset 6 â€” Join with Suffixes
df_e = pd.DataFrame({
    'Dept': ['IT', 'HR']
}, index=[101, 102])


### ðŸ§© Problem 6

Join `df_a` and `df_e`.

* What error occurs?
* **Ans: Because There is overlaping column Name `Dept`**
* Fix it properly


In [None]:
df_a.join(df_e,lsuffix='_Left',rsuffix='_Right') # Fix

Unnamed: 0,Name,Dept_Left,Dept_Right
101,Ali,IT,IT
102,Sara,HR,HR
103,Ahmed,IT,


### ðŸ“˜ Dataset 7 â€” Outer Join Logic

---

### ðŸ§© Problem 7

Perform an **outer join** between `df_a` and `df_b`.

* Identify rows that come only from right DataFrame


In [None]:
df_a.join(df_b,how='outer')


Unnamed: 0,Name,Dept,Salary
101,Ali,IT,
102,Sara,HR,50000.0
103,Ahmed,IT,60000.0
104,,,55000.0



### ðŸ“˜ Dataset 8 â€” Join vs Merge Trap

---

### ðŸ§© Problem 8

Rewrite this using `join()`:

```python
pd.merge(df_a, df_b, left_index=True, right_index=True)
```

Explain **why `join()` is cleaner**.


In [59]:
pd.merge(df_a,df_b,left_index=True,right_index=True)
df_a.join(df_b,how='inner')

Unnamed: 0,Name,Dept,Salary
102,Sara,HR,50000
103,Ahmed,IT,60000


### ðŸ“˜ Dataset 9 â€” Validate Logic (Advanced)

---

### ðŸ§© Problem 9

Use `join()` logic to **detect duplicate index values** before joining.

(Hint: index properties)


In [None]:
df_c.index.is_unique
df_c.index.duplicated()
df_c.index[df_c.index.duplicated()]

array([False,  True, False])

### ðŸ“˜ Dataset 10 â€” Real-World Bug

---

### ðŸ§© Problem 10

Your join result suddenly doubles rows after adding new data.

* What caused this?
* How do you prevent it permanently?



In [74]:
df = pd.DataFrame({
    'Status': ['Shipped', 'Pending', 'Cancelled']
}, index=[100, 101, 100])
df.index.name='index_col'

In [None]:
df = df.reset_index().drop_duplicates(subset='index_col').set_index('index_col')
df
# OR
# assert df.index.is_unique


Unnamed: 0_level_0,Status
index_col,Unnamed: 1_level_1
100,Shipped
101,Pending
