# Merging and Joining 
- real ds are spread across multiple CSVs/tables
- they need joining ds (feature + label)
- combining data across keys
- enriching ds before modeling
- pandas provide - pd.merge for SQL like joins, pd.concat for stacking, join() for simple index based join

| EmployeeID | Name    | Department |
| ---------- | ------- | ---------- |
| 1          | Alice   | HR         |
| 2          | Bob     | IT         |
| 3          | Charlie | Finance    |

| EmployeeID | Salary |
| ---------- | ------ |
| 1          | 50000  |
| 2          | 60000  |
| 3          | 55000  |

we want to combine both using EmployeeID
hence use merge


In [5]:
import pandas as pd

# Table 1: Employee Info
employees = pd.DataFrame({
    'EmployeeID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Department': ['HR', 'IT', 'Finance']
})

# Table 2: Salary Info
salaries = pd.DataFrame({
    'EmployeeID': [ 2, 3, 4],
    'Salary': [ 60000, 55000, 52000]
})

employees, salaries


(   EmployeeID     Name Department
 0           1    Alice         HR
 1           2      Bob         IT
 2           3  Charlie    Finance,
    EmployeeID  Salary
 0           2   60000
 1           3   55000
 2           4   52000)

In [None]:
# Merge - doesnt include salary of employee whose name and dept are not present. i.e. present in table2 but not in table1 or in table1 but not table2

combined = pd.merge(employees, salaries, on='EmployeeID')
combined

Unnamed: 0,EmployeeID,Name,Department,Salary
0,2,Bob,IT,60000
1,3,Charlie,Finance,55000


In [7]:
# employee 4 and 1 is not added
# to include everyone from employee table even if they dont have salary info

combined = pd.merge(employees, salaries, on='EmployeeID', how='left')
combined

Unnamed: 0,EmployeeID,Name,Department,Salary
0,1,Alice,HR,
1,2,Bob,IT,60000.0
2,3,Charlie,Finance,55000.0


### Types of merges -
1. Inner join (default) - shows rows where EmpID exists in both
2. Outer join - adds everything from left and right
3. Left join - shows all rows from left table, adds data from right is available
4. Right join

In [13]:
combined = pd.merge(employees, salaries, on='EmployeeID', how='outer')
combined

Unnamed: 0,EmployeeID,Name,Department,Salary
0,1,Alice,HR,
1,2,Bob,IT,60000.0
2,3,Charlie,Finance,55000.0
3,4,,,52000.0
