# Joining and merging in pandas allow combining DataFrames similarly to SQL joins, primarily using the merge() function and .join() method. The four core pillars are:

1. Functions: merge() (flexible, SQL-like joins on columns or index) and .join() (joins on index by default).

2. Join types: inner, left, right, and outer define which rows are retained during the join.

3. Keys: columns or indexes on which the join is based, supporting single or multiple keys.

4. Common pitfalls: mismatched keys, duplicated column names, and misunderstanding index vs column joining.

In realworld scenarios data are often spread accross multiple files, Datbases or even APIS. combining this data allows us to complete picture and and perform meaningful analysis. 

# Parent and Child Tables

Parent and child tables represent a **"one-to-many"** data relationship.  
- The **parent table** holds unique records with a **primary key**.  
- The **child table** holds related records referencing the parent's key as a **foreign key**.


# Types of Joins

When joining tables:

- **Inner join** returns only rows with matching keys in both tables.
- **Left join** returns all rows from the left table and matched rows from the right table. Rows from the left table with no match in the right will have `null` or `NaN` in the right table columns.
- **Right join** returns all rows from the right table and matched rows from the left table. Rows from the right table with no match in the left will have `null` or `NaN` in the left table columns.
- **Outer join** returns **all rows** from both tables, matching where possible, and filling with `null` or `NaN` where no match exists.

> **Note:**  
> Which table is left or right depends on the join operation, and either the parent or child table can be on either side depending on the context.

# **Relating to SQL**

| Join Type   | SQL Parameter | Pandas `merge()` `how` Parameter |
|-------------|---------------|----------------------------------|
| Inner Join  | `INNER JOIN`  | `how='inner'`                    |
| Left Join   | `LEFT JOIN`   | `how='left'`                     |
| Right Join  | `RIGHT JOIN`  | `how='right'`                    |
| Outer Join  | `FULL OUTER JOIN` | `how='outer'`                   |


### Table A (Penguins Basic)

| species   | island    | body_mass_g |
|-----------|-----------|-------------|
| Adelie    | Torgersen | 3750        |
| Chinstrap | Dream     | 3800        |
| Gentoo    | Biscoe    | 5000        |

### Table B (Penguins Extra Info)

| species   | flipper_length_mm |
|-----------|-------------------|
| Adelie    | 181               |
| Gentoo    | 220               |
| Macaroni  | 195               |

---

### Inner Join on `species` (only matching species)

| species   | island    | body_mass_g | flipper_length_mm |
|-----------|-----------|-------------|-------------------|
| Adelie    | Torgersen | 3750        | 181               |
| Gentoo    | Biscoe    | 5000        | 220               |

---

### Left Join (all from Table A, matched from Table B)

| species   | island    | body_mass_g | flipper_length_mm |
|-----------|-----------|-------------|-------------------|
| Adelie    | Torgersen | 3750        | 181               |
| Chinstrap | Dream     | 3800        | NULL              |
| Gentoo    | Biscoe    | 5000        | 220               |

---

### Right Join (all from Table B, matched from Table A)

| species   | island    | body_mass_g | flipper_length_mm |
|-----------|-----------|-------------|-------------------|
| Adelie    | Torgersen | 3750        | 181               |
| Gentoo    | Biscoe    | 5000        | 220               |
| Macaroni  | NULL      | NULL        | 195               |

---

### Outer Join (all from both, matched where possible)

| species   | island    | body_mass_g | flipper_length_mm |
|-----------|-----------|-------------|-------------------|
| Adelie    | Torgersen | 3750        | 181               |
| Chinstrap | Dream     | 3800        | NULL              |
| Gentoo    | Biscoe    | 5000        | 220               |
| Macaroni  | NULL      | NULL        | 195               |


# **Best Practise For Joins**
1. Specify the parent Table first. 
2. Validate keys. 
3. Inspect missing Data. 

In [None]:
import pandas as pd
penguins=pd.read_csv("../Datasets/penguins.csv")   
penguins.head()

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [None]:
# Prep Splits for joining (Simulate tables) 

left=penguins[["rowid","species","island","sex","year"]].copy()
right=penguins[["rowid","bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g"]].copy()

# Inject mismatches
right=right.drop([0,10,20])     #Arbitrary drops for demo—change to random for variation."
right=right.dropna(subset=["bill_length_mm"])

# inner join left and right (binary)
inner_merged = left.merge(right, on="rowid", how='inner')
print(inner_merged.head())
print(inner_merged.isna().sum())  # Check for NaNs (e.g., sex might have some)

inner_merged.groupby("species")["body_mass_g"].mean()

   rowid species     island     sex  year  bill_length_mm  bill_depth_mm  \
0      2  Adelie  Torgersen  female  2007            39.5           17.4   
1      3  Adelie  Torgersen  female  2007            40.3           18.0   
2      5  Adelie  Torgersen  female  2007            36.7           19.3   
3      6  Adelie  Torgersen    male  2007            39.3           20.6   
4      7  Adelie  Torgersen  female  2007            38.9           17.8   

   flipper_length_mm  body_mass_g  
0              186.0       3800.0  
1              195.0       3250.0  
2              193.0       3450.0  
3              190.0       3650.0  
4              181.0       3625.0  
rowid                0
species              0
island               0
sex                  8
year                 0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
dtype: int64


species
Adelie       3705.067568
Chinstrap    3733.088235
Gentoo       5076.016260
Name: body_mass_g, dtype: float64

In [49]:
# Multi-Column Inner (No row_id Needed): For species-year combos.
left_multi = penguins[['species', 'island', 'sex', 'year']]
right_multi = penguins[['species', 'year', 'bill_length_mm', 'body_mass_g']].drop([0, 10])  # Mismatches

inner_multi = pd.merge(left_multi, right_multi, on=['species', 'year'], how='inner')
print(inner_multi.shape)  # Fewer rows if year-species don't overlap perfectly

(14388, 6)


# Inner Join in Pandas

1. **Understanding Inner Joins**  
   Inner joins combine datasets on a common key.  
   Only rows with matching keys in both datasets are included.  
   Example: Finding sales reps who have made sales, including only matching records.

2. **Performing an Inner Join in Pandas**  
   Use `pd.merge()` to combine datasets.  
   Specify join keys using `left_on` and `right_on` parameters (or `on` if key name is the same).

3. **Practical Use Case: Aggregating mass by Species**  
   After merging, use operations like `groupby()` for aggregation.  
   Example: Compute total sales by region using `groupby()`.

4. **Best Practices for Inner Joins**  
   - ✅ Ensure data consistency: Confirm join keys are properly defined.  
   - ✅ Inspect results: Use `.shape` to check row counts after join.  
   - ✅ Handle missing keys: Remember unmatched rows are excluded; consider left/right join if you want to keep them.