## Creating new columns from `PassengerId`

### Line 1

```python
df["Group"] = df["PassengerId"].apply(lambda x: x.split("_")[0])
```

**Syntax explanation**

* `df["PassengerId"]` → selects a column
* `.apply()` → applies a function **row-wise**
* `lambda x:` → anonymous (inline) function
* `x.split("_")` → splits string at `_`
* `[0]` → first part of split string

Example:

```
PassengerId = "0001_02"
Group = "0001"
```

---

### Line 2

```python
df["Member"] = df["PassengerId"].apply(lambda x: x.split("_")[1])
```

Same syntax logic:

* `[1]` → second part of split
* Creates **Member number inside the group**


---

## Groupby operation

### Line 3

```python
x = df.groupby("Group")["Member"].count().sort_values()
```

**Syntax breakdown**

* `df.groupby("Group")`
  → Groups rows by `Group`

* `["Member"]`
  → Selects the `Member` column within each group

* `.count()`
  → Counts non-null values per group

* `.sort_values()`
  → Sorts counts in ascending order

Result:
`x` is a **Series**

```
Group
0001    1
0002    3
0003    5
```

---

## Filtering groups with size > 1

### Line 4

```python
y = set(x[x > 1].index)
```

**Syntax explanation**

* `x > 1`
  → Boolean mask

* `x[x > 1]`
  → Keeps only groups with more than 1 member

* `.index`
  → Extracts group IDs

* `set(...)`
  → Converts to a set for **fast lookup**

Result:

```
y = {"0002", "0003"}
```

---

## Creating boolean feature

### Line 5

```python
df["Travelling_Solo"] = df["Group"].apply(lambda x: x not in y)
```

**Syntax breakdown**

* For each group ID:

  * Checks membership using `not in`
* Returns `True` if passenger is solo
* Returns `False` if part of a group

---

## Initializing a column

### Line 6

```python
df["Group_Size"] = 0
```

* Creates a new column
* Initializes all values to `0`
* Needed before conditional assignment

---

## Looping over grouped counts

### Line 7

```python
for i in x.items():
```

**Syntax explanation**

* `x.items()` → returns `(index, value)` pairs
* `i[0]` → group ID
* `i[1]` → group size

Example:

```
i = ("0002", 3)
```

---

## Conditional assignment using `.loc`

### Line 8

```python
df.loc[df["Group"] == i[0], "Group_Size"] = i[1]
```

**Syntax breakdown**

* `df.loc[rows, columns]` → label-based indexing
* `df["Group"] == i[0]` → Boolean row filter
* `"Group_Size"` → column to update
* `= i[1]` → assigns group size

Effect:

> All passengers in the same group get the same group size value

---



## One-line vectorized alternative (better syntax)

```python
df["Group_Size"] = df.groupby("Group")["PassengerId"].transform("count")
```

No loop. Faster. Cleaner.

---

## Final takeaway (syntax-focused)

You used:

* `lambda` → inline function
* `.apply()` → row-wise operation
* `groupby()` → aggregation
* Boolean masking
* `.loc[]` → conditional assignment
* `set()` → fast membership testing

This is **intermediate-level Pandas syntax**, commonly used in feature engineering.



## Core Idea

* **LightGBM (LGBMC)**
  Fast, memory-efficient, **leaf-wise** boosting; excellent for **large tabular data**.

* **XGBoost**
  Stable, highly tunable, **level-wise** boosting; strong general-purpose baseline.

* **CatBoost**
  Designed for **categorical features**; minimal preprocessing; robust on small/medium data.

---

## Side-by-Side Comparison

| Aspect                     | LGBMC                     | XGBoost            | CatBoost                     |
| -------------------------- | ------------------------- | ------------------ | ---------------------------- |
| Tree Growth                | **Leaf-wise**             | **Level-wise**     | Symmetric (balanced)         |
| Speed                      | **Fastest** on large data | Medium             | Slower than LGBM             |
| Memory Usage               | **Lowest**                | Higher             | Medium                       |
| Categorical Handling       | Native (needs indices)    | One-hot / encoding | **Best (native, automatic)** |
| Overfitting Risk           | Higher (if untuned)       | Lower              | Lowest                       |
| Hyperparameter Sensitivity | High                      | Medium             | Low                          |
| Small Dataset Performance  | Average                   | Good               | **Excellent**                |
| Large Dataset Performance  | **Excellent**             | Good               | Average                      |
| GPU Support                | Yes                       | Yes                | Yes                          |
| Ease of Use                | Medium                    | Medium             | **Easy**                     |

---

## Key Technical Differences

### 1. Tree Growth Strategy

* **LGBMC (Leaf-wise)**
  Splits the leaf with **maximum loss reduction** → faster convergence, higher risk of overfitting.
* **XGBoost (Level-wise)**
  Splits all nodes at a depth uniformly → more stable, slower.
* **CatBoost (Symmetric)**
  Same split across a level → regularized, robust.

---

### 2. Handling Categorical Features

* **LGBMC**: Native categorical splits, but you must specify categorical columns.
* **XGBoost**: Requires encoding (One-Hot / Target Encoding).
* **CatBoost**: Uses **ordered target statistics** → avoids target leakage.

**Winner: CatBoost**

---

### 3. Hyperparameter Tuning Difficulty

* **LGBMC**: Sensitive to `num_leaves`, `min_data_in_leaf`
* **XGBoost**: Many parameters; flexible but tuning is costly
* **CatBoost**: Works well with defaults

**Winner: CatBoost**

---

## When to Use What

### Choose **LGBMC** if:

* Dataset is **large (≥ 100k rows)**
* Many features
* Performance and speed are critical

### Choose **XGBoost** if:

* You want **predictable, stable results**
* Dataset size is medium
* You need fine-grained control

### Choose **CatBoost** if:

* Dataset has **many categorical features**
* Dataset is small to medium
* You want minimal preprocessing

---

## Practical Recommendation

* **Tabular ML competitions** → LGBMC / XGBoost
* **Business datasets with categories** → CatBoost
* **Production pipelines at scale** → LGBMC
* **Quick baseline model** → CatBoost

---

## Interview One-Liners 

* *“LGBM grows trees leaf-wise, XGBoost level-wise, CatBoost symmetrically.”*
* *“CatBoost handles categorical features without encoding.”*
* *“LGBM is fastest but more prone to overfitting if not tuned.”*

