### 1.  What is a Decision Tree, and how does it work for classification problems?

- A Decision Tree is a **supervised, rule-based model** that recursively partitions the feature space into axis-aligned rectangles.

1. Start with the entire training set at the **root**.

2. At each node choose the split (feature + threshold) that **maximises impurity reduction** (e.g., Gini, Entropy).

3. Stop splitting when a purity/size/depth criterion is met.

4. Assign the **majority class** of the node to every sample reaching a leaf.

---

### 2.  Entropy & Information Gain:

- *Entropy* quantifies node disorder:

$$
H(t)=-\sum_{k=1}^{C}p_k \log_2 p_k
$$

- *Information Gain* for a split $s$ is

$$
IG(s)=H(parent)-\sum_{j\in \{left,right\}}\frac{n_j}{n_{parent}}H(j)
$$

The split with the **highest IG** is selected.

---

### 3.  Handling categorical vs numerical data:

- *Numerical*: compare against a threshold, e.g., $x_j\le 7.5$.

- *Categorical*: (i) one-hot encode, or (ii) split on membership of a category subset. scikit-learn requires encoding; libraries like CatBoost handle categories natively.

---

### 4.  Pros & Cons:

- **Advantages** – intuitive, no scaling, handles mixed types, captures non-linearities, white-box.

- **Disadvantages** – high variance, prone to overfitting, small data changes → large tree, biased toward features with many levels, piece-wise axis-aligned only.

---

### 5.  Gini Index:

$$
G(t)=1-\sum_{k=1}^{C} p_k^2
$$

- The algorithm picks the split that yields the largest decrease in **weighted Gini**; lower Gini ⇒ purer node.

---

### 6.  Role & derivation of the splitting criterion:

- The criterion is an **objective function** $J(s)$ that measures post-split impurity. It’s derived by defining node impurity $I(t)$ (Gini, Entropy, MSE) and minimising.

$$
J(s)=\sum_{j}\frac{n_j}{n_{parent}}I(j)
$$

over all possible splits $s$.

---

### 7.  Gini vs Entropy (quick compare):

| Aspect         | Gini                          | Entropy                        |
| -------------- | ----------------------------- | ------------------------------ |
| Formula        | $1-\sum p_k^2$                | $-\sum p_k\log_2 p_k$          |
| Range (binary) | 0-0.5                         | 0-1                            |
| Speed          | no logs ⇒ faster              | slower                         |
| Sensitivity    | slightly favours larger class | more sensitive to rare classes |

- Both usually choose the same split; Gini is scikit-learn’s default.

---

### 8.  Variance in regression trees:

- Impurity becomes **variance** of the target:

$$
Var(t)=\frac1{n_t}\sum_{i}(y_i-\bar y_t)^2
$$

- The split minimises weighted variance, analogous to MSE reduction.

---

### 9.  Post- vs Pre-pruning:

- *Pre-pruning*: stop early (max\_depth, min\_samples\_leaf…).

- *Post-pruning*: grow the full tree, then cut back using cost-complexity or validation error. Post-pruning typically yields better generalisation.

---

### 10.  How post-pruning reduces overfitting:

- By adding a **complexity penalty term** $\alpha|\text{leaves}|$ and choosing a subtree that minimises validation error, it removes branches that model noise, thereby lowering variance.

---

### 11.  Example pre-pruning strategy:

- `min_samples_split=20` prevents any node with fewer than 20 samples from splitting, limiting depth and variance.

---

### 12.  Python – applying pruning:

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

X,y = load_iris(return_X_y=True)
Xtr,Xte,ytr,yte = train_test_split(X,y,random_state=0)

# Cost-complexity post-pruning
path = DecisionTreeClassifier(random_state=0).cost_complexity_pruning_path(Xtr,ytr)
alpha = path.ccp_alphas[-3]          # pick a mid-range alpha
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=0)
clf.fit(Xtr,ytr)
print("Pruned accuracy:", clf.score(Xte,yte))
```

---

### 13.  Implementing a Decision Tree Classifier:

```python
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=42)
clf.fit(Xtr,ytr)
```

---

### 14.  Train & evaluate snippet:

```python
from sklearn.metrics import accuracy_score
ypred = clf.predict(Xte)
print("Accuracy:", accuracy_score(yte, ypred))
```

---

### 15.  Handling class imbalance:

- *Techniques*:

* **Class weights**: `DecisionTreeClassifier(class_weight='balanced')`.

* **Resampling**: SMOTE / undersampling.

* **Evaluation metrics**: use F1, AUC instead of accuracy.

---

### 16.  DT vs Random Forest:

- Random Forest averages many decorrelated trees ⇒ **lower variance, better accuracy**, still interpretable via feature importance, but slower and less transparent than a single tree.

---

### 17.  What is a Decision Tree Regressor?

- Same algorithm but predicts a **continuous value** (mean of training samples in a leaf) and uses variance/MSE as impurity instead of class measures.

---

### 18.  Handling continuous targets:

- Each leaf stores the **mean target**; splits chosen to minimise weighted MSE, delivering piece-wise constant approximations.

---

### 19.  Key regressor metrics:

- MSE, RMSE, MAE, $R^2$.

---

### 20.  MSE in trees:

- For a node $t$, impurity $=\frac1{n_t}\sum(y_i-\bar y_t)^2$; the algorithm selects splits that reduce total MSE.

---

### 21.  Implementing a Decision Tree Regressor:

```python
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor(max_depth=5, random_state=42).fit(Xtr, ytr)
```

---

### 22.  Train & evaluate snippet:

```python
from sklearn.metrics import mean_squared_error
y_pred = reg.predict(Xte)
print("MSE:", mean_squared_error(yte, y_pred))
```

---

### 23.  Preventing overfitting in regression trees:

- Limit depth, set `min_samples_leaf`, use post-pruning (`ccp_alpha`), or switch to ensemble methods (Bagging, RF, GB).

---

\###24  DT Regressor vs Gradient Boosting Regressor
GBR sequentially fits trees to the residuals, achieving **lower bias and variance** at the cost of interpretability and training time.

---

### 25.  Handling missing values:

- Native scikit-learn trees don’t accept NaNs → impute first. Some frameworks (XGBoost, LightGBM) learn a **default direction** for missing values during splitting.

---

### 26.  Effect of hyperparameters:

| Hyperparameter                 | Too Low  | Too High  |
| ------------------------------ | -------- | --------- |
| `max_depth`                    | underfit | overfit   |
| `min_samples_leaf`             | overfit  | underfit  |
| `ccp_alpha`                    | underfit | overfit ↓ |
| Tuning balances bias–variance. |          |           |

---

### 27.  Interpretability comparison:

- Decision Trees: explicit rules, easy plots.

- SVM: hyperplanes in high-dim space – moderate interpretability.
Neural Nets: distributed representations – **black box**.

---

### 28.  Addressing model instability:

* **Bagging / Random Forests**

* **Cross-validation pruning**

* Use ensemble averages for predictions; report feature importances with confidence bands.

---



In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

X,y = load_iris(return_X_y=True)
Xtr,Xte,ytr,yte = train_test_split(X,y,random_state=0)

# Cost-complexity post-pruning
path = DecisionTreeClassifier(random_state=0).cost_complexity_pruning_path(Xtr,ytr)
alpha = path.ccp_alphas[-3]          # pick a mid-range alpha
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=0)
clf.fit(Xtr,ytr)
print("Pruned accuracy:", clf.score(Xte,yte))


Pruned accuracy: 0.8947368421052632


In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=42)
clf.fit(Xtr,ytr)


In [None]:
from sklearn.metrics import accuracy_score
ypred = clf.predict(Xte)
print("Accuracy:", accuracy_score(yte, ypred))


Accuracy: 0.9736842105263158


In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

# Implement a Decision Tree Regressor:
reg = DecisionTreeRegressor(max_depth=5, random_state=42).fit(Xtr, ytr)
y_pred = reg.predict(Xte)
print("MSE:", mean_squared_error(yte, y_pred))


MSE: 0.02631578947368421


In [None]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
# Replace "" with the actual file name from the dataset, e.g., "car_evaluation.csv"
file_path = "car_evaluation.csv"

# Load the latest version
df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "elikplim/car-evaluation-data-set",
  file_path,
  # Provide any additional arguments like
  # sql_query or pandas_kwargs. See the
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

print("First 5 records:", df.head())

  df = kagglehub.load_dataset(


Downloading from https://www.kaggle.com/api/v1/datasets/download/elikplim/car-evaluation-data-set?dataset_version_number=1&file_name=car_evaluation.csv...


100%|██████████| 52.3k/52.3k [00:00<00:00, 39.6MB/s]

First 5 records:    vhigh vhigh.1  2 2.1  small   low  unacc
0  vhigh   vhigh  2   2  small   med  unacc
1  vhigh   vhigh  2   2  small  high  unacc
2  vhigh   vhigh  2   2    med   low  unacc
3  vhigh   vhigh  2   2    med   med  unacc
4  vhigh   vhigh  2   2    med  high  unacc



