Here‚Äôs a **clean, concept-first explanation** of what you‚Äôve written, without drowning you in output details.

---

## B.3 Model Training ‚Äî Core Ideas

### 1Ô∏è‚É£ Holdout Method (Train‚ÄìTest Split)

**What it is**

* You split the dataset **once** into:

  * **Training set** ‚Üí used to build the model
  * **Test set** ‚Üí used to evaluate the model on unseen data

**Why it‚Äôs needed**

* To check whether the model **generalizes** beyond the data it learned from.

**How scikit-learn does it**

```python
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(
    data,
    test_size=0.3,
    random_state=123
)
```

**Key points**

* `test_size=0.3` ‚Üí 30% test, 70% train
* `random_state` ‚Üí ensures **reproducibility**
* Without `random_state`, every run gives a **different split**

**Limitation**

* Model performance depends heavily on **one random split**
* Risky for small datasets

---

### 2Ô∏è‚É£ K-Fold Cross-Validation

**What it is**

* Data is split into **k equal parts (folds)**
* Each fold is used **once as test**, remaining folds as train
* Final performance = **average over k runs**

For `k = 10`:

* Train 10 models
* Each data point is tested **exactly once**

---

### How KFold Works in scikit-learn

```python
from sklearn.model_selection import KFold

kf = KFold(n_splits=10, shuffle=True, random_state=123)
```

**Important details**

* `shuffle=True` ‚Üí avoids biased folds
* `random_state` ‚Üí same folds every run
* `kf.split(data)` **does not split data**

  * It returns **indices**

---

### Understanding `kf.split(data)`

```python
for train_index, test_index in kf.split(data):
    data_train = data.iloc[train_index]
    data_test  = data.iloc[test_index]
```

**What you get**

* `train_index` ‚Üí row numbers for training
* `test_index` ‚Üí row numbers for testing

This is why you see **huge index lists** in the output.

---

### Accessing a Specific Fold

```python
folds = []
for train_index, test_index in kf.split(data):
    folds.append((train_index, test_index))

train_index, test_index = folds[3]  # 4th fold
```

---

## Holdout vs K-Fold (Quick Comparison)

| Aspect           | Holdout | K-Fold |
| ---------------- | ------- | ------ |
| Number of splits | 1       | k      |
| Stability        | Low     | High   |
| Bias             | Higher  | Lower  |
| Computation      | Fast    | Slower |
| Small datasets   | ‚ùå       | ‚úÖ      |

---

## When to Use What?

* **Large dataset, quick experiment** ‚Üí Holdout
* **Small/medium dataset, reliable evaluation** ‚Üí K-Fold
* **Research / reproducibility** ‚Üí K-Fold + fixed `random_state`

---

### Mental Model (Easy to Remember)

* **Holdout** = one exam
* **K-Fold** = average of multiple exams

You‚Äôre building solid foundations here‚Äîthis is exactly how real ML pipelines are evaluated. Keep going üöÄ


Excellent ‚Äî this is a **classic lab handout**, and your request is clear üëç
Below is **ALL code extracted and separated into clean, independent Jupyter cells**, exactly how an assignment notebook should look.

You can **copy‚Äìpaste cell by cell**.

---

## üîπ Cell 1 ‚Äì Import Pandas

```python
import pandas as pd
```

---

## üîπ Cell 2 ‚Äì Feature Construction: Derived Feature (Area)

```python
data = {
    'length': [20, 25, 30, 22],
    'breadth': [15, 20, 18, 25],
    'price': [200000, 250000, 300000, 220000]
}

df = pd.DataFrame(data)
df['area'] = df['length'] * df['breadth']
df
```

---

## üîπ Cell 3 ‚Äì Encoding Nominal Variables (One-Hot Encoding)

```python
data = {
    'city': ['A', 'B', 'C', 'A'],
    'parents_athlete': ['Y', 'N', 'N', 'Y'],
    'chance_of_win': ['Y', 'N', 'Y', 'N']
}

df = pd.DataFrame(data)
pd.get_dummies(df, drop_first=True)
```

---

## üîπ Cell 4 ‚Äì Encoding Ordinal Variables

```python
data = {'grade': ['A', 'B', 'C', 'D', 'A']}
df = pd.DataFrame(data)

grade_map = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
df['num_grade'] = df['grade'].map(grade_map)
df
```

---

## üîπ Cell 5 ‚Äì Binning Continuous Variables

```python
import numpy as np

df = pd.DataFrame({'price': [200000, 350000, 600000, 800000]})

bins = [0, 300000, 600000, np.inf]
labels = ['Low', 'Medium', 'High']

df['price_category'] = pd.cut(df['price'], bins=bins, labels=labels)
df
```

---

## üîπ Cell 6 ‚Äì PCA (Feature Extraction)

```python
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Original shape:", X.shape)
print("Reduced shape:", X_pca.shape)
```

---

## üîπ Cell 7 ‚Äì Truncated SVD

```python
import numpy as np
from sklearn.decomposition import TruncatedSVD

X = np.random.rand(5, 4)

svd = TruncatedSVD(n_components=2)
X_svd = svd.fit_transform(X)

print("Original:", X.shape)
print("Reduced:", X_svd.shape)
```

---

## üîπ Cell 8 ‚Äì LDA (Linear Discriminant Analysis)

```python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)

print("Original shape:", X.shape)
print("Reduced shape:", X_lda.shape)
```

---

## üîπ Cell 9 ‚Äì Feature Selection: Chi-Square (Filter Method)

```python
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris

iris = load_iris()

X_new = SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)

print("Original:", iris.data.shape)
print("Reduced:", X_new.shape)
```

---

## üîπ Cell 10 ‚Äì Chi-Square using `.iloc`

```python
from sklearn.feature_selection import SelectKBest, chi2

X = df.iloc[:, :-1]   # all columns except last
y = df.iloc[:, -1]    # last column

X_new = SelectKBest(chi2, k=2).fit_transform(X, y)

print("Original shape:", X.shape)
print("Reduced shape:", X_new.shape)
```

---

## üîπ Cell 11 ‚Äì Wrapper Method: Recursive Feature Elimination (RFE)

```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=200)

rfe = RFE(model, n_features_to_select=2)
fit = rfe.fit(X, y)

print("Selected Features:", fit.support_)
print("Feature Ranking:", fit.ranking_)
```

---

## üîπ Cell 12 ‚Äì Embedded Method: Lasso (LassoCV)

```python
from sklearn.linear_model import LassoCV
from sklearn.datasets import load_diabetes

diabetes = load_diabetes(as_frame=True)
df_d = diabetes.frame

X = df_d.iloc[:, :-1]
y = df_d.iloc[:, -1]

lasso = LassoCV(cv=5)
lasso.fit(X, y)

print("Coefficients:", lasso.coef_)
print("Number of selected features:", sum(lasso.coef_ != 0))
```

---


In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load example dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.Series(iris.target)

print(data.head())
print(target.head())


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
0    0
1    0
2    0
3    0
4    0
dtype: int32


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split data: 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(
    data, target, test_size=0.3, random_state=123
)

# Train a simple model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Holdout Accuracy: {accuracy:.4f}")


Holdout Accuracy: 0.9333


In [4]:
from sklearn.model_selection import KFold

# K-Fold setup: 5 folds
kf = KFold(n_splits=5, shuffle=True, random_state=123)

fold_accuracies = []

for fold, (train_index, test_index) in enumerate(kf.split(data)):
    X_train, X_test = data.iloc[train_index], data.iloc[test_index]
    y_train, y_test = target.iloc[train_index], target.iloc[test_index]
    
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    fold_accuracies.append(acc)
    
    print(f"Fold {fold+1} Accuracy: {acc:.4f}")

print(f"Average K-Fold Accuracy: {np.mean(fold_accuracies):.4f}")


Fold 1 Accuracy: 1.0000
Fold 2 Accuracy: 0.9667
Fold 3 Accuracy: 0.9667
Fold 4 Accuracy: 0.9667
Fold 5 Accuracy: 0.9333
Average K-Fold Accuracy: 0.9667


In [5]:
# Save folds into a list
folds = list(kf.split(data))

# Get 3rd fold indices (remember Python indexing starts at 0)
train_index, test_index = folds[2]

X_train_fold3 = data.iloc[train_index]
X_test_fold3 = data.iloc[test_index]
y_train_fold3 = target.iloc[train_index]
y_test_fold3 = target.iloc[test_index]

print("Train shape:", X_train_fold3.shape)
print("Test shape:", X_test_fold3.shape)


Train shape: (120, 4)
Test shape: (30, 4)
