## 8. Model-Based Learning

### Definition
**Model-Based Learning** (also called **Parametric Learning** or **Eager Learning**) creates a **mathematical model** that captures patterns in training data. The learning happens during training; predictions are just applying the learned model.

### Core Concept

```
Training Phase:
Input Data → Learn Parameters/Structure → Create Mathematical Model
(Intensive computation, learns general patterns)

Prediction Phase:
New Sample → Apply Model (simple calculation) → Prediction
(Fast, uses learned knowledge)
```

### How Model-Based Learning Works

The algorithm learns a **parameterized function** that generalizes to new data:

```
f(x) = θ₀ + θ₁×x₁ + θ₂×x₂ + ... (Linear Regression)
       where θ are learned parameters

f(x) = Tree of decision rules (Decision Trees)
       where rules are learned

f(x) = Deep Neural Network with billions of parameters
       where parameters learned via backpropagation
```

### Common Model-Based Algorithms

#### 1. **Linear Regression** (Parametric)


In [None]:
from sklearn.linear_model import LinearRegression
import numpy as np

# Training: Learn weights θ
X_train = np.array([[1], [2], [3], [4], [5]])
y_train = np.array([2, 4, 5, 4, 5])

model = LinearRegression()
model.fit(X_train, y_train)  # Learns: y = 1.3 + 0.5*x (approximately)

print(f"Learned parameters (coefficients): {model.coef_}")
print(f"Intercept: {model.intercept_}")

# Prediction: Just apply the model
x_new = np.array([[6]])
y_pred = model.predict(x_new)
print(f"Prediction for x=6: {y_pred[0]:.2f}")

# Model formula: y = 0.5 * x + 1.3


#### 2. **Decision Trees** (Non-parametric but model-based)


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

# Training: Learn decision rules
iris = load_iris()
X, y = iris.data, iris.target

tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X, y)  # Learns tree structure

# Visualize learned tree
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 10))
plot_tree(tree, feature_names=iris.feature_names,
          class_names=iris.target_names, filled=True)
plt.title('Learned Decision Tree Model')
plt.show()

# Prediction: Follow decision paths
x_new = [[5.1, 3.5, 1.4, 0.2]]
y_pred = tree.predict(x_new)
print(f"Prediction: {iris.target_names[y_pred[0]]}")


#### 3. **Neural Networks** (Parametric with millions of parameters)


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Create data
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, n_classes=2)

# Normalize
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Training: Learn millions of weights
model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=200)
model.fit(X, y)

# Model structure
print(f"Number of parameters: {model.n_parameters_}")
print(f"Layer sizes: Input->100->50->Output")

# Prediction: Forward pass through network
y_pred = model.predict(X[:10])
print(f"Predictions: {y_pred}")


### Characteristics of Model-Based Learning

#### 1. **Parametric** 📐
- Creates a **fixed-size** model regardless of data size
- Number of parameters determined before training
- Generalizes through learned parameters
- Simpler models (fewer parameters) generalize better


In [None]:
# Linear model: 100 samples = 1M samples, same parameters
# Model size doesn't grow with data size

X_100 = np.random.rand(100, 20)
X_1M = np.random.rand(1_000_000, 20)

model_100 = LinearRegression()
model_100.fit(X_100, np.random.rand(100))

model_1M = LinearRegression()
model_1M.fit(X_1M, np.random.rand(1_000_000))

# Both models have same number of parameters!
print(f"Model trained on 100 samples: {model_100.coef_.shape}")
print(f"Model trained on 1M samples: {model_1M.coef_.shape}")


#### 2. **Eager Learning** 🎯
- Learning happens during training (expensive computation)
- Predictions are fast (apply learned model)
- Training phase is computationally intensive
- Prediction phase is lightweight


In [None]:
import time

# Training: Computationally expensive
X_train = np.random.rand(50000, 100)
y_train = np.random.rand(50000)

start = time.time()
model = LinearRegression()
model.fit(X_train, y_train)  # Solve normal equations: (X'X)^-1 X'y
train_time = time.time() - start
print(f"Training time: {train_time:.4f} seconds")

# Prediction: Very fast
X_test = np.random.rand(1000, 100)
start = time.time()
y_pred = model.predict(X_test)  # Simple matrix multiplication
predict_time = time.time() - start
print(f"Prediction time: {predict_time:.4f} seconds")
print(f"Training is {train_time/predict_time:.0f}x slower than prediction")


#### 3. **Generalization**
- Learned parameters generalize to unseen data
- Avoid overfitting through regularization
- Need sufficient training data
- Requires careful model selection

### Advantages of Model-Based Learning

#### 1. **Fast Predictions** ⚡


In [None]:
# Once model learned, predictions are extremely fast
# No distance calculations like KNN
# Just matrix operations (very optimized)

knn = KNeighborsClassifier(n_neighbors=5)
tree = DecisionTreeClassifier(max_depth=5)

X_train = np.random.rand(100000, 50)
y_train = np.random.randint(0, 2, 100000)

knn.fit(X_train, y_train)  # Stores 100k samples
tree.fit(X_train, y_train)  # Learns small tree

X_test = np.random.rand(1000, 50)

start = time.time()
knn_pred = knn.predict(X_test)
knn_time = time.time() - start

start = time.time()
tree_pred = tree.predict(X_test)
tree_time = time.time() - start

print(f"KNN prediction time: {knn_time:.4f}s")
print(f"Tree prediction time: {tree_time:.4f}s")
print(f"Tree is {knn_time/tree_time:.0f}x faster")


#### 2. **Memory Efficient** 💾
- Model size independent of training data size
- Linear model with 1M parameters works with any dataset size
- Suitable for deployment (small model files)


In [None]:
import pickle

# Model-based: Tiny model file
model = LinearRegression()
model.fit(X_train, y_train)
model_size = len(pickle.dumps(model)) / 1024
print(f"Linear model size: {model_size:.2f} KB")

# Instance-based: Huge model file (stores all data!)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_size = len(pickle.dumps(knn)) / (1024**2)
print(f"KNN model size: {knn_size:.2f} MB")
# KNN is 1000s of times larger!


#### 3. **Scalable Deployment** 🚀


In [None]:
# Model-based models easy to deploy
# Just save parameters, load in production

# Save trained model
import joblib
joblib.dump(model, 'trained_model.pkl')

# Load in production
from joblib import load
model_prod = load('trained_model.pkl')

# Make predictions
predictions = model_prod.predict(new_data)


#### 4. **Interpretability** 🔍
- Linear models: See weight importance
- Decision trees: Understand decision rules
- More understandable than instance-based


In [None]:
# Linear model: See feature importance
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Feature importance = coefficients
feature_importance = model.coef_[0]
feature_names = ['feature_1', 'feature_2', ..., 'feature_n']

for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Decision tree: Visually interpretable
from sklearn.tree import plot_tree
plot_tree(tree, feature_names=feature_names, filled=True)


#### 5. **Handles Infinite Data** ♾️
- Doesn't need to store all data
- Works with streaming data (incremental learning)
- Scalable to unlimited dataset sizes

### Disadvantages of Model-Based Learning

#### 1. **Assumption-Dependent** ⚠️
- Makes assumptions about data (linear, gaussian, etc.)
- Wrong assumptions lead to poor performance
- Assumptions may not match reality


In [None]:
# Example: Linear regression assumes linear relationship
# If data is nonlinear, performance suffers

# Generate nonlinear data
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, 100)

# Linear model assumes straight line
linear = LinearRegression()
linear.fit(X, y)
linear_pred = linear.predict(X)
linear_mse = np.mean((linear_pred - y) ** 2)

# Polynomial model (still linear model but higher degree)
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=10)
X_poly = poly_features.fit_transform(X)
poly = LinearRegression()
poly.fit(X_poly, y)
poly_pred = poly.predict(X_poly)
poly_mse = np.mean((poly_pred - y) ** 2)

print(f"Linear regression MSE: {linear_mse:.4f}")
print(f"Polynomial (degree 10) MSE: {poly_mse:.4f}")
# Polynomial much better because it doesn't assume linearity


#### 2. **Slower Learning** 🐌
- Training computationally expensive
- Fitting parameters takes time
- Not suitable when quick updates needed


In [None]:
# Training is expensive for complex models
X = np.random.rand(100000, 100)
y = np.random.rand(100000)

# Simple model: Fast
start = time.time()
linear = LinearRegression()
linear.fit(X, y)
print(f"Linear model training: {time.time() - start:.2f}s")

# Complex model: Slow
from sklearn.ensemble import RandomForestRegressor
start = time.time()
forest = RandomForestRegressor(n_estimators=100, n_jobs=-1)
forest.fit(X, y)
print(f"Random Forest training: {time.time() - start:.2f}s")


#### 3. **Requires Sufficient Training Data**
- Need enough data to learn parameters well
- Too little data → overfitting
- Too much noise → poor generalization


In [None]:
# Model needs sufficient training data
sample_sizes = [10, 50, 100, 500, 1000]
accuracies = []

for size in sample_sizes:
    X_temp = np.random.rand(size, 20)
    y_temp = np.random.randint(0, 2, size)
    
    X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(
        X_temp, y_temp, test_size=0.3)
    
    model_t = RandomForestClassifier(n_estimators=10)
    model_t.fit(X_train_t, y_train_t)
    acc = model_t.score(X_test_t, y_test_t)
    accuracies.append(acc)

plt.figure(figsize=(10, 6))
plt.plot(sample_sizes, accuracies, marker='o', linewidth=2)
plt.xlabel('Number of Training Samples')
plt.ylabel('Accuracy')
plt.title('Model-Based Learning: Accuracy vs Data Size')
plt.grid(True, alpha=0.3)
plt.show()

# Accuracy improves with more data


#### 4. **Hyperparameter Tuning**
- Many models require tuning (hidden layers, regularization, etc.)
- Wrong hyperparameters → poor performance
- Grid search expensive


In [None]:
# Hyperparameter tuning required
from sklearn.model_selection import GridSearchCV

# Many parameters to tune
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'learning_rate': [0.01, 0.1, 0.5]
}

# This becomes computationally expensive!
rf = RandomForestClassifier()
grid = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1)
# Fits many models trying all combinations


### When to Use Model-Based Learning

✅ **Use Model-Based when:**
- Large datasets available
- Need fast predictions
- Storage/memory important
- Patterns are stable
- Can afford training time
- Need interpretable models
- Streaming data with concept drift

❌ **Avoid when:**
- Very little training data
- Instant adaptation needed
- Data patterns extremely complex
- Cannot afford training time
- Unknown pattern complexity

### Real-World Applications


In [None]:
# Application 1: Housing Price Prediction
from sklearn.ensemble import GradientBoostingRegressor

house_features = [
    'square_feet', 'bedrooms', 'bathrooms', 'age', 'location'
]
model = GradientBoostingRegressor()
model.fit(X_train_houses, y_train_prices)

# Fast prediction on new house
new_house_pred = model.predict([[2000, 3, 2, 10, 'suburban']])

# Application 2: Disease Diagnosis
from sklearn.ensemble import RandomForestClassifier

# Patient symptoms as features
patient_symptoms = [fever, cough, fatigue, headache]
diagnosis_model = RandomForestClassifier()
diagnosis_model.fit(X_train_symptoms, y_train_diagnosis)

# Predict disease
predicted_disease = diagnosis_model.predict([patient_symptoms])

# Application 3: Spam Detection
email_features = [sender_reputation, content_similarity, link_count]
spam_model = LogisticRegression()
spam_model.fit(X_train_emails, y_train_spam_labels)

# Classify incoming email
is_spam = spam_model.predict([incoming_email_features])


---
