## 7. Instance-Based Learning

### Definition
**Instance-Based Learning** (also called **Lazy Learning** or **Memory-Based Learning**) is an approach where the model:
1. **Stores** the entire training dataset in memory
2. **Learns** during prediction phase, not training phase
3. **Makes predictions** by comparing new instances to stored examples using similarity measures

### Core Concept

```
Training Phase:
Input Data → Store exactly as is → Done!
(No model creation, no parameter learning)

Prediction Phase:
New Sample → Find similar stored examples → Predict based on neighbors
```

### How It Works: K-Nearest Neighbors (KNN)

KNN is the most popular instance-based algorithm.

**Algorithm:**
1. Store all training examples
2. For new sample to predict:
   a. Calculate distance to all stored examples
   b. Find K nearest neighbors
   c. Return class (majority vote) or value (average) of K neighbors

#### Distance Metrics


In [None]:
import numpy as np
from scipy.spatial.distance import cdist

def euclidean_distance(x1, x2):
    """Straight-line distance between two points"""
    return np.sqrt(np.sum((x1 - x2) ** 2))

def manhattan_distance(x1, x2):
    """Taxicab distance (sum of absolute differences)"""
    return np.sum(np.abs(x1 - x2))

def cosine_similarity(x1, x2):
    """Angle-based similarity (useful for text/documents)"""
    dot_product = np.dot(x1, x2)
    norms = np.linalg.norm(x1) * np.linalg.norm(x2)
    return dot_product / norms if norms > 0 else 0

# Example
point1 = np.array([0, 0])
point2 = np.array([3, 4])

print(f"Euclidean distance: {euclidean_distance(point1, point2):.2f}")  # 5
print(f"Manhattan distance: {manhattan_distance(point1, point2):.2f}")  # 7
print(f"Cosine similarity: {cosine_similarity(point1, point2):.2f}")


#### KNN Implementation


In [None]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.datasets import load_iris, load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error

# Classification Example
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# KNN Classification
knn_clf = KNeighborsClassifier(n_neighbors=5)
knn_clf.fit(X_train, y_train)  # Stores training data, doesn't build model
predictions = knn_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"KNN Classification Accuracy: {accuracy:.4f}")

# Regression Example
diabetes = load_diabetes()
X_d, y_d = diabetes.data, diabetes.target

X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
    X_d, y_d, test_size=0.2
)

knn_reg = KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train_d, y_train_d)
predictions_reg = knn_reg.predict(X_test_d)
mse = mean_squared_error(y_test_d, predictions_reg)
print(f"KNN Regression MSE: {mse:.4f}")

# Visualizing KNN decision boundary
import matplotlib.pyplot as plt

# Create simple 2D dataset
X_2d = X_train[:, :2]  # Use only first 2 features for visualization

# Create mesh for decision boundary
h = 0.02  # Step size
x_min, x_max = X_2d[:, 0].min() - 0.5, X_2d[:, 0].max() + 0.5
y_min, y_max = X_2d[:, 1].min() - 0.5, X_2d[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                      np.arange(y_min, y_max, h))

# Predict on mesh
Z = knn_clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4, cmap=plt.cm.RdYlBu)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_train[:len(X_2d)], 
            cmap=plt.cm.RdYlBu, edgecolors='black')
plt.title('KNN Decision Boundary (K=5)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


### Characteristics of Instance-Based Learning

#### 1. **Non-Parametric**
- No explicit model parameters to learn
- Model complexity grows with training data
- Can adapt to any complex pattern
- No assumptions about data distribution

#### 2. **Lazy Learning**
- No learning phase: just stores data
- Actual computation happens at prediction time
- Fast training (just copy data)
- Slow prediction (calculate distances to all samples)


In [None]:
import time

# Training phase: Very fast (just stores data)
start = time.time()
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)  # Just stores X_train!
train_time = time.time() - start
print(f"Training time: {train_time:.4f} seconds (very fast)")

# Prediction phase: Slower (calculate distances)
start = time.time()
predictions = knn.predict(X_test)
predict_time = time.time() - start
print(f"Prediction time: {predict_time:.4f} seconds (slower)")
print(f"Ratio: Prediction is {predict_time/train_time:.0f}x slower than training")


#### 3. **Flexible**
- Quickly adapts to new data
- Can learn complex non-linear patterns
- Works with any distance metric
- No retraining needed for new data


In [None]:
# Adding new data: Simply add to memory
new_X = np.array([[5.1, 3.5, 1.4, 0.2]])
new_y = np.array([0])

# Just concatenate - no retraining!
X_updated = np.vstack([X_train, new_X])
y_updated = np.hstack([y_train, new_y])

knn_updated = KNeighborsClassifier(n_neighbors=5)
knn_updated.fit(X_updated, y_updated)  # Fast update


### Advantages of Instance-Based Learning

#### 1. **Simple Implementation** ✅


In [None]:
# Implementation is straightforward
# Just calculate distance and find nearest neighbors

def simple_knn_predict(X_train, y_train, x_new, k=5):
    """Simple KNN implementation"""
    distances = []
    for x_train_sample in X_train:
        dist = np.sqrt(np.sum((x_train_sample - x_new) ** 2))
        distances.append(dist)
    
    # Find K nearest
    distances = np.array(distances)
    k_nearest_indices = np.argsort(distances)[:k]
    k_nearest_labels = y_train[k_nearest_indices]
    
    # Majority vote
    prediction = np.bincount(k_nearest_labels).argmax()
    return prediction


#### 2. **Quick Adaptation** 🔄
- New data incorporated immediately
- No need for full retraining
- Perfect for streaming data

#### 3. **Handles Complex Patterns** 🎯
- No assumptions about underlying distribution
- Captures non-linear relationships
- Works with high-dimensional data
- Flexible decision boundaries

#### 4. **Naturally Handles Multi-Class**


In [None]:
# KNN naturally works with multiple classes
knn_multiclass = KNeighborsClassifier(n_neighbors=5)
knn_multiclass.fit(X_train, y_train)  # Works with any number of classes


#### 5. **Good for Small Datasets** 📊
- Works well when training data is limited
- Each example is valuable, Instance-based uses all
- Better than model-based which might overfit

### Disadvantages of Instance-Based Learning

#### 1. **High Memory Requirement** ❌
- Must store entire training dataset
- Large datasets require huge memory
- Not suitable for millions of samples


In [None]:
# Memory problem with large datasets
import sys

# Store 1 million samples with 100 features
X_large = np.random.rand(1_000_000, 100)
print(f"Memory required: {X_large.nbytes / (1024**3):.1f} GB")
# Requires ~800GB for just features!

# KNN must keep all in memory
knn_large = KNeighborsClassifier()
knn_large.fit(X_large, y_large)
# Cannot fit on standard hardware!


#### 2. **Slow Prediction** 🐢
- Must calculate distance to ALL training examples
- Prediction time: O(n × d) where n=samples, d=features
- Becomes impractical for millions of samples


In [None]:
# Prediction speed degrades with dataset size
dataset_sizes = [100, 1000, 10000, 100000]
prediction_times = []

for size in dataset_sizes:
    X_temp = np.random.rand(size, 50)
    y_temp = np.random.randint(0, 2, size)
    
    knn_temp = KNeighborsClassifier(n_neighbors=5)
    knn_temp.fit(X_temp, y_temp)
    
    X_test_temp = np.random.rand(100, 50)
    
    start = time.time()
    knn_temp.predict(X_test_temp)
    elapsed = time.time() - start
    prediction_times.append(elapsed)

plt.figure(figsize=(10, 6))
plt.plot(dataset_sizes, prediction_times, marker='o', linewidth=2)
plt.xlabel('Number of Training Samples')
plt.ylabel('Prediction Time (seconds)')
plt.title('KNN Prediction Speed vs Dataset Size')
plt.xscale('log')
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()


#### 3. **Sensitive to Irrelevant Features** 🎲
- All features treated equally in distance calculation
- Irrelevant features increase noise
- Performance degrades in high dimensions (curse of dimensionality)


In [None]:
# Curse of dimensionality
from sklearn.datasets import make_classification

# Few dimensions: KNN works well
X_2d, y_2d = make_classification(n_samples=500, n_features=2, 
                                 n_informative=2, n_redundant=0)
knn_2d = KNeighborsClassifier(n_neighbors=5)
knn_2d.fit(X_2d, y_2d)
acc_2d = knn_2d.score(X_2d, y_2d)
print(f"2D accuracy: {acc_2d:.4f}")

# Many dimensions: KNN struggles
X_100d, y_100d = make_classification(n_samples=500, n_features=100,
                                     n_informative=2, n_redundant=98)
knn_100d = KNeighborsClassifier(n_neighbors=5)
knn_100d.fit(X_100d, y_100d)
acc_100d = knn_100d.score(X_100d, y_100d)
print(f"100D accuracy: {acc_100d:.4f} (worse due to curse of dimensionality)")


#### 4. **Sensitive to Noise** 🔊
- Noisy data points become problematic "neighbors"
- Outliers can mislead predictions
- No learning process to filter noise


In [None]:
# Effect of noise
from sklearn.datasets import make_classification

X_clean, y_clean = make_classification(n_samples=200, n_features=20, 
                                       n_informative=10, noise=0)
X_noisy, y_noisy = make_classification(n_samples=200, n_features=20,
                                       n_informative=10, noise=10)

# Split into train/test
X_train_clean, X_test_clean, y_train_clean, y_test_clean = \
    train_test_split(X_clean, y_clean, test_size=0.3)
X_train_noisy, X_test_noisy, y_train_noisy, y_test_noisy = \
    train_test_split(X_noisy, y_noisy, test_size=0.3)

# KNN on clean data
knn_clean = KNeighborsClassifier(n_neighbors=5)
knn_clean.fit(X_train_clean, y_train_clean)
acc_clean = knn_clean.score(X_test_clean, y_test_clean)

# KNN on noisy data
knn_noisy = KNeighborsClassifier(n_neighbors=5)
knn_noisy.fit(X_train_noisy, y_train_noisy)
acc_noisy = knn_noisy.score(X_test_noisy, y_test_noisy)

print(f"Clean data accuracy: {acc_clean:.4f}")
print(f"Noisy data accuracy: {acc_noisy:.4f}")
print(f"Accuracy drop: {(acc_clean - acc_noisy)*100:.2f}%")


#### 5. **Imbalanced Data Sensitivity**


In [None]:
# Majority class dominates predictions
# Need to handle carefully
weighted_knn = KNeighborsClassifier(
    n_neighbors=5,
    weights='distance'  # Weight by distance (closer neighbors matter more)
)
weighted_knn.fit(X_train, y_train)


### When to Use Instance-Based Learning

✅ **Use Instance-Based when:**
- Small to medium dataset
- Data patterns complex and non-linear
- Quick training needed
- Data constantly updating
- Want simple, interpretable model
- Limited labeled data (few-shot learning)

❌ **Avoid when:**
- Millions of samples (memory/speed)
- Real-time predictions critical
- Storage is constrained
- Noisy data
- High-dimensional data (100+ features)

### Real-World Applications


In [None]:
# Example 1: Medical Diagnosis
# Find similar patient cases and use their outcomes

# Example 2: Recommendation Systems
# Find users with similar preferences

# Example 3: Document Classification
# Find similar documents to classify new ones

class RecommendationSystem:
    """Simple content-based recommendation using KNN"""
    
    def __init__(self):
        self.knn = KNeighborsClassifier(n_neighbors=5)
    
    def train(self, user_embeddings, item_ids):
        """Train on existing user-item interactions"""
        self.knn.fit(user_embeddings, item_ids)
    
    def recommend(self, new_user_embedding, top_k=3):
        """Recommend items similar to new user"""
        distances, indices = self.knn.kneighbors([new_user_embedding], n_neighbors=5)
        return indices[0][:top_k]


---
