### 4.2.9 Baseline Models

Before developing complex predictive models, it is important to establish a **baseline performance**.  
A **baseline model** provides a simple point of comparison for evaluating whether more advanced models actually improve performance.

Baseline predictions are evaluated using the **same metrics** you would apply to your main model —  
for example, **Classification Accuracy** for classification tasks or **RMSE** for regression tasks.

A good baseline helps answer the question:

> *“Is my model performing better than a simple or random approach?”*

---

### 🔹 Common Baseline Algorithms

1. **Random Prediction Algorithm**  
   - Generates predictions **randomly** based on the distribution of classes (for classification) or random values (for regression).  
   - Serves as a *minimum performance threshold*.  
   - Example: randomly predicting “spam” or “not spam” with equal probability.

2. **Zero Rule (ZeroR) Algorithm**  
   - A very simple method that **always predicts the most frequent class** (for classification) or the **mean value** (for regression).  
   - Provides a strong and interpretable baseline.  
   - Example: always predicting “not spam” if 80% of training samples are “not spam”.

---

### 🔹 Purpose of a Baseline

- Establishes a **minimum expected performance**.  
- Helps identify whether a new model provides **meaningful improvement**.  
- Acts as a **diagnostic tool** — if your model cannot outperform the baseline, it needs revisiting.

---

**In summary:**  
A baseline model may be simple, but it is an **essential first step** in building and validating any predictive modeling pipeline.


In [1]:
from random import seed, randrange

In [5]:
def random_algorithm(train, test):
    output_values = [ row[-1] for row in train ]
    unique = list(set(output_values))
    predicted = list()
    for i in range(len(test)):
        index = randrange(len(unique))
        predicted.append(unique[index])

    return predicted
        

In [8]:
 seed(1)
 train = [[0], [1], [0], [1], [0], [1]]
 test = [[None], [None], [None], [None]]
 predictions = random_algorithm(train, test)
 print(predictions)

[0, 0, 1, 0]


### Zero Rule Algorithm

The **Zero Rule Algorithm (ZeroR)** is a simple baseline method that uses the most frequent class or average value to make predictions. It provides a more informed baseline than the random prediction algorithm.

#### Classification
For classification problems:
- ZeroR predicts the **most common class** in the training dataset.
- Example: If a dataset has 90 instances of class 0 and 10 of class 1, ZeroR will always predict class 0.
- Baseline accuracy = 90 / 100 = **90%**
- This is higher than the random prediction accuracy (~82%).

#### Key Idea
ZeroR creates **one simple rule** using the target distribution — a strong baseline for comparing more complex models.


In [7]:
def zero_rule_algorithm_classification(train,test):
    output_values = [ row[-1] for row in train ]
    prediction = max(set(output_values), key=output_values.count)
    # print(prediction)
    predicted = [ prediction for i in range(len(test)) ]
    return predicted

In [8]:
from random import seed

seed(1)

train1 = [
    [1, 0],
    [2, 0],
    [3, 1],
    [4, 0]
]
test1 = [
    [5, None],
    [6, None]
]
print("Test 1:", zero_rule_algorithm_classification(train1, test1))

Test 1: [0, 0]


### Zero Rule Algorithm — Regression

For **regression problems**, the Zero Rule Algorithm (ZeroR) predicts a constant real value — typically the **mean** (average) of all observed target values in the training data.

#### How It Works
- Calculates the mean of the output values:
  
  $
  \text{mean} = \frac{\sum_{i=1}^{n} \text{value}_i}{\text{count(values)}}
  $

- This mean value is then used as the prediction for all inputs.
- Using the mean (or sometimes the median) provides a strong baseline, as it usually produces lower error than random predictions.

#### Key Idea
ZeroR for regression predicts the **central tendency** (mean or median) of the target variable — offering a simple yet effective performance benchmark for evaluating advanced models.


In [1]:
def zero_rule_algorithm_regression_mean(train, test):
    output_values = [ row[-1] for row in train ]
    prediction = sum(output_values)/float(len(output_values))
    predicted = [ prediction for i in range(len(test)) ]
    return predicted

In [3]:
train1 = [
    [1, 10],
    [2, 20],
    [3, 30]
]
test1 = [
    [4, None],
    [5, None]
]
print("Test 1:", zero_rule_algorithm_regression_mean(train1, test1))
# Expected mean = (10 + 20 + 30) / 3 = 20
# Output → [20.0, 20.0]


# ✅ Test Case 2: Mixed positive and negative values
train2 = [
    [1, -5],
    [2, 5],
    [3, 15]
]
test2 = [
    [4, None],
    [5, None],
]
print("Test 2:", zero_rule_algorithm_regression_mean(train2, test2))
# Expected mean = (-5 + 5 + 15) / 3 = 5.0
# Output → [5.0, 5.0]


# ✅ Test Case 3: Decimal (float) values
train3 = [
    [1, 2.5],
    [2, 3.5],
    [3, 4.5]
]
test3 = [
    [4, None]
]
print("Test 3:", zero_rule_algorithm_regression_mean(train3, test3))
# Expected mean = (2.5 + 3.5 + 4.5) / 3 = 3.5
# Output → [3.5]


# ✅ Test Case 4: All same output values
train4 = [
    [1, 100],
    [2, 100],
    [3, 100]
]
test4 = [
    [4, None],
    [5, None],
    [6, None]
]
print("Test 4:", zero_rule_algorithm_regression_mean(train4, test4))
# Expected mean = 100.0
# Output → [100.0, 100.0, 100.0]


# ✅ Test Case 5: Single training example
train5 = [
    [1, 42]
]
test5 = [
    [2, None],
    [3, None]
]
print("Test 5:", zero_rule_algorithm_regression_mean(train5, test5))
# Expected mean = 42
# Output → [42.0, 42.0]

Test 1: [20.0, 20.0]
Test 2: [5.0, 5.0]
Test 3: [3.5]
Test 4: [100.0, 100.0, 100.0]
Test 5: [42.0, 42.0]


### Central Tendency

**Central tendency** is a statistical concept that describes the **center or typical value** of a dataset — the point around which most data values cluster.  
It helps summarize a large set of numbers with a **single representative value**.

---

### 🔹 The Three Main Measures of Central Tendency

| **Measure** | **Description** | **When to Use** | **Example** |
|--------------|-----------------|-----------------|--------------|
| **Mean** | The arithmetic average — sum of all values divided by their count. | When data is roughly symmetric (no extreme outliers). | (2 + 4 + 6 + 8) / 4 = **5** |
| **Median** | The middle value when data is sorted. | When data has outliers or is skewed. | [2, 4, 100] → **4** |
| **Mode** | The most frequent value in the dataset. | When data is categorical or has repeating values. | [2, 2, 3, 5] → **2** |

---

### 📊 Example

Suppose we have data:  
`[10, 15, 15, 20, 35, 50, 50, 50, 100]`

| **Measure** | **Calculation** | **Result** |
|--------------|-----------------|-------------|
| **Mean** | (10 + 15 + 15 + 20 + 35 + 50 + 50 + 50 + 100) / 9 | **38.3** |
| **Median** | Middle value (5th in sorted list) | **35** |
| **Mode** | Most frequent value | **50** |

---

In [15]:
def zero_rule_algorithm_regression_median(train, test):
    output_values = [ row[-1] for row in train ]
    output_values = sorted(output_values)
    if len(output_values)%2 == 0:
        index1 = int(len(output_values)/2) -1
        index2 = int(len(output_values)/2)
        median = (output_values[index1] + output_values[index2])/2
        return median
    index = int(len(output_values)/2)
    return output_values[index]

In [16]:
# ---------- TEST CASES ----------

# 1️⃣ Test with odd number of values
train1 = [[1], [3], [5], [7], [9]]
test1 = [[2], [4], [6]]
print("Test 1 (Odd count):", zero_rule_algorithm_regression_median(train1, test1))
# Expected median = 5

# 2️⃣ Test with even number of values
train2 = [[2], [3], [4], [6], [8], [10]]
test2 = [[0], [1]]
print("Test 2 (Even count):", zero_rule_algorithm_regression_median(train2, test2))
# Expected median = (4 + 6) / 2 = 5.0

# 3️⃣ Test with all identical values
train3 = [[5], [5], [5], [5]]
test3 = [[10], [20]]
print("Test 3 (Identical values):", zero_rule_algorithm_regression_median(train3, test3))
# Expected median = 5

# 4️⃣ Test with negative numbers
train4 = [[-10], [-5], [0], [5], [10]]
test4 = [[1], [2]]
print("Test 4 (Negative values):", zero_rule_algorithm_regression_median(train4, test4))
# Expected median = 0

# 5️⃣ Test with single value
train5 = [[42]]
test5 = [[1], [2]]
print("Test 5 (Single value):", zero_rule_algorithm_regression_median(train5, test5))
# Expected median = 42

# 6️⃣ Test with unsorted input
train6 = [[9], [1], [8], [2], [7], [3]]
test6 = [[0]]
print("Test 6 (Unsorted data):", zero_rule_algorithm_regression_median(train6, test6))
# Expected median = (3 + 7) / 2 = 5.0


Test 1 (Odd count): 5
Test 2 (Even count): 5.0
Test 3 (Identical values): 5.0
Test 4 (Negative values): 0
Test 5 (Single value): 42
Test 6 (Unsorted data): 5.0


In [17]:
def zero_rule_algorithm_regression_mode(train, test):
    output_values = [ row[-1] for row in train ]
    unique_values = list(set(output_values))
    
    mode = unique_values[0]
    count = output_values.count(mode)

    for v in unique_values:
        c = output_values.count(v)
        if c > count:
            mode = v
            count = c
    return mode

In [18]:
# 1️⃣ Single mode (most frequent value)
train1 = [[1], [2], [2], [3], [4]]
test1 = [[0], [1]]
print("Test 1 (Single mode):", zero_rule_algorithm_regression_mode(train1, test1))
# Expected mode = 2

# 2️⃣ Multiple modes (tie) — function may return any of them (depending on set order)
train2 = [[1], [1], [2], [2], [3]]
test2 = [[10]]
print("Test 2 (Multiple modes):", zero_rule_algorithm_regression_mode(train2, test2))
# Expected mode = 1 or 2 (both appear twice)

# 3️⃣ All identical values
train3 = [[5], [5], [5], [5]]
test3 = [[7]]
print("Test 3 (All identical):", zero_rule_algorithm_regression_mode(train3, test3))
# Expected mode = 5

# 4️⃣ Negative values
train4 = [[-3], [-1], [-1], [0], [2], [2], [2]]
test4 = [[1]]
print("Test 4 (Negative + positive):", zero_rule_algorithm_regression_mode(train4, test4))
# Expected mode = 2

# 5️⃣ Single value in dataset
train5 = [[42]]
test5 = [[5]]
print("Test 5 (Single value):", zero_rule_algorithm_regression_mode(train5, test5))
# Expected mode = 42

# 6️⃣ Unsorted data
train6 = [[9], [1], [9], [3], [9], [3], [1]]
test6 = [[0]]
print("Test 6 (Unsorted data):", zero_rule_algorithm_regression_mode(train6, test6))
# Expected mode = 9

Test 1 (Single mode): 2
Test 2 (Multiple modes): 1
Test 3 (All identical): 5
Test 4 (Negative + positive): 2
Test 5 (Single value): 42
Test 6 (Unsorted data): 9


### Moving Average Algorithm — Time Series Baseline

The **Moving Average Algorithm** is a simple yet powerful **baseline method** used in **time series forecasting**.  
It predicts future values by averaging a fixed number of the most recent past observations.  
This approach assumes that **future values will follow the same general trend** as recent data.

---

## 🔹 Concept Overview

In time series problems, data points are ordered by time — for example, daily temperatures, stock prices, or website traffic.  
The **Moving Average (MA)** method helps smooth out short-term fluctuations and highlight long-term trends.

The algorithm works by computing the **average of the last _n_ observations** and using it as the **prediction for the next time step**.

---

### 🧠 Formula

If the last `n` observed values are:

$
x_{t-n+1}, x_{t-n+2}, ..., x_t
$

Then the moving average prediction for the next value ($ \hat{x}_{t+1} $) is:

$
\hat{x}_{t+1} = \frac{1}{n} \sum_{i=t-n+1}^{t} x_i
$

---

## 🔸 How It Works

1. Choose a **window size** `n` (the number of past observations to average).  
2. For each time step after the first `n` points:
   - Compute the average of the last `n` values.
   - Use it as the prediction for the next time step.
3. Continue sliding the window forward by one time step at a time.

---

In [25]:
def zero_rule_algorithm_regression_moving_average(data, window_size):
    predictions = list()
    for i in range(window_size,len(data)+1):
        window = data[i-window_size: i]
        predictions.append( sum(window)/float(window_size) )
    return predictions

In [26]:
data1 = [1, 2, 3, 4, 5, 6]
window_size1 = 3
print("Test 1 (Increasing sequence):", zero_rule_algorithm_regression_moving_average(data1, window_size1))
# Expected = [(1+2+3)/3, (2+3+4)/3, (3+4+5)/3, (4+5+6)/3]
# Expected Output: [2.0, 3.0, 4.0, 5.0]

# 2️⃣ Constant values
data2 = [5, 5, 5, 5, 5]
window_size2 = 2
print("Test 2 (Constant values):", zero_rule_algorithm_regression_moving_average(data2, window_size2))
# Expected = [5, 5, 5, 5]

# 3️⃣ Mixed increasing and decreasing
data3 = [10, 12, 14, 13, 11, 9]
window_size3 = 3
print("Test 3 (Mixed data):", zero_rule_algorithm_regression_moving_average(data3, window_size3))
# Expected = [(10+12+14)/3, (12+14+13)/3, (14+13+11)/3, (13+11+9)/3]
# Expected Output: [12.0, 13.0, 12.67, 11.0]

# 4️⃣ Small dataset (window size = 1)
data4 = [7, 8, 9, 10]
window_size4 = 1
print("Test 4 (Window size = 1):", zero_rule_algorithm_regression_moving_average(data4, window_size4))
# Expected = [8, 9, 10] (each prediction equals previous value)
# Output: [7,8,9] is not possible since we start from index 1 → Expected: [7,8,9] (depending on index interpretation)
# Correct: [7, 8, 9, 10] starts from index 1 → [7, 8, 9] averages

# 5️⃣ Window size equals dataset length - 1
data5 = [2, 4, 6, 8]
window_size5 = 3
print("Test 5 (Large window):", zero_rule_algorithm_regression_moving_average(data5, window_size5))
# Expected = [(2+4+6)/3, (4+6+8)/3] = [4.0, 6.0]

# 6️⃣ Floating-point data
data6 = [1.5, 2.5, 3.5, 4.5, 5.5]
window_size6 = 2
print("Test 6 (Float values):", zero_rule_algorithm_regression_moving_average(data6, window_size6))
# Expected = [(1.5+2.5)/2, (2.5+3.5)/2, (3.5+4.5)/2, (4.5+5.5)/2]
# Output: [2.0, 3.0, 4.0, 5.0]

# 7️⃣ Window size larger than dataset (edge case)
data7 = [1, 2, 3]
window_size7 = 5
print("Test 7 (Window > Data length):", zero_rule_algorithm_regression_moving_average(data7, window_size7))
# Expected = [] (no prediction possible)

Test 1 (Increasing sequence): [2.0, 3.0, 4.0, 5.0]
Test 2 (Constant values): [5.0, 5.0, 5.0, 5.0]
Test 3 (Mixed data): [12.0, 13.0, 12.666666666666666, 11.0]
Test 4 (Window size = 1): [7.0, 8.0, 9.0, 10.0]
Test 5 (Large window): [4.0, 6.0]
Test 6 (Float values): [2.0, 3.0, 4.0, 5.0]
Test 7 (Window > Data length): []
