## Understand the Concept

### Train/Test Split:
- You divide your dataset into Training data (to teach the model) and Testing data (to check how well it performs on unseen data).
- Common split: 80% training, 20% testing.

### Baseline Model:

- A very simple model to set a reference point.
- For regression → predict the mean of the target.
- For classification → predict the most frequent class.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error

## Step-by-Step Breakdown

### train_test_split function

- Comes from sklearn.model_selection.
- It splits arrays (X and y) into training and testing subsets.

### Inputs
- X → your features (independent variables, e.g., age, salary, height).
- y → your labels/targets (what you want to predict, e.g., price, class).
- test_size=0.2 → means 20% of the data goes to testing and the remaining 80% goes to training.
- random_state=42 → ensures the split is reproducible. If you run the code again, you’ll get the same train/test sets every time (useful for consistency).

### Outputs
- X_train → the features used for training (80% of X).
- X_test → the features used for testing (20% of X).
- y_train → the labels corresponding to X_train.
- y_test → the labels corresponding to X_test.

In [2]:
from sklearn.datasets import load_iris

data = load_iris()
X = data.data    # features
y = data.target  # labels

print(X,y)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## What is a Baseline in Machine Learning?

- A baseline is a very simple prediction method that does not use any complex ML algorithm.
- It gives you a reference point to compare your real model against.

<strong>Think of it like:</strong>
👉 “If I did something extremely simple, how well would I perform? Can my ML model do better than this?”

## Why Do We Calculate a Baseline?

- Reference Performance
    - Helps you know the minimum standard your model should beat.
    - If your ML model performs worse than baseline, then it’s not useful.
- Detect Problems Early
    - If a baseline is already very good, maybe you don’t need a complex model.
    - If your model can’t beat the baseline → maybe your data or features need improvement.
- Saves Time
    - No need to waste hours tuning complex models if a simple rule already works.

In [9]:
import numpy as np
from sklearn.metrics import mean_squared_error

majority_class = np.bincount(y_train.astype(int)).argmax()
y_pred_baseline = [majority_class] * len(y_test)

baseline_acc = accuracy_score(y_test.astype(int), y_pred_baseline)
print("Baseline Accuracy:", baseline_acc)


Baseline Accuracy: 0.40794573643410853


In [8]:
# Example: Boston housing dataset (if using regression task)
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Baseline: predict the mean
mean_value = y_train.mean()
y_pred_baseline = [mean_value] * len(y_test)

# Evaluate RMSE
baseline_mse = mean_squared_error(y_test, y_pred_baseline)
baseline_rmse = np.sqrt(baseline_mse)
print("Baseline RMSE:", baseline_rmse)


Baseline RMSE: 1.1448563543099792
