## Why Data Splitting Matters

In supervised machine learning, the goal is to build a model that learns how to correctly connect inputs to outputs. The inputs are often called **features** or **predictors**, while the outputs are known as **targets** or **responses**.

How well a model performs depends on the type of problem you’re solving.  
- For **regression** tasks, performance is usually measured using metrics like **R²**, **root mean square error (RMSE)**, or **mean absolute error (MAE)**.  
- For **classification** problems, common metrics include **accuracy**, **precision**, **recall**, and the **F1 score**.

There’s no single *perfect* value for these metrics—what’s considered good performance can vary widely depending on the industry or use case. While many resources explain these metrics in detail, the most important thing to remember is this:

> **A model must be evaluated fairly to be trusted.**

You can’t accurately judge a model’s performance using the same data it was trained on, because the model has already *seen* that data. This would result in overly optimistic performance scores. Instead, the model should be tested on **new, unseen data** to understand how well it will perform in real-world situations.

That’s why we split our dataset before training. One part is used to **train** the model, and another part is kept aside to **test** it. This separation ensures that performance metrics reflect the model’s true predictive ability, not just its ability to memorize the training data.


## Training, Validation, and Test Sets

Splitting your dataset is a key step in making sure your model’s performance is evaluated fairly and realistically. In most machine learning projects, the dataset is randomly divided into **three parts**:

### Training Set
The **training set** is used to teach the model. This is where the model learns patterns in the data by adjusting its internal parameters.  
For example, during training, a model learns the best weights or coefficients in algorithms like **linear regression**, **logistic regression**, or **neural networks**.

### Validation Set
The **validation set** is used to evaluate the model while you are fine-tuning it. This is especially useful during **hyperparameter tuning**.  
For instance, if you’re deciding how many neurons to use in a neural network or which kernel works best for a support vector machine, you try different options. For each option, the model is trained on the training set and evaluated using the validation set to see which configuration performs best.

### Test Set
The **test set** is reserved for the final evaluation of the model. It provides an unbiased measure of how well the model performs on completely new data.  
This dataset should **not** be used during training or validation, as doing so would compromise the fairness of the evaluation.

In simpler scenarios—where hyperparameter tuning isn’t needed—it’s often acceptable to work with just **training** and **test** sets.

---

## Underfitting and Overfitting

Splitting data also helps identify two common modeling problems: **underfitting** and **overfitting**.

### Underfitting
**Underfitting** occurs when a model is too simple to capture the underlying patterns in the data.  
For example, using a linear model to describe a clearly nonlinear relationship can lead to underfitting. These models usually perform poorly on both the training and test datasets because they fail to learn meaningful patterns.

### Overfitting
**Overfitting** happens when a model is too complex and learns not only the true patterns in the data but also the noise.  
Such models often perform extremely well on the training data but fail to generalize to new, unseen data. As a result, their performance on the test set is usually much worse.

---

Splitting your dataset properly helps you balance learning and generalization, ensuring that your model performs well not just on known data, but also in real-world scenarios.
