# Machine Learning Fundamentals

Table of contents:
1. Cross validation 

## 1. Cross validation 

Cross-validation is a statistical method used to **estimate the skill** of a machine learning model on independent data.

### The Core Problem it Solves:

When training an AI model, you want it to learn the general patterns in your data, not just memorize the specific examples.

* **Overfitting:** This happens when a model learns the training data too well, including the noise and outliers, and performs poorly on new data. Cross-validation helps detect and prevent this.
* **Data Usage:** It allows you to use your entire dataset for both training and validation by cycling through different data subsets.


### üõ†Ô∏è How it Works: The K-Fold Method

The most common form is **k-Fold Cross-Validation**. Here is a step-by-step breakdown:

1.  **Divide the Data:** The entire dataset is randomly split into $k$ equally sized segments, called "folds." (A common choice for $k$ is 5 or 10).
2.  **Iterate and Train:** The process repeats $k$ times (or $k$ "folds"). In each iteration:
    * One fold is set aside to be the test/validation set.
    * The remaining $k-1$ folds are combined to form the training set.
    * The model is trained on the training set and then evaluated on the holdout test set.
3.  **Aggregate Results:** After all $k$ iterations are complete (and every fold has been used exactly once as the test set), the $k$ evaluation scores (e.g., accuracy, error rate) are averaged.

$$\text{Model Performance} = \frac{\text{Score}_1 + \text{Score}_2 + \dots + \text{Score}_k}{k}$$

This final average score provides a much more *stable and unbiased* estimate of the model's true predictive power on unseen data.


### üìã Common Types of Cross-Validation

| Technique | Description | Best Used When... |
| :--- | :--- | :--- |
| **k-Fold CV** | Splits data into $k$ folds, cycling through each as the test set. | Most general use; for a balanced and robust estimate. |
| **Leave-One-Out CV (LOOCV)** | $k$ equals the number of data points ($n$); $n-1$ are training, 1 is testing. | Dataset is very small, but computationally expensive for large data. |
| **Stratified k-Fold CV** | Ensures each fold has the same proportion of classes as the entire dataset. | Dealing with **imbalanced datasets** (e.g., rare disease detection). |


### Key Takeaways

Cross-validation is a critical tool for any AI practitioner because it:

* **Estimates Generalization:** Gives you confidence that your model will work in the real world.
* **Optimizes Hyperparameters:** It's used to compare different models or find the best settings (hyperparameters) for a single model.
* **Reduces Bias:** Ensures the model's performance isn't dependent on a single, lucky random split of the data.