* In machine learning our goal is to discover `patterns`.
* But how can we be sure we have truly discovered a `general` pattern and not simply memorized our data?
* This problem of search for a way to discover patterns within our data is  called `generalization`.
* Whenever we work with finite samples of data,we must keep in mind the risk that we might fit out training data, only to discover we have failed to discover a generalizable pattern.
* The phenomenon of fitting closer to our training data than underlying distribution is called `overfitting`, and the techniques of combatting  overfitting is often called `regularization` methods.

## 1.1 Training Error and Generalization Error.

* Training error $R_{emp}$ is a statistic calculated on the training datasets while the generalization error $R$ is an expectation taken with respect to the underlying distribution.
* The generalization error can be thought of as what you would see if you applied your model to an infinite stream of additional data examples drawn from the same underlying data distribution.
* The training error is expressed as a $sum$:
$R_{emp}(f) = \frac{1}{n} \sum_{i=1}^{n} L(y^{(i)}, f(x^{(i)}))$

* where:
    1. $n$ is the total number of samples in your training dataset.
    2. $x^{(i)}$ - the input features for the $i$-th observation.
    3. $y^{(i)}$ - the true label or target value for the $i$-th observation.
    4. $f(x^{(i)})$ - the model's prediction for the $i$-th observation.
    5. $L$- The Loss Function, which calculates the difference between the prediction and truth.

* While the generalization error is expressed as an integral:
$R[p, f] = \mathbb{E}_{(x,y) \sim P} [l(x, y, f(x))] = \iint l(x, y, f(x)) p(x, y) \, dx \, dy$
   
* A problem arises since we cannot calculate the generalization $R$ exactly.
* Since nobody ever tells us the precise form the density function $p(x,y)$.
* Moreover we cannot sample an infinite stream of data points.
* Thus in practice, we must `estimate` the generalization error by applying our model to an independent test set consituted of a random slection of examples `X` and labels `y^` that were witheld from our training set.
* This consists of applying the same formula that was used  for calculating the empirical training error but to the test set `X,y^`.

##1.2 Model Complexity.

*  With `simple models` and `abundant data`, training error and generalization error are usually close.
* As model complexity increases or when data becomes scarce, the training error error may decrease while the generalization gap grows. This is because as complex model has more parameters and more flexible forms. Meaning it can memorize the little training examples and adjust parameters to match each example closely or perfectly. The generalization gap grows because the model memorizes the small training set, achieving low training error,but when it sees new data, its test error increases.
* Extremely expressively models can perfectly fit even random labels making training error alone meaningless for judging generalization.
* Without restrictions on model complexity, fitting training data does not guarantee that a model has learned a generalizable pattern.
* Learning theory draws inspiration from Karl Popper's falsibility principle: a useful scientific theory must rule out some possibilities, not explain everything.
* Model complexity is not determined only by the  number of parameters:
  1. Some models e.g kernel methods have infinitely many parameters but are controlled by other constraints.
  2. One useful notion of complexity is the range of allowed parameter values motivating techniques like regularization e.g weight decay.
* However, comparing complexity across very different model classes i.e decision trees,neural networks is often difficult.
* Low training error does not imply low generalixation error especially for high expressive models. This is because highly expressive models can fit the training data in many fundementally different ways, most of which capture noise (erronous data points) rather than true structure,and therefore training error cannot tell which one was chosen.
* However, low training error also does not necessarily imply poor generalization.
* For powerful models like deep neural networks, generalization must be assessed by using holdout data.
* The error measured on this holdout (validation) set is called validation error.

## 1.3 Underfitting or Overfitting.

* When comparing training and valiidation errors, two common situatons are important to recognize.
* The first one is a phenomenon known as `underfitting` which occurs when both the training and validation erors are high and are close to each other. Here the model is too simple to capture the underlying pattern. There's also a generalization gap suggesting the model benefits from increased complexity.

* Second `overfitting` occurs when training error is much lower than the validation error. Here the model fits the training data well but does not generalize effectively.
* The main objective is to try to minimize the generalizatio error, not necessarily the gap itself.
* However, if training error reaches zero, the generalization gap equal the generalization error, and further immprovement is possible only by reducing this gap.

## Polynomial Curve Fitting

* To illustrate some intuition about overfitting and model complexity, we consider the following: given training data consisting of a single feature `x` and a corresponding real-valued label `y`, we try to find the polynomial of degree `d`.
$
  \hat{y} = \sum_{i=0}^{d} w_i x^i
  $
for estimating the label `y`.
* This is just a linear regression problem where our features are given powers of `x`, the models weights are given by $w_i$ and the bias is given by $w_0$ since $x^{(0)} = 1 $ for all $x$.
* Higher degree polynomials are more complex because: they have more parameters and can represent wider range of functions.
* For a fixed training dataset, increasing the polynomial degree can only decrease or maintain training error.
* If all training inputs `x` are distinct, a polynomial with degree equal to the number of data points can fit the training data perfectly.


##1.4 Model Selection

* Typically we select our final model only after evaluating multiple models that differ in various ways in terms of architectures,training objectives,selected features etc.
*  Choosing among many models is called `Model selection`.

## Cross-Validation

* Wehn training data is scarce, we might not even be able to afford to validate enough data to consitute a proper validation set.
* One popular solution is to employ `K-fold cross validation`.
* Here, the original training data is split into $K$ non-overlapping subseys.
* Then the model training and validation are executed $K$ times, each time training on $K-1$ subsets and validating on a different subset i.e the one not used for training in that round.
* Finally, the training and validation erros are estimated by averaging over results from the `K` experiments.


* K-fold cross-validation can be computationally expensive because the model must be trained and validated **K separate times**. When working with **large datasets** and **complex models**, each training and validation cycle requires significant computational resources, making the overall process costly in terms of time and computation.
