# ICL Course Reference

## Formulas

#### **Binomial Distribution**
- **Binomial Coefficient**:
    - $\binom{n}{k} = \frac{n!}{k!(n-k)!}$
        - $n$ = the number of trials,
        - $k$ = the number of successes.
- **Probability Mass Function**:
    - $f(k) = \binom{n}{k} \cdot \theta^k \cdot (1 - \theta)^{n-k}$
        - $n$ = the number of trials,
        - $\theta$ = the probability of success on a trial,
        - $\binom{n}{k}$ = the binomial coefficient,
        - $k$ = the number of successes in $n$ trials.
- **Binomial Likelihood Function**:
    - $L(\theta) = \binom{n}{s} \cdot \theta^s \cdot (1 - \theta)^{n - s}$
        - $s$ = the number of successes,
        - $n$ = the number of trials,
        - $\theta$ = the probability of success.

#### **Likelihood and Probability**
- **General Likelihood Function**:
    - $L(\theta | X) = P(X | \theta) = \prod_{i=1}^{n} f(x_i | \theta)$
        - $X$ = observed data ($x_1, x_2, \dots, x_n$),
        - $\theta$ = parameter(s) of interest,
        - $f(x_i | \theta)$ = probability (or density) of $x_i$ given $\theta$.
- **Log-Likelihood Function**:
    - $\ell(\theta | X) = \log L(\theta | X) = \sum_{i=1}^{n} \log f(x_i | \theta)$
- **Total Probability**:
    - $P(X) = \sum_Y P(X, Y)$
- **Conditional Probability**:
    - $P(A|B) = \frac{P(A \cap B)}{P(B)}$
        - $P(A|B)$ = probability of $A$ given $B$,
        - $P(A \cap B)$ = probability of both $A$ and $B$,
        - $P(B)$ = probability of $B$.
- **Probability Density Function (Normal Distribution)**:
    - $f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$
        - $f(x)$ = probability density at $x$,
        - $\mu$ = mean,
        - $\sigma^2$ = variance,
        - $\sigma$ = standard deviation.
- **Bayes' Theorem**:
    - $P(Y|X) = \frac{P(X|Y) \cdot P(Y)}{P(X)}$
        - $P(Y|X)$ = is the posterior probability of the parameters $Y$
        - $P(X|Y)$ = likelihood of the data $X$,
        - $P(Y)$ = the prior probability of the parameters
        - $P(X)$ = the marginal likelihood (or evidence).
#### **Statistical Measures**
- **Variance**:
    - $\text{Var}(X) = \mathbb{E} \left[ (X - \mathbb{E}(X))^2 \right] = \sum_i (x_i - \mathbb{E}(X))^2 \cdot P(x_i)$
        - $\text{Var}(X)$ = variance of $X$,
        - $\mathbb{E}(X)$ = expected value (mean) of $X$,
        - $x_i$ = specific possible value of $X$,
        - $P(x_i)$ = probability of $x_i$.
- **Covariance**:
    - $\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}(X))(Y - \mathbb{E}(Y))]$
        - $\text{Cov}(X, Y)$ = covariance between $X$ and $Y$,
        - $\mathbb{E}(X)$ = expected value of $X$,
        - $\mathbb{E}(Y)$ = expected value of $Y$.

#### **Regression Analysis**
- **Regression Line**:
    - $Y = aX + b$
        - $a$ = slope (regression coefficient),
        - $b$ = y-intercept,
        - $X$ = independent variable,
        - $Y$ = dependent variable.
- **Regression Coefficient**:
    - $a = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}$
        - $\bar{x}$ = mean of $X$,
        - $\bar{y}$ = mean of $Y$.
- **Y-Intercept**:
    - $b = \bar{y} - a\bar{x}$
- **Correlation Coefficient**:
    - $r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2} \sqrt{\sum_i (y_i - \bar{y})^2}}$
        - $r$ = correlation coefficient,
        - $\bar{x}$ = mean of $X$,
        - $\bar{y}$ = mean of $Y$.



#### **Generalisation Bound**
- **Generalisation Bound**:
    - $n \geq \frac{\log\left(\frac{\delta}{N-1}\right)}{\log(1 - \epsilon)}$
        - $n$ =  required number of samples,
        - $\delta$ = confidence level,
        - $N$ = number of candidate functions,
        - $\epsilon$ = error rate of incorrect functions.

#### **Performance Measures For Regression Problems**
- **Mean Absolute Error (MAE)**  
    - $\frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $
        - $n$ = number of samples,
        - $y_i$ = actual value,
        - $\hat{y}_i$ = predicted value.
- **Average Error (AE)**  
    - $\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) $
- **Mean Absolute Percentage Error (MAPE)**    
    - $  \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100 $
- **Root Mean Squared Error (RMSE)**  
    - $\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $
- **Total Sum of Squared Errors (SSE)**  
    - $  \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $
#### Metrics Derived from the Confusion Matrix:
- **Accuracy**:  
   - $\frac{TP + TN}{TP + TN + FP + FN}$
       - $TP$ = Correct positive predictions.
       - $TN$ = Correct negative predictions.
       - $FP$ = Incorrect positive predictions (Type I error).
       - $FN$ = Incorrect negative predictions (Type II error).
- **Precision**:
   - $\frac{TP}{TP + FP}$
- **Recall (Sensitivity)**:  
   - $\frac{TP}{TP + FN}$
- **F1-Score**:  
   - $2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$
- **Specificity**:
   - $\frac{TN}{FP + TN}$
- **Total Error Rate**
   - $\frac{FN + FP}{TP + FN + FP + TN}$

## **Introduction**

- **Stochastic Data Models**: These models treat the data generation process as a random variable. Common examples include normal, binomial, and Student’s t-distributions.
- **Goodness of Fit**: Summarizes the discrepancy between observed and expected values. A common measure is the root mean squared error (RMSE), which quantifies the differences between the predicted and actual values.
- **Predictive Accuracy**: Refers to how well the predicted values match the actual observed values.
- **Generalization**: The ability of a model to perform well on new, unseen data, rather than just fitting the data it was trained on.
- **Summary Statistics**: A set of values that describe the central tendency, spread, and shape of a dataset, such as mean, standard deviation, and median.
- **Black Box**: A system or process where the internal workings are hidden or not fully understood, often used to describe complex models or algorithms.
- **Model Validation**: The process of confirming that a model is performing as expected and meets its intended purpose, often using validation data or cross-validation.
- **Residual Analysis**: The residual is the difference between the observed and predicted values. Analyzing these residuals can help assess the accuracy of a model and identify patterns or systematic errors.
- **Deterministic Input**: An input that produces the same output every time it is used, often derived from historical data, standards, or specifications.
- **Linear Regression**: A method for modeling the relationship between two or more variables. In simple linear regression, the goal is to fit a straight line, while in multiple linear regression, a hyperplane is used to best represent the relationship.
- **Logistic Regression**: Used for binary classification, where the goal is to predict one of two possible outcomes. Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities that can be mapped to discrete classes.
- **Stochastic Events**: Events that involve randomness or uncertainty, where the outcomes cannot be precisely predicted.
- **Poisson Distribution**: A probability distribution that models the number of events occurring within a fixed time period or space, often used in scenarios like modeling the number of customer arrivals or system failures over time.
- **Statistics**: Involves analyzing data to make inferences about a population or process. It involves fitting a model that explains the observed data (Data → Model).
- **Probability**: Describes what we expect to happen in a random experiment. It helps us understand the data or the properties of the data that the model can produce (Model → Data).


## **Machine Learning (ML)**
- **Machine Learning (ML)** is about understanding the relationship between input variables (also called features or predictors) and an output. This relationship can be described by the equation: 
  $$ Y = f(X_1, X_2, ..., X_n) + \alpha $$
  where $Y$ is the output, $X_1, X_2, ..., X_n$ are the input features, and $\alpha$ is a constant (often a bias term).
- In **supervised learning**, both $X$ (input variables) and $Y$ (output or target) are provided during training. In **unsupervised learning**, only the input data $X$ is provided, and the algorithm seeks to learn the structure or patterns in the data without predefined output labels.
- ML models typically serve two main purposes:
    - **Forecasting**: Predicting future events (denoted as $\hat{y}$) based on historical data.
    - **Inference**: Understanding the relationship between $Y$ and $X$. For example, understanding why certain products sell better than others.
        - **Statistical inference** involves drawing conclusions about a population based on sample data.
        - **Model inference** refers to using a trained model to make predictions on new, unseen data.
- Two main types of supervised learning:
    - **Prediction** (also called **regression**) estimates a continuous numerical value, e.g., predicting how much or how many of something.
    - **Classification** estimates a discrete class or category, e.g., determining whether an email is spam or not.
- Two main categories of models:
    - **Parametric** models make strong assumptions about the data's underlying structure. For example, assuming a linear relationship between size and price in a real-estate dataset (i.e., "There's a straight-line relationship between size and price").
    - **Non-parametric** models make fewer assumptions about the structure of the data and instead learn directly from the data itself. For example, using the prices of similar houses to predict the price of a new house without assuming a specific mathematical form.
- **Imputation** refers to methods for dealing with missing data. Some common techniques for handling missing data include:
    - **Removal**: Deleting the data entries that have missing values.
    - **Simple imputation**: Substituting missing values with statistical measures such as the mean or median of the column.
    - **Normalization**: Normalizing the values to estimate the missing value.
    - **Regression imputation**: Using regression models to predict missing values based on other variables in the dataset.
    - **Random regression imputation**: Adding a random component to the regular regression imputation. After predicting the missing value using regression, a residual term is added to introduce randomness.
    - **Multiple imputation**: Using all available features in the data to forecast and fill in missing values, often involving multiple imputed datasets to account for uncertainty in the imputations.


## **Probability**
- **Probability distribution**: Describes how likely different possible outcomes are to occur in a random event or experiment. There are two main types:
    - **Discrete Probability Distributions** deal with outcomes that can only take specific, countable values (e.g., flipping a coin or rolling a die).
    - **Continuous Probability Distributions** deal with outcomes that can take any value within a continuous range (e.g., measuring a person's height or time).
- A **stochastic event** is any event that involves randomness or uncertainty in its outcome. For example, each time you roll a die, you can't predict the exact outcome, but you can describe the probability of each possible result.
    - **Discrete Stochastic Events** take specific countable values, typically whole numbers or categories. These events use **Probability Mass Functions (PMF)**.
    - **Continuous Stochastic Events** take any value within a continuous range. These events use **Probability Density Functions (PDF)**.
- A **Probability Mass Function (PMF)** is used for discrete random variables. It can be represented as a bar chart, where the height of each bar represents the exact probability of a specific outcome.
- A **Probability Distribution Function (PDF)** is used for continuous random variables. It is represented as a smooth curve, where the probability is found by calculating the area under the curve between two points. It describes the likelihood of the random variable $x$ taking on a particular value within a given range.
- The **binomial distribution** is a probability distribution that describes the number of successes in a fixed number of independent trials, where each trial has two possible outcomes: success or failure. For example, flipping a fair coin multiple times and counting the number of heads. This distribution is widely used in machine learning for modeling binary outcomes.
- **Binomial coefficients** represent the number of ways to choose $k$ successes from $n$ trials. It is mathematically expressed as: 
    $$ \binom{n}{k} = \frac{n!}{k!(n - k)!} $$
- The **Central Limit Theorem (CLT)** explains how the distribution of the sample mean (or sum) of a large number of independent, identically distributed random variables will approach a normal distribution (also called a Gaussian distribution), regardless of the original distribution of the data. The key aspects of CLT involve:
    - **Grouping events**: Taking multiple observations (e.g., dice rolls, test scores, etc.) and dividing them into groups.
    - **Calculating averages**: For each group, compute the average of the observations within that group.
    - **Plotting the averages**: When these averages are plotted as a distribution, the shape tends to resemble a bell curve (Normal Distribution), even if the original data (individual events) is not normally distributed.
- **Absolute Probability** (also called **Unconditional Probability**) refers to the probability of an event occurring without any prior conditions or restrictions. It is the most straightforward form of probability and does not depend on any other events. For example, the probability of rolling a 4 on a fair six-sided die is:
    $$ P(4) = \frac{1}{6} $$
- **Conditional Probability** is the probability of an event occurring, given that another event has already occurred. It reflects the relationship between two events and is denoted as $P(A|B)$, meaning **the probability of A given B**.
- **Total Probability** refers to the overall probability of an event occurring, accounting for all possible scenarios or conditions. It is calculated using the **law of total probability**, which involves partitioning the sample space into mutually exclusive events $B_1, B_2, \dots, B_n$ such that:
    $$ P(A) = \sum_{i=1}^{n} P(A|B_i) P(B_i) $$
- **Independent events** occur when the outcome of one event does not affect the outcome of another event. For independent events, the probability of their co-occurrence (composite event) is the product of their individual probabilities:
    $$ P(X, Y) = P(X) P(Y) = P(X = x) P(Y = y) $$
- A **complementary event** in probability refers to all the outcomes in the sample space that are not part of the given event. It is essentially the opposite of the event you're considering. If $A$ is an event, then the complement of $A$, denoted as $A^c$, is the event that $A$ does not occur. The probability of the complement of an event is given by:
    $$ P(A^c) = 1 - P(A) $$
- **Bayes' Theorem** (or Bayes' Rule) describes how to update the probability of a hypothesis (or event) based on new evidence. It is mathematically represented as:
    $$ P(Y|X) = \frac{P(X|Y) \cdot P(Y)}{P(X)} $$
    - Where:
        - $P(Y|X)$ is the **posterior probability** (the probability of $Y$ given $X$).
        - $P(X|Y)$ is the **likelihood** (the probability of observing $X$ given $Y$).
        - $P(Y)$ is the **prior probability** (the initial probability of $Y$).
        - $P(X)$ is the **evidence** (the probability of observing $X$).


## Descriptive Statistics

- **1. Measures of Central Tendancy**
    -  These describe the center, or typical value, of a dataset
    -  **Mean**
        - The average of the data
        - $\mu = \frac{\sum{x_i}}{N}$ 
    -  **Median**
        - The middle value when data is ordered
    -  **Mode**
        - The most frequently occuring value     
    -  If the mean < median then data is left skewed, and implies some low outliers. And vice versa.
- **2. Measures of Dispersion**
    - These describe the variability or spread of a dataset
    - **Range**
        - $\text{range} = max(x) - min(x)$
    - **Variance**
        - The average squared deviation from the mean ($\mu$)
        - $\sigma^2 = \frac{\sum{(x_i-\mu)^2}}{N} \text{(population)}$
        - $\sigma^2 = \frac{\sum{(x_i-\mu)^2}}{N-1} \text{(sample)}$
    - **Standard Deviation**
        - The square root of the variance, providing a measure in the same units as the data
        - $\text{standard deviation} = \sqrt{\sigma^2}$
    - **Coefficient of Variation (CV)**
        - The ratio of the standard deviation to the mean as a percent
        - $\text{CV} = \frac{Standard Deviation}{\mu}\cdot 100$
            - CV < 1: Lower relative variability
            - CV ≈ 1: High relative variability
            - CV > 1: Very high relative variability
- **3. Measure of Shape**
    - Used to describe the shape and symmetry of the dataset 
    - **Skewness**
        - $\text{skewness} = \frac{\frac{1}{N}\sum{(x_i - \mu)}}{\sigma^3}$
        - Measures the asymmetry of the distribution.
        - skewness > 0 = right skewed
        - skewness < 0 = left skewed
        - skeness = 0 = symmetric
    - **Modaility**
        - The number of peaks
        - **Unimodal** One peak
        - **Bimodal** Two peak
        - **Multimodal** More than two peaks
- **4. Measure of Position**
    - **Percentiles**
        - Values divide into 100 equal parts, the $k-th$ percentile is the value below the $k\%$ of the data falls
    - **Quartiles**
        - Special percentiles dividing data into four equal parts:
            - $Q_1$ = (25th percentile): Lower quartile.
            - $Q_2$ = (50th percentile): Median.
            - $Q_3$ = (75th percentile): Upper quartile.
    - **Deciles**
        - Divide data into 10 equal parts
    - **Z-Scores**
        - Measure how many standard deviations a data point is from the mean
        - $z = \frac{x - \bar{x}}{\text{standard deviation}}$


## **Misc Statistics**

- **Statistical inference** is the process of using data from a sample to make conclusions or predictions about a larger population. Frequentist and Bayesian are two distinct paradigms of reasoning and methodology in statistical inference.
    - **Frequentist**: Views probability as the long-run frequency of events; parameters are fixed, and data is used to estimate them (e.g., hypothesis testing, confidence intervals). Frequentists believe that the "true" value of a statistic about a population (for example, the mean) is fixed and unknown.
    - **Bayesian**: Treats probability as a degree of belief; parameters are treated as random variables and updated using prior knowledge and observed data (via Bayes' Theorem). Bayesians believe that data inform us about the distribution of a statistic or event, and that as we receive more data, our belief about the distribution can be updated, confirming or revising our previous beliefs. 
- **Outliers** are data points that significantly differ from the rest of the data. Methods to detect and remove outliers include:
    - **Z-Score**: Identifies outliers based on how many standard deviations away a point is from the mean.
    - **Modified Z-Score**: Uses the median and MAD (Median Absolute Deviation) for small datasets or non-normal distributions.
    - **Tukey's Fences**: An aggressive IQR-based method with different "fence" thresholds to identify outliers.
    - **Percentile Approach**: Identifies outliers by examining the distribution of data using percentiles, which divide the data into 100 equal parts.
    - **Quartile Approach (IQR)**: Identifies outliers based on the Interquartile Range (IQR), dividing the data into four equal parts using quartiles:
        - Outliers are typically considered as those data points outside the range: 
        $$ Q_1 - 1.5 \cdot \text{IQR} \quad \text{or} \quad Q_3 + 1.5 \cdot \text{IQR} $$
    - **Visualization**: Boxplots and scatter plots help visually identify outliers.
    - **Isolation Forest**: A machine learning method that isolates outliers by partitioning the data and determining how isolated each point is.
    - **DBSCAN**: A clustering-based method that detects outliers as points that do not belong to any cluster.
    - **Anomaly Detection**: Advanced machine learning techniques like One-Class SVM or Autoencoders for detecting outliers.
- **Variance** measures the spread of a single variable around its mean, calculated as the average of squared deviations from the mean:
    $$ \text{Variance of } X = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 $$
- **Covariance** measures the relationship between two variables and how they vary together. It is calculated as:
    $$ \text{Covariance of } X \text{ and } Y = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) $$ 
- **Bootstrapping** is a resampling technique used to assess the accuracy of a statistic (e.g., correlation coefficient, mean, median) by repeatedly sampling from the original dataset. It involves generating many new datasets (called "bootstrap samples") by randomly sampling data points from the original dataset with replacement (i.e., some data points may appear more than once in a new sample, while others may not appear at all).
    - It is a **non-parametric method**, meaning it does not rely on assumptions about the distribution of the original data.
    - Bootstrapping is often used to estimate the sampling distribution of a statistic, calculate confidence intervals, and assess the stability or variability of a model's estimates (including correlation coefficients between two time series, regression coefficients, etc.).



## **Maximum Likelihood Estimation (MLE)**
- **MLE** is a method in statistics used to determine the values of a model's parameters that are most likely to have produced the observed data.
- For instance, if you were estimating the average height of a population based on a sample, MLE helps identify the average height that maximizes the likelihood of observing the sample data.

#### **General Likelihood Function**
Let the observed data be $X = (x_1, x_2, \dots, x_n)$, where $x_i$ represents individual data points, and let the unknown parameter(s) be denoted as $\theta$.
- The **likelihood function** $L(\theta | X)$ represents the probability of observing the data $X$ given the parameters $\theta$. It is calculated as:
  $$ L(\theta | X) = P(X | \theta) = \prod_{i=1}^{n} f(x_i | \theta) $$ 
  - Here, $f(x_i | \theta)$ is the probability (or probability density) of observing $x_i$ given the parameter $\theta$.
  - $\prod_{i=1}^{n}$ indicates the product over all the data points $x_1, x_2, \dots, x_n$.
- The **log-likelihood function** is often used for simplification, as working with the logarithm of the likelihood can make calculations easier and more stable. The log-likelihood is given by:
  $$ \ell(\theta | X) = \log L(\theta | X) = \sum_{i=1}^{n} \log f(x_i | \theta) $$
- To find the **Maximum Likelihood Estimate (MLE)** of the parameter(s), we maximize the log-likelihood function with respect to $\theta$. The MLE is the value of $\theta$ that maximizes this function:
  $$ \hat{\theta} = \arg \max_{\theta} \ell(\theta | X) $$
    - Where $\hat{\theta}$ is the MLE of $\theta$, which is the value that maximizes the likelihood of observing the given data.

#### **Binomial Likelihood Function**
As a specific example, consider the case where the data follows a binomial distribution. The binomial likelihood function expresses the probability of observing $s$ successes in $n$ trials, given a success probability $\theta$. The likelihood is:
  $$ L(\theta) = \binom{n}{s} \cdot \theta^s \cdot (1 - \theta)^{n - s} $$
- Here:
    - $\binom{n}{s}$ is the binomial coefficient, representing the number of ways to choose $s$ successes from $n$ trials.
    - $\theta^s$ is the probability of $s$ successes.
    - $(1 - \theta)^{n - s}$ is the probability of $n - s$ failures.
- The **log-likelihood function** for the binomial case is:
  $$ \ell(\theta) = \log L(\theta) = \log \binom{n}{s} + s \log \theta + (n - s) \log (1 - \theta) $$
- To find the MLE, we maximize this log-likelihood function with respect to $\theta$.

#### **Summary**
Both the general likelihood function and the binomial likelihood function are part of the MLE framework. The general formulation applies to any statistical model, while the binomial likelihood function is specific to data following a binomial distribution.

## Regression Analysis

- The idea of **regression** is to fit a line that best describes the trend between two variables, minimizing the discrepancy between the model and the data. This line is defined as:
    - $Y = ax + b$
        - $a$ is the **slope** (regression coefficient)
        - $b$ is the **y-intercept**
        - $x$ is the **independent variable**
        - $y$ is the **dependent variable**
- We minimize the error (discrepancy) by measuring the differences between the line and the actual values. This error is calculated as:
    - $\text{SSR} = \sum (\hat{y}_i - y_i)^2$
        - where $\hat{y}_i = ax_i + b$ is the predicted value.
        - SSR stands for **Sum of Squared Residuals**, representing the sum of squared differences between the observed ($y_i$) and predicted ($\hat{y}_i$) values.

### Regression Coefficient and Intercept:

- The **regression coefficient** is a key concept in regression analysis. It represents the relationship between a predictor (independent variable $x$) and the outcome (dependent variable $y$). It indicates the strength and direction of the relationship between $x$ and $y$ in regression models.
    - The regression coefficient ($a$) is given by the formula:
        $$a = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$$
    - The **intercept** ($b$) is given by the average deviation:
        $$b = \bar{y} - a \bar{x}$$
        - where $\bar{y} = \frac{1}{N} \sum_{i=1}^{n} y_i$ and $\bar{x} = \frac{1}{N} \sum_{i=1}^{n} x_i$.

### Correlation Coefficients:

- **Correlation Coefficients** measure the degree of co-movement or association between two variables. It shows how strongly two variables are related to each other.
    - The correlation coefficient ($r$) lies between -1 and +1:
        - $r = 1$ indicates a strong positive correlation.
        - $r = -1$ indicates a strong negative correlation.
        - $r = 0$ indicates no correlation.
    - To compute the correlation coefficient:
        - For each pair of data points $(x_i, y_i)$, where $i = (1, \dots, N)$:
        $$r = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^N (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^N (y_i - \bar{y})^2}}$$


## General Machine Learning Concepts

- **Generalisation theory**, in simple terms, is about understanding how well a machine learning model can apply what it has learned from the training data to new, unseen data.
- **Deterministic** models assume that there is no randomness in the system, so the same inputs will always return the exact same outputs.
- **Stochastic** models assume that there is some randomness (noise) that we cannot account for in the system, meaning that the same input will give a slightly different output each time. This is exactly how real-world data works.
- The **bias-variance tradeoff** is the balance, in machine learning, between building models that work well with both the data they have seen and new, unseen data. In simple terms:
    - **High bias** means the model is too simple and misses important patterns in the data. This is called **underfitting**.
    - **High variance** means the model is too complex and overly focused on the training data, capturing noise instead of true patterns. This is called **overfitting**.

## Probabilistic Framework in Machine Learning

- **Probabilistic Setting** refers to approaching machine learning problems from a probability theory perspective. Rather than making deterministic predictions, the model expresses uncertainty through probability distributions.
- **Stationarity** is a property where the statistical characteristics of data remain constant over time. There are different types:
    - ***Strict stationarity*** means the entire probability distribution stays unchanged over time.
    - ***Weak stationarity*** (or covariance stationarity) means key statistical properties, such as mean and variance, remain constant over time.
- **A Priori Knowledge** refers to information we know before analyzing the data—our assumptions and domain expertise that we build into the model. For example, physical constraints (like knowing values must be positive) are often used as prior knowledge.

## Interpreting the Generalisation Bound

- **Generalisation Bound** allows for the rigorous quantification of how much data is needed to make predictions about unseen data. It tells us how many samples $n$ are needed to be sure we select the correct function with a certain level of confidence. The generalisation bound is given by:
    $$n \geq \frac{\log\left(\frac{\delta}{N-1}\right)}{\log(1 - \epsilon)}$$
    - $n$ = number of samples needed
    - $\delta$ = desired confidence level (in this case, 99.9% confidence, so $\delta = 0.001$)
    - $N$ = number of candidate functions (in this case, $N = 1000$)
    - $\epsilon$ = error rate of incorrect functions (in this case, 1%, so $\epsilon = 0.01$)

## Laplace’s Rule of Succession

- **Laplace’s Rule of Succession** is a way of estimating the probability of an event happening in the future, based on past observations, while accounting for uncertainty with limited data:
    $$P(\text{next event is a success}) = \frac{s+1}{n+2}$$
    - $s$ is the number of successes observed (i.e., the event has happened $s$ times)
    - $n$ is the total number of trials (i.e., the event has been observed $n$ times in total)



## Evaluting Model Performance

### 1. Regression Models:
##### **Mean Absolute Error (MAE)**  
- MAE measures the average magnitude of errors between actual values ($y$) and predicted values ($\hat{y}$).
- A low MAE indicates that the model's predictions are close to the actual values on average.
- A high MAE suggests larger average errors, meaning the model is less accurate.
- $\text{MAE = np.mean(np.abs(y - y1))}$
$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $$

##### **Average Error (AE)**  
- AE gives the average signed error, where positive and negative errors can cancel out.
- A positive AE indicates overprediction on average.
- A negative AE indicates underprediction on average.
- $\text{AE = np.mean(y - y1)}$
$$ \text{AE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i) $$

##### **Mean Absolute Percentage Error (MAPE)**    
- MAPE measures the average percentage error relative to the actual values.
- A low MAPE suggests the model predicts well on a relative scale.
- A high MAPE means that errors are large relative to the actual values.
- $\text{MAPE = np.mean(np.abs((y - y1) / y)) * 100}$
$$ \text{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100 $$

##### **Root Mean Squared Error (RMSE)**  
- RMSE measures the square root of the average squared errors.
- Penalizes large errors more heavily than MAE because of the squaring.
- A low RMSE indicates better predictive performance.
- A high RMSE highlights larger deviations in predictions.
- $\text{RMSE = np.sqrt(np.mean((y - y1) ** 2))}$
$$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$

##### **Total Sum of Squared Errors (SSE)**  
- SSE measures the total squared deviation of predictions from actual values.
- It represents the overall error without normalization.
- A low SSE means better predictive performance, but its scale depends on the dataset size.
- $\text{SSE = np.sum((y - y1) ** 2)}$
$$ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

### 2. Classification Models:
#### Confusion Matrix

A confusion matrix is a table used to evaluate the performance of a classification model. It compares the actual target values with the predictions made by the model. Here's an example of a confusion matrix layout:

| Actual \ Predicted | Positive | Negative |
|---------------------|----------|----------|
| **Positive**        | True Positive (TP)  | False Negative (FN) |
| **Negative**        | False Positive (FP) | True Negative (TN)  |

##### Explanation:
- **True Positive (TP)**: The model correctly predicted the positive class.
- **True Negative (TN)**: The model correctly predicted the negative class.
- **False Positive (FP)**: The model incorrectly predicted the positive class when it was actually negative (Type I Error).
- **False Negative (FN)**: The model incorrectly predicted the negative class when it was actually positive (Type II Error).

##### Metrics Derived from the Confusion Matrix:
1. **Accuracy**:  
   The accuracy metric tells you the overall correctness of the model by calculating the proportion of correct predictions to the total predictions.  
   $$
   \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
   $$

2. **Precision**:  
   Precision measures the proportion of positive predictions that are actually correct. A higher precision means that when the model predicts positive, it is more likely to be correct.  
   $$
   \text{Precision} = \frac{TP}{TP + FP}
   $$

3. **Recall (Sensitivity)**:  
   Recall (also known as sensitivity) measures the proportion of actual positives that are correctly identified by the model. A higher recall indicates that the model is better at identifying the positive class.
   - focuses on the model's ability to correctly predict positives.
   $$
   \text{Recall} = \frac{TP}{TP + FN}
   $$

5. **F1-Score**:  
   The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is particularly useful when the class distribution is imbalanced. A higher F1-score indicates a better balance between precision and recall.  
   $$
   \text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
   $$
6. **Specificity**:
   Specificity measures the proportion of actual negatives that are correctly identified as negative by the model.
   - Specificity focuses on the model's ability to correctly predict negatives.
   $$
   \text{Specificity} = \frac{TN}{FP + TN}
   $$

   
8. **Estimation Misclassification Rate**
    Shows the proportion of incorrect predictions
    $$
    \text{Total Error Rate} = \frac{FN + FP}{TP + FN + FP + TN}
    $$

## Lift Charts 
- Lift:
    - $\text{Lift} = \frac{\text{True Positives in model-selected subset}}{\text{Expected True Positives by random selection}}$
    - Measures how much better a model performs compared to random selection.
- Interpretation:
    - A lift chart plots the cumulative percentage of actual positives (y-axis) against the cumulative percentage of samples (x-axis), ranked by predicted probability.
    - The baseline (random model) follows a diagonal line, where a model with no predictive power achieves a lift of 1.
    - A perfect model captures all positive cases early, creating a steep curve.
- Use Cases:
    - Commonly used in marketing, fraud detection, and customer targeting to rank predictions and prioritize high-value actions.

## Techniques for fine-tuning the machine learning algorithms

#### 1. K-Folds Cross Validation:
- The dataset is split into $K$ equal-sized folds. The model is trained on $K-1$ folds and tested on the remaining fold. This process repeats $K$ times, each time using a different fold for testing.
- **Cross-Validation Estimate**:
    - $\text{CV Error} = \frac{1}{K} \sum_{i=1}^{K} \text{Error}_i$
    - Computes the average error across all $K$ iterations.
- Interpretation:
    - Helps assess model performance by ensuring every data point is used for both training and testing.
    - Reduces bias and variance compared to a single train-test split.
- Common Variants:
    - **Stratified K-Folds**: Maintains class distribution across folds (useful for imbalanced datasets).
    - **Leave-One-Out Cross Validation (LOO-CV)**: Special case where $K$ equals the number of samples ($n$).
- Use Cases:
    - Model selection and hyperparameter tuning to ensure robust evaluation.

#### 2. Oversampling
- A technique used to address class imbalance by increasing the representation of the minority class to match the majority class.
- Approaches:
    - **Basic Duplication**: Creates identical copies of existing minority class samples.
    - Synthetic Data Generation:
        - **SMOTE (Synthetic Minority Oversampling Technique)**: Generates new samples by interpolating between existing minority class samples.
        - **ADASYN (Adaptive Synthetic Sampling)**: Enhances SMOTE by generating synthetic samples in underrepresented regions.
        - **Variational Autoencoders (VAEs) & GANs**: Create realistic synthetic data points, often used in image and text applications.
        - **Weighted Oversampling**: Assigns higher weights to minority class samples instead of duplicating them.
- Challenges:
    - Overfitting: Repeating samples can lead to models memorizing instead of generalizing.
    - Computational Cost: Increases dataset size, leading to higher training time.
    - Synthetic Sample Quality: Poorly generated samples can negatively impact model performance.
- **Stratified Sampling**
    - Ensures that each class maintains its original proportion in both the training and test sets, preventing biased performance evaluation.
    - Approach:
        - Divide data into two groups:
        - Set A: Minority class samples.
        - Set B: Majority class samples.
        - Construct training and validation sets:
    - Training: Randomly select 50% from Set A and an equal number from Set B.
    - Validation: Use the remaining 50% of Set A and add samples from Set B to restore the original class ratio.