<font size="+3"><strong>Machine Learning: Core Concepts</strong></font>

# Statistical Concepts

## ***Cost Functions***

When we train a model, we're solving an optimization problem. We provide training data to an algorithm and tell it to find the model or model parameters that best fit the data. But how can the algorithm judge what the "best" fit is? What criteria should it use?

A **cost function** (sometimes also called a loss or error function) is a mathematical formula that provides the score by which the algorithm will determine the best fit. Generally, the goal is to minimize the cost function and get the lowest score. For linear models, these functions measure distance, and the model tries to to get the closest fit to the data. For tree-based models, they measure impurity, and the model tries to get the most terminal nodes.

#### Key Points about Cost Functions:

1. **Quantifying Error**: The cost function assigns a numerical value to the error between predicted values and actual values. It depends on the problem type.

2. **Model Complexity**: The choice of cost function can also influence the complexity of the resulting model. Some cost functions encourage simpler models that generalize better, while others might lead to more complex models that overfit the training data.

3. **Trade-offs**:  The choice of a specific cost function can involve trade-offs. For example, a complex cost function might better capture the intricacies of the problem, but it could also make optimization more difficult. A simpler cost function might lead to faster training, but it might not capture all the nuances of the data.

4. **Common Cost Functions**:
   - **Mean Squared Error (MSE)**: For regression, it measures squared differences.

   $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

   Where:
    - $\(n\)$ is the number of data points
    - $\(y_i\)$ is the true target value for the $\(i\)$th data point
    - $\(\hat{y}_i\)$ is the predicted value for the $\(i\)$th data point


   - **Cross-Entropy**: Often used for classification problems. It quantifies the difference between predicted class probabilities and true class labels.

   - **Hinge Loss**: Used for support vector machines (SVMs) and other classifiers. It aims to maximize the margin between classes.



   - **Log-Likelihood**: Used in probabilistic models. It measures the likelihood of the observed data under the model's distribution.

************************************

## ***Residuals in Regression Analysis***

When we perform regression analysis to model relationships between variables, we often use a line (or curve) of best fit to approximate the underlying trend in the data. However, real-world data is rarely perfectly clean and follows the theoretical model exactly. Residuals come into play when we consider the differences between the observed data points and the predictions made by the regression model.

### What are Residuals?

A **residual** is the vertical distance between an individual data point and the regression line. In other words, it's the difference between the actual observed value and the value predicted by the regression model. Each data point has its own residual, and it tells us how much that particular data point deviates from the regression line.

### Interpreting Residuals

Residuals provide valuable insights into the quality and appropriateness of our regression model. Here's how to interpret them:

- **Positive Residual**: If a residual is positive, it means that the actual observed value is higher than the value predicted by the regression line at that particular point. This suggests that the model is underestimating the value for that data point.

- **Negative Residual**: Conversely, a negative residual indicates that the actual observed value is lower than the predicted value. In this case, the model is overestimating the value.

- **Zero Residual**: If a data point lies exactly on the regression line, its residual is zero. This means that the model's prediction matches the observed value perfectly for that point.

### Importance of Residuals

Residual analysis is an essential part of regression modeling for several reasons:

1. **Model Validation**: By examining the distribution of residuals, we can assess how well the regression model fits the data. If the residuals are randomly scattered around zero and have constant variance, it's an indication that the model is appropriate.

2. **Detection of Patterns**: Patterns in the residual plot, such as a curved shape or a fan-like pattern, might indicate that the model is not capturing some underlying relationships between variables.

3. **Outlier Detection**: Residuals that are significantly larger or smaller than others could represent outliers or anomalies in the data. Outliers can have a substantial impact on the regression model's performance.
************************************


## ***Performance Metrics***

In statistics, an *error* is the difference between a measurement and reality. There may not be any difference at all, but there's usually *something* not quite right, and we need to account for that in our model. To do that, we need to figure out the **mean absolute error (MAE)**. Absolute error is the error in a single measurement, and mean absolute error is the average error over the course of several measurements.

## More Performance Metrics

### Mean Squared Error (MSE)

The **mean squared error (MSE)** is another widely used performance metric, especially in regression tasks. It measures the average squared difference between the predicted values and the true values. MSE gives more weight to larger errors, making it sensitive to outliers.

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

### Root Mean Squared Error (RMSE)

The **root mean squared error (RMSE)** is a modification of the MSE that takes the square root of the average squared differences. RMSE provides an estimate of the standard deviation of the prediction errors.

$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$


### R-squared (Coefficient of Determination)

The **R-squared** value, also known as the coefficient of determination, quantifies the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model. It ranges between 0 and 1, where 1 indicates a perfect fit.
$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}i)^2}{\sum{i=1}^{n} (y_i - \bar{y})^2}$$

These metrics are commonly used for classification tasks:

- **Accuracy**: Measures the proportion of correctly classified instances out of the total instances.
- **Precision**: Measures the proportion of true positive predictions out of all positive predictions.
- **Recall**: Measures the proportion of true positive predictions out of all actual positive instances.
- **F1-Score**: Combines precision and recall to provide a balanced measure of a model's performance.
************************************

# Data Concepts

## ***Leakage***

**Leakage** is the use of data in training your model that would not be typically be available when making predictions. For example, suppose we want to predict property prices in USD but include property prices in Mexican Pesos in our model. If we assume a fixed exchange rate or a nearly constant exchange rate, then our model will have a low error on the training data, but this will not be reflective of its performance on real world data.

**Leakage** refers to the inadvertent introduction of information into your training data that would not typically be available when making predictions on real-world data. It can significantly distort the performance of your model during training and evaluation, leading to overly optimistic results that do not generalize well to unseen data.

### Types of Leakage

There are two main types of leakage to be aware of:

1. **Target Leakage**: This occurs when information from the target variable is available to the model during training, but it would not be available at prediction time. Including such information can make your model appear more accurate during training, but it will perform poorly in real-world scenarios.

   Example: Imagine you're predicting whether a credit card transaction is fraudulent or not. If your model has access to information about whether a transaction was later marked as fraudulent during training, it would lead to target leakage. The model could pick up on patterns that are not present in real-world situations.

2. **Data Leakage**: This happens when your model has access to features that it should not have during training. These features could be directly or indirectly related to the target variable.

   Example: Suppose you're predicting the stock market, and you include future stock prices as features in your training data. This would lead to data leakage because, in a real-world scenario, you wouldn't have access to future stock prices at the time of prediction.

### Consequences of Leakage

Leakage can have several negative consequences:

- **Overfitting**: Your model could learn to exploit the leaked information to fit the training data extremely well, resulting in poor generalization to new data.
  
- **Unrealistic Performance**: Leakage can make your model appear highly accurate during training and validation, but its performance will degrade significantly when faced with real-world scenarios.

### Preventing Leakage

To prevent leakage, follow these practices:

1. **Data Splitting**: Always split your data into training, validation, and testing sets before preprocessing or feature engineering. Leakage can occur if these steps are performed on the entire dataset before splitting.

2. **Feature Engineering**: Avoid including any information that wouldn't be available during prediction. For instance, remove features derived from the target variable or future data.

3. **Time-Based Splits**: If dealing with time-series data, ensure that your validation and test sets come after your training data. This prevents future information from leaking into the training process.



## ***Imputation***

Datasets are often incomplete, containing missing values in one or more rows or columns. When dealing with these missing entries, it's important to address them appropriately to ensure accurate and reliable analysis. **Imputation** is the process of estimating and filling in these missing values with educated guesses.

### Importance of Imputation

Imputation is crucial for several reasons:

1. **Preserving Data Integrity**: Removing rows or columns with missing values can lead to loss of valuable information. Imputation allows you to retain as much data as possible.

2. **Maintaining Statistical Power**: Imputing missing values helps ensure that your analysis has sufficient statistical power, preventing underestimation of variance or biases in results.

3. **Algorithm Compatibility**: Many machine learning algorithms require complete datasets. Imputation allows you to use these algorithms effectively.

### Imputation Techniques

There are various techniques for imputing missing values:

1. **Mean/Median Imputation**: Replacing missing values with the mean or median of the non-missing values in the column. This is a simple method but assumes the data is missing at random.

2. **Mode Imputation**: Replacing missing categorical values with the mode (most frequent value) of the non-missing values in the column.

3. **Regression Imputation**: Predicting missing values using regression models based on other variables.

4. **K-Nearest Neighbors (KNN) Imputation**: Filling in missing values by considering the values of the k-nearest neighbors.

5. **Extrapolation**: For time-series data, using previous or subsequent values to estimate missing values based on patterns.

### Caveats of Imputation

While imputation is valuable, it's important to be cautious:

- **Bias Introductions**: Imputation can introduce bias if not done carefully. The imputed values might not accurately represent the true values.
- **Underestimation of Uncertainty**: Imputed values often underestimate the uncertainty associated with missing data.
- **Distorted Relationships**: Imputed data might distort the relationships between variables, impacting downstream analyses.



[Sklearn-Impuation](https://scikit-learn.org/stable/modules/impute.html#impute)

## ***Generalization***

Notice that we tested the model with a dataset that's *different* from the one we used to train the model. Machine learning models are useful if they allow you to make predictions about data other than what you used to train your model. We call this concept **generalization**. By testing your model with different data than you used to train it, you're checking to see if your model can generalize. Most machine learning models do not generalize to all possible types of input data, so they should be used with care. On the other hand, machine learning models that don't generalize to make predictions for at least a restricted set of data aren't very useful.

### The Goal of Generalization

The ultimate goal of machine learning is to create models that can generalize well to unseen data. A model that generalizes well can accurately predict outcomes for new and unseen examples, even if those examples come from a different distribution than the training data. Generalization ensures that your model is not merely memorizing the training data but is learning meaningful patterns that apply more broadly.

### Overfitting and Underfitting

Two common challenges related to generalization are **overfitting** and **underfitting**:

- **Overfitting**: Occurs when a model learns the noise and random fluctuations in the training data rather than the underlying patterns. As a result, the model performs well on the training data but poorly on new, unseen data.

- **Underfitting**: Happens when a model is too simple to capture the underlying patterns in the data. It performs poorly on both the training data and new data.

### Techniques for Generalization

To achieve better generalization, consider the following techniques:

1. **Cross-Validation**: Use cross-validation to assess your model's performance on multiple subsets of the data. This helps evaluate how well your model generalizes.

2. **Regularization**: Apply techniques like L1 or L2 regularization to prevent overfitting by adding penalty terms to the model's loss function.

3. **Feature Selection**: Choose relevant features and remove irrelevant ones to prevent the model from fitting noise.

4. **Ensemble Methods**: Combine multiple models to improve overall performance and reduce overfitting.

### Balancing Generalization and Complexity

Striking the right balance between a model's complexity and its ability to generalize is key. A model that's too complex might overfit, while one that's too simple might underfit. Regularization techniques and model evaluation help find this balance.


# Model Concepts

## Hyperparameters

When we instantiate an estimator, we can pass keyword arguments that will dictate its structure. These arguments are called **hyperparameters**. For example, when we defined our decision tree estimator, we chose how many layers the tree would have using the `max_depth` keyword. This is in contrast to **parameters**, which are the numbers that our model uses to make predictions based on features. Parameters are optimized during the training process based on data and input features. They keep changing during training to fit the data, and only the best-performing ones are selected.

### The Distinction between Parameters and Hyperparameters

**Parameters** are learned by the model during training. They are the coefficients or weights that are adjusted to minimize the difference between the model's predictions and the actual target values. For example, in linear regression, the coefficients of the features are parameters.

**Hyperparameters**, on the other hand, are set before the training process begins. They determine the high-level characteristics of the model, such as its complexity, capacity, or behavior during training. Adjusting hyperparameters impacts how the model learns and generalizes.

### Role of Hyperparameters

Hyperparameters play a crucial role in machine learning:

1. **Model Behavior**: Hyperparameters influence how a model learns and generalizes. They determine the complexity of the model, which affects its ability to fit the training data and generalize to new data.

2. **Avoiding Overfitting**: By adjusting hyperparameters, you can control overfitting, underfitting, and the balance between the two. Techniques like regularization involve tuning hyperparameters.

3. **Computational Efficiency**: Some hyperparameters affect how the model is trained and how computations are distributed across resources.

### Common Hyperparameters

Here are some common examples of hyperparameters and their impact on models:

- **Learning Rate**: Affects the step size taken during gradient descent in optimization algorithms.

- **Number of Neighbors (K) in K-Nearest Neighbors**: Influences the level of local vs. global pattern capturing.

- **Number of Trees in Random Forest**: Determines the trade-off between model complexity and ensemble strength.

- **Regularization Strength**: Affects the penalty applied to large coefficients in linear models.

### Hyperparameter Tuning

Finding the right set of hyperparameters is crucial for model performance. This process is called **hyperparameter tuning**. It involves selecting the best hyperparameters based on validation data or cross-validation. Grid search, random search, and Bayesian optimization are common methods used for tuning.
