## 1

**Purpose of Grid Search CV:**
Grid Search Cross-Validation (CV) is used to systematically work through multiple combinations of hyperparameter values, cross-validate each combination, and determine the best model parameters that yield the highest performance.

**How It Works:**
1. **Specify Hyperparameter Grid:** Define a set of hyperparameters and their possible values to search through.
2. **Exhaustive Search:** For each combination of hyperparameters:
   - Split the training data into k-folds.
   - Train the model on  k-1  folds and validate it on the remaining fold.
   - Repeat for all folds and calculate the average performance metric (e.g., accuracy, F1-score).
3. **Select Best Parameters:** Choose the hyperparameter combination with the best average performance across the k-folds.
4. **Refit Model:** Train the final model on the entire training dataset using the best hyperparameters.

## 2
**Grid Search CV:**
- **Search Method:** Exhaustively searches all possible combinations of hyperparameters.
- **Pros:** Guarantees finding the optimal combination within the specified grid.
- **Cons:** Computationally expensive and time-consuming, especially with a large number of hyperparameters and values.

**Randomized Search CV:**
- **Search Method:** Samples a fixed number of hyperparameter combinations randomly from the specified grid.
- **Pros:** Faster and more efficient than grid search, especially when dealing with a large hyperparameter space.
- **Cons:** Does not guarantee finding the absolute best combination but usually finds a good enough solution.

**When to Choose:**
- **Grid Search CV:** When the hyperparameter space is small or when computational resources and time are not a constraint.
- **Randomized Search CV:** When dealing with a large hyperparameter space or limited computational resources and time.

## 3

**Data Leakage:**
Data leakage occurs when information from outside the training dataset is used to create the model, resulting in overly optimistic performance estimates and poor generalization to new data.

**Why It’s a Problem:**
- It leads to models that perform well on the training data but fail to generalize to unseen data.
- It can cause incorrect conclusions about the model’s performance and effectiveness.

**Example:**
Consider predicting customer churn using features like customer tenure and whether the customer churned last month. If the training data includes this information from the future (i.e., post-churn), the model may learn to predict churn based on future information, leading to misleadingly high performance.

## 4

**Preventing Data Leakage:**
1. **Separate Data Properly:** Ensure that the training, validation, and test datasets are correctly separated and no data is shared between them.
2. **Feature Engineering:** Perform feature engineering (e.g., scaling, encoding) within cross-validation to prevent information from the validation set leaking into the training process.
3. **Temporal Ordering:** For time-series data, maintain temporal order by training on past data and testing on future data.
4. **Careful Feature Selection:** Avoid using features that won’t be available at prediction time or that contain information from the future.

## 5

**Confusion Matrix:**
A confusion matrix is a table that summarizes the performance of a classification model by comparing the actual labels with the predicted labels. It consists of four components:
- **True Positives (TP):** Correctly predicted positive cases.
- **True Negatives (TN):** Correctly predicted negative cases.
- **False Positives (FP):** Incorrectly predicted positive cases.
- **False Negatives (FN):** Incorrectly predicted negative cases.

**Performance Insights:**
- Provides a detailed breakdown of correct and incorrect classifications.
- Helps identify types of errors the model makes (e.g., false positives vs. false negatives).

## 6

**Precision:**
- **Definition:** The ratio of correctly predicted positive observations to the total predicted positives.
- **Formula:** *Precision* = TP/{TP + FP}
- **Focus:** Measures the accuracy of positive predictions.

**Recall (Sensitivity):**
- **Definition:** The ratio of correctly predicted positive observations to the actual positives.
- **Formula:** *Recall* = TP/{TP + FN}
- **Focus:** Measures the model’s ability to capture all positive cases.

## 7

**Interpreting Errors:**
- **False Positives (FP):** Instances where the model incorrectly predicts the positive class. High FP indicates the model is too lenient in classifying positives.
- **False Negatives (FN):** Instances where the model incorrectly predicts the negative class. High FN indicates the model is too strict in classifying positives.
- By analyzing the counts of FP and FN, you can determine if your model tends to favor one class over the other and adjust accordingly (e.g., adjusting the decision threshold).

## 8

**Common Metrics:**
1. **Accuracy:**
   - **Formula:** *Accuracy* = {TP + TN}/{TP + TN + FP + FN}
   - Measures the overall correctness of the model.

2. **Precision:**
   - **Formula:** *Precision* = TP/{TP + FP}
   - Measures the accuracy of positive predictions.

3. **Recall (Sensitivity):**
   - **Formula:** *Recall* = TP/{TP + FN}
   - Measures the ability to capture all positive cases.

4. **F1-Score:**
   - **Formula:**  F1 = 2 . {Precision.Recall}/{Precision + Recall}
   - Harmonic mean of precision and recall, balancing both metrics.

5. **Specificity:**
   - **Formula:** *Specificity* = TN/{TN + FP}
   - Measures the ability to capture all negative cases.

## 9

**Relationship:**
- **Accuracy:** Measures the proportion of correct predictions (both true positives and true negatives) among all predictions.
- **Calculation:** Directly derived from the confusion matrix values:
   *Accuracy* = {TP + TN}/{TP + TN + FP + FN} 
- **Limitations:** Accuracy can be misleading in imbalanced datasets where one class dominates.

## 10

**Identifying Biases and Limitations:**
- **Class Imbalance:** High number of false negatives or false positives might indicate a bias towards the majority class.
- **Error Patterns:** Consistent errors in certain classes can indicate areas where the model is underperforming.
- **Threshold Adjustment:** Use precision-recall trade-offs to adjust the decision threshold based on the type of errors you want to minimize (e.g., reducing false negatives in medical diagnoses).
- **Performance Across Classes:** Analyze metrics like precision, recall, and F1-score for each class to ensure balanced performance.
