**1) What is the purpose of grid search cv in machine learning, and how does it work?**

Grid search CV (Cross-Validation) is a technique used in machine learning for hyperparameter tuning.

**Purpose:**
- Find optimal hyperparameters for a model
- Improve model performance
- Reduce overfitting
- Automate the process of model tuning

**How it works:**
- Define a grid of hyperparameter values
- Create models with all possible combinations of these values
- Evaluate each model using cross-validation
- Select the best-performing combination of hyperparameters

**2) Describe the difference between grid search cv and randomize search cv, and when might you choose
one over the other?**

Grid search CV and randomized search CV are both hyperparameter tuning techniques, but they differ in their approach:

**Grid Search CV:**
- Exhaustively searches through a specified set of hyperparameters
- Tests all possible combinations

**Randomized Search CV:**
- Randomly samples from the hyperparameter space
- Tests a specified number of combinations

Key differences:

**Search strategy:**
- Grid: Systematic and complete
- Random: Stochastic and partial

**Efficiency:**
- Grid: Can be computationally expensive
- Random: Often more efficient, especially for large parameter spaces

**Coverage:**
- Grid: May miss optimal values between grid points
- Random: Can find unexpected combinations, better for continuous parameters

When to choose:

**Choose Grid Search when:**
- Parameter space is small
- You have specific values to test
- Computational resources are not a constraint

**Choose Randomized Search when:**
- Parameter space is large or continuous
- You're unsure about the range of good parameters
- Computational resources are limited
- You want to explore the parameter space more broadly

**3) What is data leakage, and why is it a problem in machine learning? Provide an example.**

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates.

Why it's a problem:
- Overestimates model performance
- Creates unrealistic expectations
- Model fails to generalize to new, unseen data
- Can lead to incorrect business decisions

Example:

In a credit default prediction model:
- Dataset includes a feature "account_closed" (Yes/No)
- This feature is highly correlated with default status
- However, account closure often happens after a default

**4) How can you prevent data leakage when building a machine learning model?**

Let's consider an above mentioned example.

The problem:

Using "account_closed" in the model leaks information from the future, making predictions unrealistically accurate but useless for real-world applications.

Prevention:
- Careful feature engineering
- Proper train-test splitting
- Time-based validation for time-series data
- Cross-validation
- Domain expertise to identify potential leakage sources

**5) What is a confusion matrix, and what does it tell you about the performance of a classification model?**

A confusion matrix is a table that summarizes the performance of a classification model.

Here's a concise explanation:

A tabular summary of predicted vs. actual class outcomes.
Structure (for binary classification):

| | Predicted Positive | Predicted Negative |
|-------------------|--------------------|--------------------|
| Actual Positive   | True Positive (TP)  | False Negative (FN)|
| Actual Negative   | False Positive (FP) | True Negative (TN) |

What it tells you:

**Correct predictions:**
- TP: Correctly predicted positives
- TN: Correctly predicted negatives

**Errors:**
- FP: Incorrectly predicted positives
- FN: Incorrectly predicted negatives

We can derived certain important metrices like Accuracy, Precision, Recall and F1-score etc.

**6) Explain the difference between precision and recall in the context of a confusion matrix.**

**Precision:**
- The proportion of correct positive predictions among all positive predictions
- Formula: **TP / (TP + FP)**
- Focus: Measures the accuracy of positive predictions
- Interpretation: "When the model predicts positive, how often is it correct?"

**Recall (also known as Sensitivity):**
- The proportion of actual positives that were correctly identified
- Formula: **TP / (TP + FN)**
- Focus: Measures the model's ability to find all positive instances
- Interpretation: "Of all the actual positive cases, how many did the model correctly identify?"

**Key differences:**
- Precision prioritizes minimizing false positives
- Recall prioritizes minimizing false negatives

**7) How can you interpret a confusion matrix to determine which types of errors your model is making?**

Interpreting a confusion matrix helps identify the types of errors your model is making.

Here's how to analyze it:

**True Positives (TP):** 
- Correct positive predictions
- Upper left cell

**True Negatives (TN):** 
- Correct negative predictions
- Lower right cell

**False Positives (FP) - Type I error:**
- Lower left cell
- Model incorrectly predicts positive
- Indicates over-prediction of positive class

**False Negatives (FN) - Type II error:**
- Upper right cell
- Model incorrectly predicts negative
- Indicates under-prediction of positive class

Interpretation:
- High FP: Model is too sensitive, over-predicting positives
- High FN: Model is too specific, under-predicting positives
- Compare FP vs FN to see which error type is more common
- Look at the ratio of errors to correct predictions in each class

**8) What are some common metrics that can be derived from a confusion matrix, and how are they
calculated?**

Derived metrics:
- Accuracy: (TP + TN) / Total
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1-score: Harmonic mean of precision and recall

**9) What is the relationship between the accuracy of a model and the values in its confusion matrix?**

The proportion of correct predictions (both true positives and true negatives) among the total number of cases examined.

Accuracy formula:
- Accuracy = (TP + TN) / (TP + TN + FP + FN)

  Where:
  - TP = True Positives
  - TN = True Negatives
  - FP = False Positives
  - FN = False Negatives

Relationship:
- Higher TP and TN values increase accuracy
- Higher FP and FN values decrease accuracy
- Accuracy is the sum of the diagonal elements (TP + TN) divided by the sum of all elements in the matrix

**10) How can you use a confusion matrix to identify potential biases or limitations in your machine learning model?**

A confusion matrix can reveal several biases and limitations in a machine learning model. Here's how to use it for this purpose:

**Class imbalance bias:**
- Compare row totals (actual class distributions)
- If one class is much larger, the model may be biased towards it

**Prediction bias:**
- Compare column totals to row totals
- If significantly different, model may over/under-predict certain classes

**Specific class performance:**
- Examine TP, TN, FP, FN for each class
- Poor performance in one class suggests limitations for that category

**Error types:**
- Compare FP and FN
- Consistent misclassification between specific classes indicates model limitations

**Overall accuracy vs. class-specific accuracy:**
- High overall accuracy with poor performance in minority classes suggests bias

**Precision and recall trade-offs:**
- Imbalanced precision and recall may indicate bias towards certain outcomes

**Cost-sensitive errors:**
- If certain misclassifications are more costly, assess their frequency

**Threshold bias:**
- Adjust classification threshold and observe changes in TP, FP, TN, FN