# Module66 ML Logistic Regression Assignment2

Q1. What is the purpose of grid search cv in machine learning, and how does it work?

A1. **Purpose:** Grid Search with Cross-Validation (CV) is used to find the optimal hyperparameters for a machine learning model by systematically searching through a predefined set of values.


**How It Works:**

1.) Specify a grid of hyperparameter combinations to test.

2.) For each combination:

Perform k-fold cross-validation.

Calculate the performance metric for each fold.

3.) Choose the combination with the best cross-validated performance.
```
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid
param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}

# Create and run GridSearchCV
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best parameters
print(grid_search.best_params_)

```

Q2. Describe the difference between grid search cv and randomize search cv, and when might you choose
one over the other?

A2. The differemnce between Grid search CV and Randomize Search CV according to:

**1.) Search space**

GridSearchCV - Exhaustively tests all parameter combinations.

RandomizedSearchCV - Randomly selects combinations from the grid.

**2.) Efficiency**

GridSearchCV - Computationally expensive for large grids.

RandomizedSearchCV - Faster, especially for large parameter spaces.

**3.) Optimal usecase**

GridSearchCV - Small, well-defined parameter grids.

RandomizedSearchCV - Large, less-defined parameter grids.



## When to Choose:

1.) Use Grid Search for small parameter grids where exhaustive testing is feasible.

2.) Use Randomized Search for large parameter grids or when you need faster results.

Q3. What is data leakage, and why is it a problem in machine learning? Provide an example.

A3. **Definition:**

Data leakage occurs when information from outside the training dataset is inadvertently used to build the model, leading to over-optimistic performance metrics.

**Problem:**
It causes the model to perform well during training/validation but poorly on unseen data.

**Example:**
Predicting loan defaults, where the dataset includes the target variable (e.g., loan repayment status) in a feature column.

Q4. How can you prevent data leakage when building a machine learning model?

A4. Ways to prevent data leakage when building a ML model are:

**1.) Split Data Properly:** Ensure test data is not used during training or feature engineering.

**2.)Feature Engineering:** Perform feature scaling or imputation after splitting the data.

**3.) Time-Series Data:** Avoid using future data to predict past outcomes.

Q5. What is a confusion matrix, and what does it tell you about the performance of a classification model?

A5. **Definition:**

A confusion matrix summarizes the performance of a classification model by comparing predicted vs. actual values.

**Structure:**

matrix[Actual Positive][Predicted Positive] = TP

matrix[Actual Positive][Predicted Negative] = FN

matrix[Actual Negative][Predicted Positive] = FP

matrix[Actual Negative][Predicted Negative] = TN

where,

TP = True Positive

FN = False Negative

FP = False Positive

TN = True Negative

**Insights:**

It helps evaluate metrics like accuracy, precision, recall, and F1-score.


Q6. Explain the difference between precision and recall in the context of a confusion matrix.

A6. **1.) Precision:**

Definition: Proportion of true positive predictions out of all positive predictions.

Formula:
``` Precision = TP / (TP + FP) ```

Focus: Reducing false positives.

Example: Useful in spam detection.

**2.) Recall:**

Definition: Proportion of true positive predictions out of all actual positives.

Formula:
``` Recall = TP / (TP + FN) ```

Focus: Reducing false negatives.

Example: Useful in medical diagnoses.

Q7. How can you interpret a confusion matrix to determine which types of errors your model is making?

A7. **False Positives (FP):**

Model predicts positive when actual is negative.

Example: Predicting spam for non-spam emails.


**False Negatives (FN):**

Model predicts negative when actual is positive.

Example: Failing to detect a disease in a patient.


**Analyze FP and FN to understand where the model struggles (e.g., precision vs. recall trade-offs).**

Q8. What are some common metrics that can be derived from a confusion matrix, and how are they
calculated?

A8. Some common metrics that can be derived from a confusion matrix are:

1.) Accuracy
```Accuracy = (TP + TN) / (TP + TN + FP + FN)```

2.) Precision
```Precision = TP/ (TP + FP)```

3.) Recall(Sensititvity)
``` Recall = TP / (TP + FN)```

4.) Specificity
``` Specificity = TN / (TN + FP)```

5.) F1- score
``` F1-score = 2 * [(Precision * Recall) / (Precision + Recall)] ```

Q9. What is the relationship between the accuracy of a model and the values in its confusion matrix?

A9. Accuracy measures the proportion of correct predictions (TP + TN) but can be misleading in imbalanced datasets.

Example: For a dataset with 95% negatives, predicting all negatives yields 95% accuracy but fails to identify positives.

Q10. How can you use a confusion matrix to identify potential biases or limitations in your machine learning
model?

A10. **Class Imbalance:**

High FN or FP for a minority class indicates poor handling of class imbalance.


**Bias Detection:**

Disparities in FN or FP across demographic groups can indicate bias.


**Corrective Actions:**

Use resampling techniques or adjust class weights to address imbalance.

Evaluate fairness metrics to ensure unbiased predictions.
