# Logistic Regression 2: Model Selection, Data Leakage, and Confusion Matrix Analysis
This notebook covers grid search CV, random search CV, data leakage, confusion matrix interpretation, and related classification metrics.

## Q1. What is the purpose of grid search cv in machine learning, and how does it work?

Grid search cross-validation (GridSearchCV) is used to find the best combination of hyperparameters for a model. It works by exhaustively searching over a specified parameter grid, training and evaluating the model for each combination using cross-validation, and selecting the combination with the best performance.

In [None]:
# Example: GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

param_grid = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X, y)
print('Best parameters:', grid.best_params_)

## Q2. Describe the difference between grid search cv and randomize search cv, and when might you choose one over the other?

- **Grid search CV:** Tests all possible combinations in the parameter grid. More thorough but computationally expensive.
- **Randomized search CV:** Samples a fixed number of parameter combinations at random. Faster and more efficient for large or continuous parameter spaces.

Choose randomized search when the parameter space is large or when computational resources are limited.

In [None]:
# Example: RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_dist = {'C': np.logspace(-3, 2, 100), 'penalty': ['l1', 'l2'], 'solver': ['liblinear']}
random_search = RandomizedSearchCV(LogisticRegression(), param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X, y)
print('Best parameters (randomized):', random_search.best_params_)

## Q3. What is data leakage, and why is it a problem in machine learning? Provide an example.

Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates. For example, if test data is used during feature engineering or model training, the model may "cheat" and perform unrealistically well.

## Q4. How can you prevent data leakage when building a machine learning model?

- Split data into training and test sets before any preprocessing or feature engineering.
- Use pipelines to ensure transformations are applied only to training data during cross-validation.
- Avoid using future or target-related information in feature creation.

## Q5. What is a confusion matrix, and what does it tell you about the performance of a classification model?

A confusion matrix is a table that summarizes the performance of a classification model by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). It helps identify how well the model distinguishes between classes.

In [None]:
# Example: Confusion matrix
from sklearn.metrics import confusion_matrix

y_pred = grid.predict(X)
cm = confusion_matrix(y, y_pred)
print('Confusion matrix:\n', cm)

## Q6. Explain the difference between precision and recall in the context of a confusion matrix.

- **Precision:** TP / (TP + FP) — the proportion of positive predictions that are actually correct.
- **Recall:** TP / (TP + FN) — the proportion of actual positives that are correctly identified.

Precision focuses on prediction accuracy, while recall focuses on capturing all actual positives.

## Q7. How can you interpret a confusion matrix to determine which types of errors your model is making?

By examining the off-diagonal elements (FP and FN), you can see whether the model is making more false positives or false negatives, and adjust the model or threshold accordingly based on the application's needs.

## Q8. What are some common metrics that can be derived from a confusion matrix, and how are they calculated?

- **Accuracy:** (TP + TN) / (TP + TN + FP + FN)
- **Precision:** TP / (TP + FP)
- **Recall:** TP / (TP + FN)
- **F1-score:** 2 * (Precision * Recall) / (Precision + Recall)

These metrics provide different perspectives on model performance.

In [None]:
# Example: Calculating metrics from confusion matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy:', accuracy_score(y, y_pred))
print('Precision:', precision_score(y, y_pred))
print('Recall:', recall_score(y, y_pred))
print('F1-score:', f1_score(y, y_pred))

## Q9. What is the relationship between the accuracy of a model and the values in its confusion matrix?

Accuracy is the proportion of correct predictions (TP + TN) out of all predictions. It is directly calculated from the confusion matrix values.

## Q10. How can you use a confusion matrix to identify potential biases or limitations in your machine learning model?

If the model makes more errors for one class (e.g., more FNs than FPs), it may be biased or not generalizing well. Analyzing the confusion matrix helps identify such issues and guides further model tuning or data collection.