**Q1. Explain the difference between linear regression and logistic regression models. Provide an example of
a scenario where logistic regression would be more appropriate.**

**ANSWER:-----**

Linear regression and logistic regression are both statistical models used for different types of predictive tasks. Here’s a breakdown of their differences and an example scenario for logistic regression:

### Linear Regression

**Purpose:** Linear regression is used to predict a continuous dependent variable based on one or more independent variables.

**Output:** It produces a continuous output (e.g., predicting house prices, temperatures, etc.).

**Equation:** The relationship is modeled using a linear equation:
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \epsilon \]
where \( Y \) is the dependent variable, \( X_i \) are the independent variables, \( \beta_i \) are the coefficients, and \( \epsilon \) is the error term.

### Logistic Regression

**Purpose:** Logistic regression is used to predict a categorical dependent variable, often binary (i.e., two possible outcomes).

**Output:** It produces probabilities that can be mapped to binary outcomes (e.g., yes/no, pass/fail, win/lose).

**Equation:** The relationship is modeled using the logistic function (sigmoid function):
\[ P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n)}} \]
where \( P(Y=1) \) is the probability that the dependent variable \( Y \) equals 1 (the event of interest).

### Example Scenario for Logistic Regression

**Scenario:** Predicting whether a student will pass or fail a course based on study hours, attendance, and prior grades.

**Why Logistic Regression?** The outcome (pass/fail) is categorical. Logistic regression will model the probability of passing (or failing) based on the input variables, allowing us to classify each student into one of the two categories.

### Key Differences

1. **Nature of the Outcome:**
   - **Linear Regression:** Predicts a continuous outcome.
   - **Logistic Regression:** Predicts a categorical outcome.

2. **Error Distribution:**
   - **Linear Regression:** Assumes that the residuals (errors) are normally distributed.
   - **Logistic Regression:** Does not make such an assumption; instead, it uses the logistic function to bound the probabilities between 0 and 1.

3. **Model Interpretation:**
   - **Linear Regression:** Coefficients represent the change in the dependent variable for a one-unit change in the independent variable.
   - **Logistic Regression:** Coefficients represent the change in the log odds of the dependent variable for a one-unit change in the independent variable.



**Q2. What is the cost function used in logistic regression, and how is it optimized?**

**ANSWER:---------**

The cost function used in logistic regression is the **logistic loss function**, also known as the **log-loss** or **cross-entropy loss**. This function is optimized to find the best-fitting parameters for the logistic regression model.

### Logistic Loss Function (Log-Loss)

For binary classification, the log-loss function is defined as:

\[ \text{Log-Loss} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right] \]

where:
- \( m \) is the number of training examples,
- \( y_i \) is the actual label for the \( i \)-th training example (0 or 1),
- \( \hat{y}_i \) is the predicted probability of the \( i \)-th training example being in class 1.

The log-loss penalizes incorrect predictions with larger errors, particularly when a predicted probability is far from the actual outcome.

### Optimization

The optimization of the log-loss function in logistic regression is typically done using **gradient descent** or one of its variants (e.g., stochastic gradient descent, mini-batch gradient descent).

### Gradient Descent

1. **Initialize Parameters:** Start with initial values for the parameters (weights) \(\beta_0, \beta_1, \ldots, \beta_n\).

2. **Compute Predictions:** For each training example, compute the predicted probability \(\hat{y}_i\) using the logistic function:
   \[ \hat{y}_i = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \ldots + \beta_nX_{in})}} \]

3. **Compute the Gradient:** Calculate the gradient of the log-loss with respect to each parameter \(\beta_j\). The gradient for a parameter \(\beta_j\) is:
   \[ \frac{\partial \text{Log-Loss}}{\partial \beta_j} = -\frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i) X_{ij} \]
   where \( X_{ij} \) is the value of the \( j \)-th feature for the \( i \)-th training example.

4. **Update Parameters:** Update each parameter using the gradient and a learning rate \(\alpha\):
   \[ \beta_j := \beta_j - \alpha \frac{\partial \text{Log-Loss}}{\partial \beta_j} \]

5. **Iterate:** Repeat the steps of computing predictions, calculating the gradient, and updating parameters until convergence (i.e., when changes in the cost function are below a certain threshold or after a fixed number of iterations).

### Convergence Criteria

The optimization process continues until one of the following criteria is met:
- The change in the cost function is smaller than a predefined threshold.
- A maximum number of iterations is reached.
- The gradient values become sufficiently small.

By minimizing the log-loss function through gradient descent, logistic regression finds the optimal parameters that best fit the training data, providing accurate probabilities for the binary classification task.

**Q3. Explain the concept of regularization in logistic regression and how it helps prevent overfitting.**

**ANSWER:------**


Regularization is a technique used in logistic regression (and other machine learning models) to prevent overfitting by adding a penalty term to the cost function. This penalty discourages the model from fitting too closely to the training data, which can lead to poor generalization on unseen data.

### Types of Regularization

The two most common forms of regularization in logistic regression are **L1 regularization** and **L2 regularization**.

1. **L1 Regularization (Lasso):**
   - Adds the absolute value of the coefficients to the cost function.
   - The regularized cost function becomes:
     \[ \text{Log-Loss} + \lambda \sum_{j=1}^{n} |\beta_j| \]
   - Encourages sparsity, meaning it can shrink some coefficients to exactly zero, effectively performing feature selection.

2. **L2 Regularization (Ridge):**
   - Adds the squared value of the coefficients to the cost function.
   - The regularized cost function becomes:
     \[ \text{Log-Loss} + \frac{\lambda}{2} \sum_{j=1}^{n} \beta_j^2 \]
   - Tends to shrink coefficients evenly, without necessarily driving them to zero.

### Combined Regularization (Elastic Net)

- Combines L1 and L2 regularization:
  \[ \text{Log-Loss} + \lambda_1 \sum_{j=1}^{n} |\beta_j| + \lambda_2 \sum_{j=1}^{n} \beta_j^2 \]
- Allows for a balance between sparsity and coefficient shrinkage.

### How Regularization Prevents Overfitting

1. **Penalizes Complexity:** Regularization adds a penalty for large coefficients, which typically indicates a more complex model. By constraining the size of the coefficients, the model is forced to be simpler, reducing the likelihood of overfitting the noise in the training data.

2. **Bias-Variance Trade-off:** Regularization introduces a bias into the model, but this bias can result in a lower variance. While the training error might increase slightly due to the bias, the overall generalization error (performance on unseen data) is reduced.

3. **Feature Selection (L1 Regularization):** L1 regularization can drive some coefficients to zero, effectively removing irrelevant or redundant features. This simplifies the model and reduces the risk of overfitting.

### Implementation in Logistic Regression

Regularization is typically controlled by a parameter \(\lambda\) (or \(\alpha\) in some contexts), which determines the strength of the penalty. A higher \(\lambda\) value results in stronger regularization, while a lower \(\lambda\) value reduces the regularization effect.

### Optimized Cost Function with Regularization

- **L1 Regularization:**
  \[ \text{Log-Loss} + \lambda \sum_{j=1}^{n} |\beta_j| \]

- **L2 Regularization:**
  \[ \text{Log-Loss} + \frac{\lambda}{2} \sum_{j=1}^{n} \beta_j^2 \]

- **Gradient Descent Updates:** When applying gradient descent, the updates to the coefficients incorporate the regularization term. For L2 regularization, the update rule for a parameter \(\beta_j\) is:
  \[ \beta_j := \beta_j - \alpha \left( \frac{\partial \text{Log-Loss}}{\partial \beta_j} + \lambda \beta_j \right) \]
  For L1 regularization, the update rule includes a sub-gradient due to the non-differentiability at zero.

### Summary

Regularization helps logistic regression models generalize better by:
- Penalizing large coefficients and complex models.
- Introducing a bias that reduces variance.
- Simplifying the model by potentially eliminating irrelevant features (L1 regularization).

By balancing the model complexity and ensuring that it does not fit the training data too closely, regularization effectively mitigates the risk of overfitting.

**Q4. What is the ROC curve, and how is it used to evaluate the performance of the logistic regression
model?**

**ANSWER:--------**


The **Receiver Operating Characteristic (ROC) curve** is a graphical representation used to evaluate the performance of a binary classification model, such as logistic regression. It illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) at various threshold settings.

### Key Concepts

1. **True Positive Rate (TPR) / Sensitivity / Recall:**
   - TPR = \(\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}\)
   - Measures the proportion of actual positives correctly identified by the model.

2. **False Positive Rate (FPR):**
   - FPR = \(\frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}\)
   - Measures the proportion of actual negatives incorrectly identified as positives by the model.

3. **Threshold:**
   - The ROC curve is plotted by varying the decision threshold. For logistic regression, the threshold determines the cutoff point for classifying a probability prediction as positive or negative. 

### Plotting the ROC Curve

To plot the ROC curve:

1. **Calculate Predictions:** Use the logistic regression model to predict probabilities for the positive class on the test set.
   
2. **Vary Threshold:** Adjust the decision threshold from 0 to 1. For each threshold value, calculate the TPR and FPR.

3. **Plot Points:** Plot TPR against FPR for each threshold value, resulting in the ROC curve.

### Interpreting the ROC Curve

- **Perfect Classifier:** A perfect classifier would have a point at (0, 1), meaning it has 100% sensitivity (no false negatives) and 0% FPR (no false positives).
- **Random Classifier:** A classifier that makes random guesses would produce a diagonal line from (0, 0) to (1, 1), indicating no better performance than random chance.
- **Good Classifier:** The closer the ROC curve is to the top left corner, the better the model's performance. This indicates high sensitivity and low FPR.

### Area Under the ROC Curve (AUC-ROC)

- **AUC-ROC:** The Area Under the ROC Curve (AUC-ROC) is a single scalar value summarizing the performance of the model. It ranges from 0 to 1.
  - **AUC = 1:** Perfect model.
  - **AUC = 0.5:** No discriminative power, equivalent to random guessing.
  - **AUC > 0.5:** Better than random guessing; the higher the AUC, the better the model's performance.

### Using the ROC Curve to Evaluate Logistic Regression

1. **Visual Assessment:** The shape of the ROC curve provides a visual assessment of the model's ability to distinguish between the positive and negative classes.
2. **Threshold Selection:** The ROC curve helps in choosing an optimal threshold that balances sensitivity and specificity according to the specific needs of the problem.
3. **Model Comparison:** AUC-ROC is useful for comparing different models. A model with a higher AUC is generally better at classification.

### Example

Consider a logistic regression model predicting whether patients have a certain disease (positive class) based on various features.

1. **Predict Probabilities:** The model outputs probabilities for each patient.
2. **Vary Threshold:** Calculate TPR and FPR for thresholds ranging from 0 to 1.
3. **Plot ROC Curve:** Create a plot of TPR vs. FPR at different thresholds.
4. **Calculate AUC-ROC:** Compute the area under the ROC curve to quantify the overall performance.

The ROC curve and AUC-ROC provide a comprehensive way to evaluate and compare the effectiveness of logistic regression models, ensuring that they are not only accurate but also balanced in terms of sensitivity and specificity.

**Q5. What are some common techniques for feature selection in logistic regression? How do these
techniques help improve the model's performance?**

**ANSWER:------**


Feature selection is a crucial step in building a logistic regression model, as it helps in improving model performance by eliminating irrelevant or redundant features. Here are some common techniques for feature selection in logistic regression:

### 1. **Filter Methods**

Filter methods assess the relevance of features by looking at their statistical properties, independently of the model.

- **Correlation Matrix:**
  - Calculate the correlation coefficients between each feature and the target variable.
  - Select features with high correlation (absolute value) with the target.
  - Also, check for multicollinearity (high correlation between features) and remove highly correlated features.

- **Chi-Square Test:**
  - For categorical features, use the chi-square test to determine the independence between the feature and the target variable.
  - Select features with low p-values (indicating a significant relationship with the target).

- **ANOVA (Analysis of Variance):**
  - For numerical features, use ANOVA to compare the means of different groups and select features with significant differences.

### 2. **Wrapper Methods**

Wrapper methods evaluate the performance of a subset of features by training and testing a model.

- **Forward Selection:**
  - Start with no features and iteratively add the most significant feature at each step.
  - Evaluate model performance (e.g., using cross-validation) and stop when adding more features does not significantly improve performance.

- **Backward Elimination:**
  - Start with all features and iteratively remove the least significant feature at each step.
  - Continue until removing features no longer improves model performance.

- **Recursive Feature Elimination (RFE):**
  - Train the model and rank features based on their importance.
  - Remove the least important features recursively and evaluate model performance at each step.
  - Stop when performance no longer improves significantly.

### 3. **Embedded Methods**

Embedded methods perform feature selection during the model training process.

- **L1 Regularization (Lasso):**
  - Adds a penalty equal to the absolute value of the coefficients.
  - Encourages sparsity by driving some coefficients to zero, effectively selecting a subset of features.

- **L2 Regularization (Ridge):**
  - Adds a penalty equal to the squared value of the coefficients.
  - While it does not perform feature selection, it reduces the impact of less important features.

- **Elastic Net:**
  - Combines L1 and L2 regularization.
  - Balances feature selection (L1) and coefficient shrinkage (L2).

### 4. **Dimensionality Reduction Techniques**

These techniques transform the feature space into a lower-dimensional space while retaining most of the information.

- **Principal Component Analysis (PCA):**
  - Transforms the original features into a smaller set of uncorrelated components.
  - Select components that explain the most variance in the data.

- **Linear Discriminant Analysis (LDA):**
  - Aims to maximize the separation between multiple classes.
  - Transforms features into a lower-dimensional space based on class separability.

### How Feature Selection Improves Model Performance

1. **Reduces Overfitting:**
   - By removing irrelevant or redundant features, the model is less likely to fit noise in the training data, improving generalization to unseen data.

2. **Improves Model Interpretability:**
   - A simpler model with fewer features is easier to interpret and understand, providing clearer insights.

3. **Enhances Model Training Efficiency:**
   - Fewer features reduce the computational cost and time required for training the model.

4. **Mitigates Multicollinearity:**
   - Feature selection helps in reducing multicollinearity, ensuring that the model coefficients are more stable and reliable.

5. **Boosts Model Performance:**
   - By retaining only the most relevant features, the model can achieve better predictive performance.

By employing these feature selection techniques, logistic regression models can become more robust, interpretable, and efficient, leading to better overall performance.

**Q6. How can you handle imbalanced datasets in logistic regression? What are some strategies for dealing
with class imbalance?**

**ANSWER:-------**



Handling imbalanced datasets is crucial in logistic regression as it ensures the model performs well across all classes, especially the minority class. Here are some strategies to deal with class imbalance:

### 1. **Resampling Techniques**

#### a. **Oversampling the Minority Class**
- **Random Oversampling:** Randomly duplicate instances of the minority class to balance the class distribution.
- **SMOTE (Synthetic Minority Over-sampling Technique):** Generate synthetic samples for the minority class by interpolating between existing samples.

#### b. **Undersampling the Majority Class**
- **Random Undersampling:** Randomly remove instances of the majority class to balance the class distribution.
- **Tomek Links and NearMiss:** Techniques that intelligently undersample the majority class by focusing on borderline examples or nearest neighbors.

### 2. **Algorithmic Approaches**

#### a. **Cost-Sensitive Learning**
- Modify the learning algorithm to incorporate different costs for misclassification errors, giving higher penalty to misclassifying the minority class.
- **Weighted Loss Function:** Adjust the loss function to assign higher weights to the minority class:
  \[ \text{Weighted Log-Loss} = -\frac{1}{m} \sum_{i=1}^{m} \left[ w_i y_i \log(\hat{y}_i) + w_i (1 - y_i) \log(1 - \hat{y}_i) \right] \]
  where \( w_i \) is the weight for the \( i \)-th instance.

### 3. **Evaluation Metrics**

#### a. **Use Appropriate Metrics**
- **Precision, Recall, and F1-Score:** Evaluate the model using metrics that are sensitive to class imbalance.
- **ROC-AUC:** Use the area under the ROC curve to evaluate the model’s ability to distinguish between classes.
- **Precision-Recall Curve:** Especially useful for imbalanced datasets, as it focuses on the performance with respect to the minority class.

### 4. **Generating Synthetic Data**

#### a. **SMOTE Variants**
- **Borderline-SMOTE:** Focus on generating synthetic samples near the decision boundary.
- **ADASYN (Adaptive Synthetic Sampling):** Generate more synthetic samples for harder-to-classify instances.

### 5. **Ensemble Methods**

#### a. **Balanced Random Forests**
- Combine resampling techniques with random forests by resampling the dataset in each iteration of tree building.
- Ensure each tree in the ensemble is built on a balanced subset of the data.

#### b. **EasyEnsemble and BalanceCascade**
- **EasyEnsemble:** Create multiple balanced subsets by undersampling the majority class and train separate classifiers on each subset, then aggregate their predictions.
- **BalanceCascade:** Iteratively train classifiers, removing correctly classified majority class instances at each step, thus focusing on harder examples.

### 6. **Adjusting Decision Threshold**

- **Threshold Moving:** Adjust the classification threshold to favor the minority class. Instead of using the default threshold of 0.5, find an optimal threshold that improves recall or F1-score.



In [2]:
pip install imbalanced-learn


Collecting imbalanced-learn
  Downloading imbalanced_learn-0.12.3-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.3/258.3 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.12.3
Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE

# Generate sample data
np.random.seed(42)
X = pd.DataFrame({
    'feature1': np.random.randn(1000),
    'feature2': np.random.randn(1000),
    'feature3': np.random.randn(1000),
})
y = np.random.randint(0, 2, 1000)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Strategy 1: Resampling with SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

# Standardize the features
scaler = StandardScaler()
X_res_scaled = scaler.fit_transform(X_res)
X_test_scaled = scaler.transform(X_test)

# Apply PCA to reduce multicollinearity (optional)
pca = PCA(n_components=2)  # Adjust the number of components as needed
X_res_pca = pca.fit_transform(X_res_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train logistic regression on resampled data
model = LogisticRegression(penalty='l2', C=1.0)
model.fit(X_res_pca, y_res)

# Predict and evaluate
y_pred = model.predict(X_test_pca)
print("Results after SMOTE and PCA:")
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_pred))

# Strategy 2: Cost-sensitive learning
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = {0: class_weights[0], 1: class_weights[1]}

model = LogisticRegression(class_weight=class_weights_dict)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print("\nResults with cost-sensitive learning:")
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_pred))


Results after SMOTE and PCA:
              precision    recall  f1-score   support

           0       0.48      0.51      0.49       136
           1       0.57      0.53      0.55       164

    accuracy                           0.52       300
   macro avg       0.52      0.52      0.52       300
weighted avg       0.53      0.52      0.52       300

ROC-AUC: 0.522596843615495

Results with cost-sensitive learning:
              precision    recall  f1-score   support

           0       0.47      0.58      0.52       136
           1       0.57      0.46      0.51       164

    accuracy                           0.52       300
   macro avg       0.52      0.52      0.52       300
weighted avg       0.53      0.52      0.52       300

ROC-AUC: 0.5221484935437589


**Q7. Can you discuss some common issues and challenges that may arise when implementing logistic
regression, and how they can be addressed? For example, what can be done if there is multicollinearity
among the independent variables?**

**ANSWER:-------**


Implementing logistic regression can come with several issues and challenges. Here are some common ones and their potential solutions:

### 1. Multicollinearity

#### Issue:
Multicollinearity occurs when two or more independent variables are highly correlated, which can lead to unstable coefficient estimates and inflated standard errors.

#### Solutions:
- **Remove Highly Correlated Variables:** Identify and remove one of the correlated variables using a correlation matrix.
- **Principal Component Analysis (PCA):** Transform the correlated variables into a smaller set of uncorrelated components.
- **Ridge Regression:** Use L2 regularization, which can mitigate the effect of multicollinearity by shrinking the coefficients.

### 2. Overfitting

#### Issue:
Overfitting occurs when the model performs well on the training data but poorly on unseen data due to capturing noise and outliers.

#### Solutions:
- **Regularization:** Apply L1 (Lasso) or L2 (Ridge) regularization to penalize large coefficients.
- **Cross-Validation:** Use techniques like k-fold cross-validation to ensure the model generalizes well to unseen data.
- **Simplify the Model:** Remove irrelevant features or use feature selection methods to reduce the complexity of the model.

### 3. Imbalanced Datasets

#### Issue:
Class imbalance can lead to a model that performs well on the majority class but poorly on the minority class.

#### Solutions:
- **Resampling Techniques:** Use oversampling (e.g., SMOTE) or undersampling to balance the class distribution.
- **Class Weights:** Assign higher weights to the minority class in the cost function.
- **Evaluation Metrics:** Use metrics like precision, recall, F1-score, and ROC-AUC instead of accuracy.

### 4. Non-Linearity

#### Issue:
Logistic regression assumes a linear relationship between the independent variables and the log-odds of the dependent variable, which may not always hold true.

#### Solutions:
- **Feature Engineering:** Create interaction terms or polynomial features to capture non-linear relationships.
- **Non-Linear Models:** Consider using non-linear models like decision trees or neural networks if the relationship is highly non-linear.

### 5. Outliers

#### Issue:
Outliers can disproportionately affect the model, leading to skewed results.

#### Solutions:
- **Identify and Remove Outliers:** Use statistical tests or visualization methods to detect and remove outliers.
- **Robust Scalers:** Use robust scalers that are less sensitive to outliers (e.g., median-based scaling).

### 6. Missing Data

#### Issue:
Missing data can reduce the amount of available data and introduce bias.

#### Solutions:
- **Imputation:** Use techniques like mean/mode/median imputation, k-nearest neighbors, or model-based imputation to fill in missing values.
- **Remove Missing Data:** If the amount of missing data is small, consider removing those instances.

### 7. Model Interpretability

#### Issue:
Understanding and interpreting the coefficients in logistic regression can be challenging, especially with many features.

#### Solutions:
- **Standardization:** Standardize the features to make the coefficients comparable.
- **Odds Ratios:** Transform the coefficients into odds ratios to make them more interpretable.
- **Partial Dependence Plots:** Use these plots to understand the relationship between each feature and the target variable.

### 8. Convergence Issues

#### Issue:
The logistic regression model may not converge if the learning algorithm fails to find optimal parameters, especially with large datasets or complex models.

#### Solutions:
- **Feature Scaling:** Standardize or normalize the features to ensure they are on a similar scale.
- **Algorithm Parameters:** Adjust the solver and maximum iterations parameters in the learning algorithm.
- **Simplify the Model:** Reduce the number of features or use regularization to stabilize the optimization process.



In [4]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, roc_auc_score

# Generate sample data
np.random.seed(42)
X = pd.DataFrame({
    'feature1': np.random.randn(1000),
    'feature2': np.random.randn(1000),
    'feature3': np.random.randn(1000),
})
y = np.random.randint(0, 2, 1000)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA to reduce multicollinearity
pca = PCA(n_components=2)  # Adjust the number of components as needed
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train logistic regression with L2 regularization
model = LogisticRegression(penalty='l2', C=1.0)
model.fit(X_train_pca, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_pca)
print(classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.46      0.57      0.51       136
           1       0.56      0.45      0.50       164

    accuracy                           0.50       300
   macro avg       0.51      0.51      0.50       300
weighted avg       0.51      0.50      0.50       300

ROC-AUC: 0.5086979913916786
