# Logistic Regression Assignment

---
## 1. What is Logistic Regression, and how does it differ from Linear Regression?

**Logistic Regression** is a supervised learning algorithm used for **classification tasks**, especially binary classification (e.g., spam/not spam, pass/fail).
* **Purpose:** It predicts the **probability** that an instance belongs to a particular class.
* **Output:** Produces values between 0 and 1 by using the **sigmoid (logistic) function** on the linear combination of input features.

**Mathematical form:**

$$
p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \dots + \beta_n x_n)}}
$$

Where $p$ is the probability of belonging to the positive class.

**How it differs from Linear Regression?**

| Feature                     | Linear Regression                               | Logistic Regression                                             |
| --------------------------- | ----------------------------------------------- | --------------------------------------------------------------- |
| **Goal**                    | Predicts continuous numerical values            | Predicts probability for classification                         |
| **Output range**            | Any real number (-∞ to +∞)                      | Between 0 and 1 (via sigmoid/logistic function)                 |
| **Algorithm type**          | Regression                                      | Classification                                                  |
| **Error metric**            | Uses **Mean Squared Error (MSE)**               | Uses **Log-Loss (Cross-Entropy Loss)**                          |
| **Decision boundary**       | Not applicable                                  | Classification threshold (commonly 0.5)                         |
| **Relationship assumption** | Linear relationship between features and output | Linear relationship between features and **log-odds** of output |

---
## 2. Explain the role of the Sigmoid function in Logistic Regression.

1. **Purpose**

   * In logistic regression, the model first computes a **linear combination** of input features:

     $$
     z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n
     $$

     This $z$ can take any value from $-\infty$ to $+\infty$.
   * The **sigmoid function** is applied to transform this unbounded value into a **probability** in the range **(0, 1)**.

2. **Sigmoid Function Formula**

   $$
   \sigma(z) = \frac{1}{1 + e^{-z}}
   $$

   * When $z \to +\infty$, $\sigma(z) \to 1$
   * When $z \to -\infty$, $\sigma(z) \to 0$
   * When $z = 0$, $\sigma(z) = 0.5$

3. **Why It’s Important in Logistic Regression**

   * **Probability Mapping:** Converts linear output into probabilities for classification.
   * **Interpretability:** The output can be interpreted as “probability of belonging to the positive class.”
   * **Decision Making:** By setting a threshold (commonly 0.5), we classify the data into two classes.
   * **Smooth Gradient:** The function is differentiable, which allows optimization using **Gradient Descent**.

4. **Visualization Insight**

   * The sigmoid curve is **S-shaped**, ensuring that extremely large or small linear scores don’t cause extreme instability—values saturate near 0 or 1.

---
## 3. What is Regularization in Logistic Regression and why is it needed?

Regularization is a technique used to **prevent overfitting** by adding a **penalty term** to the cost function in logistic regression.
It discourages the model from fitting too closely to the training data by **shrinking the coefficients** ($\beta$ values).

### **Why It’s Needed**

1. **Overfitting Control**

   * Without regularization, logistic regression can produce very large weights for some features, especially if the dataset has many features or multicollinearity.
   * Large weights can cause the model to fit training data noise, reducing generalization to new data.

2. **Feature Selection**

   * Certain regularization methods (like **L1 regularization**) can force some coefficients to become exactly zero, effectively removing irrelevant features.

3. **Improved Generalization**

   * By constraining weights, the model becomes simpler and less sensitive to fluctuations in training data.

### **Types of Regularization in Logistic Regression**

| Type            | Penalty Term Added to Cost Function | Effect                                                             |                                                               
| --------------- | ----------------------------------- | ------------------------------------------------------------------ |
| **L1 (Lasso)**  | $\frac{\lambda}{m} \sum_{j=1}^n \lvert\beta_j\rvert$  | Shrinks some coefficients to exactly zero → **feature selection** |
| **L2 (Ridge)**  | $\frac{\lambda}{2m} \sum_{j=1}^n \beta_j^2$  | Shrinks coefficients evenly, no zeroing out → **better stability** |   
| **Elastic Net** | Combination of L1 and L2            | Balances between feature selection and stability                   |

**Without regularization**:

$$
J(\beta) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log h_\beta(x^{(i)}) + (1 - y^{(i)}) \log(1 - h_\beta(x^{(i)})) \right]
$$

#### **L2 Regularization (Ridge Logistic Regression)**

With **L2 penalty**:

$$
J_{L2}(\beta) = J(\beta) + \frac{\lambda}{2m} \sum_{j=1}^n \beta_j^2
$$

* **Penalty term:** $\frac{\lambda}{2m} \sum_{j=1}^n \beta_j^2$
* **Effect:** Shrinks coefficients towards zero but **does not** make them exactly zero.

#### **L1 Regularization (Lasso Logistic Regression)**

With **L1 penalty**:

$$
J_{L1}(\beta) = J(\beta) + \frac{\lambda}{m} \sum_{j=1}^n |\beta_j|
$$

* **Penalty term:** $\frac{\lambda}{m} \sum_{j=1}^n |\beta_j|$
* **Effect:** Can make some coefficients **exactly zero**, performing **automatic feature selection**.

---
## 4. What are some common evaluation metrics for classification models, and why are they important?

Evaluation metrics help measure **how well** a classification model performs.
They are important because accuracy alone can be misleading — especially with **imbalanced datasets**.

### **1. Accuracy**

* **Formula:**

$$
\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total Samples}}
$$

* **Meaning:** Percentage of correct predictions.
* **Limitation:** Can be misleading if classes are imbalanced (e.g., 95% accuracy by predicting all samples as the majority class).

### **2. Precision**

* **Formula:**

$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

* **Meaning:** Out of all predicted positives, how many were actually positive.
* **Importance:** High precision means fewer false alarms — useful in cases like spam detection.

### **3. Recall (Sensitivity / True Positive Rate)**

* **Formula:**

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

* **Meaning:** Out of all actual positives, how many were correctly predicted.
* **Importance:** High recall means fewer missed positive cases — critical in medical diagnoses.

### **4. F1-Score**

* **Formula:**

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

* **Meaning:** Harmonic mean of Precision and Recall.
* **Importance:** Good for imbalanced datasets when both precision and recall matter.

### **5. ROC-AUC (Receiver Operating Characteristic – Area Under Curve)**

* **Meaning:** Measures the trade-off between True Positive Rate (Recall) and False Positive Rate across different thresholds.
* **Importance:** AUC close to 1 indicates strong model discrimination ability.

### **6. Confusion Matrix**

* **Meaning:** A table showing counts of TP, TN, FP, FN.
* **Importance:** Gives a complete picture of classification performance, not just one score.

✅ **Why these metrics are important:**

* They allow you to **choose the right model** for the problem.
* Different problems need **different priorities** (e.g., Precision for fraud detection, Recall for cancer screening).
* They help detect **overfitting** and **imbalanced class bias**.

---
## 5. Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

In [1]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Load dataset from sklearn (Iris dataset for example)
iris = datasets.load_iris()

# Create a Pandas DataFrame from the dataset
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Display first few rows
print("Dataset preview:")
print(df.head())

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=200)  # Increase max_iter to ensure convergence

# Train the model
model.fit(X_train, y_train)

# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"\nLogistic Regression Model Accuracy: {accuracy:.2f}")


Dataset preview:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

Logistic Regression Model Accuracy: 1.00


---
## 6. Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.

In [2]:
# Load dataset from sklearn (Iris dataset for example)
iris = datasets.load_iris()

# Create a Pandas DataFrame from the dataset
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression model with L2 regularization
model = LogisticRegression(
    penalty='l2',       # L2 Regularization
    C=1.0,              # Regularization strength (smaller value = stronger regularization)
    max_iter=200        # Increase iterations for convergence
)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print model coefficients and accuracy
print("Model Coefficients (per feature, per class):")
print(model.coef_)
print("\nIntercepts for each class:")
print(model.intercept_)
print(f"\nL2-Regularized Logistic Regression Accuracy: {accuracy:.2f}")

Model Coefficients (per feature, per class):
[[-0.39348375  0.96248072 -2.37513667 -0.99874733]
 [ 0.50844947 -0.25480597 -0.21300937 -0.77574588]
 [-0.11496571 -0.70767474  2.58814604  1.77449321]]

Intercepts for each class:
[  9.00911397   1.86887848 -10.87799245]

L2-Regularized Logistic Regression Accuracy: 1.00


---
## 7. Write a Python program to train a Logistic Regression model for multiclass classification using `multi_class='ovr'` and print the classification report.

In [3]:
from sklearn.metrics import classification_report

# Load dataset from sklearn (Iris dataset for example)
iris = datasets.load_iris()

# Create a Pandas DataFrame from the dataset
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features (X) and Target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression model with One-vs-Rest strategy
model = LogisticRegression(
    multi_class='ovr',   # One-vs-Rest classification
    max_iter=200,        # Increase iterations for convergence
    solver='lbfgs'       # Suitable for small datasets like Iris
)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report (One-vs-Rest):")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Classification Report (One-vs-Rest):
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



---
## 8. Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.

In [4]:
from sklearn.model_selection import GridSearchCV

# Load dataset from sklearn (Iris dataset)
iris = datasets.load_iris()

# Create DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define the Logistic Regression model
log_reg = LogisticRegression(max_iter=500, solver='liblinear')
# Using 'liblinear' because it supports both L1 and L2 penalties

# Define the parameter grid for C and penalty
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],   # Regularization strength (inverse)
    'penalty': ['l1', 'l2']         # L1 = Lasso, L2 = Ridge
}

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,             # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

# Get the best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

# Make predictions using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate test accuracy
test_accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Hyperparameters:", best_params)
print(f"Best Cross-Validation Accuracy: {best_score:.4f}")
print(f"Test Accuracy with Best Parameters: {test_accuracy:.4f}")


Best Hyperparameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9583
Test Accuracy with Best Parameters: 1.0000


---
## 9. Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

In [5]:
from sklearn.preprocessing import StandardScaler

# Load dataset from sklearn (Iris dataset)
iris = datasets.load_iris()

# Create DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Without Scaling
model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# With Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_with_scaling = LogisticRegression(max_iter=200)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = model_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

# Results
print(f"Accuracy without Scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with Scaling:    {accuracy_with_scaling:.4f}")

Accuracy without Scaling: 1.0000
Accuracy with Scaling:    1.0000


---
## 10. Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

We want to predict which customers will respond to a marketing campaign.
**Challenge:**

* **Imbalanced dataset** → Only **5% positive class** (responders).
* If we use standard accuracy, predicting everyone as “No” would give \~95% accuracy but be useless for business.

### **1. Data Understanding & Preprocessing**

* **Exploratory Data Analysis (EDA):**

  * Check class distribution → confirm imbalance.
  * Look for missing values, duplicates, outliers.
  * Understand feature types (numeric, categorical, date).
* **Feature Engineering:**

  * Extract meaningful features (e.g., recency of purchase, frequency, total spend).
  * Convert categorical variables → **One-Hot Encoding** or **Target Encoding**.
  * Handle missing values with imputation (mean/median for numeric, mode/most frequent for categorical).

### **2. Feature Scaling**

* Logistic Regression is sensitive to feature scales.
* Use **StandardScaler** (mean = 0, variance = 1) **after** splitting train/test data to avoid leakage.

### **3. Handling Class Imbalance**

We can combine several techniques:

1. **Class Weight Adjustment:**

   * Use `class_weight='balanced'` in Logistic Regression.
   * This adjusts weights inversely proportional to class frequencies.
2. **Oversampling Minority Class:**

   * Use **SMOTE** (Synthetic Minority Over-sampling Technique) to generate synthetic responders.
3. **Undersampling Majority Class:**

   * Randomly remove some "non-responders" to balance faster, but risk losing information.
4. **Combination Approach:**

   * Light undersampling + SMOTE to keep dataset size manageable.

### **4. Train/Test Split**

* **Stratified Split** to preserve class ratio in both sets.
* Common: 70–80% training, 20–30% testing.

### **5. Model Training with Hyperparameter Tuning**

* Use **GridSearchCV** or **RandomizedSearchCV** with **StratifiedKFold CV** to tune:

  * `C` (inverse regularization strength)
  * `penalty` (L1, L2, Elastic Net)
  * `solver` (liblinear, saga depending on penalty)
* Example parameter grid:

  ```python
  param_grid = {
      'C': [0.01, 0.1, 1, 10],
      'penalty': ['l1', 'l2'],
      'solver': ['liblinear', 'saga']
  }
  ```

### **6. Model Evaluation**

* **Avoid plain accuracy** — use metrics that handle imbalance:

  * **Precision**: Of predicted responders, how many actually respond? (Important to avoid spamming uninterested customers)
  * **Recall (Sensitivity)**: Of all actual responders, how many did we correctly identify? (Important for campaign reach)
  * **F1-score**: Harmonic mean of Precision & Recall.
  * **ROC-AUC**: Overall ability to rank responders higher than non-responders.
  * **PR-AUC** (Precision-Recall AUC): More informative for highly imbalanced datasets.
* For business context:

  * High **Recall** ensures we target as many potential responders as possible.
  * High **Precision** keeps campaign cost low by reducing wasted outreach.

### **7. Business-Aware Decision Threshold**

* Logistic Regression outputs probabilities → default threshold is 0.5.
* In imbalanced cases, **lower the threshold** (e.g., 0.3) to capture more positives (increase recall).
* Select threshold based on business trade-off between:

  * **Marketing Cost** (false positives)
  * **Lost Revenue** (false negatives)

### **8. Final Deployment & Monitoring**

* Retrain periodically as customer behavior changes.
* Monitor:

  * Data drift (feature distributions changing over time)
  * Model drift (accuracy degradation)
  * Business KPIs (ROI of marketing campaigns)

✅ **Summary Table of Approach:**

| Step                    | Technique Used                                           |
| ----------------------- | -------------------------------------------------------- |
| Data cleaning           | Missing value imputation, encoding categorical variables |
| Feature scaling         | StandardScaler                                           |
| Handling imbalance      | Class weights, SMOTE, undersampling                      |
| Model selection         | Logistic Regression                                      |
| Hyperparameter tuning   | GridSearchCV with StratifiedKFold                        |
| Evaluation metrics      | Precision, Recall, F1, ROC-AUC, PR-AUC                   |
| Threshold adjustment    | Based on cost-benefit analysis                           |
| Deployment & monitoring | Retraining & KPI tracking                                |