# Logistic Regression | Assignment

Question 1 : What is Logistic Regression, and how does it differ from Linear Regression?

Answer :
***
Logistic regression is a supervised machine learning algorithm primarily used for classification tasks, not regression, to predict the probability of an event occurring or an instance belonging to a particular class. It models the relationship between independent variables and a categorical dependent variable, using the S-shaped sigmoid function to transform the output into a probability between 0 and 1. This probability is then used with a threshold (commonly 0.5) to classify data points into discrete categories, such as "spam" or "not spam" for emails, or "disease present" versus "disease absent".  

Logistic Regression predicts the probability of a categorical outcome, often binary (like yes/no, 0/1), using an S-shaped sigmoid curve, whereas Linear Regression predicts a continuous numerical outcome (like price or temperature) by fitting a straight line to the data. The core difference lies in their target variable type: logistic regression is for classification and linear regression is for regression, meaning linear regression finds a continuous value, and logistic regression predicts a probability that falls into a discrete category.
***

Question 2 : Explain the role of the Sigmoid function in Logistic Regression.

Answer :
***
The Sigmoid function, or logistic function, is crucial in Logistic Regression because it transforms the raw output of a linear model into a probability value between 0 and 1. This S-shaped function "squashes" any real-valued input into this bounded range, making the output directly interpretable as the probability of a binary event occurring (e.g., 0.88 meaning an 88% chance). This probabilistic output is essential for binary classification, enabling the model to predict which of two classes an input belongs to based on a set threshold.

Here's a breakdown of its role:
1. Probability Mapping:
Logistic regression starts by calculating a linear combination of input features, which can produce any real number. The sigmoid function takes this raw score and maps it to a value between 0 and 1, which is a valid probability.
2. Binary Classification:
The model uses this probability to classify an input into one of two categories. For example, if the probability exceeds a certain threshold (often 0.5), the input is classified as belonging to the positive class; otherwise, it's assigned to the negative class.
3. Interpretability:
Without the sigmoid function, the output of the linear model wouldn't be directly interpretable as a probability. The sigmoid function provides this essential probabilistic interpretation, which is valuable for risk assessment and other predictive tasks.
4. Log-Odds Connection:
Mathematically, the output of the linear model (before the sigmoid) can be seen as the "log-odds" of the event, and the sigmoid function is derived from this relationship, making the connection explicit and providing a well-defined link between the model's linear component and the predicted probability.
***

Question 3 : What is Regularization in Logistic Regression and why is it needed?

Answer :
***
Regularization in Logistic Regression is a technique that adds a penalty to the model's complexity, preventing it from overfitting the training data. It's needed because logistic regression, especially with many features, can learn complex patterns in the training data that don't generalize to new, unseen data, leading to poor predictive performance. By penalizing large model coefficients, regularization encourages simpler models that are more robust and accurate on new data.

Characteristics of Regularization :-
1. Penalty for Complexity:
Regularization introduces a penalty term into the logistic regression model's objective function (the function it tries to minimize).
2. Shrinks Coefficients:
This penalty discourages excessively large coefficients for the input features, which would make the model too complex.
3. Trade-off:
It creates a trade-off between fitting the training data perfectly (low training error) and building a model that generalizes well to unseen data (low generalization error).

Need of Regularization :-
1. To Prevent Overfitting:
The primary reason for regularization is to combat overfitting, a common problem where a model learns the noise and specific details of the training data too well, leading to poor performance on new data.
2. High-Dimensional Data:
Logistic regression models with many features are particularly prone to overfitting.
3. Improve Generalization:
By controlling model complexity, regularization helps the model generalize better to new, unseen data, resulting in more reliable predictions.
4. Reduce High Variance:
Overfitting often manifests as high variance, meaning the model is highly sensitive to the specific training data. Regularization reduces this variance by increasing the model's bias, leading to a more stable and generalized model.

Common Regularization Techniques :-
1. L1 Regularization (Lasso):
Adds a penalty equal to the absolute value of the coefficients. It can shrink some coefficients to exactly zero, effectively performing feature selection and creating sparse models.
2. L2 Regularization (Ridge):
Adds a penalty equal to the square of the coefficients. It shrinks coefficients towards zero but rarely to exactly zero, distributing the importance across features rather than eliminating them.
***

Question 4 : What are some common evaluation metrics for classification models, and why are they important?

Answer :
***
Common classification evaluation metrics include Accuracy, Precision, Recall (Sensitivity), F1-Score, Specificity, and the Confusion Matrix, which are crucial for understanding a model's performance beyond simple correct/incorrect counts. These metrics are important because the ideal metric depends on the specific task's costs of different errors—such as missing a disease (false negative) vs. falsely diagnosing one (false positive)—and the balance of the dataset's classes.

Common Classification Metrics
1. Confusion Matrix:
A table that breaks down a model's predictions into four components: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). It forms the basis for other metrics.
2. Accuracy:
The proportion of all predictions that were correct (TP+TN) out of the total number of predictions. It is useful for balanced datasets but can be misleading with imbalanced ones.
3. Precision:
The proportion of positive predictions that were actually correct (TP / (TP + FP)). High precision means that when the model predicts a positive class, it is usually correct, which is important to minimize false alarms.
4. Recall (Sensitivity or True Positive Rate):
The proportion of actual positive instances that were correctly identified (TP / (TP + FN)). High recall indicates the model's ability to find all positive cases and is crucial for tasks like medical diagnosis where false negatives have severe consequences.
5. F1-Score:
The harmonic mean of precision and recall (2 * (Precision * Recall) / (Precision + Recall)). It provides a single metric that balances both precision and recall, making it useful for imbalanced datasets where you want both to be high.
6. Specificity (True Negative Rate):
The proportion of actual negative instances that were correctly identified (TN / (TN + FP)). It measures the model's ability to correctly identify negatives.
7. Area Under the ROC Curve (AUC-ROC):
Measures the model's ability to distinguish between positive and negative classes across various probability thresholds. It is a good metric for evaluating models that output probabilities.

Why They Are Important
1. Task-Specific Decision Making:
The choice of metric must align with the real-world problem's objectives. For example, prioritizing Recall is critical in disease screening to avoid missing cases (false negatives), while Precision is vital in spam detection to avoid marking legitimate emails as spam (false positives).
2. Handling Imbalanced Datasets:
Accuracy can be a poor indicator when one class is significantly more common than others. Metrics like Precision, Recall, and F1-Score provide a more nuanced view of the model's performance on minority classes, which might be the more critical ones.
3. Understanding Trade-offs:
Metrics like Precision and Recall represent different types of errors. Using them helps in understanding the trade-offs inherent in the model's predictions and choosing a threshold that best fits the specific business or scientific goals.
4. Comprehensive Evaluation:
No single metric tells the whole story. A combination of metrics from the confusion matrix provides a comprehensive picture of a model's strengths and weaknesses, leading to better model development and selection.
***

Question 5 : Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)

Answer :-
***

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

# Features and target
X = df.drop("target", axis=1)
y = df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Logistic Regression model
model = LogisticRegression(max_iter=5000)  # increased max_iter for convergence
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Logistic Regression Model Accuracy:", accuracy)


Logistic Regression Model Accuracy: 0.9649122807017544


***

Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)

Answer :
***

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

# Features and target
X = df.drop("target", axis=1)
y = df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Logistic Regression with L2 regularization (Ridge is default)
model = LogisticRegression(penalty="l2", solver="lbfgs", max_iter=5000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print coefficients and accuracy
print("Model Coefficients (per feature):")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

print("\nIntercept:", model.intercept_[0])
print("Accuracy:", accuracy)


Model Coefficients (per feature):
mean radius: 0.8071
mean texture: 0.1133
mean perimeter: -0.2831
mean area: 0.0252
mean smoothness: -0.1673
mean compactness: -0.2022
mean concavity: -0.4551
mean concave points: -0.2524
mean symmetry: -0.3092
mean fractal dimension: -0.0312
radius error: -0.0551
texture error: 1.1033
perimeter error: 0.0856
area error: -0.0960
smoothness error: -0.0223
compactness error: 0.0591
concavity error: -0.0214
concave points error: -0.0354
symmetry error: -0.0404
fractal dimension error: 0.0137
worst radius: 0.0952
worst texture: -0.3769
worst perimeter: -0.0878
worst area: -0.0146
worst smoothness: -0.3248
worst compactness: -0.7477
worst concavity: -1.3233
worst concave points: -0.5634
worst symmetry: -0.7879
worst fractal dimension: -0.0916

Intercept: 29.173300070114156
Accuracy: 0.9649122807017544


***

Question 7 : Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)

Answer :
***


In [4]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

# Load dataset
data = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

# Features and target
X = df.drop("target", axis=1)
y = df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Logistic Regression with OvR
model = LogisticRegression(multi_class="ovr", solver="lbfgs", max_iter=5000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.80      0.89        10
   virginica       0.83      1.00      0.91        10

    accuracy                           0.93        30
   macro avg       0.94      0.93      0.93        30
weighted avg       0.94      0.93      0.93        30



***

Question 8 : Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)

Answer:
***

In [5]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Load dataset
data = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

# Features and target
X = df.drop("target", axis=1)
y = df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define parameter grid
param_grid = {
    "C": [0.01, 0.1, 1, 10, 100],            # Regularization strength
    "penalty": ["l1", "l2"],                 # Penalty type
    "solver": ["liblinear"]                  # Solver that supports both l1 & l2
}

# Logistic Regression + GridSearchCV
grid = GridSearchCV(
    LogisticRegression(max_iter=5000),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid.fit(X_train, y_train)

# Print best parameters and validation accuracy
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9666666666666668


***

Question 9 : Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)

Answer :
***

In [6]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

# Features and target
X = df.drop("target", axis=1)
y = df["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Without Scaling
model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# With Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaling = LogisticRegression(max_iter=5000)
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaling)

# Compare Results
print("Accuracy without Scaling:", acc_no_scaling)
print("Accuracy with Scaling:", acc_scaling)


Accuracy without Scaling: 0.9649122807017544
Accuracy with Scaling: 0.9824561403508771


***

Question 10 : Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

Answer :
***

Here’s how I’d approach building a Logistic Regression model for predicting customer responses to a marketing campaign, step by step :

1. Understand the Problem

* Target variable : Response (1 = responded, 0 = no response)
* Challenge : Severe class imbalance (only \~5% positives).
* Goal : Build a model that identifies responders well (business cares about not missing responders, but also avoiding spamming non-responders).

2. Data Preparation :

* Explore Data : Check missing values, outliers, skewed distributions.
* Feature Engineering :

  * Customer demographics (age, income, location).
  * Purchase history (frequency, recency, monetary value).
  * Marketing engagement (email opens, clicks).

3. Feature Scaling :

* Logistic Regression is sensitive to feature scales (since it uses regularization).
* Apply StandardScaler (z-score normalization) to continuous features.
* Leave categorical variables encoded (via One-Hot Encoding or Target Encoding).

4. Handling Imbalanced Classes

Several strategies:

1. Class Weights:

   * In `LogisticRegression(class_weight="balanced")`, the algorithm gives more importance to the minority class.
2. Resampling:

   * Oversample responders (SMOTE) or undersample non-responders.
   * Use only on training set (never test set).
3. Hybrid approach: Start with class weights, then experiment with SMOTE.

5. Model Training & Hyperparameter Tuning

* Base model: Logistic Regression.
* Important hyperparameters:

  * `C`: Regularization strength.
  * `penalty`: L1 (feature selection) vs. L2 (ridge).
  * `solver`: e.g., `liblinear`, `saga`.
* Use GridSearchCV or RandomizedSearchCV with Stratified K-Fold CV (to preserve imbalance in splits).

6. Evaluation Metrics

* Accuracy is misleading (model could get 95% accuracy by predicting all 0s).
* Instead, focus on:

  * Precision & Recall (for responders)
  * F1-score (balance between precision & recall)
  * ROC-AUC (overall ranking ability).
  * PR-AUC (Precision-Recall curve) (better for highly imbalanced data).

Business perspective:

* If cost of spamming non-responders is low → prioritize **Recall** (catch more responders).
* If cost of spamming is high → prioritize **Precision**.

7. Final Steps

* Train final model with best hyperparameters.
* Calibrate predicted probabilities (using Platt scaling or isotonic regression) if probability estimates are needed for ranking customers.
* Deploy model and monitor performance (check drift, recalibrate periodically).

Summary of Approach:

1. Clean & preprocess features (scale numerics, encode categoricals).
2. Handle imbalance using class\_weight="balanced" and/or SMOTE.
3. Train Logistic Regression with hyperparameter tuning (C, penalty).
4. Evaluate using Precision, Recall, F1, ROC-AUC, PR-AUC (not just accuracy).
5. Align metric choice with business objective (catch more responders vs. reduce false positives).

---