### Logistic Regression

1. What is Logistic Regression, and how does it differ from Linear
Regression?
- Logistic regression is a statistical method used for classification, especially for predicting the probability of a binary outcome (e.g., yes/no, 0/1). It uses the logistic (sigmoid) function to map predicted values to a range between 0 and 1. In contrast, linear regression predicts a continuous numeric value by fitting a straight-line relationship between input variables and the output. While linear regression uses least squares to minimize prediction error, logistic regression uses maximum likelihood estimation and focuses on probabilities rather than direct numeric predictions.

2. Explain the role of the Sigmoid function in Logistic Regression.
- The sigmoid (logistic) function maps the linear score z = w·x + b to a value in (0,1), turning it into a valid probability for the positive class and enabling a natural decision threshold (often 0.5). It also provides the link between log-odds and features: applying the inverse (logit) gives log(p/(1−p)) = w·x + b, which is central to maximum-likelihood training and interpretable odds ratios.

3. What is Regularization in Logistic Regression and why is it needed?
- Regularization in logistic regression adds a penalty to the loss function (typically L1 or L2) to discourage large coefficients, which reduces model complexity and helps prevent overfitting, improving generalization to unseen data. L2 regularization shrinks weights toward zero without making them exactly zero, stabilizing models under multicollinearity, while L1 can drive some weights to exactly zero, doubling as embedded feature selection; the penalty strength is controlled by a hyperparameter (λ), trading a bit of training accuracy for better test performance.

4. What are some common evaluation metrics for classification models, and
why are they important?
- Common metrics for classification include accuracy, precision, recall, F1 score, ROC-AUC, PR-AUC, log loss, and the confusion matrix. They are important because each captures different aspects of performance; for example, precision and recall are key for imbalanced data, F1 balances them, ROC-AUC and PR-AUC assess discrimination across thresholds, and log loss measures the quality of probability predictions.

In [1]:
# 5. Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")


Accuracy: 0.9649


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [2]:
# 6. Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy. (Use Dataset from sklearn package)
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(X_train_scaled, y_train)

coefficients = model.coef_.ravel()
intercept = model.intercept_[0]

y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print("Intercept (bias):", intercept)
print("\nCoefficients (feature -> weight):")
for fname, coef in zip(data.feature_names, coefficients):
    print(f"{fname}: {coef:.6f}")

print(f"\nAccuracy: {accuracy:.4f}")


Intercept (bias): 0.3022075735370281

Coefficients (feature -> weight):
mean radius: -0.511479
mean texture: -0.552698
mean perimeter: -0.476298
mean area: -0.541059
mean smoothness: -0.212479
mean compactness: 0.648342
mean concavity: -0.602103
mean concave points: -0.704156
mean symmetry: -0.167233
mean fractal dimension: 0.199732
radius error: -1.082965
texture error: 0.248823
perimeter error: -0.544333
area error: -0.929104
smoothness error: -0.160276
compactness error: 0.647227
concavity error: 0.160563
concave points error: -0.443784
symmetry error: 0.360492
fractal dimension error: 0.437894
worst radius: -0.947616
worst texture: -1.255088
worst perimeter: -0.763220
worst area: -0.947756
worst smoothness: -0.746625
worst compactness: 0.055514
worst concavity: -0.823151
worst concave points: -0.953686
worst symmetry: -0.939181
worst fractal dimension: -0.187251

Accuracy: 0.9825


In [3]:
# 7. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report. (Use Dataset from sklearn package)
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
clf.fit(X_train_scaled, y_train)

y_pred = clf.predict(X_test_scaled)
print(classification_report(y_test, y_pred, target_names=iris.target_names))


              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.89      0.80      0.84        10
   virginica       0.82      0.90      0.86        10

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.90        30
weighted avg       0.90      0.90      0.90        30





In [4]:
# 8. Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy. (Use Dataset from sklearn package)
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=1000))
])

param_grid = [
    {
        "logreg__solver": ["liblinear"],
        "logreg__penalty": ["l1", "l2"],
        "logreg__C": np.logspace(-3, 3, 7),
    },
    {
        "logreg__solver": ["lbfgs", "newton-cg", "sag", "saga"],
        "logreg__penalty": ["l2"],
        "logreg__C": np.logspace(-3, 3, 7),
    }
]

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1
)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

print("Best parameters:", grid.best_params_)
print(f"Best CV accuracy: {grid.best_score_:.4f}")
print(f"Test accuracy: {test_acc:.4f}")

Best parameters: {'logreg__C': np.float64(0.1), 'logreg__penalty': 'l2', 'logreg__solver': 'liblinear'}
Best CV accuracy: 0.9802
Test accuracy: 0.9825


In [5]:
# 9. Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling. (Use Dataset from sklearn package)
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf_no_scale = LogisticRegression(max_iter=1000)
clf_no_scale.fit(X_train, y_train)
y_pred_no_scale = clf_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

clf_scaled = LogisticRegression(max_iter=1000)
clf_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = clf_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy without scaling: {acc_no_scale:.4f}")
print(f"Accuracy with scaling:    {acc_scaled:.4f}")

Accuracy without scaling: 0.9649
Accuracy with scaling:    0.9825


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

- For a 5% response rate, start with careful preprocessing: fix label leakage, impute missing values, one-hot encode categoricals, and standardize/normalize numeric features because Logistic Regression assumes features on comparable scales for stable coefficients and calibration. Handle imbalance primarily via cost-sensitive learning: set class_weight to “balanced” (≈ inverse frequency) or tune custom weights to emphasize responders without exploding false positives. Optionally compare data-level methods on the training fold only: SMOTE to oversample minority examples (fit SMOTE inside the CV pipeline to avoid leakage) versus modest undersampling; SMOTE often improves minority recall but can add noise if classes overlap, so validate carefully. Build a pipeline: scaler → (optional) SMOTE → LogisticRegression with L2 (or L1 for sparsity) and class weights; hyperparameter-tune C, penalty (L1/L2), solver, and class_weight via stratified cross-validation optimizing PR AUC or F2 (if recall matters more) rather than accuracy/ROC AUC alone, since PR AUC better reflects performance at low prevalence. Evaluate on a held-out test set with PR AUC, recall/precision, F1/F2, calibration (reliability curve/Brier), and business-costed metrics; choose a decision threshold by maximizing expected profit or meeting ops constraints (e.g., cap contact volume while targeting a minimum precision). Finally, check stability across time splits, inspect coefficient signs for sanity, and monitor drift and calibration post-deployment, retraining and retuning weights/thresholds as class balance or campaign economics change.