#  Logistic Regression

1. What is Logistic Regression, and how does it differ from Linear Regression?

- Logistic Regression is a supervised classification algorithm used to predict a binary (or multiclass via extensions) outcome. It models the probability that a given input belongs to a class using the logistic (sigmoid) function. Unlike Linear Regression which predicts continuous numeric values using a linear equation Y = b0 + b1*X + ..., Logistic Regression predicts probabilities between 0 and 1 and then thresholds them to produce class labels. The model output is p = sigmoid(z) where z = b0 + b1*X + ....

2. Explain the role of the Sigmoid function in Logistic Regression.

-  The sigmoid (logistic) function maps any real-valued number z into the range (0, 1):  
         sigmoid(z) = 1 / (1 + exp(-z))
In Logistic Regression, the linear combination z = b0 + b1*x1 + ... is passed through sigmoid to produce a probability p that the observation belongs to class 1. This probability is used for classification (e.g., classify as 1 if p >= 0.5) and for likelihood-based training.

3. What is Regularization in Logistic Regression and why is it needed?

-  Regularization adds a penalty term to the loss function to limit model complexity and avoid overfitting. Common types:

 - L2 (Ridge): penalty = lambda * sum(coef^2)

 - L1 (Lasso): penalty = lambda * sum(|coef|)
  # Why needed:

-  Reduces overfitting on training data.

 - L1 can perform feature selection (drives some coefficients to zero).

 - L2 shrinks coefficients to stabilize model and improve generalization.

4.  What are some common evaluation metrics for classification models, and
why are they important?
Answer:

 - Accuracy: fraction of correct predictions. Good for balanced classes.

 - Precision: TP / (TP + FP) — important when false positives cost more.

 - Recall (Sensitivity): TP / (TP + FN) — important when false negatives cost more.

 - F1-score: harmonic mean of precision and recall — good for imbalanced classes.

 - ROC-AUC: area under ROC curve — measures discrimination ability across thresholds.

 - Confusion Matrix: breakdown of TP, FP, TN, FN — helps understand types of errors.
These metrics help evaluate model performance depending on business priorities and class balance.

5. Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)


In [5]:
# 5. Load CSV-like dataset (sklearn), split, train logistic regression, print accuracy
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression(max_iter=10000, solver='liblinear')
model.fit(X_train, y_train)

# Predict and print accuracy
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", acc)


Accuracy: 0.956140350877193


6. : Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

In [4]:
# 6. L2 regularization (Ridge) example using breast_cancer dataset
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# C is inverse of regularization strength. default penalty='l2'
model_l2 = LogisticRegression(penalty='l2', C=1.0, solver='liblinear', max_iter=10000)
model_l2.fit(X_train, y_train)

y_pred = model_l2.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("Accuracy (L2):", acc)
print("Coefficients:", model_l2.coef_)
print("Intercept:", model_l2.intercept_)


Accuracy (L2): 0.956140350877193
Coefficients: [[ 2.13248406e+00  1.52771940e-01 -1.45091255e-01 -8.28669349e-04
  -1.42636015e-01 -4.15568847e-01 -6.51940282e-01 -3.44456106e-01
  -2.07613380e-01 -2.97739324e-02 -5.00338038e-02  1.44298427e+00
  -3.03857384e-01 -7.25692126e-02 -1.61591524e-02 -1.90655332e-03
  -4.48855442e-02 -3.77188737e-02 -4.17516190e-02  5.61347410e-03
   1.23214996e+00 -4.04581097e-01 -3.62091502e-02 -2.70867580e-02
  -2.62630530e-01 -1.20898539e+00 -1.61796947e+00 -6.15250835e-01
  -7.42763610e-01 -1.16960181e-01]]
Intercept: [0.40847797]


7. Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)

In [6]:
# 7. Multiclass classification with multi_class='ovr' on iris dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use multi_class='ovr'
model_ovr = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=10000)
model_ovr.fit(X_train, y_train)

y_pred = model_ovr.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))


Accuracy: 1.0
Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





8. Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

In [7]:
# 8. GridSearchCV to tune C and penalty
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Parameter grid: try l1 and l2 penalties with solver 'liblinear' (binary only)
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.01, 0.1, 1, 10, 100]
}

lr = LogisticRegression(solver='liblinear', max_iter=10000)

grid = GridSearchCV(lr, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best CV score:", grid.best_score_)

# Evaluate on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Accuracy with best params:", accuracy_score(y_test, y_pred))


Best params: {'C': 100, 'penalty': 'l1'}
Best CV score: 0.9670329670329672
Test Accuracy with best params: 0.9824561403508771


9.  Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

In [8]:
# 9. Compare accuracy with and without StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without scaling
model_raw = LogisticRegression(max_iter=10000, solver='liblinear')
model_raw.fit(X_train, y_train)
y_pred_raw = model_raw.predict(X_test)
acc_raw = accuracy_score(y_test, y_pred_raw)
print("Accuracy without scaling:", acc_raw)

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=10000, solver='liblinear')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)
print("Accuracy with scaling:", acc_scaled)


Accuracy without scaling: 0.956140350877193
Accuracy with scaling: 0.9736842105263158


 10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case

-  Answer (practical pipeline):

 1. Understand business & metric:

 - Determine cost of false positives vs false negatives. Use metrics like Precision, Recall, F1, PR-AUC, and business KPIs (e.g., campaign ROI).

  2. Data preprocessing & features:

 - Clean data, handle missing values, create meaningful features (recency, frequency, monetary, engagement features).

 - Convert categorical features with one-hot or target encoding (careful with target leakage).

  3. Train-test split:

 - Use stratified split to maintain class ratio in train/test sets.

 4. Feature scaling & pipelines:

 - Use StandardScaler or RobustScaler in a pipeline along with model to avoid data leakage.

5. Handle class imbalance:

 - Try sampling methods: oversampling minority (SMOTE), undersampling majority, or combined.

 - Alternatively use class-weighted loss: LogisticRegression(class_weight='balanced') or set custom weights to penalize misclassification of minority more.

 - Prefer cross-validated sampling or use pipelines so sampling is only applied to training fold.

  6. Model selection & regularization:

 - Use regularized logistic regression (L1/L2) and tune C (inverse regularization). L1 for feature selection.

 - Consider