# Logistic Regression

# Question 1:  What is Logistic Regression, and how does it differ from Linear Regression?

Answer 1. Logistic Regression

Definition: Logistic Regression is a statistical and machine learning model used for classification problems, where the target variable is categorical (e.g., Yes/No, 0/1, Spam/Not Spam).

How it works:

Instead of predicting a continuous value, it predicts the probability that an input belongs to a certain class.

It uses the sigmoid (logistic) function to squeeze outputs into the range 0 to 1.

Decision boundary: If probability > 0.5 → class 1, else class 0.

Example: Predicting if a customer will buy a product (Yes/No) based on income and age.


2. Linear Regression

Definition: Linear Regression is used for regression problems, where the target variable is continuous (e.g., predicting house prices, salary, temperature).

How it works:

It fits a straight line (or hyperplane) to minimize the error between predicted and actual values.

Linear Regression → Predicts numbers

Logistic Regression → Predicts categories (via probabilities)


# Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer 1. Logistic Regression Goal

Logistic Regression is about predicting the probability of an event (e.g., Yes/No, 1/0).

Probabilities, however, must always lie between 0 and 1.

3. Why It’s Important in Logistic Regression

Probability Mapping: Converts linear outputs into probabilities.

Decision Making:

Non-linearity: Although the input is linear, the sigmoid introduces a non-linear transformation, making it suitable for classification.

Interpretability: The output directly tells us the probability of belonging to a class (e.g., 0.8 means 80% chance of "Yes").

4. Visual Intuition

The sigmoid curve is S-shaped:

For very negative inputs → probability close to 0

For very positive inputs → probability close to 1


# Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer
1. What is Regularization in Logistic Regression?

Regularization is a technique used to prevent overfitting by adding a penalty term to the model’s cost function (loss function).

In Logistic Regression, the basic cost function is Log Loss (Cross-Entropy Loss).

With regularization, we modify it to penalize very large coefficients (weights).


2. Why is Regularization Needed?

Problem: Without regularization, Logistic Regression may try to fit the training data too closely (overfitting), especially when:

There are many features

Features are highly correlated

Dataset is small or noisy

3. Types of Regularization in Logistic Regression
(a) L1 Regularization (Lasso)

Adds penalty = sum of absolute values of coefficients.

(b) L2 Regularization (Ridge)

Adds penalty = sum of squared values of coefficients.

(c) Elastic Net (Combination of L1 + L2)

Balances both feature selection (L1) and coefficient shrinking (L2).


4. Benefits of Regularization

Prevents overfitting

Improves generalization (better performance on new data)

Helps deal with multicollinearity (correlated features)

Encourages simpler models


# Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer
1. Why Do We Need Evaluation Metrics?

Just checking accuracy is not always enough.

Example: If 95% of patients are healthy and only 5% have a disease → a model that always predicts "healthy" will have 95% accuracy but is useless in practice.

That’s why we use multiple evaluation metrics.


2. Common Evaluation Metrics for Classification
(a) Accuracy
When to use: Good when classes are balanced.

Limitation: Misleading if dataset is imbalanced.

(b) Precision

Out of all predicted positives, how many are actually positive?

High Precision → few false positives.

Example: Spam detection (better to be precise so genuine emails aren’t marked as spam).

(c) Recall (Sensitivity or True Positive Rate)

Out of all actual positives, how many did we correctly predict?
High Recall → few false negatives.

Example: Disease detection (better to catch as many patients as possible).

(d) F1-Score

Harmonic mean of Precision and Recall:

When to use: Best when you need a balance between Precision and Recall.


(e) ROC Curve & AUC (Area Under Curve)

ROC Curve: Plots True Positive Rate (Recall) vs False Positive Rate.

AUC: Measures overall ability of the model to distinguish between classes.

Closer to 1 → Better.

Example: In credit card fraud detection, a model with AUC = 0.95 is very good at separating fraud from non-fraud.

(f) Confusion Matrix

A table that shows TP, FP, TN, FN.

Helps you see types of errors made by the model.



# Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package) (Include your Python code and output in the code box below.)

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset from sklearn
data = load_breast_cancer()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

print("First 5 rows of dataset:")
print(df.head(), "\n")

# 3. Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# 4. Train-Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Train Logistic Regression model
model = LogisticRegression(max_iter=5000)  # Increase iterations for convergence
model.fit(X_train, y_train)

# 6. Predict on test set
y_pred = model.predict(X_test)

# 7. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Logistic Regression Model: {accuracy:.4f}")



# Question 6:  Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.

(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

answer
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()

# 2. Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features and Target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression with L2 Regularization (Ridge)
# penalty='l2' is default, but we explicitly specify it
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# 5. Predictions and Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 6. Print coefficients and accuracy
print("Logistic Regression with L2 Regularization (Ridge)")
print("-------------------------------------------------")
print("Model Coefficients (per feature):")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature:25s}: {coef:.4f}")

print("\nIntercept:", model.intercept_[0])
print(f"\nAccuracy on Test Set: {accuracy:.4f}")


# Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)


Answer
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 1. Load dataset
data = load_iris()

# 2. Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features and Target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression with One-vs-Rest (OvR)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# 5. Predictions
y_pred = model.predict(X_test)

# 6. Classification Report
print("Logistic Regression with OvR (One-vs-Rest)")
print("------------------------------------------")
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


# Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

answer
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# 1. Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define parameter grid for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],          # Regularization strength
    'penalty': ['l1', 'l2']                # Type of penalty
}

# 4. Create Logistic Regression model
# Note: l1 penalty requires solver='liblinear'
log_reg = LogisticRegression(solver='liblinear', max_iter=5000, multi_class='ovr')

# 5. Apply GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# 6. Print results
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.4f}")
print(f"Test Set Accuracy: {grid.score(X_test, y_test):.4f}")


# Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
 (Include your Python code and output in the code box below.)

 answer
 import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------
# Model WITHOUT Standardization
# -----------------------------
model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -----------------------------
# Model WITH Standardization
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_with_scaling = LogisticRegression(max_iter=5000)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = model_with_scaling.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

# -----------------------------
# Print Results
# -----------------------------
print("Logistic Regression Accuracy Comparison")
print("--------------------------------------")
print(f"Without Scaling: {acc_no_scaling:.4f}")
print(f"With Scaling   : {acc_with_scaling:.4f}")


# Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

answer
Scenario

Task: Predict which customers will respond to a marketing campaign.

Challenge: Only 5% positive responses (imbalanced dataset).

Goal: Build a robust Logistic Regression model that handles imbalance and provides actionable insights.

1. Data Handling

Load & Clean Data: Handle missing values (impute with mean/median for numeric, mode for categorical).

Feature Engineering:

Encode categorical variables (One-Hot Encoding).

Create domain-specific features (e.g., recency, frequency, monetary value for purchases).

Train/Test Split: Use stratified split so class ratios remain consistent.

2. Feature Scaling

Since Logistic Regression is sensitive to feature magnitudes, apply StandardScaler or MinMaxScaler to continuous variables.

This ensures coefficients are comparable and optimization converges faster.

3. Handling Class Imbalance
Options:

Resampling Techniques:

Oversampling: Use SMOTE (Synthetic Minority Oversampling Technique) to increase positive class samples.

Undersampling: Reduce majority class, but risk of information loss.

Hybrid: Combine both.

Class Weights:

Set class_weight='balanced' in Logistic Regression.

Automatically adjusts weights inversely proportional to class frequencies.


4. Hyperparameter Tuning

Use GridSearchCV or RandomizedSearchCV with cross-validation.

Tune parameters like:

C (regularization strength) → controls overfitting/underfitting.

penalty (L1 vs L2).

solver (liblinear, saga depending on penalty).

Use Stratified K-Fold CV to ensure balanced splits during tuning.

5. Evaluation Metrics

Since dataset is imbalanced, Accuracy is misleading. Focus on:

Precision: Of predicted responders, how many are correct?

Recall (Sensitivity): Of all actual responders, how many did we catch?

F1-Score: Balance between Precision & Recall.

ROC-AUC: Measures overall separation between responders/non-responders.

PR-AUC (Precision-Recall AUC): More informative when positives are rare.


6. Business Deployment Considerations

Use probability thresholds instead of just 0.5.

Example: If the model predicts P(Response) > 0.3, target that customer.

The threshold can be tuned based on cost of marketing vs expected revenue uplift.

Periodically retrain the model as customer behavior changes.

Combine Logistic Regression with business rules (e.g., only send offers to high-value customers even if probability is moderate).

In [1]:
# 5
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset from sklearn
data = load_breast_cancer()

# 2. Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

print("First 5 rows of dataset:")
print(df.head(), "\n")

# 3. Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# 4. Train-Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 5. Train Logistic Regression model
model = LogisticRegression(max_iter=5000)  # Increase iterations for convergence
model.fit(X_train, y_train)

# 6. Predict on test set
y_pred = model.predict(X_test)

# 7. Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Logistic Regression Model: {accuracy:.4f}")

First 5 rows of dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  wor

In [2]:
# 6
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()

# 2. Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features and Target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression with L2 Regularization (Ridge)
# penalty='l2' is default, but we explicitly specify it
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# 5. Predictions and Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 6. Print coefficients and accuracy
print("Logistic Regression with L2 Regularization (Ridge)")
print("-------------------------------------------------")
print("Model Coefficients (per feature):")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature:25s}: {coef:.4f}")

print("\nIntercept:", model.intercept_[0])
print(f"\nAccuracy on Test Set: {accuracy:.4f}")

Logistic Regression with L2 Regularization (Ridge)
-------------------------------------------------
Model Coefficients (per feature):
mean radius              : 1.0274
mean texture             : 0.2215
mean perimeter           : -0.3621
mean area                : 0.0255
mean smoothness          : -0.1562
mean compactness         : -0.2377
mean concavity           : -0.5326
mean concave points      : -0.2837
mean symmetry            : -0.2267
mean fractal dimension   : -0.0365
radius error             : -0.0971
texture error            : 1.3706
perimeter error          : -0.1814
area error               : -0.0872
smoothness error         : -0.0225
compactness error        : 0.0474
concavity error          : -0.0429
concave points error     : -0.0324
symmetry error           : -0.0347
fractal dimension error  : 0.0116
worst radius             : 0.1117
worst texture            : -0.5089
worst perimeter          : -0.0156
worst area               : -0.0169
worst smoothness         : -0.30

In [3]:
# 7
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 1. Load dataset
data = load_iris()

# 2. Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features and Target
X = df.drop('target', axis=1)
y = df['target']

# 3. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression with One-vs-Rest (OvR)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# 5. Predictions
y_pred = model.predict(X_test)

# 6. Classification Report
print("Logistic Regression with OvR (One-vs-Rest)")
print("------------------------------------------")
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Logistic Regression with OvR (One-vs-Rest)
------------------------------------------
Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [4]:
# 8
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# 1. Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define parameter grid for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],          # Regularization strength
    'penalty': ['l1', 'l2']                # Type of penalty
}

# 4. Create Logistic Regression model
# Note: l1 penalty requires solver='liblinear'
log_reg = LogisticRegression(solver='liblinear', max_iter=5000, multi_class='ovr')

# 5. Apply GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# 6. Print results
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.4f}")
print(f"Test Set Accuracy: {grid.score(X_test, y_test):.4f}")




Best Parameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9583
Test Set Accuracy: 1.0000




In [5]:
# 9
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# 2. Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -----------------------------
# Model WITHOUT Standardization
# -----------------------------
model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -----------------------------
# Model WITH Standardization
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_with_scaling = LogisticRegression(max_iter=5000)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = model_with_scaling.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

# -----------------------------
# Print Results
# -----------------------------
print("Logistic Regression Accuracy Comparison")
print("--------------------------------------")
print(f"Without Scaling: {acc_no_scaling:.4f}")
print(f"With Scaling   : {acc_with_scaling:.4f}")

Logistic Regression Accuracy Comparison
--------------------------------------
Without Scaling: 0.9561
With Scaling   : 0.9737
