**1: What is Logistic Regression, and how does it differ from Linear**
**Regression? **

**Answer:**

Logistic Regression is a statistical and machine learning method used for classification problems, especially when the output (dependent variable) is categorical, most commonly binary (e.g., Yes/No, Spam/Not Spam).

.It predicts the probability of an outcome belonging to a certain class.

.The output values range between 0 and 1 using the sigmoid (logistic) function.

.A decision boundary (e.g., probability > 0.5) is applied to classify the outcome.

Linear regression draws a straight line to fit data points.

Logistic regression squeezes that line through a sigmoid curve so outputs stay between 0 and 1, making it suitable for classification.





---



**Question 2: Explain the role of the Sigmoid function in Logistic Regression?**

**Answer:**  

In Logistic Regression, the sigmoid function (also called the logistic function) plays the role of converting a raw linear prediction into a probability between 0 and 1.

The regression equation inside logistic regression is still linear:
This
𝑧
z can be any real number
(
−
∞
,
∞
)
(−∞,∞), but probabilities must lie between 0 and 1.
That’s where the sigmoid comes in — it “squashes” any real number into the range
(
0
,
1
)
(0,1).

**2.The Sigmoid Function**


If
𝑧
z is very large →
𝜎
(
𝑧
)
σ(z) approaches 1.

If
𝑧
z is very small (negative) →
𝜎
(
𝑧
)
σ(z) approaches 0.

If
𝑧
=
0
z=0 →
𝜎
(
𝑧
)
=
0.5
σ(z)=0.5.

**3. Role in Logistic Regression**

Probability Conversion
Turns the raw score
𝑧
z into a probability
𝑃
(
𝑌
=
1
∣
𝑋
)
P(Y=1∣X).

Decision Boundary
If
𝜎
(
𝑧
)
≥
0.5
σ(z)≥0.5 → predict class 1; otherwise → class 0 (threshold can be adjusted).

Interpretability
Outputs can be directly interpreted as “chance of belonging to class 1.”

**4.Intuition**

Think of the sigmoid function as a soft switch:

Instead of abruptly saying “yes” or “no” (like a step function), it gradually changes from 0 to 1.

This smoothness makes it differentiable, which is crucial for optimization using gradient descent.



---



**Question 3: What is Regularization in Logistic Regression and why is it needed?**

**Answer:**

Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty term to the model’s loss function.
It discourages the model from assigning too large weights to features, which can cause it to fit noise in the training data instead of learning general patterns.

In logistic regression, if a feature strongly separates the classes, the model may give it a very large coefficient (weight).

Large weights make the model sensitive to small changes in input data → overfitting.

Overfitted models work well on training data but poorly on unseen data.

Regularization controls the complexity of the model by shrinking weights.

**3. Benefits**

Reduces overfitting → better generalization to new data.

Improves stability of the model.

Can perform automatic feature selection (L1 case).





---



**4: What are some common evaluation metrics for classification models, and**
**why are they important?**

**Answer:**

When we build a classification model (like Logistic Regression, Decision Trees, etc.), we need to check how well it performs — not just on training data, but on unseen data. This is where evaluation metrics come in.

**1. Common Evaluation Metrics for Classification**

**(a) Accuracy**

Accuracy=
Total Predictions/
Correct Predictions
​

Example: If your model predicts correctly 90 out of 100 times → Accuracy = 90%.

When useful: When classes are balanced (equal distribution of classes).

Limitation: Misleading for imbalanced data (e.g., 99% predicting “not fraud” in a fraud dataset isn’t good).

**(b) Precision**

Precision=
True Positives + False Positives/
True Positives
​

Example: Out of 10 emails marked spam, 8 are truly spam → Precision = 80%.

When useful: When false positives are costly (e.g., flagging legitimate emails as spam).

**(c) Recall (Sensitivity / True Positive Rate)**

Recall=
True Positives + False Negatives/
True Positives
​
Example: Out of 100 spam emails, your model catches 90 → Recall = 90%.

When useful: When missing positives is costly (e.g., detecting cancer, fraud).

**(d) F1-Score**

F1=2×
Precision + Recall/
Precision×Recall
​
When useful: When you need a balance between Precision and Recall, especially in imbalanced datasets.

**(e) ROC-AUC (Receiver Operating Characteristic – Area Under Curve)**

Definition: Measures the model’s ability to distinguish between classes across different thresholds.

AUC Value Meaning:

0.5 → No better than random guessing

1.0 → Perfect separation

When useful: For comparing models, especially in binary classification.

**(f) Confusion Matrix**

Definition: A table showing True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

Why useful: Gives a full picture of what types of errors the model makes.

Different problems care about different errors.

Medical diagnosis → high Recall (catch all patients).

Email spam filter → high Precision (don’t mark real emails as spam).

Accuracy alone can be misleading when data is imbalanced.

They help in model selection and hyperparameter tuning.

They guide trade-offs (e.g., increasing recall might lower precision).






---



**5: Write a Python program that loads a CSV file into a Pandas DataFrame,**
**splits into train/test sets, trains a Logistic Regression model, and prints its** **accuracy (Use Dataset from sklearn package)?**
**(Include your Python code and output in the code box below.)**

**Answer:**

Python example using Pandas and scikit-learn with the built-in Breast Cancer dataset from sklearn.
We’ll:

1.Load the dataset.

2.Convert it into a Pandas DataFrame.

3.Split into training/testing sets.

4.Train a Logistic Regression model.

5.Print accuracy on the test set.





In [3]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score

# Step 1: Load dataset from sklearn
cancer = datasets.load_breast_cancer()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Step 3: Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 5: Create and train Logistic Regression model
model = LogisticRegression(max_iter=10000)  # max_iter increased to ensure convergence
model.fit(X_train, y_train)

# Step 6: Make predictions on the test set
y_pred = model.predict(X_test)

# Step 7: Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Step 8: Print results
print("Logistic Regression Model Accuracy:", accuracy)


Logistic Regression Model Accuracy: 0.956140350877193




---



**6: Write a Python program to train a Logistic Regression model using L2 **
**regularization (Ridge) and print the model coefficients and accuracy.**
**(Use Dataset from sklearn package)?**

**(Include your Python code and output in the code box below.)**

**Answer:**

Here’s the Python program using L2 Regularization (Ridge) in Logistic Regression with the Breast Cancer dataset from sklearn.

By default, LogisticRegression in scikit-learn uses L2 regularization, so we’ll explicitly set penalty='l2' and print the model coefficients.

In [4]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
cancer = datasets.load_breast_cancer()

# Step 2: Convert to Pandas DataFrame
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Step 3: Split into features and target
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 5: Train Logistic Regression model with L2 regularization
model = LogisticRegression(penalty='l2', C=1.0, max_iter=10000)  # C controls regularization strength
model.fit(X_train, y_train)

# Step 6: Predictions
y_pred = model.predict(X_test)

# Step 7: Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Step 8: Print coefficients and accuracy
print("Model Coefficients:\n", model.coef_)
print("\nIntercept:", model.intercept_)
print("\nAccuracy:", accuracy)


Model Coefficients:
 [[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
  -0.53255786 -0.28369224 -0.22668189 -0.03649446 -0.09710208  1.3705667
  -0.18140942 -0.08719575 -0.02245523  0.04736092 -0.04294784 -0.03240188
  -0.03473732  0.01160522  0.11165329 -0.50887722 -0.01555395 -0.016857
  -0.30773117 -0.77270908 -1.42859535 -0.51092923 -0.74689363 -0.10094404]]

Intercept: [28.64871395]

Accuracy: 0.956140350877193




---



7: Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)?

(Include your Python code and output in the code box below.)

**Answer:**

the Python program using multi_class='ovr' (One-vs-Rest) Logistic Regression on the Iris dataset from sklearn:

In [7]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import classification_report


# Step 1: Load Iris dataset
iris = datasets.load_iris()

# Step 2: Create DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Step 3: Features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 5: Create Logistic Regression model for multiclass classification
model = LogisticRegression(multi_class='ovr', max_iter=10000)
model.fit(X_train, y_train)

# Step 6: Predictions
y_pred = model.predict(X_test)

# Step 7: Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30







---



**8.Write a Python program to apply GridSearchCV to tune C and penalty**
**hyperparameters for Logistic Regression and print the best parameters and** **validation accuracy.  (Use Dataset from sklearn package) ?**

**(Include your Python code and output in the code box below.)**

**Answer:**

Here’s the Python program that uses GridSearchCV to tune the C (regularization strength) and penalty hyperparameters for Logistic Regression using a dataset from sklearn (I’ll use the Iris dataset):

In [8]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

# Step 1: Load the Iris dataset
iris = datasets.load_iris()

# Step 2: Create DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Step 3: Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 5: Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],           # Regularization strength
    'penalty': ['l1', 'l2'],                # L1 (Lasso) or L2 (Ridge)
    'solver': ['liblinear']                 # liblinear supports both L1 and L2
}

# Step 6: Create Logistic Regression model
log_reg = LogisticRegression(max_iter=10000)

# Step 7: Apply GridSearchCV
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 8: Print results
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9583333333333334




---



**9: Write a Python program to standardize the features before training Logistic**
**Regression and compare the model's accuracy with and without scaling.**
**(Use Dataset from sklearn package)?**

**(Include your Python code and output in the code box below.)**

**Answer:**

 Python program that compares Logistic Regression accuracy with and without feature standardization using the Breast Cancer dataset from sklearn:

In [9]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
from sklearn.metrics import accuracy_score

# Step 1: Load the Breast Cancer dataset
cancer = datasets.load_breast_cancer()

# Step 2: Create DataFrame
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

# Step 3: Split into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Step 4: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---- Model WITHOUT scaling ----
model_no_scale = LogisticRegression(max_iter=10000)
model_no_scale.fit(X_train, y_train)
y_pred_no_scale = model_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

# ---- Model WITH scaling ----
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=10000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Step 5: Print comparison
print("Accuracy without scaling:", acc_no_scale)
print("Accuracy with scaling:", acc_scaled)


Accuracy without scaling: 0.956140350877193
Accuracy with scaling: 0.9736842105263158




---



**10: Imagine you are working at an e-commerce company that wants to**
**predict which customers will respond to a marketing campaign. Given an** **imbalanceddataset (only 5% of customers respond), describe the approach you’d** **take to build aLogistic Regression model — including data handling, feature** **scaling, balancingclasses, hyperparameter tuning, and evaluating the model for** **this real-world business use case. ?**

**Answer:**

simulate an imbalanced dataset (like your 5% response rate case) using sklearn's make_classification

In [10]:
# Step 1: Import Libraries
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE  # to handle imbalance

# Step 2: Create an imbalanced dataset (5% positive class)
X, y = make_classification(
    n_samples=5000, n_features=10, n_informative=6, n_redundant=2,
    n_classes=2, weights=[0.95, 0.05], flip_y=0, random_state=42
)

# Step 3: Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Step 4: Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Handle Class Imbalance (SMOTE Oversampling)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# Step 6: Hyperparameter Tuning with GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],  # L1 = Lasso, L2 = Ridge
    'solver': ['liblinear']   # solver for L1/L2 penalties
}

grid_search = GridSearchCV(
    LogisticRegression(max_iter=1000),
    param_grid,
    scoring='roc_auc',  # good for imbalanced datasets
    cv=5,
    n_jobs=-1
)

grid_search.fit(X_train_resampled, y_train_resampled)

# Step 7: Best Model
best_model = grid_search.best_estimator_

# Step 8: Evaluation
y_pred = best_model.predict(X_test_scaled)
y_proba = best_model.predict_proba(X_test_scaled)[:, 1]

print("Best Parameters:", grid_search.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nROC-AUC Score:", roc_auc_score(y_test, y_proba))


Best Parameters: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.76      0.86       950
           1       0.15      0.82      0.26        50

    accuracy                           0.76      1000
   macro avg       0.57      0.79      0.56      1000
weighted avg       0.95      0.76      0.83      1000


Confusion Matrix:
 [[721 229]
 [  9  41]]

ROC-AUC Score: 0.8634947368421052
