**Question 1: What is Logistic Regression, and how does it differ from Linear Regression?**

**Logistic Regression**

* Logistic Regression is a statistical and machine learning algorithm used for classification problems, especially when the output (dependent variable) is categorical.

* It predicts the probability that an observation belongs to a certain class (e.g., Yes/No, Spam/Not Spam).

* The output is passed through a sigmoid (logistic) function, which maps values to a range between 0 and 1.

* A threshold (commonly 0.5) is then applied to decide the final class.

**Example:** Predicting whether a customer will buy a product (Yes=1, No=0).

**Linear Regression**

* Linear Regression is used for predicting continuous outcomes.

* It assumes a linear relationship between the independent variables (X) and the dependent variable (Y).

* The output is a real number that can range from −∞ to +∞.

**Example:** Predicting house prices based on area, location, and number of rooms.

**In short:**

Linear Regression → Continuous prediction.

Logistic Regression → Classification (probability-based).

**Question 2: Explain the role of the Sigmoid function in Logistic Regression.**

Logistic Regression starts with a linear combination of inputs:

𝑧 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑛𝑥𝑛

But if we use this directly, it can output any value between −∞ and +∞.
That’s not useful when we need probabilities (must be between 0 and 1).

The sigmoid function (also called logistic function) fixes this:

𝜎(𝑧) = 1/1+e−z

**Role of the Sigmoid function:**

Maps outputs to probabilities (0–1) → ensures predictions are interpretable as probabilities.

Decision boundary: If probability ≥ 0.5 → Class 1, else Class 0.

Smooth & differentiable → makes optimization with gradient descent possible.

Probabilistic interpretation → output tells how likely the input belongs to the positive class.

**Question 3: What is Regularization in Logistic Regression and why is it needed?**

Regularization is a technique in Logistic Regression (and other ML models) that adds a penalty term to the loss function to prevent the model from overfitting.

The idea: when coefficients become too large, the model fits the training data extremely well but performs poorly on new data (poor generalization).

Regularization keeps coefficients smaller and more stable, so the model balances fit and simplicity.

**Types of Regularization:**

*L1 Regularization (Lasso):*

Adds 𝜆∑∣𝛽𝑗∣ to the cost function.

Forces some coefficients to become exactly zero → performs feature selection.

*L2 Regularization (Ridge):*

Adds 𝜆∑𝛽𝑗2 to the cost function.

Shrinks coefficients towards zero but never makes them exactly zero.

Helps when features are correlated (multicollinearity).

*Elastic Net:*

Combination of L1 and L2 penalties.

**Why Regularization is needed:**

* Prevents overfitting (model memorizing noise in training data).

* Improves generalization on unseen data.

* Helps with multicollinearity by shrinking correlated feature weights.

* Encourages sparser models (L1) or more stable models (L2).

**Question 4: What are some common evaluation metrics for classification models, and why are they important?**

When evaluating classification models (like Logistic Regression), accuracy alone is not enough, especially when the dataset is imbalanced (e.g., fraud detection, medical diagnosis).
That’s why we use multiple evaluation metrics:

1. Accuracy

* Measures overall % of correct predictions.

* Misleading if classes are imbalanced (e.g., predicting everyone as “negative” in a 95%-5% dataset gives 95% accuracy but is useless).

2. Precision (Positive Predictive Value)

* Out of all predicted positives, how many are actually positive?

* Important when the cost of false positives is high.

**Example:** Predicting spam emails → better to avoid labeling normal emails as spam.

3. Recall (Sensitivity or True Positive Rate)

* Out of all actual positives, how many did we catch?

* Important when the cost of false negatives is high.

**Example:** In cancer detection, missing a positive case is very costly.

4. F1-Score

* Harmonic mean of Precision and Recall.

* Useful when we want a balance between avoiding false positives & false negatives.

5. ROC-AUC (Receiver Operating Characteristic – Area Under Curve)

* Plots True Positive Rate (Recall) vs False Positive Rate (FP / (FP + TN)) at different thresholds.

* AUC = probability that the model ranks a random positive higher than a random negative.

* Closer to 1.0 → better.

6. PR-AUC (Precision-Recall AUC)

* Plots Precision vs Recall across thresholds.

* More informative than ROC-AUC when classes are highly imbalanced.

**Why these metrics are important?**

Accuracy: Good only when classes are balanced.

Precision & Recall: Help us understand trade-offs between catching positives and avoiding false alarms.

F1: Balances Precision & Recall in one number.

ROC-AUC & PR-AUC: Evaluate model performance across thresholds instead of at just one cut-off.

**Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.**

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Step 2: Convert to DataFrame
df = pd.DataFrame(X, columns=wine.feature_names)
df['target'] = y

print("Dataset shape:", df.shape)
print("Class distribution:\n", df['target'].value_counts())

# Step 3: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns=['target']),
    df['target'],
    test_size=0.2,
    random_state=42,
    stratify=df['target']
)

# Step 4: Train Logistic Regression model
model = LogisticRegression(max_iter=5000, multi_class='ovr')  # one-vs-rest for multiclass
model.fit(X_train, y_train)

# Step 5: Predictions & Accuracy
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print("\nTest Accuracy:", acc)

Dataset shape: (178, 14)
Class distribution:
 target
1    71
0    59
2    48
Name: count, dtype: int64





Test Accuracy: 0.9444444444444444


**Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.**

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
wine = load_wine()
X, y = wine.data, wine.target
feature_names = wine.feature_names

# Step 2: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 3: Standardize features (important for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Train Logistic Regression with L2 regularization
model = LogisticRegression(penalty='l2', solver='liblinear', max_iter=5000, C=1.0)
model.fit(X_train_scaled, y_train)

# Step 5: Predictions & Accuracy
y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)

# Step 6: Model Coefficients
coef_df = pd.DataFrame(model.coef_, columns=feature_names)
coef_df['class'] = wine.target_names

print("Test Accuracy:", acc)
print("\nLogistic Regression Coefficients (per class):")
print(coef_df)

Test Accuracy: 1.0

Logistic Regression Coefficients (per class):
    alcohol  malic_acid       ash  alcalinity_of_ash  magnesium  \
0  1.210972    0.415617  0.805159          -1.359310   0.096232   
1 -1.307149   -0.790072 -1.169551           0.697742  -0.199082   
2  0.276977    0.483434  0.478266           0.406877   0.018923   

   total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  \
0       0.254444    1.120447             -0.006312        -0.248447   
1       0.087130    0.338720              0.206824         0.379657   
2      -0.160326   -1.349051             -0.144092        -0.425197   

   color_intensity       hue  od280/od315_of_diluted_wines   proline    class  
0         0.185990  0.001280                      0.953953  1.645212  class_0  
1        -1.869870  0.905697                     -0.204198 -1.685038  class_1  
2         1.448417 -0.895848                     -0.662348  0.066522  class_2  


**Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.**

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score

# Step 1: Load dataset
wine = load_wine()
X, y = wine.data, wine.target
target_names = wine.target_names

# Step 2: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 3: Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Train Logistic Regression with OVR
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=5000)
model.fit(X_train_scaled, y_train)

# Step 5: Predictions & Evaluation
y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)

print("Test Accuracy:", acc)
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=target_names))

Test Accuracy: 1.0

Classification Report:

              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        12
     class_1       1.00      1.00      1.00        14
     class_2       1.00      1.00      1.00        10

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36





**Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.**

In [None]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Step 1: Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Step 2: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 3: Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Define Logistic Regression and parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

log_reg = LogisticRegression(solver='liblinear', max_iter=5000, multi_class='ovr')

# Step 5: GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train_scaled, y_train)

# Step 6: Print best parameters and validation accuracy
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

Best Parameters: {'C': 0.1, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.993103448275862




**Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.**

In [2]:
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Step 2: Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 3a: Train Logistic Regression without scaling
model_no_scale = LogisticRegression(solver='liblinear', multi_class='ovr', max_iter=5000)
model_no_scale.fit(X_train, y_train)
y_pred_no_scale = model_no_scale.predict(X_test)
acc_no_scale = accuracy_score(y_test, y_pred_no_scale)

# Step 3b: Train Logistic Regression with scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(solver='liblinear', multi_class='ovr', max_iter=5000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Step 4: Print results
print("Accuracy without scaling:", acc_no_scale)
print("Accuracy with scaling:", acc_scaled)

Accuracy without scaling: 0.9722222222222222
Accuracy with scaling: 1.0




**Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

**1. Problem framing & metric selection**

* Goal: Identify customers likely to respond (positive class = responder). Only 5% responders → high class imbalance.

* Primary metrics: Precision@k, Recall, Precision-Recall AUC (average precision), Lift/Gain, and business-metric (expected profit).

* Do not rely on accuracy. Use PR-AUC / precision@topK because they reflect performance on rare class.

**2. Data preparation / EDA**

* Inspect class distribution, missing values, feature types, correlations, outliers.

* Visualize positive vs negative distributions for key features.

* Handle missing data: impute (median / KNN / iterative) based on feature type and missingness mechanism.

* Outliers: Winsorize or clip if they are data errors; otherwise consider robust scaling.

**3. Feature engineering**

* Create interaction features (e.g., recency × frequency), binned continuous variables, one-hot or target-encoding for high-cardinality categoricals.

* Use domain knowledge — e.g., recency of last purchase, total spend, categorical membership.

**4. Train/validation split**

* Use Stratified split so the 5% class ratio is preserved.

* Prefer StratifiedKFold for cross-validation to preserve imbalance in folds.

**5. Balancing strategies (choose one or combine)**

* Model-level (fast, safe):

* Use class_weight='balanced' in LogisticRegression — often a good baseline (no resampling).

In [None]:
model = LogisticRegression(class_weight='balanced', solver='liblinear')

* Data-level (resampling):

* SMOTE (synthetic oversampling) to increase minority class in training only:

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

* Combine SMOTE + Tomek or SMOTEENN to oversample then clean.

* Hybrid / ensemble:

* BalancedBaggingClassifier, or ensemble of models trained on different balanced samples.

* Important: only resample the training set. Keep validation/test sets untouched for realistic evaluation.

**6. Feature scaling**

* Use StandardScaler (or MinMax) when using regularized logistic regression. Fit scaler on training set and transform validation/test.

**7. Hyperparameter tuning**

* Tune C (inverse of regularization strength), penalty (l1,l2), and solver. Use Stratified CV.

* For imbalanced data, optimize for PR-AUC or average_precision:

In [None]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(LogisticRegression(solver='liblinear'),
                    param_grid={'C':[0.01,0.1,1,10],'penalty':['l1','l2']},
                    scoring='average_precision', cv=5)

* Consider nested CV if you want unbiased performance estimates.

**8. Probability calibration & threshold tuning**

* Logistic regression gives probabilities; they may need calibration:

In [None]:
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(base_model, cv=3)
calibrated.fit(X_train, y_train)

* Threshold selection:

* Pick a probability threshold not necessarily 0.5. Choose threshold based on business metric (max expected profit, or required precision@k).

* Use precision-recall curve to pick a threshold that gives acceptable precision / recall trade-off:

In [None]:
from sklearn.metrics import precision_recall_curve
probs = model.predict_proba(X_val)[:,1]
precision, recall, thresholds = precision_recall_curve(y_val, probs)

* Evaluate precision@k (top k customers) and expected revenue per targeted customer.

**9. Evaluation — practical business metrics**

* Report: Precision, Recall, F1 (for chosen threshold), PR-AUC (average precision), Lift/Gain charts, Confusion matrix, Precision@k.

* For business: compute expected profit using:

* revenue per positive, cost per contact, and predicted counts at threshold → choose threshold maximizing expected value.

* Monitor false positives cost (contacting uninterested customers) vs false negatives lost revenue.

**10. Explainability & monitoring**

* Logistic regression coefficients are interpretable — show top drivers of response.

* Deploy monitoring: population drift, model calibration, feature drift.

* Retrain periodically and keep a labeled feedback loop.