## Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

**Answer:**

• **Logistic Regression** is a classification algorithm that predicts categorical outcomes (binary or multiclass) using a logistic function

• **Linear Regression** predicts continuous numerical values, while Logistic Regression predicts probabilities and class labels

• Logistic Regression uses the **sigmoid function** to map any real-valued number to a value between 0 and 1

• **Output interpretation**: Linear regression gives direct values, logistic regression gives probabilities that need thresholding

• **Use cases**: Linear regression for house prices, logistic regression for spam detection or customer churn prediction


## Question 2: Explain the role of the Sigmoid function in Logistic Regression.

**Answer:**

• **Sigmoid function** (σ(z) = 1/(1 + e^(-z))) transforms linear combinations into probabilities between 0 and 1

• **S-shaped curve** ensures smooth transitions and prevents extreme values, making it perfect for probability estimation

• **Decision boundary**: The function creates a natural threshold at 0.5 for binary classification decisions

• **Differentiability**: The sigmoid function is smooth and differentiable, enabling gradient descent optimization

• **Interpretability**: Output values can be directly interpreted as probabilities, making the model more intuitive for business stakeholders


## Question 3: What is Regularization in Logistic Regression and why is it needed?

**Answer:**

• **Regularization** adds penalty terms to the cost function to prevent overfitting and improve generalization

• **L1 (Lasso)**: Adds absolute value of coefficients as penalty, can drive coefficients to zero for feature selection

• **L2 (Ridge)**: Adds squared coefficients as penalty, shrinks coefficients toward zero but keeps all features

• **Prevents overfitting** by constraining model complexity, especially important with high-dimensional data

• **Improves stability** by reducing variance in predictions and making the model more robust to noise in training data


## Question 4: What are some common evaluation metrics for classification models, and why are they important?

**Answer:**

• **Accuracy**: Overall correctness, but can be misleading with imbalanced datasets

• **Precision**: True positives / (True positives + False positives) - measures how many predicted positives are actually positive

• **Recall (Sensitivity)**: True positives / (True positives + False negatives) - measures how many actual positives were correctly identified

• **F1-Score**: Harmonic mean of precision and recall, balances both metrics for imbalanced datasets

• **ROC-AUC**: Area under the ROC curve, measures model's ability to distinguish between classes across all thresholds


## Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

**Answer:**


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")


Dataset shape: (569, 31)

First few rows:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  wor

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.

**Answer:**


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score, classification_report

# Load wine dataset
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Wine Dataset - Binary Classification")
print("Dataset shape:", df.shape)

# Create binary classification (class 0 vs others)
df['binary_target'] = (df['target'] == 0).astype(int)

X = df.drop(['target', 'binary_target'], axis=1)
y = df['binary_target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression with L2 regularization (Ridge)
model = LogisticRegression(penalty='l2', C=1.0, random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"\nL2 Regularization Results:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Regularization strength (C): {1.0}")
print(f"\nModel Coefficients:")
for i, coef in enumerate(model.coef_[0]):
    print(f"{X.columns[i]}: {coef:.4f}")

print(f"\nIntercept: {model.intercept_[0]:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))


Wine Dataset - Binary Classification
Dataset shape: (178, 14)

L2 Regularization Results:
Accuracy: 1.0000
Regularization strength (C): 1.0

Model Coefficients:
alcohol: 1.1018
malic_acid: 0.5916
ash: 0.9935
alcalinity_of_ash: -0.3907
magnesium: -0.0163
total_phenols: 0.4541
flavanoids: 1.3373
nonflavanoid_phenols: 0.1403
proanthocyanins: -0.1387
color_intensity: -0.0583
hue: 0.0314
od280/od315_of_diluted_wines: 0.8082
proline: 0.0129

Intercept: -25.3053

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        35
           1       1.00      1.00      1.00        19

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54



## Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.

**Answer:**


In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report, confusion_matrix

# Load iris dataset for multiclass classification
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Iris Dataset - Multiclass Classification")
print("Dataset shape:", df.shape)
print("\nTarget classes:", data.target_names)
print("\nClass distribution:")
print(df['target'].value_counts().sort_index())

X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train Logistic Regression with One-vs-Rest (OvR) strategy
model = LogisticRegression(multi_class='ovr', random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

print(f"\nMulticlass Classification Results:")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

print("\nModel Coefficients for each class:")
for i, class_name in enumerate(data.target_names):
    print(f"\n{class_name} vs Rest:")
    for j, feature in enumerate(data.feature_names):
        print(f"  {feature}: {model.coef_[i][j]:.4f}")


Iris Dataset - Multiclass Classification
Dataset shape: (150, 5)

Target classes: ['setosa' 'versicolor' 'virginica']

Class distribution:
target
0    50
1    50
2    50
Name: count, dtype: int64

Multiclass Classification Results:
Training samples: 112
Test samples: 38

Confusion Matrix:
[[15  0  0]
 [ 0 10  1]
 [ 0  0 12]]

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       1.00      0.91      0.95        11
   virginica       0.92      1.00      0.96        12

    accuracy                           0.97        38
   macro avg       0.97      0.97      0.97        38
weighted avg       0.98      0.97      0.97        38


Model Coefficients for each class:

setosa vs Rest:
  sepal length (cm): -0.4150
  sepal width (cm): 0.8674
  petal length (cm): -2.1851
  petal width (cm): -0.9055

versicolor vs Rest:
  sepal length (cm): -0.1534
  sepal width (cm): -2.0957
  petal length (cm): 0.5



## Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.

**Answer:**


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report

# Load breast cancer dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Breast Cancer Dataset - Hyperparameter Tuning")
print("Dataset shape:", df.shape)

X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']  # solvers that support both L1 and L2
}

# Create base model
log_reg = LogisticRegression(random_state=42, max_iter=1000)

# GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    log_reg, 
    param_grid, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("\nStarting Grid Search...")
grid_search.fit(X_train, y_train)

# Get best parameters and score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"\nBest Parameters: {best_params}")
print(f"Best Cross-Validation Score: {best_score:.4f}")

# Test on holdout set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f"Test Set Accuracy: {test_accuracy:.4f}")

print("\nDetailed Results:")
results_df = pd.DataFrame(grid_search.cv_results_)
print(results_df[['param_C', 'param_penalty', 'param_solver', 'mean_test_score', 'std_test_score']].head(10))

print("\nClassification Report on Test Set:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Breast Cancer Dataset - Hyperparameter Tuning
Dataset shape: (569, 31)

Starting Grid Search...
Fitting 5 folds for each of 24 candidates, totalling 120 fits

Best Parameters: {'C': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Score: 0.9670
Test Set Accuracy: 0.9825

Detailed Results:
   param_C param_penalty param_solver  mean_test_score  std_test_score
0    0.001            l1    liblinear         0.914286        0.039560
1    0.001            l1         saga         0.912088        0.041117
2    0.001            l2    liblinear         0.920879        0.045786
3    0.001            l2         saga         0.914286        0.040166
4    0.010            l1    liblinear         0.914286        0.041931
5    0.010            l1         saga         0.914286        0.040166
6    0.010            l2    liblinear         0.925275        0.045786
7    0.010            l2         saga         0.916484        0.044284
8    0.100            l1    liblinear         0.92087

## Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

**Answer:**


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine
from sklearn.metrics import accuracy_score, classification_report

# Load wine dataset
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

print("Wine Dataset - Feature Scaling Comparison")
print("Dataset shape:", df.shape)
print("\nFeature statistics (before scaling):")
print(df.describe().round(2))

X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("\n" + "="*60)
print("COMPARISON: WITH vs WITHOUT FEATURE SCALING")
print("="*60)

# 1. WITHOUT SCALING
print("\n1. LOGISTIC REGRESSION WITHOUT SCALING:")
print("-" * 40)

model_no_scale = LogisticRegression(random_state=42, max_iter=1000)
model_no_scale.fit(X_train, y_train)

y_pred_no_scale = model_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)

print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")
print(f"Convergence: {'Yes' if model_no_scale.n_iter_[0] < 1000 else 'No'}")
print(f"Iterations needed: {model_no_scale.n_iter_[0]}")

# 2. WITH SCALING
print("\n2. LOGISTIC REGRESSION WITH STANDARD SCALING:")
print("-" * 40)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nScaled feature statistics:")
scaled_df = pd.DataFrame(X_train_scaled, columns=data.feature_names)
print(f"Mean: {scaled_df.mean().round(4).tolist()}")
print(f"Std:  {scaled_df.std().round(4).tolist()}")

model_scaled = LogisticRegression(random_state=42, max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"\nAccuracy with scaling: {accuracy_scaled:.4f}")
print(f"Convergence: {'Yes' if model_scaled.n_iter_[0] < 1000 else 'No'}")
print(f"Iterations needed: {model_scaled.n_iter_[0]}")

# COMPARISON
print("\n" + "="*60)
print("COMPARISON RESULTS:")
print("="*60)
print(f"Accuracy without scaling: {accuracy_no_scale:.4f}")
print(f"Accuracy with scaling:    {accuracy_scaled:.4f}")
print(f"Improvement: {accuracy_scaled - accuracy_no_scale:.4f}")
print(f"Iterations without scaling: {model_no_scale.n_iter_[0]}")
print(f"Iterations with scaling:    {model_scaled.n_iter_[0]}")

print("\nClassification Report (With Scaling):")
print(classification_report(y_test, y_pred_scaled, target_names=data.target_names))


Wine Dataset - Feature Scaling Comparison
Dataset shape: (178, 14)

Feature statistics (before scaling):
       alcohol  malic_acid     ash  alcalinity_of_ash  magnesium  \
count   178.00      178.00  178.00             178.00     178.00   
mean     13.00        2.34    2.37              19.49      99.74   
std       0.81        1.12    0.27               3.34      14.28   
min      11.03        0.74    1.36              10.60      70.00   
25%      12.36        1.60    2.21              17.20      88.00   
50%      13.05        1.87    2.36              19.50      98.00   
75%      13.68        3.08    2.56              21.50     107.00   
max      14.83        5.80    3.23              30.00     162.00   

       total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  \
count         178.00      178.00                178.00           178.00   
mean            2.30        2.03                  0.36             1.59   
std             0.63        1.00                  0.12   

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you'd take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

**Answer:**

• **Data Handling**: Clean missing values, engineer features like purchase frequency, recency, and monetary value, create customer segments based on behavior patterns

• **Feature Scaling**: Apply StandardScaler to normalize numerical features since Logistic Regression is sensitive to scale, especially important for regularization

• **Class Balancing**: Use SMOTE for oversampling minority class, or class_weight='balanced' parameter, or cost-sensitive learning to handle 5% response rate

• **Hyperparameter Tuning**: GridSearchCV with different C values, penalty types (L1/L2), and class_weight options, use stratified CV to maintain class distribution

• **Evaluation Strategy**: Focus on precision-recall curve and F1-score instead of accuracy, use business metrics like cost per acquisition and ROI, implement threshold tuning based on business costs
