Question 1: What is Logistic Regression, and how does it differ from Linear
Regression?

Answer:Logistic Regression is a supervised machine learning algorithm used mainly for classification problems, especially binary classification where the output variable has two possible outcomes such as yes or no, true or false, or 0 and 1. Unlike Linear Regression, which predicts continuous numerical values, Logistic Regression predicts the probability that a given input belongs to a particular category. It uses the logistic or sigmoid function to map predicted values to a probability range between 0 and 1. The sigmoid function is expressed as 1 / (1 + e^(-z)), where z is the linear combination of input features and their corresponding weights. The output probability is then converted into a class label using a threshold, usually 0.5. In contrast, Linear Regression directly models the relationship between dependent and independent variables using a straight line and is used for predicting continuous outcomes. Logistic Regression also differs in terms of error measurement; it uses log loss or cross-entropy instead of mean squared error used in Linear Regression. Therefore, while Linear Regression is suitable for predicting numeric values, Logistic Regression is designed to classify data points into discrete categories based on probability estimates.

Question 2: Explain the role of the Sigmoid function in Logistic Regression.

Answer:The Sigmoid function plays a crucial role in Logistic Regression as it transforms the output of the linear equation into a probability value that lies between 0 and 1. In Logistic Regression, the model first computes a linear combination of input features and their corresponding coefficients, represented as z = β₀ + β₁X₁ + β₂X₂ + ... + βnXn. This value of z can range from negative to positive infinity, which is not suitable for representing probabilities. The Sigmoid function, defined as 1 / (1 + e^(-z)), is then applied to this linear output to squash it into a range between 0 and 1. This transformation allows the model to interpret the result as the probability that a given input belongs to a particular class. For instance, if the output of the Sigmoid function is greater than or equal to 0.5, the data point is classified as belonging to class 1; otherwise, it is classified as class 0. Thus, the Sigmoid function acts as a bridge between the linear model and the probabilistic interpretation required for classification, enabling Logistic Regression to make meaningful predictions in terms of class probabilities.

Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer:Regularization in Logistic Regression is a technique used to prevent the model from overfitting the training data by adding a penalty term to the cost function. Overfitting occurs when a model learns the noise and random fluctuations in the training data instead of the true underlying patterns, which reduces its ability to generalize to new, unseen data. Regularization helps to control the complexity of the model by discouraging excessively large coefficient values. In Logistic Regression, two common types of regularization are used: L1 regularization (Lasso) and L2 regularization (Ridge). L1 regularization adds the absolute value of the coefficients as a penalty term to the cost function, which can shrink some coefficients to zero and effectively perform feature selection. L2 regularization, on the other hand, adds the squared magnitude of the coefficients as a penalty term, which helps to keep the coefficient values small but does not necessarily make them zero. By including these penalty terms, regularization ensures that the model remains simpler, more stable, and better at handling multicollinearity among features. Therefore, regularization is essential in Logistic Regression as it improves model generalization, enhances predictive performance, and prevents the model from fitting too closely to the training data.

Question 4: What are some common evaluation metrics for classification models, and
why are they important?

Answer:Evaluation metrics for classification models are essential because they help measure how well a model performs in distinguishing between different classes. Common evaluation metrics include accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC-ROC). Accuracy represents the proportion of correctly predicted instances out of all predictions and is useful when the dataset is balanced. However, when dealing with imbalanced data, accuracy alone can be misleading because it may remain high even if the model performs poorly on the minority class. Precision measures the proportion of correctly predicted positive instances among all instances predicted as positive, indicating how reliable positive predictions are. Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive cases that the model correctly identifies, showing how well the model captures positive instances. The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both, especially useful when class distributions are uneven. The ROC curve and its corresponding AUC value measure the model’s ability to distinguish between classes across different threshold values, where a higher AUC indicates better classification performance. These metrics are important because they provide deeper insights into different aspects of model performance beyond simple accuracy, allowing data scientists to choose the most suitable model based on the problem’s specific requirements and class distribution.


Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.

(Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

Answer:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("First five rows of the dataset:")
print(df.head())

X = df.iloc[:, :-1]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("\nLogistic Regression Model Accuracy: {:.2f}%".format(accuracy * 100))


First five rows of the dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

Logistic Regression Model Accuracy: 100.00%


Question 6: Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.

(Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

Answer:

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df.iloc[:, :-1]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Model Coefficients:")
print(model.coef_)
print("\nIntercept:")
print(model.intercept_)
print("\nLogistic Regression Model Accuracy: {:.2f}%".format(accuracy * 100))


Model Coefficients:
[[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]

Intercept:
[  9.00884295   1.86902164 -10.87786459]

Logistic Regression Model Accuracy: 100.00%


Question 7: Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.

(Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

Answer:

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.multiclass import OneVsRestClassifier

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = OneVsRestClassifier(LogisticRegression(solver='liblinear', max_iter=200))
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45



Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.

(Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

Answer:


In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
    'estimator__C': [0.01, 0.1, 1, 10, 100],
    'estimator__penalty': ['l1', 'l2']
}

grid = GridSearchCV(OneVsRestClassifier(LogisticRegression(solver='liblinear', max_iter=1000)), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Validation Accuracy:", grid.best_score_)

Best Parameters: {'estimator__C': 10, 'estimator__penalty': 'l2'}
Validation Accuracy: 0.9523809523809523


Question 9: Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.

(Use Dataset from sklearn package)

(Include your Python code and output in the code box below.)

Answer:


In [9]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


model1 = LogisticRegression(max_iter=200)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred1)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model2 = LogisticRegression(max_iter=200)
model2.fit(X_train_scaled, y_train)
y_pred2 = model2.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred2)

print("Accuracy without scaling:", acc_without_scaling)
print("Accuracy with scaling:", acc_with_scaling)


Accuracy without scaling: 1.0
Accuracy with scaling: 1.0


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Answer:For an imbalanced marketing campaign dataset where only 5% of customers respond, I would start by collecting and cleaning relevant data, including customer demographics, purchase history, and engagement metrics, while handling missing values and removing duplicates. I would perform feature engineering to create meaningful variables, such as recency, frequency, monetary (RFM) scores, and one-hot encode categorical features. Numeric features would be standardized using methods like `StandardScaler` to ensure proper convergence of the Logistic Regression model. To address the severe class imbalance, I would either use resampling techniques such as SMOTE to oversample the minority class or undersample the majority class, or set `class_weight='balanced'` in the Logistic Regression model. Hyperparameter tuning would be performed using GridSearchCV or RandomizedSearchCV to optimize parameters like the regularization strength `C`, penalty type (`l1` or `l2`), solver, and class weighting. For model evaluation, I would avoid relying on accuracy and instead focus on metrics suitable for imbalanced data, such as precision, recall, F1-score, ROC-AUC, and precision-recall curves, emphasizing the recall for responders since identifying them is the business priority. Finally, the model would be validated using stratified splits or cross-validation, and deployed in a way that allows monitoring for drift and regular retraining with new campaign data, ensuring it remains aligned with business objectives and maximizes campaign effectiveness.

