**Logistic Regression | Assignment**

Question 1: What is Logistic Regression, and how does it differ from Linear Regression?


 Answer: Logistic Regression is a supervised machine learning algorithm used for classification tasks, where
 the output variable is categorical (such as Yes/No or 0/1). It predicts the probability that an instance belongs
 to a particular class using the logistic (sigmoid) function. The model takes a linear combination of the input
 features and applies the sigmoid function to map the result between 0 and 1.
 The mathematical representation of Logistic Regression is: P(Y=1|X) = 1 / (1 + e^-(b0 + b1x1 + b2x2 + ... +
 bnxn))
 Key Points: - Logistic Regression is used when the dependent variable is categorical, while Linear Regression
 is used when it is continuous. - Linear Regression predicts a numeric value, but Logistic Regression predicts
 a probability value that is later converted to a class (0 or 1). - Logistic Regression uses log loss (cross
entropy) as its cost function, whereas Linear Regression uses mean squared error (MSE). - The output of
 Linear Regression is unbounded, while Logistic Regression restricts the output between 0 and 1 using the
 sigmoid function.




Question 2: Explain the role of the Sigmoid function in Logistic Regression.


 Answer: The Sigmoid function is the core of Logistic Regression, as it converts the output of a linear
 equation into a probability between 0 and 1. It ensures that predictions are interpretable as probabilities,
 which helps in classification.
 The formula for the sigmoid function is: σ(z) = 1 / (1 + e^-z), where z = b0 + b1x1 + b2x2 + ... + bnxn
 Role of the Sigmoid Function: - Converts any real-valued number into a probability value between 0 and 1.
When z > 0, the output of sigmoid approaches 1, and when z < 0, it approaches 0. - Acts as a threshold
 function, helping the model classify data into classes. - Provides a smooth and differentiable output, which
 makes it suitable for gradient-based optimization methods

 Question 3: What is Regularization in Logistic Regression and why is it needed?


 Answer: Regularization in logistic regression is a technique used to reduce overfitting by adding a penalty
 term to the loss function. Overfitting occurs when a model learns unnecessary details from training data,
 reducing its performance on unseen data. Regularization helps keep the model simple and general.
 The regularized cost function is: J(θ) = -(1/m) Σ [y_i log(hθ(x_i)) + (1 - y_i) log(1 - hθ(x_i))] + λ R(θ), where R(θ) is
 the regularization term and λ is the regularization strength.
 1
Types of Regularization: - L1 Regularization (Lasso): Adds the absolute value of coefficients |wi| as a
 penalty, can shrink some coefficients to zero. - L2 Regularization (Ridge): Adds the square of coefficients
 wi^2 as a penalty, prevents large weights but keeps all features.
 Importance: - Controls the complexity of the model. - Prevents overfitting and improves generalization.
Makes the model more stable and interpretable.

Question 4: What are some common evaluation metrics for classification models, and why are they
 important?


 Answer: In classification problems, it is essential to evaluate a model using various metrics to understand its
 performance. Common evaluation metrics include: - Accuracy: Measures the ratio of correct predictions to
 total predictions. Accuracy = (TP + TN) / (TP + TN + FP + FN) - Precision: Indicates how many of the predicted
 positive cases were actually positive. Precision = TP / (TP + FP) - Recall (Sensitivity): Measures how well the
 model captures actual positive cases. Recall = TP / (TP + FN) - F1-Score: Harmonic mean of precision and
 recall. F1 = 2 * (Precision * Recall) / (Precision + Recall) - ROC-AUC Score: Evaluates how well the model
 distinguishes between classes.
 These metrics are important because relying on accuracy alone can be misleading, especially in imbalanced
 datasets. Precision, recall, and F1-score provide a deeper understanding of the model’s performance and
 help select the optimal decision threshold.

In [6]:
#Question 5: Python Program – Load CSV, Train Logistic Regression Model, and Print Accuracy

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
# Load dataset
data = load_iris()
X = data.data
y = (data.target != 0).astype(int)
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 1.0


In [9]:
#  Question 6: Train Logistic Regression Model using L2 Regularization (Ridge)

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,
test_size=0.3, random_state=42)
# L2 Regularization
model = LogisticRegression(penalty='l2', solver='liblinear')
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("Model Coefficients:", model.coef_)
print("Accuracy:", accuracy_score(y_test, y_pred))

Model Coefficients: [[ 2.17532856e+00  1.59657795e-01 -1.25372350e-01 -4.00203956e-03
  -1.30412639e-01 -4.11271449e-01 -6.55025779e-01 -3.50105949e-01
  -2.02221998e-01 -2.92893734e-02 -6.61181920e-02  1.40364311e+00
   1.17866280e-01 -1.09265346e-01 -1.46461555e-02 -2.48382696e-02
  -6.34867207e-02 -4.11476085e-02 -4.87826550e-02 -7.69418228e-04
   1.15519347e+00 -3.90327993e-01 -7.67924369e-02 -2.13242325e-02
  -2.42143589e-01 -1.13976004e+00 -1.57934527e+00 -6.17350727e-01
  -7.29143464e-01 -1.10785408e-01]]
Accuracy: 0.9649122807017544


In [11]:
#  Question 7: Train Logistic Regression for Multiclass Classification using ‘ovr’


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
test_size=0.3, random_state=42)
# Train model
model = LogisticRegression(multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.92      0.96        13
           2       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45





In [13]:
#  Question 8: Hyperparameter Tuning using GridSearchCV


from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,
test_size=0.3, random_state=42)
# Parameter grid
param_grid = {'C': [0.01, 0.1, 1, 10], 'penalty': ['l1', 'l2']}
# GridSearchCV
grid = GridSearchCV(LogisticRegression(solver='liblinear'), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)
print("Best Validation Accuracy:", grid.best_score_)



Best Parameters: {'C': 10, 'penalty': 'l1'}
Best Validation Accuracy: 0.9697784810126582


In [15]:
#  Question 9: Compare Accuracy with and without Standardization

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target,
test_size=0.3, random_state=42)
# Without scaling
model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)
acc1 = accuracy_score(y_test, model1.predict(X_test))
# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model2 = LogisticRegression(max_iter=1000)
model2.fit(X_train_scaled, y_train)
acc2 = accuracy_score(y_test, model2.predict(X_test_scaled))
print("Accuracy without scaling:", acc1)
print("Accuracy with scaling:", acc2)

Accuracy without scaling: 0.9707602339181286
Accuracy with scaling: 0.9824561403508771


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


 Question 10: Real-World Business Case – Predicting Marketing Campaign Response


 Answer: In an e-commerce company, we want to predict which customers will respond to a marketing
 campaign where only 5% respond. The dataset is highly imbalanced, so a careful approach is required. First,
 the data should be cleaned and preprocessed, missing values handled, and categorical variables encoded.
 The dataset is then split into training and testing sets. To handle class imbalance, techniques such as
 SMOTE (Synthetic Minority Oversampling Technique) or class_weight='balanced' in logistic regression
 should be used.
 Next, numerical features should be standardized using StandardScaler to ensure uniformity. Logistic
 Regression with L2 regularization is used to avoid overfitting. Hyperparameter tuning can be done using
 GridSearchCV to find the optimal C and penalty values. The model should be evaluated using Precision,
 Recall, F1-score, and ROC-AUC instead of accuracy, due to the imbalanced nature of the dataset. Finally, an
 optimal probability threshold can be chosen to maximize business benefits. Following this approach
 ensures a robust and effective prediction model for marketing campaign responses.