Q1: What is Logistic Regression, and how does it differ from Linear
Regression?
Ans: Logistic Regression is a statistical model primarily used for binary classification problems, meaning it predicts the probability of an event belonging to one of two categories (e.g., Yes/No, 0/1, True/False, Spam/Not Spam).
As it contains the word "regression," it is fundamentally a classification algorithm. It works by applying the sigmoid function (also called the logistic function) to the linear combination of input variables. This transformation squashes the output of the linear equation into a probability value between 0 and 1. A threshold (often 0.5) is then applied to this probability to classify the outcome.

Q2: Explain the role of the Sigmoid function in Logistic Regression.

Ans: The Sigmoid function, also known as the logistic function, is the indispensable component that defines Logistic Regression as a classification algorithm, despite its name. The model first computes a linear combination of its inputs and weights, producing a result that can range from negative to positive infinity. Since this unbounded output cannot directly represent a probability, the Sigmoid function is applied. This function's characteristic S-shaped curve  transforms any real-valued number into an output that is strictly bounded between 0 and 1. This constrained output is interpreted as the predicted probability that the observation belongs to the positive class (Class 1). Finally, this probability is converted into a binary classification (0 or 1) by applying a predetermined threshold (most commonly 0.5).

Q3: What is Regularization in Logistic Regression and why is it needed?

Ans: Regularization in Logistic Regression is a technique that adds a penalty term to the model's loss function to discourage excessively large coefficient (weight) values.

It is needed primarily to prevent overfitting, which occurs when the model learns the noise in the training data too well, leading to poor performance on new data. By forcing weights to be smaller, regularization reduces model complexity and increases the model's ability to generalize.

The two main types are:

L2 Regularization (Ridge): Shrinks coefficients toward zero but rarely to zero, helping to handle multicollinearity.

L1 Regularization (Lasso): Can shrink coefficients exactly to zero, effectively performing automatic feature selection.

Q4: What are some common evaluation metrics for classification models, and
why are they important?
Ans: Classification model evaluation metrics are vital because they offer different views on a model's performance, which is especially critical in imbalanced datasets where simple Accuracy can be misleading.
The key metrics, derived from the Confusion Matrix, provide insight into the specific types of errors being made:
Accuracy: Overall correct predictions. It's useful for balanced datasets but poor for judging performance when one class greatly outweighs the other.
Precision: Focuses on the correctness of positive predictions. It's important when False Positives are costly (e.g., wrongly flagging a safe transaction as fraud).
Recall (Sensitivity): Focuses on the model's ability to find all actual positive cases. It's important when False Negatives are costly (e.g., failing to detect a disease).
F1 Score: A single metric that represents the balance between Precision and Recall. It's essential for imbalanced datasets because it penalizes models that favor one metric over the other.
AUC-ROC: Measures the model's overall ability to distinguish classes across all possible classification thresholds, making it a robust measure for imbalanced data.

In [12]:
#Q5: Write a Python program that loads a CSV file into a Pandas DataFrame,
#splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
# (Use Dataset from sklearn package)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings 
warnings.filterwarnings('ignore')

from sklearn.datasets import load_iris
data = load_iris()

df = pd.DataFrame( data.data , columns= data.feature_names)
df['target'] = data.target

df = df[df['target'] != 2]

x = df.iloc[: , :-1]
y = df.iloc[: , -1]

from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size=0.3 , random_state=1)

from sklearn.linear_model import LogisticRegression
model5 = LogisticRegression(max_iter=400)
model5.fit(x_train , y_train)
y_pred = model5.predict(x_test)

from sklearn.metrics import accuracy_score
print(f"The accuracy for the above model is {accuracy_score(y_test , y_pred)}")

The accuracy for the above model is 1.0


In [22]:
#Q6: Write a Python program to train a Logistic Regression model using L2
# regularization (Ridge) and print the model coefficients and accuracy

from sklearn.linear_model import LogisticRegression

ridge = LogisticRegression(penalty='l2' , C = 0.1)
ridge.fit(x_train , y_train)
y_ridge = ridge.predict(x_test)

print(ridge.coef_)
print(f"The accuracy for the above model is {accuracy_score(y_test , y_ridge)}")

[[ 0.30467322 -0.3003339   1.09759147  0.44211871]]
The accuracy for the above model is 1.0


In [26]:
#Q7 :Write a Python program to train a Logistic Regression model for multiclass
# classification using multi_class='ovr' and print the classification report.
# (Use Dataset from sklearn package)

from sklearn.datasets import make_classification
x , y = make_classification(n_samples=1000 , n_features=10 , n_redundant=5 , n_informative=5 , n_classes=3 , random_state=1)

x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.3 , random_state = 23)

model8 = LogisticRegression(multi_class='ovr' , solver='lbfgs')

model8.fit(x_train , y_train)

y_ovr = model8.predict(x_test)

from sklearn.metrics import classification_report
print(classification_report(y_test , y_ovr))

              precision    recall  f1-score   support

           0       0.62      0.79      0.70        91
           1       0.80      0.71      0.75       110
           2       0.79      0.69      0.74        99

    accuracy                           0.73       300
   macro avg       0.74      0.73      0.73       300
weighted avg       0.74      0.73      0.73       300



In [31]:
#Q8:  Write a Python program to apply GridSearchCV to tune C and penalty
# hyperparameters for Logistic Regression and print the best parameters and validation
# accuracy.

from sklearn.model_selection import GridSearchCV
classifier = LogisticRegression()
params = {"penalty":('l1','l2','elasticnet') ,'C' :[1,2,10,20,30,40]}
clf = GridSearchCV(classifier , param_grid=params , cv = 5 , verbose=1)
clf.fit(x_train , y_train)

print(clf.best_params_)
y_clf = clf.best_estimator_.predict(x_test)

print(f"The accuracy for the above model is {accuracy_score(y_test , y_clf)}")

Fitting 5 folds for each of 18 candidates, totalling 90 fits
{'C': 1, 'penalty': 'l2'}
The accuracy for the above model is 0.73


In [39]:
# Q9:  Write a Python program to standardize the features before training Logistic
# Regression and compare the model's accuracy with and without scaling.

from sklearn.datasets import make_classification
x , y = make_classification(n_samples=1000 , n_features=10 , n_redundant=5,n_informative=5,n_classes=2,random_state=5)

x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.3 , random_state = 25)

model9 = LogisticRegression(max_iter=350)
model9.fit(x_train , y_train)
y_simple = model9.predict(x_test)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_fit = scaler.fit_transform(x_train)
x_test_fit = scaler.transform(x_test)

model91 = LogisticRegression()
model91.fit(x_fit , y_train)
y_scaled = model91.predict(x_test_fit)

print(f"Accuracy before scaling is {accuracy_score(y_test , y_simple)}")
print(f"Accuracy after scaling is {accuracy_score(y_test , y_scaled)}")

Accuracy before scaling is 0.8766666666666667
Accuracy after scaling is 0.8766666666666667


Q10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Ans: To build a robust Logistic Regression model for predicting customer campaign response in an imbalanced 5% positive class scenario, the strategy must prioritize the minority class. We would start by performing feature scaling with StandardScaler and countering imbalance directly in the model using the class_weight='balanced' parameter. Crucially, model evaluation must discard misleading Accuracy and instead focus on Recall (to maximize identified responders, minimizing missed sales) and Precision (to ensure marketing efforts are cost-effective, minimizing wasted budget). The final step is tuning the probability threshold based on the business's tolerance for False Positives versus False Negatives, rather than relying on the default 0.5.
