1. What is Logistic Regression, and how does it differ from Linear
Regression?
    -> Logistic Regression is a statistical method used for classification problems, where the output is categorical (e.g., yes/no, 0/1). It uses the logistic (sigmoid) function to map predicted values to probabilities between 0 and 1. Unlike Linear Regression, which predicts continuous numerical values, Logistic Regression predicts the probability of a class and then assigns the class label based on a threshold (commonly 0.5).

2. Explain the role of the Sigmoid function in Logistic Regression.
    -> In Logistic Regression, the Sigmoid function transforms the linear combination of input features into a value between 0 and 1, representing the probability of belonging to a particular class. This mapping allows the model to handle classification tasks by setting a decision threshold (usually 0.5) to assign class labels from the predicted probabilities.

3.  What is Regularization in Logistic Regression and why is it needed?
    -> Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty term to the loss function, which discourages the model from assigning excessively large weights to features. It helps improve the model’s generalization on unseen data by controlling complexity, with common types being L1 (Lasso) and L2 (Ridge) regularization.

4. What are some common evaluation metrics for classification models, and
why are they important?
    -> Common evaluation metrics for classification models include Accuracy, Precision, Recall, F1-Score, and the ROC-AUC score. These metrics are important because they provide different perspectives on model performance—accuracy measures overall correctness, precision and recall assess performance on positive predictions, F1-score balances precision and recall, and ROC-AUC evaluates the model’s ability to distinguish between classes across thresholds—helping choose the most suitable model for the task.
                

In [1]:
#5. Write a Python program that loads a CSV file into a Pandas DataFrame,splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
# (Use Dataset from sklearn package)

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.956140350877193


In [2]:
#6. Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.
# (Use Dataset from sklearn package)

from sklearn.linear_model import LogisticRegression

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Model Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Accuracy: 0.956140350877193
Model Coefficients: [[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
  -0.53255786 -0.28369224 -0.22668189 -0.03649446 -0.09710208  1.3705667
  -0.18140942 -0.08719575 -0.02245523  0.04736092 -0.04294784 -0.03240188
  -0.03473732  0.01160522  0.11165329 -0.50887722 -0.01555395 -0.016857
  -0.30773117 -0.77270908 -1.42859535 -0.51092923 -0.74689363 -0.10094404]]
Intercept: [28.64871395]


In [4]:
#7. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.

from sklearn.datasets import load_iris
from sklearn.metrics import classification_report
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)

model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [5]:
#8. Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

model = LogisticRegression(solver='saga', max_iter=5000)

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)
print(grid_search.best_score_)

{'C': 0.01, 'penalty': 'l1'}
0.9142857142857144


In [7]:
#9. Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)

model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaling = LogisticRegression(max_iter=5000)
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
acc_scaling = accuracy_score(y_test, y_pred_scaling)

print(acc_no_scaling)
print(acc_scaling)

0.956140350877193
0.9736842105263158


10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

    -> For this marketing campaign problem, I’d start by exploring and cleaning the data to fix any missing or incorrect values. I’d then scale the features so they are on a similar range, as Logistic Regression works better with standardized inputs. Since only 5% of customers respond, I’d handle the imbalance using either class_weight='balanced' in Logistic Regression or oversampling techniques like SMOTE to create more responder samples.

    Next, I’d use GridSearchCV to tune important hyperparameters like C and penalty to get the best-performing model. For evaluation, I’d apply stratified cross-validation to maintain the class ratio in splits and focus on metrics such as precision, recall, F1-score, and ROC-AUC, along with the confusion matrix to understand misclassifications. Finally, I’d present the results in a simple way, showing the trade-off between identifying more responders and keeping campaign costs reasonable.