#Logistic Regression assignment


#theory questions


Q1. What is Logistic Regression, and how does it differ from Linear Regression?
 - Logistic Regression is a classification algorithm that predicts probabilities using the sigmoid function. Logistic Regression is used for classification, while Linear Regression is used for predicting continuous values. Logistic outputs probabilities between 0 and 1, while Linear outputs any real number. Logistic uses log-loss, while Linear uses mean squared error. Logistic coefficients explain log-odds, while Linear coefficients give direct change in the output.

Q2. Role of Sigmoid function in Logistic Regression
 - The sigmoid function maps any linear value to a range between 0 and 1, which can be interpreted as probability. It is smooth and differentiable, making optimization possible. It connects the linear predictor to log-odds and allows thresholding, such as 0.5, to assign class labels.

Q3. What is Regularization in Logistic Regression and why is it needed?
 - Regularization is the technique of adding a penalty to the size of model coefficients to prevent overfitting. L2 regularization (Ridge) shrinks weights, while L1 regularization (Lasso) forces sparsity and performs feature selection. Regularization is needed to improve generalization, handle multicollinearity, and stabilize the model.

Q4. What are some common evaluation metrics for classification models, and
why are they important?
 - Accuracy measures overall correctness. Precision measures how many predicted positives are correct. Recall measures how many actual positives are found. The F1-score balances precision and recall. ROC-AUC and PR-AUC evaluate performance across thresholds. These metrics are important because they capture different aspects of performance, especially for imbalanced data.

#Practical Questions

In [10]:
#Question 5
"""Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package) """

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
X, y = data.data, data.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.956140350877193


In [11]:
#Question 6
"""Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package) """

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

print("Coefficients:", model.coef_)
print("Accuracy:", model.score(X_test, y_test))


Coefficients: [[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
  -0.53255786 -0.28369224 -0.22668189 -0.03649446 -0.09710208  1.3705667
  -0.18140942 -0.08719575 -0.02245523  0.04736092 -0.04294784 -0.03240188
  -0.03473732  0.01160522  0.11165329 -0.50887722 -0.01555395 -0.016857
  -0.30773117 -0.77270908 -1.42859535 -0.51092923 -0.74689363 -0.10094404]]
Accuracy: 0.956140350877193


In [12]:
#Question 7
"""Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)"""


from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', max_iter=5000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [13]:
#Question 8
"""Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)
"""
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid = GridSearchCV(LogisticRegression(max_iter=5000), param_grid, cv=5)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best validation accuracy:", grid.best_score_)


Best parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best validation accuracy: 0.9583333333333334


In [14]:
#Question 9
"""Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
"""

from sklearn.preprocessing import StandardScaler

model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)
print("Accuracy without scaling:", model.score(X_test, y_test))

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=5000)
model_scaled.fit(X_train_scaled, y_train)
print("Accuracy with scaling:", model_scaled.score(X_test_scaled, y_test))


Accuracy without scaling: 1.0
Accuracy with scaling: 1.0


- Question 10

Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.


Answer:
To build a Logistic Regression model for this case, I would follow these steps:

 - Data handling

Clean missing values and encode categorical features.

Remove duplicates and irrelevant columns.

- Feature scaling

Standardize numerical features so all are on the same scale, which is important for regularization.

 - Balancing classes

Since only 5% respond, accuracy alone is misleading.

Use methods like oversampling (SMOTE), undersampling, or class weights (class_weight='balanced' in LogisticRegression).

 - Hyperparameter tuning

Use GridSearchCV to tune regularization strength (C), penalty type (L1 or L2), and solver.

 - Evaluation

Focus on metrics like precision, recall, F1-score, and ROC-AUC instead of accuracy.

Plot Precision-Recall curve since positives are rare.

Adjust probability threshold depending on business need (e.g., maximize recall to target most responders).

 - Conclusion:
By scaling features, balancing the dataset, carefully tuning hyperparameters, and using the right evaluation metrics, Logistic Regression can provide a reliable model for predicting customer response in an imbalanced e-commerce dataset.
