Q1) What is Logistic Regression, and how does it differ from Linear
Regression?

Ans1)
    Logistic Regression is a statistical method used for predicting the probability of a categorical outcome, most commonly a binary outcome such as “yes or no” or “success or failure.” Instead of fitting a straight line as in linear regression, logistic regression uses a sigmoid (S-shaped) curve to model the relationship between the independent variables and the dependent variable. This curve ensures that the predicted values always fall between 0 and 1, making them interpretable as probabilities. In contrast, Linear Regression predicts continuous values and assumes a linear relationship between the input and output. For example, linear regression might predict a person’s exact salary based on years of experience, while logistic regression might predict the probability that the person earns above a certain threshold. Thus, the main difference lies in the type of outcome each method predicts—continuous values for linear regression and categorical probabilities for logistic regression.

Q2) Explain the role of the Sigmoid function in Logistic Regression.

Ans2)
    The sigmoid function plays a central role in logistic regression because it converts the output of a linear equation into a probability value between 0 and 1. Logistic regression first calculates a weighted sum of the input variables, similar to linear regression, but this value can range from negative infinity to positive infinity. To make sense of this in terms of probabilities, the sigmoid function is applied. Its S-shaped curve smoothly maps any real number into the probability range, ensuring predictions are interpretable and bounded. For instance, if the sigmoid output is closer to 1, it indicates a higher likelihood of the event occurring, while values near 0 suggest a lower likelihood. This transformation allows logistic regression to classify outcomes effectively while maintaining a probabilistic interpretation.

Q3) What is Regularization in Logistic Regression and why is it needed?

Ans3)
    Regularization in logistic regression is a technique used to prevent the model from overfitting the training data by adding a penalty term to the cost function. Overfitting occurs when the model becomes too complex and starts capturing noise or random fluctuations in the data, which reduces its ability to generalize to new, unseen data. Regularization works by discouraging the model from assigning very high weights to the features, thereby keeping the coefficients smaller and the model simpler. Common types of regularization used in logistic regression are L1 (Lasso) and L2 (Ridge). L1 regularization can also perform feature selection by shrinking some coefficients to zero, while L2 spreads the penalty across all coefficients to keep them small. In short, regularization improves the stability, performance, and generalization ability of logistic regression models.

Q4) What are some common evaluation metrics for classification models, and
why are they important?

Ans4)

    Common evaluation metrics for classification models include accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC-ROC). Accuracy measures the percentage of correctly predicted outcomes, but it can be misleading when the data is imbalanced. Precision focuses on how many of the predicted positives are actually correct, which is important in cases like spam detection where false positives are costly. Recall measures how many of the actual positives the model correctly identifies, which is critical in applications like disease detection where missing a positive case can be dangerous. The F1-score combines precision and recall into a single metric, making it useful when there is a need to balance both. Lastly, the AUC-ROC evaluates how well the model distinguishes between classes across different threshold values. These metrics are important because they provide deeper insights into model performance beyond just accuracy, helping to choose the right model for the problem at hand.


In [2]:
#Q5) Write a Python program that loads a CSV file into a Pandas DataFrame,splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.(Use Dataset from sklearn package)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load dataset from sklearn
data = load_breast_cancer()

# Create a DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Logistic Regression model
model = LogisticRegression(max_iter=5000)  # Increased iterations for convergence
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Logistic Regression model: {accuracy:.2f}")


Accuracy of Logistic Regression model: 0.96


In [3]:
#Q6) Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load dataset from sklearn
data = load_breast_cancer()

# Create a DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression with L2 regularization (default = 'l2')
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=5000)

# Train the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print model coefficients and accuracy
print("Model Coefficients:\n", model.coef_)
print("\nIntercept:", model.intercept_)
print(f"\nAccuracy of Logistic Regression with L2 regularization: {accuracy:.2f}")


Model Coefficients:
 [[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
  -0.53255786 -0.28369224 -0.22668189 -0.03649446 -0.09710208  1.3705667
  -0.18140942 -0.08719575 -0.02245523  0.04736092 -0.04294784 -0.03240188
  -0.03473732  0.01160522  0.11165329 -0.50887722 -0.01555395 -0.016857
  -0.30773117 -0.77270908 -1.42859535 -0.51092923 -0.74689363 -0.10094404]]

Intercept: [28.64871395]

Accuracy of Logistic Regression with L2 regularization: 0.96


In [4]:
#Q7) Write a Python program to train a Logistic Regression model for multiclassclassification using multi_class='ovr' and print the classification report.(Use Dataset from sklearn package)


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# Load Iris dataset from sklearn
data = load_iris()

# Create DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Logistic Regression with one-vs-rest strategy
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000)

# Train the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [5]:
#Q8) Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()

# Create DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],       # Regularization strength
    'penalty': ['l1', 'l2']             # L1 = Lasso, L2 = Ridge
}

# Initialize Logistic Regression with solver that supports L1 and L2
log_reg = LogisticRegression(solver='liblinear', max_iter=5000)

# Apply GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Print best parameters and validation accuracy
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.2f}")
print(f"Test Set Accuracy: {grid.score(X_test, y_test):.2f}")


Best Parameters: {'C': 100, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.97
Test Set Accuracy: 0.98


In [6]:
#Q9)  Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model_scaling = LogisticRegression(max_iter=5000)
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)

# Print comparison
print(f"Accuracy without Scaling: {accuracy_no_scaling:.2f}")
print(f"Accuracy with Scaling:    {accuracy_scaling:.2f}")


Accuracy without Scaling: 0.96
Accuracy with Scaling:    0.97


Q10) Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

Ans10)

    If I were building a Logistic Regression model to predict which customers will respond to a marketing campaign in an imbalanced dataset where only 5% respond, I would take a structured approach. First, I would carefully clean the data, handle missing values, and engineer meaningful features such as purchase history, browsing behavior, and demographics. Since Logistic Regression is sensitive to feature scales, I would standardize or normalize the numerical variables to ensure fair weight assignment. The major challenge here is the imbalance, as predicting “no response” for everyone would give high accuracy but no business value. To handle this, I would use techniques like class weighting (giving more importance to the minority class in Logistic Regression), oversampling methods such as SMOTE, or undersampling the majority class. For hyperparameter tuning, I would apply GridSearchCV to optimize parameters like the regularization strength (C) and penalty type (L1 or L2).

    For evaluation, I would avoid using accuracy alone since it would be misleading in an imbalanced setting. Instead, I would rely on metrics such as precision, recall, F1-score, and AUC-ROC. Recall would be especially important in this business case because the company wants to capture as many responders as possible, but precision also matters since targeting uninterested customers could increase campaign costs. Finally, I would present the model’s performance in terms of business impact—for example, estimating how many additional responders the campaign can capture compared to random targeting. This approach ensures that the Logistic Regression model is both technically robust and aligned with the company’s marketing goals.
