**Logistic Regression - Assignment**

**Question 1: What is Logistic Regression, and how does it differ from Linear Regression?**

Logistic regression predicts the probability of a categorical outcome (like "yes" or "no"), while linear regression predicts a continuous value (like a house price).

The main differences are that logistic regression uses a sigmoid (S-shaped) curve for binary classification and outputs a probability between 0 and 1, whereas linear regression uses a straight line and can output any number.  

**Question 2: Explain the role of the Sigmoid function in Logistic Regression.**

The sigmoid function is a crucial part of logistic regression, used to map a linear combination of inputs to a probability between 0 and 1.

**Question 3: What is Regularization in Logistic Regression and why is it needed?**

Regularization in logistic regression is a technique that penalizes model complexity to prevent overfitting and improve the model's ability to generalize to new, unseen data.

This is achieved by adding a penalty term to the model's loss function, which discourages the model from assigning excessively large weights (coefficients) to features.

**Question 4: What are some common evaluation metrics for classification models, and why are they important?**

Common classification metrics include accuracy, precision, recall, the F1-score, and AUC-ROC, which measure how well a model distinguishes between classes.

These metrics are crucial for evaluating a model's performance, understanding its strengths and weaknesses (like its ability to handle imbalanced datasets), and selecting the best model for a specific task

**Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# 1. Load a dataset (using a built-in sklearn dataset for demonstration)
# In a real-world scenario, you would use pd.read_csv('your_file.csv')
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# 2. Split the data into training and testing sets
# test_size=0.3 means 30% of the data will be used for testing
# random_state ensures reproducibility of the split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Train a Logistic Regression model
model = LogisticRegression(max_iter=200) # max_iter increased for convergence with some datasets
model.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = model.predict(X_test)

# 5. Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy:.4f}")

Accuracy of the Logistic Regression model: 1.0000


**Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy. (Use Dataset from sklearn package)**

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load a dataset (e.g., Iris dataset)
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train a Logistic Regression model with L2 regularization
# The 'penalty' parameter is set to 'l2' for L2 regularization (Ridge)
# The 'C' parameter controls the inverse of regularization strength; smaller C means stronger regularization.
# 'solver' is chosen for its compatibility with 'l2' penalty and multi-class classification.
model = LogisticRegression(penalty='l2', C=0.1, solver='lbfgs', multi_class='auto', max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Print the model coefficients
print("Model Coefficients:")
print(model.coef_)
print("\nModel Intercept:")
print(model.intercept_)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

Model Coefficients:
[[-0.26735189  0.29439384 -1.02390677 -0.41212167]
 [ 0.02959517 -0.3373663   0.07552206 -0.15957599]
 [ 0.23775672  0.04297246  0.94838471  0.57169765]]

Model Intercept:
[ 4.54765055  1.6115804  -6.15923095]

Model Accuracy: 0.9556




**Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report. (Use Dataset from sklearn package)**

In [3]:
import warnings
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Suppress ConvergenceWarning for demonstration purposes
warnings.filterwarnings('ignore', category=UserWarning)

# 1. Load a multiclass dataset from scikit-learn
iris = load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Instantiate and train the Logistic Regression model with multi_class='ovr'
# The 'liblinear' solver is efficient for smaller datasets and handles 'ovr' well.
model = LogisticRegression(multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = model.predict(X_test)

# 5. Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.92      0.96        13
   virginica       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45





**Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy. (Use Dataset from sklearn package)**

In [5]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features (important for Logistic Regression with regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the parameter grid for GridSearchCV
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Inverse of regularization strength
    'penalty': ['l1', 'l2']  # Regularization type (l1 or l2)
}

log_reg = LogisticRegression(solver='liblinear', max_iter=200)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters found
print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

# Print the best validation accuracy
print("\nBest validation accuracy (mean cross-validation score):")
print(f"{grid_search.best_score_:.4f}")

# Evaluate the best estimator on the test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test_scaled, y_test)
print(f"\nTest set accuracy with best parameters: {test_accuracy:.4f}")

Best parameters found by GridSearchCV:
{'C': 10, 'penalty': 'l1'}

Best validation accuracy (mean cross-validation score):
0.9583

Test set accuracy with best parameters: 1.0000


**Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling. (Use Dataset from sklearn package)**


In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- Model without scaling ---
print("--- Model without scaling ---")
# Initialize and train Logistic Regression model
model_no_scaling = LogisticRegression(max_iter=200, random_state=42)
model_no_scaling.fit(X_train, y_train)

# Make predictions and evaluate accuracy
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")

# --- Model with scaling ---
print("\n--- Model with scaling ---")
# Initialize StandardScaler
scaler = StandardScaler()

# Fit scaler on training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train Logistic Regression model on scaled data
model_scaled = LogisticRegression(max_iter=200, random_state=42)
model_scaled.fit(X_train_scaled, y_train)

# Make predictions and evaluate accuracy on scaled data
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {accuracy_scaled:.4f}")

# Compare accuracies
print(f"\nComparison:")
print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling: {accuracy_scaled:.4f}")

--- Model without scaling ---
Accuracy without scaling: 1.0000

--- Model with scaling ---
Accuracy with scaling: 1.0000

Comparison:
Accuracy without scaling: 1.0000
Accuracy with scaling: 1.0000


**Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

When building a logistic regression model for an imbalanced e-commerce marketing campaign, the approach would involve several key steps.

First, during data handling and preprocessing, standardize numerical features to prevent high-magnitude features from dominating the model. Then, address the class imbalance, where only 5% of customers respond, using a technique like oversampling the minority class (respondents) with SMOTE or using class weights within the logistic regression algorithm itself. This prevents the model from becoming biased toward the majority class (non-responders), which would result in poor performance in predicting who will actually respond.

For hyperparameter tuning, use cross-validation, such as Stratified K-Fold, to ensure each fold maintains the same class distribution as the original dataset. Tune key hyperparameters like the regularization strength (C) and the algorithm's class weight settings to optimize for the business objective.

Finally, evaluate the model not by accuracy, which is misleading with imbalanced data, but with relevant business metrics such as Precision, Recall, F1-score, and AUC-ROC, which provide a more comprehensive view of the model's ability to correctly identify potential responders.