*Assignment Code: DA-AG-011*

Logistic Regression | Assignment

*Question 1: What is Logistic Regression, and how does it differ from Linear Regression?*

Answer: - Logistic regression is a supervised machine learning algorithm widely used for binary classification tasks.
-	Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable (or output), y, can take only discrete values for a given set of features (or inputs), X. Here, we predict the value by 1 or 0.
-	 Linear Regression is a machine learning algorithm based on supervised regression algorithm. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Here, we predict the value by an integer number.


*Question 2: Explain the role of the Sigmoid function in Logistic Regression.*

Answer: The sigmoid function maps input values to a value between 0 and 1, making it useful for binary classification and logistic regression problems.


*Question 3: What is Regularization in Logistic Regression and why is it needed?*

Answer: - Regularization, a mechanism for penalizing model complexity during training, is extremely important in logistic regression modeling. It is an important technique in machine learning that helps to improve model accuracy by preventing overfitting.
- The primary objective of regularization is to strike a balance between fitting the training data accurately and ensuring that the model can make reliable predictions on new, unseen data.


*Question 4: What are some common evaluation metrics for classification models, and why are they important?*

Answer: - Some of the common evaluation matrices for classification models are Confusion matrix, Accuracy, Precision, Recall, F1-score, ROC-AUC, Log-loss etc.
-	Evaluation metrics quantify the performance of classification models in machine learning. They offer feedback for model refinement and aid in selecting the most suitable model for specific problems. Different problems value errors differently for example in fraud detection, missing a fraud (False Negative) might be worse than flagging an innocent transaction (False Positive).


In [None]:
'''Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package) (Include your Python code and output in the code box below.)'''

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

#Load dataset from sklearn
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

#Save to CSV (simulating reading from a CSV file)
df.to_csv("breast_cancer.csv", index=False)

#Load CSV into Pandas DataFrame
data = pd.read_csv("breast_cancer.csv")

#Split features and target
X = data.drop('target', axis=1)
y = data['target']

#Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#Train Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

#Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")


Accuracy: 0.9561


In [None]:
'''Question 6: Write a Python program to train a Logistic Regression model using
L2 regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package) (Include your Python code and output in the code box below.)'''



from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression (L2 regularization is default)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict and check accuracy
y_pred = model.predict(X_test)
print("Coefficients:", model.coef_)
print("Accuracy:", accuracy_score(y_test, y_pred))


Coefficients: [[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]
Accuracy: 1.0


In [None]:
'''Question 7: Write a Python program to train a Logistic Regression model for
 multiclass classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package) (Include your Python code and output in the code box below.)'''



from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Logistic Regression (One-vs-Rest)
model = LogisticRegression(multi_class='ovr', max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification report
print(classification_report(y_test, y_pred, target_names=iris.target_names))


              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



In [None]:
'''Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation accuracy.
(Use Dataset from sklearn package) (Include your Python code and output in the code box below.)'''


import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression model
log_reg = LogisticRegression(max_iter=10000)

# Parameter grid for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],       # Regularization strength
    'penalty': ['l1', 'l2'],            # Regularization type
    'solver': ['liblinear']             # liblinear supports both l1 and l2
}

# Grid search with cross-validation
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best parameters and validation accuracy
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.4f}")


Best Parameters: {'C': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9670


In [2]:
'''Question 9: Write a Python program to standardize the features before training
Logistic Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package) (Include your Python code and output in the code box below.) '''


import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


import warnings
warnings.filterwarnings('ignore')

# Load the dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without feature scaling
model_unscaled = LogisticRegression(max_iter=10000, random_state=1)
model_unscaled.fit(X_train, y_train)
y_pred_unscaled = model_unscaled.predict(X_test)
accuracy_unscaled = accuracy_score(y_test, y_pred_unscaled)

# Apply StandardScaler to standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with feature scaling
model_scaled = LogisticRegression(max_iter=10000, random_state=42)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Display the results
print(f"Accuracy without scaling: {accuracy_unscaled:.4f}")
print(f"Accuracy with scaling   : {accuracy_scaled:.4f}")





Accuracy without scaling: 0.9561
Accuracy with scaling   : 0.9737


*Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.*

Answer:
 1. Understand the problem & business goals

•	Objective: Predict which customers will respond to the marketing campaign.
•	Business priority: Minimize wasted marketing spend by targeting the right customers.
•	Error cost:
o	False Positive (FP): We spend money marketing to someone who won’t respond.
o	False Negative (FN): We miss someone who would have responded — potentially lost revenue.
•	In marketing, sometimes recall is more important (we want to reach most potential responders), but precision also matters if campaigns are costly.
2. Prepare the data

•	Split into train/test sets.
•	Handle missing values (fill or drop).
•	Convert categories to numbers (one-hot encoding).
•	Scale features with StandardScaler (Logistic Regression works better with scaled data).
3. Handle imbalance

•	Use class_weight='balanced' in Logistic Regression (easy, no manual resampling).
•	This makes the model pay more attention to the rare positive class
4. Train & tune

•	Try different values of C (e.g., 0.01, 0.1, 1, 10) and penalty (l1 or l2).
•	Use cross-validation to find the best parameters.
•	Focus on F1-score or Recall instead of Accuracy.
5. Evaluate

•	Look at Precision, Recall, F1-score, and the Confusion Matrix.
•	Check PR-AUC to measure performance on rare positives.
•	Pick a probability threshold that matches budget (e.g., target top 10% likely customers).
6. Deploy

•	Use predict_proba() to get response probabilities.
•	Rank customers by score and choose how many to target.
