1) What is Logistic Regression?

Logistic Regression is a supervised learning algorithm used for classification problems, particularly binary classification (e.g., yes/no, 0/1, spam/not spam).

It predicts the probability that a given input point belongs to a certain class using the logistic (sigmoid) function, which maps any real-valued number to a value between 0 and 1.

Sigmoid Function Formula:
𝜎
(
𝑧
)
=
1/
1
+
𝑒
−
𝑧

​

   Where:

   𝑧
=
𝛽
0
+
𝛽
1
𝑥
1
+
𝛽
2
𝑥
2
+
⋯
+
𝛽
𝑛
𝑥
𝑛
z=β
0
​
 +β
1
​
 x
1
​
 +β
2
​
 x
2
​
 +⋯+β
n
​
 x
n
​


 What is Linear Regression?
Linear Regression is also a supervised learning algorithm, but it is used for regression tasks, meaning it predicts a continuous numerical value based on input features.

It assumes a linear relationship between the input variables
𝑥
x and the target variable
𝑦
y:

𝑦
=
𝛽
0
+
𝛽
1
𝑥
1
+
𝛽
2
𝑥
2
+
⋯
+
𝛽
𝑛
𝑥
𝑛
y=β
0
​
 +β
1
​
 x
1
​
 +β
2
​
 x
2
​
 +⋯+β
n
​
 x
n
​

-- Key Differences Between Logistic and Linear Regression:


Feature	Linear Regression	Logistic Regression
Type of Problem	Regression (predicts continuous value)	Classification (predicts class label)
Output	Any real number	Probability between 0 and 1
Target Variable	Continuous	Categorical (mostly binary: 0 or 1)
Function Used	Linear function	Sigmoid/logistic function
Goal	Minimize Mean Squared Error (MSE)	Maximize Likelihood (or minimize log-loss)
Linearity	Predicts directly using a linear equation	Applies a linear equation, then maps to probability using sigmoid
Decision Boundary	Not applicable	Yes (usually at 0.5 for binary classification)

 Example Use Cases:


Linear Regression: Predicting house prices, temperature, salary, etc.

Logistic Regression: Spam detection, disease prediction, loan default prediction.





2)   What is the Sigmoid Function


The sigmoid function, also known as the logistic function, is a mathematical function that squashes any real-valued number into the range (0, 1).

Mathematical Formula:

𝜎
(
𝑧
)
=
1
/
1+e
−z




z is the linear combination of input features:

𝑧
=
𝛽
0
+
𝛽
1
𝑥
1
+
𝛽
2
𝑥
2
+
⋯
+
𝛽
𝑛
𝑥
𝑛
z=β
0
​
 +β
1
​
 x
1
​
 +β
2
​
 x
2
​
 +⋯+β
n
​
 x
n
​


  Why Use Sigmoid in Logistic Regression?


The core goal of logistic regression is to predict probabilities that a given input belongs to Class 1 (positive class).

However, the linear combination
𝑧
z can take any real value (from
−
∞
−∞ to
+
∞
+∞).
But a probability must always lie in the range
[
0
,
1
]
[0,1].

The sigmoid function solves this by:



Taking the raw output
𝑧
z of the linear model

Converting it into a probability between 0 and 1

 Interpretation of Sigmoid Output


The output of the sigmoid function is interpreted as:


P(y = 1 | x) — the probability that the output is 1 (positive class) given input
𝑥
x

Example:
If

𝜎
(
𝑧
)
=
0.82
σ(z)=0.82
then there's an 82% chance the instance belongs to class 1.

To classify, we usually use a threshold:

If
𝜎
(
𝑧
)
≥
0.5
σ(z)≥0.5, predict class 1

If
𝜎
(
𝑧
)
<
0.5
σ(z)<0.5, predict class 0

 Visual Intuition

The sigmoid curve:


Has an S-shape

Smoothly transitions from 0 to 1

Centered at
𝑧
=
0
z=0 (where probability = 0.5)


    
    
    
    
    
3) Definition of Regularization
Regularization is a technique used to prevent overfitting in machine learning models by penalizing large coefficients (weights) in the model.

In the context of Logistic Regression, regularization modifies the cost function by adding a penalty term that discourages overly complex models.

 Why is Regularization Needed?
  Without Regularization:

The model might memorize the training data (overfitting), especially when:

You have many features (high-dimensional data)

Features are highly correlated

Dataset is small or noise

 With Regularization:
The model is simpler, generalizes better to unseen data

Coefficients are kept small and meaningful, reducing the risk of overfitting




Question 4: What are some common evaluation metrics for classification models, and
why are they important?


   -Why Are Evaluation Metrics Important?
Evaluation metrics measure the performance of a classification model on unseen data.
They help you understand:

How well the model distinguishes between classes

Whether the model is accurate, balanced, or biased

Which model performs best in real-world conditions

Different metrics are appropriate depending on whether you're dealing with balanced or imbalanced datasets.

Common Evaluation Metrics for Classification

- Accuracy
Accuracy= TP + TN / (TP + TN + FP + FN)


TP: True Positives

TN: True Negatives

FP: False Positives

FN: False Negatives

 Good for: Balanced datasets

Bad for: Imbalanced datasets (e.g., 95% "No" and 5% "Yes")


Precision = TP/(TP+FP)

Measures how many predicted positives are actually correct

 Important when false positives are costly
(e.g., spam email filters, fraud detection)

- Recall (Sensitivity or True Positive Rate)
Recall = TP / TP + FN
​

Measures how many actual positives are correctly identified
 Important when false negatives are costly
(e.g., cancer detection, defect detection)

- F1 Score
F1 Score
=
2
×(
Precision
×
Recall) / (
Precision + Recall )


Harmonic mean of precision and recall
 Useful when classes are imbalanced
 Balances false positives and false negatives

- ROC-AUC (Receiver Operating Characteristic – Area Under Curve)
Plots True Positive Rate (Recall) vs. False Positive Rate at various thresholds

AUC = Area Under Curve

0.5 = random guess

1.0 = perfect classifier

 Measures overall model discrimination ability

Write a Python program to train a Logistic Regression model using L2
egularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)

In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Convert to binary classification (class 0 vs not class 0)
y_binary = (y == 0).astype(int)  # Setosa vs not-setosa

# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)

# Step 4: Train Logistic Regression with L2 regularization (default)
model = LogisticRegression(penalty='l2', solver='liblinear', C=1.0)  # C is inverse of regularization strength
model.fit(X_train, y_train)

# Step 5: Print model coefficients and intercept
print("Model coefficients:", model.coef_)
print("Model intercept:", model.intercept_)

# Step 6: Predict on test data and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on test set:", accuracy)


Model coefficients: [[ 0.36479402  1.35499766 -2.09628559 -0.92154751]]
Model intercept: [0.23630834]
Accuracy on test set: 1.0


Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)


In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Load a dataset from sklearn
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Step 2: Save the dataset to a CSV file (simulating a real-world file load)
df.to_csv('breast_cancer.csv', index=False)

# Step 3: Load the CSV file into a Pandas DataFrame
df_loaded = pd.read_csv('breast_cancer.csv')

# Step 4: Split into features (X) and target (y)
X = df_loaded.drop('target', axis=1)
y = df_loaded['target']

# Step 5: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 6: Train Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Step 7: Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 8: Print accuracy
print("Accuracy of Logistic Regression model:", accuracy)


Accuracy of Logistic Regression model: 0.956140350877193


Q7)  Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)

In [3]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Step 3: Train Logistic Regression with One-vs-Rest (OvR) strategy
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)
model.fit(X_train, y_train)

# Step 4: Predict on test set
y_pred = model.predict(X_test)

# Step 5: Print the classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.92      0.96        13
   virginica       0.93      1.00      0.96        13

    accuracy                           0.98        45
   macro avg       0.98      0.97      0.97        45
weighted avg       0.98      0.98      0.98        45





Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.

(Use Dataset from sklearn package)

In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Step 2: Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 3: Define model and parameter grid
model = LogisticRegression(solver='liblinear', max_iter=1000)

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2']
}

# Step 4: GridSearchCV
grid = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Step 5: Best parameters and score
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Step 6: Test accuracy with best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", test_accuracy)


Best Parameters: {'C': 10, 'penalty': 'l2'}
Best Cross-Validation Accuracy: 0.9626373626373628
Test Set Accuracy: 0.956140350877193


Question 9: Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.

(Use Dataset from sklearn package)

In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Step 2: Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 3: Train Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=1000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Step 4: Standardize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Train Logistic Regression with scaled features
model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Step 6: Print both accuracies
print("Accuracy without scaling:", acc_no_scaling)
print("Accuracy with scaling:   ", acc_scaled)


Accuracy without scaling: 0.956140350877193
Accuracy with scaling:    0.9736842105263158


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.


-  Step-by-Step Approach for Handling Imbalanced Marketing Data

1. Data Understanding and Preprocessing
Load and explore the dataset

Understand feature types (numeric, categorical, date/time)

Analyze class distribution (e.g., 5% response = strong imbalance)

Data cleaning

Handle missing values, outliers, incorrect data types

Feature engineering

Derive meaningful features: e.g., recency, frequency, average spend

Convert categorical variables using One-Hot Encoding or Target Encoding

2. Train-Test Split (Preserve Class Ratio)
Use stratified splitting to ensure both training and test sets maintain the rare class (5% responders):

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)


3. Feature Scaling
Standardize numeric features (important for Logistic Regression):

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


4. Handling Class Imbalance
You can balance the classes using any of the following:


 a. Class Weighting (recommended for Logistic Regression)
python

In [8]:
model = LogisticRegression(class_weight='balanced', solver='liblinear')


b. Resampling (alternative)
Use SMOTE, RandomUnderSampler, or combine them:

In [9]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)


 5. Train Logistic Regression Model
Use proper solver and regularization:

In [10]:
model = LogisticRegression(
    class_weight='balanced',
    solver='liblinear',
    max_iter=1000
)
model.fit(X_train_scaled, y_train)


 Hyperparameter Tuning with GridSearchCV
Tune C, penalty, class_weight using cross-validation

In [11]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'class_weight': [None, 'balanced']
}
grid = GridSearchCV(LogisticRegression(solver='liblinear'), param_grid, scoring='f1', cv=5)
grid.fit(X_train_scaled, y_train)
best_model = grid.best_estimator_


7. Model Evaluation: Focus on Minority Class
Since the data is imbalanced, avoid relying on accuracy. Instead, use:


Precision	 - Prevent targeting wrong customers

Recall - 	Ensure responders aren't missed

F1 Score - 	Balance of both

ROC-AUC	- Overall discrimination capability

PR-AUC	- Better than ROC for imbalanced cases

Confusion Matrix	-  Understand true/false positives and negatives

In [12]:
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
y_pred = best_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
print("ROC AUC:", roc_auc_score(y_test, best_model.predict_proba(X_test_scaled)[:, 1]))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.98      0.98      0.98        42
           1       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114

ROC AUC: 0.9957010582010581
Confusion Matrix:
 [[41  1]
 [ 1 71]]
