# **Theoretical Questions :**

1. What is Logistic Regression, and how does it differ from Linear
Regression?
- Logistic Regression is a supervised learning algorithm used for classification problems, where the dependent variable (output) is categorical—typically binary (e.g., Yes/No, 0/1, True/False). It predicts the probability that an observation belongs to a particular class. The model uses the logistic (sigmoid) function to map predicted values to a range between 0 and 1, making it suitable for probability estimation. The general form of the logistic function is:

$$
P(Y = 1) = \frac{1}{1 + e^{-(b_0 + b_1X_1 + b_2X_2 + \dots + b_nX_n)}}
$$
Linear Regression, on the other hand, is used for regression problems, where the dependent variable is continuous (e.g., predicting sales, temperature, or prices). It models the relationship between independent variables and the dependent variable by fitting a straight line through the data, represented as:

$$
Y = b_0 + b_1X_1 + b_2X_2 + \dots + b_nX_n
$$


| Aspect                | Linear Regression                            | Logistic Regression                                     |
| --------------------- | -------------------------------------------- | ------------------------------------------------------- |
| **Purpose**           | Predicts a continuous outcome                | Predicts a categorical outcome (usually binary)         |
| **Output Range**      | Any real number                              | Between 0 and 1 (as probability)                        |
| **Function Used**     | Linear function                              | Sigmoid (logistic) function                             |
| **Error Measurement** | Measured using Mean Squared Error (MSE)      | Measured using Log Loss or Cross-Entropy                |
| **Linearity**         | Models linear relationship between variables | Models the log-odds (logit) of the probability linearly |
| **Use Case Example**  | Predicting house prices                      | Predicting if an email is spam or not                   |


2. Explain the role of the Sigmoid function in Logistic Regression.
- The Sigmoid function, also known as the logistic function, plays a crucial role in Logistic Regression by converting the output of a linear equation into a probability value between 0 and 1.

The Sigmoid function is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

where,

$$
z = b_0 + b_1X_1 + b_2X_2 + \dots + b_nX_n
$$

Key Roles of the Sigmoid Function:

Probability Mapping:

The linear combination
 $$
z = b_0 + b_1X_1 + b_2X_2 + \dots + b_nX_n
$$


 can take any real value from −∞ to +∞.
The Sigmoid function maps this range to (0, 1), making it interpretable as a probability that the output belongs to a particular class (e.g.,
P(Y=1)).

Decision Boundary:

By setting a threshold (commonly 0.5), the Sigmoid output can be used to classify data:

If

σ(z)≥0.5, predict class 1

If

σ(z)<0.5, predict class 0

Smooth and Differentiable:

The Sigmoid function is smooth and differentiable, which is essential for optimization using gradient descent during model training.

Non-linearity Introduction:

Although Logistic Regression is a linear model in parameters, the Sigmoid function introduces non-linearity in the output, allowing it to model binary outcomes effectively.

3. What is Regularization in Logistic Regression and why is it needed?
-  **Regularization** is a technique used in **Logistic Regression** to **prevent overfitting** by adding a **penalty term** to the cost (loss) function.  
Overfitting happens when the model learns the noise and details of the training data, reducing its performance on unseen data.  
Regularization discourages the model from assigning excessively large weights to the features, thereby improving **generalization**.



### **Mathematical Representation**

The regularized cost function in Logistic Regression is:

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \Big[ y_i \log(h_\theta(x_i)) + (1 - y_i) \log(1 - h_\theta(x_i)) \Big] + \lambda R(\theta)
$$

where:  
- \( h_\theta(x_i) \) = predicted probability using the sigmoid function  
- \( m \) = number of training samples  
- \( \lambda \) = regularization parameter controlling penalty strength  
- \( R(\theta) \) = regularization term (depends on type)



### **Types of Regularization**

#### **i. L1 Regularization (Lasso)**
Adds the absolute value of coefficients to the cost function:

$$
R(\theta) = \sum_{j=1}^{n} |\theta_j|
$$

It can shrink some coefficients to **zero**, effectively performing **feature selection**.



#### **ii. L2 Regularization (Ridge)**
Adds the squared value of coefficients to the cost function:

$$
R(\theta) = \sum_{j=1}^{n} \theta_j^2
$$

It **reduces large coefficients** smoothly without completely eliminating them.


### **Why Regularization is Needed**

- **Prevents overfitting:** Keeps the model from fitting the noise in data.  
- **Improves generalization:** Enhances performance on unseen data.  
- **Controls coefficient size:** Avoids excessively large parameter values.  
- **Stabilizes optimization:** Ensures smoother and more reliable training.





4. : What are some common evaluation metrics for classification models, and
why are they important?
-  **Evaluation metrics** are used to measure the **performance** of classification models.  
They help determine how well the model is making predictions and whether it is suitable for real-world deployment.  
Different metrics capture different aspects of model performance such as accuracy, precision, recall, and overall balance between them.



### **1. Accuracy**

Accuracy measures the proportion of correctly predicted observations to the total observations.

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

where:  
- \( TP \): True Positives  
- \( TN \): True Negatives  
- \( FP \): False Positives  
- \( FN \): False Negatives  

Accuracy works well when the dataset is **balanced**, but it can be **misleading** for **imbalanced datasets**.



### **2. Precision**

Precision measures how many of the predicted positive cases are actually positive.

$$
\text{Precision} = \frac{TP}{TP + FP}
$$

It is useful when the **cost of false positives** is high (e.g., spam detection).



### **3. Recall (Sensitivity or True Positive Rate)**

Recall measures how many of the actual positive cases were correctly identified.

$$
\text{Recall} = \frac{TP}{TP + FN}
$$

It is important when **missing positive cases** is costly (e.g., medical diagnosis).



### **4. F1-Score**

F1-Score is the **harmonic mean** of Precision and Recall.  
It balances the trade-off between them.

$$
\text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

It is especially useful for **imbalanced datasets**.



### **5. ROC Curve and AUC (Area Under the Curve)**

- The **ROC curve** plots the **True Positive Rate (TPR)** against the **False Positive Rate (FPR)** at various threshold settings.  
- The **AUC** represents the **area under this curve** and indicates how well the model distinguishes between classes.

$$
\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}
$$

A higher AUC value (closer to 1) indicates a better-performing model.



### **Importance of Evaluation Metrics**

- They provide **quantitative measures** of model performance.  
- Help in **model comparison** and **hyperparameter tuning**.  
- Identify **strengths and weaknesses** (e.g., sensitivity vs. specificity).  
- Ensure the model is **reliable and fair**, especially in critical applications.




In [None]:
# 5. Write a Python program that loads a CSV file into a Pandas DataFrame,
# splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.



In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

breast_cancer = load_breast_cancer()
df = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
df['target'] = breast_cancer.target

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (455, 30)
Testing set shape: (114, 30)


In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.956140350877193


In [None]:
print(accuracy)

0.956140350877193


In [8]:
 #6. Write a Python program to train a Logistic Regression model using L2
# regularization (Ridge) and print the model coefficients and accuracy.

# Step 1: Import Libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 2: Load Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 3: Split into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 4: Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Train Logistic Regression with L2 Regularization (Ridge)
# multi_class='ovr' for One-vs-Rest strategy
model = LogisticRegression(penalty='l2', solver='liblinear', multi_class='ovr', random_state=42)
model.fit(X_train_scaled, y_train)

# Step 6: Print Model Coefficients
print("Model Coefficients:\n", model.coef_)
print("Model Intercept:\n", model.intercept_)

# Step 7: Evaluate Model Accuracy
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on Test Set:", accuracy)



Model Coefficients:
 [[-0.82629339  1.29687882 -1.60852314 -1.43624942]
 [ 0.06394239 -1.24520478  0.84925141 -0.90288382]
 [ 0.10388483 -0.01359343  1.58911662  2.54884662]]
Model Intercept:
 [-1.55171933 -0.89578062 -2.38035044]
Accuracy on Test Set: 0.8333333333333334




In [6]:
#7. Write a Python program to train a Logistic Regression model for multiclass
# classification using multi_class='ovr' and print the classification report.

# Step 1: Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Step 2: Load the dataset
iris = load_iris()
X = iris.data
y = iris.target

# Optional: Convert to DataFrame for clarity
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y

# Step 3: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=y)

# Step 4: Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Train Logistic Regression with one-vs-rest (OvR)
model = LogisticRegression(multi_class='ovr', solver='liblinear', random_state=42)
model.fit(X_train_scaled, y_train)

# Step 6: Make predictions
y_pred = model.predict(X_test_scaled)

# Step 7: Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.86      0.60      0.71        10
   virginica       0.69      0.90      0.78        10

    accuracy                           0.83        30
   macro avg       0.85      0.83      0.83        30
weighted avg       0.85      0.83      0.83        30





In [2]:
#8. Write a Python program to apply GridSearchCV to tune C and penalty
# hyperparameters for Logistic Regression and print the best parameters and validation
# accuracy.


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create Logistic Regression model
model = LogisticRegression(max_iter=1000, solver='liblinear')

# Define hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,             # 5-fold cross-validation
    scoring='accuracy'
)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters and corresponding score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy: {:.2f}".format(grid_search.best_score_))

# Evaluate on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Accuracy: {:.2f}".format(accuracy_score(y_test, y_pred)))


Best Parameters: {'C': 0.1, 'penalty': 'l2'}
Best Cross-Validation Accuracy: 0.92
Test Accuracy: 1.00


In [9]:
# 9. : Write a Python program to standardize the features before training Logistic
# Regression and compare the model's accuracy with and without scaling.


from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


model_no_scaling = LogisticRegression(max_iter=1000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)


print("Accuracy WITHOUT Scaling: {:.4f}".format(acc_no_scaling))
print("Accuracy WITH Scaling:    {:.4f}".format(acc_scaled))


Accuracy WITHOUT Scaling: 0.9708
Accuracy WITH Scaling:    0.9825


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

-   The goal is to predict whether a customer will respond to a marketing campaign using Logistic Regression. The dataset is highly imbalanced, with only 5% of customers responding, which makes it challenging for standard predictive models.

1. Data Handling and Preprocessing

Data Cleaning: Handle missing values, remove duplicates, and check for outliers.

Feature Engineering: Create variables such as customer tenure, purchase frequency, average spending, and recency of last purchase.

Encoding Categorical Variables: Convert categorical features like gender or region using one-hot or label encoding.

2. Feature Scaling

Logistic Regression is sensitive to feature magnitudes. Numerical features should be standardized (subtract the mean and divide by the standard deviation) so that all features contribute equally to the model and training converges effectively.

3. Handling Class Imbalance

Resampling Techniques: Oversample the minority class using SMOTE or undersample the majority class to balance the dataset.

Class Weights: Assign higher weight to the minority class in the Logistic Regression model using class_weight='balanced' so the model focuses more on responders.

4. Model Building

Split the dataset into training (80%) and testing (20%) sets.

Train a Logistic Regression model with regularization (L1 or L2).

Use Grid Search or Randomized Search for hyperparameter tuning, such as adjusting regularization strength (C) and penalty type.

5. Model Evaluation

For imbalanced datasets, accuracy is not sufficient. Use:

Precision: Measures how many predicted responders actually responded.

Recall: Measures how many actual responders were correctly identified.

F1-Score: Balances precision and recall.

ROC-AUC: Evaluates the model’s ability to distinguish responders from non-responders.

Confusion Matrix: Visualizes true positives, false positives, true negatives, and false negatives.

6. Business Application

The model predicts the probability of each customer responding. Customers with high predicted probabilities can be targeted for campaigns, optimizing marketing spend and maximizing ROI. This approach ensures resources are focused on the most promising customers, improving campaign effectiveness.

7. Summary

Clean and preprocess the data.

Scale numerical features.

Handle class imbalance through resampling or class weights.

Train and tune a Logistic Regression model.

Evaluate performance using precision, recall, F1-score, and ROC-AUC.

Apply predictions to guide targeted marketing campaigns.
