**Q1.What is Logistic Regression, and how does it differ from Linear Regression?**

Ans. **Logistic Regression** is a classification algorithm used to predict the probability of a categorical outcome, usually binary (e.g., 0 or 1). It applies the **sigmoid function** to map predictions to values between 0 and 1.

**Difference from Linear Regression:**

* Linear Regression predicts **continuous** values.
* Logistic Regression predicts **probabilities** for **categorical** outcomes.

**Key Differences:**

1. **Purpose:**

   * Linear: Regression (predict numeric output)
   * Logistic: Classification (predict class)

2. **Output:**

   * Linear: Any real number
   * Logistic: Between 0 and 1 (probability)

3. **Function Used:**

   * Linear: Linear equation
   * Logistic: Sigmoid (logistic) function

4. **Loss Function:**

   * Linear: Mean Squared Error
   * Logistic: Cross-Entropy (Log Loss)

5. **Decision Boundary:**

   * Linear: Not used
   * Logistic: Uses threshold (e.g., 0.5) to classify

**Q2. Explain the role of the Sigmoid function in Logistic Regression.**

Ans. In Logistic Regression, the **Sigmoid function** is used to convert the output of a linear equation into a **probability value** between **0 and 1**. This probability indicates how likely an input belongs to a particular class, making it suitable for **binary classification**.

The sigmoid function is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where $z$ is the linear combination of input features and model weights.

**Key Points:**

1. **Probability Output:** Maps any real-valued input to a range between 0 and 1.
2. **Classification Decision:** Helps classify data by applying a threshold (e.g., >0.5 = class 1).
3. **Smooth Curve:** Provides a smooth gradient, which is useful for optimization using gradient descent.
4. **Core Component:** It transforms logistic regression from a linear model to a classifier.

**Q3. What is Regularization in Logistic Regression and why is it needed?**

Ans. **Regularization** is a technique used in Logistic Regression (and other models) to prevent **overfitting** by adding a **penalty** to large model coefficients. It discourages the model from becoming too complex and helps improve its performance on unseen data.

**Why Regularization is Needed:**

1. **Prevents Overfitting:** Controls the model’s complexity.
2. **Improves Generalization:** Helps the model perform better on new data.
3. **Stabilizes Coefficients:** Reduces the influence of less important features.
4. **Promotes Simplicity:** Encourages simpler, more interpretable models.

**Common Types in Logistic Regression:**

* **L1 Regularization (Lasso):** Can reduce some coefficients to zero (feature selection).
* **L2 Regularization (Ridge):** Shrinks coefficients smoothly but keeps them all.

**Q4: What are some common evaluation metrics for classification models, and why are they important?**

Evaluation metrics help us measure how well a **classification model** is performing. They provide insights beyond just accuracy, especially when dealing with **imbalanced datasets** or when different types of errors have different costs.


**Common Evaluation Metrics:**

1. **Accuracy:**

   * Proportion of correct predictions.
   * Useful when classes are balanced.

2. **Precision:**

   * Ratio of true positives to predicted positives.
   * Important when **false positives** are costly.

3. **Recall (Sensitivity):**

   * Ratio of true positives to actual positives.
   * Important when **false negatives** are critical.

4. **F1-Score:**

   * Harmonic mean of precision and recall.
   * Good balance when both false positives and negatives matter.

5. **ROC-AUC (Receiver Operating Characteristic - Area Under Curve):**

   * Measures model’s ability to distinguish between classes.
   * Useful for comparing classifiers.

**Why Important:**

* Help choose the best model for a specific problem.
* Reveal different types of errors.
* Guide model tuning and improvement.

In [2]:
'''Q5. Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)'''

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# STEP 1: Load dataset and save to CSV (only needed once)
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.to_csv('breast_cancer_data.csv', index=False)  # Save to CSV

# STEP 2: Load CSV into DataFrame
df = pd.read_csv('breast_cancer_data.csv')

# STEP 3: Split into features and target
X = df.drop('target', axis=1)
y = df['target']

# STEP 4: Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# STEP 5: Train Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# STEP 6: Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.956140350877193


In [3]:
'''Q6. Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.'''

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset and save to CSV (optional for assignment)
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.to_csv('breast_cancer_data.csv', index=False)

# Read CSV
df = pd.read_csv('breast_cancer_data.csv')

# Split data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with L2 Regularization
model = LogisticRegression(penalty='l2', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Model Coefficients:", model.coef_)
print("Accuracy:", accuracy)

Model Coefficients: [[ 2.13248406e+00  1.52771940e-01 -1.45091255e-01 -8.28669349e-04
  -1.42636015e-01 -4.15568847e-01 -6.51940282e-01 -3.44456106e-01
  -2.07613380e-01 -2.97739324e-02 -5.00338038e-02  1.44298427e+00
  -3.03857384e-01 -7.25692126e-02 -1.61591524e-02 -1.90655332e-03
  -4.48855442e-02 -3.77188737e-02 -4.17516190e-02  5.61347410e-03
   1.23214996e+00 -4.04581097e-01 -3.62091502e-02 -2.70867580e-02
  -2.62630530e-01 -1.20898539e+00 -1.61796947e+00 -6.15250835e-01
  -7.42763610e-01 -1.16960181e-01]]
Accuracy: 0.956140350877193


In [4]:
'''Q7. Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)'''

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset and save to CSV (optional)
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.to_csv('iris_data.csv', index=False)

# Load CSV file
df = pd.read_csv('iris_data.csv')

# Prepare features and target
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with one-vs-rest multiclass strategy
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





In [5]:
'''Q8. Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for
Logistic Regression and print the best parameters and validation accuracy. (Use Dataset from sklearn package)'''

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features and target
X = df.drop('target', axis=1)
y = df['target']

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with scaling and logistic regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=10000))
])

# Define hyperparameter grid
param_grid = {
    'clf__penalty': ['l1', 'l2'],
    'clf__C': [0.01, 0.1, 1, 10, 100],
    'clf__solver': ['liblinear'],  # compatible with l1 and l2
}

# Setup GridSearchCV
grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X_train, y_train)

# Best parameters and validation accuracy
best_params = grid.best_params_
y_val_pred = grid.predict(X_val)
val_accuracy = accuracy_score(y_val, y_val_pred)

print("Best Parameters:", best_params)
print("Validation Accuracy:", val_accuracy)

Best Parameters: {'clf__C': 0.1, 'clf__penalty': 'l2', 'clf__solver': 'liblinear'}
Validation Accuracy: 0.9912280701754386


In [6]:
'''Q9. Write a Python program to standardize the features before training Logistic Regression and
compare the model's accuracy with and without scaling. (Use Dataset from sklearn package)'''

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Prepare features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --------- Without Scaling ---------
model_no_scaling = LogisticRegression(max_iter=10000, solver='liblinear')
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# --------- With Scaling ---------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_with_scaling = LogisticRegression(max_iter=10000, solver='liblinear')
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_with_scaling = model_with_scaling.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

# Print comparison
print(f"Accuracy without scaling: {acc_no_scaling:.4f}")
print(f"Accuracy with scaling:    {acc_with_scaling:.4f}")

Accuracy without scaling: 0.9561
Accuracy with scaling:    0.9737


Q10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

Ans. When predicting customer response with a **highly imbalanced dataset** (only 5% responders), the key challenges are handling the imbalance, building a robust model, and ensuring meaningful evaluation.

**1. Data Handling**

* **Explore and clean data:** Check for missing values, outliers, and inconsistent entries.
* **Feature engineering:** Create meaningful features from raw data (e.g., customer demographics, purchase history).
* **Feature selection:** Use correlation analysis or feature importance methods to keep relevant features.

**2. Feature Scaling**

* Scale numerical features using **StandardScaler** (mean=0, std=1) or **MinMaxScaler**.
* Scaling helps Logistic Regression converge faster and improves model stability.

**3. Handling Class Imbalance**

Since only 5% respond, standard training would bias towards predicting non-responders.

* **Resampling techniques:**

  * **Oversampling** minority class (e.g., using **SMOTE**).
  * **Undersampling** majority class.
  * Or a combination of both (balanced sampling).
* **Use class weights:**

  * Set `class_weight='balanced'` in Logistic Regression to penalize mistakes on minority class more heavily.
* Prefer **class weights** or **SMOTE** over just accuracy-based methods to preserve data integrity.

**4. Model Building & Hyperparameter Tuning**

* Use **Logistic Regression** with **L2 regularization** to avoid overfitting.
* Tune hyperparameters like:

  * **C (inverse regularization strength)**
  * **Penalty type (L1/L2)**
  * **Solver type**
* Employ **GridSearchCV** or **RandomizedSearchCV** with **stratified k-fold cross-validation** to ensure minority class is represented in each fold.
* Use **pipeline** to combine scaling and model training cleanly.

**5. Evaluation Metrics**

* Avoid relying on **accuracy**, which can be misleading due to imbalance.
* Focus on:

  * **Precision**: Of predicted responders, how many are truly responders?
  * **Recall (Sensitivity)**: Of all actual responders, how many did we catch?
  * **F1-score**: Harmonic mean of precision and recall for balance.
  * **ROC-AUC**: Measures ability to discriminate classes regardless of threshold.
  * **Precision-Recall Curve**: More informative when classes are imbalanced.
* Consider **business impact** of false positives (wasting marketing budget) vs false negatives (missing potential customers).

**6. Final Model & Deployment**

* Choose model balancing recall and precision as per business priorities.
* Monitor model performance regularly on new data.
* Update the model periodically with fresh data to adapt to customer behavior changes.