**1. What is Logistic Regression, and how does it differ from Linear
Regression?**

* 1. Logistic Regression
  * Logistic Regression is a statistical/machine learning algorithm used for classification tasks (predicting categorical outcomes).
  *  Instead of predicting continuous values, it predicts the probability that a given input belongs to a certain class (usually binary: 0 or 1).
  * Mathematics:Linear combination of inputs is computed first:
        z=w0+w1x1+w2x2+⋯+wnxn
	* Then passed through the sigmoid (logistic) function: p=1+e−z1
	* ​Output p lies between 0 and 1, representing probability.
  * A threshold (e.g., 0.5) is applied to classify into categories.
* 2. Linear Regression
   * Linear Regression is used for regression tasks (predicting continuous outcomes).
  *  Directly predicts a numeric value, not probabilities.
  * Mathematics:
      * y=w0​+w1​x1​+w2​x2​+⋯+wn​xn​
	* ​Here y can take any real value (−∞ to +∞).

* **Key Differences Linear Regression**

    * Aspect	 ->Linear Regression
    * Type of Problem -> Regression (predict continuous values)
    * Output Range -> −∞ to +∞ (real numbers)
    * Algorithm Goal  -> Minimize squared error (MSE)
    * Prediction  -> Continuous value (e.g., house price)
    * Function Used -> Linear function

* **Key Differences Logistic Regression**

   * Aspect	 ->Logistic Regression
  * Type of Problem -> Classification (predict categories, usually binary)
  * Output Range -> 0 to 1 (probabilities via sigmoid)
  * Algorithm Goal  -> Maximize likelihood (via log-loss / cross-entropy)
  * Prediction  -> Probability/class label (e.g., spam vs. not spam)
  * Function Used -> Logistic (sigmoid) function

**2. Explain the role of the Sigmoid function in Logistic Regression?**

* 1. Transforms Linear Output into Probability
  * Logistic Regression first computes a linear combination of inputs:
  * z=w0​+w1​x1​+w2​x2​+⋯+wn​xn​
  * This z can take any real value (from −∞ to +∞).
  * But probabilities must always lie in the range [0, 1]
  * The sigmoid function fixes this:σ(z)=1+e−z1​
  * It "squashes" any real number into a value between 0 and 1.
* 2. Interpreting Output as Probability
  * After applying sigmoid, the output represents:P(y=1∣X)=σ(z)
  * Example: If sigmoid output is 0.85 → there is an 85% probability that the input belongs to class 1.
* 3. Decision Boundary
  * Logistic Regression uses a threshold (commonly 0.5) on the sigmoid output to assign classes:
     * If σ(z)≥0.5⇒ predict class 1
     * If σ(z)< 0.5⇒ predict class 0.
* 4. Smooth Gradient for Optimization
  * The sigmoid function is differentiable, which is essential for optimization using Gradient Descent.
  * Its derivative:σ′(z)=σ(z)(1−σ(z))
  * This property makes weight updates efficient during training.


**3. What is Regularization in Logistic Regression and why is it needed?**

* Regularization is a technique used to prevent overfitting in machine learning models.
* It works by adding a penalty term to the model’s loss function, discouraging the model from assigning too large weights to the features.
* In Logistic Regression, the modified cost function looks like: J(w)=−m1​i=1∑m​[y(i)log(y^​(i))+(1−y(i))log(1−y^​(i))]+λ⋅Penalty(w)
* Here: m = number of samples
  * y^(i)= predicted probability
  * λ = regularization strength (controls penalty size)
  * Penalty(w) = term based on weights w
* Types of Regularization
* 1. L2 Regularization (Ridge)
   * Penalty: Penalty(w)= j∑wj2
	​ * Encourages small but nonzero weights.
   * Prevents model from relying too much on a single feature.
* 2. L1 Regularization (Lasso)
   * Penalty: Penalty(w)=j∑ ∣wj∣
   * Encourages sparsity → pushes some weights to exactly zero.
   * Useful for feature selection.
* 3. Elastic Net
   * Combination of L1 and L2 regularization.

* Why is Regularization Needed?
* Logistic Regression can overfit when:
     * There are too many features.
     * Features are highly correlated (multicollinearity).
     * Training data is small or noisy.
* Overfitting means the model learns noise instead of general patterns → performs poorly on unseen data.

* Regularization helps by:
   * Controlling model complexity.
   * Preventing large weights.
   * Improving generalization to new data.

* 4. Intuition Example
* Suppose you’re predicting if an email is spam.
* Without regularization, the model might assign an extremely high weight to a rare word (e.g., "lottery"), making predictions unstable.
* With regularization, the model balances weights across features, improving stability and accuracy.

**4. What are some common evaluation metrics for classification models, and
why are they important?**

* When we build a classification model (like Logistic Regression, Decision Trees, etc.), we need to measure how well it performs. Different metrics highlight different aspects of performance.
* 1. Accuracy
   * Accuracy=Correct Predictions/Total Predictions
   * Example: If 90 out of 100 predictions are correct → Accuracy = 90%.
  * Usefulness: Simple and intuitive.
  * Limitation: Misleading when classes are imbalanced (e.g., 95% "not spam" and 5% "spam" → model predicting always "not spam" gives 95% accuracy but is useless).
* 2. Precision
   *  Among the samples predicted as positive, how many are actually positive?
   * Precision=TP+FP/TP
	​ * Usefulness: Important when false positives are costly (e.g., predicting "cancer" when it’s not).
* 3. Recall (Sensitivity or True Positive Rate)
  *  Among the actual positives, how many did the model correctly identify?
  * Recall=TP+FN/TP
	​* Usefulness: Important when false negatives are costly (e.g., missing an actual cancer patient).
* 4. F1-Score
  *  Harmonic mean of Precision and Recall.
  * F1=2⋅Precision+Recall/Precision⋅Recall
	​* Usefulness: Balances Precision and Recall. Useful when dataset is imbalanced.
* 5. ROC Curve & AUC (Area Under Curve)
  * ROC Curve: Plots True Positive Rate (Recall) vs False Positive Rate at different thresholds.
  * AUC: Measures area under ROC curve (closer to 1 is better).
  * Usefulness: Evaluates model performance across thresholds instead of a single cutoff (like 0.5).
* 6. Confusion Matrix
   * A table showing:
      * TP (True Positives)
      * TN (True Negatives)
      * FP (False Positives)
      * FN (False Negatives)
* Usefulness: Gives a complete picture of errors.
* Different problems have different priorities:
   * Medical diagnosis → Recall is more important (don’t miss actual cases).
   * Spam filter → Precision is more important (don’t classify normal mail as spam).
   * Accuracy alone can be misleading, especially with imbalanced datasets.
   * Using multiple metrics helps understand trade-offs between errors.

**5.  Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)**


In [1]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# 1. Load dataset from sklearn
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Show first 5 rows
print("First 5 rows of dataset:")
print(df.head(), "\n")

# 2. Split into features (X) and target (y)
X = df.drop("target", axis=1)
y = df["target"]

# 3. Split into Train/Test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Train Logistic Regression model
model = LogisticRegression(max_iter=5000)  # increase iterations for convergence
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Logistic Regression model: {accuracy:.4f}")

First 5 rows of dataset:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  wor

**6. Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)**

In [2]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# 1. Load dataset
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Features and target
X = df.drop("target", axis=1)
y = df["target"]

# 2. Split dataset into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Logistic Regression with L2 Regularization (Ridge)
# C controls regularization strength (smaller C = stronger regularization)
model = LogisticRegression(penalty='l2', C=1.0, max_iter=5000)
model.fit(X_train, y_train)

# 4. Predictions and Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# 5. Print Model Coefficients and Accuracy
print("Model Coefficients (first 10 shown):")
print(model.coef_[0][:10])  # printing only first 10 for readability
print("\nIntercept:", model.intercept_[0])
print(f"\nAccuracy of Logistic Regression with L2 Regularization: {accuracy:.4f}")

Model Coefficients (first 10 shown):
[ 1.0274368   0.22145051 -0.36213488  0.0254667  -0.15623532 -0.23771256
 -0.53255786 -0.28369224 -0.22668189 -0.03649446]

Intercept: 28.648713947072245

Accuracy of Logistic Regression with L2 Regularization: 0.9561


**7.Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)**

In [3]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# 1. Load dataset
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features and target
X = df.drop("target", axis=1)
y = df["target"]

# 2. Split dataset into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train Logistic Regression with One-vs-Rest (OvR)
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=5000)
model.fit(X_train, y_train)

# 4. Predictions
y_pred = model.predict(X_test)

# 5. Print Classification Report
print("Classification Report for Logistic Regression (OvR):\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Classification Report for Logistic Regression (OvR):

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





**8. Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)**

In [4]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# 1. Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define Logistic Regression model
log_reg = LogisticRegression(max_iter=5000, solver='liblinear')
# (liblinear supports both l1 and l2 penalties)

# 4. Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],   # Regularization strength
    'penalty': ['l1', 'l2']         # L1 = Lasso, L2 = Ridge
}

# 5. Apply GridSearchCV
grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# 6. Print best parameters and validation accuracy
print("Best Parameters found:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}")

Best Parameters found: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9583


**9. Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)**

In [5]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# 1. Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# 2. Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# -------- Model WITHOUT Scaling --------
model_no_scaling = LogisticRegression(max_iter=5000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -------- Model WITH Scaling --------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=5000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# 3. Print Results
print("Accuracy WITHOUT Scaling: {:.4f}".format(accuracy_no_scaling))
print("Accuracy WITH Scaling   : {:.4f}".format(accuracy_scaled))

Accuracy WITHOUT Scaling: 0.9561
Accuracy WITH Scaling   : 0.9737


**10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

**Approach to Building the Model**

* 1. Data Understanding & Cleaning
  * Check missing values → impute or drop as appropriate.
  * Handle outliers (e.g., extremely high purchase amounts).
  * Feature engineering:
      * Recency, frequency, and monetary value (RFM features).
      * Past response history.
  * Demographics, engagement data (website visits, email opens).
* 2. Feature Scaling
  * Logistic Regression is sensitive to feature scale.
  * Apply StandardScaler or MinMaxScaler to ensure features like age (years) and income ($) are on similar scales.
  * Scaling helps coefficients be more meaningful and improves convergence.
* 3. Handling Class Imbalance (only 5% responders)
   * Imbalance is critical here — otherwise, the model will predict “no response” for everyone and still achieve ~95% accuracy.
   * Options:
   * 1.Resampling Techniques
        * Oversampling minority class (e.g., SMOTE – Synthetic Minority Oversampling Technique).
        * Undersampling majority class (downsample non-responders).
        * Sometimes a hybrid works best.

    * 2.Class Weights
        * In scikit-learn:
        * LogisticRegression(class_weight='balanced')
        * Penalizes misclassification of minority class more heavily.
* 4. Model Training & Hyperparameter Tuning
    * Logistic Regression hyperparameters:
    
            * C (inverse regularization strength).
            * Penalty (l1, l2, elasticnet).
            * Solver (must match penalty type).
    * Use GridSearchCV or RandomizedSearchCV with cross-validation.
* Example grid: param_grid = {
    "C": [0.01, 0.1, 1, 10],
    "penalty": ["l1", "l2"],
    "solver": ["liblinear", "saga"]
}
* 5. Evaluation Metrics
   * Accuracy is not useful in imbalanced datasets → instead:
   * Precision: Of predicted responders, how many actually responded?
   * Recall (Sensitivity): Of actual responders, how many did we capture?
   * F1-score: Balance between Precision and Recall.
   * ROC-AUC: Probability that the model ranks a random responder higher than a non-responder.
   * PR-AUC (Precision-Recall AUC): Especially valuable in highly imbalanced data.
   * If campaign cost is high → prioritize Precision (target fewer, but more accurate).
   * If missing a potential customer is worse → prioritize Recall.
   * Best: tune decision threshold to maximize expected profit.
* 6. Business Deployment
   * Use the model’s predicted probability, not just hard 0/1.
   * Rank customers by probability → run campaigns on top N% most likely responders.
   * Continuously retrain as new campaign results arrive.