Logistic Regression

# 1: What is Logistic Regression, and how does it differ from Linear Regression?

     -> Linear Regression:

        * Used for predicting continuous values.
        * Output range: −∞ to +∞.
        * Equation: y = β₀ + β₁x.
        * No activation function.
        * Error metric: Mean Squared Error (MSE).

    Logistic Regression:

       * Used for classification tasks (predicting class probabilities).
       * Output range: 0 to 1.
       * Equation: p = 1 / (1 + e^(−(β₀ + β₁x))).
       * Uses Sigmoid activation function.
       * Error metric: Log Loss (Cross-Entropy Loss)



# 2: Explain the role of the Sigmoid function in Logistic Regression.

  ->  The Sigmoid function transforms any real-valued number into a range   between 0 and 1, making it ideal for probability estimation in classification

  
   Role in Logistic Regression:

     Converts the linear combination of features into a probability.

       Ensures the model output is bounded between 0 and 1.




# 3: What is Regularization in Logistic Regression and why is it needed?

    -> Regularization is a technique to prevent overfitting by adding a penalty term to the loss function.
     It controls the size of the coefficients so the model generalizes better on unseen data.
      
    Types:

      * L1 (Lasso): Adds absolute values of coefficients → may shrink some coefficients to 0.

      * L2 (Ridge): Adds squared values of coefficients → keeps all coefficients small but non-zero.

    Why Needed:

    * Avoids overfitting in complex datasets.

    * Improves model stability.

    * Enhances generalization to new data.

# What are some common evaluation metrics for classification models, and why are they important?

   -> Common Evaluation Metrics in Classification
  Accuracy:
     * Measures the overall correctness of predictions.

     * Best used when the dataset is balanced.

 Precision:
     * Of all predicted positives, how many are actually positive.

     * Important when avoiding false positives is the priority.

 Recall (True Positive Rate):
      * Of all actual positives, how many were correctly predicted.

      * Important when avoiding false negatives is the priority.

 F1-Score:
      * Harmonic mean of precision and recall.

      * Useful for imbalanced datasets.

In [1]:
# 5: Python program to train Logistic Regression and print accuracy?
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

# Load dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop('target', axis=1), df['target'], test_size=0.2, random_state=42
)

# Train Logistic Regression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 1.0


In [2]:
# 6. Logistic Regression with L2 regularization (Ridge).

# L2 regularization is default in LogisticRegression
model = LogisticRegression(penalty='l2', C=1.0, max_iter=200)
model.fit(X_train, y_train)

print("Model Coefficients:", model.coef_)
print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))


Model Coefficients: [[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]
Accuracy: 1.0


In [3]:
# 7: Logistic Regression for multiclass classification (ovr).
from sklearn.metrics import classification_report

model = LogisticRegression(multi_class='ovr', max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [5]:
# 8. Logistic Regression hyperparameter tuning with GridSearchCV.
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'solver': ['liblinear', 'saga']
}

grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=3, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Validation Accuracy:", grid.best_score_)




Best Parameters: {'C': 1, 'penalty': 'l1', 'solver': 'saga'}
Validation Accuracy: 0.9666666666666667


48 fits failed out of a total of 96.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [6]:
# 9: Compare accuracy with and without feature scaling.
from sklearn.preprocessing import StandardScaler

# Without scaling
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
print("Without Scaling Accuracy:", accuracy_score(y_test, model.predict(X_test)))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model.fit(X_train_scaled, y_train)
print("With Scaling Accuracy:", accuracy_score(y_test, model.predict(X_test)))



Without Scaling Accuracy: 1.0
With Scaling Accuracy: 0.36666666666666664




#  10: Real-world approach for imbalanced dataset (marketing campaign prediction)?

  -> Steps:

   1. Data Understanding & Cleaning – Handle missing values, encode categorical variables.

   2. Feature Scaling – Standardize numeric features for better convergence.

   3.Handling Imbalance – Use SMOTE, class weighting (class_weight='balanced'), or undersampling.

   4.Train Logistic Regression – Use regularization to avoid overfitting.

   5. Hyperparameter Tuning – Tune C, penalty, and solver using GridSearchCV.

   6. Evaluation Metrics – Use Precision, Recall, F1-score, ROC-AUC (since accuracy will be misleading).

   7. Business Considerations – Focus on Recall (so we don’t miss potential customers).

