<a href="https://colab.research.google.com/github/Nishil2009/Data-anlaytics-course/blob/main/Logisticsregressionassign.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. What is Logistic Regression, and how does it differ from Linear Regression?

-> Logistic Regression is a supervised learning algorithm used for classification problems, where the target variable is categorical (e.g., yes/no, spam/ham).

-It predicts the probability that an observation belongs to a particular class, using the logistic (sigmoid) function to map values between 0 and 1.

Difference from Linear Regression:

-Linear Regression predicts a continuous outcome (e.g., predicting house prices).

-Logistic Regression predicts probabilities for classification (e.g., will the customer buy or not).

-Linear Regression uses least squares loss, while Logistic Regression uses log-likelihood loss.

2. Explain the role of the Sigmoid function in Logistic Regression

The Sigmoid function is:

sigma(z) = 1 / (1 + e^(-z))

It converts the linear combination of features (
z =wTx+b) into a probability between 0 and 1.

Decision Rule:

If probability ≥ 0.5 → Class 1

If probability < 0.5 → Class 0

Thus, sigmoid makes logistic regression suitable for binary classification.

3. What is Regularization in Logistic Regression and why is it needed?

-> Regularization adds a penalty to the loss function to prevent overfitting.

Types:

L1 (Lasso) → Encourages sparsity (some coefficients become 0).

L2 (Ridge) → Shrinks coefficients but keeps all features.

Needed because:

-Helps prevent overfitting on noisy data.

-Improves generalization on unseen test data.

4. Common Evaluation Metrics for Classification Models

Accuracy - percentage of correct predictions.

Precision - how many predicted positives are actually positive.

Recall (Sensitivity) - how many actual positives are correctly predicted.

F1-Score - harmonic mean of Precision & Recall, balances false positives/negatives.

ROC-AUC - measures model's ability to distinguish between classes.

Also, Thing to be noted is that in imbalanced datasets, accuracy alone is misleading → Precision, Recall, and AUC are more reliable.



In [2]:
#5.Train a Logistic Regression model, and print accuracy.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, :-1], df['target'], test_size=0.3, random_state=42)

# Logistic Regression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predictions & Accuracy
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))




Accuracy: 1.0


In [3]:
#6.Logistic Regression with L2 Regularization (Ridge)

from sklearn.datasets import load_breast_cancer

# Load dataset
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, test_size=0.3, random_state=42)

# Logistic Regression with L2 Regularization
model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

print("Model Coefficients:", model.coef_)
print("Accuracy:", model.score(X_test, y_test))

Model Coefficients: [[ 2.09913415  0.1673581  -0.11876587 -0.00399211 -0.11964349 -0.41157283
  -0.62086473 -0.31616778 -0.18731917 -0.03068573 -0.04091334  1.55447102
   0.22882622 -0.12126295 -0.01220061 -0.04231814 -0.07360856 -0.03703756
  -0.04597635 -0.0027586   1.20265811 -0.40104191 -0.08808799 -0.02043432
  -0.22211505 -1.20053841 -1.55500712 -0.57085173 -0.67900347 -0.11799646]]
Accuracy: 0.9649122807017544


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [4]:
#7.Logistic Regression for Multiclass (One-vs-Rest)

from sklearn.metrics import classification_report

# Using Iris dataset again
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42)

# Multiclass Logistic Regression
model = LogisticRegression(multi_class='ovr', max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.85      0.92        13
           2       0.87      1.00      0.93        13

    accuracy                           0.96        45
   macro avg       0.96      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45





In [5]:
#8. GridSearchCV for Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Grid search parameters
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # supports L1
}

grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Validation Accuracy:", grid.best_score_)

Best Parameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
Best Validation Accuracy: 0.9523809523809523


In [6]:
#9.Compare Accuracy With & Without Feature Scaling

from sklearn.preprocessing import StandardScaler

# Without scaling
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
print("Accuracy without scaling:", model.score(X_test, y_test))

# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=200)
model_scaled.fit(X_train_scaled, y_train)
print("Accuracy with scaling:", model_scaled.score(X_test_scaled, y_test))

Accuracy without scaling: 1.0
Accuracy with scaling: 1.0


10. Real-world Case Study: Predicting Marketing Campaign Response (Imbalanced Data)

*Business Challenge*
-Only 5% customers respond, meaning dataset is highly imbalanced.

*Approach*

Data Handling & Preprocessing:

-Clean missing values, encode categorical features.

-Standardize numerical features for logistic regression.

Handling Imbalance:

-Use SMOTE (Synthetic Minority Oversampling Technique) or

-Apply class_weight='balanced' in Logistic Regression.

Feature Scaling:

-StandardScaler to normalize features.

Model Training:

-Logistic Regression with regularization (L1/L2).

-Hyperparameter tuning with GridSearchCV (parameters: C, penalty).

Evaluation Metrics:

-Accuracy is misleading.

-Use Precision, Recall, F1-score, ROC-AUC.

-Focus on Recall (don’t miss responders) or Precision (don’t waste resources on non-responders), depending on business goal.

Deployment:

-Final model should output probability scores.

-Business team can set threshold (e.g., target top 10% most likely to respond).