<a href="https://colab.research.google.com/github/HimAir10/Pw-skillsAssignment/blob/main/Logistic_Regression_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Theory Questions

# Question 1: What is Logistic Regression, and how does it differ from LinearRegression?

Logistic Regression is a supervised learning algorithm used mainly for classification tasks. Instead of predicting continuous numerical values, it estimates the probability that a given input belongs to a certain class. To achieve this, it uses the logistic or sigmoid function, which transforms the linear combination of input features into a value between 0 and 1. This probability can then be converted into a class label based on a threshold, usually set at 0.5.

The main difference between Logistic Regression and Linear Regression lies in their purpose and output. Linear Regression predicts continuous values and can output any number between negative and positive infinity, making it suitable for regression problems like predicting house prices or sales figures. Logistic Regression, on the other hand, predicts probabilities and outputs values strictly between 0 and 1, making it suitable for classification tasks like spam detection or customer churn prediction.

Another important distinction is that Linear Regression models the relationship between input variables and the actual target values, while Logistic Regression models the relationship between input variables and the log-odds of the target variable. Additionally, Linear Regression typically uses Mean Squared Error as its loss function, whereas Logistic Regression uses log loss (cross-entropy) to measure performance.

# Question 2: Explain the role of the Sigmoid function in Logistic Regression

In Logistic Regression, the sigmoid function plays a crucial role in transforming the output of a linear equation into a probability value between 0 and 1. The model first computes a weighted sum of the input features along with a bias term, which can range from negative infinity to positive infinity. This raw score, often called the logit, is then passed through the sigmoid function to "squash" it into a range that is interpretable as a probability.

The sigmoid function is defined as:

𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
σ(z)=
1+e
−z
1
	​


Here,
𝑧
z is the linear combination of input features and their weights. When
𝑧
z is a large positive number, the sigmoid output approaches 1; when
𝑧
z is a large negative number, the output approaches 0; and when
𝑧
z is 0, the output is exactly 0.5.

This transformation is essential because probabilities must always fall between 0 and 1. By using the sigmoid function, Logistic Regression can map any real-valued number into this range, making it possible to interpret the output as the likelihood of a given observation belonging to the positive class. This probability can then be compared to a decision threshold (often 0.5) to assign the final class label.

# Question 3: What is Regularization in Logistic Regression and why is it needed?

Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty term to the loss function. Overfitting occurs when the model learns not only the general patterns in the data but also the noise, which reduces its ability to generalize well to unseen data. In Logistic Regression, overfitting often happens when the model assigns very large weights to certain features, making the decision boundary overly sensitive to small changes in input.

The penalty term in regularization discourages the model from assigning excessively large coefficients. There are two common types of regularization used in Logistic Regression: L1 regularization (Lasso), which adds the absolute values of the coefficients to the loss function, and L2 regularization (Ridge), which adds the squared values of the coefficients. L1 tends to produce sparse models by driving some coefficients to zero, effectively performing feature selection, while L2 generally shrinks all coefficients but keeps them non-zero, helping to maintain stability.

Regularization is needed because it improves the model’s ability to generalize, reduces variance, and makes it more robust to multicollinearity among features. Without it, especially in datasets with many features or noisy data, the Logistic Regression model can become unstable and produce poor performance on new, unseen samples.

# Question 4: What are some common evaluation metrics for classification models, and why are they important?

Common evaluation metrics for classification models help measure how well a model is performing and whether it meets the requirements of the problem. One of the most widely used metrics is accuracy, which measures the proportion of correct predictions out of the total predictions. While accuracy is easy to understand, it can be misleading for imbalanced datasets where one class dominates, as a model could achieve high accuracy simply by predicting the majority class every time.

Precision and recall are two important metrics that address such limitations. Precision measures the proportion of true positive predictions out of all predicted positives, showing how reliable the positive predictions are. Recall, also known as sensitivity or true positive rate, measures the proportion of actual positives correctly identified by the model, which is important when missing positive cases has a high cost. The F1-score combines precision and recall into a single metric by taking their harmonic mean, providing a balanced view when both metrics are important.

Another key metric is the ROC-AUC (Receiver Operating Characteristic – Area Under Curve), which evaluates the model’s ability to distinguish between classes across different probability thresholds. A related metric for imbalanced datasets is Precision-Recall AUC, which focuses more on the performance with respect to the positive class. These metrics are important because they go beyond a simple accuracy number, allowing data scientists to choose and tune models based on the priorities of the specific business or real-world use case.

# Practical Questions

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Question 5

In [52]:
# Question 5: Load CSV, Train Logistic Regression, Print Accuracy

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset from sklearn
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# 2. Save DataFrame to CSV (simulating loading from a CSV file)
csv_path = "iris_dataset.csv"
df.to_csv(csv_path, index=False)

# 3. Load CSV into Pandas DataFrame
data = pd.read_csv(csv_path)

# 4. Split into features (X) and target (y)
X = data.drop('target', axis=1)
y = data['target']

# 5. Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 6. Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 7. Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# 8. Predictions & Accuracy
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)


Accuracy: 0.9333333333333333


# Question 6

In [51]:
# Question 6: Logistic Regression with L2 Regularization (Ridge)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# 1. Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Logistic Regression with L2 regularization
model = LogisticRegression(
    penalty='l2',     # L2 Regularization (Ridge)
    C=1.0,            # Regularization strength (default)
    solver='lbfgs',   # Supports L2
    max_iter=1000
)

# 5. Train the model
model.fit(X_train_scaled, y_train)

# 6. Predictions & Accuracy
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

# 7. Output model coefficients and accuracy
print("Model Coefficients:\n", model.coef_)
print("\nIntercepts:\n", model.intercept_)
print("\nAccuracy:", accuracy)


Model Coefficients:
 [[-1.08894494  1.02420763 -1.79905609 -1.68622819]
 [ 0.53633654 -0.36048698 -0.20407418 -0.80795703]
 [ 0.5526084  -0.66372065  2.00313027  2.49418523]]

Intercepts:
 [-0.30558672  1.90855554 -1.60296882]

Accuracy: 0.9333333333333333


# Question 7

In [50]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X,y =  make_classification(n_samples=1000, n_features=10, n_redundant=5, n_informative=5, n_classes=3, random_state=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=1)


model = LogisticRegression(multi_class='ovr', solver = 'lbfgs')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[75 21  3]
 [10 58 21]
 [23 14 75]]
0.6933333333333334
              precision    recall  f1-score   support

           0       0.69      0.76      0.72        99
           1       0.62      0.65      0.64        89
           2       0.76      0.67      0.71       112

    accuracy                           0.69       300
   macro avg       0.69      0.69      0.69       300
weighted avg       0.70      0.69      0.69       300



# Question 8

In [48]:
# Question 8: GridSearchCV for Logistic Regression

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# 1. Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Train-Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Create pipeline: StandardScaler + LogisticRegression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000))
])

# 4. Define parameter grid
param_grid = {
    'logreg__C': [0.01, 0.1, 1, 10, 100],     # Regularization strength
    'logreg__penalty': ['l1', 'l2'],          # Regularization type
    'logreg__solver': ['liblinear']           # Supports both L1 and L2
}

# 5. Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# 6. Print best parameters and validation accuracy
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)

# 7. Evaluate on test data
y_pred = grid_search.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'logreg__C': 10, 'logreg__penalty': 'l1', 'logreg__solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.9583333333333334
Test Accuracy: 0.9333333333333333


# Question 9

In [23]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=10, n_redundant=5, n_informative=5, n_classes=2, random_state=1)

In [26]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size= 0.3, random_state=1)

In [34]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [47]:
y_pred = model.predict(X_train)
#Evaluation matrix and accuracy after data standardization
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(confusion_matrix(y_train, y_pred))#confusion matrix

print(accuracy_score(y_train,y_pred))#accuracy

[[272  85]
 [ 54 289]]
0.8014285714285714


In [43]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [39]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

In [46]:
y_pred_scaled = model.predict(X_train_scaled)
#Evaluation matrix and accuracy after data standardization
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(confusion_matrix(y_train, y_pred_scaled))#confusion matrix

print(accuracy_score(y_train,y_pred_scaled))#accuracy

[[307  50]
 [ 89 254]]
0.8014285714285714


Question 10: Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.

Ans.

Step 1 — Understanding the Business Problem

We want to predict customers likely to respond to a campaign so the marketing team can focus efforts and reduce costs.
Since only 5% respond, the dataset is heavily imbalanced — meaning a naïve model could predict “No” for everyone and still be 95% “accurate,” but it wouldn’t be useful.

Step 2 — Data Preparation

Load & Inspect Data

Check missing values, outliers, and data types.

Ensure target variable is binary (0 = No, 1 = Yes).

Feature Engineering

Create meaningful features (e.g., purchase frequency, recency, average order value, last campaign response).

Encode categorical variables:

One-hot encoding for nominal (e.g., product category).

Ordinal encoding for ordered categories (e.g., loyalty tiers).

Feature Scaling

Logistic Regression is sensitive to feature magnitude.

Apply StandardScaler (z-score normalization) to continuous features to bring them on a similar scale.

Step 3 — Handling Class Imbalance

Options:

Class weights (preferred for Logistic Regression):
Use class_weight='balanced' so the algorithm gives more penalty to misclassifying the minority class.

Resampling techniques:

Oversampling minority (SMOTE) to synthetically create more positive samples.

Undersampling majority to reduce negative samples (careful with data loss).

Hybrid approach: Combine SMOTE oversampling with undersampling for better balance.

For business use, I’d start with class weights, then test SMOTE + Logistic Regression to see if recall improves.

Step 4 — Splitting Data

Train–Test split: Stratified split (preserves 5% positive ratio in both sets).

Further split train set into train–validation for hyperparameter tuning.

Step 5 — Model Building (Logistic Regression)

Base Model

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight='balanced', solver='liblinear', random_state=42)


Hyperparameter Tuning

Grid search over:

C (inverse regularization strength, e.g., [0.01, 0.1, 1, 10]).

penalty (L1 or L2).

solver (liblinear, saga for L1).

Use Stratified K-Fold CV to maintain class ratio in folds.

Step 6 — Model Evaluation (Business Context)

Accuracy is misleading here — we need metrics that care about the minority class:

Precision: Of those predicted as responders, how many actually respond?
(Important for cost control — avoid wasting money on uninterested customers.)

Recall: Of all actual responders, how many did we catch?
(Important if missing a potential buyer is costly.)

F1-score: Balance between precision & recall.

ROC-AUC: Overall ranking ability of the model.

Precision–Recall AUC: More meaningful for imbalanced datasets.

For marketing:

If budget is tight → prioritize high precision.

If goal is maximum reach → prioritize high recall.

Step 7 — Threshold Tuning

By default, Logistic Regression uses 0.5 probability as the cutoff.
For imbalanced data, lowering the threshold (e.g., 0.2) can improve recall — test different cutoffs based on business goals.

Step 8 — Final Deployment Strategy

Deploy model to score incoming customers.

Sort customers by predicted probability of responding.

Marketing team can target the top X% based on campaign budget.

Regularly retrain the model as customer behavior changes.