#Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

-- Logistic Regression:

It is a classification algorithm used to predict categorical outcomes (often binary, like 0/1, Yes/No, True/False).

Instead of predicting exact values, it predicts the probability that an observation belongs to a particular class.

Uses the logistic (sigmoid) function to map predicted values to a range between 0 and 1:

𝑃
(
𝑌
=
1
)
=
1
1
+
𝑒
−
(
𝛽
0
+
𝛽
1
𝑋
)
P(Y=1)=
1+e
−(β
0
	​

+β
1
	​

X)
1
	​


Linear Regression:

It is a regression algorithm used to predict continuous numerical outcomes.

Predicts the value of a dependent variable using a linear relationship:

𝑌
=
𝛽
0
+
𝛽
1
𝑋
+
𝜖
Y=β
0
	​

+β
1
	​

X+ϵ

Key Differences:

Feature	Linear Regression	Logistic Regression
Output	Continuous values	Probability (0–1) / class label
Purpose	Regression (predict value)	Classification (predict class)
Equation	Linear	Sigmoid of linear equation
Error Measure	MSE (Mean Squared Error)	Log-Loss / Cross-Entropy


#Question 2: Explain the role of the Sigmoid function in Logistic Regression.


-- Role of the Sigmoid Function in Logistic Regression:

Definition:
The sigmoid (or logistic) function is:

𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
σ(z)=
1+e
−z
1
	​


Where
𝑧
=
𝛽
0
+
𝛽
1
𝑋
z=β
0
	​

+β
1
	​

X.

Maps Any Value to Probability:

Linear combination
𝑧
=
𝛽
0
+
𝛽
1
𝑋
z=β
0
	​

+β
1
	​

X can be any real number
(
−
∞
,
+
∞
)
(−∞,+∞).

Sigmoid transforms
𝑧
z into a value between 0 and 1, which can be interpreted as a probability of the positive class.

Decision Boundary:

Typically, if
𝜎
(
𝑧
)
≥
0.5
σ(z)≥0.5, predict class 1; otherwise, predict class 0.

This threshold allows Logistic Regression to classify data points.

Non-linearity:

The sigmoid function introduces non-linear mapping, enabling Logistic Regression to handle classification even if the input-output relationship is not strictly linear.

#Question 3: What is Regularization in Logistic Regression and why is it needed?

-- Regularization in Logistic Regression:

Definition:
Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. It discourages the model from fitting the training data too perfectly, which helps it generalize better to unseen data.

Why It’s Needed:

Logistic Regression can overfit when there are many features or when features are highly correlated.

Overfitting leads to high variance, meaning the model performs well on training data but poorly on new data.

Regularization helps simplify the model by shrinking coefficients.

Types of Regularization:

L1 Regularization (Lasso): Adds the sum of absolute values of coefficients as a penalty. Can produce sparse models (some coefficients become zero).

Loss
=
−
Log-Likelihood
+
𝜆
∑
∣
𝛽
𝑗
∣
Loss=−Log-Likelihood+λ∑∣β
j
	​

∣

L2 Regularization (Ridge): Adds the sum of squared coefficients as a penalty. Shrinks coefficients but usually keeps all of them.

Loss
=
−
Log-Likelihood
+
𝜆
∑
𝛽
𝑗
2
Loss=−Log-Likelihood+λ∑β
j
2
	​


Effect:

Reduces overfitting.

Improves model generalization.

Controls the magnitude of coefficients.

#Question 4: What are some common evaluation metrics for classification models, and why are they important?

-- Common Evaluation Metrics for Classification Models:

Accuracy

Definition: Percentage of correct predictions out of total predictions.

Accuracy
=
TP + TN
TP + TN + FP + FN
Accuracy=
TP + TN + FP + FN
TP + TN
	​


Use: Good for balanced datasets.

Limitation: Misleading for imbalanced datasets.

Precision

Definition: Out of all predicted positives, how many are actually positive.

Precision
=
TP
TP + FP
Precision=
TP + FP
TP
	​


Use: Important when false positives are costly (e.g., spam detection).

Recall (Sensitivity)

Definition: Out of all actual positives, how many were correctly predicted.

Recall
=
TP
TP + FN
Recall=
TP + FN
TP
	​


Use: Important when false negatives are costly (e.g., disease detection).

F1-Score

Definition: Harmonic mean of Precision and Recall.

F1
=
2
⋅
Precision
⋅
Recall
Precision + Recall
F1=2⋅
Precision + Recall
Precision⋅Recall
	​


Use: Balances Precision and Recall, useful for imbalanced datasets.

ROC-AUC (Receiver Operating Characteristic - Area Under Curve)

Definition: Measures model’s ability to distinguish between classes at all thresholds.

Use: Higher AUC = better separation between positive and negative classes.

Why Metrics Are Important:

They provide different perspectives on model performance.

Help choose the right model for the problem.

Essential for imbalanced datasets, where accuracy alone is misleading.



In [1]:
#Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.(Use Dataset from sklearn package).(Include your Python code and output in the code box below.)

# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# For simplicity, make it a binary classification (class 0 vs class 1)
X = X[y != 2]
y = y[y != 2]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


In [2]:
#Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.(Use Dataset from sklearn package).(Include your Python code and output in the code box below.)


# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression model with L2 regularization (default)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Print model coefficients
print("Model Coefficients:")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

# Make predictions and print accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)


Model Coefficients:
sepal length (cm): -0.4054
sepal width (cm): 0.8689
petal length (cm): -2.2779
petal width (cm): -0.9568

Accuracy: 1.0


In [3]:
# Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.(Use Dataset from sklearn package).(Include your Python code and output in the code box below.)

# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Logistic Regression model with One-vs-Rest strategy
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print classification report
report = classification_report(y_test, y_pred, target_names=iris.target_names)
print(report)


              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.85      0.92        13
   virginica       0.87      1.00      0.93        13

    accuracy                           0.96        45
   macro avg       0.96      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45





In [4]:
#Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.(Use Dataset from sklearn package).(Include your Python code and output in the code box below.)

# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Logistic Regression model
model = LogisticRegression(solver='liblinear', max_iter=200)  # liblinear supports both L1 and L2

# Define hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Apply GridSearchCV
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Print best parameters
print("Best Parameters:", grid.best_params_)

# Evaluate on test set
y_pred = grid.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Validation Accuracy:", accuracy)


Best Parameters: {'C': 10, 'penalty': 'l2'}
Validation Accuracy: 1.0


In [5]:
# Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.(Use Dataset from sklearn package).(Include your Python code and output in the code box below.)


# Import libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --------- Logistic Regression WITHOUT scaling ---------
model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# --------- Logistic Regression WITH standardization ---------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=200)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Print results
print("Accuracy WITHOUT scaling:", accuracy_no_scaling)
print("Accuracy WITH scaling:", accuracy_scaled)


Accuracy WITHOUT scaling: 1.0
Accuracy WITH scaling: 1.0


## Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

-- 1. Data Handling

Collect and clean data: Ensure there are no missing values, duplicates, or erroneous entries.

Feature selection/engineering: Create meaningful features (e.g., past purchase history, browsing behavior, demographics).

Categorical encoding: Convert categorical variables into numerical format (e.g., One-Hot Encoding, Target Encoding).

2. Feature Scaling

Scale numerical features using StandardScaler or MinMaxScaler.

Logistic Regression benefits from scaling because it improves gradient convergence and coefficient interpretability.

3. Handle Class Imbalance

Since only 5% of customers respond:

Option 1: Resampling

Oversample minority class: e.g., using SMOTE (Synthetic Minority Oversampling Technique).

Undersample majority class: Randomly reduce non-responders.

Option 2: Use class weights

Logistic Regression allows class_weight='balanced', which penalizes misclassification of minority class more heavily.

4. Model Building & Hyperparameter Tuning

Train Logistic Regression with L1 or L2 regularization.

Tune hyperparameters using GridSearchCV or RandomizedSearchCV:

C (regularization strength)

penalty (L1 or L2)

solver compatible with your penalty

Include stratified cross-validation to maintain class distribution in folds.

5. Model Evaluation Metrics

Accuracy is not reliable due to imbalance. Use:

Precision: Of predicted responders, how many are actual responders?

Recall (Sensitivity): Of all actual responders, how many did we identify?

F1-score: Harmonic mean of precision and recall.

ROC-AUC: Measures how well the model separates responders vs. non-responders.

PR-AUC (Precision-Recall AUC): Especially useful for imbalanced datasets.

6. Deployment Considerations

Threshold tuning: Adjust the probability threshold for predicting a responder to optimize business goals (e.g., maximize campaign ROI).

Monitor model drift: Customer behavior may change over time, so periodically retrain the model.

Explainability: Use feature importance or coefficients to understand what drives responses, which helps marketing strategy.

✅ Summary Approach:

Clean and engineer features.

Scale numerical features.

Handle imbalance via SMOTE or class weights.

Train Logistic Regression with cross-validation and hyperparameter tuning.

Evaluate using precision, recall, F1-score, ROC-AUC, and PR-AUC.

Deploy with threshold tuning and monitor performance.