**Question 1: What is Logistic Regression, and how does it differ from Linear
Regression?**  
Logistic Regression is a type of machine learning algorithm that we use when the outcome we want to predict has only two possibilities — for example, whether an email is spam or not, whether a person will buy a product or not, or whether a student will pass or fail.

Even though the name has the word “regression,” it is actually used for classification, not for predicting continuous values.

How it works

Logistic regression studies the relationship between the input variables (such as age, income, or study hours) and the output variable (like pass or fail).
However, instead of giving a direct number as a result, it predicts a probability — a value between 0 and 1.

If the probability is closer to 1, the model predicts “yes.”
If it is closer to 0, the model predicts “no.”

To keep the output within this 0 to 1 range, logistic regression uses a mathematical function called the sigmoid function, which bends a straight line into an S-shaped curve.

How it differs from Linear Regression

Linear Regression is used when we want to predict continuous values — for example, predicting a person’s height based on their age, or predicting the price of a house.
It can output any number, positive or negative.

Logistic Regression, on the other hand, is used when we want to predict categories — like whether something will happen or not.
It does not give exact numbers but rather the probability that something belongs to a particular class.

**Question 2: Explain the role of the Sigmoid function in Logistic Regression.**

The Sigmoid function is the heart of logistic regression. Its main job is to take any number the model produces and turn it into something that looks like a probability — a value between 0 and 1.

Why we need it

When a logistic regression model makes a prediction, it first calculates a value using a simple linear equation — something like
b0+b1x1+b2x2
.
This number can be anything: very small, very large, positive, or negative.
But we cannot use that directly because probabilities can never be less than 0 or greater than 1.

That's where the Sigmoid function helps. It takes this raw number and compresses it into a range between 0 and 1.

How it works

The Sigmoid function has an S-shaped curve.

When the input is a large positive number, the output of the function is close to 1.

When the input is a large negative number, the output is close to 0.

When the input is around 0, the output is about 0.5.

So, the function naturally fits the idea of probability.

In logistic regression

After calculating the linear part, the model passes the result through the Sigmoid function.
The output we get is the probability that the given input belongs to a particular class.

Usually, we set a cutoff point at 0.5:

If the probability is greater than 0.5, we predict the class as “1” or “Yes.”

If it is less than 0.5, we predict the class as “0” or “No.”

Example

Imagine a model that predicts whether a student will pass an exam based on how many hours they study.
The raw output might be something like 2.4, which doesn't mean much on its own.
When we apply the Sigmoid function to this value, it converts it to 0.92 — meaning there is a 92% chance the student will pass.

**Question 3: What is Regularization in Logistic Regression and why is it needed?**

Understanding the need for it

When we train a logistic regression model, it tries to find the best coefficients (weights) for each input variable so that it can make accurate predictions.
Sometimes, especially when the data has many features or noise, the model starts fitting too closely to the training data. It tries to capture every small detail, even random fluctuations that don’t actually represent real patterns.

As a result, the model learns the training data too well but fails to generalize — this is called overfitting.
Regularization helps solve this problem.

How it works

Regularization adds a penalty term to the model’s cost function.
This penalty discourages the model from assigning very large values to the coefficients.
When the coefficients are smaller, the model becomes simpler and more general, which improves its ability to perform well on new data.

There are mainly two types of regularization used in logistic regression:

L1 Regularization (Lasso)

It adds the absolute value of the coefficients as a penalty.

This can shrink some coefficients to exactly zero, effectively removing less important features from the model.

L2 Regularization (Ridge)

It adds the square of the coefficients as a penalty.

This makes the coefficients smaller but rarely zero.

It helps reduce the influence of less important features without completely removing them.

Why it is needed

To avoid overfitting and make the model generalize better.

To handle multicollinearity, where features are highly correlated with each other.

To simplify the model by reducing unnecessary complexity.

To improve prediction accuracy on new, unseen data.

**Question 4: What are some common evaluation metrics for classification models, and why are they important?**

When we build a classification model, like logistic regression, we need to check how well it actually performs.
Just because a model gives some predictions doesn’t mean they’re always reliable — so we use evaluation metrics to measure its performance.
These metrics help us understand where the model is doing well and where it’s making mistakes.

1. Accuracy

Accuracy tells us how many predictions the model got right overall.
If out of 100 test cases the model correctly predicts 90, then its accuracy is 90%.

Accuracy is simple and useful when the data is balanced (when both classes have roughly the same number of examples).
But if one class dominates — say 95% “no” and 5% “yes” — the model could just keep predicting “no” and still get 95% accuracy.
That’s why accuracy alone isn’t always enough.

2. Precision

Precision answers the question: When the model predicted something as positive, how often was it correct?

For example, if a spam detector says 10 emails are spam and 8 actually are, the precision is 8 out of 10.
So precision focuses on how trustworthy the positive predictions are.
It’s especially important in situations where false positives are costly — like marking an important email as spam or diagnosing someone as sick when they’re not.

3. Recall (or Sensitivity)

Recall tells us how well the model finds all the actual positives.

For example, if there are 10 spam emails and the model correctly finds 8 of them, the recall is 8 out of 10.
Recall matters most when missing a positive case is serious — like failing to detect a disease or missing a fraudulent transaction.

4. F1 Score

Sometimes we want a balance between precision and recall.
The F1 Score does exactly that — it combines the two into one number.
If both precision and recall are high, the F1 score will also be high.
It’s useful when the dataset is uneven or when we care equally about false positives and false negatives.

5. Confusion Matrix

A confusion matrix is a simple table that shows where the model got things right and where it went wrong.
It breaks predictions into four parts:

True Positives (correctly predicted “yes”)

True Negatives (correctly predicted “no”)

False Positives (predicted “yes” but actually “no”)

False Negatives (predicted “no” but actually “yes”)

Looking at this table helps us understand the types of mistakes the model makes.

6. ROC Curve and AUC

The ROC curve shows how well the model separates the two classes at different decision thresholds.
The AUC (Area Under the Curve) gives a single score — the higher it is (closer to 1), the better the model is at distinguishing between classes.





In [2]:
# Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.(Use Dataset from sklearn package)(Include your Python code and output in the code box below.)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Save dataset to CSV
df.to_csv("iris.csv", index=False)

# Load CSV into DataFrame
data = pd.read_csv("iris.csv")

# Keep only two classes for binary classification
data = data[data['target'] != 2]

X = data[iris.feature_names]
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Predicted values:", y_pred)
print("Actual values:", list(y_test))
print("Model Accuracy:", round(accuracy * 100, 2), "%")


Predicted values: [1 1 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0]
Actual values: [1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0]
Model Accuracy: 100.0 %


In [3]:
# Question 6: Write a Python program to train a Logistic Regression model using L2
# regularization (Ridge) and print the model coefficients and accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
df = df[df['target'] != 2]

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', solver='liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Model Coefficients:", model.coef_)
print("Model Accuracy:", round(accuracy * 100, 2), "%")


Model Coefficients: [[-0.3753915  -1.39664105  2.15250857  0.96423532]]
Model Accuracy: 100.0 %


In [4]:
# Question 7: Write a Python program to train a Logistic Regression model for multiclass
# classification using multi_class='ovr' and print the classification report.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import classification_report

iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', solver='liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





In [5]:
# Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
# hyperparameters for Logistic Regression and print the best parameters and validation
# accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid.best_params_)
print("Validation Accuracy:", round(accuracy * 100, 2), "%")


Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Validation Accuracy: 100.0 %




In [6]:
# Question 9: Write a Python program to standardize the features before training Logistic
# Regression and compare the model's accuracy with and without scaling.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import datasets
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without scaling
model1 = LogisticRegression(max_iter=200)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
accuracy_without_scaling = accuracy_score(y_test, y_pred1)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model2 = LogisticRegression(max_iter=200)
model2.fit(X_train_scaled, y_train)
y_pred2 = model2.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred2)

print("Accuracy without scaling:", round(accuracy_without_scaling * 100, 2), "%")
print("Accuracy with scaling:", round(accuracy_with_scaling * 100, 2), "%")


Accuracy without scaling: 100.0 %
Accuracy with scaling: 100.0 %


**Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**

1. Understand the data and problem

 - You have an imbalanced dataset where only 5% of customers respond to a marketing campaign.
 - This is a binary classification problem: Responded (Yes/No).
 - Since the classes are highly imbalanced, a simple logistic regression trained on raw data may always predict “No” and still appear 95% accurate. So careful handling is needed.

2. Data preprocessing

 - Check for missing values and handle them (impute or remove).

 - Convert categorical features to numerical using one-hot encoding or label encoding.

 - Feature scaling: Use StandardScaler to standardize features so that all numeric features contribute equally to the model. This is especially important if features vary widely in scale (e.g., income vs. number of purchases).

3. Handle class imbalance

 - Since only 5% of customers respond:

 - Use resampling techniques:

 - Oversampling the minority class using SMOTE (Synthetic Minority Oversampling Technique).

 - Undersampling the majority class (if dataset is very large).

 - Alternatively, use class weights in logistic regression: class_weight='balanced'. This tells the model to pay more attention to the minority class without changing the dataset.

4. Train/Test split

 - Split the data into training and test sets.

 - Keep the test set untouched, preferably stratified so that the class distribution is preserved.

 - Training set can then be balanced using the techniques above.

5. Model training with logistic regression

 - Use L2 regularization (ridge) to prevent overfitting.

 - Set class_weight='balanced' to account for imbalance.

 - Train the logistic regression on the scaled and balanced data.

6. Hyperparameter tuning

 - Use GridSearchCV or RandomizedSearchCV to tune important parameters:

 - C (regularization strength)

 - penalty (L1 or L2)

 - solver appropriate for your penalty and dataset size

 - Use cross-validation with stratified folds to maintain class distribution.

7. Model evaluation

 - Accuracy is misleading for imbalanced data — a 95% accuracy may just reflect the majority class.

 - Use metrics that focus on minority class performance:

 - Precision: Of all customers predicted to respond, how many actually responded?

 - Recall (Sensitivity): Of all customers who responded, how many did the model identify?

 - F1 Score: Balance between precision and recall.

 - ROC-AUC: Measures the model’s ability to distinguish responders from non-responders.

 - Confusion matrix: To visualize true positives, false positives, true negatives, and false negatives.

8. Final deployment considerations

 - Test the model on new data to ensure it generalizes.

 - Use the predicted probabilities (not just the 0/1 class) to rank customers for marketing campaigns.

 - Continuously monitor model performance as customer behavior changes over time.
