In [None]:
##Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

"""
Answer: Logistic regression predicts the likelihood of a categorical outcome, such as a yes/no choice, whereas linear regression predicts a continuous numerical value.
The main difference is in their applications: linear regression is used to predict quantities, whereas logistic regression is used to predict categories.
Logistic regression employs a sigmoid (S-shaped) function to convert the output into a probability between 0 and 1,
whereas linear regression employs a linear equation and the least squares approach to determine the best fit line.
"""

In [None]:
##Question 2: Explain the role of the Sigmoid function in Logistic Regression.


"""
Answer: The Sigmoid (or logistic) function in logistic regression is primarily responsible for converting the model's linear output into a probability value between 0 and 1.
This "S-shaped" curve condenses any real-valued input into a range appropriate for representing the probability of a binary result, which is required for classification tasks,
allowing the model to output the likelihood of an event occurring.
Logistic regression initially computes a weighted sum of the input features plus a bias factor, yielding a continuous value (let us call it z).
The z value is then sent via the Sigmoid function, defined as g(z) = 1 / (1 + e^(-z)).
"""

In [None]:
##Question 3: What is Regularization in Logistic Regression and why is it needed?

"""
Answer: Regularisation in Logistic Regression is a strategy for preventing overfitting by imposing a penalty on the model's cost function, discouraging excessive coefficient values.
This penalty aids the model in achieving a balance between fitting the training data and generalising successfully to fresh, untested data.It is required because, without it,
particularly with numerous features,the logistic regression model may become too complex and learn the noise in the training data,
resulting in poor performance in future predictions.
"""


In [None]:
##Question 4: What are some common evaluation metrics for classification models, and why are they important?

"""
Answer: Accuracy, Precision, Recall (Sensitivity), F1-Score, and Specificity are common classification model evaluation metrics that measure various aspects of a model's performance,
such as how frequently it is correct (accuracy), the reliability of its positive predictions (precision),its ability to find all actual positive instances (recall),
and its ability to correctly identify negative instances (specificity).These indicators are critical for evaluating a model's strengths and limitations,
selecting the best model for a given problem, and ensuring that it performs effectively in its intended application.
"""


In [5]:
##Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")



Accuracy: 1.0000


In [6]:
##Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with L2 regularization (default)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Print model coefficients
print("Model coefficients:")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

# Predict on test set
y_pred = model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")


Model coefficients:
sepal length (cm): -0.3935
sepal width (cm): 0.9625
petal length (cm): -2.3751
petal width (cm): -0.9987

Accuracy: 1.0000


In [7]:
##Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset from sklearn
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with one-vs-rest strategy
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Print classification report
report = classification_report(y_test, y_pred, target_names=data.target_names)
print("Classification Report:\n", report)


Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [8]:
##Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.


import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the Logistic Regression model
model = LogisticRegression(solver='liblinear', max_iter=200)

# Define hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Setup GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters and best score
print("Best parameters:", grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Evaluate on test set using best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {test_accuracy:.4f}")


Best parameters: {'C': 10, 'penalty': 'l1'}
Best cross-validation accuracy: 0.9583
Test set accuracy: 1.0000


In [10]:
##Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model_scaling = LogisticRegression(max_iter=200)
model_scaling.fit(X_train_scaled, y_train)
y_pred_scaling = model_scaling.predict(X_test_scaled)
accuracy_scaling = accuracy_score(y_test, y_pred_scaling)


print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with scaling:    {accuracy_scaling:.4f}")


Accuracy without scaling: 1.0000
Accuracy with scaling:    1.0000


In [None]:
"""
Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.

Answer: 1. Data Collection and Cleaning:Collect relevant client data such as demographics, purchase history, browsing habits, and past campaign responses.
Missing values are handled through imputation or elimination, depending on their extent and importance.Remove duplicates and fix inconsistencies.
Feature Engineering:Develop significant features like recency, frequency, and monetary value (RFM).Categorical variables can be encoded using either one-hot or target encoding.
Consider interaction terms or polynomial features, if applicable.
Train-Test Split:To maintain the class distribution, divide the dataset into training and test sets (for example, an 80/20 split) using stratified sampling.
2. Feature Scaling:To normalise the scale of numerical features, use either StandardScaler or MinMaxScaler.Scaling is necessary for Logistic Regression to achieve faster convergence and better performance.
Fit the scaler to the training data only, then apply the same transformation to the test dataset.
3. Handling Class Imbalance:Because only 5% of clients react, the dataset is extremely unbalanced.
Techniques for addressing imbalance:
Resampling:Oversampling the minority class via methods such as SMOTE.To eliminate prejudice, consider undersampling the majority class.Combine both (for example, SMOTE and Tomek connections).
Class Weighting:Use the class_weight='balanced' parameter in Logistic Regression to penalise misclassification of the minority class more severely.
Anomaly Detection Perspective: Treat respondents as anomalies and use specialised approaches as needed.
4. Model Training and Hyperparameter Tuning:Use Logistic Regression with L2 regularisation (ridge) as a baseline.Tune hyperparameters with GridSearchCV or RandomizedSearchCV and cross-validation.
Regularisation strength C (the inverse of regularisation).Penalty type (l1, l2) if the solver supports it.Solver option (liblinear, saga, etc.).Class weight (balanced or bespoke).
Use stratified k-fold cross-validation to keep class distribution in folds.
5. Model Evaluation:Accuracy is a poor metric owing to imbalance.
Use metrics to assess minority class performance, such as precision, recall, and F1-score (particularly recall to identify responses).
ROC-AUC (Area Under the Receiver Operating Characteristic Curve).
PR-AUC (Area Under the Precision-Recall Curve) is more useful for imbalanced data.
A confusion matrix might help you understand false positives and false negatives.
Consider the business impact:False negatives (missed responses) may be more costly than false positives.
Adjust the classification threshold to meet business objectives (for example, maximise recall while maintaining acceptable precision).
6. Deployment Considerations:Check model performance over time for data drift.
Retrain on new data at regular intervals.Integrate model predictions into marketing workflows to create targeted campaigns.Provide stakeholders with explainability
(for example, feature importance and coefficients).

