# Question 1: What is Logistic Regression, and how does it differ from Linear Regression?
ANSWER-1
Logistic Regression

Logistic regression is a statistical model used when the dependent variable is categorical, most commonly binary (0/1). Instead of modeling the dependent variable directly as in linear regression, logistic regression models the log-odds of the probability of the event:

\ln \left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n

where:

 is the probability of the event occurring,

 is the odds of success.


Through the logistic (sigmoid) function, predictions are restricted between 0 and 1:

p = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \dots + \beta_nx_n)}}

Estimation is done using Maximum Likelihood Estimation (MLE).


Linear Regression

Linear regression is a model used when the dependent variable is continuous. It assumes a linear relationship between the independent variables and the dependent variable:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon

where:

 is continuous,

 is the error term, assumed to be normally distributed with constant variance.


Estimation is done using Ordinary Least Squares (OLS).






# Question 2: Explain the role of the Sigmoid function in Logistic Regression.
ANSWER-2
In Logistic Regression, the sigmoid function is used to map the linear combination of input features into a probability value between 0 and 1. Its mathematical form is

\sigma(z) = \frac{1}{1 + e^{-z}}


This ensures that the output of the model can be interpreted as a probability of the dependent variable belonging to a particular class. Values close to 0 indicate class 0, while values close to 1 indicate class 1. Thus, the sigmoid function enables logistic regression to perform binary classification.


#Question 3: What is Regularization in Logistic Regression and why is it needed?

Answer:
Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty term to the loss function. It discourages the model from assigning very high weights to features. Common forms are L1 (Lasso) and L2 (Ridge) regularization. It is needed because it improves generalization, reduces model complexity, and ensures better performance on unseen data.

#Question 4: What are some common evaluation metrics for classification models, and why are they important?

Answer:
Common evaluation metrics include:

Accuracy: Measures the overall correctness of predictions.

Precision: Proportion of correctly predicted positives among all predicted positives.

Recall (Sensitivity): Proportion of correctly predicted positives among all actual positives.

F1-Score: Harmonic mean of precision and recall, balances both.

ROC-AUC: Measures the model’s ability to distinguish between classes.


These metrics are important because they provide insights into different aspects of model performance, especially when the dataset is imbalanced.

In [1]:
# Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy (Use Dataset from sklearn package)
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset from sklearn and convert to DataFrame
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# For binary classification, keep only two classes (0 and 1)
df = df[df['target'] != 2]

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 1.0


In [2]:
# Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression with L2 Regularization (Ridge)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Print coefficients
print("Model Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

# Predictions and Accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Model Coefficients: [[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]
Intercept: [  9.00884295   1.86902164 -10.87786459]
Accuracy: 1.0


In [3]:
# Question 7: Write a Python program to train a Logistic Regression model for multiciass classification using multi_class= ovr and print the classification report. (Use Dataset from sklearn package)
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression for Multiclass with One-vs-Rest
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





In [5]:
# Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],       # Regularization strength
    'penalty': ['l1', 'l2']             # L1 = Lasso, L2 = Ridge
}

# Logistic Regression with solver that supports both L1 and L2
log_reg = LogisticRegression(solver='liblinear', max_iter=1000)

# GridSearchCV
grid = GridSearchCV(log_reg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid.best_params_)

# Validation accuracy (best cross-validation score)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Test accuracy on unseen data
y_pred = grid.best_estimator_.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

Best Parameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9583333333333334
Test Accuracy: 1.0


In [6]:
# Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model1 = LogisticRegression(max_iter=1000)
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)
acc_without_scaling = accuracy_score(y_test, y_pred1)


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model2 = LogisticRegression(max_iter=1000)
model2.fit(X_train_scaled, y_train)
y_pred2 = model2.predict(X_test_scaled)
acc_with_scaling = accuracy_score(y_test, y_pred2)

# Print results
print("Accuracy without Scaling:", acc_without_scaling)
print("Accuracy with Scaling:", acc_with_scaling)

Accuracy without Scaling: 1.0
Accuracy with Scaling: 1.0


# Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you'd take to build a Logistic Regression model including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.
Answer:

To build a Logistic Regression model for predicting customer responses in an imbalanced dataset, the following approach can be taken:

1. Data Handling:

Collect and clean customer data (e.g., demographics, past purchases, browsing behavior).

Handle missing values, encode categorical variables, and remove irrelevant features.



2. Feature Scaling:

Apply standardization (z-score scaling) so that all features are on the same scale, ensuring Logistic Regression coefficients are stable and interpretable.



3. Balancing Classes:

Since only 5% of customers respond, the dataset is highly imbalanced.

Use techniques such as SMOTE (Synthetic Minority Oversampling Technique), undersampling, or class weights (penalty='balanced') in Logistic Regression to ensure the model does not always predict the majority class.



4. Hyperparameter Tuning:

Use GridSearchCV to tune hyperparameters like C (regularization strength) and penalty (L1 or L2).

This helps avoid overfitting and improves generalization.



5. Model Evaluation:

Accuracy is not a good metric in imbalanced data (predicting all customers as "no" would give ~95% accuracy).

Instead, use metrics such as Precision, Recall, F1-Score, and ROC-AUC to evaluate performance.

In this business case, Recall (Sensitivity) is very important (to capture as many potential responders as possible), but Precision is also relevant (to avoid wasting marketing budget on uninterested customers).