Python Questions
-

Q1] What is Logistic Regression, and how does it differ from Linear Regression?  
Ans] Logistic Regression is a supervised machine learning algorithm used for classification problems.  
It predicts the probability that a given input belongs to a certain class (category).  
Let’s say you want to predict whether a student will pass (1) or fail (0) based on study hours.  
Input: X = study hours  
Output: Y = 1 (pass) or 0 (fail)  
Instead of predicting a continuous number (like Linear Regression does), Logistic Regression predicts:  
The probability that Y = 1 (student passes).  

Since probabilities must be between 0 and 1, Logistic Regression uses a sigmoid function to “squash” the output of a linear equation.    
1 Linear Regression is used to predict a continuous numeric value, while Logistic Regression is used to predict a categorical outcome such as Yes/No or 0/1.  

2 In Linear Regression, the output can take any real value (from negative infinity to positive infinity), whereas in Logistic Regression, the output is always between 0 and 1, representing a probability.  

3 Linear Regression fits a straight line to the data, showing a linear relationship between the input and output, while Logistic Regression fits an S-shaped (sigmoid) curve to map values into probabilities.  

4 Linear Regression directly predicts a numeric value, but Logistic Regression predicts a probability, which is then converted into a class label using a threshold (commonly 0.5).  

5 Linear Regression produces a straight-line decision boundary, whereas Logistic Regression produces a sigmoid-shaped probability boundary that helps separate classes.  

6 Linear Regression assumes a linear relationship between the independent and dependent variables, whereas Logistic Regression assumes a linear relationship between the independent variables and the log-odds of the dependent variable.  

7 The error function (or loss function) used in Linear Regression is Mean Squared Error (MSE), while Logistic Regression uses Log Loss (Binary Cross-Entropy).  

8 Linear Regression is applied to problems like predicting house prices, sales, or temperature, while Logistic Regression is used for predicting spam detection, disease diagnosis, or customer churn.  

10 Finally, Linear Regression is meant for regression tasks (continuous outputs), and Logistic Regression is meant for classification tasks (discrete outputs).   


2] Explain the role of the Sigmoid function in Logistic Regression?  
Ans] The Sigmoid function is a mathematical function that converts any real number (from −∞ to +∞) into a value between 0 and 1.  
In Logistic Regression, we want to predict the probability that a data point belongs to a certain class (e.g., 1 = yes, 0 = no).
But the linear equation   
b0​+b1​x can produce any value — positive, negative, or zero.  
That’s not suitable for a probability (which must be between 0 and 1).  

So, the Sigmoid function “squashes” this linear output into the range [0, 1].  
1️⃣ Compute linear combination  
Logistic Regression first computes:  z=b0​+b1​x1​+b2​x2​+⋯+bn​xn​  

2️⃣ Apply the Sigmoid function  
Then it applies the sigmoid transformation

3️⃣ Interpret the result  
if p>0.5: predict class 1  
If p≤0.5: predict class 0  

4️⃣ Use in model training  
The model adjusts coefficients b0​,b1​,… to minimize the difference between predicted probabilities and actual class labels (using log loss).  


3] What is Regularization in Logistic Regression and why is it needed?  
Ans] Regularization is a technique used to prevent overfitting in a model by adding a penalty term to the model’s loss (cost) function.  
When training Logistic Regression (or any model), the algorithm tries to minimize a loss function — in this case, the log loss (or cross-entropy loss).

If we don’t control the model, it might:  
Fit too closely to the training data (overfitting)  
Learn noise or irrelevant patterns  
Perform poorly on new (test) data  
To fix this, regularization adds an extra term that penalizes large weights (coefficients) — forcing the model to stay simpler and more generalizable.  
Types of Regularization  
L1 Regularization (Lasso)  
L2 Regularization (Ridge)  
Elastic Net  

4] What are some common evaluation metrics for classification models, and why are they important?  
Ans] Evaluation metrics are quantitative measures used to assess how well a classification model performs — that is, how accurately and reliably it predicts classes (like 0/1 or Yes/No).  

1] Accuracy  
The ratio of correctly predicted observations to the total observations.  
Accuracy=TP+TN+FP+FN/TP+TN  
TP = True Positives (correctly predicted 1s)  
TN = True Negatives (correctly predicted 0s)  
FP = False Positives (wrongly predicted 1s)  
FN = False Negatives (missed 1s)  

Precision  
Out of all the positive predictions, how many were actually correct?  
Precision=𝑇𝑃/𝑇𝑃+𝐹𝑃	​  
Measures how reliable positive predictions are.  
Important when false positives are costly (e.g., spam detection — you don’t want to mark genuine emails as spam).   

Recall (Sensitivity or True Positive Rate)  
Out of all actual positives, how many did the model correctly predict?  
Recall=TP/TP+FN  
Measures how comprehensive the model is in finding positives.  
Important when missing a positive case is costly (e.g., disease detection — you don’t want to miss an actual patient).  

F1-Score  
The harmonic mean of Precision and Recall.    
F1 Score=2×  Precision×Recall/Precision+Recall  
Balances both Precision and Recall.  
Useful when you have imbalanced classes.  
A single metric to evaluate trade-off between FP and FN.  

ROC Curve (Receiver Operating Characteristic) & AUC (Area Under Curve)  
Plots True Positive Rate (Recall) vs False Positive Rate (1 - Specificity) for different thresholds.  

Confusion Matrix  
Gives a complete picture of model performance.  
Helps calculate all other metrics like Precision, Recall, and F1-score.  
	​


5] : Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)  

In [4]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target


df = df[df['target'] != 2]

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Model Accuracy: {accuracy:.2f}")

from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Logistic Regression Model Accuracy: 1.00
[[12  0]
 [ 0  8]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00         8

    accuracy                           1.00        20
   macro avg       1.00      1.00      1.00        20
weighted avg       1.00      1.00      1.00        20



6] Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.


In [6]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

df = df[df['target'] != 2]

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


model = LogisticRegression(penalty='l2', C=1.0, solver='liblinear')

model.fit(X_train, y_train)

print("Model coefficients (weights for each feature):")
for feature, coef in zip(iris.feature_names, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

print(f"Intercept: {model.intercept_[0]:.4f}")

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nLogistic Regression Model Accuracy: {accuracy:.2f}")


Model coefficients (weights for each feature):
sepal length (cm): -0.3754
sepal width (cm): -1.3966
petal length (cm): 2.1525
petal width (cm): 0.9642
Intercept: -0.2564

Logistic Regression Model Accuracy: 1.00


7] Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)  

In [7]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Classification Report for Multiclass Logistic Regression (OvR):\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report for Multiclass Logistic Regression (OvR):

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





8] Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.


In [9]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=500, solver='liblinear')  

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],          
    'penalty': ['l1', 'l2']                
}

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, 
                           cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

print("Best Hyperparameters:", grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.2f}")

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {test_accuracy:.2f}")


Best Hyperparameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.96
Test Set Accuracy: 1.00


9]Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.

In [12]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df[data.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


model_no_scaling = LogisticRegression(max_iter=5000, solver='liblinear')
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
print(f"Accuracy WITHOUT scaling: {accuracy_no_scaling:.2f}")


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


model_scaled = LogisticRegression(max_iter=5000, solver='liblinear')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy WITH scaling: {accuracy_scaled:.2f}")


print("\nComparison:")
print(f"Without scaling: {accuracy_no_scaling:.2f}")
print(f"With scaling   : {accuracy_scaled:.2f}")


Accuracy WITHOUT scaling: 0.96
Accuracy WITH scaling: 0.97

Comparison:
Without scaling: 0.96
With scaling   : 0.97


10] Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business

In [15]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE


X, y = make_classification(
    n_samples=5000,        
    n_features=10,       
    n_informative=5,
    n_redundant=2,
    n_clusters_per_class=1,
    weights=[0.95, 0.05],  
    flip_y=0,
    random_state=42
)

df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df['target'] = y

print("Original class distribution:")
print(df['target'].value_counts())


X = df.drop('target', axis=1)
y = df['target']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_scaled, y)

print("\nClass distribution AFTER SMOTE:")
print(pd.Series(y_res).value_counts())


X_train, X_test, y_train, y_test = train_test_split(
    X_res, y_res, test_size=0.2, random_state=42, stratify=y_res
)


param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear'] 
}

grid = GridSearchCV(
    LogisticRegression(class_weight='balanced', max_iter=1000),
    param_grid,
    cv=5,
    scoring='f1',   
    n_jobs=-1
)

grid.fit(X_train, y_train)
best_model = grid.best_estimator_

print("\nBest hyperparameters:")
print(grid.best_params_)


y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:,1]

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC Score: {roc_auc:.2f}")


Original class distribution:
target
0    4750
1     250
Name: count, dtype: int64

Class distribution AFTER SMOTE:
target
0    4750
1    4750
Name: count, dtype: int64

Best hyperparameters:
{'C': 1, 'penalty': 'l1', 'solver': 'liblinear'}

Confusion Matrix:
[[936  14]
 [  1 949]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       950
           1       0.99      1.00      0.99       950

    accuracy                           0.99      1900
   macro avg       0.99      0.99      0.99      1900
weighted avg       0.99      0.99      0.99      1900

ROC-AUC Score: 1.00
