Question 1: What is Logistic Regression, and how does it differ from Linear
Regression?

Answer: Logistic Regression:
Logistic Regression is a supervised machine learning algorithm used primarily for classification problems—especially binary classification (i.e., predicting one of two possible outcomes such as yes/no, spam/ham, 0/1, disease/no disease, etc.).

Instead of predicting a continuous output (as in linear regression), logistic regression predicts the probability that a given input point belongs to a particular class.

The core idea is to use the logistic (sigmoid) function to map any real-valued number into the range (0, 1):

𝜎
(
𝑧
)
=
1
1
+
𝑒
−
𝑧
,
where
𝑧
=
𝛽
0
+
𝛽
1
𝑥
1
+
𝛽
2
𝑥
2
+
⋯
+
𝛽
𝑛
𝑥
𝑛
σ(z)=
1+e
−z

1
​
 ,where z=β
0
​
 +β
1
​
 x
1
​
 +β
2
​
 x
2
​
 +⋯+β
n
​
 x
n
​

If the output probability is greater than a certain threshold (commonly 0.5), the instance is classified as class 1; otherwise, class 0.

| Aspect               | Linear Regression                                 | Logistic Regression                                    |
| -------------------- | ------------------------------------------------- | ------------------------------------------------------ |
| **Type of Problem**  | Regression (predicting continuous value)          | Classification (predicting class label or probability) |
| **Output Range**     | Real numbers (-∞ to ∞)                            | Probability (0 to 1)                                   |
| **Function Used**    | Linear function                                   | Sigmoid (logistic) function                            |
| **Model Equation**   | $y = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n$ | $p = \frac{1}{1 + e^{-z}}$                             |
| **Loss Function**    | Mean Squared Error (MSE)                          | Log Loss (Binary Cross-Entropy)                        |
| **Use Case Example** | Predicting house prices                           | Predicting if an email is spam or not                  |


Question 2: Explain the role of the Sigmoid function in Logistic Regression

Answer : In Logistic Regression, the model needs to predict the probability that a given input belongs to a particular class (usually binary: 0 or 1). The sigmoid function (also known as the logistic function) plays a central role in making this possible.

| Role                                         | Explanation                                                                                                                                                                                             |
| -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **1. Converts Linear Output to Probability** | The linear combination $z$ can take any value between $-\infty$ to $\infty$. The sigmoid function squashes this value into a **bounded range (0 to 1)**, which can be interpreted as a **probability**. |
| **2. Enables Classification**                | After calculating the probability $\hat{y} = \sigma(z)$, logistic regression applies a threshold (usually 0.5). If $\hat{y} > 0.5$, predict class 1; else, predict class 0.                             |
| **3. Makes Optimization Feasible**           | The sigmoid function is **differentiable**, which allows optimization algorithms like **gradient descent** to efficiently minimize the loss (usually **log loss** or **binary cross-entropy**).         |
| **4. Helps Interpret Model Output**          | The output of the sigmoid function is easily interpretable: a value of 0.9 means the model is 90% confident the input belongs to class 1.                                                               |


Question 3: What is Regularization in Logistic Regression and why is it needed?


Answer: Regularization is a technique used in logistic regression (and other machine learning models) to prevent overfitting by penalizing large or unnecessary coefficients in the model.

In simpler terms, regularization discourages the model from becoming too complex, ensuring it generalizes well to unseen data—not just memorizing the training data.

Why is Regularization Needed?
1. To Prevent Overfitting:

Logistic regression can easily overfit when there are too many features or if some features dominate the prediction due to large weights.

Overfitting means the model performs well on training data but poorly on new data.

2. To Improve Generalization:

Regularization helps the model focus on the most important features and ignore noise or irrelevant information.

3. To Handle Multicollinearity:

When features are highly correlated, regularization can help stabilize the model by reducing the variance.

Question 4: What are some common evaluation metrics for classification models, and
why are they important?

valuation metrics like accuracy, precision, recall, F1-score, and AUC are crucial for understanding a model’s true performance. They provide more nuanced and situation-specific feedback than accuracy alone. Choosing the right metric ensures your model not only performs well statistically but also meets the real-world goals of your application.

| Reason                                           | Explanation                                                                                                                        |
| ------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------- |
| **1. Different Problems Need Different Metrics** | In medical diagnosis, a **false negative** could be fatal → **recall** is more important than accuracy.                            |
| **2. Accuracy Can Be Misleading**                | In imbalanced datasets (e.g., 95% class 0, 5% class 1), predicting all zeros gives 95% accuracy but **fails to detect positives**. |
| **3. Guides Model Improvement**                  | Metrics highlight model weaknesses (e.g., low recall), helping you **tune or redesign** the model.                                 |
| **4. Essential for Model Comparison**            | Helps choose between multiple models (e.g., compare F1 scores or AUC).                                                             |


Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.(Use Dataset from sklearn package)

In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

# Split dataset into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train Logistic Regression model
model = LogisticRegression(max_iter=10000)  # Increased max_iter to avoid convergence issues
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}")


Accuracy of the Logistic Regression model: 0.96


Question 6: Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)

In [3]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model with L2 regularization (default penalty='l2')
model = LogisticRegression(penalty='l2', solver='liblinear', max_iter=1000)

# Train the model
model.fit(X_train, y_train)

# Print model coefficients
print("Model Coefficients:")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

# Predict and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy of the L2-Regularized Logistic Regression Model: {accuracy:.2f}")


Model Coefficients:
mean radius: 2.1325
mean texture: 0.1528
mean perimeter: -0.1451
mean area: -0.0008
mean smoothness: -0.1426
mean compactness: -0.4156
mean concavity: -0.6519
mean concave points: -0.3445
mean symmetry: -0.2076
mean fractal dimension: -0.0298
radius error: -0.0500
texture error: 1.4430
perimeter error: -0.3039
area error: -0.0726
smoothness error: -0.0162
compactness error: -0.0019
concavity error: -0.0449
concave points error: -0.0377
symmetry error: -0.0418
fractal dimension error: 0.0056
worst radius: 1.2321
worst texture: -0.4046
worst perimeter: -0.0362
worst area: -0.0271
worst smoothness: -0.2626
worst compactness: -1.2090
worst concavity: -1.6180
worst concave points: -0.6153
worst symmetry: -0.7428
worst fractal dimension: -0.1170

Accuracy of the L2-Regularized Logistic Regression Model: 0.96


Question 7: Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)

In [5]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load Iris dataset (3-class classification)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with One-vs-Rest (OvR) strategy
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report (One-vs-Rest):\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report (One-vs-Rest):

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)

In [6]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # liblinear supports both l1 and l2
}

# Initialize Logistic Regression
logreg = LogisticRegression(max_iter=1000)

# Apply GridSearchCV
grid = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best parameters and score
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.2f}")

# Evaluate on test data
y_pred = grid.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {test_accuracy:.2f}")


Best Parameters: {'C': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validation Accuracy: 0.97
Test Set Accuracy: 0.98


Question 9: Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)

In [7]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ----------- Model without Scaling -----------
model_no_scaling = LogisticRegression(max_iter=1000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
print(f"Accuracy without scaling: {accuracy_no_scaling:.2f}")

# ----------- Model with Standardization -----------
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression with scaled features
model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with standardization: {accuracy_scaled:.2f}")


Accuracy without scaling: 0.96
Accuracy with standardization: 0.97


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
