Assignment Code: DA-AG-011

Assignment- Logistic Regression

Ques.1 What is Logistic Regression, and how does it differ from Linear
Regression?

Ans: Logistic Regression is a classification algorithm used to predict the probability of a binary outcome (e.g., 0 or 1, Yes or No, True or False). It uses a logistic function (sigmoid function) to map the output of a linear equation to a probability value between 0 and 1.

Here's how it differs from Linear Regression:

Purpose: Linear Regression is used for predicting continuous numerical values, while Logistic Regression is used for predicting categorical outcomes (specifically binary outcomes).
Output: Linear Regression outputs a continuous value, while Logistic Regression outputs a probability between 0 and 1.
Function: Linear Regression uses a linear function (y = mx + c), while Logistic Regression uses the sigmoid function to transform the output of a linear equation into a probability.
Cost Function: Linear Regression typically uses Mean Squared Error (MSE) as its cost function, while Logistic Regression uses log loss (also known as cross-entropy).
Decision Boundary: Linear Regression doesn't have a decision boundary. Logistic Regression has a decision boundary (usually at a probability of 0.5) to classify the output into one of the two categories.

Ques.2 Explain the role of the Sigmoid function in Logistic Regression?

Ans:

Ans: The Sigmoid function (also known as the logistic function) plays a crucial role in Logistic Regression. Its purpose is to map the output of the linear equation to a probability value between 0 and 1.

Here's a breakdown of its role:

1.  **Transformation:** The Sigmoid function takes any real-valued number as input and transforms it into a value between 0 and 1. This is essential because we want to predict a probability, which must lie within this range.
2.  **Probability Interpretation:** The output of the Sigmoid function can be interpreted as the probability that the input belongs to a particular class (usually the positive class). A value close to 1 indicates a high probability of belonging to the positive class, while a value close to 0 indicates a high probability of belonging to the negative class.
3.  **Decision Boundary:** The Sigmoid function helps in establishing a decision boundary. By default, if the output of the Sigmoid function is greater than or equal to 0.5, the prediction is classified as the positive class, and if it's less than 0.5, it's classified as the negative class.

Essentially, the Sigmoid function provides a smooth transition between the linear output and the probability output, making it suitable for binary classification problems.

Ques.3 What is Regularization in Logistic Regression and why is it needed?

Ans: Regularization in Logistic Regression is a technique used to prevent overfitting. Overfitting occurs when the model learns the training data too well, including the noise and outliers, which results in poor performance on unseen data.

Regularization adds a penalty term to the cost function that the model tries to minimize during training. This penalty discourages the model from assigning excessively large weights to the features. By shrinking the weights, regularization makes the model simpler and less sensitive to small fluctuations in the training data, thus improving its ability to generalize to new data.

There are two common types of regularization used in Logistic Regression:

L1 Regularization (Lasso): Adds a penalty proportional to the absolute value of the weights. This can lead to some weights becoming exactly zero, effectively performing feature selection.
L2 Regularization (Ridge): Adds a penalty proportional to the square of the weights. This shrinks the weights towards zero but rarely makes them exactly zero.
Regularization is needed in Logistic Regression to:

Prevent Overfitting: The primary reason is to prevent the model from becoming too complex and performing poorly on new, unseen data.
Reduce Variance: Regularization helps reduce the variance of the model, making it more stable and less sensitive to changes in the training data.
Handle Multicollinearity: In cases where features are highly correlated, regularization can help stabilize the model and prevent issues caused by multicollinearity.

Ques.4 What are some common evaluation metrics for classification models, and
why are they important?

Ans: There are several common evaluation metrics for classification models, and they are crucial for understanding how well your model is performing. Here are some of the key ones and why they are important:

Accuracy:
What it is: The proportion of correctly predicted instances out of the total number of instances.
Why it's important: It provides a general measure of the model's overall correctness. However, it can be misleading in cases of imbalanced datasets (where one class is significantly more frequent than the other).
Precision:
What it is: The proportion of true positive predictions among all positive predictions (true positives + false positives). It answers the question: "Of all the instances predicted as positive, how many were actually positive?"
Why it's important: It is important when the cost of a false positive is high. For example, in medical diagnosis, a false positive could lead to unnecessary treatment.
Recall (Sensitivity or True Positive Rate):
What it is: The proportion of true positive predictions among all actual positive instances (true positives + false negatives). It answers the question: "Of all the actual positive instances, how many were correctly predicted as positive?"
Why it's important: It is important when the cost of a false negative is high. For example, in fraud detection, a false negative means a fraudulent transaction is missed.
F1-Score:
What it is: The harmonic mean of precision and recall. It provides a balance between precision and recall.
Why it's important: It is useful when you need to consider both precision and recall and want a single metric to evaluate the model's performance, especially in cases of imbalanced datasets.
Confusion Matrix:
What it is: A table that summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives.
Why it's important: It provides a detailed breakdown of the model's predictions and allows you to calculate other metrics like precision, recall, and accuracy. It helps you understand where the model is making mistakes.
ROC Curve and AUC:
What it is: The Receiver Operating Characteristic (ROC) curve is a plot that shows the trade-off between the True Positive Rate (Recall) and the False Positive Rate (1 - Specificity) at various threshold settings. The Area Under the Curve (AUC) is a single value that summarizes the overall performance of the model across all possible thresholds.
Why it's important: It is useful for evaluating the model's ability to distinguish between the two classes. A higher AUC indicates better performance.

Ques.5 Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load a sample dataset from scikit-learn (e.g., Iris dataset)
# If you have a CSV file, you would use:
# df = pd.read_csv('your_dataset.csv')
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='target')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}")

Accuracy of the Logistic Regression model: 1.00


Ques.6 Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy?

Ans:

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

# Initialize and train the Logistic Regression model with L2 regularization
# The default penalty in LogisticRegression is 'l2'
model_l2 = LogisticRegression(penalty='l2')
model_l2.fit(X_train, y_train)

# Make predictions on the test set
y_pred_l2 = model_l2.predict(X_test)

# Calculate and print the accuracy
accuracy_l2 = accuracy_score(y_test, y_pred_l2)
print(f"Accuracy of the Logistic Regression model with L2 regularization: {accuracy_l2:.2f}")

# Print the model coefficients
print("\nModel Coefficients (L2 regularization):")
# For multi-class classification, coefficients are per class
if len(model_l2.coef_) > 1:
    for i, coef in enumerate(model_l2.coef_):
        print(f"Class {model_l2.classes_[i]}: {coef}")
else:
    print(model_l2.coef_[0])

# Print the intercept
print(f"\nModel Intercept (L2 regularization): {model_l2.intercept_}")

Accuracy of the Logistic Regression model with L2 regularization: 1.00

Model Coefficients (L2 regularization):
Class 0: [-0.39345607  0.96251768 -2.37512436 -0.99874594]
Class 1: [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
Class 2: [-0.11497673 -0.70769055  2.58813565  1.7744936 ]

Model Intercept (L2 regularization): [  9.00884295   1.86902164 -10.87786459]


Ques.7 Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report?


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Initialize and train the Logistic Regression model with multi_class='ovr'
# The 'ovr' strategy trains a separate binary classifier for each class
model_ovr = LogisticRegression(multi_class='ovr', solver='liblinear') # 'liblinear' is often suitable for 'ovr' with smaller datasets

# Fit the model
model_ovr.fit(X_train, y_train)

# Make predictions on the test set
y_pred_ovr = model_ovr.predict(X_test)

# Print the classification report
print("Classification Report (multi_class='ovr'):")
print(classification_report(y_test, y_pred_ovr))

Classification Report (multi_class='ovr'):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





Ques.8 Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy?


In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the parameter grid to search
# 'C' is the inverse of regularization strength; smaller values specify stronger regularization.
# 'penalty' can be 'l1', 'l2', 'elasticnet', or 'none'. 'elasticnet' requires 'solver' to be 'saga'.
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Initialize Logistic Regression model
# Use a solver that supports both l1 and l2 penalties, like 'liblinear' or 'saga'
# 'liblinear' is generally good for smaller datasets and supports l1/l2.
# 'saga' is good for larger datasets and supports l1, l2, elasticnet, and none penalties.
# For this example with l1 and l2, 'liblinear' is a good choice.
model = LogisticRegression(solver='liblinear')

# Initialize GridSearchCV
# cv=5 means 5-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters and best score (validation accuracy)
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation accuracy: {:.2f}".format(grid_search.best_score_))

Best parameters found:  {'C': 10, 'penalty': 'l1'}
Best cross-validation accuracy: 0.96




Ques.9 : Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling?


In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the Logistic Regression model on scaled data
model_scaled = LogisticRegression()
model_scaled.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set
y_pred_scaled = model_scaled.predict(X_test_scaled)

# Calculate and print the accuracy of the scaled model
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy of the Logistic Regression model with scaling: {accuracy_scaled:.2f}")

# Compare with the accuracy of the model without scaling (assuming 'accuracy' variable exists from previous steps)
# If the 'accuracy' variable is not available, you would need to re-calculate it here
try:
    print(f"Accuracy of the Logistic Regression model without scaling: {accuracy:.2f}")
except NameError:
    print("Accuracy of the model without scaling is not available in this session.")

Accuracy of the Logistic Regression model with scaling: 1.00
Accuracy of the Logistic Regression model without scaling: 1.00
