#Logistic Regression | Assignment

1. What is Logistic Regression, and how does it differ from Linear
Regression?
  - Logistic Regression is a supervised machine learning algorithm used primarily for classification tasks, where the goal is to predict discrete outcomes such as whether an email is spam or not, or if a customer will make a purchase. Unlike Linear Regression, which predicts continuous numeric values using a linear relationship between input features and the target variable, Logistic Regression predicts the probability that a given input belongs to a particular category. It does this by applying the sigmoid (logistic) function to the linear combination of input features, effectively mapping the output to a range between 0 and 1. This probability is then used to classify the input into classes (e.g., 0 or 1). While Linear Regression uses mean squared error as its loss function, Logistic Regression uses log loss (or cross-entropy loss). The key difference lies in their purpose and output: Linear Regression is used for predicting quantities, whereas Logistic Regression is used for making categorical decisions.










2. Explain the role of the Sigmoid function in Logistic Regression.
   - The sigmoid function in Logistic Regression is used to convert the linear output of the model into a probability value between 0 and 1. Logistic Regression works by computing a weighted sum of the input features (like in linear regression), but since the goal is to classify data into categories—typically 0 or 1—we need to interpret the output as a probability.The sigmoid function, defined as σ(z)= 1/(1+e)−z, takes the linear result
𝑧
z and maps it to a value in the range [0, 1]. This output represents the probability that a given input belongs to the positive class (class 1). If the probability is greater than or equal to 0.5, the model usually classifies the input as class 1; otherwise, it is classified as class 0. Thus, the sigmoid function enables Logistic Regression to make meaningful and interpretable probabilistic predictions.

















3. What is Regularization in Logistic Regression and why is it needed?
   - Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty to the loss function based on the magnitude of the model's coefficients. In simple terms, it discourages the model from relying too heavily on any one feature by keeping the weights (coefficients) small.

When a model becomes too complex—especially if there are many features—it might fit the training data very well but perform poorly on new, unseen data. This is known as overfitting. Regularization addresses this problem by adding a term to the loss function that increases as the coefficients become larger.

There are two common types of regularization used in Logistic Regression:

 - L1 Regularization (Lasso) – Adds the absolute values of the coefficients to the loss function. It can shrink some coefficients to exactly zero, effectively performing feature selection.

 - L2 Regularization (Ridge) – Adds the square of the coefficients to the loss function. It penalizes large weights but does not usually shrink them to zero.

Regularization is needed to improve the generalization ability of the model, ensuring it performs well not just on the training data but also on unseen data. It makes the model more stable and less sensitive to noise or irrelevant features.

4. What are some common evaluation metrics for classification models, and
why are they important?
   - Some common evaluation metrics for classification models include accuracy, precision, recall, F1-score, and the ROC-AUC score. These metrics are essential for understanding how well a classification model performs, especially when dealing with imbalanced datasets or when the cost of different types of errors varies.

    - Accuracy measures the overall correctness of the model by calculating the proportion of correctly predicted instances out of the total. While it's simple and useful, it can be misleading when classes are imbalanced.

   - Precision is the ratio of true positive predictions to the total predicted positives. It’s important when the cost of false positives is high, such as in spam detection.

   - Recall (or sensitivity) is the ratio of true positives to the actual positives. It’s crucial when missing a positive case is costly, like in disease detection.

   - F1-score is the harmonic mean of precision and recall, offering a balance between the two. It is especially useful when you need to balance false positives and false negatives.

   - ROC-AUC score measures the model's ability to distinguish between classes across all thresholds. A higher AUC indicates better performance.

These metrics help in choosing the right model and in understanding the trade-offs involved in different types of errors, which is critical for making reliable predictions in real-world applications.

5. : Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
(Use Dataset from sklearn package)

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_breast_cancer()

# Convert to Pandas DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Split into features and target
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train Logistic Regression model
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}")


Accuracy of the Logistic Regression model: 0.96


6. Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

In [3]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with L2 regularization (default)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=10000)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Print model coefficients
print("Model Coefficients:")
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")


Model Coefficients:
mean radius: 1.0274
mean texture: 0.2215
mean perimeter: -0.3621
mean area: 0.0255
mean smoothness: -0.1562
mean compactness: -0.2377
mean concavity: -0.5326
mean concave points: -0.2837
mean symmetry: -0.2267
mean fractal dimension: -0.0365
radius error: -0.0971
texture error: 1.3706
perimeter error: -0.1814
area error: -0.0872
smoothness error: -0.0225
compactness error: 0.0474
concavity error: -0.0429
concave points error: -0.0324
symmetry error: -0.0347
fractal dimension error: 0.0116
worst radius: 0.1117
worst texture: -0.5089
worst perimeter: -0.0156
worst area: -0.0169
worst smoothness: -0.3077
worst compactness: -0.7727
worst concavity: -1.4286
worst concave points: -0.5109
worst symmetry: -0.7469
worst fractal dimension: -0.1009

Model Accuracy: 0.9561


7. Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

In [4]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load Iris dataset (multiclass: 3 classes)
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model with multi_class='ovr'
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))




Classification Report:
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



8. : Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation
accuracy.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)

In [5]:
# Import required libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # 'liblinear' supports both l1 and l2 penalties
}

# Initialize logistic regression
logreg = LogisticRegression(max_iter=1000)

# Initialize GridSearchCV
grid = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Get best parameters and score
print("Best Parameters:", grid.best_params_)
print(f"Best Cross-Validated Accuracy: {grid.best_score_:.4f}")

# Test set accuracy
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {test_accuracy:.4f}")


Best Parameters: {'C': 100, 'penalty': 'l1', 'solver': 'liblinear'}
Best Cross-Validated Accuracy: 0.9670
Test Set Accuracy: 0.9825


9. Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.
(Use Dataset from sklearn package)
(Include your Python code and output in the code box below.)


In [6]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression without scaling
model_no_scaling = LogisticRegression(max_iter=10000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with scaling
model_scaled = LogisticRegression(max_iter=10000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

# Print the results
print(f"Accuracy without Scaling: {accuracy_no_scaling:.4f}")
print(f"Accuracy with Scaling:    {accuracy_scaled:.4f}")


Accuracy without Scaling: 0.9561
Accuracy with Scaling:    0.9737


10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced
dataset (only 5% of customers respond), describe the approach you’d take to build a
Logistic Regression model — including data handling, feature scaling, balancing
classes, hyperparameter tuning, and evaluating the model for this real-world business
use case.
   - To build a Logistic Regression model for predicting customer responses in an e-commerce marketing campaign with only 5% of customers responding, a careful and structured approach is essential due to the class imbalance. First, I would begin by exploring and preprocessing the data—this includes handling missing values, encoding categorical features, engineering useful variables (like customer purchase history or engagement), and splitting the data into training and testing sets using stratified sampling to maintain the class distribution. Since Logistic Regression is sensitive to feature scale, I would standardize all numerical features using a method like StandardScaler to ensure they contribute equally to the model. To address the significant class imbalance, I would apply techniques such as setting class_weight='balanced' in the Logistic Regression model, which adjusts weights inversely proportional to class frequencies, or use oversampling methods like SMOTE to synthetically increase the minority class. For model tuning, I would use GridSearchCV to optimize hyperparameters such as the regularization strength C, the penalty type (l1 or l2), and the solver. When evaluating the model, I would avoid relying solely on accuracy and instead focus on metrics better suited to imbalanced data, such as precision, recall, F1-score, ROC-AUC, and the confusion matrix. Additionally, I would consider threshold tuning to adjust the decision boundary, aiming to maximize recall without incurring excessive false positives, aligning the model performance with the business goal of identifying as many potential responders as possible while managing marketing costs effectively.