#### #Question 1: What is Logistic Regression, and how does it differ from Linear Regression?
##### #Ans.Logistic Regression is a statistical method used for binary classification problems where the dependent variable has only two possible outcomes such as Yes/No, Spam/Not Spam, or Pass/Fail. Despite its name, it is a classification algorithm and not a regression technique in the traditional sense.

In Logistic Regression, instead of predicting continuous values like in Linear Regression, we predict the probability of an observation belonging to a particular category. This probability is calculated using the logistic or sigmoid function which converts any real-valued number into a value between 0 and 1. The sigmoid function is given as:

p = 1 / (1 + e^(-z))

where z = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

The value of p represents the probability of the positive class. If p is greater than 0.5, we classify the observation as positive. If p is less than 0.5, we classify it as negative.

Now let us understand how Logistic Regression differs from Linear Regression.

1. Purpose and Output
* Linear Regression is used for predicting continuous values such as prices or temperatures.
* Logistic Regression is used for predicting probabilities and classifications such as spam or not spam.

2. Mathematical Model
* Linear Regression uses a straight line equation: y = β₀ + β₁x.
* Logistic Regression uses the S-shaped sigmoid function: p = 1 / (1 + e^(-z)).

3. Assumptions
* Linear Regression assumes a linear relationship between variables and normally distributed errors.
* Logistic Regression assumes a linear relationship between the log-odds of the dependent variable and the independent variables.

4. Error Measurement
* Linear Regression minimizes the sum of squared errors.
* Logistic Regression uses maximum likelihood estimation and log-loss.

5. Range of Predictions
* Linear Regression can predict any real number from minus infinity to plus infinity.
* Logistic Regression always predicts values between 0 and 1 as probabilities.

#### #Question 2: Explain the role of the Sigmoid function in Logistic Regression.
##### #Ans.Role of the Sigmoid Function in Logistic Regression

The sigmoid function plays a central role in Logistic Regression because it transforms the linear combination of input features into a probability value between 0 and 1.

In Logistic Regression, the input is a linear equation of the form:

z = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

The output of this linear equation can range from minus infinity to plus infinity, which is not suitable for classification since probabilities must always lie between 0 and 1.

To solve this, the sigmoid (logistic) function is applied:

p = 1 / (1 + e^(-z))

Roles of the sigmoid function:

1. Probability Mapping - It compresses any real-valued number into the range [0,1], so the output can be interpreted as a probability.

2. Decision Boundary - By setting a threshold (commonly 0.5), it helps in classifying outcomes into two categories. If p > 0.5, the observation is classified as positive. If p < 0.5, it is classified as negative.

3. Smooth Gradient - The S-shaped curve of the sigmoid function provides smooth gradients, which are useful for optimization during training with methods like gradient descent.

#### #Question 3: What is Regularization in Logistic Regression and why is it needed?
##### #Ans.Regularization in Logistic Regression

Regularization is a technique used in Logistic Regression (and other machine learning models) to prevent overfitting by adding a penalty term to the cost function. Overfitting occurs when the model learns not only the underlying pattern but also the noise in the training data, which reduces its performance on unseen data.

In Logistic Regression, the cost function without regularization is based on maximum likelihood estimation. Regularization modifies this cost function by adding a penalty term that discourages the model from assigning too large weights (coefficients) to the features.

There are mainly two types of regularization used:

1. L1 Regularization (Lasso): Adds the absolute value of coefficients as a penalty term. It can shrink some coefficients to zero, effectively performing feature selection.

2. L2 Regularization (Ridge): Adds the squared value of coefficients as a penalty term. It reduces the magnitude of coefficients but does not usually make them zero.

Why is Regularization needed?

* To prevent overfitting and improve generalization on new/unseen data.

* To keep coefficients small, making the model more stable and less sensitive to fluctuations in training data.

* In the case of L1, it also helps in simplifying the model by removing irrelevant features.

#### #Question 4: What are some common evaluation metrics for classification models, and why are they important?
##### #Ans.
When building classification models like Logistic Regression, it is not enough to only look at accuracy. Different evaluation metrics are used to measure how well the model performs, especially when data is imbalanced. These metrics are important because they help us understand the strengths and weaknesses of a model and choose the best one for a given problem.

Common Evaluation Metrics:

1. Accuracy
* Definition: The proportion of correctly classified observations out of the total observations.
* Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
* Limitation: Can be misleading when classes are imbalanced.

2. Precision
* Definition: Out of all predicted positives, how many are actually positive.
* Formula: Precision = TP / (TP + FP)
* Importance: High precision means fewer false positives.

3. Recall (Sensitivity or True Positive Rate)
* Definition: Out of all actual positives, how many were correctly identified.
* Formula: Recall = TP / (TP + FN)
* Importance: High recall means fewer false negatives.

4. F1 Score
* Definition: Harmonic mean of precision and recall.
* Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall)
* Importance: Useful when we want a balance between precision and recall.

5. ROC Curve and AUC (Area Under Curve)
* ROC Curve plots the True Positive Rate against the False Positive Rate at different thresholds.
* AUC measures the overall ability of the model to distinguish between classes.
* Importance: AUC closer to 1 indicates a better model.

Why they are important:

* They provide a deeper understanding of model performance beyond accuracy.
* They help in selecting models suited for specific needs (e.g., fraud detection requires high recall, while spam detection may need high precision).
* They allow comparison between different models on the same dataset.

In [1]:
# Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,
# splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression # Import LogisticRegression

# Load breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into train/test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression model
model = LogisticRegression(random_state=42)
model.fit(X_train_scaled, y_train)

# Predict and calculate accuracy
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print(f"Test Accuracy: {accuracy:.4f}")


Test Accuracy: 0.9737


In [5]:
# Question 6: Write a Python program to train a Logistic Regression model using L2
# regularization (Ridge) and print the model coefficients and accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression with L2 regularization (Ridge)
model = LogisticRegression(
    penalty='l2',        # Ridge Regularization
    C=1.0,               # Regularization strength (default = 1.0)
    solver='liblinear',  # Suitable for smaller datasets
    max_iter=1000        # Increase iterations to ensure convergence
)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Print coefficients
print("Model Coefficients with L2 Regularization:\n")
coef_df = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_[0]
})
print(coef_df.to_string(index=False))

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")



Model Coefficients with L2 Regularization:

                Feature  Coefficient
            mean radius     2.132484
           mean texture     0.152772
         mean perimeter    -0.145091
              mean area    -0.000829
        mean smoothness    -0.142636
       mean compactness    -0.415569
         mean concavity    -0.651940
    mean concave points    -0.344456
          mean symmetry    -0.207613
 mean fractal dimension    -0.029774
           radius error    -0.050034
          texture error     1.442984
        perimeter error    -0.303857
             area error    -0.072569
       smoothness error    -0.016159
      compactness error    -0.001907
        concavity error    -0.044886
   concave points error    -0.037719
         symmetry error    -0.041752
fractal dimension error     0.005613
           worst radius     1.232150
          worst texture    -0.404581
        worst perimeter    -0.036209
             worst area    -0.027087
       worst smoothness    -0.2

In [7]:
# Question 7: Write a Python program to train a Logistic Regression model for multiclass
# classification using multi_class='ovr' and print the classification report.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Logistic Regression model for multiclass classification using One-vs-Rest (OvR)
model = LogisticRegression(
    multi_class='ovr',  # One-vs-Rest strategy
    solver='liblinear', # Suitable for smaller datasets
    max_iter=1000
)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report for Multiclass Logistic Regression (OvR):\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Classification Report for Multiclass Logistic Regression (OvR):

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





In [9]:
# Question 8: Write a Python program to apply GridSearchCV to tune C and penalty
# hyperparameters for Logistic Regression and print the best parameters and validation
# accuracy.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

print(f"Dataset shape: {X.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {np.bincount(y)}\n")

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

# Logistic Regression model
model = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)

# GridSearchCV
grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train)

# Best parameters and cross-validation accuracy
print("Best parameters found by GridSearchCV:", grid_search.best_params_)
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

# Evaluate best model on test set
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, y_pred_test)
print(f"Test set accuracy with best parameters: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)\n")

# Classification report on test set
print("Classification Report on Test Set with Best Model:")
print(classification_report(y_test, y_pred_test, target_names=cancer.target_names))


Dataset shape: (569, 30)
Classes: ['malignant' 'benign']
Class distribution: [212 357]

Best parameters found by GridSearchCV: {'C': 1, 'penalty': 'l2'}
Best cross-validation accuracy: 0.9802
Test set accuracy with best parameters: 0.9825 (98.25%)

Classification Report on Test Set with Best Model:
              precision    recall  f1-score   support

   malignant       0.98      0.98      0.98        42
      benign       0.99      0.99      0.99        72

    accuracy                           0.98       114
   macro avg       0.98      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



In [11]:
# Question 9: Write a Python program to standardize the features before training Logistic
# Regression and compare the model's accuracy with and without scaling.
# (Use Dataset from sklearn package)
# (Include your Python code and output in the code box below.)

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

print(f"Dataset shape: {X.shape}")
print(f"Classes: {cancer.target_names}")
print(f"Class distribution: {np.bincount(y)}\n")

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}\n")

# Logistic Regression WITHOUT scaling
print("Training Logistic Regression model WITHOUT scaling...")
model_no_scale = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear')
model_no_scale.fit(X_train, y_train)
y_pred_no_scale = model_no_scale.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)
print(f"Accuracy WITHOUT scaling: {accuracy_no_scale:.4f} ({accuracy_no_scale*100:.2f}%)\n")

# Logistic Regression WITH scaling
print("Standardizing features...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training Logistic Regression model WITH scaling...")
model_scaled = LogisticRegression(random_state=42, max_iter=1000, solver='liblinear')
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy WITH scaling:    {accuracy_scaled:.4f} ({accuracy_scaled*100:.2f}%)\n")

# Accuracy Comparison
print("Comparison of Accuracy:")
print(f"  Without Scaling: {accuracy_no_scale:.4f}")
print(f"  With Scaling:    {accuracy_scaled:.4f}")

Dataset shape: (569, 30)
Classes: ['malignant' 'benign']
Class distribution: [212 357]

Training set shape: (455, 30)
Test set shape: (114, 30)

Training Logistic Regression model WITHOUT scaling...
Accuracy WITHOUT scaling: 0.9561 (95.61%)

Standardizing features...
Training Logistic Regression model WITH scaling...
Accuracy WITH scaling:    0.9825 (98.25%)

Comparison of Accuracy:
  Without Scaling: 0.9561
  With Scaling:    0.9825


#### #Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you'd take to build a
Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.
##### #Ans.When predicting customer response to a marketing campaign with an imbalanced dataset (e.g., only 5% of customers respond), the following approach should be taken:

1. Data Handling and Preprocessing
* Feature Selection/Engineering: Identify meaningful features such as customer demographics, past purchases, browsing history, or engagement metrics.
* Missing Values: Handle missing data via imputation or removal of rows/columns.
* Categorical Variables: Encode categorical features using techniques like One-Hot Encoding or Label Encoding.

2. Feature Scaling
* Scale numeric features using StandardScaler or MinMaxScaler because Logistic Regression is sensitive to feature magnitudes, especially when using regularization.

3. Handling Class Imbalance
* Resampling Techniques:
  * Oversampling: Increase the number of minority class samples using techniques like SMOTE.
  * Undersampling: Reduce majority class samples to balance the dataset.
* Class Weights: Use class_weight='balanced' in Logistic Regression to penalize misclassification of minority class more heavily.

4. Model Building and Hyperparameter Tuning
* Use Logistic Regression with regularization (L1 or L2) to prevent overfitting.
* Tune hyperparameters like C (inverse of regularization strength) and penalty using GridSearchCV or RandomizedSearchCV.
* For imbalanced data, consider scoring='roc_auc' during hyperparameter tuning rather than accuracy, because accuracy can be misleading.

5. Model Evaluation
* Avoid using accuracy alone; focus on metrics that are robust for imbalanced datasets:
   * Precision: How many predicted responders are actually responders.
   * Recall (Sensitivity): How many actual responders are correctly identified.
   * F1-Score: Balance between precision and recall.
   * ROC-AUC: Measures the model's ability to discriminate between responders and non-responders.
* Use Confusion Matrix to understand true positives, false positives, true negatives, and false negatives.

6. Deployment Considerations

* Evaluate business impact: prioritize minimizing false negatives if missing a potential customer is costly.
* Monitor model performance over time and retrain periodically as customer behavior changes.


#### #