<a href="https://colab.research.google.com/github/JMandal02/Data-Science_pwskills/blob/main/Assignment__Logistics_Regressionipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Logistic Regression | Assignment**


# **Assignment Code: DA-AG-011**



# **Question 1: What is Logistic Regression, and how does it differ from Linear Regression?**

**Logistic Regression** is a supervised learning algorithm used for **classification tasks**.  
**Linear Regression** is used for **regression tasks** (predicting continuous values).  

**Differences:**

| Feature         | Linear Regression          | Logistic Regression            |
|-----------------|---------------------------|-------------------------------|
| Output          | Continuous value          | Probability (0-1)             |
| Function        | Linear equation: y = wx + b | Sigmoid function: p = 1/(1+e^-z) |
| Loss Function   | Mean Squared Error (MSE)  | Log Loss / Cross-Entropy      |
| Task Type       | Regression                | Classification               |


# **Question 2: Explain the role of the Sigmoid function in Logistic Regression.**


- The **Sigmoid function** converts any real-valued number into a **probability between 0 and 1**.  
- Formula:  
\[
\sigma(z) = \frac{1}{1 + e^{-z}}
\]  
- This allows Logistic Regression to output probabilities.  
- Probabilities can then be thresholded (e.g., 0.5) to classify inputs into classes 0 or 1.


# **Question 3: What is Regularization in Logistic Regression and why is it needed?**

**Regularization** is a technique to **prevent overfitting** by penalizing large model coefficients.  

**Types of Regularization:**
- **L1 (Lasso):** Penalizes sum of absolute coefficients. Can shrink some coefficients to zero (feature selection).  
- **L2 (Ridge):** Penalizes sum of squared coefficients. Prevents large weights but keeps all features.  

**Why needed:**
- Overfitting occurs when the model fits training data too closely and performs poorly on new data.
- Regularization balances bias-variance tradeoff, improving model generalization.


## **Question 4: What are some common evaluation metrics for classification models, and why are they important?**

**Common Metrics:**
- **Accuracy:** Fraction of correct predictions (useful for balanced datasets).  
- **Precision:** Fraction of predicted positives that are actually positive (important when false positives are costly).  
- **Recall (Sensitivity):** Fraction of actual positives correctly predicted (important when false negatives are costly).  
- **F1 Score:** Harmonic mean of precision and recall. Balances both.  
- **ROC-AUC:** Measures the model’s ability to distinguish between classes.  

**Importance:**
- Metrics provide a better understanding of model performance than accuracy alone, especially for imbalanced datasets.


# **Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.**


In [None]:

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset from sklearn
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predict and print accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


# **Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.**

In [None]:
from sklearn.linear_model import LogisticRegression

# L2 Regularization is default in sklearn
model = LogisticRegression(penalty='l2', max_iter=200)
model.fit(X_train, y_train)

print("Model Coefficients:", model.coef_)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Model Coefficients: [[-0.39345607  0.96251768 -2.37512436 -0.99874594]
 [ 0.50843279 -0.25482714 -0.21301129 -0.77574766]
 [-0.11497673 -0.70769055  2.58813565  1.7744936 ]]
Accuracy: 1.0


# **Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.**


In [None]:
from sklearn.multiclass import OneVsRestClassifier

ovr_model = OneVsRestClassifier(LogisticRegression(max_iter=200))
ovr_model.fit(X_train, y_train)
y_pred = ovr_model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      0.89      0.94         9
           2       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30



# **Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.**


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']  # solver compatible with L1
}

grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best Parameters:", grid.best_params_)
print("Best Validation Accuracy:", grid.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l1', 'solver': 'liblinear'}
Best Validation Accuracy: 0.9583333333333334


# **Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.**


In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Model without scaling
model1 = LogisticRegression(max_iter=200)
model1.fit(X_train, y_train)
acc1 = accuracy_score(y_test, model1.predict(X_test))

# Model with scaling
model2 = LogisticRegression(max_iter=200)
model2.fit(X_train_scaled, y_train)
acc2 = accuracy_score(y_test, model2.predict(X_test_scaled))

print("Accuracy without scaling:", acc1)
print("Accuracy with scaling:", acc2)


Accuracy without scaling: 1.0
Accuracy with scaling: 1.0


# **Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build aLogistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**


**Scenario:** Predict which customers will respond to a marketing campaign (only 5% respond).  

**Approach:**

1. **Data Handling:**
   - Handle missing values.
   - Encode categorical features using One-Hot or Label Encoding.

2. **Feature Scaling:**
   - Standardize numerical features using `StandardScaler`.

3. **Balancing Classes:**
   - Use SMOTE (Synthetic Minority Oversampling) or Random Oversampling.
   - Alternatively, set `class_weight='balanced'` in Logistic Regression.

4. **Model Training:**
   - Train Logistic Regression with L1 or L2 regularization.
   - Tune `C` (regularization strength) and `penalty` using GridSearchCV.

5. **Evaluation:**
   - Use Precision, Recall, F1-score, ROC-AUC.
   - Avoid relying only on Accuracy due to class imbalance.

6. **Deployment Considerations:**
   - Monitor for concept drift (changes in customer behavior).
   - Retrain periodically with new data for accurate predictions.
