#**Question 1: What is Logistic Regression, and how does it differ from Linear Regression?**
-  Logistic Regression

Logistic Regression is a statistical method used for classification problems, where the target variable is categorical. It's used to predict the probability of an event occurring based on a set of independent variables.

Key Differences

| Aspect              | Linear Regression               | Logistic Regression                 |
| ------------------- | ------------------------------- | ----------------------------------- |
| **Type of Problem** | Regression (continuous output)  | Classification (categorical output) |
| **Output Range**    | Any real value (−∞ to +∞)       | Probability between 0 and 1         |
| **Function Used**   | Linear function                 | Sigmoid (logistic) function         |
| **Error Metric**    | Mean Squared Error (MSE), RMSE  | Log Loss (Cross-Entropy Loss)       |
| **Goal**            | Fit best line to predict values | Find probability and classify data  |

#**Question 2: Explain the role of the Sigmoid function in Logistic Regression.**
-  Sigmoid Function in Logistic Regression

The sigmoid function, also known as the logistic function, plays a crucial role in Logistic Regression. It's used to model the probability of the target variable being in a specific class (e.g., 1 or 0).

Mathematical Representation
The sigmoid function is represented mathematically as:

σ(z) = 1 / (1 + e^(-z))

Where:

- σ(z): Sigmoid function
- e: Base of the natural logarithm
- z: Linear combination of the independent variables and their coefficients

Role in Logistic Regression
The sigmoid function serves several purposes in Logistic Regression:

1. Mapping to Probability Space: The sigmoid function maps any real number to a value between 0 and 1, making it suitable for modeling probabilities.
2. Non-Linearity: The sigmoid function introduces non-linearity into the model, allowing it to learn complex relationships between the independent variables and the target variable.
3. Binary Classification: The sigmoid function is particularly useful for binary classification problems, where the target variable has two classes (e.g., 0 and 1).

#**Question 3: What is Regularization in Logistic Regression and why is it needed?**
-  Regularization in Logistic Regression

Regularization is a technique used in Logistic Regression to prevent overfitting by adding a penalty term to the loss function. Overfitting occurs when a model is too complex and fits the training data too well, resulting in poor performance on unseen data.

Why is Regularization Needed?
Regularization is needed in Logistic Regression for several reasons:

1. Preventing Overfitting: Regularization helps prevent overfitting by reducing the complexity of the model.
2. Improving Generalization: Regularization improves the model's ability to generalize to new, unseen data.
3. Reducing Model Variance: Regularization reduces the variance of the model, making it more stable and reliable.

Types of Regularization
There are two common types of regularization used in Logistic Regression:

1. L1 Regularization (Lasso): L1 regularization adds a penalty term proportional to the absolute value of the model coefficients.
2. L2 Regularization (Ridge): L2 regularization adds a penalty term proportional to the square of the model coefficients.

#**Question 4: What are some common evaluation metrics for classification models, and why are they important?**
-  Common Evaluation Metrics for Classification Models

When evaluating classification models, several metrics can be used to assess their performance. Here are some common ones:

1. Accuracy: Accuracy measures the proportion of correctly classified instances out of all instances in the dataset.
2. Precision: Precision measures the proportion of true positives (correctly predicted instances) among all positive predictions made by the model.
3. Recall: Recall, also known as sensitivity, measures the proportion of true positives among all actual positive instances in the dataset.
4. F1 Score: The F1 score is the harmonic mean of precision and recall, providing a balanced measure of both.
5. Area Under the ROC Curve (AUC-ROC): AUC-ROC measures the model's ability to distinguish between positive and negative classes, with higher values indicating better performance.

Importance of Evaluation Metrics
These evaluation metrics are important for several reasons:

- Assessing Model Performance: Evaluation metrics provide a quantitative measure of a model's performance, allowing you to compare different models and identify areas for improvement.
- Identifying Class Imbalance Issues: Metrics like precision, recall, and F1 score can help identify class imbalance issues, where one class has a significantly larger number of instances than others.
- Optimizing Model Parameters: Evaluation metrics can be used to optimize model parameters and hyperparameters, ensuring the best possible performance.
- Comparing Models: Evaluation metrics enable comparison between different models, helping you choose the best model for your specific problem.


#**Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame,splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.**

In [1]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset (Iris dataset as an example)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("First 5 rows of dataset:")
print(df.head())

# Features (X) and Target (y)
X = df[iris.feature_names]
y = df['target']

# Split dataset into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Logistic Regression model
model = LogisticRegression(max_iter=200)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy:", accuracy)


First 5 rows of dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

Model Accuracy: 1.0


#**Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy.**


In [4]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset (Iris dataset as example)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features (X) and Target (y)
X = df[iris.feature_names]
y = df['target']

# Split dataset into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Logistic Regression with L2 regularization
# C is inverse of regularization strength (smaller C = stronger regularization)
model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=500, multi_class='auto')

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print coefficients and accuracy
print("Model Coefficients (per class):\n", model.coef_)
print("\nIntercepts (per class):\n", model.intercept_)
print("\nModel Accuracy:", accuracy)


Model Coefficients (per class):
 [[-0.40538546  0.86892246 -2.2778749  -0.95680114]
 [ 0.46642685 -0.37487888 -0.18745257 -0.72127133]
 [-0.06104139 -0.49404358  2.46532746  1.67807247]]

Intercepts (per class):
 [  8.86383271   2.20981479 -11.0736475 ]

Model Accuracy: 1.0




#**Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report.**

In [5]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# Load dataset (Iris dataset)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features (X) and Target (y)
X = df[iris.feature_names]
y = df['target']

# Split dataset into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Initialize Logistic Regression with One-vs-Rest
model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=500)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      0.85      0.92        13
   virginica       0.87      1.00      0.93        13

    accuracy                           0.96        45
   macro avg       0.96      0.95      0.95        45
weighted avg       0.96      0.96      0.96        45





#**Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy.**

In [6]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load dataset (Iris dataset)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features (X) and Target (y)
X = df[iris.feature_names]
y = df['target']

# Split dataset into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define Logistic Regression model
log_reg = LogisticRegression(solver='liblinear', max_iter=500)

# Define parameter grid for tuning
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],          # Regularization strength
    'penalty': ['l1', 'l2']                # L1 = Lasso, L2 = Ridge
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=log_reg,
    param_grid=param_grid,
    cv=5,               # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Print best parameters and validation accuracy
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)


Best Parameters: {'C': 10, 'penalty': 'l2'}
Best Cross-Validation Accuracy: 0.9523809523809523


#**Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling.**

In [7]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# Load dataset (Iris dataset)
iris = load_iris()

# Convert to Pandas DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Features (X) and Target (y)
X = df[iris.feature_names]
y = df['target']

# Split dataset into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# ---------------- Without Scaling ----------------
model_no_scaling = LogisticRegression(max_iter=500)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# ---------------- With Standard Scaling ----------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_with_scaling = LogisticRegression(max_iter=500)
model_with_scaling.fit(X_train_scaled, y_train)
y_pred_scaled = model_with_scaling.predict(X_test_scaled)
accuracy_with_scaling = accuracy_score(y_test, y_pred_scaled)

# ---------------- Results ----------------
print("Accuracy without Scaling:", accuracy_no_scaling)
print("Accuracy with Scaling   :", accuracy_with_scaling)


Accuracy without Scaling: 1.0
Accuracy with Scaling   : 1.0


#**Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**
  
1. Data Handling

Exploration:

Check missing values, outliers, and data distribution.

Look for data leakage (e.g., features that directly reveal response).

Feature Engineering:

Categorical features → one-hot encoding.

Date/time features → extract month/weekday/recency (how recently a customer purchased).

Customer behavior features (purchase frequency, last purchase amount, etc.).

 2. Feature Scaling

Logistic Regression with regularization (L1/L2) is sensitive to feature scale.

Apply StandardScaler (zero mean, unit variance) or MinMaxScaler to numerical variables.

Apply scaling after train/test split to avoid data leakage.

 3. Handling Class Imbalance (5% responders)

Imbalanced data is the biggest challenge here. Options:

Resampling Methods

Oversampling minority class (e.g., SMOTE, ADASYN).

Undersampling majority class (but risk losing info).

Class Weights

In scikit-learn: LogisticRegression(class_weight='balanced')

This adjusts weights so minority class has higher importance.

Hybrid Approaches

Use moderate oversampling + class weights.

 4. Model Training with Hyperparameter Tuning

Use GridSearchCV or RandomizedSearchCV for:

C → regularization strength.

penalty → L1 (Lasso) or L2 (Ridge).

class_weight → balanced vs. custom weights.

Example grid:

param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}


Cross-validation should use StratifiedKFold to maintain class balance.

 5. Evaluation Metrics (beyond accuracy)

Accuracy is misleading (95% accuracy if predicting “no” always!).

Use metrics focused on minority class:

Precision, Recall, F1-score

ROC-AUC (probability-based measure)

PR-AUC (Precision-Recall AUC) → especially important in high imbalance.

Business Metric: Focus on Recall (catch as many responders as possible) OR Precision (don’t waste marketing cost) depending on company strategy.

 6. Threshold Tuning

Default threshold = 0.5 may not be optimal.

Adjust threshold using ROC curve or Precision-Recall curve.

For example:

If marketing budget is high → lower threshold → catch more responders (maximize recall).

If budget is limited → higher threshold → focus on most likely responders (maximize precision).

 7. Deployment & Monitoring

Deploy model in marketing workflow.

Monitor:

Response rate over time.

Model drift (customer behavior changes).

Re-train periodically with fresh data.

**Summary Approach:**

Preprocess features + scale data.

Handle imbalance (SMOTE / class_weight).

Train Logistic Regression with hyperparameter tuning.

Evaluate using Precision, Recall, F1, ROC-AUC, PR-AUC (not just accuracy).

Adjust threshold based on business cost/benefit trade-off.

Deploy and monitor performance.