<a href="https://colab.research.google.com/github/Manishsuthar-01/Logistic-Regression/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. What is Logistic Regression, and how does it differ from Linear
Regression?**
>**Introduction**

>Logistic Regression and Linear Regression are both statistical methods used in supervised machine learning. However, they serve different purposes and are used for different types of problems.

>**Definition**

>>**Linear Regression:**

>>Linear Regression is used for predicting a continuous output.

>>It establishes a linear relationship between the input variables (independent variables) and the output (dependent variable).

>>The model predicts values using the equation:

>>>$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \varepsilon$



>>where y is the predicted value, xi are the input features, βi are the coefficients, and ε is the error term.

>**Logistic Regression:**

>>Logistic Regression is used for classification problems, especially binary classification (e.g., spam vs. not spam).

>>It predicts the probability that a given input belongs to a certain class (usually 0 or 1).

>>The output is transformed using the logistic (sigmoid) function:

$$
P(y = 1 \mid x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n)}}
$$


>>This maps the output to a range between 0 and 1.

>**Example**

>>**Linear Regression Example:**

>>Predicting house prices based on size, location, and number of rooms.

>>**Logistic Regression Example:**

>>Predicting whether an email is spam (1) or not spam (0) based on its content.

>**Assumptions**

>>**Linear Regression assumes** linearity, independence, homoscedasticity, and normality of errors.

>>**Logistic Regression assumes:**

>>>The log-odds of the dependent variable is a linear combination of the independent variables.

>>>No multicollinearity.

>>>Large sample size for stable estimates.

>**Applications**

>>**Linear Regression:** Forecasting sales, temperature prediction, stock prices.

>>**Logistic Regression:** Medical diagnosis, credit scoring, email spam detection, fraud detection.

**2. Explain the role of the Sigmoid function in Logistic Regression.**
>In Logistic Regression, the Sigmoid function plays a central role in converting the output of a linear equation into a probability value between 0 and 1. This is essential for binary classification, where the goal is to determine whether an instance belongs to class 0 or class 1.

>**What is the Sigmoid Function?**

>The Sigmoid (or logistic) function is a mathematical function defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$


>Where:

>$$
z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n
$$


>>e is the base of the natural logarithm.

>This function maps any real-valued number to a value in the range (0, 1).

>**Why is it Needed in Logistic Regression?**

>>Logistic Regression starts with a linear combination of input features (like in Linear Regression).

>>However, unlike Linear Regression which outputs any real number, Logistic Regression needs to output probabilities.

>>The Sigmoid function transforms the output of the linear model into a probability, making it suitable for classification tasks.

>**How It Works**

>>When z (the linear output) is a large positive number, σ(z) approaches 1.

>>When z is a large negative number, σ(z) approaches 0.

>>When z=0, σ(z)=0.5, representing the decision boundary.

>**Decision Making**

>>In binary classification:

>>>If σ(z)≥0.5, predict class 1.

>>>If σ(z)<0.5, predict class 0.

>>This threshold (0.5 by default) can be adjusted based on the problem or evaluation metrics.

>**Visualization Insight**

>>The Sigmoid function has an S-shaped curve, which makes it ideal for modeling probabilistic transitions between classes, especially when the relationship is non-linear.

**3. What is Regularization in Logistic Regression and why is it needed?**
>**Introduction**

>Regularization is a technique used in Logistic Regression (and other machine learning models) to prevent overfitting by penalizing large coefficients in the model. It helps the model to generalize better on unseen data.

>**The Need for Regularization**

>In Logistic Regression, the model learns coefficients 𝛽0,𝛽1,…,𝛽𝑛 to fit the training data. However:

>>If the model is too complex or if there are too many features, it may learn patterns that are specific to the training data (i.e., noise).

>>This results in overfitting, where the model performs well on training data but poorly on test data.

>>Regularization addresses this by constraining the magnitude of the coefficients, encouraging simpler models.

>**How Regularization Works**

>>Regularization adds a penalty term to the loss function (cost function) that Logistic Regression tries to minimize.

>>**Original Logistic Loss (Binary Cross-Entropy):**

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right]
$$



>>**With Regularization:**

>>>**L2 Regularization (Ridge)**– adds the squared magnitude of coefficients:

$$
J(\theta) = \text{Loss} + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2
$$


>>>**L1 Regularization (Lasso)** – adds the absolute value of coefficients:

$$
J(\theta) = \text{Loss} + \frac{\lambda}{m} \sum_{j=1}^{n} \left| \theta_j \right|
$$


>>Where:

>>λ is the regularization parameter that controls the strength of the penalty.

>>θj ​are the model coefficients (excluding the bias term).

>**Why Regularization is Important**

>>Controls model complexity

>>Reduces overfitting

>>Improves generalization

>>Helps with high-dimensional data (many features)

>>Encourages simpler models with smaller weights

**4. What are some common evaluation metrics for classification models, and
why are they important?**
>**Accuracy**

>>**Definition:** The ratio of correctly predicted observations to the total observations.

>>**Formula:**

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

>>**Importance:**

>>>Simple and intuitive metric to measure overall correctness.

>>>Useful when classes are balanced.

>>**Limitation:**

>>>Misleading if classes are imbalanced.

>**Precision**

>>**Definition:** The ratio of correctly predicted positive observations to the total predicted positives.

>>**Formula:**
$$
\text{Precision} = \frac{TP}{TP + FP}
$$


>>**Importance:**

>>>Important when the cost of false positives is high (e.g., spam detection, fraud detection).

>>**Focus:**

>>>How many selected items are relevant.

>**Recall (Sensitivity or True Positive Rate)**

>>**Definition:** The ratio of correctly predicted positive observations to all actual positives.

>>**Formula:**
$$
\text{Recall} = \frac{TP}{TP + FN}
$$

>>**Importance:**

>>>Crucial when the cost of false negatives is high (e.g., disease diagnosis).

>>**Focus:**

>>>How many relevant items are selected.

>**F1 Score**

>>**Definition:** The harmonic mean of Precision and Recall.

>>**Formula:**

$$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$


>>**Importance:**

>>>Balances Precision and Recall.

>>>Useful when classes are imbalanced and you need a single metric.

>**Confusion Matrix**

>>**Definition:** A table showing the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

>>**Importance:**

>>>Provides detailed insight into the types of errors.

>>>Foundation for calculating other metrics.

>**ROC Curve and AUC (Area Under the Curve)**

>>**ROC Curve:** Plots True Positive Rate (Recall) against False Positive Rate at various threshold settings.

>>**AUC:** Measures the overall ability of the model to discriminate between classes.

>>**Importance:**

>>>Useful for comparing models.

>>>Works well with imbalanced datasets.


**5. : Write a Python program that loads a CSV file into a Pandas DataFrame,
splits into train/test sets, trains a Logistic Regression model, and prints its accuracy.**

**(Use Dataset from sklearn package)**

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Load dataset from sklearn
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Convert to Pandas DataFrame (optional, for visualization)
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y

# Step 3: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train a Logistic Regression model
model = LogisticRegression(max_iter=200)  # Increase max_iter if needed
model.fit(X_train, y_train)

# Step 5: Make predictions and print accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the Logistic Regression model: {accuracy:.2f}")


Accuracy of the Logistic Regression model: 1.00


**6. Write a Python program to train a Logistic Regression model using L2
regularization (Ridge) and print the model coefficients and accuracy.**

**(Use Dataset from sklearn package)**

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Step 2: Convert to DataFrame (optional)
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y

# Step 3: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train Logistic Regression with L2 regularization (default)
model = LogisticRegression(penalty='l2', solver='liblinear', C=1.0)  # C is the inverse of regularization strength
model.fit(X_train, y_train)

# Step 5: Print model coefficients
print("Model Coefficients:")
for feature, coef in zip(data.feature_names, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

# Step 6: Evaluate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy on test set: {accuracy:.4f}")


Model Coefficients:
mean radius: 2.1325
mean texture: 0.1528
mean perimeter: -0.1451
mean area: -0.0008
mean smoothness: -0.1426
mean compactness: -0.4156
mean concavity: -0.6519
mean concave points: -0.3445
mean symmetry: -0.2076
mean fractal dimension: -0.0298
radius error: -0.0500
texture error: 1.4430
perimeter error: -0.3039
area error: -0.0726
smoothness error: -0.0162
compactness error: -0.0019
concavity error: -0.0449
concave points error: -0.0377
symmetry error: -0.0418
fractal dimension error: 0.0056
worst radius: 1.2321
worst texture: -0.4046
worst perimeter: -0.0362
worst area: -0.0271
worst smoothness: -0.2626
worst compactness: -1.2090
worst concavity: -1.6180
worst concave points: -0.6153
worst symmetry: -0.7428
worst fractal dimension: -0.1170

Accuracy on test set: 0.9561


**7. Write a Python program to train a Logistic Regression model for multiclass
classification using multi_class='ovr' and print the classification report.**

**(Use Dataset from sklearn package)**

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Step 1: Load dataset
data = load_iris()
X = data.data
y = data.target

# Step 2: Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train Logistic Regression with OvR strategy
model = LogisticRegression(multi_class='ovr', solver='liblinear', max_iter=200)
model.fit(X_train, y_train)

# Step 4: Make predictions
y_pred = model.predict(X_test)

# Step 5: Print classification report
print("Classification Report (OvR):")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report (OvR):
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30





**8. Write a Python program to apply GridSearchCV to tune C and penalty
hyperparameters for Logistic Regression and print the best parameters and validation accuracy.**

**(Use Dataset from sklearn package)**

In [4]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Load the dataset
data = load_iris()
X = data.data
y = data.target

# Step 2: Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Step 3: Define the model
model = LogisticRegression(multi_class='ovr', solver='liblinear')

# Step 4: Define hyperparameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],          # Inverse of regularization strength
    'penalty': ['l1', 'l2']                # L1 = Lasso, L2 = Ridge
}

# Step 5: Apply GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 6: Print best parameters and validation score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy: {:.4f}".format(grid_search.best_score_))

# Step 7: Evaluate on test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy: {:.4f}".format(test_accuracy))




Best Parameters: {'C': 10, 'penalty': 'l1'}
Best Cross-Validation Accuracy: 0.9583
Test Accuracy: 1.0000


**9. Write a Python program to standardize the features before training Logistic
Regression and compare the model's accuracy with and without scaling.**

**(Use Dataset from sklearn package)**


In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Step 2: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### --- Model WITHOUT Scaling ---
model_no_scaling = LogisticRegression(max_iter=1000)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)
acc_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

### --- Model WITH Scaling ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=1000)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

# Step 3: Compare and Print Results
print(f"Accuracy WITHOUT scaling: {acc_no_scaling:.4f}")
print(f"Accuracy WITH scaling   : {acc_scaled:.4f}")


Accuracy WITHOUT scaling: 0.9561
Accuracy WITH scaling   : 0.9737


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**10. Imagine you are working at an e-commerce company that wants to
predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.**
>**1. Understand the Business Problem**

>>**Goal:** Predict which customers are likely to respond to a marketing campaign.

>>**Challenge:** Only 5% positive class, i.e., very imbalanced.

>>**Cost-sensitive:** False negatives may lead to lost revenue; false positives may waste marketing spend.

>**2. Data Preparation**

>>Handle missing values (e.g., fill, drop, or impute).

>>Convert categorical features using OneHotEncoder or pd.get_dummies().

>>Split the data into training and test sets (e.g., train_test_split with stratify=y to preserve class ratio).

>**3. Feature Scaling**

>>Use StandardScaler to normalize features, especially for models like Logistic Regression that rely on feature magnitudes.

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


>**4. Handle Class Imbalance**

>>**Option A:** Use Class Weights (Recommended for Logistic Regression)

In [7]:
model = LogisticRegression(class_weight='balanced')


>>**Option B:** Resampling

>>>Oversampling minority class: e.g., SMOTE()

>>>Undersampling majority class: e.g., RandomUnderSampler()

>>>Can use imblearn.pipeline.Pipeline for combining scaling + resampling + model training.

>**5. Train the Model**

>>Use Logistic Regression with:

>>>penalty='l2' (ridge regularization),

>>>class_weight='balanced',

>>>solver='liblinear' (if using L1/L2 on small data),

>>>max_iter=1000.

>**6. Hyperparameter Tuning with GridSearchCV**

>>Tune:

>>>C: regularization strength,

>>>penalty: 'l1', 'l2'.

>>Use GridSearchCV with scoring='f1' or scoring='roc_auc' instead of accuracy.

>**7. Model Evaluation**

>>Since accuracy is misleading in imbalanced datasets, use:

>>**Metric:Why?**

>>**Precision:**	Minimize marketing waste (false positives)

>>**Recall:**	Capture as many responders as possible

>>**F1-score:**	Balance of precision and recall

>>**ROC-AUC**:	General discrimination ability

>>**PR AUC:**	Better for highly imbalanced classes

>**8. Threshold Tuning**

>>By default, predictions are based on threshold 0.5, but for imbalanced data:

>>>Adjust threshold to improve recall/precision tradeoff.

>>>Use precision-recall curve to choose optimal threshold.

>**9. Model Deployment & Monitoring**

>>Deploy the model (e.g., via API or in marketing pipeline).

>>Monitor:

>>>Precision/recall drift over time,

>>>False positives/negatives,

>>>Real campaign response vs. predictions.