#LOGISTIC REGRESSION

#Question 1: What is Logistic Regression, and how does it differ from Linear Regression?

- Logistic Regression is a supervised machine learning algorithm used for classification problems, where the output (dependent variable) is categorical — for example, predicting whether an email is spam or not spam, or whether a patient has a disease or not.

- Logistic Regression Works:

   - Instead of predicting continuous values like Linear Regression, it predicts the probability that a data point belongs to a particular class.

   - It uses the logistic (sigmoid) function to map predicted values between 0 and 1, which represent probabilities.

- Differences:-

1. In linear regression the type of problem is regression and in logistic it is classification.

2. In linear regression the output is any real number but in logistic the output is probability between 0 and 1.

3. Goal of Linear regression is MSE (Minimize mean squared error) and goal of logistic is maximize likelihood.

4. Example of linear regression is predicting house prices whereas example of logistic regression is predicting if a customer will buy the product or not.



#Question 2: Explain the role of the Sigmoid function in Logistic Regression.

- The Sigmoid function plays a central role in Logistic Regression because it converts the linear output of the model into a probability value between 0 and 1, making classification possible.

Formula of the Sigmoid Function:

     z = 1/1+e^-z

where

z = b0 + b1x1 + b2x2 + ----- bnxn

- Role in Logistic Regression:

1. Transforms Linear Output to Probability:
    - The raw output z from the linear equation can be any real number (positive or negative).
     - The Sigmoid function compresses it into a value between 0 and 1, which can be interpreted as a probability.

2. It helps in classification:
 - if P(Y=1) > 0.5, predicts class 1
 - if P(Y=1) <= 0.5, predicts class 0   

3. Ensures Smooth Gradient for Optimization:
     - The sigmoid curve is smooth and differentiable, allowing gradient descent to optimize the model parameters effectively.

4. Probability Interpretation
     - The Sigmoid function transforms any real-valued number into a value between 0 and 1, which can be interpreted as the probability of the sample belonging to the positive class.

5. Non-linearity Introduction
     - Even though Logistic Regression is based on a linear equation, the Sigmoid function adds non-linearity, enabling the model to handle relationships where the dependent variable changes sharply around a threshold.


#Question 3: What is Regularization in Logistic Regression and why is it needed?


- Regularization in Logistic Regression is a technique used to prevent overfitting by adding a penalty term to the model’s cost (loss) function.

- It ensures that the model does not become too complex or depend too heavily on any one feature, which helps it generalize better on new, unseen data.

- Why Regularization is Needed:

1. Prevents Overfitting:
   - When a model fits the training data too perfectly, it may perform poorly on test data. Regularization discourages overly large coefficient values, reducing overfitting.

2. Controls Model Complexity:
   - By penalizing large weights, regularization keeps the model simple and more interpretable.

3. Improves Generalization:
   - A regularized model performs better on unseen data because it focuses on important patterns rather than noise.

4. Handles Multicollinearity:
   - If features are highly correlated, regularization helps by reducing their coefficients, making the model more stable.

- Types of Regularization in Logistic Regression:   

1. L1 Regularization (Lasso)
   - Effect = w_i

2. L2 Regularization (Ridge)
   - Effect = Reduces the magnitude of all coefficients but rarely makes them zero; helps smoothly shrink weights.


#Question 4: What are some common evaluation metrics for classification models, and why are they important?      


- Evaluation metrics for classification models are used to measure how well the model predicts the correct categories.

- They are important because they help us understand a model’s accuracy, reliability, and usefulness in real-world decision-making — especially when data is imbalanced or when certain types of errors (like false positives) are more serious than others.

- Why These Metrics Are Important:

1. Provide deeper insights than accuracy alone.

2. Help select the best model for specific use cases (e.g., medical, finance, spam detection).

3. Show trade-offs between false positives and false negatives.

4. Assist in tuning thresholds to optimize model performance.

- Common Evaluation Metrics for Classification Models:-

1. Accuracy

   - Meaning: Measures the percentage of total correct predictions.

   - Importance: Simple and easy to interpret, but works best when classes are balanced.

2. Precision

   - Meaning: Out of all instances predicted as positive, how many are actually positive.

   - Importance: Useful when false positives are costly — for example, marking a normal email as spam.

3. Recall

    - Meaning: Out of all actual positive cases, how many the model correctly identified.

     - Importance: Important when missing positive cases is risky — for example, detecting diseases or fraud.

4. F1 Score

    - Meaning: It is the harmonic mean of Precision and Recall.

    - Importance: Balances Precision and Recall; useful when data is imbalanced.

5. 5. Confusion Matrix

    - Meaning: A table showing counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).

    - Importance: Gives a complete picture of the model’s performance, showing where it makes mistakes.   


#Question 5: Write a Python program that loads a CSV file into a Pandas DataFrame, splits into train/test sets, trains a Logistic Regression model, and prints its accuracy. (Use Dataset from sklearn package)
    

    

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

print("Dataset Preview:")
print(df.head())

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")


Dataset Preview:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  

Model Accuracy: 100.00%


#Question 6: Write a Python program to train a Logistic Regression model using L2 regularization (Ridge) and print the model coefficients and accuracy. (Use Dataset from sklearn package)

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df[data.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(penalty='l2', solver='lbfgs', max_iter=1000)


model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("🔹 Logistic Regression with L2 Regularization (Ridge)")
print("---------------------------------------------------")
print("Model Coefficients:\n", model.coef_)
print("\nModel Intercept:\n", model.intercept_)
print(f"\nModel Accuracy: {accuracy * 100:.2f}%")


🔹 Logistic Regression with L2 Regularization (Ridge)
---------------------------------------------------
Model Coefficients:
 [[ 2.09981182  0.13248576 -0.10346836 -0.00255646 -0.17024348 -0.37984365
  -0.69120719 -0.4081069  -0.23506963 -0.02356426 -0.0854046   1.12246945
  -0.32575716 -0.06519356 -0.02371113  0.05960156  0.00452206 -0.04277587
  -0.04148042  0.01425051  0.96630267 -0.37712622 -0.05858253 -0.02395975
  -0.31765956 -1.00443507 -1.57134711 -0.69351401 -0.84095566 -0.09308282]]

Model Intercept:
 [2.13128402]

Model Accuracy: 95.61%


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


#Question 7: Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr' and print the classification report. (Use Dataset from sklearn package)


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df[iris.feature_names]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=500)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("🔹 Logistic Regression (One-vs-Rest) Classification Report:")
print("-----------------------------------------------------------")
print(classification_report(y_test, y_pred, target_names=iris.target_names))


🔹 Logistic Regression (One-vs-Rest) Classification Report:
-----------------------------------------------------------
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30





#Question 8: Write a Python program to apply GridSearchCV to tune C and penalty hyperparameters for Logistic Regression and print the best parameters and validation accuracy. (Use Dataset from sklearn package)


In [6]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)


iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logreg = LogisticRegression(solver='liblinear', multi_class='ovr', max_iter=200)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

grid_search = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5, scoring='accuracy')

grid_search.fit(X_train, y_train)

print("Best Hyperparameters:", grid_search.best_params_)

y_pred = grid_search.best_estimator_.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Validation Accuracy:", accuracy)


Best Hyperparameters: {'C': 10, 'penalty': 'l1'}
Validation Accuracy: 1.0


#Question 9: Write a Python program to standardize the features before training Logistic Regression and compare the model's accuracy with and without scaling(Use Dataset from sklearn package)


In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model_no_scaling = LogisticRegression(max_iter=200)
model_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = model_no_scaling.predict(X_test)

accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)
print("Accuracy without scaling:", accuracy_no_scaling)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LogisticRegression(max_iter=200)
model_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = model_scaled.predict(X_test_scaled)


accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
print("Accuracy with scaling:", accuracy_scaled)


Accuracy without scaling: 1.0
Accuracy with scaling: 1.0


#Question 10: Imagine you are working at an e-commerce company that wants to predict which customers will respond to a marketing campaign. Given an imbalanced dataset (only 5% of customers respond), describe the approach you’d take to build a Logistic Regression model — including data handling, feature scaling, balancing classes, hyperparameter tuning, and evaluating the model for this real-world business use case.


1. Understand the Problem

   - Target: Predict whether a customer will respond (1) or not (0) to a marketing campaign.

   - Issue: Only 5% of customers respond → highly imbalanced dataset.

   - Goal: Maximize true positives (catch responders) while controlling false positives.

2. Data Handling

   - Check for missing values and handle them (mean/median for numeric, mode for categorical).

   - Feature selection/engineering:

   - Encode categorical variables (One-Hot or Ordinal Encoding).

   - Create meaningful features (e.g., recency, frequency, monetary value).

   - Split the dataset into training and test sets using stratification to preserve class distribution.

3. Feature Scaling

   - Logistic Regression benefits from scaled features, especially if using regularization.

   - Use StandardScaler or MinMaxScaler

4. Handling Class Imbalance

   - With only 5% responders, a naive model predicts all zeros → 95% accuracy but useless.

   Strategies:

      - Class weighting in Logistic Regression.

   Resampling:

     - Oversample minority class: e.g., SMOTE (imblearn.over_sampling.SMOTE).

     - Undersample majority class (careful not to lose data).

5. Hyperparameter Tuning

   - Key parameters for Logistic Regression:

   - C: Inverse of regularization strength.

   - penalty: 'l1', 'l2', or 'elasticnet'.

   - Use GridSearchCV with cross-validation.



   