#**Logistics Regression**

#**Theoretical**

-----

**1. What is Logistic Regression, and how does it differ from Linear Regression?**

* **Linear Regression:** Predicts a continuous numerical output (e.g., house price). It fits a straight line to the data.
* **Logistic Regression:** Predicts a categorical output (e.g., yes/no, spam/not spam). It estimates the probability of an event occurring, then classifies based on a threshold. It uses a "squishing" function (sigmoid) to constrain the output between 0 and 1.

-----

**2. What is the mathematical equation of Logistic Regression?**

* The equation is: $P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}}$
    * Where:
        * $P(y=1|x)$ is the probability of the output being 1 (e.g., "yes").
        * $e$ is Euler's number.
        * $\beta_0, \beta_1, ... \beta_n$ are the coefficients.
        * $x_1, x_2, ... x_n$ are the input features.

----

**3. Why do we use the Sigmoid function in Logistic Regression?**

* The sigmoid function (the fraction in the equation above) "squishes" any real number into a value between 0 and 1. This is perfect for probabilities, which must be within that range. It takes the linear output of the input features, and transforms it into a probability.

----

**4. What is the cost function of Logistic Regression?**

* The cost function is called "cross-entropy" or "log loss." It measures how wrong our predictions are. It penalizes incorrect predictions more heavily when they are made with high confidence. The goal is to minimize this cost.


----

**5. What is Regularization in Logistic Regression? Why is it needed?**

* Regularization adds a penalty to the cost function based on the size of the coefficients. This prevents "overfitting," where the model learns the training data too well and performs poorly on new data. It keeps the model simple.

----

**6. Explain the difference between Lasso, Ridge, and Elastic Net regression.**

* **Ridge (L2):** Adds a penalty proportional to the *square* of the coefficients. Shrinks all coefficients, but rarely makes them exactly zero.
* **Lasso (L1):** Adds a penalty proportional to the *absolute value* of the coefficients. Can shrink some coefficients to exactly zero, effectively performing feature selection.
* **Elastic Net:** A combination of Ridge and Lasso. It adds both L1 and L2 penalties, providing a balance between feature selection and coefficient shrinkage.

----

**7. When should we use Elastic Net instead of Lasso or Ridge?**

* When you have many features, and you suspect that some are correlated. Elastic Net combines the strengths of both Lasso and Ridge, handling correlated features better than Lasso alone and providing feature selection.

----

**8. What is the impact of the regularization parameter (λ) in Logistic Regression?**

* The regularization parameter (lambda, often represented as "C" in libraries, where C=1/lambda) controls the strength of regularization.
    * A large lambda (small C) means strong regularization: coefficients shrink more, simplifying the model.
    * A small lambda (large C) means weak regularization: the model can become more complex, potentially overfitting.

----

**9. What are the key assumptions of Logistic Regression?**

* Binary output.
* Independence of features.
* Linearity between features and the log-odds of the outcome.
* No severe multicollinearity (high correlation between features).


----

**10. What are some alternatives to Logistic Regression for classification tasks?**

* Decision Trees
* Random Forests
* Support Vector Machines (SVMs)
* K-Nearest Neighbors (KNN)
* Neural Networks

----

**11. What are Classification Evaluation Metrics?**

* **Accuracy:** Overall correct predictions.
* **Precision:** Correct positive predictions out of all positive predictions.
* **Recall (Sensitivity):** Correct positive predictions out of all actual positive cases.
* **F1-score:** Harmonic mean of precision and recall.
* **AUC-ROC:** Area under the Receiver Operating Characteristic curve, measuring the ability to distinguish between classes.


-----

**12. How does class imbalance affect Logistic Regression?**

* If one class is much more frequent than the other, the model may be biased towards the majority class. It might perform well on the majority class but poorly on the minority class. Solutions include oversampling, undersampling, and using class weights.

------

**13. What is Hyperparameter Tuning in Logistic Regression?**

* It's the process of finding the best values for parameters like the regularization strength (C) and the solver. Techniques include grid search and cross-validation.

------

**14. What are different solvers in Logistic Regression? Which one should be used?**

* Solvers are algorithms used to optimize the cost function. Common ones include:
    * 'liblinear': Good for small datasets.
    * 'lbfgs': Good for larger datasets, supports L2 regularization.
    * 'sag' and 'saga': Good for large datasets, handle various regularization types.
* The choice depends on the dataset size and the type of regularization.

-----

**15. How is Logistic Regression extended for multiclass classification?**

* **One-vs-Rest (OvR):** Trains a binary classifier for each class against all other classes.
* **Softmax Regression:** Generalizes logistic regression to multiple classes by calculating the probability of each class and selecting the class with the highest probability.


----

**16. What are the advantages and disadvantages of Logistic Regression?**

* **Advantages:**
    * Simple and easy to implement.
    * Provides probabilities.
    * Efficient for binary classification.
    * Easy to interpret coefficients.
* **Disadvantages:**
    * Assumes linearity.
    * Can struggle with complex relationships.
    * Sensitive to outliers.


----

**17. What are some use cases of Logistic Regression?**

* Spam detection.
* Medical diagnosis (e.g., predicting disease presence).
* Credit risk assessment.
* Customer churn prediction.

------

**18. What is the difference between Softmax Regression and Logistic Regression?**

* **Logistic Regression:** Binary classification (two classes).
* **Softmax Regression:** Multiclass classification (more than two classes). Softmax outputs probabilities for each class, summing to 1.

----

**19. How do we choose between One-vs-Rest (OvR) and Softmax for multiclass classification?**

* Softmax is generally preferred when the classes are mutually exclusive (an item belongs to only one class).
* OvR can be useful when classes are not mutually exclusive (an item can belong to multiple classes).
* In many cases, the results are very similar.

-----


**20. How do we interpret coefficients in Logistic Regression?**

* The coefficients represent the change in the log-odds of the outcome for a one-unit change in the predictor, holding other predictors constant.
* To make it more interpretable, you can exponentiate the coefficients, which gives you the odds ratio. An odds ratio greater than 1 means the odds of the outcome increase; less than 1 means they decrease.

-----

#**Practical**

In [None]:
#1. Write a Python program that loads a dataset, splits it into training and testing sets, applies Logistic Regression, and prints the model accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a sample dataset (replace with your data loading)
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6],
    'feature2': [5, 4, 3, 2, 1, 0],
    'target_column': [0, 0, 0, 1, 1, 1]  # Binary target
})

# Separate features (X) and target variable (y)
X = data[['feature1', 'feature2']]  # Use a list of feature column names
y = data['target_column']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("Model Accuracy:", accuracy)

Model Accuracy: 1.0


In [None]:
# 2. Write a Python program to apply L1 regularization (Lasso) on a dataset using LogisticRegression (penalty='l1') and print the model accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a sample dataset (replace with your data loading)
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6],
    'feature2': [5, 4, 3, 2, 1, 0],
    'target_column': [0, 0, 0, 1, 1, 1]  # Binary target
})

# Separate features (X) and target variable (y)
X = data[['feature1', 'feature2']]
y = data['target_column']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model with L1 regularization
model = LogisticRegression(penalty='l1', solver='liblinear')  # 'liblinear' for L1

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("Model Accuracy (L1 Regularization):", accuracy)

Model Accuracy (L1 Regularization): 1.0


In [None]:
# 3. Write a Python program to train Logistic Regression with L2 regularization (Ridge) using LogisticRegression (penalty='l2'). Print model accuracy and coefficients.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a sample dataset (replace with your data loading)
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6],
    'feature2': [5, 4, 3, 2, 1, 0],
    'target_column': [0, 0, 0, 1, 1, 1]  # Binary target
})

# Separate features (X) and target variable (y)
X = data[['feature1', 'feature2']]
y = data['target_column']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model with L2 regularization
model = LogisticRegression(penalty='l2')  # 'l2' is the default

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("Model Accuracy (L2 Regularization):", accuracy)

# Print coefficients
print("Model Coefficients:", model.coef_)

Model Accuracy (L2 Regularization): 1.0
Model Coefficients: [[ 0.60372287 -0.60348415]]


In [None]:
# 4. Write a Python program to train Logistic Regression with Elastic Net Regularization (penalty='elasticnet').

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a sample dataset (replace with your data loading)
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6],
    'feature2': [5, 4, 3, 2, 1, 0],
    'target_column': [0, 0, 0, 1, 1, 1]  # Binary target
})

# Separate features (X) and target variable (y)
X = data[['feature1', 'feature2']]
y = data['target_column']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model with Elastic Net regularization
model = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("Model Accuracy (Elastic Net Regularization):", accuracy)

Model Accuracy (Elastic Net Regularization): 1.0


In [None]:
#5. Write a Python program to train a Logistic Regression model for multiclass classification using multi_class='ovr'.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a sample dataset (replace with your data loading)
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6],
    'feature2': [5, 4, 3, 2, 1, 0],
    'target_column': [0, 0, 1, 1, 2, 2]  # Multiclass target
})

# Separate features (X) and target variable (y)
X = data[['feature1', 'feature2']]
y = data['target_column']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Logistic Regression model for multiclass classification (One-vs-Rest)
model = LogisticRegression(multi_class='ovr', solver='liblinear')

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy
print("Model Accuracy (Multiclass - OvR):", accuracy)

Model Accuracy (Multiclass - OvR): 0.0




In [None]:
#6. Write a Python program to apply GridSearchCV to tune the hyperparameters (C and penalty) of Logistic Regression. Print the best parameters and accuracy.

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create a sample dataset (replace with your data loading)
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8],
    'feature2': [5, 4, 3, 2, 1, 0, 1, 2],
    'target_column': [0, 0, 0, 1, 1, 1, 0, 1]  # Binary target
})

# Separate features (X) and target variable (y)
X = data[['feature1', 'feature2']]
y = data['target_column']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid with explicit solver-penalty pairings
param_grid = [
    {C : [0.1, 1, 10], 'penalty' ['l1'], 'solver'['liblinear']},
    {C : [0.1, 1, 10], 'penalty' ['l2'], 'solver' ['liblinear', 'lbfgs', 'sag', 'newton-cg']},
    {C : [0.1, 1, 10], 'penalty' ['elasticnet'], 'solver'['saga'], 'l1_ratio'},
    {C : [0.1, 1, 10], 'penalty' [None], 'solver' ['lbfgs']}
]

# Create Logistic Regression model
model = LogisticRegression()  # No solver specified here

# Create GridSearchCV object
grid_search = GridSearchCV(model, param_grid, cv=5)  # cv is for cross-validation

# Fit the GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Get the best accuracy
best_accuracy = grid_search.best_score_  # Accuracy from cross-validation
print("Best Accuracy:", best_accuracy)

# Evaluate on the test set
y_pred = grid_search.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Test Set Accuracy:", test_accuracy)

SyntaxError: ':' expected after dictionary key (<ipython-input-16-db51a481e8af>, line 24)

In [None]:
#

In [None]:
#