# What it is 
* Logistic Regression is a statistical method used to fit a regression model when the response variable is binary. In logistic regression, the goal is to find the best coefficients, denoted by $\beta_0, \beta_1, ..., \beta_p$, that will separate the positive instances from the negative instances in the training set.

* The logistic function, also known as the sigmoid function, is used to model the probability of the positive class given the feature values:

$$P(y=1|x) = \frac{1}{1+e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p)}}$$
Where $x$ are the feature values and $\beta_0, \beta_1, ..., \beta_p$ are the coefficients.

* To find the best coefficients, we need to maximize the likelihood of the observed data. The likelihood is defined as the probability of the observed data given the model. The negative log-likelihood is used as a cost function to minimize during the optimization process.

* The cost function for logistic regression is given by:
$$J(\beta) = -\frac{1}{n}\sum_{i=1}^{n}[y_i log(P(y_i=1|x_i)) + (1-y_i)log(1-P(y_i=1|x_i))]$$

Where $n$ is the number of instances in the training set.

* This cost function is optimized using optimization algorithms like gradient descent, Newton-Raphson method or conjugate gradient optimization etc.

* Once the optimal values of coefficients are found, the model can be used to predict the probability of the positive class for new instances. A threshold is chosen to convert the predicted probability into a binary output.

* In summary, Logistic Regression is a supervised learning algorithm used to classify binary data by fitting a logistic function to the input features and the output variable, and then maximizing the likelihood of the observed data.

## L1
* Logistic Regression is a statistical method that we use to fit a regression model when the response variable is binary. L1 regularization, also known as Lasso regularization, is a method to avoid overfitting by adding a penalty term to the cost function. The penalty term is the absolute value of the coefficients multiplied by a hyperparameter, lambda. This has the effect of shrinking the coefficients towards zero, which in turn can lead to some of the features being completely ignored by the model (i.e., the coefficients become exactly zero).

* The L1 regularization term causes some coefficients to become exactly zero. This can be useful when we have a high number of features and we want to select a subset for our model. This method is known as feature selection.

* Advantage of L1 regularization is that it is computationally efficient, and can be useful when we have a large number of features and we want to select a subset for our model.

* A disadvantage of L1 regularization is that it is not differentiable, which makes it more difficult to optimize the cost function using gradient-based optimization algorithms. Additionally, L1 regularization can lead to unstable solutions and yield models that are difficult to interpret.

* Overall, L1 regularization is a useful technique to prevent overfitting and to select a subset of features for a logistic regression model. However, it should be used with caution, as it can lead to unstable solutions and models that are difficult to interpret.
* The goal of logistic regression is to find the best coefficients $\beta_0, \beta_1, ..., \beta_p$ that will separate the positive instances from the negative instances in the training set.

* The logistic function is used to model the probability of the positive class given the feature values:
$$P(y=1|x) = \frac{1}{1+e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p)}}$$
Where $x$ is the feature vector, $\beta_0, \beta_1, ..., \beta_p$ are the coefficients and $e$ is the base of the natural logarithm.

* To find the best coefficients, we need to maximize the likelihood of the observed data. The likelihood is defined as the probability of the observed data given the model. The negative log-likelihood is used as a cost function to minimize during the optimization process.

* The cost function for logistic regression is given by:
$$J(\beta) = -\frac{1}{n}\sum_{i=1}^{n}[y_i log(P(y_i=1|x_i)) + (1-y_i)log(1-P(y_i=1|x_i))]$$
Where $n$ is the number of instances in the training set.

* With L1 regularization, we add a penalty term to the cost function, which is the absolute value of the coefficients multiplied by a hyperparameter lambda. This has the effect of shrinking the coefficients towards zero. The cost function for logistic regression with L1 regularization is given by:
$$J(\beta) = -\frac{1}{n}\sum_{i=1}^{n}[y_i log(P(y_i=1|x_i)) + (1-y_i)log(1-P(y_i=1|x_i))] + \lambda\sum_{j=1}^{p}|\beta_j|$$
Where $\lambda$ is the regularization parameter, and it controls the strength of the regularization. A higher value of $\lambda$ will result in smaller coefficients.

* It is important to note that this cost function is not differentiable at $\beta_j=0$, so optimization algorithm like gradient descent can not be used directly. Instead, sub-gradient descent is used to optimize the cost function.

## L2
* L2 regularization, also known as Ridge regularization, is a method to avoid overfitting by adding a penalty term to the cost function. The penalty term is the sum of the squares of the coefficients multiplied by a hyperparameter, lambda. This has the effect of shrinking the coefficients towards zero, which in turn can lead to a simpler model.

* The L2 regularization term causes the coefficients to be smaller, and it does not force any coefficients to be exactly zero. This can be useful when we want to balance the trade-off between a simpler model and a model that fits the data well.

* The cost function for logistic regression with L2 regularization is given by:
$$J(\beta) = -\frac{1}{n}\sum_{i=1}^{n}[y_i log(P(y_i=1|x_i)) + (1-y_i)log(1-P(y_i=1|x_i))] + \frac{\lambda}{2}\sum_{j=1}^{p}\beta_j^2$$ 
Where $\lambda$ is the regularization parameter, and it controls the strength of the regularization. A higher value of $\lambda$ will result in smaller coefficients.
* It is important to note that L2 regularization is differentiable so it can be optimized using optimization algorithms like gradient descent.

* As L2 regularization causes the coefficients to be smaller, it can help reduce the variance of the model and prevent overfitting. However, it may not be as effective as L1 regularization in selecting a subset of features for the model.

# Feedforwarding in Logistic Regression
In logistic regression, feedforward is the process of calculating the predicted output from the input features and the coefficients of the model.

The feedforward process starts by taking the input feature values and multiplying them with the corresponding coefficients, then it sums them up and applies the logistic function to the result.

The logistic function, also known as the sigmoid function, is defined as:
$$\sigma(z) = \frac{1}{1+e^{-z}}$$
Where $z$ is the input to the function.

In logistic regression, we compute the input to the logistic function as:
$$z = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p$$
Where $\beta_0, \beta_1, ..., \beta_p$ are the coefficients of the model and $x_1, x_2, ..., x_p$ are the features.
The predicted output, also known as the hypothesis, is then given by the logistic function applied to the input:
$$\hat{y} = \sigma(z) = \frac{1}{1+e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p)}}$$
It is important to note that the logistic function outputs a probability of the positive class, so a threshold is chosen to convert the predicted probability into a binary output.

# Backpropagation in Logistic Regression
* Backpropagation is an algorithm used to update the coefficients of a model during the training process. It is commonly used in neural networks but can also be applied to logistic regression. Backpropagation works by calculating the gradient of the cost function with respect to the coefficients and using it to update the coefficients in the opposite direction of the gradient.

* In logistic regression, the cost function is the negative log-likelihood of the observed data given the model. The gradient of the cost function with respect to the coefficients is given by:
$$\frac{\partial J}{\partial \beta_j} = -\frac{1}{n}\sum_{i=1}^{n}[y_i - \hat{y_i}]x_j$$
Where $\hat{y_i}$ is the predicted probability of the positive class for the i-th instance in the training set, $y_i$ is the actual label, $x_j$ is the j-th feature value for the i-th instance, and $n$ is the number of instances in the training set.

* The coefficients are then updated using the gradient:
$$\beta_j = \beta_j - \alpha\frac{\partial J}{\partial \beta_j}$$
Where $\alpha$ is the learning rate, which controls the step size of the updates.

* This process is repeated for a number of iterations until the cost function converges to a minimum.

* It's important to note that Backpropagation is a supervised learning algorithm, which means it uses labeled data to learn the model's parameters. Backpropagation is mainly used in neural networks but can also be used in logistic regression to optimize the cost function with respect to the coefficients.

In [None]:
#implementing logistic regression 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score ,f1_score, precision_score, recall_score, confusion_matrix , classification_report
from sklearn.preprocessing import StandardScaler


In [None]:
#loading dataset from sklearn
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

#splitting data into train and test
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
#we will be doing grid search to find the best hyperparameters for logistic regression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score ,f1_score, precision_score, recall_score, confusion_matrix , classification_report
from sklearn.preprocessing import StandardScaler


In [None]:
#plotting datapoints
plt.scatter(X_train['mean radius'], X_train['mean texture'], c=y_train, cmap='rainbow')
plt.xlabel('mean radius')
plt.ylabel('mean texture')
plt.show()

## With L1 Regularization

In [None]:
clf = LogisticRegression(random_state=0, max_iter=1000,penalty='l1',solver='liblinear')
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100,200]}
grid = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

#putting it all in a df
df = pd.DataFrame({'C':param_grid['C'],'Accuracy':grid.cv_results_['mean_test_score']})
df = df.set_index('C')
df = df.sort_index

In [None]:
display(df)

In [None]:
#plotting classification report
y_pred = grid.predict(X_test)
print(classification_report(y_test, y_pred))

## With L2 Regularization

In [None]:

clf = LogisticRegression(random_state=0, max_iter=1000,penalty='l2')
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100,200]}
grid = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

#putting it all in a df
dff = pd.DataFrame({'C':param_grid['C'],'Accuracy':grid.cv_results_['mean_test_score']})
dff = dff.set_index('C')
dff = dff.sort_index()
display(dff)
#plotting importance of C
plt.plot(dff.index,dff['Accuracy'])
plt.xlabel('C')
plt.ylabel('Accuracy with l1 penalty')
plt.title('Accuracy vs C')
plt.show()


In [None]:
#C value should be from 8 to 15 with 0.1 step
C = np.arange(1,100,0.1)
l1accuracy = []
l2accuracy = []
for c in C:
    l1 = LogisticRegression(random_state=0, max_iter=1000,penalty='l1',solver='liblinear',C=c)
    l1.fit(X_train,y_train)
    l1y_pred = l1.predict(X_test)
    l1accuracy.append(accuracy_score(y_test,l1y_pred))
    l2 = LogisticRegression(random_state=0, max_iter=1000,penalty='l2',C=c)
    l2.fit(X_train,y_train)
    l2y_pred = l2.predict(X_test)
    l2accuracy.append(accuracy_score(y_test,l2y_pred))
    

In [19]:
#make a df with l1accuracy and l2accuracy as columns
df = pd.DataFrame({'C':C,'l1':l1accuracy,'l2':l2accuracy})
df = df.set_index('C')
df = df.sort_index()
display(df)


Unnamed: 0_level_0,l1,l2
C,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.956140,0.95614
1.1,0.956140,0.95614
1.2,0.956140,0.95614
1.3,0.956140,0.95614
1.4,0.956140,0.95614
...,...,...
99.5,0.982456,0.95614
99.6,0.982456,0.95614
99.7,0.982456,0.95614
99.8,0.982456,0.95614


In [18]:
df.transpose()
df.head()

Unnamed: 0_level_0,l1,l2
C,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,0.95614,0.95614
1.1,0.95614,0.95614
1.2,0.95614,0.95614
1.3,0.95614,0.95614
1.4,0.95614,0.95614


In [None]:
#plotting l1accuracy and l2accuracy vs C without using df
plt.plot(C,l1accuracy,label='l1')
plt.plot(C,l2accuracy,label = 'l2')
plt.legend()
plt.show()