# What is Linear Classification?

Imagine that you have two point clouds that you want to classify, what solution do you propose? The idea that logistic regression proposes is to separate them with a straight line.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact
import dprocessing as dp
%matplotlib inline

ModuleNotFoundError: No module named 'dprocessing'

In [2]:
from jupyterthemes import jtplot
jtplot.style()

In [3]:
np.random.seed(1995)

In [4]:
x = [4 + np.random.normal() for i in range(20)] + [2 + np.random.normal() for i in range(20)]
y = [2 + np.random.normal() for i in range(20)] + [4 + np.random.normal() for i in range(20)]
z = ['Stark'] * 20 + ['Bolton'] * 20
data = pd.DataFrame({'x' : x, 'y' : y, 'Type' : z})
data.head()

Unnamed: 0,x,y,Type
0,2.759367,3.450856,Stark
1,2.529421,4.03792,Stark
2,6.101191,2.030075,Stark
3,2.535178,1.790706,Stark
4,4.817922,0.162836,Stark


In [5]:
def linear_classification(line = False, point1 = False, point2 = False, point3 = False):
    
    fig, ax = plt.subplots(figsize = (13, 8))
    
    sns.scatterplot(data = data, x = 'x', y = 'y', hue = 'Type', ax = ax)
    
    if line:
        ax.plot([0, data['x'].max()], [0, data['x'].max()])
        
    if point1:
        ax.plot([4], [2], markersize = 15, marker = '*', color = 'g')
        
    if point2:
        ax.plot([2], [4], markersize = 15, marker = '*', color = 'g')
        
    if point3:
        ax.plot([4], [4], markersize = 15, marker = '*', color = 'g')

In [6]:
interact(linear_classification, line = False, point1 = False, point2 = False, point3 = False);

interactive(children=(Checkbox(value=False, description='line'), Checkbox(value=False, description='point1'), …

We can solve this problem by defining $h$ as follows:

$$h(x, y) = x - y$$

- If $h > 0$ → Stark
- If $h < 0$ → Bolton

# Biological inspiration


This model, strange as it may seem, can be interpreted as a neuron, if we observe a neuron in our brain we will see that it is made up of several denditres and an axon, the neuron connects its axon to the denditres of another neuron, and in turn has several neurons connected to their own denditras.

<img src = https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Blausen_0657_MultipolarNeuron.png/1024px-Blausen_0657_MultipolarNeuron.png>

So a neuron receives information from several more neurons, but only sends a result (through the axon) to another, that is, the neuron has several inputs but only has one output. Like the previous model that receives two inputs ($x$, $y$) and a single output ($Stark$ / $Bolton$ depending on the case).

<hr>

# From linear regression to logistics regression


The first idea that we could have when trying to model the above would be to use the probability that a point is Stark or Bolton and use what we already know, linear regression to model it.

$$P = \alpha + \beta X$$

But something is wrong, here $P \in [0, 1]$ but in the right side $X \in (- \infty, \infty)$, so we still have a proiblem. What if we use the odds ratio?, the odds ratio is defined as follows: 

$$Odds_{P} = \frac{P}{1 - P}$$

if we use the odds ratio instead of probability our model would be as follows:

$$\frac{P}{1 - P} = \alpha + \beta X$$


Notice when the numerator tends to 0 (the probability of success tends to 0) the odds ratio tends to 0. In the other hand when the numerator is bigger than the denominator, the odds ratio is bigger than one, even more so if the denominator tends to 0 (the probability of failure tends to 0 or the probability of success tends to 1) the odds ratio tend to $\infty$. Thus the odds ratio is a real number in the $[0, \infty)$, but in the right side of our equation we still have $X \in (- \infty, \infty)$, the problem still there.


So let's take a look at the function $\ln$.


$$\ln \left| \frac{P}{1 - P} \right| =  \alpha + \beta X$$

Finally $\ln \left| \frac{P}{1 - P} \right| \in (- \infty, \infty)$ and $X \in (- \infty, \infty)$ and:


Maybe you're thinking that the $\ln$ is not look like a straight line, and you are right, but don't forget the for of the form of the odds.

we can solve for P: 

$$\ln \left| \frac{P}{1 - P} \right|=  \alpha + \beta X$$


Apply the exponential function:

$$\frac{P}{1 - P} = e^{\alpha + \beta X}$$


multiply by $(1 - P)$

$$P = e^{\alpha + \beta X} (1 - P)$$
$$P = e^{\alpha + \beta X}  - P e^{\alpha + \beta X}$$


Add $P e^{\alpha + \beta X}$

$$P + P e^{\alpha + \beta X} = e^{\alpha + \beta X}$$

$$P (1 + e^{\alpha + \beta X}) = e^{\alpha + \beta X}$$


Divide by $(1 + e^{\alpha + \beta X})$

$$P = \frac{e^{\alpha + \beta X}}{1 + e^{\alpha + \beta X}}$$

And finally apply a sneaky 1:

$$P = \frac{e^{-(\alpha + \beta X)}}{e^{-(\alpha + \beta X)}}\frac{e^{\alpha + \beta X}}{1 + e^{\alpha + \beta X}}$$

$$P = \frac{1}{1 + e^{-(\alpha + \beta X)}}$$



In the case of multiple predictor variables we would have:

$$\ln \left| \frac{P}{1 - P} \right| =  w_0 + w_1 X_1 + \cdots + w_k X_k$$

And we can define two vectors:

$$\vec{w}
=
\begin{bmatrix}
w_0\\
w_1\\
\vdots \\
w_k \\
\end{bmatrix}
$$


$$\vec{X}
=
\begin{bmatrix}
1\\
X_1\\
\vdots \\
X_k \\
\end{bmatrix}
$$

Thus:

$$\ln \left| \frac{P}{1 - P} \right| = w^T X$$

$$P = \frac{1}{1 + e^{-(w^T X)}}$$

# How we estimate the $\vec{w}$

Every time we fit a statistical or machine learning model, we are estimating parameters.A single variable linear regression has the equation:

$Y = w_0 + w_1 X$

Our goal when we fit this model is to estimate the parameters  $w_0$ and $w_1$ given our observed values of $Y$ and $X$. We use Ordinary Least Squares (OLS) to fit the linear regression model and estimate $w_0$ and $w_1$.

- **Can we use OLS to estimate the $\vec{w}$?**

The answer is **No**. So we need to change cost function, now we will focus on the cross-error entropy function call.

$$j = - \left[y \ln(P) + (1 - y) \ln(1 - P) \right]$$

In [7]:
def j(y, p):
    return - (y * np.log(p) + (1 - y) * np.log(1 - p))

In [8]:
j(1, 0.9) #rigth

0.10536051565782628

In [9]:
j(1, 0.021) #wrong

3.863232841258714

In [10]:
j(0, 0.01) #rigth

0.01005033585350145

In [11]:
j(0, 0.9) #wrong

2.302585092994046

As you can see, this metric is smaller the smaller the error we make with our prediction, so we can reuse the above to build a function to minimize.

$$J = - \frac{1}{n} \sum \left[y_i \ln(P_i) + (1 - y_i) \ln(1 - P_i) \right]$$

To minimize the previous function we can use the descent of the gradient but for this we need to know what the gradient is, but first we must remember that P is a function of $w^T X$ and this in turn of $w_i$.

$$P = \frac{1}{1 + e^{-(w^T X)}}$$


So if we use the chain rule we can express the derivative of $J$ with respect to $w_i$ as:

$$\frac{\partial J}{\partial w_i} = \sum \frac{\partial J}{\partial P_i}\frac{\partial P_i}{\partial \alpha_i}\frac{\partial \alpha_i}{\partial w_i}$$

where $\alpha$ is:


$$\alpha_i = w^T x_i$$

If we calculate each of the individual derivatives we have:

$$
\left \{
\begin{array}{l}
\frac{\partial J}{\partial P_i} = -\left[\frac{y_i}{p_i} - \frac{1 - y_i}{1 - p_i}\right]\\
\frac{\partial P_i}{\partial \alpha_i} = \frac{1}{(1 + e^{-(w^T X)})^2} (e^{-\alpha_i})(- 1)\\
\frac{\partial \alpha_i}{\partial w_i} = x_{ni}
\end{array}
\right.
$$



The second derivative can be simplified

$$\frac{\partial P_i}{\partial \alpha_i} = \frac{1}{(1 + e^{-(w^T X)})^2} (e^{-\alpha_i})(- 1)$$
<hr>
$$\frac{1}{(1 + e^{-(w^T X)})^2} (e^{-\alpha_i})(- 1) = \frac{e^{-\alpha_i}}{(1 + e^{-(w^T X)})^2}$$
<hr>
$$\frac{e^{-\alpha_i}}{(1 + e^{-(w^T X)})^2} = \frac{1}{1 + e^{-(w^T X)}} \frac{e^{-\alpha_i}}{(1 + e^{-(w^T X)})}$$

Finally notice that:

$$1 - P_i = 1 - \frac{1}{1 + e^{-(\vec{\beta} \cdot \vec{X}_{ui})}}$$

$$1 - P_i = \frac{1 + e^{-(\vec{\beta} \cdot \vec{X}_{ui})}}{1 + e^{-(\vec{\beta} \cdot \vec{X}_{ui})}} - \frac{1}{1 + e^{-(\vec{\beta} \cdot \vec{X}_{ui})}}$$

$$1 - P_i = \frac{1 + e^{-(\vec{\beta} \cdot \vec{X}_{ui})} - 1}{1 + e^{-(\vec{\beta} \cdot \vec{X}_{ui})}}$$

<div class="alert alert-danger">$$1 - P_i = \frac{e^{-(\vec{\beta} \cdot \vec{X}_{ui})}}{1 + e^{-(\vec{\beta} \cdot \vec{X}_{ui})}}$$</div>


So.


$$
\left \{
\begin{array}{l}
\frac{\partial J}{\partial P_i} = -\left[\frac{y_i}{p_i} - \frac{1 - y_i}{1 - p_i}\right]\\
\frac{\partial P_i}{\partial \alpha_i} = P_i (1 - P_i)\\
\frac{\partial \alpha_i}{\partial w_i} = x_{ni}
\end{array}
\right.
$$


Putting them all together:

$$\frac{\partial J}{\partial w_i} = - \sum \left[ \frac{y_i}{P_i} P_i(1 - P_i)x_{ni} - \frac{1 - y_i}{1 - p_i} P_i (1 - P_i) x_{ni}\right]$$



$$\frac{\partial J}{\partial w_i} = - \sum \left[ y_i(1 - P_i)x_{ni} - (1 - y_i) P_i  x_{ni}\right]$$

$$\frac{\partial J}{\partial w_i} = - \sum \left[ (y_i - y_iP_i - P_i + y_iP_i)  x_{ni}\right]$$


$$\frac{\partial J}{\partial w_i} = - \sum(y_i  - P_i)  x_{ni}$$

$$\frac{\partial J}{\partial w_i} =  \sum(P_i  - y_i)  x_{ni}$$

<hr>

$$\frac{\partial J}{\partial \vec{w}} =  \sum(P_i  - y_i)  \vec{x_n}$$


<hr>

$$\nabla J =  X^T (P  - y)$$

In [12]:
x = [4 + np.random.normal() for i in range(20)] + [2 + np.random.normal() for i in range(20)]
y = [4 + np.random.normal() for i in range(20)] + [2 + np.random.normal() for i in range(20)]
z = [1] * 20 + [0] * 20
data = pd.DataFrame({'x' : x, 'y' : y, 'Type' : z})
data.head()

Unnamed: 0,x,y,Type
0,3.948864,5.043422,1
1,4.654946,3.221058,1
2,2.346167,3.612457,1
3,3.160796,4.624555,1
4,3.70533,3.836523,1


In [15]:
import numpy as np

class LogisticRegressionGD:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None
        
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradiente descendente
        for _ in range(self.num_iterations):
            linear_model = np.dot(X, self.weights) + self.bias
            y_predicted = self.sigmoid(linear_model)
            
            # Cálculo de los gradientes
            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / n_samples) * np.sum(y_predicted - y)
            
            # Actualización de pesos y sesgo
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
    def predict(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = self.sigmoid(linear_model)
        y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]
        return np.array(y_predicted_cls)
    
# Ejemplo de uso
if __name__ == "__main__":
 
    
x = [4 + np.random.normal() for i in range(20)] + [2 + np.random.normal() for i in range(20)]
y = [4 + np.random.normal() for i in range(20)] + [2 + np.random.normal() for i in range(20)]
z = [1] * 20 + [0] * 20
data = pd.DataFrame({'x' : x, 'y' : y, 'Type' : z})
data.head()
    
    # Instanciar y entrenar el modelo
    model = LogisticRegressionGD(learning_rate=0.01, num_iterations=1000)
    model.fit(x, y)
    
    # Predecir y mostrar los resultados
    y_pred = model.predict(x)
    print("Valores reales:", y)
    print("Valores predichos:", y_pred)

IndentationError: unexpected indent (Temp/ipykernel_18152/58723961.py, line 48)

### Exercise 

implement logistic regression

In [None]:
class LogisticRegression:
    def __j(self,*w):
        a=self.y*np.log(self.__sigmoid(self.X @self.w))
        b=(1-self.y)*np.log(1-self.__sigmoid(self.X@self.w))
    def __sigmoid(self,x):
        return 1 /(1+np.exp(-x))
         
            

## Performance Measures

<img src = 'https://geekflare.com/wp-content/uploads/2022/07/basic_cm-edited.jpg'>


True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes. E.g. if actual class value indicates that this passenger survived and predicted class tells you the same thing.

True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no. E.g. if actual class says this passenger did not survive and predicted class tells you the same thing.

False positives and false negatives, these values occur when your actual class contradicts with the predicted class.

False Positives (FP) – When actual class is no and predicted class is yes. E.g. if actual class says this passenger did not survive but predicted class tells you that this passenger will survive.

False Negatives (FN) – When actual class is yes but predicted class in no. E.g. if actual class value indicates that this passenger survived and predicted class tells you that passenger will die.

Once you understand these four parameters then we can calculate Accuracy, Precision, Recall and F1 score.


**Accuracy** - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.


**Precision** - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. 

**Recall (Sensitivity)** - Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the passengers that truly survived, how many did we label?

**F1 score** - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

In [None]:
class Class