# Supervised Learning: Logistic Regression

Prediction Function
$$\hat{y} = \sigma(z)$$

Activation Function - Sigmoid
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

- Turns the logit's linear score into a probability 

Score function - Logit
$$z = \mathbf{w} \cdot \mathbf{x} + b$$

$$\begin{align*}

  &\mathbf{w}\text{: weight vector}\\
  &\mathbf{x}\text{: input feature vector}\\
  &b\text{: bias term}\\
\end{align*}
$$

- A linear combination of the weights and features plus bias




In [None]:
# set up model parameters
import numpy as np

LEARNING_RATE = 0.1
EPOCHS = 1000

def prediction_function(z):
    return (1) / (1+np.exp(-z))

def score_function(w, x, b):
    return np.dot(x, w) + b

weights = []
bias = 0

In [None]:
# set up the training data

# [kills, deaths]
features = [[3, 6], [7,7], [2,4], [3,5], [0,5], [4, 5], [2, 4], [0, 4]] 
# [win/loss]
labels = [1, 1, 1, 0, 1, 1, 1, 0]

# convert to numpy arrays for matrix and vector operations
features = np.array(features); labels = np.array(labels); weights = np.array(np.zeros(features.shape[1]))

Loss Function - Cross-Entropy
$$\mathcal{L}(y, \hat{y}) = -\left[ y \log(\hat{y}) + (1 - y)\log(1 - \hat{y}) \right]$$

- Loss is a number indicating how
bad the model's prediction was on a
single example. If the model's
prediction is perfect, the loss is zero;
otherwise, the loss is greater.

Cost Function

$$J(\mathbf{w}, b) = -\frac{1}{n} \sum_{i=1}^n \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)})\log(1 - \hat{y}^{(i)}) \right]$$

- The cost function is the average loss across all samples. The ultimate goal is to minimize the cost of the model.  

In Gradient Descent we want to move in the opposite direction of where Loss increases the most each iteration.
 
$$\mathcal{L}(y, \hat{y})= \mathcal{L}(y, f(\mathbf{w}, \mathbf{x}, b))$$

$ y $ and $\mathbf{x}$ are both constant to this function so you just need
$$

\begin{align*}

&\nabla L = \left( \frac{\partial L}{\partial \mathbf{w}}, \frac{\partial L}{\partial b} \right)\\
&\frac{\partial L}{\partial \mathbf{w}} = (\hat{y} - y) \mathbf{x}, \space \frac{\partial L}{\partial b} = \hat{y} - y

\end{align*}
$$



In [None]:
# fit the model using gradient descent 
n = len(features)

for _ in range(EPOCHS):
    # compute the prediction function
    z = score_function(weights, features, bias)
    y_hat = prediction_function(z)

    # compute gradient components
    dL_dw = np.dot(features.T, (y_hat - labels) )
    dL_db = y_hat - labels
    
    dJ_dw = 1/n * dL_dw
    dJ_db = 1/n * np.sum(dL_db)

    # update the weights and bias
    weights += -dJ_dw * LEARNING_RATE
    bias += -dJ_db * LEARNING_RATE
    # since differentials are the direction where loss increases the most,
    # move in the opposite direction of those differentials

In [None]:
def predict_probability(X_test):
    return prediction_function(score_function(weights, X_test, bias))

def predict_successes(X_test):
    return np.where(predict_probability(X_test) >= 0.5, 1, 0)

# note: The magnitude of the weights can tell you how much a feature affects the prediction
print(f"Weights: ", end="")
for v in weights:
    print(f"{v:.2f} ", end="")
print()
print(f"Bias: {bias:.2f}\n")

X_test = [ [3,6], [0, 2], [3,5] ]
print("X testing data: ", end=""); print(X_test)

print(f"Probability of elements: ", end="")
for v in X_test:
    print(f"{predict_probability(v):.2f} ", end="")
print()

print(f"Prediction of successes: {predict_successes(X_test)}")

Weights: 0.36 0.19 
Bias: -0.58

X testing data: [[3, 6], [0, 2], [3, 5]]
Probability of elements: 0.84 0.45 0.81 
Prediction of successes: [1 0 1]


#### Loss function derivation for logistic regression

The Bernoulli Distribution

$$P(X = x) = p^x (1 - p)^{1 - x}, \quad x \in \{0, 1\}, \; 0 \leq p \leq 1$$

The Bernoulli distribution models the outcome of a single trial that has only two possible outcomes:  
1: Success  
0: Failure  
  
$p$: the probability of success  

Likelihood of the data happening, given parameters for a distribution $P$
$$L(\theta) = \prod_{i=1}^{n} P(x_i \mid \theta)$$

The Log-Likelihood is equivalent but faster computationally
$$\log L(\theta) = \sum_{i=1}^{n} \log P(x_i \mid \theta)$$

Cross-Entropy

$$H(P, Q) = -\sum_{x} P(x) \log Q(x)$$

Cross-Entropy is a measure of the difference between two probability distributions. 

$P$: True distribution  
$Q$: Predicted distribution  

In logistic regression, there are only two outcomes so it follows a Bernoulli Distribution. $P$ here would be the Bernoulli Distribution.

$$L(p) = \prod_{i=1}^{n} p^{y_i} (1 - p)^{1 - y_i}$$
$$\log L(p) = \sum_{i=1}^{n} \left[ y_i \log p + (1 - y_i) \log (1 - p) \right]$$


We can use cross-entropy to measure the loss between the actual probability distribution, and the predicted one with our model. And since we know Cross-Entropy is the negative of the Likelihood function in this case. That brings us to this:

$$H(P, Q) = -L(p) = -\log L(p)$$

Loss is equal to Cross-Entropy
$$\mathcal{L}(y, \hat y)= H(P, Q) = -\log L(p) = -\sum_{i=1}^{n} \left[ y_i \log p + (1 - y_i) \log (1 - p) \right]$$






https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning