# Logistic Regression

Logistic regression estimates the probability that an input belongs to class 1 using a sigmoid function and is trained by minimizing a loss over all examples.

For a single example $x^{(i)}$:
$$z^{(i)} = w^T x^{(i)} + b \tag{1}$$
$$\hat{y}^{(i)} = a^{(i)} = \text{sigmoid}(z^{(i)})\tag{2}$$ 
$$ \mathcal{L}(a^{(i)}, y^{(i)}) =  - y^{(i)}  \log(a^{(i)}) - (1-y^{(i)} )  \log(1-a^{(i)})\tag{3}$$

The cost is then computed by summing over all training examples:
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})\tag{4}$$

## 00 Importing

In [1]:
import numpy as np

## 01 Data preprocessing

#### 1. Determine key dimensions:
- m_train (number of training examples)
- m_test (number of test examples)

If image:
- num_px (= height = width of a training image)

#### 2. Reshape the data

If image, flatten it into the following shape $(\text{num\_px} \cdot \text{num\_px} \cdot 3, 1)$ using the `img2vec` function.

In [8]:
def image2vector(image):
    return image.reshape(image.shape[0]*image.shape[1]*image.shape[2], 1)

When flattening a matrix $X$ of shape (a, b, c, d) to (b * c * d, a), use:
```python
X_flatten = X.reshape(X.shape[0], -1).T
```

#### 3. Standardize the data

Standardizing data prevents large features from dominating the learning—leading to faster convergence, better stability, and improved model performance.

Below are ways of normalizing data. For images, only pixel value normalization is used.

##### pixel value normalization
Each pixel in a color image has 3 values (RGB), each from 0 to 255. 

It's common to normalize by dividing by 255.

This scales pixel values to the [0, 1] range for easier model training.
```python
train_set_x = train_set_x_flatten/255
test_set_x = test_set_x_flatten/255
```

##### row normalization

In [None]:
def normalize_rows(x):
    x_norm = np.linalg.norm(x, axis=1, keepdims=True)
    x /= x_norm

    return x

## 02 Helper functions

### Sigmoid

$$\text{sigmoid}(z) = \sigma(z) = \frac{1}{1 + e^{-z}}$$

In [2]:
def sigmoid(z):
    return 1/(1+np.exp(-z))

### Sigmoid derivative

$$
\frac{d}{dz} \sigma(z) = \sigma(z)\left(1-\sigma(z)\right)
$$

In [3]:
def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

### Softmax

$$
\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$

In [5]:
def softmax(x):
    x_exp = np.exp(x)
    x_sum = np.sum(x_exp, axis=1, keepdims=True)
    s = x_exp/x_sum
    
    return s

## 03 Initialization

Initialize a parameters as follows:
- $w$ as a vector of zeros.
- $b$ to 0.

In [9]:
def initialize_with_zeros(dim):
    w = np.zeros((dim, 1))
    b = float(0)

    return w, b

## 04 Forward and backward propagation

With parameters initialized:
1. Perform forward propagation to compute predictions and the cost.
2. Perform backward propagation to compute gradients for learning.

#### Forward propagation
1. Get $X$
2. Compute
$$A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)}) \tag{5}$$
3. Compute the cost function:
$$J = -\frac{1}{m}\sum_{i=1}^{m}(y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})) \tag{6}$$

#### Backward propogation
Compute for the following gradients:
$$dw = \frac{\partial J}{\partial w} = \frac{1}{m}X(A-Y)^T\tag{7}$$
$$db = \frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})\tag{8}$$

The `propagate()` computes the cost function and its gradient:

In [10]:
def propagate(w, b, X, Y):
    m = X.shape[1]
    
    A = sigmoid(np.dot(w.T, X) + b)
    cost = -(1/m)*(np.dot(Y, np.log(A).T) + np.dot((1-Y), np.log(1-A).T))
    
    dw = (1/m)*np.dot(X, (A-Y).T)
    db = (1/m)*np.sum(A-Y)
    
    cost = np.squeeze(np.array(cost))

    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

## 05 Optimization

When parameters are:
- Initialized
- The cost and gradients are computed.

The next step is to do optimization.

Optimization is the use of gradients from backprop to:
- update the parameters, and
- reduce the cost.

The `optimization` function guides the learning of $w$ and $b$ by minimizing the cost $J$.

Each parameter $\theta$ is updated using $\theta = \theta - \alpha \, d\theta$, where $\alpha$ is the learning rate.


In [11]:
def optimize(w, b, X, Y, num_iterations=100, learning_rate=0.009, print_cost=False):    
    w = copy.deepcopy(w)
    b = copy.deepcopy(b)
    
    costs = []
    
    for i in range(num_iterations):
        # Cost and gradient calculation 
        grads, cost = propagate(w, b, X, Y)
        
        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update rule (≈ 2 lines of code)
        w = w - learning_rate*dw
        b = b - learning_rate*db
        
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
        
            # Print the cost every 100 training iterations
            if print_cost:
                print ("Cost after iteration %i: %f" %(i, cost))
    
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

## 06 Prediction

The `predict(w, b, X)` function uses the logistic regression parameters $w$ and $b$ to compute predictions (0 or 1) for each example in $X$.

It applies the sigmoid function $\sigma(z)$ and thresholding at 0.5. 

It returns a NumPy array `Y_prediction` containing these binary predictions.

In [14]:
def predict(w, b, X):    
    m = X.shape[1]
    Y_prediction = np.zeros((1, m))
    w = w.reshape(X.shape[0], 1)
    A = sigmoid(np.dot(w.T, X) + b)
    
    for i in range(A.shape[1]):
        if A[0, i] > 0.5:
            Y_prediction[0, i] = 1
        else:
            Y_prediction[0, i] = 0
    
    return Y_prediction

## 07 Merge into a model function

In [15]:
def model(X_train, Y_train, X_test, Y_test, num_iterations=2000, learning_rate=0.5, print_cost=False):
    w, b = initialize_with_zeros(X_train.shape[0])
    
    # Gradient descent
    params, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)

    # Get parameters w and b from dictionary "params"
    w = params['w']
    b = params['b']

    # Predict test/train set examples
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)
    
    # YOUR CODE ENDS HERE

    # Print train/test Errors
    if print_cost:
        print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
        print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))

    
    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "w" : w, 
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

## 08 Training

Run the following cell to train your model:

````python
logistic_regression_model = model(train_set_x,
                                  train_set_y,
                                  test_set_x,
                                  test_set_y,
                                  num_iterations=2000, 
                                  learning_rate=0.005, 
                                  print_cost=True)
````

## 09 Data analysis

### Plot learning curve (with costs)

```python
costs = np.squeeze(logistic_regression_model['costs'])
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(logistic_regression_model["learning_rate"]))
plt.show()
````

## 10 Additional remarks

### Learning rate

Learning rate is a hyperparameter.

The learning rate $\alpha$ controls how quickly gradient descent updates the parameters. 
- If it's too large, the model may overshoot the minimum.
- If too small, convergence will be very slow.

Tuning the learning rate can significantly impact the convergence speed and performance of the algorithm.