<a href="https://colab.research.google.com/github/AhmadAghaebrahimian/Optimization/blob/main/GradientDescent/GD_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this excercise, we implement Gradient Decent for a function with one multidimensioanl variable. We want to minimize the function (hypothesis) $h(x) = \theta x + b$ where $x$, $\theta$ are multidimensioanl vectors. We also use the Sigmoid function (or otherwise known as the logistic function, hense the method logistic regression)

$\frac{1}{1+e^{-f(x)}}$

to transform the output of $h(x)$ from real value to an integr value. This is the setting for a classification problem using logistic regression where the hypothesis maps an n-dimensioanl feature space to an integer (id of a class).

Let's import dependencies and get started.

In [55]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

#suppress warnings
warnings.filterwarnings('ignore')

We use the Tumor dataset for training the model. It consists of 30 different measurements (feature) of 569 moles (instances). Each instance is assigned to one label, either Malignant or Benign as the type of the tumor.

Let's load the dataset.

In [None]:
link = 'https://drive.google.com/uc?export=download&id=1m1s6Q7xQfdWf642OqMkzu2nmvoJ0xK_5'
data = pd.read_csv(link)
# data = data.sample(frac=1).reset_index(drop=True)
d = {'M': 1, 'B': 0} # we map labels from string to integer
data['diagnosis'] = data['diagnosis'].map(d)
print(data.head())

Although it is not directly related to optimization, a good practise in Machine Learning is to have three seperate data each for paticular purpose; training, development, and test set. Training data is used to tune the model parameters $\theta$ while development data is used to tune hyperparamerts related to overall algorthms such as learning rate. Test data are reserved for when data scietist are satisfied with results and want to report a final results.

Let's seperate the Tumor data into standard datasets; 80% for training, 10% for development, and 10% for testing.

In [78]:
test_idx = int(data.shape[0]*0.9)
val_idx = int(data.shape[0]*0.8)

test_data = data[test_idx:]
val_data = data[val_idx:test_idx]
data = data[:val_idx]

train_Y, train_X = data['diagnosis'], data.drop('diagnosis', axis=1)
val_Y, val_X = val_data['diagnosis'], val_data.drop('diagnosis', axis=1)
test_Y, test_X = test_data['diagnosis'], test_data.drop('diagnosis', axis=1)

print('Training data shape: ', train_X.shape)
print('Validation data shape: ', val_X.shape)
print('Testing data shape: ', test_X.shape)

Training data shape:  (455, 30)
Validation data shape:  (57, 30)
Testing data shape:  (57, 30)


Now we implement the Sigmoid function.

$s(x) = \frac{1}{1+e^{-f(x)}}$

It squeezes the real-value output of $h(x)$ to a probabiliy distribution that can be trnasformed to $1$, or $0$, by defining a threshold like $0.5$:

In [25]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

Now let's workout the gradient step:

Hers is the hypothesis. For the sake of simplicity we ignored the bias term. As a practce add a bias terms and make the required changes. $h(x) = \theta^\top x$

Similar to Excecise 1, we could use MSE for computing the loss.

$\ell (\theta) = \frac{1}{m} \sum_{i=1}^m {(\hat y - y)}^2$

However, since $\hat y$ is not linear in logistic regression, $\ell$ above is not convex, thus not suitable for optimization. To spare from some complexities, let's ignore the derivation and directly define and implement the update rule as the following:

$\theta_{j\_new} := \theta_{j\_old} - \alpha x^i_j(y^i - s(x^i))$


In [73]:
def gradient_step(theta, X, y, lr):
    h_x = X.dot(theta)
    y_hat = np.array(sigmoid(h_x))
    theta -= lr * (X.T.dot(y_hat - y))
    return theta

def gradient_descent(lr, n_iterations):
    theta = np.ones((30, 1))

    iter_count = 0
    val_accuracies = []
    while iter_count < n_iterations:
        val_accuracy = get_accuracy(val_X, val_Y.to_frame(), theta)
        print('Epoch {}: Validation Accuracy {}%'.format(iter_count, val_accuracy))
        val_accuracies.append(val_accuracy)

        for i in range(0, train_X.shape[0], batch_size):
            theta = gradient_step(theta, train_X[i:i+batch_size], train_Y[i:i+batch_size].to_frame(), lr)
        iter_count += 1

    return iter_count, theta, val_accuracies

Before running the optimization, we define a metric to evaluate how well the parameters are optimized. For this purpose, we use the Accuracy as the ratio of the number of correctly classified cases against all cases.

Let's implement a function to compute accuracy:

In [74]:
def get_accuracy(X, y, theta):
    y_hat = np.array(sigmoid(X.dot(theta)))

    ## Converting y_hat probabilities to prediction, >.5 = 1(M), <.5 = 0(B)
    predictions = np.greater(y_hat, 0.5 * np.ones((y_hat.shape[1], 1)))
    accuracy = np.count_nonzero(np.equal(predictions, y)) / predictions.shape[0] * 100
    return accuracy

Hyperparameters

In [84]:
batch_size = 50
lr = 0.003
n_iterations = 100

Now, everthing is in place. Let's run everything together and visualize the accuracy curve.

In [None]:

iter_count, theta, val_accuracies = gradient_descent(lr, n_iterations)
print('After {} Iterations'.format(iter_count))
print('Test Accuracy: {}%'.format(get_accuracy(test_X, test_Y.to_frame(), theta)))

plt.plot(range(len(val_accuracies)), val_accuracies)
plt.xlabel('Iterations')
plt.ylabel('Accuracy')
plt.show()