# Neural Network optimization with SGD and Adam

In this assignment you are asked to study the behavior of Adam and compare with SGD. You will be replicating the results of section 6 from the paper [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980) by Diederik P. Kingma and Jimmy Ba.

## Datasets

You will use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) and [CFAR10](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/cifar10) datasets and MNIST datasets. The IMDB dataset is a set of 50,000 highly polarized reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing. The CFAR10 dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.

In [2]:
import tensorflow as tf

#imdb dataset
(x_train_imdb, y_train_imdb), (x_test_imdb, y_test_imdb) = tf.keras.datasets.imdb.load_data(
    path='imdb.npz',
    num_words=None,
    skip_top=0,
    maxlen=None,
    seed=113,
    start_char=1,
    oov_char=2,
    index_from=3
)

#cfar10 dataset
(x_train_cfar10, y_train_cfar10), (x_test_cfar10, y_test_cfar10) = tf.keras.datasets.cifar10.load_data()

#mnist dataser
(x_train_mnist, y_train_mnist), (x_test_mnist, y_test_mnist) = tf.keras.datasets.mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


## Create the BoW feature vectors  (10 points)

Create the word vectors using Bag of Words (BoW) representation. You can use the following code to get the BoW representation of the dataset. You can read more about BoW [here](https://www.freecodecamp.org/news/an-introduction-to-bag-of-words-and-how-to-code-it-in-python-for-nlp-282e87a9da04/)

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer
vectorizer = CountVectorizer(max_features = 10000)
word_index = tf.keras.datasets.imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}

def bow(index):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in index])

x_train_bow = vectorizer.fit_transform([bow(index) for index in x_train_imdb])
x_test_bow = vectorizer.transform([bow(index) for index in x_test_imdb])
print(x_train_bow, x_test_bow)

  (0, 9010)	3
  (0, 3489)	6
  (0, 9686)	11
  (0, 4946)	4
  (0, 1238)	3
  (0, 1484)	1
  (0, 5342)	1
  (0, 7769)	1
  (0, 8565)	2
  (0, 2631)	1
  (0, 3184)	2
  (0, 7206)	2
  (0, 8703)	1
  (0, 8967)	15
  (0, 6472)	2
  (0, 8994)	4
  (0, 6693)	2
  (0, 473)	9
  (0, 9970)	4
  (0, 2126)	1
  (0, 4514)	1
  (0, 948)	2
  (0, 8988)	2
  (0, 7555)	1
  (0, 7252)	1
  :	:
  (24999, 9858)	1
  (24999, 3367)	1
  (24999, 9294)	1
  (24999, 4329)	1
  (24999, 4681)	1
  (24999, 9607)	1
  (24999, 3391)	1
  (24999, 559)	1
  (24999, 546)	1
  (24999, 7340)	1
  (24999, 9654)	1
  (24999, 7351)	1
  (24999, 1651)	2
  (24999, 8256)	1
  (24999, 5193)	1
  (24999, 7945)	1
  (24999, 4045)	1
  (24999, 9669)	1
  (24999, 3531)	1
  (24999, 2208)	1
  (24999, 7702)	1
  (24999, 861)	1
  (24999, 1395)	1
  (24999, 7548)	1
  (24999, 844)	1   (0, 396)	1
  (0, 409)	1
  (0, 414)	1
  (0, 473)	2
  (0, 1187)	4
  (0, 1482)	1
  (0, 2126)	1
  (0, 2751)	1
  (0, 3372)	1
  (0, 3550)	3
  (0, 3888)	2
  (0, 3928)	1
  (0, 4185)	1
  (0, 4198)	1
  (0, 

## Implement the models (10 points)

You need to implement Logistioc Regression, MLP and CNN models. 

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def logistic_regression(x_train, y_train, x_test, y_test, optimizer, max_iter=100):
    if optimizer == 'sgd':
        model = LogisticRegression(solver='sgd', max_iter=max_iter)
    elif optimizer == 'adam':
        model = LogisticRegression(solver='adam', max_iter=max_iter)
    else:
        raise ValueError("Invalid optimizer. Supported options: 'sgd', 'adam'")
    
    # Train the model on the training data
    model.fit(x_train, y_train)
    # Make predictions on the test data
    y_pred = model.predict(x_test)
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    return model

In [None]:
# MLP
def mlp_model(x_train, y_train, x_test, y_test, optimizer, hidden_layers=(100,)):
    if optimizer == 'sgd':
        model = MLPClassifier(solver='sgd', hidden_layer_sizes=hidden_layers, max_iter=1000)
    elif optimizer == 'adam':
        model = MLPClassifier(solver='adam', hidden_layer_sizes=hidden_layers, max_iter=1000)
    else:
        raise ValueError("Invalid optimizer. Supported options: 'sgd', 'adam'")

    # Train the model on the training data
    model.fit(x_train, y_train)
    # Make predictions on the test data
    y_pred = model.predict(x_test)
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    return model

In [None]:
# CNN
import tensorflow as tf
from tensorflow import keras
from sklearn.metrics import accuracy_score

def cnn_model(x_train, y_train, x_test, y_test, optimizer, epochs=10, batch_size=32):
    input_shape = x_train.shape[1:]
    num_classes = len(set(y_train))
    if optimizer == 'sgd':
        opt = keras.optimizers.SGD(learning_rate=0.01)
    elif optimizer == 'adam':
        opt = keras.optimizers.Adam()
    else:
        raise ValueError("Invalid optimizer. Supported options: 'sgd', 'adam'")
        
    model = keras.Sequential([
        keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape),
        keras.layers.MaxPooling2D((2, 2)),
        keras.layers.Conv2D(64, (3, 3), activation='relu'),
        keras.layers.MaxPooling2D((2, 2)),
        keras.layers.Flatten(),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dense(num_classes, activation='softmax')
    ])    
    model.compile(optimizer=opt,
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(x_test, y_test))
    return model

## SGD and Adam optimizers (20 points)

Use SGD and Adam optimizers with Optuna to find the best hyperparameters for the models. You can read more about Optuna [here](https://optuna.readthedocs.io/en/stable/).

In [20]:
!pip3 install optuna

Collecting optuna
  Obtaining dependency information for optuna from https://files.pythonhosted.org/packages/69/60/87a06ef66b34cbe2f2eb0ab66f003664404a7f40c21403a69fad7e28a82b/optuna-3.3.0-py3-none-any.whl.metadata
  Downloading optuna-3.3.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Obtaining dependency information for alembic>=1.5.0 from https://files.pythonhosted.org/packages/a2/8b/46919127496036c8e990b2b236454a0d8655fd46e1df2fd35610a9cbc842/alembic-1.12.0-py3-none-any.whl.metadata
  Downloading alembic-1.12.0-py3-none-any.whl.metadata (7.2 kB)
Collecting cmaes>=0.10.0 (from optuna)
  Obtaining dependency information for cmaes>=0.10.0 from https://files.pythonhosted.org/packages/f7/46/7d9544d453346f6c0c405916c95fdb653491ea2e9976cabb810ba2fe8cd4/cmaes-0.10.0-py3-none-any.whl.metadata
  Downloading cmaes-0.10.0-py3-none-any.whl.metadata (19 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting sqlalc

In [None]:
#Adam
import optuna

def objective(trial):
    # Define the hyperparameter search space
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-2)
    epsilon = trial.suggest_loguniform('epsilon', 1e-8, 1e-5)
    beta_1 = trial.suggest_uniform('beta_1', 0.85, 0.99)
    beta_2 = trial.suggest_uniform('beta_2', 0.9, 0.999)
    model = tf.keras.Sequential()
    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=epsilon, beta_1=beta_1, beta_2=beta_2)
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    # Train the model on the training data
    model.fit(x_train, y_train, epochs=10, batch_size=32, verbose=0)
    _, accuracy = model.evaluate(x_test, y_test)
    return -accuracy

def optuna_adam(x_train, y_train, x_test, y_test):
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=100)
    best_params = study.best_params
    best_learning_rate = best_params['learning_rate']
    best_epsilon = best_params['epsilon']
    best_beta_1 = best_params['beta_1']
    best_beta_2 = best_params['beta_2']
    final_model = tf.keras.Sequential()
    final_optimizer = tf.keras.optimizers.Adam(learning_rate=best_learning_rate, epsilon=best_epsilon, beta_1=best_beta_1, beta_2=best_beta_2)
    final_model.compile(optimizer=final_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    final_model.fit(x_train, y_train, epochs=20, batch_size=32)

In [None]:
#sgd
def objective(trial):
    # Define the hyperparameter search space
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-5, 1e-2)
    epsilon = trial.suggest_loguniform('epsilon', 1e-8, 1e-5)
    beta_1 = trial.suggest_uniform('beta_1', 0.85, 0.99)
    beta_2 = trial.suggest_uniform('beta_2', 0.9, 0.999)
    model = tf.keras.Sequential()
    optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate, epsilon=epsilon, beta_1=beta_1, beta_2=beta_2)
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    # Train the model on the training data
    model.fit(x_train, y_train, epochs=10, batch_size=32, verbose=0)
    _, accuracy = model.evaluate(x_test, y_test)
    return -accuracy

def optuna_sgd(x_train, y_train, x_test, y_test):
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=100)
    best_params = study.best_params
    best_learning_rate = best_params['learning_rate']
    best_epsilon = best_params['epsilon']
    best_beta_1 = best_params['beta_1']
    best_beta_2 = best_params['beta_2']
    final_model = tf.keras.Sequential()
    final_optimizer = tf.keras.optimizers.SGD(learning_rate=best_learning_rate, epsilon=best_epsilon, beta_1=best_beta_1, beta_2=best_beta_2)
    final_model.compile(optimizer=final_optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    final_model.fit(x_train, y_train, epochs=20, batch_size=32)

## Compare the results (10 points)

Comment on whether the results of the paper have been replicated and on the relative merits of Adam vs SGD.  Consider though the statements made on page 24 of [this reference](https://arxiv.org/pdf/1912.08957.pdf) and comment on how hyperparameter optimization may make the empirical results of the paper less relevant.