# Homework: Basic Artificial Neural Networks

The goal of this homework is simple, yet an actual implementation may take some time :). We are going to write an Artificial Neural Network (almost) from scratch. The software design was heavily inspired by [PyTorch](http://pytorch.org) which is the main framework in ML.

In [None]:
%matplotlib inline
from time import time, sleep
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
import gzip

In [None]:
# Import your google drive with notebooks
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [None]:
# move to folder with homework (all files need to be in one folder)
%cd '/content/drive/MyDrive/smthg/path_to_folder/'

In [None]:
#import modules
%run homework_modules.ipynb

# Framework

Implement everything in `Modules.ipynb`. Read all the comments thoughtfully to ease the pain. Please try not to change the prototypes.

Do not forget, that each module should return **AND** store `output` and `gradInput`.

The typical assumption is that `module.backward` is always executed after `module.forward`,
so `output` is stored, this would be useful for `SoftMax`.

### Tech note
Prefer using `np.multiply`, `np.add`, `np.divide`, `np.subtract` instead of `*`,`+`,`/`,`-` for better memory handling.

Example: suppose you allocated a variable

```
a = np.zeros(...)
```
So, instead of
```
a = b + c  # will be reallocated, GC needed to free
```
You can use:
```
np.add(b,c,out = a) # puts result in `a`
```

# Data Analysis

This task is aimed at testing the skill of a data analyst.

Remember, you don't always need to use a large neural network to solve a problem, sometimes it's enough just to look carefully at the data

In [None]:
train_df = pd.read_csv('data/table_data_train.csv')
test_df = pd.read_csv('data/table_data_test.csv')

In [None]:
df = pd.concat([train_df, test_df])

In [None]:
df

First of all, look at the data, guess where this data comes from, what it is about, what is the most important variable that can be predicted

***Your opinion here:***

#### 1. Check data types and missing values.


In [None]:
# Your code goes here

#### 2. Numerical Features Analysis

Calculate the mean values of Total day minutes, Total intl charge, Customer service calls for churned (Churn=1) and retained (Churn=0) customers.

In [None]:
# Your code goes here

#### 3. Distribution Visualization

- Create a histogram of `Customer service calls` for `Churn=0` and `Churn=1` on the same plot.

- Create a boxplot for `Total day minutes` segmented by Churn.

- Create a bar chart showing the churn rate for `International plan` (Yes/No).

In [None]:
# Your code goes here

In [None]:
# Your code goes here

In [None]:
# Your code goes here

####4. Correlation Analysis

Find the top 3 features with the highest Pearson correlation to `Churn`

In [None]:
# Your code goes here

####5. Decision Rule Without ML

Create a rule to predict Churn using no more than 3 conditions. You can perform any additional analysis that you deem necessary.

Achieve accuracy ≥ 0.75 on the test data.

In [None]:
# Your analysis below

In [None]:
def custom_rule(df):
    # Your code goes here
    return None

In [None]:
test_df['Predicted'] = custom_rule(test_df)
accuracy = (test_df['Predicted'] == test_df['Churn']).mean()
print(f"\nAccuracy: {accuracy:.4f}")

assert accuracy >= 0.82

# Digit classification

### Load dataset

We are using old good [MNIST](http://yann.lecun.com/exdb/mnist/) as our dataset.

In [None]:
def load_image(filename):
    # Read the inputs in Yann LeCun's binary format.
    with gzip.open(filename, 'rb') as f:
        data = np.frombuffer(f.read(), np.uint8, offset=16)
    # The inputs are vectors now, we reshape them to monochrome 2D images
    data = data.reshape(-1, 28, 28)
    # The inputs come as bytes, we convert them to float32 in range [0,1].
    return (data / np.float32(256)).squeeze()

def load_mnist_labels(filename):
    # Read the labels in Yann LeCun's binary format.
    with gzip.open(filename, 'rb') as f:
        data = np.frombuffer(f.read(), np.uint8, offset=8)
    # The labels are vectors of integers now, that's exactly what we want.
    return data

In [None]:
X_train = load_image('data/train-images-idx3-ubyte.gz')
X_test = load_image('data/t10k-images-idx3-ubyte.gz')
Y_train = load_mnist_labels('data/train-labels-idx1-ubyte.gz')
Y_test = load_mnist_labels('data/t10k-labels-idx1-ubyte.gz')
# We reserve the last 10000 training examples for validation.
X_train, X_val = X_train[:-10000], X_train[-10000:]
Y_train, Y_val = Y_train[:-10000], Y_train[-10000:]

In [None]:
print('X_train: ' + str(X_train.shape))
print('Y_train: ' + str(Y_train.shape))
print('X_val: ' + str(X_val.shape))
print('Y_val: ' + str(Y_val.shape))
print('X_test:  '  + str(X_test.shape))
print('Y_test:  '  + str(Y_test.shape))

In [None]:
plt.subplot(331)
plt.imshow(X_train[0], cmap=plt.get_cmap('gray'))
plt.show()
print()
print('Y_train[0]: ' + str(Y_train[0]))

## Preparing Data

### Task 1:

make one-hot encoding for labels. Clue: use [np.eye](https://numpy.org/doc/stable/reference/generated/numpy.eye.html) for them

In [None]:
def one_hot_encode(y):
    # YOUR CODE HERE:
    ###########################
    ### ╰( ͡° ͜ʖ ͡° )つ──☆*:・ﾟ
    ###########################
    one_hot_y = None
    return one_hot_y

hot_y_train = one_hot_encode(Y_train)
hot_y_val = one_hot_encode(Y_val)
hot_y_test = one_hot_encode(Y_test)

#### Test task 1

In [None]:
def one_hot_encode_test(hot_y_train):
    first_ten_answers = np.array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
                        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
                        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
                        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
                        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
                        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
                        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
                        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
                        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
                        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])
    np.testing.assert_equal(hot_y_train[:10], first_ten_answers, err_msg="First ten samples are not equal")
    print("The test pass successfully !!!")

one_hot_encode_test(hot_y_train)

### Task 2:  

In `homework_main-basic.ipynb` we treated mnist images as vectors, so we flattened it. For CNN, we assume that images have size `(bs, num_channels, w, h)`. Our mnist image is grayscale, so, it don't have a `num_channels` dimension. You need to reshape `X_train`, `X_val` and `X_test` to appropriate size.

In [None]:
# YOUR CODE HERE:
###########################
### ╰( ͡° ͜ʖ ͡° )つ──☆*:・ﾟ
###########################
X_train, X_val, X_test = None, None, None

#### Test Task 2

In [None]:
def dimension_test(X_train, X_val, X_test):
    true_train_shape = (50000, 784)
    true_test_shape = (10000, 784)
    np.testing.assert_equal(X_train.shape, true_train_shape, err_msg="Train shape doesn't the same")
    np.testing.assert_equal(X_val.shape, true_test_shape, err_msg="Train shape doesn't the same")
    np.testing.assert_equal(X_test.shape, true_test_shape, err_msg="Train shape doesn't the same")
    print("The test pass successfully !!!")

dimension_test(X_train, X_val, X_test)

### Compare activation function

- **Compare** `ReLU`, `ELU`, `LeakyReLU`, `SoftPlus` activation functions.
You would better pick the best optimizer params for each of them, but it is overkill for now.

- **Try** inserting `BatchNormalization` (folowed by `ChannelwiseScaling`) between `Linear` module and activation functions.

- Fill blanks in the code below

- Plot the losses both from activation functions comparison and `BatchNormalization` comparison on one plot. Please find a scale (log?) when the lines are distinguishable, do not forget about naming the axes, the plot should be goodlooking.

- Hint: good logloss for MNIST should be around 0.5.

In [None]:
# batch generator
def get_batches(dataset, batch_size):
    X, Y = dataset
    n_samples = X.shape[0]

    # Shuffle at the start of epoch
    indices = np.arange(n_samples)
    np.random.shuffle(indices)

    for start in range(0, n_samples, batch_size):
        end = min(start + batch_size, n_samples)
        batch_idx = indices[start:end]
        yield X[batch_idx], Y[batch_idx]

In [None]:
# Your code goes here.
def get_optimizer(optimizer_name):
    if optimizer_name == 'sgd_momentum':
        optimizer_config = # Your code goes here
        optimizer_state = {}

    elif optimizer_name == 'adam_optimizer':
        optimizer_config = # Your code goes here
        optimizer_state = {}

    else:
        raise NameError('Optimizer name have to one of {\'sgd_momentum\', \'adam_optimizer\'}')

    return optimizer_config, optimizer_state

In [None]:
def train(net, criterion, optimizer_name, n_epoch,
          X_train, y_train, X_val, y_val, batch_size):

    loss_train_history = []
    loss_val_history = []
    optimizer_config, optimizer_state = get_optimizer(optimizer_name)

    for i in range(n_epoch):
        print('Epoch {}/{}:'.format(i, n_epoch - 1), flush=True)

        for phase in ['train', 'val']:
            if phase == 'train':
                X = X_train
                y = y_train
                net.train()
            else:
                X = X_val
                y = y_val
                net.evaluate()

            num_batches = X.shape[0] / batch_size
            running_loss = 0.
            running_acc = 0.

            for x_batch, y_batch in get_batches((X, y), batch_size):

                net.zeroGradParameters()

                # Forward
                predictions = # Your code goes here
                loss = # Your code goes here

                # Backward
                if phase == 'train':
                    # Your code goes here

                    # Update weights
                    if optimizer_name == 'sgd_momentum':
                        # Your code goes here
                    else:
                        # Your code goes here

                running_loss += loss
                running_acc += np.sum(predictions.argmax(axis=1) == y_batch.argmax(axis=1))

            epoch_loss = running_loss / num_batches
            epoch_acc = running_acc / y.shape[0]
            if phase == 'train':
                loss_train_history.append(epoch_loss)
            else:
                loss_val_history.append(epoch_loss)

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc), flush=True)

    return net, loss_train_history, loss_val_history

In [None]:
def test(net, criterion, X_test, y_test, batch_size):
    net.evaluate()
    num_batches = X_test.shape[0] / batch_size
    running_loss = 0.
    running_acc = 0.
    for x_batch, y_batch in get_batches((X_test, y_test), batch_size):
        net.zeroGradParaameters()

        # Forward
        predictions = # Your code goes here
        loss = # Your code goes here
        running_loss += loss
        running_acc += (predictions.argmax(axis=1) == y_batch.argmax(axis=1)).astype(float).mean()

    epoch_loss = running_loss / num_batches
    epoch_acc = running_acc / num_batches
    return epoch_loss, epoch_acc

In [None]:
def get_net(activation=ReLU, norm=False):
    net = Sequential()
    net.add(Linear(28*28, 100))
    if norm:
        net.add(BatchNormalization(alpha=0.0001))
        net.add(ChannelwiseScaling(100))
    net.add(activation())
    net.add(Linear(100, 10))
    if norm:
        net.add(BatchNormalization(alpha=0.0001))
        net.add(ChannelwiseScaling(10))
    net.add(LogSoftMax())
    return net

In [None]:
# Fix parametrs (you can change it if you want)
batch_size = 64
n_epoch = 15
criterion = ClassNLLCriterion()
optimizer_name = 'sgd_momentum'

In [None]:
nets = []
activations = # Your code goes here

for activ in activations:
    # Your code goes here
    # Add nets for all activation with or without Batch Normalization
    # Use `get_net` function

In [None]:
losses_train = []
losses_val = []

for i, net in enumerate(nets):
    print(f'\n\nTrain net {i}/{len(nets)}')
    # Your code goes here
    # Train net and save net, losses on train and validation
    # Use `train` function
    # This may take up to 15 minutes

In [None]:
for net in nets:
    # Your code goes here
    # Test net and print loss and accuracy
    # Use `test` function

In [None]:
import matplotlib.pyplot as plt

# Your code goes here
# Plot train and validation loss for all nets

### Compare optimizer

- Plot the losses for two networks: one trained by momentum_sgd, another one trained by Adam. Which one performs better?

In [None]:
nets = []
optimizer_names = ['sgd_momentum', 'adam_optimizer']

for optim_name in optimizer_names:
    nets.append(get_net(activation=ReLU, norm=True))

In [None]:
criterion = ClassNLLCriterion()
batch_size = 64
n_epoch = 15

losses_train = []
losses_val = []

for i, net in enumerate(nets):
    # Your code goes here
    # Train net and save net, losses on train and validation

In [None]:
for net in nets:
    # Your code goes here
    # Test net and print loss and accuracy

In [None]:
import matplotlib.pyplot as plt

# Your code goes here
# Plot train and validation loss for nets trained by sgd momentum and adam optimizer

### Your conclusions

What conclusions did you draw for yourself? Write your conclusion on the work done.

You can rely on the questions below:

- Which activation functions provide the best accuracy?

- Are there differences in training stability when using different activation functions?

- How does batch normalization affect the model's training speed?

- Does batch normalization improve the model's accuracy on the test dataset?

- How does batch normalization affect training stability?

- Which activation function provided the greatest improvement in performance when combined with batch normalization?

- How does the convergence speed of SGD with momentum compare to that of Adam?

- Are there differences in training stability when using SGD with momentum versus Adam?

- How does the loss function behave over epochs for each optimizer?

***Your answer here***:

### Custom model

**Finally**, use all your knowledge to build a super cool model on this dataset. Use **dropout** to prevent overfitting, play with **learning rate decay**. You can use **data augmentation** such as rotations, translations to boost your score. Use your knowledge and imagination to train a model. Don't forget to call `training()` and `evaluate()` methods to set desired behaviour of `BatchNormalization` and `Dropout` layers.

In [None]:
# Your code goes here
# Create a model

In [None]:
# Your code goes here
# Train your custom architecture

In [None]:
# Your code goes here
# Plot validation loss

Print here your accuracy on test set. It should be around 90%.

In [None]:
# Your answer goes here

### Comparing with PyTorch implementation
The last (and maybe the easiest step after compared to the previous tasks: build a network with the same architecture as above now with PyTorch.
__Good Luck!__

In [None]:
# Your beautiful code here