Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart Kernel) and then **run all cells** (in the menubar, select Run$\rightarrow$Run All Cells). Alternatively, you can use the **validate** button in the assignment list panel.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". When you insert your Code you can remove the line `raise NotImplementedError()`. Also put your name, matriculationnumber, and collaborators below:

In [None]:
NAME = ""
MATRICULATIONNUMBER = ""
COLLABORATORS = ""

---

<img src="images/logo_ifn.svg" alt="Drawing" style="width: 256px;" align="right"/>

# Exercise 2.2: Simple 2D Problems

In the last notebook you got to know the basics of machine learning. You learned how to generate, prepare, and preprocess data for machine learning models as well as training and evaluation of such models. Also, the last notebook shows that not for every problem a sophisticated deep neural network is needed, which you should keep in mind when encountering data-related problems in the future. In this notebook, we will further explore simple 2D problems with support vector machines and first single-layer neural networks. 

In [11]:
import os
import numpy as np
import random
from sklearn.datasets import make_swiss_roll
from sklearn import svm
from sklearn import neural_network
import matplotlib.pyplot as plt
import params
from ipylab import JupyterFrontEnd
from threadpoolctl import threadpool_limits

PARAM_1, PARAM_2, PARAM_3 = params.gen_params(os.getcwd())
PARAM_4 = int(params.gen_params(os.getcwd(), mode='float', num=1)[0] *100000)
app = JupyterFrontEnd()
app.commands.execute('notebook:render-all-markdown')

<img src="images/2d-distribution.png" alt="Drawing" style="width: 256px;" align="right"/>

### Task 2.2-A: Data Generation (5P) 

The aim of this task is to generate a data distribution as shown on the right. For this purpose, below you can find two functions. Look through the functions and familiarize yourself with them. Execute the cell with the `generate_distribution` function. It will yield two Gaussian clusters as shown in the resulting plot. Modify the data generation according to the following steps to generate the data distribution shown on the right hand side.
- Generate four Gaussian clusters of class 0 with centers at (2, 2), (-2, 2), (-2, -2), and (2, -2), identity covariance matrix, and 100 samples per cluster.
- Generate four Gaussian clusters of class 1 with centers at (0, 2), (-2, 0), (2, 0), and (0, -2), diagonal covariance matrix which only contains the values 0.1 and 2, and 100 samples per cluster.
- Multiply all x-values with a value of 1000 while you multiply all y-values with a value of 0.1 for both training and (later generated) validation data
- Generate a validation set with the same parameters. Use a seed of {{PARAM_4}} for generation of the training set and a seed of {{PARAM_4*2}} for generation of the validation set.
- Keep only 10\% of values from class 1 in the training dataset. This should yield a similar distribution as shown on the right hand side. Retain the same amount of samples for both classes in the validation set. The validation set should contain 400 samples per class at this point.

In [None]:
def genGauss(means, covs, pre_label, amount_samples, seed):
    """This function generates Gaussian 2D data clusters with axis-specific means and covariances as parameters"""
    np.random.seed(seed)
    random.seed(seed)
    X = []
    Y = []
    if ((len(means.shape) <= 1) or (len(covs.shape) <= 1)):
        exit("Dimension of Data (means or cov-matrices) not equal.\nPlease ensure that dimensions of mean and cov-matrices fit!")
    if pre_label.size == 0 and amount_samples.size == 1:
        if not (means.shape[0] == covs.shape[0]):
            exit(
                "Number of means do not fit to the number of cov-matrices!\nPlease ensure that the amounts are equal.")
    elif pre_label.size == 0:
        if not (means.shape[0] == covs.shape[0] == amount_samples.shape[0]):
            exit(
                "Number of means do not fit to the number of cov-matrices or labels!\nPlease ensure that the amounts are equal.")
    elif amount_samples.size == 1:
        if not (means.shape[0] == covs.shape[0] == pre_label.shape[0]):
            exit(
                "Number of means do not fit to the number of cov-matrices or amount of samples!\nPlease ensure that the amounts are equal.")
    else:
        if not (means.shape[0] == covs.shape[0] == pre_label.shape[0] == amount_samples.shape[0]):
            exit("Number of means do not fit to the number of cov-matrices or labels or amount of samples!\nPlease ensure that the amounts are equal.")
    for i in range(0, means.shape[0]):
        if amount_samples.size == 1:
            gauss_vals = np.random.multivariate_normal(means[i], covs[i], amount_samples[0])
        elif amount_samples.size < 1:
            exit(
                "Amount of samples < 1!\nPlease ensure that the amount is at least 1.")
        else:
            gauss_vals = np.random.multivariate_normal(means[i], covs[i], amount_samples[i])
        labels = np.empty([gauss_vals.shape[0]])
        if pre_label.size == 0:
            labels.fill(i)
        else:
            labels.fill(pre_label[i])
        for k in range(0, gauss_vals.shape[0]):
            X.append(gauss_vals[k])
            Y.append(labels[k])
    combined = list(zip(X, Y))
    random.shuffle(combined)
    X[:], Y[:] = zip(*combined)
    return np.asarray(X), np.asarray(Y)    
    
def modify_data(X, Y, multipliers=(1, 1), imbalance=1):
    """This function multiplies data given by X and Y with the respecitve multiplyer elements. 
    It also deletes elements according to the imbalance parameter."""
    np.random.seed(0)
    random.seed(0)
    X[:,0] *= multipliers[0]
    X[:,1] *= multipliers[1]
    X_N = []
    Y_N = []
    classes = np.unique(Y)
    num_samples = int(len(X)/len(classes))
    for cl in classes:
        mask = (Y==cl)
        if cl == 1:
            X_N.append(X[mask,:][:int(num_samples*imbalance)])
            Y_N.append(Y[mask][:int(num_samples*imbalance)])
        else:
            X_N.append(X[mask,:][:num_samples])
            Y_N.append(Y[mask][:num_samples])
    X_N = np.concatenate(X_N)
    Y_N = np.concatenate(Y_N)
    p = np.random.permutation(len(X_N))
    X_N, Y_N = X_N[p], Y_N[p]
    return X_N, Y_N

In [None]:
def generate_distribution(multipliers, imbalance):
    # the below code is just a simple example on how to use the above functions.
    # You can edit it as necessary for solving the task
    amount_samples_train = np.array([100, 100, 100, 100, 100, 100])  
    means_train = np.array([[-2, -2], [2, 2], [0, 2], [-2, 0], [2, 0], [0, -2]])
    covs_train = np.array([[[1, 0], [0, 1]]] * 4 + [[[0.1, 0], [0, 2]]] * 2)
    labels_train = np.array([0, 0, 0, 0, 1, 1])
    X_tr, Y_tr = genGauss(means_train, covs_train, labels_train, amount_samples_train, seed=13832)
    X_tr, Y_tr = modify_data(X_tr, Y_tr, multipliers, imbalance)
    
    amount_samples_val = np.array([400, 400]) 
    X_val, Y_val = genGauss(means_train, covs_train, labels_train, amount_samples_val, seed=27664)
    X_val, Y_val = modify_data(X_val, Y_val, multipliers, imbalance)
    # YOUR CODE HERE
   
    return X_tr, Y_tr, X_val, Y_val

X_tr, Y_tr, X_val, Y_val = generate_distribution((1000, 0.1),0.1)
# YOUR CODE HERE


plt.scatter(X_tr[:,0], X_tr[:,1], c=Y_tr)
plt.show()

In [None]:
assert type(X_tr) == np.ndarray
assert type(Y_tr) == np.ndarray
assert type(X_val) == np.ndarray
assert type(Y_val) == np.ndarray


### Task 2.2-B: Support Vector Machines (5P) 

The generated data is clearly not linearly separable. In a first attempt to solving this problem, implement a support vector machine to solve this problem. For this purpose complete the function below taking the training data and labels as well as several hyperparameters as input and returning the trained SVM. Make yourself familiar with the SVM implementation of sklearn and implement an SVM that achieves 100% accuracy on the training set. Discuss with your fellow students why this is not the optimal solution and what the reasons might be.

In [None]:
def train_svm(X_tr, Y_tr, kernel_function, penalty, kernel_coeff, max_iterations):
    # YOUR CODE HERE
    clf = SVC(kernel=kernel_function,C=penalty,gamma = kernel_coeff,max_iter = max_iterations)
    clf.fit(X_tr,Y_tr)
    return clf

with threadpool_limits(limits=1, user_api='blas'):
    clf = train_svm(X_tr, Y_tr, kernel_function='linear', penalty=1.0, kernel_coeff=10.0, max_iterations=10000)
    # YOUR CODE HERE
    
    print('Result svm-Training - Accuracy on training data = ' + '{0:g}%'.format(clf.score(X_tr, Y_tr) * 100) + "\n")
    print('Result svm-Validation - Accuracy on validation data = ' + '{0:g}%'.format(clf.score(X_val, Y_val) * 100) + "\n")

In [None]:
train_accuracy = clf.score(X_tr, Y_tr)
print(train_accuracy)


### Task 2.2-C: Class Balancing and Data Normalization (5P) 

You might have received a ConvergenceWarning in the previous task, hinting at a rather poor convergence of the model. Unnormalized data as well as the class imbalance can be possible reasons for the poor convergence and the rather bad results on the validation set. Regenerate your data using the `generate_distribution` function (it should be reproducible due to the set random seeds) and see how the class imbalance as well as the unnormalized and inhomogeneous value ranges affect the SVM training and generalization result. Find a way to improve the SVM result on the validation set to more than 75\%. Also, discuss with your fellow students how the hyperparameters affect the training and validation scores and which commonly known phenomenon you can observe.

In [None]:
X_tr, Y_tr, X_val, Y_val = generate_distribution((1000, 0.1), 0.1)
clf = train_svm(X_tr, Y_tr, kernel_function='linear', penalty=1.0, kernel_coeff=10.0, max_iterations=10000)

with threadpool_limits(limits=1, user_api='blas'):
    # YOUR CODE HERE
    scaler = StandardScaler()
    X_tr_hormalized = scaler.fit_tranform(X_tr)
    X_val_normalized = scaler.transform(X_val)
    
    class_weights = compute_class_weight('balanced',classes = np.unique(Y_tr),y=Y_tr)
    
    clf.fit(X_tr_normalized, Y_tr, sample_weight=np.array([class_weights[int(label)] for label in Y_tr]))
    print('Result svm-Training - Accuracy on training data = ' + '{0:g}%'.format(clf.score(X_tr, Y_tr) * 100) + "\n")
    print('Result svm-Validation - Accuracy on validation data = ' + '{0:g}%'.format(clf.score(X_val, Y_val) * 100) + "\n")

In [None]:
validation_accuracy = clf.score(X_val, Y_val)
print(validation_accuracy)


### Task 2.2-D: Single-Layer Neural Networks (5P) 

Let's start implementing our first neural network. We will use the neural network to solve the same task that we previously approached with the SVM. To this end, please implement the following:
- Complete the `train_nnet` function by first instantiating the `MLPClassifier` class with the parameters passed to the `train_nnet` function and afterwards fitting it to the training data. Return the trained model as `clf`.
- As you will see the accuracy on both the training and the validation set is rather bad. Try to find model parameters that achieve a training score of more than 85\% (don't change the data at this point). Use a random_state value of {{PARAM_4}}. Discuss why the validation is not easily improved. 


In [None]:
def train_nnet(X_tr, Y_tr, 
               hidden_layer_sizes, 
               activation, 
               solver,
               alpha,
               batch_size,
               learning_rate,
               max_iter,
               learning_rate_init=0.001, 
               shuffle=True, 
               nesterovs_momentum=True, 
               momentum=0.9,
               random_state=0):
    np.random.seed(random_state)
    random.seed(random_state)
    # YOUR CODE HERE
    clf = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                        activation=activation,
                        solver=solver,
                        alpha=alpha,
                        batch_size=batch_size,
                        learning_rate=learning_rate,
                        max_iter=max_iter,
                        learning_rate_init=learning_rate_init,
                        shuffle=shuffle,
                        nesterovs_momentum=nesterovs_momentum,
                        momentum=momentum,
                        random_state=random_state)
    clf.fit(X_tr,Y_tr)
    return clf

with threadpool_limits(limits=1, user_api='blas'):
    X_tr, Y_tr, X_val, Y_val = generate_distribution((1000, 0.1), 0.1)
    clf = train_nnet(X_tr, Y_tr, 
                     hidden_layer_sizes = (100),
                     activation = "identity", 
                     solver = "sgd",
                     alpha = 0.0001,
                     batch_size = 200,
                     learning_rate = "constant",
                     max_iter = 500
                     learning_rate_init=0.001,
                     shuffle=True,
                     nesterovs_momentum=True,
                     momentum=0.9,
                     random_state=13832)
    # YOUR CODE HERE
    

    print('Result nnet-Training - Accuracy on training data = ' + '{0:g}%'.format(clf.score(X_tr, Y_tr) * 100) + "\n")
    print('Result nnet-Validation - Accuracy on validation data = ' + '{0:g}%'.format(clf.score(X_val, Y_val) * 100) + "\n")

In [None]:
X_tr, Y_tr, X_val, Y_val = generate_distribution((1000, 0.1), 0.1)
train_accuracy = clf.score(X_tr, Y_tr)
validation_accuracy = clf.score(X_val, Y_val)
print(train_accuracy, validation_accuracy)


### Task 2.2-E: Class Balancing and Data Normalization II (5P) 

While the training accuracy is also satisfying now modify the training data distribution using the parameters of the `generate_distribution` function. Also, remember to train a new neural network model on the new training data (again with a random_state value of {{PARAM_4}}). Discuss how the balanced class distributions and normalized feature value ranges affect the performance of the neural network. In the end use a training data set that allows for a validation accuracy of more than 80\%.

In [None]:
with threadpool_limits(limits=1, user_api='blas'):
    X_tr, Y_tr, X_val, Y_val = generate_distribution((1000, 0.1), 0.1)
    # YOUR CODE HERE
    clf = train_nnet(X_tr, Y_tr, 
                     hidden_layer_sizes=(100,),
                     activation="identity", 
                     solver="sgd",
                     alpha=0.0001,
                     batch_size=200,
                     learning_rate="constant",
                     max_iter=500,
                     learning_rate_init=0.001,
                     shuffle=True,
                     nesterovs_momentum=True,
                     momentum=0.9,
                     random_state=13832)
    print('Result nnet-Training - Accuracy on training data = ' + '{0:g}%'.format(clf.score(X_tr, Y_tr) * 100) + "\n")
    print('Result nnet-Validation - Accuracy on validation data = ' + '{0:g}%'.format(clf.score(X_val, Y_val) * 100) + "\n")

In [None]:
validation_accuracy = clf.score(X_val, Y_val)
print(validation_accuracy)


### Task 2.2-F: Parameter Initialization (5P) 

As you learned, the weights in a neural network are initialized randomly. Accordingly, the same training with the same parameters may yield differing results depending on the initialization. In practice one usually fixes the random seed of all involved libraries to ensure (at least somewhat) reproducible experiments. Try this out yourself by using three arbitrary different values for the `random_state` parameter. Train the models only for 100 iterations and observe the validation accuracy at this point. Train three different models `clf1`,`clf2`, and `clf3` which all yield different validation accuracies. While for this simple problem usually all trainings converge to the global optimum, in more complicated settings, this phenomenon can not only lead to differing convergence speeds but rather to the convergence towards local optima. 

In [None]:
with threadpool_limits(limits=1, user_api='blas'):
    X_tr, Y_tr, X_val, Y_val = generate_distribution((1, 1), 1)
    # YOUR CODE HERE
    clf1 = train_nnet(X_tr, Y_tr, 
                      hidden_layer_sizes=(100,),
                      activation="identity", 
                      solver="sgd",
                      alpha=0.0001,
                      batch_size=200,
                      learning_rate="constant",
                      max_iter=100,
                      learning_rate_init=0.001,
                      shuffle=True,
                      nesterovs_momentum=True,
                      momentum=0.9,
                      random_state=42)
    clf2 = train_nnet(X_tr, Y_tr, 
                      hidden_layer_sizes=(100,),
                      activation="identity", 
                      solver="sgd",
                      alpha=0.0001,
                      batch_size=200,
                      learning_rate="constant",
                      max_iter=100,
                      learning_rate_init=0.001,
                      shuffle=True,
                      nesterovs_momentum=True,
                      momentum=0.9,
                      random_state=123)
    clf3 = train_nnet(X_tr, Y_tr, 
                      hidden_layer_sizes=(100,),
                      activation="identity", 
                      solver="sgd",
                      alpha=0.0001,
                      batch_size=200,
                      learning_rate="constant",
                      max_iter=100,
                      learning_rate_init=0.001,
                      shuffle=True,
                      nesterovs_momentum=True,
                      momentum=0.9,
                      random_state=456)

    print('Result nnet-Training - Accuracy on training data = ' + '{0:g}%'.format(clf.score(X_tr, Y_tr) * 100) + "\n")
    print('Result nnet-Validation - Accuracy on validation data = ' + '{0:g}%'.format(clf.score(X_val, Y_val) * 100) + "\n")
    
    print('Result nnet-Training - Accuracy on training data = ' + '{0:g}%'.format(clf2.score(X_tr, Y_tr) * 100) + "\n")
    print('Result nnet-Validation - Accuracy on validation data = ' + '{0:g}%'.format(clf2.score(X_val, Y_val) * 100) + "\n")
    
    print('Result nnet-Training - Accuracy on training data = ' + '{0:g}%'.format(clf3.score(X_tr, Y_tr) * 100) + "\n")
    print('Result nnet-Validation - Accuracy on validation data = ' + '{0:g}%'.format(clf3.score(X_val, Y_val) * 100) + "\n")

In [None]:
validation_accuracy1 = clf1.score(X_val, Y_val)
validation_accuracy2 = clf2.score(X_val, Y_val)
validation_accuracy3 = clf3.score(X_val, Y_val)
print(validation_accuracy1, validation_accuracy2, validation_accuracy3)
