Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart Kernel) and then **run all cells** (in the menubar, select Run$\rightarrow$Run All Cells). Alternatively, you can use the **validate** button in the assignment list panel.

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE". When you insert your Code you can remove the line `raise NotImplementedError()`. Also put your name, matriculationnumber, and collaborators below:

In [None]:
NAME = ""
MATRICULATIONNUMBER = ""
COLLABORATORS = ""

---

<img src="images/logo_ifn.svg" alt="Drawing" style="width: 256px;" align="right"/>

# Exercise 2.1: Machine Learning Basics

After you have learned how to handle the programming language Python, you are ready for some real exercises in machine learning (ML). The main aim of this unit is that you develop a rough intuition for ML model design and selection, method evaluation, and data loading. You will train various different ML models as well as your first small-scale neural network. While the remaining 5 tasks will focus solely on neural networks and deep learning, in this unit you will also learn about cases, where a neural network is maybe not the best choice or where other methods yield good solutions with less method complexity. As previous we first need to import some libraries. Concretely, we import the os, the numpy, the matplotlib, and the sklearn libraries, which all have online-available documentation.

In [13]:
import os
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import params
from ipylab import JupyterFrontEnd
PARAM_1, PARAM_2, PARAM_3 = params.gen_params(os.getcwd())
PARAM_4 = int(params.gen_params(os.getcwd(), mode='float', num=1)[0] *100000)
app = JupyterFrontEnd()
app.commands.execute('notebook:render-all-markdown')

<img src="images/xor-data-distribution.png" alt="Drawing" style="width: 256px;" align="right"/>

### Task 2.1-A: Classification Data Generation (5P) 

As a first step, we need to create a toy data distribution. In this exercise we want to use a simple 2D data distribution for the XOR problem, where the decision boundary cannot be linear. While this example is not the most difficult, it is easy to analyse and the results can be easily visualized. An example for the desired data distribution of this task is given on the right hand side of this cell. You can visualize your generated data distribution using the plot function two cells below. To obtain this data distribution implement the following steps in the below function `generate_xor_data`:
- Set the random seed to {{PARAM_4}} to ensure reproducibility when executing the function several times
- Generate {{500 + PARAM_1 * 4}} uniform randomly distributed x and y coordinates in the range \[-{{3 + PARAM_2 % 5}} , {{3 + PARAM_2 % 5}}\]. 
- Assure that the padding interval \[{{-(0.5 + PARAM_1 % 2 / 4)}}, {{0.5 + PARAM_1 % 2 / 4}}\] in both dimensions is free of points (as in th picture on the right) by adding {{0.5 + PARAM_1 % 2 / 4}} to all corrdinates >0 and subtracting {{0.5 + PARAM_1 % 2 / 4}} from all coordinates <=0 after they have been generated.
- In practice, labels are often noisy. We want to simulate that by assigning labels -1 and 1 to the data based on modified coordinates. For these modified coordinates we add a random number from the interval \[-{{(3 + PARAM_2 % 3)/2}} , {{(3 + PARAM_2 % 3)/2}}\]. Then, assign labels using the XOR criterion based on the modified coordinates. For the XOR problem we assign the label 1 if both coordinates are positive or both coordinates are negative, otherwise we would assign the label -1. We treat 0 as neither positive not negative.
- Return the non-noisy coordinates as a numpy array `data` in the shape ({{1000 + PARAM_1 * 4}}, 2) as well as the noisy labels as `labels` in the shape ({{1000 + PARAM_1 * 4}},). Your final distribution should appear similar (not exactly the same!) as in the image shown on the right.

In [None]:
def generate_xor_data():
    # YOUR CODE HERE
    #1.
    np.random.seed(13832)
    #2.
    x = np.random.uniform(-5,5, 572)
    y = np.random.uniform(-5,5,500 + PARAM_1 * 4) 
    #3.
    x[x>0]+=0.5
    x[x<0]-=0.5
    y[y>0]+=0.5
    y[y<0]-=0.5
    
    modified_x =np.random.uniform(-2.0, 2.0,1)
    modified_y =np.random.uniform(-2.0, 2.0,1)
    x = np.append(x, modified_x)
    y = np.append(y, modified_y)
    labels = np.where(np.logical_xor(x>0,y>0),-1,1)
    data = np.column_stack((x,y))
    
    
    return data, labels

In [None]:
data, labels = generate_xor_data()
assert type(data) == np.ndarray
assert type(labels) == np.ndarray

def plot_data(x, y, label):
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.scatter(x[label == 1], y[label == 1], s=40, facecolors='#00EB2C', edgecolors='white')
    ax.scatter(x[label == -1], y[label == -1], s=40, facecolors='#2679B6', edgecolors='white')
    plt.show()
    
plot_data(data[:,0], data[:,1], labels)


### Task 2.1-B: Data Splitting (3P) 

Machine learning usually involves three sets of data: The training set, where the model is trained on, a validation set, which monitors the model performance during training and gives clues how much the model generalizes to unknown data, and a test set, where the final model is benchmarked. In this task, we want to split the data created in the previous task into these three distinct subsets. For this, retain the order of array dimensions and apply the following:
- Use the first 50\% of elements for the training set (don't apply shuffling) and return them as ``data_train`` and ``labels_train``
- Use the following 25\% of elements for the validation set (don't apply shuffling) and return them as ``data_val`` and ``labels_val``
- Use the last 25\% of elements for the test set (don't apply shuffling) and return them as ``data_test`` and ``labels_test``
(Hint: 25\% = 50\% \* 50\%)

In [None]:
def split_data(data, labels):
    # YOUR CODE HERE
    num_samples = len(data)
    train_end = int(0.5 * num_samples)
    val_end = int(0.75 * num_samples)
    
    data_train,labels_train = data[:train_end],labels[:train_end]
    data_val,labels_val = data[train_end:val_end],labels[train_end:val_end]
    data_test,labels_test = data[val_end:],labels[val_end:]
    
    return data_train, data_val, data_test, labels_train, labels_val, labels_test

In [None]:
data_train, data_val, data_test, labels_train, labels_val, labels_test = split_data(data, labels)
assert type(data_train) == np.ndarray
assert type(labels_train) == np.ndarray
assert type(data_val) == np.ndarray
assert type(labels_val) == np.ndarray
assert type(data_test) == np.ndarray
assert type(labels_test) == np.ndarray


### Task 2.1-C: Linear Classification Models (5P) 

Now that we have created data, let's try to fit a model to this data distribution. Below you can find the initialization for a simple linear logistic regression model from the sklearn library:
- Fit this model to the data (using the standard settings from sklearn) and return the class predictions (maximum class probability) on the validation set as ``result`` and set the parameter ``random_state`` to {{PARAM_2}}. Check out the documentation of sklearn for an example on how to do this. 
- Try to visualize the results and interpret them. Discuss with fellow students in the course why the model is not able to fit the data distribution.
- Try to change the parameters ``solver`` and ``penalty`` of the ``LogisticRegression`` class and see how that affects the results on the validation set. In the end, set the parameters as solver = {{'lbfgs' if PARAM_1 % 2 == 0 else 'liblinear'}} and penalty = {{'l2' if PARAM_2 % 2 == 0 or PARAM_1 % 2 == 0 else 'l1'}}. Discuss why you are not able to improve the results.

Note: Whenever there is a task for you to discuss something, you do not need to write this down anywhere explicitly in the notebook. It is merely meant for you and your fellow students to gain a deeper understanding into the topics.


In [None]:
def linear_cls(d_train, l_train, d_val):
    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression(random_state=67)
    # YOUR CODE HERE
    clf.fit(d_train,l_train)
    result  = clf.predict(d_val)
    return result

linear_val_result = linear_cls(data_train, labels_train, data_val)

# plotting command here
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
linear_val_result = linear_cls(data_train, labels_train, data_val)
assert type(linear_val_result) == np.ndarray
assert linear_val_result.shape == labels_val.shape


### Task 2.1-D: Classification Evaluation (4P) 

The first step in debugging a machine learning model is often to visually check, if the result makes sense. This is what we did in the previous section by looking at the 2D plots. Even if we go to more complex tasks, this strategy mostly remains the same as qualitative results often provide more insights than pure metrics. Nevertheless, to also quantify the method's performance, we also need metrics. Implement the following metrics interpreting the class 1 as the positive class:
- The accuracy indicating how many overall samples have been classified correctly. Discuss why this can often be a bad measure.
- The precision indicating how many predicted samples of class 1 are correctly classified
- The recall indicating how many ground truth samples of a class 1 have been classified corrctly
- The F1 score taking both precision and recall into account

In [None]:
def evaluate_cls(gt, pred):
    # YOUR CODE HERE
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    acc = accuracy_score(gt,pred)
    prec = precision_score(gt,pred)
    rec = recall_score(gt,pred)
    f1 = f1_score(gt,pred)
    return acc, prec, rec, f1

In [None]:
acc, prec, rec, f1 = evaluate_cls(labels_val, linear_val_result)
print(acc, prec, rec, f1)
assert type(f1) == np.float64
assert type(prec) == np.float64
assert type(rec) == np.float64
assert type(acc) == np.float64


### Task 2.1-E: Data Preprocessing (4P) 

Let's try to improve the result from above. We have two angles from which we can approach the problem: From the data perspective and from the model perspective. Starting with the data, we note that a linear model cannot separate the non-linearly separable data. However, the data may be preprocessed such that it becomes separable. Try to implement such a preprocessing and use it to preprocess the data for your function `linear_cls` above. You can verify that your solution is correct by plotting the result and computing the metrics (which should improve significantly). Your achieved F1-score and accuracy should both be higher than 0.75.

In [None]:
def preprocess(d):
    # YOUR CODE HERE
    d_prep =d * 2
    return d_prep

In [None]:
prep_val_result = linear_cls(preprocess(data_train), labels_train, preprocess(data_val))
assert type(prep_val_result) == np.ndarray
assert prep_val_result.shape == labels_val.shape


### Task 2.1-F: Non-Linear Classification Models (4P) 

Apart from preprocessing the data, we can also make use of non-linear models such as Support Vector Machines (SVMs). Fit the SVM model (initialized with `random_state` as {{PARAM_4}}) and return the model predictions on the validation set. Also in this case the performance should be significantly improved both qualitatively and quantitatively. Your achieved F1-score and accuracy should both be higher than 0.75.

In [None]:
def non_linear_cls(d_train, l_train, d_val):
    from sklearn.svm import SVC
    # YOUR CODE HERE
    svm = SVC(random_state = 13832)
    svm.fit(d_train,l_train)
    result = svm.predict(d_val)
    return result

In [None]:
nonlinear_val_result = non_linear_cls(data_train, labels_train, data_val)
assert type(nonlinear_val_result) == np.ndarray
assert nonlinear_val_result.shape == labels_val.shape


### Task 2.1-G: Unsupervised Clustering (5P)

In cases where we do not have many labels, we often still want to implement a classification. In this case, unsupervised methods are of interest. Apply the KMeans algorithm (initialized with `random_state` as {{PARAM_4}}) on the problem and return the predictions on the validation set. (Hint: how many clusters do you need for a meaningful application of the algorithm?). Discuss with your fellow students how you could use the result of the KMeans algorithm to obtain the same result as in the supervised case and the according requirements.

In [None]:
def unsupervised_cls(d_train, d_val):
    from sklearn.cluster import KMeans
    # YOUR CODE HERE
    num_clusters = 2
    kmeans = KMeans(n_clusters = num_clusters,random_state = 13832,n_init=10)
    kmeans.fit(d_train)
    
    result = kmeans.predict(d_val)
    return result

In [None]:
unsupervised_result = unsupervised_cls(data_train, data_val)
assert type(unsupervised_result) == np.ndarray
assert unsupervised_result.shape == labels_val.shape


### Task 2.1-H: Linear Regression Models (10P) 

Now we want to apply our knowledge acquired above to a very simple regression problem. Below you can find the data generation code for a data distribution which can be approximated by a linear regression. Implement the following steps:
- Generate a data distribution using the given function with a seed of {{PARAM_4 * 2}}
- Use the first 50\% of data for training, the second 25\% for validation and the last 25\% for testing and return them as `d_train`, `d_val`, and `d_test` (don't use data shuffling). 
- Fit a linear regression model on the training set which regresses the y-value from the x-value. Afterwards, return the predictions on the test set (not the validation set) as `result`. The `result` array should have the shape (100,). 
- Implement an evaluation that computes the mean absolute error and the mean squared error and returns them as `mae` and `mse`. Assume that pred and gt both have the same shape.
- Implement a function that executes all of the above functions such that the predictions as well as the MAE and MSE metrics can be returned.

In [None]:
def generate_data(seed = 0):
    np.random.seed(seed=27664)
    x = np.random.uniform(0,10,400)
    y = 0.3*x + 2 + np.random.uniform(-0.5, 0.5, 400)
    data_reg = np.stack([x,y], axis=1)
    return data_reg

def split_data(d_reg):
    # YOUR CODE HERE
    train_end = int(0.5*len(d_reg))
    val_end = int(0.75*len(d_reg))
    
    d_train, d_val, d_test = d_reg[:train_end], d_reg[train_end:val_end], d_reg[val_end:]
    return d_train, d_val, d_test

def fit_and_predict(d_reg_train, d_reg_test):
    from sklearn.linear_model import LinearRegression
    # YOUR CODE HERE
    reg = LinearRegression()
    reg.fit(d_reg_train[:,0].reshape(-1,1),d_reg_train[:,1])
    result = reg.predict(d_reg_test[:,0].reshape(-1,1))
    return result

def evaluate_reg(gt, pred):
    # YOUR CODE HERE
    from sklearn.metrics import mean_absolute_error, mean_squared_error
    mae = mean_absolute_error(gt,pred)
    mse = mean_squared_error(gt,pred)
    return mae, mse

def regression_model():
    # don't forget to change the seed ;)
    np.random.seed(27664)
    data_reg = generate_data()
    # YOUR CODE HERE
    d_train, d_val, data_reg_test = split_data(data_reg)
    reg_result = fit_and_predict(d_train,data_reg_test)
    mae,mse = evaluate_reg(data_reg_test[:,1],reg_result)
    
    return data_reg_test, reg_result, mae, mse

In [None]:
data_reg_test, reg_result, mae, mse = regression_model()
assert data_reg_test.shape == (100,2)
assert type(reg_result) == np.ndarray
assert len(reg_result) == 100

assert type(mae) == np.float64
assert type(mse) == np.float64
