In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn import datasets

import tensorflow as tf
from tensorflow.keras import Input, Model, layers, losses, optimizers

from matplotlib import pyplot as plt

**Differential Privacy**

The goal of differential privacy is to minimize the effect, a single training sample can have of the output of a (randomized) algorithm. Therefore, perform the following tasks:
1) Load the boston housing dataset, one-hot encode the categorical features and normalize (use StandardScaler) the continuous features.

2) Split the data into train/test (80% / 20%) sets (give a seed for reproducibility, i.e. random_state=42).

3) Train a LinearRegression model (set fit_intercept=False) on the training set and compute r2_score and mean_squared_error on the test set.

4) Find the training sample with the largest prediction error. Create a mask to exclude that **sample** from the training set.

5) Fit the same LinearRegression model to the training set excluding the found **sample**. Measure the deviation of the two regression models (e.g. np.linalg.norm(w-w'), the weights w can be found in LinearRegression.coef_)

6) Add noise of varying scales to the training set, create a second training set by excluding the **sample** and train linear regression models on each. How does the utility (r2_score, mse) behave with increasing noise level? What happens to the difference between the fitted models' weights? What about the prediction error on the **sample**? Plot the r2_score, mse, prediction error on **sample** and weight difference over the noise scale.




In [2]:
# load dataset
boston_dataset = datasets.load_boston()
X = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
Y = pd.DataFrame(boston_dataset.target, columns=['price'])

# normalize continuous features
cont_features = ['CRIM', 'ZN', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT']
feature_scaler = StandardScaler()
X_pp = X.copy()
X_pp[cont_features] = feature_scaler.fit_transform(X[cont_features])
X_pp.head()

# normalize targets
Y_pp = Y.copy()
target_scaler = StandardScaler()
Y_pp[['price']] = target_scaler.fit_transform(Y[['price']])

# one-hot-encode the 'RAD' feature
X_pp[['CHAS', 'RAD']] = X_pp[['CHAS', 'RAD']].astype('int32')
X_pp = pd.get_dummies(X_pp, columns=['RAD'])
X_pp.head()

# split data and target DataFrames into data train, data test, target train and target test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X_pp.to_numpy(), Y_pp.to_numpy(), test_size=0.20, random_state=42)

In [3]:
# train regression model and find sample with largest prediction error
LR = LinearRegression(fit_intercept=False)

# create neighboring dataset, i.e. training set without the sample with largest prediction error

# train on that dataset the same model, measure the difference between both models' coefficients

In [4]:
# train on varying noise scales, you should repeat the computation for each sigma several times and average
sigmas = np.logspace(-3,2, 24) # noise scales




In [5]:
# plot results


**Autoencoder**

An autoencoder is a model that maps samples into a so-called latent space and then back to the original space. It consists of an encoder model X->Z and a decoder model Z->X. 
It is used for dimensionality reduction, representation learning and as generative model. The training objective of an autoencoder is usually a combination of reconstruction error and some regularization (on its weights and/or the latent representation).

1) Load the mnist dataset, split into 10k training samples and 1000 test samples. Take 1000 training samples for evaluation.

2) Build a linear autoencoder (encoder/decoder are linear models each). Train your autoencoder with 2D latent space Z. 

3) Visualize the results as follows: For 1000 training and 1000 test samples

    a) Create a scatterplot of the latent embeddings
    
    b) Plot the reconstructions of 100 samples in a 10x10 grid
    
    c) Plot the corresponding original samples below
    
3) Build a non-linear autoencoder. The encoder has 2 conv layers (32/64 filters of size (3,3), strides 2), followed by a dense layer that maps to 2-dimensional latent variables. The decoder consists of a dense layer, followed by 2 conv-transpose layers (64/32, ... matching the encoder). Train the non-linear autoencoder and visualize the results as above.

4) Create synthetic samples by randomly sampling 100 points in latent space and decoding them using the linear and non-linear decoders.


In [6]:
# load data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

X = x_train.astype('float32').reshape((-1, 28,28,1))/255.
Xtrain, Xtest_, Ytrain, Ytest_ = train_test_split(X, y_train, train_size=10000)
Xtest, Xtest_, Ytest, Ytest_ = train_test_split(Xtest_, Ytest_, train_size=1000)

Xtrain_, _, Ytrain_, _ = train_test_split(Xtrain, Ytrain, train_size=1000)

In [7]:
# helper function to plot several samples in one image
def make_grid(X, grid_size=[10,10]):
    sh = [X.shape[1], X.shape[2]]
    G = np.zeros((sh[0]*grid_size[0], sh[1]*grid_size[1]))
    for i in range(grid_size[0]):
        for j in range(grid_size[1]):
            G[i*sh[0]:(i+1)*sh[0], j*sh[1]:(j+1)*sh[1]] = X[i*grid_size[1]+j].reshape(sh)
    return G

In [8]:
# define autoencoder model
class AutoEncoder(Model):
    
    def __init__(self, encoder, decoder, **kwargs):
        super(AutoEncoder, self).__init__(**kwargs)
        self.encoder = encoder
        self.decoder = decoder
        
    def call(self, X, training=None):
        z = self.encoder(X)
        x = self.decoder(z)
        return x    

In [9]:
# linear models    


In [10]:
# train linear autoencoder

In [11]:
# visualize latent space, reconstructions and original samples for train and test data


In [12]:
# nonlinear models


In [13]:
# train non-linear autoencoder

In [14]:
# Visualization as before for non-linear model


In [15]:
# Try out linear decoder as generative model


In [16]:
# Non-linear decoder as generative model
