## CSCI 470 Activities and Case Studies

1. For all activities, you are allowed to collaborate with a partner. 
1. For case studies, you should work individually and are **not** allowed to collaborate.

By filling out this notebook and submitting it, you acknowledge that you are aware of the above policies and are agreeing to comply with them.

Some considerations with regard to how these notebooks will be graded:

1. You can add more notebook cells or edit existing notebook cells other than "# YOUR CODE HERE" to test out or debug your code. We actually highly recommend you do so to gain a better understanding of what is happening. However, during grading, **these changes are ignored**. 
2. You must ensure that all your code for the particular task is available in the cells that say "# YOUR CODE HERE"
3. Every cell that says "# YOUR CODE HERE" is followed by a "raise NotImplementedError". You need to remove that line. During grading, if an error occurs then you will not receive points for your work in that section.
4. If your code passes the "assert" statements, then no output will result. If your code fails the "assert" statements, you will get an "AssertionError". Getting an assertion error means you will not receive points for that particular task.
5. If you edit the "assert" statements to make your code pass, they will still fail when they are graded since the "assert" statements will revert to the original. Make sure you don't edit the assert statements.
6. We may sometimes have "hidden" tests for grading. This means that passing the visible "assert" statements is not sufficient. The "assert" statements are there as a guide but you need to make sure you understand what you're required to do and ensure that you are doing it correctly. Passing the visible tests is necessary but not sufficient to get the grade for that cell.
7. When you are asked to define a function, make sure you **don't** use any variables outside of the parameters passed to the function. You can think of the parameters being passed to the function as a hint. Make sure you're using all of those variables.
8. Finally, **make sure you run "Kernel > Restart and Run All"** and pass all the asserts before submitting. If you don't restart the kernel, there may be some code that you ran and deleted that is still being used and that was why your asserts were passing.

# Deep Learning - Autoencoders

In this exercise we'll use an AutoEncoder to learn a dimenionally reduced representation of data and investigate its performance compared to using the original data. You'll learn how to build AutoEncoders and how to use the keras functional API.

In [92]:
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense
from tensorflow.keras import Model, Input
import sklearn as sk
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
import numpy as np


np.random.seed(0)
tf.random.set_seed(0)

In [93]:
data = load_breast_cancer()
features = data["data"]
targets = data["target"]
X_train, X_test, y_train, y_test = train_test_split(features, targets, random_state=0)

In [94]:
# Read through the description of the data to better understand it
# What features do we have and what is the target we're trying to predict?
print(data["DESCR"])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [95]:
X_train.shape

(426, 30)

In [96]:
X_train

array([[1.185e+01, 1.746e+01, 7.554e+01, ..., 9.140e-02, 3.101e-01,
        7.007e-02],
       [1.122e+01, 1.986e+01, 7.194e+01, ..., 2.022e-02, 3.292e-01,
        6.522e-02],
       [2.013e+01, 2.825e+01, 1.312e+02, ..., 1.628e-01, 2.572e-01,
        6.637e-02],
       ...,
       [9.436e+00, 1.832e+01, 5.982e+01, ..., 5.052e-02, 2.454e-01,
        8.136e-02],
       [9.720e+00, 1.822e+01, 6.073e+01, ..., 0.000e+00, 1.909e-01,
        6.559e-02],
       [1.151e+01, 2.393e+01, 7.452e+01, ..., 9.653e-02, 2.112e-01,
        8.732e-02]])

In [97]:
y_train.shape

(426,)

In [98]:
y_train

array([1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,

In this exercise, instead of using the sequential model, we will use the base Model in tf.keras. There are two approaches to use tf.keras.Model and we will use the functional api as outlined in the [Model docs](https://www.tensorflow.org/api_docs/python/tf/keras/models/Model).
Note that unlike the manner in which we defined our model in the prior Activity, in this Activity (using the base Model) the definition is more like that of the coding of a functional algorithm, e.g.:

b = f1(a)

c = f2(b)

d = f3(c)

etc.


In [99]:
# Determine the number of input dimensions (features) of each datapoint
# and use that value to create a tf.keras.Input object, giving it the variable
# name "inputs".
# Also, select a dimension (<=5) for the embedding output give it the variable
# name "embedding_dim".

inputs = tf.keras.Input(shape=(30,))
embedding_dim = 3

# raise NotImplementedError()

In [100]:
assert inputs.shape[1] == X_train.shape[1]
assert isinstance(embedding_dim, int)
assert embedding_dim >0
assert embedding_dim <= 5

In [101]:
# Chain our Input layer to two subsequent Dense layers, the first with 10 neurons (units),
# the second with "embedding_dim" neurons. Name the final output "encoded".
# Use ReLU as the activation function for the first Dense layer and do not set an
# activation for the final layer.

second = tf.keras.layers.Dense(10, activation="relu")(inputs)
encoded = tf.keras.layers.Dense(embedding_dim)(second)

# raise NotImplementedError()

In [102]:
testM = Model(inputs, encoded)
assert len(testM.layers) == 3
assert encoded.shape[1] == embedding_dim

In [103]:
# Chain two additional dense layers, the first with 10 neurons and the
# second (final layer) with the same number of neurons as your input (number
# of features).
# Set the output of the final layer equal to "decoded".

dsecond = tf.keras.layers.Dense(10, activation="relu")(encoded)
decoded = tf.keras.layers.Dense(30)(dsecond)

# raise NotImplementedError()

In [104]:
testM = Model(inputs, decoded)
print(len(testM.layers))
assert len(testM.layers) == 5
assert decoded.shape[1] == 30

5


In [105]:
# Create the autoencoder
autoencoder = Model(inputs,decoded)

# Create the encoder which takes the same inputs but stops at the encoded layers
encoder = Model(inputs, encoded)

# Create the decoder which starts at the encoded output and uses the remaining layers
encoded_embedding = Input(shape=(embedding_dim,))

decoder_layer2 = autoencoder.layers[-2]
decoder_layer3 = autoencoder.layers[-1]

decoder_out = decoder_layer3(decoder_layer2(encoded_embedding))

decoder = Model(encoded_embedding, decoder_out)

In [106]:
# Compile the model with 
autoencoder.compile(optimizer="adam", loss="mse")
autoencoder.fit(X_train, X_train, epochs=1000)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

<tensorflow.python.keras.callbacks.History at 0x7fa7945c3110>

In [107]:
autoencoder.summary()

Model: "model_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         [(None, 30)]              0         
_________________________________________________________________
dense_50 (Dense)             (None, 10)                310       
_________________________________________________________________
dense_51 (Dense)             (None, 3)                 33        
_________________________________________________________________
dense_52 (Dense)             (None, 10)                40        
_________________________________________________________________
dense_53 (Dense)             (None, 30)                330       
Total params: 713
Trainable params: 713
Non-trainable params: 0
_________________________________________________________________


In [108]:
# Calculate the embedding using the encoder model
X_train_embed = encoder.predict(X_train)
X_test_embed = encoder.predict(X_test)

In [109]:
# Now, train two LinearSVC models.
# Fit the first model on the original data and name the model "base_model".
# Fit the second model on the autoencoder-embedded data (X_train_embed) and
# name the model "embed_model".

base_model = LinearSVC().fit(X_train, y_train)
embed_model = LinearSVC().fit(X_train_embed, y_train)


# raise NotImplementedError()



In [110]:
assert base_model
assert isinstance(base_model, LinearSVC)
assert base_model.coef_.shape[1] == 30
assert embed_model
assert isinstance(embed_model, LinearSVC)
assert embed_model.coef_.shape[1] == embedding_dim

In [111]:
print(f"The base SVM classifier scores {base_model.score(X_test, y_test)}.")
print(f"The autoencoder embedding SVM classifier scores {embed_model.score(X_test_embed, y_test)}.")

The base SVM classifier scores 0.9440559440559441.
The autoencoder embedding SVM classifier scores 0.9090909090909091.


Was the test set score of the embedded data model better or worse than that of the original data model?

Ask youself why it might be better or worse. 

 - What happens when you change the activation function(s) in the autoencoder?
 - What happens when you change the embedding_dim to be larger or smaller?
 - Is it sufficient to just use LinearSVC with the default parameters to make any of these conclusions?

## Feedback

In [112]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    return("not too easy not too difficult")
    # raise NotImplementedError()