## Assignment 4 - Question Duplicates

Below for reference:

https://github.com/latentghost/NLP-Specialization-deeplearning.ai/blob/818304cecbe2d64252c365eb2ad9f350676ffd1b/NLP%20with%20Sequence%20Models/Week3/Assignment/.ipynb_checkpoints/C3W3_Assignment-checkpoint.ipynb#L989

<a name='0'></a>
## Overview
In this assignment, concretely you will: 

- Learn about Siamese networks
- Understand how the triplet loss works
- Understand how to evaluate accuracy
- Use cosine similarity between the model's outputted vectors
- Use the data generator to get batches of questions
- Predict using your own model

By now, you are familiar with trax and know how to make use of classes to define your model. We will start this homework by asking you to preprocess the data the same way you did in the previous assignments. After processing the data you will build a classifier that will allow you to identify whether two questions are the same or not. 
<img src = "images/meme.png" style="width:550px;height:300px;"/>


You will process the data first and then pad in a similar way you have done in the previous assignment. Your model will take in the two question embeddings, run them through an LSTM, and then compare the outputs of the two sub networks using cosine similarity. Before taking a deep dive into the model, start by importing the data set.

### 1. Importing the data

In [121]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import os
import numpy as np
import pandas as pd
import random as rnd
import tensorflow as tf

# Set random seeds
rnd.seed(34)

In [122]:
import w3_unittest

In [123]:
data = pd.read_csv('data/questions.csv')
print(f"number of questions pairs: {len(data)}")
data.head()

number of questions pairs: 404351


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [124]:
N_TRAIN = 300000
N_TEST=10*1024
data_train = data[:N_TRAIN]
data_test = data[N_TRAIN:N_TRAIN + N_TEST]

print(f"length of training set: {len(data_train)}, length of testing set: {len(data_test)}")

length of training set: 300000, length of testing set: 10240


In [125]:
is_duplicate_index = data_train[data_train['is_duplicate'] == True].index.to_list()
is_duplicate_index

[5,
 7,
 11,
 12,
 13,
 15,
 16,
 18,
 20,
 29,
 31,
 32,
 38,
 48,
 49,
 50,
 51,
 53,
 58,
 62,
 65,
 66,
 67,
 71,
 72,
 73,
 74,
 79,
 84,
 85,
 86,
 88,
 92,
 93,
 95,
 100,
 104,
 107,
 113,
 120,
 122,
 125,
 127,
 135,
 136,
 143,
 144,
 152,
 156,
 158,
 159,
 160,
 163,
 165,
 168,
 173,
 175,
 176,
 178,
 179,
 180,
 182,
 185,
 188,
 189,
 190,
 191,
 193,
 194,
 197,
 198,
 199,
 200,
 203,
 209,
 210,
 215,
 216,
 219,
 220,
 221,
 224,
 226,
 229,
 235,
 236,
 238,
 242,
 243,
 244,
 246,
 249,
 250,
 251,
 253,
 255,
 260,
 261,
 262,
 267,
 269,
 270,
 273,
 274,
 275,
 281,
 284,
 285,
 286,
 287,
 288,
 291,
 293,
 295,
 296,
 299,
 304,
 307,
 308,
 309,
 312,
 317,
 318,
 321,
 322,
 323,
 326,
 329,
 331,
 339,
 341,
 346,
 347,
 348,
 349,
 350,
 353,
 364,
 365,
 368,
 373,
 377,
 380,
 383,
 390,
 393,
 394,
 395,
 397,
 399,
 400,
 402,
 403,
 404,
 405,
 409,
 410,
 412,
 415,
 421,
 422,
 428,
 430,
 431,
 432,
 439,
 442,
 443,
 445,
 446,
 450,
 451,
 457,

In [126]:
print(f"number of duplicate questions: {len(is_duplicate_index)}, number of non duplicate questions {len(data) - len(is_duplicate_index)}")

number of duplicate questions: 111486, number of non duplicate questions 292865


In [127]:
data.loc[is_duplicate_index[:5]]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
11,11,23,24,How do I read and find my YouTube comments?,How can I see all my Youtube comments?,1
12,12,25,26,What can make Physics easy to learn?,How can you make physics easy to learn?,1
13,13,27,28,What was your first sexual experience like?,What was your first sexual experience?,1


Splitting out test and train q1 and q2 words

In [128]:
q1_train_words = data_train.loc[is_duplicate_index,'question1'].to_numpy()
q2_train_words = data_train.loc[is_duplicate_index,'question2'].to_numpy()

q1_test_words = data_test['question1'].to_numpy()
q2_test_words = data_test['question2'].to_numpy()
y_test = data_test['is_duplicate'].to_numpy()

In [None]:
print('TRAINING QUESTIONS:\n')
print('Question 1: ', q1_train_words[0])
print('Question 2: ', q2_train_words[0], '\n')
print('Question 1: ', q1_train_words[5])
print('Question 2: ', q2_train_words[5], '\n')

print('TESTING QUESTIONS:\n')
print('Question 1: ', q1_test_words[0])
print('Question 2: ', q2_test_words[0], '\n')
print('is_duplicate =', y_test[0], '\n')

Q1 and Q2 training breakdown

In [129]:
q1_train_words[:10], q2_train_words[:10]

(array(['Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?',
        'How can I be a good geologist?',
        'How do I read and find my YouTube comments?',
        'What can make Physics easy to learn?',
        'What was your first sexual experience like?',
        'What would a Trump presidency mean for current international master’s students on an F1 visa?',
        'What does manipulation mean?',
        'Why are so many Quora users posting questions that are readily answered on Google?',
        'Why do rockets look white?',
        'How should I prepare for CA final law?'], dtype=object),
 array(["I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?",
        'What should I do to be a great geologist?',
        'How can I see all my Youtube comments?',
        'How can you make physics easy to learn?',
        'What was your first sexual experience?',
        'How will a Trump presidency affect the stude

In [130]:
print(f"number of q1 train words: {len(q1_train_words)}, and number of q2 train words {len(q2_train_words)}")

number of q1 train words: 111486, and number of q2 train words 111486


Q1 and Q2 testing breakdown.

In [131]:
print(f"number of q1 test words: {len(q1_test_words)}, number of q2 test words: {len(q2_test_words)}, number of y test labels: {len(y_test)}")

number of q1 test words: 10240, number of q2 test words: 10240, number of y test labels: 10240


Split into training / validation sets.

In [132]:
cut_off = int(len(q1_train_words) * 0.8)
train_q1, train_q2 = q1_train_words[:cut_off], q2_train_words[:cut_off]
val_q1, val_q2 = q1_train_words[cut_off:], q1_train_words[cut_off:]
print(f"Number of duplicate questions:{len(q1_train_words)}")
print(f"Length of the training set is: {len(train_q1)}")
print(f"Length of the validation set is: {len(val_q1)}")

Number of duplicate questions:111486
Length of the training set is: 89188
Length of the validation set is: 22298


<a name='1.2'></a>
### 1.2 Learning question encoding

The next step is to learn how to encode each of the questions as a list of numbers (integers). You will be learning how to encode each word of the selected duplicate pairs with an index. 

You will start by learning a word dictionary, or vocabulary, containing all the words in your training dataset, which you will use to encode each word of the selected duplicate pairs with an index. 

For this task you will be using the [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer from Keras. which will take care of everything for you. Begin by setting a seed, so we all get the same encoding.

In [133]:
tf.random.set_seed(0)
text_vectorization = tf.keras.layers.TextVectorization(output_mode="int", split="whitespace", standardize="strip_punctuation")
text_vectorization.adapt(np.concatenate((q1_train_words, q2_train_words)))

In [134]:
print(f'Vocabulary size: {text_vectorization.vocabulary_size()}')

Vocabulary size: 36224


In [135]:
print('first question in the train set:\n')
print(q1_train_words[0], '\n') 
print('encoded version:')
print(text_vectorization(q1_train_words[0]),'\n')

print('first question in the test set:\n')
print(q1_test_words[0], '\n')
print('encoded version:')
print(text_vectorization(q1_test_words[0]) )

first question in the train set:

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me? 

encoded version:
tf.Tensor(
[ 6984     6   178    10  8988  2442 35393   761    13  6636 28205    31
    28   483    45    98], shape=(16,), dtype=int64) 

first question in the test set:

How do I prepare for interviews for cse? 

encoded version:
tf.Tensor([    4     8     6   160    17  2079    17 11775], shape=(8,), dtype=int64)


<a name='2'></a>
## 2 - Defining the Siamese Model

<a name='2-1'></a>
### 2.1 - Understanding Siamese Network 
A Siamese network is a neural network which uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.The Siamese network you are about to implement looks like this:

<img src = "images/siamese.png" style="width:600px;height:300px;"/>

You get the question embedding, run it through an LSTM layer, normalize $v_1$ and $v_2$, and finally use a triplet loss (explained below) to get the corresponding cosine similarity for each pair of questions. As usual, you will start by importing the data set. The triplet loss makes use of a baseline (anchor) input that is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized. In math equations, you are trying to maximize the following.

$$\mathcal{L}(A, P, N)=\max \left(\|\mathrm{f}(A)-\mathrm{f}(P)\|^{2}-\|\mathrm{f}(A)-\mathrm{f}(N)\|^{2}+\alpha, 0\right)$$

$A$ is the anchor input, for example $q1_1$, $P$ the duplicate input, for example, $q2_1$, and $N$ the negative input (the non duplicate question), for example $q2_2$.<br>
$\alpha$ is a margin; you can think about it as a safety net, or by how much you want to push the duplicates from the non duplicates. 
<br>

<a name='ex-2'></a>
### Exercise 2 - Siamese

**Instructions:** Implement the `Siamese` function below. You should be using all the objects explained below. 

To implement this model, you will be using `trax`. Concretely, you will be using the following functions.


- `tl.Serial`: Combinator that applies layers serially (by function composition) allows you set up the overall structure of the feedforward. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Serial) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/combinators.py#L26)
    - You can pass in the layers as arguments to `Serial`, separated by commas. 
    - For example: `tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))` 


-  `tl.Embedding`: Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L113)
    - `tl.Embedding(vocab_size, d_feature)`.
    - `vocab_size` is the number of unique words in the given vocabulary.
    - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example).


-  `tl.LSTM` The LSTM layer. It leverages another Trax layer called [`LSTMCell`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.LSTMCell). The number of units should be specified and should match the number of elements in the word embedding. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.LSTM) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/rnn.py#L87)
    - `tl.LSTM(n_units)` Builds an LSTM layer of n_units.
    
    
- `tl.Mean`: Computes the mean across a desired axis. Mean uses one tensor axis to form groups of values and replaces each group with the mean value of that group. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Mean) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L276)
    - `tl.Mean(axis=1)` mean over columns.


- `tl.Fn` Layer with no weights that applies the function f, which should be specified using a lambda syntax. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.base.Fn) / [source doce](https://github.com/google/trax/blob/70f5364dcaf6ec11aabbd918e5f5e4b0f5bfb995/trax/layers/base.py#L576)
    - $x$ -> This is used for cosine similarity.
    - `tl.Fn('Normalize', lambda x: normalize(x))` Returns a layer with no weights that applies the function `f`
    
    
- `tl.parallel`: It is a combinator layer (like `Serial`) that applies a list of layers in parallel to its inputs. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Parallel) / [source code](https://github.com/google/trax/blob/37aba571a89a8ad86be76a569d0ec4a46bdd8642/trax/layers/combinators.py#L152)


In [136]:
# UNQ_C2 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: Siamese
def Siamese(text_vectorizer, vocab_size=35224, d_feature=128):
    """Returns a Siamese model.

    Args:
        text_vectorizer (TextVectorization): TextVectorization instance, already adapted to your training data.
        vocab_size (int, optional): Length of the vocabulary. Defaults to 56400.
        d_model (int, optional): Depth of the model. Defaults to 128.
        
    Returns:
        tf.model.Model: A Siamese model. 
    
    """
    LSTM = tf.keras.models.Sequential(name="sequential")
    LSTM.add(text_vectorizer)
    LSTM.add(tf.keras.layers.Embedding(name="embedding", input_dim=vocab_size, output_dim=d_feature))
    LSTM.add(tf.keras.layers.LSTM(name="LSTM", units=d_feature, return_sequences=True))
    LSTM.add(tf.keras.layers.GlobalAveragePooling1D(name="mean"))
    LSTM.add(tf.keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1), name='out'))
    
    input1 = tf.keras.layers.Input((1,), name="input_1", dtype=tf.string)
    input2 = tf.keras.layers.Input((1,), name="input_2", dtype=tf.string)
    
    LSTM1 = LSTM(input1)
    LSTM2 = LSTM(input2)
    
    conc = tf.keras.layers.Concatenate(axis=1, name="conc_1_2")([LSTM1, LSTM2])
    return tf.keras.models.Model(inputs=(input1, input2), outputs=conc, name="SiameseModel")

In [137]:
# check your model
model = Siamese(text_vectorization, vocab_size=text_vectorization.vocabulary_size())
model.build(input_shape=None)
model.summary()
model.get_layer(name='sequential').summary()

Model: "SiameseModel"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 1)]                  0         []                            
                                                                                                  
 sequential (Sequential)     (None, 128)                  4768256   ['input_1[0][0]',             
                                                                     'input_2[0][0]']             
                                                                                                  
 conc_1_2 (Concatenate)      (None, 256)                  0         ['sequential[0][0]'

In [138]:
tf.keras.utils.plot_model(
    model,
    to_file="model.png",
    show_shapes=True,
    show_dtype=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=True)

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


In [139]:
# Test your function!
w3_unittest.test_Siamese(Siamese)

[92mAll tests passed!


**Expected output:**  

```CPP
Parallel_in2_out2[
  Serial[
    Embedding_41699_128
    LSTM_128
    Mean
    Normalize
  ]
  Serial[
    Embedding_41699_128
    LSTM_128
    Mean
    Normalize
  ]
]
```

<a name='2.2'></a>

### 2.2 Hard Negative Mining


You will now implement the `TripletLoss` with hard negative mining.<br>
As explained in the lecture, you will be using all the questions from each batch to compute this loss. Positive examples are questions $q1_i$, and $q2_i$, while all the other combinations $q1_i$, $q2_j$ ($i\neq j$), are considered negative examples. The loss will be composed of two terms. One term utilizes the mean of all the non duplicates, the second utilizes the *closest negative*. Our loss expression is then:
 
\begin{align}
 \mathcal{Loss_1(A,P,N)} &=\max \left( -cos(A,P)  + mean_{neg} +\alpha, 0\right) \\
 \mathcal{Loss_2(A,P,N)} &=\max \left( -cos(A,P)  + closest_{neg} +\alpha, 0\right) \\
\mathcal{Loss(A,P,N)} &= mean(Loss_1 + Loss_2) \\
\end{align}


Further, two sets of instructions are provided. The first set, found just below, provides a brief description of the task. If that set proves insufficient, a more detailed set can be displayed.  

<a name='ex03'></a>
### Exercise 02

**Instructions (Brief):** Here is a list of things you should do: <br>

- As this will be run inside Tensorflow, use all operation supplied by `tf.math` or `tf.linalg`, instead of `numpy` functions. You will also need to explicitly use `tf.shape` to get the batch size from the inputs. This is to make it compatible with the Tensor inputs it will receive when doing actual training and testing. 
- Use [`tf.linalg.matmul`](https://www.tensorflow.org/api_docs/python/tf/linalg/matmul) to calculate the similarity matrix $v_2v_1^T$ of dimension `batch_size` x `batch_size`. 
- Take the score of the duplicates on the diagonal with [`tf.linalg.diag_part`](https://www.tensorflow.org/api_docs/python/tf/linalg/diag_part). 
- Use the `TensorFlow` functions [`tf.eye`](https://www.tensorflow.org/api_docs/python/tf/eye) and [`tf.math.reduce_max`](https://www.tensorflow.org/api_docs/python/tf/math/reduce_max) for the identity matrix and the maximum respectively. 

In [172]:
# UNQ_C3 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: TripletLossFn
def TripletLossFn(v1, v2, margin=0.25):
    """Custom Loss function.

    Args:
        v1 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q1.
        v2 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to Q2.
        margin (float, optional): Desired margin. Defaults to 0.25.

    Returns:
        jax.interpreters.xla.DeviceArray: Triplet Loss.
    """
    ### START CODE HERE (Replace instances of 'None' with your code) ###

    # use `tf.linalg.matmul` to take the dot product of the two batches. 
    # Don't forget to transpose the second argument using `transpose_b=True`
    #scores = np.dot(np.linalg.norm(v1), np.linalg.norm(v2).T)
    
    def norm(x):
        return tf.math.l2_normalize(x, axis=1)
    scores = tf.linalg.matmul(v2,v1, transpose_b=True)
    # calculate new batch size
    batch_size = tf.cast(tf.shape(scores)[0], scores.dtype)
    # use np to grab all postive `diagonal` entries in `scores`
    sim_ap = tf.linalg.diag_part(scores)  # the positive ones (duplicates)
    # subtract `np.eye(batch_size)` out of 1.0 and do element-wise multiplication with `scores`
    sim_an = scores - tf.linalg.diag(sim_ap)
    
    mean_negative = tf.reduce_sum(sim_an, axis=1) / (batch_size - 1)
    # create a composition of two masks: 
    # the first mask to extract the diagonal elements, 
    # the second mask to extract elements in the negative_zero_on_duplicate matrix that are larger than the elements in the diagonal 
    mask1 = (tf.eye(batch_size) == 1)
    mask2 = (sim_an > tf.expand_dims(sim_ap, 1))
    mask_exclude_positives = tf.cast((mask1 | mask2), scores.dtype)
    # multiply `mask_exclude_positives` with 2.0 and subtract it out of `negative_zero_on_duplicate`
    negative_without_positive = sim_an - (2.0 * mask_exclude_positives)
    # take the row by row `max` of `negative_without_positive`. 
    # Hint: negative_without_positive.max(axis = [?])  
    closest_negative = tf.math.reduce_max(negative_without_positive, axis=1)
    # compute `np.maximum` among 0.0 and `A`
    # where A = subtract `positive` from `margin` and add `closest_negative`
    # IMPORTANT: DO NOT create an extra variable 'A'
    triplet_loss1 = tf.maximum(mean_negative - sim_ap + margin, 0.0)
    # compute `np.maximum` among 0.0 and `B`
    # where B = subtract `positive` from `margin` and add `mean_negative`
    # IMPORTANT: DO NOT create an extra variable 'B'
    triplet_loss2 = tf.maximum(closest_negative - sim_ap + margin, 0.0)
    # add the two losses together and take the `np.sum` of it    
    triplet_loss = tf.math.reduce_sum(triplet_loss1 + triplet_loss2)    
    ### END CODE HERE ###
    
    return triplet_loss

In [173]:
v1 = np.array([[0.26726124, 0.53452248, 0.80178373],[0.5178918 , 0.57543534, 0.63297887]])
v2 = np.array([[ 0.26726124,  0.53452248,  0.80178373],[-0.5178918 , -0.57543534, -0.63297887]])
print("Triplet Loss:", TripletLossFn(v1,v2).numpy())

Triplet Loss: 0.7035076825158911


In [174]:
def TripletLoss(labels, out, margin=0.25):
    _, embedding_size = out.shape # get embedding size
    v1 = out[:,:int(embedding_size/2)] # Extract v1 from out
    v2 = out[:,int(embedding_size/2):] # Extract v2 from out
    return TripletLossFn(v1, v2, margin=margin)

**Expected Output:**
```CPP
Triplet Loss: ~ 0.70
```   

In [175]:
# Test your function
w3_unittest.test_TripletLoss(TripletLoss)

[92mAll tests passed!


<a name='3'></a>

# Part 3: Training

Now it's time to finally train your model. As usual, you have to define the cost function and the optimizer. You also have to build the actual model you will be training. 

To pass the input questions for training and validation you will use the iterator produced by [`tensorflow.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset). Run the next cell to create your train and validation datasets. 

In [176]:
train_dataset = tf.data.Dataset.from_tensor_slices(((train_q1, train_q2),tf.constant([1]*len(train_q1))))
val_dataset = tf.data.Dataset.from_tensor_slices(((val_q1, val_q2),tf.constant([1]*len(val_q1))))

In [177]:
test_dataset = tf.data.Dataset.from_tensor_slices([train_q1])
test_dataset

<_TensorSliceDataset element_spec=TensorSpec(shape=(89188,), dtype=tf.string, name=None)>

<a name='3.1'></a>

### 3.1 Training the model

You will now write a function that takes in your model to train it. To train your model you have to decide how many times you want to iterate over the entire data set; each iteration is defined as an `epoch`. For each epoch, you have to go over all the data, using your `Dataset` iterator.

<a name='ex04'></a>
### Exercise 03

**Instructions:** Implement the `train_model` below to train the neural network above. Here is a list of things you should do: 

- Compile the model. Here you will need to pass in:
    - `loss=TripletLoss`
    - `optimizer=Adam()` with learning rate `lr`
- Call the `fit` method. You should pass:
    - `train_dataset`
    - `epochs`
    - `validation_data` 



You will be using your triplet loss function with Adam optimizer. Also, note that you are not explicitly defining the batch size, because it will be already determined by the `Dataset`.

This function will return the trained model

In [178]:
# GRADED FUNCTION: train_model
def train_model(Siamese, triplet_loss, text_vectorizer, train_dataset, val_dataset, d_feature=128, lr=0.01, train_steps=5):
    """Training the Siamese Model

    Args:
        Siamese (function): Function that returns the Siamese model.
        TripletLoss (function): Function that defines the TripletLoss loss function.
        text_vectorizer: trained instance of `TextVecotrization` 
        train_dataset (tf.data.Dataset): Training dataset
        val_dataset (tf.data.Dataset): Validation dataset
        d_feature (int, optional) = size of the encoding. Defaults to 128.
        lr (float, optional): learning rate for optimizer. Defaults to 0.01
        train_steps (int): number of epochs
        
    Returns:
        tf.keras.Model
    """
    ## START CODE HERE ###

    # Instantiate your Siamese model
    model = Siamese(text_vectorizer,
                    vocab_size = text_vectorization.vocabulary_size(), #set vocab_size accordingly to the size of your vocabulary
                    d_feature = d_feature)
    # Compile the model
    model.compile(loss=triplet_loss,
                  optimizer=tf.keras.optimizers.Adam(learning_rate=lr)
            )
    # Train the model 
    print(f"train_dataset: {train_dataset}")
    model.fit(train_dataset,
              epochs = train_steps,
              validation_data = val_dataset,
             )
             
    ### END CODE HERE ###

    return model

In [179]:
train_steps = 2
batch_size = 256
train_generator = train_dataset.shuffle(len(train_q1),
                                        seed=7, 
                                        reshuffle_each_iteration=True).batch(batch_size=batch_size)
val_generator = val_dataset.shuffle(len(val_q1), 
                                   seed=7,
                                   reshuffle_each_iteration=True).batch(batch_size=batch_size)
model = train_model(Siamese, TripletLoss,text_vectorization, 
                                            train_generator, 
                                            val_generator, 
                                            train_steps=train_steps,)

train_dataset: <_BatchDataset element_spec=((TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.string, name=None)), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>
Epoch 1/2
Epoch 2/2


In [180]:
# Test your function!
w3_unittest.test_train_model(train_model, Siamese, TripletLoss)

train_dataset: <_BatchDataset element_spec=((TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.string, name=None)), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>
train_dataset: <_BatchDataset element_spec=((TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.string, name=None)), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>
[92mAll tests passed!


<a name='4'></a>

# Part 4:  Evaluation  

<a name='4.1'></a>

### 4.1 Evaluating your siamese network

In this section you will learn how to evaluate a Siamese network. You will start by loading a pretrained model, and then you will use it to predict. For the prediction you will need to take the output of your model and compute the cosine loss between each pair of questions.

In [181]:
model = tf.keras.models.load_model('model/trained_model.keras', safe_mode=False, compile=False)

# Show the model architecture
model.summary()

OSError: No file or directory found at model/trained_model.keras

<a name='4.2'></a>
### 4.2 Classify
To determine the accuracy of the model, you will use the test set that was configured earlier. While in training you used only positive examples, the test data, `Q1_test`, `Q2_test` and `y_test`, is set up as pairs of questions, some of which are duplicates and some are not. 
This routine will run all the test question pairs through the model, compute the cosine similarity of each pair, threshold it and compare the result to `y_test` - the correct response from the data set. The results are accumulated to produce an accuracy; the confusion matrix is also computed to have a better understanding of the errors.


<a name='ex05'></a>
### Exercise 04

**Instructions**  
 - Loop through the incoming data in batch_size chunks, you will again define a `tensorflow.data.Dataset` to do so. This time you don't need the labels, so you can just replace them by `None`,
 - compute `v1`, `v2` using the model output,
 - for each element of the batch
        - compute the cosine similarity of each pair of entries, `v1[j]`,`v2[j]`
        - determine if `d > threshold`
        - increment accuracy if that result matches the expected results (`y_test[j]`)
  
   Instead of running a for loop, you will vectorize all these operations to make things more efficient,
 - compute the final accuracy and confusion matrix and return. For the confusion matrix you can use the [`tf.math.confusion_matrix`](https://www.tensorflow.org/api_docs/python/tf/math/confusion_matrix) function. 

In [None]:
# GRADED FUNCTION: classify
def classify(test_Q1, test_Q2, y_test, threshold, model, batch_size=64, verbose=True):
    """Function to test the accuracy of the model.

    Args:
        test_Q1 (numpy.ndarray): Array of Q1 questions. Each element of the array would be a string.
        test_Q2 (numpy.ndarray): Array of Q2 questions. Each element of the array would be a string.
        y_test (numpy.ndarray): Array of actual target.
        threshold (float): Desired threshold
        model (tensorflow.Keras.Model): The Siamese model.
        batch_size (int, optional): Size of the batches. Defaults to 64.

    Returns:
        float: Accuracy of the model
        numpy.array: confusion matrix
    """
    y_pred = []
    test_gen = tf.data.Dataset.from_tensor_slices(((test_Q1, test_Q2),None)).batch(batch_size=batch_size)
    
    ### START CODE HERE ###

    pred = None
    _, n_feat = None
    v1 = None
    v2 = None
    
    # Compute the cosine similarity. Using `tf.math.reduce_sum`. 
    # Don't forget to use the appropriate axis argument.
    d  = None
    # Check if d>threshold to make predictions
    y_pred = tf.cast(None, tf.float64)
    # take the average of correct predictions to get the accuracy
    accuracy = None
    # compute the confusion matrix using `tf.math.confusion_matrix`
    cm = None
    
    ### END CODE HERE ###
    
    return accuracy, cm

In [None]:
# this takes around 1 minute
accuracy, cm = classify(q1_test,q2_test, y_test, 0.7, model,  batch_size = 512) 
print("Accuracy", accuracy.numpy())
print(f"Confusion matrix:\n{cm.numpy()}")

### **Expected Result**  
Accuracy ~0.725

Confusion matrix:
```
[[4876 1506]
 [1300 2558]]
 ```

In [None]:
# Test your function!
w3_unittest.test_classify(classify, model)

<a name='5'></a>

# Part 5: Testing with your own questions

In this final section you will test the model with your own questions. You will write a function `predict` which takes two questions as input and returns `True` or `False` depending on whether the question pair is a duplicate or not.   

Write a function `predict` that takes in two questions, the threshold and the model, and returns whether the questions are duplicates (`True`) or not duplicates (`False`) given a similarity threshold. 

<a name='ex06'></a>
### Exercise 05


**Instructions:** 
- Create a tensorflow.data.Dataset from your two questions. Again, labels are not important, so you simply write `None`
- use the trained model output to create `v1`, `v2`
- compute the cosine similarity (dot product) of `v1`, `v2`
- compute `res` by comparing d to the threshold


In [None]:
# GRADED FUNCTION: predict
def predict(question1, question2, threshold, model, verbose=False):
    """Function for predicting if two questions are duplicates.

    Args:
        question1 (str): First question.
        question2 (str): Second question.
        threshold (float): Desired threshold.
        model (tensorflow.keras.Model): The Siamese model.
        data_generator (function): Data generator function. Defaults to data_generator.
        verbose (bool, optional): If the results should be printed out. Defaults to False.

    Returns:
        bool: True if the questions are duplicates, False otherwise.
    """
    generator = tf.data.Dataset.from_tensor_slices((([question1], [question2]),None)).batch(batch_size=1)
    
    ### START CODE HERE ###
    
    # Call the predict method of your model and save the output into v1v2
    v1v2 = None
    # Extract v1 and v2 from the model output
    v1 = None
    v2 = None
    # Take the dot product to compute cos similarity of each pair of entries, v1, v2
    # Since v1 and v2 are both vectors, use the function tf.math.reduce_sum instead of tf.linalg.matmul
    d = None
    # Is d greater than the threshold?
    res = None

    ### END CODE HERE ###
    
    if(verbose):
        print("Q1  = ", question1, "\nQ2  = ", question2)
        print("d   = ", d.numpy())
        print("res = ", res.numpy())

    return res.numpy()

In [None]:
# Feel free to try with your own questions
question1 = "When will I see you?"
question2 = "When can I see you again?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, verbose = True)

##### Expected Output
If input is:
```
question1 = "When will I see you?"
question2 = "When can I see you again?"
```

Output is (d may vary a bit):
```
1/1 [==============================] - 0s 13ms/step
Q1  =  When will I see you? 
Q2  =  When can I see you again?
d   =  0.8422112
res =  True
```

In [None]:
# Feel free to try with your own questions
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, verbose=True)

##### Expected output

If input is:
```
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
```

Output (d may vary a bit):

```
1/1 [==============================] - 0s 12ms/step
Q1  =  Do they enjoy eating the dessert? 
Q2  =  Do they like hiking in the desert?
d   =  0.12625802
res =  False

False
```

In [None]:
# Test your function!
w3_unittest.test_predict(predict, model)