# [Math-Bot] Siamese LSTM:  Detecting duplicates

<img src="media/uni_logo.png"/>

<b>Author: Alin-Andrei Georgescu 2021</b>

Welcome to my notebook! It explores the Siamese networks applied to natural language processing. The model is intended to detect duplicates, in other words to check if two sentences are similar.
The model uses "Long short-term memory" (LSTM) neural networks, which are an artificial recurrent neural networks (RNNs).

## Outline

- [Overview](#0)
- [Part 1: Importing the Data](#1)
    - [1.1 Loading in the data](#1.1)
    - [1.2 Converting a sentence to a tensor](#1.2)
    - [1.3 Understanding and building the iterator](#1.3)
- [Part 2: Defining the Siamese model](#2)
    - [2.1 Understanding and building the Siamese Network](#2.1)
    - [2.2 Implementing Hard Negative Mining](#2.2)
- [Part 3: Training](#3)
- [Part 4: Evaluation](#4)
- [Part 5: Making predictions](#5)

<a name='0'></a>
### Overview

General ideas:
- Designing a Siamese networks model
- Implementing the triplet loss
- Evaluating accuracy
- Using cosine similarity between the model's outputted vectors
- Working with Trax and Numpy libraries in Python 3

The LSTM cell's architecture (source: https://www.researchgate.net/figure/The-structure-of-the-LSTM-unit_fig2_331421650):
<img src="https://www.researchgate.net/profile/Xiaofeng-Yuan-4/publication/331421650/figure/fig2/AS:771405641695233@1560928845927/The-structure-of-the-LSTM-unit.png" style="width:600px;height:300px;"/>



I will start by preprocessing the data, then I will build a classifier that will identify whether two sentences are the same or not. 


I tokenized the data and build a vocabulary, then split the dataset into training and testing sets. I then padded the sentences to obtain equal lengths. The model takes in the two sentence embeddings, runs them through an LSTM, and then compares the outputs of the two sub networks using cosine similarity.

This notebook has been built based on Coursera's <a href="https://www.coursera.org/specializations/natural-language-processing">Natural Language Processing Specialization</a>.


<a name='1'></a>
# Part 1: Importing the Data
<a name='1.1'></a>
### 1.1 Loading in the data

First step in building a model is building a dataset. I used three datasets in building my model:
- the Quora Question Pairs
- edited SICK dataset
- custom Maths duplicates dataset

Run the cell below to import some of the needed packages. 

In [None]:
import os
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import numpy as np
import pandas as pd
import random as rnd

!pip install textcleaner
import textcleaner as tc

!pip install trax
import trax
from trax import layers as tl
from trax.supervised import training
from trax.fastmath import numpy as fastnp

# set random seeds
rnd.seed(34)

**Notice that in this notebook Trax's numpy is referred to as `fastnp`, while regular numpy is referred to as `np`.**

Now the dataset will get loaded and the data processed.

In [None]:
data = pd.read_csv("data/merged_dataset.csv", encoding="utf-8")

N = len(data)
print("Number of sentence pairs: ", N)

data.head()

Then I split the data into a train and test set. The test set will be used later to evaluate the model.

In [None]:
N_dups = len(data[data.is_duplicate == 1])

# Take 90% of the duplicates for the train set
N_train = int(N_dups * 0.9)
print(N_train)

# Take the rest of the duplicates for the test set + an equal number of non-dups
N_test = (N_dups - N_train) * 2
print(N_test)

data_train = data[: N_train]
# Shuffle the train set
data_train = data_train.sample(frac=1)

data_test = data[N_train : N_train + N_test]
# Shuffle the test set
data_test = data_test.sample(frac=1)

print("Train set: ", len(data_train), "; Test set: ", len(data_test))

# Remove the unneeded data to some memory
del(data)

In [None]:
S1_train_words = np.array(data_train["sentence1"])
S2_train_words = np.array(data_train["sentence2"])

S1_test_words = np.array(data_test["sentence1"])
S2_test_words = np.array(data_test["sentence2"])
y_test  = np.array(data_test["is_duplicate"])

Above, you have seen that the model only takes the duplicated sentences for training.
All this has a purpose, as the data generator will produce batches $([s1_1, s1_2, s1_3, ...]$, $[s2_1, s2_2,s2_3, ...])$, where $s1_i$ and $s2_k$ are duplicate if and only if $i = k$.

An example of how the data looks is shown below.

In [None]:
print("TRAINING SENTENCES:\n")
print("Sentence 1: ", S1_train_words[0])
print("Sentence 2: ", S2_train_words[0], "\n")
print("Sentence 1: ", S1_train_words[5])
print("Sentence 2: ", S2_train_words[5], "\n")

print("TESTING SENTENCES:\n")
print("Sentence 1: ", S1_test_words[0])
print("Sentence 2: ", S2_test_words[0], "\n")
print("is_duplicate =", y_test[0], "\n")

The first step is to tokenize the sentences using a custom tokenizer defined below. Then each word of the selected duplicate pairs is encoded with an index. After that, given a sentence, it can just be encoded as a list of numbers. 

In [None]:
# Create arrays
S1_train = np.empty_like(S1_train_words)
S2_train = np.empty_like(S2_train_words)

S1_test = np.empty_like(S1_test_words)
S2_test = np.empty_like(S2_test_words)

In [None]:
def data_tokenizer(sentence):
    """Tokenizer function - cleans and tokenizes the data

    Args:
        sentence (str): The input sentence.
    Returns:
        list: The transformed input sentence.
    """
    
    if sentence == "":
        return ""

    sentence = tc.lower_all(sentence)[0]

    # Change tabs to spaces
    sentence = re.sub(r"\t+_+", " ", sentence)
    # Change short forms
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"(can\'t|can not)", "cannot", sentence)
    sentence = re.sub(r"n\'t", " not", sentence)
    sentence = re.sub(r"I\'m", "I am", sentence)
    sentence = re.sub(r" m ", " am ", sentence)
    sentence = re.sub(r"(\'re| r )", " are ", sentence)
    sentence = re.sub(r"\'d", " would ", sentence)
    sentence = re.sub(r"\'ll", " will ", sentence)
    sentence = re.sub(r"(\d+)(k)", r"\g<1>000", sentence)
    # Make word separations
    sentence = re.sub(r"(\+|-|\*|\/|\^|\.)", " $1 ", sentence)
    # Remove irrelevant stuff, nonprintable characters and spaces
    sentence = re.sub(r"(\'s|\'S|\'|\"|,|[^ -~]+)", "", sentence)
    sentence = tc.strip_all(sentence)[0]
    # Remove dot (encoded by textcleaner with $1), if necessary
    sentence = re.sub(r" *\$1 *$", "", sentence)

    if sentence == "":
        return ""

    return tc.token_it(tc.lemming(tc.stemming(sentence)))[0]

In [None]:
# Building the vocabulary with the train set
from collections import defaultdict

vocab = defaultdict(lambda: 0)
vocab["<PAD>"] = 1

for idx in range(len(S1_train_words)):
    S1_train[idx] = data_tokenizer(S1_train_words[idx])
    S2_train[idx] = data_tokenizer(S2_train_words[idx])

    s = S1_train[idx] + S2_train[idx]
    for word in s:
        if word not in vocab:
            vocab[word] = len(vocab) + 1

print("The length of the vocabulary is: ", len(vocab))

In [None]:
print(vocab["<PAD>"])
print(vocab["Astrology".lower()])
print(vocab["Scrumptious".lower()])  #not in vocabulary, returns 0

In [None]:
print(vocab)

In [None]:
for idx in range(len(S1_test_words)): 
    S1_test[idx] = data_tokenizer(S1_test_words[idx])
    S2_test[idx] = data_tokenizer(S2_test_words[idx])

<a name='1.2'></a>
### 1.2 Converting a sentence to a tensor

The next step is to convert every sentence to a tensor, or an array of numbers, using the vocabulary built above.

In [None]:
# Converting sentences to arrays of integers
for i in range(len(S1_train)):
    S1_train[i] = [vocab[word] for word in S1_train[i]]
    S2_train[i] = [vocab[word] for word in S2_train[i]]

        
for i in range(len(S1_test)):
    S1_test[i] = [vocab[word] for word in S1_test[i]]
    S2_test[i] = [vocab[word] for word in S2_test[i]]

In [None]:
print("FIRST SENTENCE IN TRAIN SET:\n")
print(S1_train_words[0], "\n") 
print("ENCODED VERSION:")
print(S1_train[0],"\n")

print("FIRST SENTENCE IN TEST SET:\n")
print(S1_test_words[0], "\n")
print("ENCODED VERSION:")
print(S1_test[0]) 

Now, the train set must be split into a training/validation set so that it can be used to train and evaluate the Siamese model.

In [None]:
# Splitting the data
cut_off = int(len(S1_train) * .8)
train_S1, train_S2 = S1_train[: cut_off], S2_train[: cut_off]
val_S1, val_S2 = S1_train[cut_off :], S2_train[cut_off :]
print("Number of duplicate sentences: ", len(S1_train))
print("The length of the training set is:  ", len(train_S1))
print("The length of the validation set is: ", len(val_S1))

<a name='1.3'></a>
### 1.3 Understanding and building the iterator 

Given the compational limits, we need to split our data into batches. In this notebook, I built a data generator that takes in $S1$ and $S2$ and returned a batch of size `batch_size` in the following format $([s1_1, s1_2, s1_3, ...]$, $[s2_1, s2_2,s2_3, ...])$. The tuple consists of two arrays and each array has `batch_size` sentences. Again, $s1_i$ and $s2_i$ are duplicates, but they are not duplicates with any other elements in the batch. 

The command `next(data_generator)` returns the next batch. This iterator returns a pair of arrays of sentences, which will later be used in the model.

**The ideas behind:**  
- The generator returns shuffled batches of data. To achieve this without modifying the actual sentence lists, a list containing the indexes of the sentences is created. This list can be shuffled and used to get random batches everytime the index is reset.
- Append elements of $S1$ and $S2$ to `input1` and `input2` respectively.
- If batches are full (i.e. we have achieved `batch_size` length), determine `max_len` as the longest sentence in `input1` and `input2`. I ceiled `max_len` to a power of $2$, for computation purposes.
- The shorter sentences are padded by `vocab["<PAD>"]` until the length `max_len`.

In [None]:
def data_generator(S1, S2, batch_size, pad=1, shuffle=False):
    """Generator function that yields batches of data

    Args:
        S1 (list): List of transformed (to tensor) sentences.
        S2 (list): List of transformed (to tensor) sentences.
        batch_size (int): Number of elements per batch.
        pad (int, optional): Pad character from the vocab. Defaults to 1.
        shuffle (bool, optional): If the batches should be randomnized or not. Defaults to False.
    Yields:
        tuple: Of the form (input1, input2) with types (numpy.ndarray, numpy.ndarray)
        NOTE: input1: inputs to your model [s1a, s2a, s3a, ...] i.e. (s1a,s1b) are duplicates
              input2: targets to your model [s1b, s2b,s3b, ...] i.e. (s1a,s2i) i!=a are not duplicates
    """

    input1 = []
    input2 = []
    idx = 0
    len_s = len(S1)
    sentence_indexes = [*range(len_s)]

    if shuffle:
        rnd.shuffle(sentence_indexes)

    while True:
        if idx >= len_s:
            # If idx is greater than or equal to len_q, reset it
            idx = 0
            # Shuffle to get random batches if shuffle is set to True
            if shuffle:
                rnd.shuffle(sentence_indexes)

        s1 = S1[sentence_indexes[idx]]
        s2 = S2[sentence_indexes[idx]]

        idx += 1

        input1.append(s1)
        input2.append(s2)

        if len(input1) == batch_size:
            # Determine max_len as the longest sentence in input1 & input 2
            max_len = max(max([len(s) for s in input1]), max([len(s) for s in input2]))
            # Pad to power-of-2
            max_len = 2 ** int(np.ceil(np.log2(max_len)))

            b1 = []
            b2 = []
            for s1, s2 in zip(input1, input2):
                # Add [pad] to s1 until it reaches max_len
                s1 = s1 + [pad] * (max_len - len(s1))
                # Add [pad] to s2 until it reaches max_len
                s2 = s2 + [pad] * (max_len - len(s2))

                # Append s1
                b1.append(s1)
                # Append s2
                b2.append(s2)

            # Use b1 and b2
            yield np.array(b1), np.array(b2)

            # reset the batches
            input1, input2 = [], []

In [None]:
batch_size = 2
res1, res2 = next(data_generator(train_S1, train_S2, batch_size))
print("First sentences  :\n", res1, "\n")
print("Second sentences :\n", res2)

Now we can go ahead and start building the neural network, as we have a data generator.

<a name='2'></a>
# Part 2: Defining the Siamese model

<a name='2.1'></a>

### 2.1 Understanding and building the Siamese Network 

A Siamese network is a neural network which uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. The Siamese network model proposed in this notebook looks like this:

<img src="media/siamese.png" style="width:600px;height:300px;"/>

The sentences' embeddings are passed to an LSTM layer, the output vectors, $v_1$ and $v_2$, are normalized, and finally a triplet loss is used to get the corresponding cosine similarity for each pair of sentences. The triplet loss makes use of a baseline (anchor) input that is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized. In math equations, the following is maximized:

$$\mathcal{L}(A, P, N)=\max \left(\|\mathrm{f}(A)-\mathrm{f}(P)\|^{2}-\|\mathrm{f}(A)-\mathrm{f}(N)\|^{2}+\alpha, 0\right)$$

$A$ is the anchor input, for example $s1_1$, $P$ the duplicate input, for example, $s2_1$, and $N$ the negative input (the non duplicate sentence), for example $s2_2$.<br>
$\alpha$ is a margin - how much the duplicates are pushed from the non duplicates. 
<br>

**The ideas behind:**
- Trax library is used in implementing the model.
- `tl.Serial`: Combinator that applies layers serially (by function composition) allowing the set up the overall structure of the feedforward.
- `tl.Embedding`: Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding.
- `tl.LSTM` The LSTM layer.    
- `tl.Mean`: Computes the mean across a desired axis. Mean uses one tensor axis to form groups of values and replaces each group with the mean value of that group.
- `tl.Fn` Layer with no weights that applies the function f - vector normalization in this case.
- `tl.parallel`: It is a combinator layer (like `Serial`) that applies a list of layers in parallel to its inputs.


In [None]:
def Siamese(vocab_size=len(vocab), d_model=128, mode="train"):
    """Returns a Siamese model.

    Args:
        vocab_size (int, optional): Length of the vocabulary. Defaults to len(vocab).
        d_model (int, optional): Depth of the model. Defaults to 128.
        mode (str, optional): "train", "eval" or "predict", predict mode is for fast inference. Defaults to "train".

    Returns:
        trax.layers.combinators.Parallel: A Siamese model. 
    """

    def normalize(x):  # normalizes the vectors to have L2 norm 1
        return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))

    s_processor = tl.Serial(                        # Processor will run on S1 and S2.
        tl.Embedding(vocab_size, d_model),          # Embedding layer
        tl.LSTM(d_model),                           # LSTM layer
        tl.Mean(axis=1),                            # Mean over columns
        tl.Fn('Normalize', lambda x: normalize(x))  # Apply normalize function
    )  # Returns one vector of shape [batch_size, d_model].
    
    # Run on S1 and S2 in parallel.
    model = tl.Parallel(s_processor, s_processor)
    return model


Setup the Siamese network model.

In [None]:
# Check the model
model = Siamese(d_model=256)
print(model)

<a name='2.2'></a>

### 2.2 Implementing Hard  Negative Mining


Now it's the time to implement the `TripletLoss`.
As explained earlier, loss is composed of two terms. One term utilizes the mean of all the non duplicates, the second utilizes the *closest negative*. The loss expression is then:
 
\begin{align}
 \mathcal{Loss_1(A,P,N)} &=\max \left( -cos(A,P)  + mean_{neg} +\alpha, 0\right) \\
 \mathcal{Loss_2(A,P,N)} &=\max \left( -cos(A,P)  + closest_{neg} +\alpha, 0\right) \\
\mathcal{Loss(A,P,N)} &= mean(Loss_1 + Loss_2) \\
\end{align}

In [None]:
def TripletLossFn(v1, v2, margin=0.25):
    """Custom Loss function.

    Args:
        v1 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to S1.
        v2 (numpy.ndarray): Array with dimension (batch_size, model_dimension) associated to S2.
        margin (float, optional): Desired margin. Defaults to 0.25.

    Returns:
        jax.interpreters.xla.DeviceArray: Triplet Loss.
    """    
    scores = fastnp.dot(v1, v2.T)       # pairwise cosine sim
    batch_size = len(scores)

    positive = fastnp.diagonal(scores)  # the positive ones (duplicates)
    negative_without_positive = scores - 2.0 * fastnp.eye(batch_size)

    closest_negative = fastnp.max(negative_without_positive, axis=1)
    negative_zero_on_duplicate = (1.0 - fastnp.eye(batch_size)) * scores
    mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1) / (batch_size - 1)

    triplet_loss1 = fastnp.maximum(0.0, margin - positive + closest_negative)
    triplet_loss2 = fastnp.maximum(0.0, margin - positive + mean_negative)
    triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
    
    return triplet_loss

In [None]:
v1 = np.array([[0.26726124, 0.53452248, 0.80178373],[0.5178918 , 0.57543534, 0.63297887]])
v2 = np.array([[ 0.26726124,  0.53452248,  0.80178373],[-0.5178918 , -0.57543534, -0.63297887]])
TripletLossFn(v2,v1)
print("Triplet Loss:", TripletLossFn(v2,v1))

**Expected Output:**
```CPP
Triplet Loss: 0.5
```   

In [None]:
from functools import partial
def TripletLoss(margin=0.25):
    # Trax layer creation
    triplet_loss_fn = partial(TripletLossFn, margin=margin)
    return tl.Fn("TripletLoss", triplet_loss_fn)

<a name='3'></a>

# Part 3: Training

The next step is model training - defining the cost function and the optimizer, feeding in the built model. But first I will define the data generators used in the model.

In [None]:
batch_size = 512
train_generator = data_generator(train_S1, train_S2, batch_size, vocab["<PAD>"])
val_generator = data_generator(val_S1, val_S2, batch_size, vocab["<PAD>"])
print("train_S1.shape ", train_S1.shape)
print("val_S1.shape   ", val_S1.shape)

Now, I will define the training step. Each training iteration is defined as an `epoch`, each epoch being an iteration over all the data, using the training iterator.

**The ideas behind:**
- Two tasks are needed: `TrainTask` and `EvalTask`.
- The training runs in a trax loop `trax.supervised.training.Loop`.
- Pass the other parameters to a loop.

In [None]:
def train_model(Siamese, TripletLoss, lr_schedule, train_generator=train_generator, val_generator=val_generator, output_dir="trax_model/"):
    """Training the Siamese Model

    Args:
        Siamese (function): Function that returns the Siamese model.
        TripletLoss (function): Function that defines the TripletLoss loss function.
        lr_schedule (function): Trax multifactor schedule function.
        train_generator (generator, optional): Training generator. Defaults to train_generator.
        val_generator (generator, optional): Validation generator. Defaults to val_generator.
        output_dir (str, optional): Path to save model to. Defaults to "trax_model/".

    Returns:
        trax.supervised.training.Loop: Training loop for the model.
    """
    output_dir = os.path.expanduser(output_dir)

    train_task = training.TrainTask(
        labeled_data=train_generator,
        loss_layer=TripletLoss(),
        optimizer=trax.optimizers.Adam(0.01),
        lr_schedule=lr_schedule
    )

    eval_task = training.EvalTask(
        labeled_data=val_generator,
        metrics=[TripletLoss()]
    )

    training_loop = training.Loop(Siamese(),
                                  train_task,
                                  eval_tasks=[eval_task],
                                  output_dir=output_dir,
                                  random_seed=34)

    return training_loop

In [None]:
train_steps = 1500
lr_schedule = trax.lr.warmup_and_rsqrt_decay(400, 0.01)
training_loop = train_model(Siamese, TripletLoss, lr_schedule)
training_loop.run(train_steps)

<a name='4'></a>

# Part 4:  Evaluation

To determine the accuracy of the model, the test set that was configured earlier is used. While the training used only positive examples, the test data, S1_test, S2_test and y_test, is setup as pairs of sentences, some of which are duplicates some are not. 
This routine runs all the test sentences pairs through the model, computes the cosine simlarity of each pair, thresholds it and compares the result to  y_test - the correct response from the data set. The results are accumulated to produce an accuracy.

**The ideas behind:**  
 - The model loops through the incoming data in batch_size chunks.
 - The output vectors are computed and their cosine similarity is thresholded.

In [None]:
def classify(test_S1, test_S2, y, threshold, model, vocab, data_generator=data_generator, batch_size=64):
    """Function to test the accuracy of the model.

    Args:
        test_S1 (numpy.ndarray): Array of S1 sentences.
        test_S2 (numpy.ndarray): Array of S2 sentences.
        y (numpy.ndarray): Array of actual target.
        threshold (float): Desired threshold.
        model (trax.layers.combinators.Parallel): The Siamese model.
        vocab (collections.defaultdict): The vocabulary used.
        data_generator (function): Data generator function. Defaults to data_generator.
        batch_size (int, optional): Size of the batches. Defaults to 64.

    Returns:
        float: Accuracy of the model.
    """
    accuracy = 0

    for i in range(0, len(test_S1), batch_size):
        to_process = len(test_S1) - i

        if to_process < batch_size:
            batch_size = to_process

        s1, s2 = next(data_generator(test_S1[i : i + batch_size], test_S2[i : i + batch_size], batch_size, vocab["<PAD>"], shuffle=False))
        y_test = y[i : i + batch_size]

        v1, v2 = model((s1, s2))

        for j in range(batch_size):
            d = np.dot(v1[j], v2[j].T)
            res = d > threshold

            accuracy += (y_test[j] == res)

    accuracy = accuracy / len(test_S1)
    
    return accuracy

In [None]:
print(len(S1_test))

In [None]:
# Loading in the saved model
model = Siamese()
model.init_from_file("trax_model/model.pkl.gz")
# Evaluating it
accuracy = classify(S1_test, S2_test, y_test, 0.7, model, vocab, batch_size=512) 
print("Accuracy", accuracy)

<a name='5'></a>

# Part 5: Making predictions

In this section the model will be put to work. It will be wrapped in a function called `predict` which takes two sentences as input and returns $1$ or $0$, depending on whether the pair is a duplicate or not.   

But first, a reverse vocabulary needs to be built, because it allows to map encoded sentences back to words.

In [None]:
def predict(sentence1, sentence2, threshold, model, vocab, data_generator=data_generator, verbose=False):
    """Function for predicting if two sentences are duplicates.

    Args:
        sentence1 (str): First sentence.
        sentence2 (str): Second sentence.
        threshold (float): Desired threshold.
        model (trax.layers.combinators.Parallel): The Siamese model.
        vocab (collections.defaultdict): The vocabulary used.
        data_generator (function): Data generator function. Defaults to data_generator.
        verbose (bool, optional): If the results should be printed out. Defaults to False.

    Returns:
        bool: True if the sentences are duplicates, False otherwise.
    """

    s1 = data_tokenizer(sentence1)  # tokenize
    s2 = data_tokenizer(sentence2)  # tokenize
    S1, S2 = [], []

    for word in s1:  # encode s1
        S1 += [vocab[word]]
    for word in s2:  # encode s2
        S2 += [vocab[word]]

    S1, S2 = next(data_generator([S1], [S2], 1, vocab["<PAD>"]))

    v1, v2 = model((S1, S2))
    d = np.dot(v1[0], v2[0].T)
    res = d > threshold
    
    if verbose == True:
        print("S1  = ", S1, "\nS2  = ", S2)
        print("d   = ", d)
        print("res = ", res)

    return res

Now we can test the model's ability to make predictions.

In [None]:
sentence1 = "I love running in the park."
sentence2 = "I like running in park."
# 1 means it is duplicated, 0 otherwise
predict(sentence1 , sentence2, 0.7, model, vocab, verbose=True)

The Siamese network is capable of catching complicated structures. Concretely, it can identify sentence duplicates although the sentences do not have many words in common.