<a href="https://colab.research.google.com/github/Shawn1303/semantic-textual-similarity/blob/main/semantic_textual_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Overview

This assignment is built on the skills and techniques introduced in previous assignments, but completion (partially or entirely) of any previous assignment is not required to complete this work.

The previous assignment ended with word similarity and word analogy tests. Here, we take the next step and explore sentence similarity: students will learn how the representations of smaller language units (e.g., words) can be composed to form representations of larger units (e.g., sentences) using deep learning. Specifically, the assignment introduces
1. the **task** of measuring semantic textual similarity (STS),
2. a popular **dataset** of [Sentences Involving Compositional Knowledge (SICK)](https://zenodo.org/records/2787612),
3. **fine-tuning** pretrained embeddings for a specific task, and
4. how regression (or regression-like) tasks use **correlation statistics for evaluation**.

**<font color="darkred">Important Notes</font>**

In this assignment, the model implementation is largely left for the student to complete. As such, you are no longer restricted to completion of `todo` blocks like in previous assignments.

This freedom includes the right to use the following libraries and their functionalities:
- PyTorch
- Pandas
- Scikit-Learn
- NumPy
- TorchText
- NLTK

A few libraries are prohibited, however:
- PyTorch Lightning
- Scorch
- Allen NLP
- Keras
- TensorFlow

*There is a huge ecosystem of hundreds of libraries in the PyTorch universe, and it is possible that some libraries have already built the kind of neural network you are being asked to build in this assignment. Even though such libraries are not explicitly named here, their use is prohibited for this assignment.*

# 2. Technical Overview of Model Architecture

Modeling semantic textual similarity is complicated by the ambiguity and variability of linguistic expressions. To tackle this, you will develop and test a model comprising two components:

1. A sentence model for converting a sentence into a representation for similarity measurement. This is a convolutional neural network (CNN) architecture with multiple types of convolution and pooling, designed to capture different granularities of information.
2. A similarity measurement layer using structured similarity measurements, which compare local regions of sentence representations (obtained from the sentence model).

This approach involves two subnetworks, each processing a sentence (in parallel). The subnetworks share all their weights, and are eventually joined by a similarity measurement layer. This is followed by a fully connected layer for the final similarity score output. This kind of an architecture is called a *Siamese network* or *twin network* in NLP research literature.

> **[Schematic diagram of a twin network](https://drive.google.com/file/d/1sqS8n145QCEjxdBo6Ztlrjyf0ahoF8eJ/view?usp=drive_link)**

*Make sure to understand the conceptual layout shown in the above schematic diagram before you proceed.*

# 3. Semantic Textual Similarity (STS): Technical Details and Programming


## 3.1 The SICK Dataset

You are going to use a very well-known corpus called [the SICK (Sentences Involving Compositional Knowledge) dataset](https://zenodo.org/records/2787612). It includes information other than semantic similarity of sentences, but for the purposes of this assignment, you can ignore those additional properties of this corpus.

So, let us first obtain the corpus.

In [None]:
!wget https://zenodo.org/records/2787612/files/SICK.zip

--2024-04-30 20:56:29--  https://zenodo.org/records/2787612/files/SICK.zip
Resolving zenodo.org (zenodo.org)... 188.184.98.238, 188.184.103.159, 188.185.79.172, ...
Connecting to zenodo.org (zenodo.org)|188.184.98.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 217584 (212K) [application/octet-stream]
Saving to: ‘SICK.zip’


2024-04-30 20:56:30 (382 KB/s) - ‘SICK.zip’ saved [217584/217584]



This is a `.zip` archive, so we need to extract it.

In [None]:
from zipfile import ZipFile

with ZipFile('SICK.zip', 'r') as z:
    z.extractall('sick_dataset')

You should be able to see the extracted corpus using the `Files` icon on the left sidebar here on Colab. The corpus resides in the `sick_dataset` folder, and contains a `readme.txt` and a `SICK.txt`.

By default, this should be located in your `/content` folder on Colab. You can/should verify this using the `!pwd`, `!cd`, and `!ls` commands.

*Before moving forward in this assignment, check the structure of the data and understand what it provides.*

## 3.2 The `torchtext` package

This package consists of data processing utilities for natural language processing. You are being introduced to this package through this assignment with the expectation that you will find it useful not just in this assignment, but in future work related to NLP. It has the added advantage of being extremely well-integrated with the wider PyTorch project.

In [None]:
!pip install torchtext

Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.1->torchtext)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.1->torchtext)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.1->torchtext)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.1->torchtext)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.2.1->torchtext)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch==2.2.1->torchtext)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 

## 3.3 Utility functions and Lexical Similarity

Here, we provide you with a utility function that creates a dictionary and one function that provides a few features based on lexical overlap.

The use of these functions is highly recommended (but not mandatory), as these features are known to improve the performance in semantic similarity detection.

You are free to add other utility functions that compute specific features or create various dictionaries (for maintaining indices, or other mappings required by your implementation).
- **Always provide a docstring with any function you add, and also include type hints so that the data types are obvious to anyone using your code**.

In [None]:
from collections import defaultdict

import nltk

nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords

import numpy as np

def pairwise_word2doc_frequency(sentence_list_1: list[str], sentence_list_2: list[str]) -> dict[str, int]:
    """
    Calculate the document frequency of each unique word from two lists of sentences.

    This function counts how many "documents" (in this context, a pair of sentences) each unique word appears in.
    Each pair of sentences from the two lists is considered as a separate "document". If a word occurs in either
    sentence of the pair, it is counted once for that pair.

    Args:
    sentence_list_1 (list[str]): The first list of sentences.
    sentence_list_2 (list[str]): The second list of sentences. It should be of the same length as sentence_list_1.

    Returns:
    dict[str, int]: A dictionary where keys are the unique words and values are the number of "documents" in which the
                    word appears.

    Raises:
    ValueError: If the input lists have different lengths.
    """
    if len(sentence_list_1) != len(sentence_list_2):
        raise ValueError("Sentence lists have different lengths.")

    word2doc_counts = defaultdict(int)
    for s1, s2 in zip(sentence_list_1, sentence_list_2):
        uniquetokens = set(s1) | set(s2)
        for t in uniquetokens:
            word2doc_counts[t] += 1

    return word2doc_counts

def pairwise_lexical_overlap_features(sentence_list_1: list[str], sentence_list_2: list[str],
                                      word2doc_counts: dict[str, int]) -> list[list[float]]:
    """
    Calculate various lexical overlap features between two lists of tokenized sentences.

    This function computes four types of lexical overlap features for each pair of sentences:
    1. Basic overlap: The proportion of overlapping tokens in the two sentences.
    2. IDF-weighted overlap: The inverse document frequency (IDF) weighted overlap score.
    3. Content-only overlap: The basic overlap excluding stopwords.
    4. Content-only IDF-weighted overlap: The IDF-weighted overlap score excluding stopwords.

    Args:
        sentence_list_1 (list[str]): The first list of sentences to be analyzed.
        sentence_list_2 (list[str]): The second list of sentences to be analyzed.
        word2doc_counts (dict[str, int]): A dictionary mapping tokens to their document frequency across a corpus.

    Returns:
        list[list[float]]: A list of lists, where each sublist contains four float values for each sentence-pair:
                           [overlap, IDF-weighted overlap, content-only overlap, content-only IDF-weighted overlap]

    Raises:
        ValueError: If `sentence_list_1` and `sentence_list_2` have different lengths.
    """
    if len(sentence_list_1) != len(sentence_list_2):
        raise ValueError("Sentence lists have different lengths.")

    stopwords_set = set(stopwords.words('english'))
    num_docs = len(sentence_list_1)
    overlap_features = []
    for s1, s2 in zip(sentence_list_1, sentence_list_2):
        tokens_a_set, tokens_b_set = set(s1), set(s2)
        intersection = tokens_a_set & tokens_b_set
        overlap = len(intersection)/(len(tokens_a_set) + len(tokens_b_set))

        idf_intersection = sum(np.log(num_docs/word2doc_counts[t]) for t in intersection if t in word2doc_counts)
        idf_weighted_overlap = idf_intersection/(len(tokens_a_set) + len(tokens_b_set))

        tokens_a_contentset = set(t for t in s1 if t not in stopwords_set)
        tokens_b_contentset = set(t for t in s2 if t not in stopwords_set)
        intersection_content = tokens_a_contentset & tokens_b_contentset
        overlap_content = len(intersection_content)/(len(tokens_a_contentset) + len(tokens_b_contentset))

        idf_intersection_content = sum(np.log(num_docs/word2doc_counts[t]) for t in intersection_content)
        idf_weighted_overlap_content = idf_intersection_content/(len(tokens_a_contentset) + len(tokens_b_contentset))

        overlap_features.append([overlap, idf_weighted_overlap, overlap_content, idf_weighted_overlap_content])

    return overlap_features

Before going any further, we must fix random seeds for random number generation throughout the remainder of this assignment. This ensures that the experiments are reproducible.

In [None]:
import random
import torch

# Do not change the seed value and any line where this value is used for settings.
# Changing the seed values may prevent your results from being reproduced if needed.

SEED = 1234
DATA_SPLIT_SEED = 99
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if device.type == 'cuda':
    torch.cuda.manual_seed(SEED)

## 3.4 The SICK Dataset Object in Python

Here, you are given the class to represent the SICK dataset. Some preprocessing functionality is also included. PyTorch allows for [map-style datasets](https://pytorch.org/docs/stable/data.html#map-style-datasets), which is the approach taken in this assignment.

You are encouraged to add methods and/or enhance the structure of this dataset in any way, as long as the additional code does not use a prohibited library or package. But please remember that the primary objective of this assignment is to understand convolutional neural networks (CNNs) and semantic representation of sentences. Your enhancements should be guided by those goals. Otherwise, you run the risk of overinvesting in this portion of the assignment for diminished returns!

> A standard recommendation from the teaching staff is that once you understand the `SickDataset` class in its given form, you should move on to the CNN and its filters (next section). There, you may think of enhacing the `SickDataset` class in certain ways so that the class' structure integrates smoothly with how you want to use the CNN architecture.


In [None]:
import os
import torch

import pandas as pd

from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from sklearn.model_selection import train_test_split

class SickDataset(Dataset):
    """
    A PyTorch Dataset class for loading and processing the SICK dataset.

    This class handles loading the SICK dataset from a specified path and provides methods for processing the data,
    splitting it into training, development, and test sets, and accessing individual instances.

    Attributes:
        path (str): The path to the directory containing the SICK dataset files.
        tokenizer (callable): The tokenizer function used to process text data.
        fields (list): A list specifying how each column in the dataset should be processed.
        instances (list): A list containing dictionaries, each representing an instance in the dataset.
                          Each dictionary contains the processed data for a single instance.

    Methods:
        __len__(): Returns the total number of instances in the dataset.
        __getitem__(idx): Returns the instance at the specified index.
        splits(): Splits the dataset into training, development, and test sets.
    """

    def __init__(self, path):
        self.path = path
        self.tokenizer = get_tokenizer("basic_english")
        self.fields = [('id', None), ('sentence_1', self.tokenizer), ('sentence_2', self.tokenizer),
                       ('external_features', None), ('label', float)]
        self.instances = self._process_data()

    def __len__(self):
        return len(self.instances)

    def __getitem__(self, idx):
        return self.instances[idx]

    def __create_instance(self, row):
        return {
            'id': row['pair_ID'],
            'sentence_1': row['sentence_A'],
            'sentence_2': row['sentence_B'],
            'external_features': row['overlap_features'],
            'label': float(row['relatedness_score'])
        }

    def _process_data(self):
        corpus_df = pd.read_csv(os.path.join(self.path, 'SICK.txt'), sep='\t')
        remove_trailing_space = lambda s : self.tokenizer(s.rstrip())
        sentence_1_list = corpus_df['sentence_A'].apply(remove_trailing_space).tolist()
        sentence_2_list = corpus_df['sentence_B'].apply(remove_trailing_space).tolist()

        self.word2doc_counts = pairwise_word2doc_frequency(sentence_1_list, sentence_2_list)
        corpus_df['overlap_features'] = pairwise_lexical_overlap_features(sentence_1_list, sentence_2_list,
                                                                          self.word2doc_counts)
        instances = corpus_df.apply(self.__create_instance, axis=1).tolist()

        return instances

    def splits(self) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
        """
        Split the dataset into training (70%), development (10%), and test (20%) sets.

        Returns:
            tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]: A tuple containing three DataFrames, in the order train_df,
                                                             dev_df, and test_df.
        """
        entire_df = pd.DataFrame(self.instances)

        train_df, other = train_test_split(entire_df, test_size=0.3, random_state=DATA_SPLIT_SEED)
        dev_df, test_df = train_test_split(other, test_size=2/3, random_state=DATA_SPLIT_SEED)

        return train_df, dev_df, test_df

## 3.5 Model Description

A **convolutional neural network (CNN)** is a *regularized* feed-forward neural network that learns useful features automatically through the use of **filters** (a process that removes some unwanted components or features from an input).
- A **convolution** is the process of applying a filter to text data to extract features from it. The original idea comes from image processing, where convolutions are applied to 2D grids. In NLP, we apply a filter to a sequence of words or characters.

You will use a CNN to model each sentence, by using two types of convolution filters. The intention is to model two different perspectives of the semantics of a sentence. You will also explore multiple types of pooling.

We view the input as a sequence of tokens where nearby tokens are very likely correlated. Thus, we consider a sentence $S \in \mathbb{R}^{l \times d}$ as a sequence of $l$ input words, and each word represented by a $d$-dimensional embedding.

We will need to introduce some notation for a technical description of the model. To keep things as similar as possible to our textbook,

- $S_i \in \mathbb{R}^d$ will denote the embedding of the $i^{\text{th}}$ word in the sequence
- $S_{i:j}$ will denote the concatenation of embeddings from word $i$ up to and including word $j$.
- The $k^{\text{th}}$ dimension will be denoted by $[k]$ in the supersccript. That is, $S_i^{[k]}$ is the dimension $k$ in the representation of word $i$ in our sentence. Similarly, $S_{i:j}^{[k]}$ is the vector of the values in the $k^{\text{th}}$ dimension of words $i$ to $j$.


### 3.5.1 Convolution filters

We define a convolution filter $F = \langle ws, w_F, b_F, h_F \rangle$ as a tuple of size 4, comprising
1. $ws$, the width of the sliding window,
2. $w_F$, the weight vector for the filter (this is a vector in $\mathbb{R}^{ws \times d}$),
3. a real-valued scalar bias $b_F$, and
4. a nonlinear activation function $h_F$.

When the above filter is applied to a sentence $S$, it computes the inner product between $w_F$ and each possible window of length $ws$ in the sentence $S$. Then, as with any feedforward neural network, we add the bias and apply the activation function. Thus, the output is a vector $\mathbf{o}_F \in \mathbb{R}^{1+l-ws}$ given by
$$\mathbf{o}_F = \langle h_F(w_F⋅S_{i:i+ws-1}+b_F)\rangle_{i=1}^{1+l-ws}$$

First, you will need some pretrained embeddings to serve as initial $d$-dimensional word embeddings. For this, let us resort to something you have already seen in the previous assignment: GloVe embeddings.

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
# !wget https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip
!unzip glove*.zip

--2024-04-30 20:57:55--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-04-30 20:57:56--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2024-04-30 21:00:35 (5.19 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


Now you have the following embeddings:

- `glove.6B.50d.txt`
- `glove.6B.100d.txt`
- `glove.6B.200d.txt`
- `glove.6B.300d.txt`

You can either use the `gensim` library as shown in the previous assignment, or use these pretrained embeddings directly as follows:

In [None]:
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

# Example use:
# embeddings_file = 'glove.6B.50d.txt'
# glove_embeddings = load_glove_embeddings(embeddings_file)
#
# if 'fox' in glove_embeddings:
#     fox_embedding = glove_embeddings['fox']
#     print("Embedding vector for 'fox':", fox_embedding)
# else:
#     print("Embedding vector for 'fox' not found.")

With such pretrained embeddings, and the technical details described above, it's time to implement the convolution filter. The start of this portion is given to you below. You are free to add and/or modify fields to make this filter class richer, if you want/need (and you will almost certainly need to add more methods).

In [None]:
import torch

import torch.nn as nn

class ConvolutionalFilter(nn.Module):
    def __init__(self, window_size, embedding_dim, pooling_type, activation_function=nn.Tanh()):
        super(ConvolutionalFilter, self).__init__()
        self.window_size = window_size
        self.embedding_dim = embedding_dim
        self.weight = nn.Parameter(torch.randn(window_size, embedding_dim, device=device))
        self.bias = nn.Parameter(torch.randn(1,device=device))
        self.activation_function = activation_function
        self.pooling_type = pooling_type

    def forward(self, sentence: torch.Tensor) -> torch.Tensor:
        # In a PyTorch nn.Module, the forward function defines the forward pass computation of the neural network. This
        # function describes how input data is processed through the layers of the network to produce the output.
        # Usually, it consists of
        # 1. The input data (usually a tensor) is passed into the forward function as an argument.
        # 2. Then, you define the computation graph by specifying how the input data flows forward through the layers
        #    of your network.
        # 3. You apply operations defined by the layers (e.g., convolutional transformation, activation functions,
        #    pooling, etc.) to the input data successively, forming the forward pass of the network.
        # 4. Finally, the forward function returns the output of the network after processing the input data through all
        #    the layers.
        num_windows = 1 + len(sentence) - self.window_size # num of windows
        windows = sentence.flatten().unfold(0, self.window_size * self.embedding_dim, self.embedding_dim) #get all the windows using unfold
        z = self.activation_function(torch.matmul(windows, self.weight.flatten()) + self.bias) #caluclating the activation value of size num_windows x 1
        # pooling, returns a tensor of size 1x1
        if self.pooling_type == 'max':
          pooled_output = torch.max(z)
        if self.pooling_type == 'mean':
          pooled_output = torch.mean(z)
        if self.pooling_type == 'min':
          pooled_output = torch.min(z)
        return pooled_output

### 3.5.2 Pooling

Once you start looking into the details of the convolution filter, you will perhaps notice a discrepancy in the shape of certain intermediate results and the output vector as described. This is where **pooling** comes in.

In networks like this, the output vector of a convolution filter is usually converted to a scalar, and this conversion is done using some sort of pooling. In this context, it makes sense for us to think of an operational object triple $(ws, p, S)$ that contains a convolution layer with width $ws$, uses a pooling function $p$, and operates on the sentence $S$. These operational triples model different perspectives on the semantics of a sentence. That is, such a "perspective" can be defined as

$$\{(ws, p, S) : p \in \{\max, \min, \text{mean}\}\}$$

**Why all these complicated things?**
*We want each convolution layer to learn to recognize distinct phenomena of the input. This allows for richer modeling of compositional semantics. To this end, the design of the operational triples allows a pooling function to interact with its own underlying convolution layers independently.*

For a perspective $(ws, p, S)$, with a convolution layer with $n$ filters, the output is a vector of length $n$, whose $i^\text{th}$ entry is $p(\mathbf{o}_{F_i})$, where $F_i$ is the $i^\text{th}$ filter.

Pooling is a common technique used in neural networks for reducing the spatial dimensions of tensors.

**Max Pooling** is a pooling operation that takes the maximum value from each patch of the input tensor. It effectively downsamples the input tensor by retaining only the maximum value from each patch. In the simplest scenario, the entire tensor is a single patch, and it is downsampled to a single scalar -- the element in that tensor which had the highest value.

**Mean Pooling**, also known as average pooling, computes the average value from each patch of the input tensor. It downsamples the input tensor by retaining the average value from each patch. In the simplest scenario, the entire tensor is downsampled to a single scalar, which is the average of all the elements in that tensor.

**Min Pooling** computes the minimum value from each patch of the input tensor. It downsamples the input tensor by retaining the minimum value from each patch. In the simplest case, the entire tensor is replaced by a scalar -- its minimum element.

These pooling operations are commonly used in convolutional neural networks (CNNs) and other types of neural network architectures to reduce the spatial dimensions of feature maps while retaining important information. They help in controlling overfitting and in extracting essential features from input data.

### 3.5.3 Structured Similarity Computation

Recall the lexical overlap features provided to you initially as a utility function. You may or may not have used it in your code already. Now, however, is the time to look at it again.

You will combine word representations through different pooling techniques and window sizes to create the input to the similarity computation layer in your Siamese network (i.e., the `ConvolutionalTwinNetwork` coming up soon). So, simply using cosine similarity is most probably not leveraging all that hard work!

For both sentences in the input, you have used convolution filters. Some of these filters may have used max pooling, others may have used min or mean pooling. Similarly, multiple window sizes may have also been used.

Now, build a similarity measurement function with the following properties:

- it computes the similarity of the max-pooled regions of the representation of sentence 1 with the max-pooled regions of the representation of sentence 2 (and similarly for min-pooled and mean-pooled)
- it computes a weighted sum of these regional similarities, where the weights are derived from the amount of lexical overlap
- the final similarity measure abides by the basic rule of cosine similarity in that its range is $[-1, 1]$.

In [None]:
# Write your structured similarity computation function here. This is the function you should use in the siamese network
# code. Observe carefully how `nn.CosineSimilarity` works, and write your code in a similar form so that in your final
# experiments, you can swap the standard cosine similarity with your custom function (i.e., this one) and see which one
# works better.
from typing import List, Tuple, Optional

def structured_similarity(sentence1_pooled: torch.Tensor, sentence2_pooled: torch.Tensor, split_size: List[int], overlap_features: List[float]) -> torch.Tensor:
  """
    Calculate the simialrity scores for the max pooled, mean pooled and min
    pooled region of the representations of the sentences and combine them with
    the overlap features.

    Args:
        sentence1_pooled torch.Tensor: Sentence 1 representation
        sentence2_pooled torch.Tensor: Sentence 2 representation
        split_size List[int]: Split of pooling regions
        overlap_features List[float]: overlap features

    Returns:
        torch.Tensor: 7 numbers of 3 similarities scores and 4 overlap feature numbers
  """
  cosine_similarity = nn.CosineSimilarity(dim=0) # initalize cosine similarity func

  overlap_features_tensor = torch.tensor(overlap_features, device=device, dtype=torch.float32) #convert overlap_features to a tensor

  #split the inputted sentence representations by their pooling type
  sentence1_split = torch.split(sentence1_pooled, split_size, dim=0)
  sentence2_split = torch.split(sentence2_pooled, split_size, dim=0)

  # calculate the max, mean and min similarity scores
  similarity_max = cosine_similarity(sentence1_split[0], sentence2_split[0])
  similarity_mean = cosine_similarity(sentence1_split[1], sentence2_split[1])
  similarity_min = cosine_similarity(sentence1_split[2], sentence2_split[2])

  # return a concatenated tensor of the overlap features and the 3 scores
  return torch.cat((similarity_max.unsqueeze(0), similarity_mean.unsqueeze(0), similarity_min.unsqueeze(0), overlap_features_tensor))

### 3.5.4 Siamese Network

Once you have finished the convolutional filter, the next step is to build a "Siamese" feedforward neural network whose input layer takes in two sentences.

**How to handle sentences of varying lengths?** *Pad the shorter sentence with zero vectors.*

PyTorch can be used to combine the following to create a Siamese neural network:

- an input layer (with padding, as described above),
- a hidden convolutional layer (which uses the convolutional filter described earlier),
- a similarity computation layer,
- a fully connected (linear) layer, and
- the final scalar output, which is the "relatedness score" of the two sentences (recall that the gold-standard actual score is available from the dataset during training, and also for evaluation during testing).

This component of the assignment requires you to build this network. You do **not** have to build it from scratch! You should use the flexibility of PyTorch and combine layers built using PyTorch.

A skeleton code is given below. Please understand that this is simply the skeleton to show you the overall structure to be followed.



In [None]:
import torch

import torch.nn as nn
import torch.nn.functional as F

class MyConvolutionalTwinNetwork(nn.Module):
  def __init__(self, embedding_dim, glove_embedding, num_filters, similarity_computation):
    super(MyConvolutionalTwinNetwork, self).__init__()

    # initalize embedding layer with glove embedding
    self.embedding_layer = nn.Embedding(len(glove_embedding), embedding_dim, padding_idx=0).to(device)
    self.embedding_layer.weight.data.copy_(torch.Tensor(np.array(list(glove_embedding.values()))))
    self.word_to_index = {word: idx for idx, word in enumerate(glove_embedding.keys())}

    # determine the split of which filters use which window size and pooling
    total_size = num_filters - 3
    # 50% uses max pooling, 30% uses mean and 20% uses min
    self.split_size = [int(0.5 * total_size) + 1, int(0.3 * total_size) + 1, int(0.2 * total_size) + 1]
    self.split_size[1] += total_size - sum(self.split_size) + 3
    # initalizing filters with different ws
    self.convolfilters = nn.ModuleList()
    # 50% of max pooled, 30% of mean pooled and 50% of min pooled uses ws 2
    # other percentage uses ws 3
    pooling = {
        'max': 0.5,
        'mean': 0.3,
        'min': 0.5
    }
    for pooling_type, num_filters in zip(['max', 'mean', 'min'], self.split_size):
      f1 = int(np.ceil(num_filters * pooling[pooling_type]))
      f2 = num_filters - f1
      for j in range(f1):
        self.convolfilters.append(ConvolutionalFilter(2, embedding_dim, pooling_type))
      for j in range(f2):
        self.convolfilters.append(ConvolutionalFilter(3, embedding_dim, pooling_type))

    # determine similarity computation layer and initialize fully connected layer
    self.similarity_computation = similarity_computation
    if similarity_computation == 'structured_similarity':
      self.similarity_computation_func = structured_similarity
      self.fc = nn.Linear(7, 1).to(device)
    elif similarity_computation == 'cosine_similarity':
      self.similarity_computation_func = nn.CosineSimilarity(dim=0)
      self.fc = nn.Linear(5, 1).to(device)

  def embedding(self, sentence):
    # convert word to indices
    sentence_indices = [self.word_to_index[word.lower()] if word.lower() in self.word_to_index else self.word_to_index['unk'] for word in sentence.split()]
    sentence_tensor = torch.tensor(sentence_indices, dtype=torch.long, device=device)
    # embeds word by indices
    sentence_embedded = self.embedding_layer(sentence_tensor)

    return sentence_embedded

  def forward(self, sentence1, sentence2, overlap_features):
    # embedding layer
    sentence1_embed = self.embedding(sentence1)
    sentence2_embed = self.embedding(sentence2)

    # padding
    max_length = max(sentence1_embed.size(0), sentence2_embed.size(0))
    pad_length1 = max_length - sentence1_embed.size(0)
    padded_sentence1 = torch.cat([sentence1_embed, torch.zeros(pad_length1, sentence1_embed.size(1), device=device)], dim=0)
    pad_length2 = max_length - sentence2_embed.size(0)
    padded_sentence2 = torch.cat([sentence2_embed, torch.zeros(pad_length2, sentence2_embed.size(1), device=device)], dim=0)

    # convolutional layer
    sentence1_convol_output = torch.stack([filter(padded_sentence1) for filter in self.convolfilters])
    sentence2_convol_output = torch.stack([filter(padded_sentence2) for filter in self.convolfilters])

    # structured similarity computation layer
    if self.similarity_computation == 'cosine_similarity':
      feature = self.similarity_computation_func(sentence1_convol_output, sentence2_convol_output)
      overlap_features_tensor = torch.tensor(overlap_features, device=device, dtype=torch.float32)
      feature = torch.cat((feature.unsqueeze(0), overlap_features_tensor))
    else:
      feature = self.similarity_computation_func(sentence1_convol_output, sentence2_convol_output, self.split_size, overlap_features)

    # make output range of [1,5]
    output = torch.sigmoid(self.fc(feature)) * 4 + 1

    return output


## 3.6 Training (fine-tuning)

With your siamese network ready, you can now start modifying the pretrained vectors. This is easier than training a model from scratch, and typically happens much faster (even without a GPU runtime, the entire process should not take more than 1 - 1.5 hours).

The training process is nothing new, so we won't repeat any code from the previous assignment here. The steps are:

1. Define your Siamese network with pretrained embeddings
2. Define your loss function
3. Define/choose your optimizer (we recommend Adam, available in PyTorch)
4. Create your training loop (in this assignment, you will not need more than 6-8 epochs; the use of small batch sizes such as 8 is recommended)

By training your Siamese network with gold-standard labels, you will effectively be modifying the pretrained embeddings to become better suited to your specific task. This process is called **fine tuning**. In this assignment, you are thus fine-tuning GloVe embeddings to capture semantic similarity between sentences, and thus, you are also creating embeddings (think of the output of your convolution layer) that capture the meaning of sentences instead of individual words.

One loss function you can use is the **mean squared error**. But, it turns out that there is better loss function for STS, called the **KL Divergence loss**. This is defined as follows:

$\mathcal{L}_{KL}(\theta) = \frac{1}{m}\sum_{k=1}^m \left(\mathbf{y} || \hat{\mathbf{y}}_\theta \right) + \lambda \Vert\theta\Vert^2$

A few things to note:

- the loss incorporates $L_2$ regularization.
- for STS, the KL divergence can be computed by considering the actual scores $\mathbf{y}$ over all test instances as the first probability distribution, and the estimated scores $\hat{\mathbf{y}}_\theta$ as the second distribution.

In [None]:
from tqdm import tqdm
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error

def run_training(
    embedding_dim, # embedding dim of words
    num_filters, # number of filters
    similarity_computation, # type of structured similarity: cosine/structured
    loss_name, # name of loss func: kl-loss/mse
    train_df, # training set
    dev_df, # dev set for eval per epoch
    epochs, # number of epochs
    batch_size, # batch size
    learning_rate, # learning rate
    alpha, # lambda for l2 regularization
    m # used for unique naming to save to file
    ) -> MyConvolutionalTwinNetwork:
  # initialize glove embedding
  embeddings_file = f'glove.6B.{embedding_dim}d.txt'
  glove_embeddings = load_glove_embeddings(embeddings_file)
  # initalize my network
  network = MyConvolutionalTwinNetwork(embedding_dim, glove_embeddings, num_filters, similarity_computation)

  # choose loss func
  if loss_name == 'kl-loss':
    loss_func = lambda y_hat, y: F.kl_div(F.log_softmax(y_hat, dim=0), F.softmax(y, dim=0), reduction='mean')
  if loss_name == 'mse':
    loss_func = lambda y_hat, y: F.mse_loss(y_hat, y)

  # initalizee optimizer and counters
  optimizer = torch.optim.Adam(network.parameters(), lr=learning_rate)
  training_loss = 0.0
  time = 0
  dev_loss = 0
  for epoch in range(epochs):
    # generate the batches from training set by shuffling the indices around
    shuffled_indices = np.random.permutation(len(train_df))
    shuffled_df = train_df.iloc[shuffled_indices]
    batches = np.array_split(shuffled_df, np.ceil(len(train_df) / batch_size))
    batch_process = tqdm(batches)

    for i, batch in enumerate(batch_process):
      # extract true scores
      true_scores = torch.tensor(np.array(batch['label']), device=device)
      predicted_scores = torch.zeros_like(true_scores, device=device)

      for j, instance in enumerate(batch.iloc):
        # predict all instances in a batch
        predicted_scores[j] = network(instance['sentence_1'], instance['sentence_2'], instance['external_features'])

      # calculate l2 reg
      l2_regularization = sum(torch.norm(param, p=2) for param in network.parameters())
      # calculate loss
      loss_bl2 = loss_func(predicted_scores, true_scores)
      # apply l2 reg
      loss = loss_bl2 + alpha * l2_regularization
      # backpropagation
      loss.backward()
      # update parameters
      optimizer.step()
      # reset gradient
      optimizer.zero_grad()
      # total loss
      training_loss += loss.item()
      time += 1
      if time:
        batch_process.set_description(f"Average loss: {(round(training_loss/time, 6))}, L2: {l2_regularization.item()}, loss_bl2: {loss_bl2.item()}, loss: {loss.item()}")

    with torch.no_grad():
      # eval with dev set after each epoch
      ts = dev_df['label'].tolist()
      ps = []
      for a in dev_df.iloc:
        ps.append(network(a['sentence_1'],	a['sentence_2'],	a['external_features']).item())
      # uses pearson and mse
      correlation_coefficient, _ = pearsonr(ps, ts)
      mse = mean_squared_error(ts, ps)
      ps_tensor = torch.tensor(ps, device=device)
      ts_tensor = torch.tensor(ts, device=device)
      l2_regularization = sum(torch.norm(param, p=2) for param in network.parameters())
      # total loss on dev set
      dev_loss += (loss_func(ps_tensor, ts_tensor) + alpha * l2_regularization).item()
      print(correlation_coefficient)
      print(mse)
      print((round(dev_loss/(epoch+1), 6)))

  # saves the whole model, can extract embedding if needed
  torch.save(network, f"glove.{embedding_dim}d.tuned.{num_filters}-filters.{loss_name}{m}.txt")

  if loss_name == 'kl-loss':
    loss_func_name = 'KL Divergence Loss'
  if loss_name == 'mse':
    loss_func_name = 'Measured Squared Error'

  content_dict = {
      "embedding_dim": embedding_dim,
      "training_loss_function": loss_func_name,
      "regularization_parameter": alpha,
      "window_sizes_used": '2, 3',
      "num_convolutional_filters_used": num_filters,
      "num_filters_with_max_pooling": network.split_size[0],
      "num_filters_with_mean_pooling": network.split_size[1],
      "num_filters_with_min_pooling": network.split_size[2],
      "similarity_computation": similarity_computation,
      "num_training_epochs": epochs,
      "batch_size_used_for_training": batch_size,
      "learning_rate": learning_rate
  }

  # saves a dict for properties
  torch.save(content_dict, f"glove.{embedding_dim}d.tuned.{num_filters}-filters.{loss_name}{m}.properties.md")

  return network

In [None]:
# initialize dataset and split it into training, test and dev set
dataset = SickDataset("/content/sick_dataset")
train_df, dev_df, test_df = dataset.splits()

In [None]:
# model 1
m1 = run_training(
    embedding_dim = 100,
    num_filters = 16,
    similarity_computation = 'structured_similarity',
    loss_name = 'kl-loss',
    train_df = train_df,
    dev_df = dev_df,
    epochs = 5,
    batch_size = 8,
    learning_rate = 0.004,
    alpha = 0.00001,
    m=1
)

Average loss: 0.036438, L2: 454.9950866699219, loss_bl2: 0.023986688205270695, loss: 0.028536639032183814: 100%|██████████| 861/861 [02:58<00:00,  4.83it/s]


0.7108176935500273
1.0939373806401635
0.004726


Average loss: 0.029166, L2: 422.44390869140625, loss_bl2: 0.014090418960285393, loss: 0.018314858111870495: 100%|██████████| 861/861 [03:09<00:00,  4.55it/s]


0.7056740041135797
0.7540792252053903
0.004559


Average loss: 0.026264, L2: 417.9087219238281, loss_bl2: 0.012286561682611748, loss: 0.016465648684442086: 100%|██████████| 861/861 [02:57<00:00,  4.85it/s]


0.7185606447636977
0.6668056665510245
0.004487


Average loss: 0.024289, L2: 414.3552551269531, loss_bl2: 0.010436776082286027, loss: 0.014580328471281913: 100%|██████████| 861/861 [02:42<00:00,  5.29it/s]


0.7249472272912776
0.4936859045434176
0.00444


Average loss: 0.022809, L2: 410.33599853515625, loss_bl2: 0.02371385852956577, loss: 0.027817218295870264: 100%|██████████| 861/861 [02:39<00:00,  5.41it/s]


0.728550940018299
0.49800652209012336
0.004404


In [None]:
# model 2
m2 = run_training(
    embedding_dim = 100,
    num_filters = 16,
    similarity_computation = 'cosine_similarity',
    loss_name = 'kl-loss',
    train_df = train_df,
    dev_df = dev_df,
    epochs = 6,
    batch_size = 8,
    learning_rate = 0.004,
    alpha = 0.00001,
    m=2
)

Average loss: 0.037359, L2: 369.777587890625, loss_bl2: 0.018155978323643546, loss: 0.021853754093622307: 100%|██████████| 861/861 [02:30<00:00,  5.71it/s]


0.6784449842777348
1.0749950251457805
0.00389


Average loss: 0.030593, L2: 302.36004638671875, loss_bl2: 0.015382062355060738, loss: 0.01840566279367046: 100%|██████████| 861/861 [02:31<00:00,  5.70it/s]


0.6994383468771559
1.0265232000218123
0.003549


Average loss: 0.027846, L2: 288.0466003417969, loss_bl2: 0.005803038846002862, loss: 0.008683504783781215: 100%|██████████| 861/861 [02:28<00:00,  5.79it/s]


0.700581337185474
0.9285386665298863
0.003386


Average loss: 0.026149, L2: 287.83453369140625, loss_bl2: 0.02818589894410758, loss: 0.031064244260383535: 100%|██████████| 861/861 [02:31<00:00,  5.67it/s]


0.6744010880699587
0.6962510323976394
0.003305


Average loss: 0.024924, L2: 289.1406555175781, loss_bl2: 0.0039973444936599115, loss: 0.006888750910552917: 100%|██████████| 861/861 [02:32<00:00,  5.65it/s]


0.671544804956969
0.603485184851955
0.003258


Average loss: 0.023896, L2: 288.9397277832031, loss_bl2: 0.004662137133735891, loss: 0.007551534222174163: 100%|██████████| 861/861 [02:31<00:00,  5.69it/s]


0.6755836447128171
0.5776179788718008
0.003226


Average loss: 0.023057, L2: 289.77325439453125, loss_bl2: 0.007815993116881814, loss: 0.010713725542362895: 100%|██████████| 861/861 [02:30<00:00,  5.72it/s]


0.667421453221254
0.5746552553983295
0.003205


In [None]:
# model 3
m3 = run_training(
    embedding_dim = 200,
    num_filters = 18,
    similarity_computation = 'structured_similarity',
    loss_name = 'kl-loss',
    train_df = train_df,
    dev_df = dev_df,
    epochs = 6,
    batch_size = 8,
    learning_rate = 0.004,
    alpha = 0.00001,
    m=3
)

Average loss: 0.047407, L2: 960.1822509765625, loss_bl2: 0.013181423659172922, loss: 0.022783245782104403: 100%|██████████| 861/861 [03:19<00:00,  4.32it/s]


0.6854695645242547
0.8284245526031891
0.009763


Average loss: 0.036189, L2: 573.0174560546875, loss_bl2: 0.005893781818978122, loss: 0.011623956300846866: 100%|██████████| 861/861 [03:29<00:00,  4.10it/s]


0.71479492807214
0.6487956070087442
0.007828


Average loss: 0.031621, L2: 561.7040405273438, loss_bl2: 0.017835913775436887, loss: 0.023452953984909066: 100%|██████████| 861/861 [03:19<00:00,  4.32it/s]


0.7252350433513228
0.5015027314479407
0.007141


Average loss: 0.029042, L2: 556.6574096679688, loss_bl2: 0.028883491160191548, loss: 0.034450065327651755: 100%|██████████| 861/861 [03:21<00:00,  4.26it/s]


0.6965090417694549
0.5467094292321882
0.006786


Average loss: 0.027391, L2: 550.689208984375, loss_bl2: 0.010691306072854677, loss: 0.01619819783010451: 100%|██████████| 861/861 [03:20<00:00,  4.29it/s]


0.6983198301017538
0.5525301867001039
0.006563


Average loss: 0.026165, L2: 540.3027954101562, loss_bl2: 0.011171031555023539, loss: 0.01657405942478453: 100%|██████████| 861/861 [03:21<00:00,  4.27it/s]


0.714758952370834
0.5096526111031333
0.006395


In [None]:
# model 4
m4 = run_training(
    embedding_dim = 200,
    num_filters = 18,
    similarity_computation = 'structured_similarity',
    loss_name = 'mse',
    train_df = train_df,
    dev_df = dev_df,
    epochs = 4,
    batch_size = 8,
    learning_rate = 0.004,
    alpha = 0.0001,
    m=4
)

Average loss: 0.785292, L2: 558.9580688476562, loss_bl2: 0.659770616071614, loss: 0.7156664214305007: 100%|██████████| 861/861 [03:19<00:00,  4.32it/s]


0.679813705652421
0.5550722563706575
0.610967


Average loss: 0.668216, L2: 557.3414306640625, loss_bl2: 0.6493225034653912, loss: 0.7050566461264859: 100%|██████████| 861/861 [03:19<00:00,  4.32it/s]


0.7096200353605859
0.5135307493949648
0.590116


Average loss: 0.610097, L2: 555.384033203125, loss_bl2: 0.7350070489515832, loss: 0.7905454499592355: 100%|██████████| 861/861 [03:22<00:00,  4.25it/s]


0.7133194679248316
0.5079637865418862
0.581244


Average loss: 0.571482, L2: 552.2826538085938, loss_bl2: 0.2767035971010616, loss: 0.33193186024078636: 100%|██████████| 861/861 [03:20<00:00,  4.30it/s]


0.7126223079908363
0.5126493010175442
0.577902


## 3.7 Experiments and Evaluation

In this final leg of the assignment, you are required to report on your experiments and the final results. Unlike the previous assignments, the evaluation here is not in terms of precision, recall, and accuracy. Rather, it is in terms of the following:

- Pearson's correlation coefficient $r$
- Mean Squared Error (MSE)

Your aim should be to achieve $r \geq 0.80$ and MSE $\leq 0.35$ on the test set. This is a somewhat ambitious aim, so please don't despair if you can't reach this goal! These numbers are give to you as a yardstick.

When you finish fine-tuning and start to achieve decent results, you should save your model as a `.zip` (see the last section on what to submit).

For grading the performance of your models, four models are required:

- Your two best models using structured similarity computation
- Your best model using cosine similarity
- Your best model among those trained using KL divergence loss. If this criterion is already fulfilled by the earlier models, then you should provide your best fine-tuned model that was trained using mean squared error as the loss function.



In [None]:
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error
from tabulate import tabulate

def test(finetuned_embeddings_path: str, test_df: pd.DataFrame, model_properties_file_path: Optional[str] = None) -> Tuple[List[float], List[float], str]:
    """
    Computes the Pearson's correlation coefficient r and the mean squared error (MSE) for fine-tuned GloVe embeddings
    and prints the result in a neat tabular structure with model details, as follows (actual values not displayed):

    Embedding dimension                            200
    Training loss function                         KL Divergence Loss
    Regularization parameter (in loss function)    0.4
    Window sizes used                              2, 3, 4
    Number of convolutional filters used           15
    Number of filters with max pooling             8
    Number of filters with min pooling             5
    Number of filters with mean pooling            2
    Similarity computation (cosine/structured)     structured
    Number of training epochs                      8
    Batch size used for training                   20

    Pearson's Correlation Coefficient              0.821
    Mean Squared Error                             0.344
    """
    true_scores = test_df['label'].tolist()
    mynetwork = torch.load(finetuned_embeddings_path)
    results = []
    if model_properties_file_path:
      content_dict = torch.load(model_properties_file_path)
      results = [[key, value] for key, value in content_dict.items()]
    pred_scores = []
    for a in test_df.iloc:
      pred_scores.append(mynetwork(a['sentence_1'],	a['sentence_2'],	a['external_features']).item())
    results.append(["", ""])
    correlation_coefficient, _ = pearsonr(pred_scores, true_scores)
    results.append(['Pearson\'s Correlation Coefficient', correlation_coefficient])
    mse = mean_squared_error(true_scores, pred_scores)
    results.append(['Mean Squared Error', mse])

    return np.array(pred_scores), np.array(true_scores), tabulate(results)

def worst(test_df: pd.DataFrame, pred_scores: List[float], true_scores: List[float], k: int = 10):
    """
    Returns the top k sentence pairs for which your model's estimated score was the worst (by worst, we mean the
    highest difference between the gold-standard relatedness score and your predicted score).
    """
    test_df['pred_scores'] = pred_scores
    test_df['abs_diff'] = np.abs(true_scores - pred_scores)
    sorted_df = test_df.sort_values(by='abs_diff', ascending=False)
    print(f"Top {k} worst pairs:")
    display(sorted_df.head(k))
    test_df = test_df.drop(columns=['abs_diff', 'pred_scores'])

def best(test_df: pd.DataFrame, pred_scores: List[float], true_scores: List[float], k: int = 10):
    """
    Returns the top k sentence pairs for which your model's estimated score was the best (by best, we mean that the
    difference between the gold-standard relatedness score and your predicted score was the lowest)
    """
    test_df['pred_scores'] = pred_scores
    test_df['abs_diff'] = np.abs(true_scores - pred_scores)
    sorted_df = test_df.sort_values(by='abs_diff', ascending=True)
    print(f"Top {k} best pairs:")
    display(sorted_df.head(k))
    test_df = test_df.drop(columns=['abs_diff', 'pred_scores'])

In [None]:
file_name = '/content/glove.100d.tuned.16-filters.kl-loss-structuredsimilarity'

pred_scores, true_scores, results = test(f'{file_name}.txt', test_df, f'{file_name}.properties.md')
print(results)
worst(test_df, pred_scores, true_scores)
best(test_df, pred_scores, true_scores)

---------------------------------  ---------------------
embedding_dim                      100
training_loss_function             KL Divergence Loss
regularization_parameter           1e-05
window_sizes_used                  2, 3
num_convolutional_filters_used     16
num_filters_with_max_pooling       7
num_filters_with_mean_pooling      6
num_filters_with_min_pooling       3
similarity_computation             structured_similarity
num_training_epochs                5
batch_size_used_for_training       8
learning_rate                      0.004

Pearson's Correlation Coefficient  0.6951298834188069
Mean Squared Error                 0.5400974703215263
---------------------------------  ---------------------
Top 10 worst pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
4997,5124,A potato is being sliced by a person,A person is playing a piano,"[0.25, 0.22341673820133656, 0.1666666666666666...",1.185,3.545886,2.360886
5259,5398,The man is putting chicken into the container,A man is putting some food in a box,"[0.2, 0.3409810090819531, 0.25, 0.622109644357...",4.315,1.975988,2.339012
1295,1313,A person is doing a trick with some cards,A man is frying a tortilla,"[0.15384615384615385, 0.01811918583549378, 0.0...",1.0,3.338049,2.338049
4672,4790,A woman is riding on an elephant,A soda is being opened and drunk by a woman,"[0.1875, 0.11934536558956817, 0.14285714285714...",1.0,3.325091,2.325091
8668,8828,A boy is wearing all black and is sliding down...,A kid is speeding down a steep slide,"[0.2, 0.44973287829224395, 0.09090909090909091...",4.1,1.835842,2.264158
9689,9850,A football is being held by a man with no shirt,The man is stirring the sauce for the chicken,"[0.11764705882352941, 0.07257118383675207, 0.1...",1.0,3.259403,2.259403
9664,9825,The girl is carrying a sign and a group of peo...,A woman is cleaning a man's face,"[0.1, 0.011777470793070958, 0.0, 0.0]",1.0,3.244311,2.244311
9741,9902,The dog is catching a ball,A small girl is riding in a toy car,"[0.14285714285714285, 0.016824958275815653, 0....",1.0,3.219852,2.219852
4870,4996,An email is being read by a man,A person is peeling a banana,"[0.15384615384615385, 0.01811918583549378, 0.0...",1.0,3.19475,2.19475
9618,9778,A dog is playing alone on a beach covered by snow,The buttons of a microwave are being pushed by...,"[0.1, 0.11643662193349687, 0.0, 0.0]",1.0,3.158071,2.158071


Top 10 best pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
8978,9138,A man is standing in the middle of a gym and i...,A man in a black jersey is standing in a gym,"[0.35, 0.8270588366542299, 0.36363636363636365...",3.9,3.899513,0.000487
5707,5857,Someone is stirring noodles in the water,Noodles are being boiled in water by a woman,"[0.1875, 0.6418167983608748, 0.25, 1.138686102...",4.0,3.998662,0.001338
618,619,A boy is climbing up a wall artificially built...,"A little boy, who looks fearful, is on a climb...","[0.2, 0.44332518771360385, 0.2, 0.723172018465...",3.5,3.498552,0.001448
944,952,Two young kids are playing with a man eating c...,Two young kids are eating corndogs,"[0.375, 1.3452657915030617, 0.4166666666666667...",3.9,3.903232,0.003232
3982,4065,A man is outside close to a target for blade t...,A man is outside and is throwing blades at a c...,"[0.35, 1.1131803607157107, 0.4166666666666667,...",3.9,3.89602,0.00398
6583,6738,The people are sitting on a bench in front of ...,The people are sitting in a restaurant in fron...,"[0.42857142857142855, 0.9559200787016386, 0.4,...",4.0,4.004121,0.004121
2622,2657,The lady is eating slices of meat,The lady is slicing up the meat,"[0.3076923076923077, 0.7170249548590034, 0.285...",3.4,3.404528,0.004528
436,437,A woman is jumping with a bicycle,The helmeted girl is perching on a bike,"[0.14285714285714285, 0.016824958275815653, 0....",2.8,2.805148,0.005148
5951,6103,A brown and black dog is running through the l...,A dog and a black man are running through brow...,"[0.4, 1.015052629938546, 0.45454545454545453, ...",3.4,3.405848,0.005848
4996,5123,A person is slicing a potato,There is no person slicing a potato,"[0.4166666666666667, 0.8444685890884692, 0.5, ...",2.9,2.906138,0.006138


In [None]:
file_name = '/content/glove.100d.tuned.16-filters.kl-loss-cosinesimilarity'

pred_scores, true_scores, results = test(f'{file_name}.txt', test_df, f'{file_name}.properties.md')
print(results)
worst(test_df, pred_scores, true_scores)
best(test_df, pred_scores, true_scores)

---------------------------------  ------------------
embedding_dim                      100
training_loss_function             KL Divergence Loss
regularization_parameter           1e-05
window_sizes_used                  2, 3
num_convolutional_filters_used     16
num_filters_with_max_pooling       7
num_filters_with_mean_pooling      6
num_filters_with_min_pooling       3
similarity_computation             cosine_similarity
num_training_epochs                7
batch_size_used_for_training       8
learning_rate                      0.004

Pearson's Correlation Coefficient  0.6739208335937442
Mean Squared Error                 0.5703112110472557
---------------------------------  ------------------
Top 10 worst pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
4484,4593,Some onions are being cut by a person,A hamster is singing,"[0.08333333333333333, 0.008142619541156339, 0....",1.0,4.327632,3.327632
4870,4996,An email is being read by a man,A person is peeling a banana,"[0.15384615384615385, 0.01811918583549378, 0.0...",1.0,4.160628,3.160628
4997,5124,A potato is being sliced by a person,A person is playing a piano,"[0.25, 0.22341673820133656, 0.1666666666666666...",1.185,4.212835,3.027835
4672,4790,A woman is riding on an elephant,A soda is being opened and drunk by a woman,"[0.1875, 0.11934536558956817, 0.14285714285714...",1.0,3.955758,2.955758
4859,4985,A woman is frying something,A horse is being ridden by a woman,"[0.25, 0.1591271541194242, 0.16666666666666666...",1.4,4.27386,2.87386
4176,4263,The horse is being ridden by the girl,The girl is spraying the plants with water,"[0.21428571428571427, 0.2412669961645377, 0.14...",1.4,4.227293,2.827293
1515,1536,A horse is running,A woman is cutting an onion,"[0.2, 0.023554941586141916, 0.0, 0.0]",1.0,3.801035,2.801035
5232,5369,A flute is being played by a man,A man is riding a scooter,"[0.25, 0.11095179664322176, 0.1666666666666666...",1.3,4.079238,2.779238
5266,5405,Eggs are being cracked into a bowl by a woman,A woman is placing skewers onto a rack,"[0.125, 0.11073049175409673, 0.111111111111111...",1.7,4.339128,2.639128
4670,4788,An elephant is being ridden by a woman,A woman is opening a soda and drinking it,"[0.1875, 0.11934536558956817, 0.14285714285714...",1.0,3.559777,2.559777


Top 10 best pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
1096,1106,A man is opening a small package that contains...,A man is opening a package that contains headp...,"[0.47058823529411764, 1.7672672212712859, 0.45...",4.3,4.297334,0.002666
2570,2605,The monkey is brushing a bull dog,The monkey is not brushing a bull dog,"[0.4666666666666667, 1.2958696593079113, 0.5, ...",4.0,3.997027,0.002973
4507,4623,The boy is cutting the grass by the curb with ...,There is no boy cutting grass with scissors,"[0.35294117647058826, 1.0337056096823525, 0.44...",4.1,4.096787,0.003213
6736,6891,A girl is jumping off a rock and another is st...,A girl is jumping on a rock and another is sit...,"[0.42857142857142855, 1.0310459876150664, 0.4,...",3.8,3.803302,0.003302
489,490,Two daschunds are playing with a red ball,Two brown dogs are fighting over a red ball an...,"[0.2727272727272727, 0.5593231594304615, 0.307...",3.7,3.703827,0.003827
1794,1815,The person is dicing up a pepper which is orange,The orange colored dices are being cast by a c...,"[0.18181818181818182, 0.5195428584863551, 0.18...",2.9,2.896044,0.003956
6463,6618,Four boys are kneeling next to each other in f...,The young boys are not posing with a green soc...,"[0.19230769230769232, 0.38696122637111724, 0.1...",3.0,3.00445,0.00445
8978,9138,A man is standing in the middle of a gym and i...,A man in a black jersey is standing in a gym,"[0.35, 0.8270588366542299, 0.36363636363636365...",3.9,3.895399,0.004601
9340,9500,A big brown and white spotted dog is lying on ...,The dog is having a nap in the park,"[0.19047619047619047, 0.14082652836446316, 0.0...",2.9,2.893877,0.006123
837,844,A woman with a ponytail is climbing down a wal...,A woman with a ponytail is climbing a wall of ...,"[0.47368421052631576, 1.3282970879272995, 0.5,...",4.085,4.091183,0.006183


In [None]:
file_name = '/content/glove.200d.tuned.18-filters.kl-loss-structuredsimilarity'

pred_scores, true_scores, results = test(f'{file_name}.txt', test_df, f'{file_name}.properties.md')
print(results)
worst(test_df, pred_scores, true_scores)
best(test_df, pred_scores, true_scores)

---------------------------------  ---------------------
embedding_dim                      200
training_loss_function             KL Divergence Loss
regularization_parameter           1e-05
window_sizes_used                  2, 3
num_convolutional_filters_used     18
num_filters_with_max_pooling       8
num_filters_with_mean_pooling      6
num_filters_with_min_pooling       4
similarity_computation             structured_similarity
num_training_epochs                6
batch_size_used_for_training       8
learning_rate                      0.004

Pearson's Correlation Coefficient  0.6900155031706035
Mean Squared Error                 0.5441464128481119
---------------------------------  ---------------------
Top 10 worst pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
4997,5124,A potato is being sliced by a person,A person is playing a piano,"[0.25, 0.22341673820133656, 0.1666666666666666...",1.185,4.196121,3.011121
4870,4996,An email is being read by a man,A person is peeling a banana,"[0.15384615384615385, 0.01811918583549378, 0.0...",1.0,3.763891,2.763891
4859,4985,A woman is frying something,A horse is being ridden by a woman,"[0.25, 0.1591271541194242, 0.16666666666666666...",1.4,4.034573,2.634573
4484,4593,Some onions are being cut by a person,A hamster is singing,"[0.08333333333333333, 0.008142619541156339, 0....",1.0,3.533337,2.533337
1288,1306,There is no man playing flute,A man is playing soccer,"[0.2727272727272727, 0.28046744781265803, 0.33...",1.1,3.570554,2.470554
2508,2543,Someone is cutting a tomato,The person is slicing a vegetable,"[0.18181818181818182, 0.021413583260129012, 0....",4.6,2.166119,2.433881
9633,9793,A person is reading the email,A person with a green shirt is jumping high ov...,"[0.23529411764705882, 0.19389242668941295, 0.1...",1.1,3.385404,2.285404
3599,3668,A man is strolling in the rain,A man is walking in the rain,"[0.42857142857142855, 0.628848238623261, 0.333...",5.0,2.745414,2.254586
1515,1536,A horse is running,A woman is cutting an onion,"[0.2, 0.023554941586141916, 0.0, 0.0]",1.0,3.247742,2.247742
2419,2452,A baby rhino is following an adult rhino,"A rhino, which is a youngling, is walking arou...","[0.15789473684210525, 0.3190770829432234, 0.09...",4.8,2.567545,2.232455


Top 10 best pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
421,422,Two men are holding fishing poles and fishing ...,Two men are holding fishing poles,"[0.42857142857142855, 1.6575520157765118, 0.45...",4.3,4.299221,0.000779
7507,7662,The girl in the black hat is wearing a bikini ...,A young lady in a bikini is lying on the beach,"[0.2857142857142857, 0.5587523069019342, 0.181...",3.5,3.500922,0.000922
2742,2777,There is no person pouring olive oil into a po...,A person is pouring cooking oil into a pot,"[0.3333333333333333, 0.9429884688024084, 0.363...",3.8,3.799068,0.000932
1472,1492,The man is stirring oil,The man is stirring the sauce for the chicken,"[0.3333333333333333, 0.5976575416389313, 0.285...",3.6,3.598866,0.001134
9830,9991,The young girl is blowing a bubble that is huge,There is no girl in pink twirling a ribbon,"[0.16666666666666666, 0.1589043880829679, 0.11...",2.1,2.098631,0.001369
6120,6272,A dog with a golden coat is swimming clumsily ...,A dog is swimming in a lake and the sun is shi...,"[0.21739130434782608, 0.3220276503189902, 0.18...",3.3,3.296967,0.003033
4576,4694,A man is holding a small animal in one hand,A man is exhibiting a small monkey,"[0.26666666666666666, 0.32035471821050326, 0.2...",3.665,3.669563,0.004563
9308,9468,A young swimmer is sitting in a bathing suit a...,A young swimmer is standing in a bathing suit ...,"[0.4642857142857143, 1.5444167540018388, 0.437...",4.3,4.294511,0.005489
5582,5726,A cow is not eating hay,A cow is eating hay,"[0.45454545454545453, 1.5035506411393478, 0.5,...",3.7,3.692907,0.007093
6369,6521,A girl in a brown dress is being surrounded by...,The sun is not shining on the face of the girl...,"[0.17857142857142858, 0.3257044775157434, 0.15...",1.9,1.890922,0.009078


In [None]:
file_name = '/content/glove.200d.tuned.18-filters.mse-structuredsimilarity'

pred_scores, true_scores, results = test(f'{file_name}.txt', test_df, f'{file_name}.properties.md')
print(results)
worst(test_df, pred_scores, true_scores)
best(test_df, pred_scores, true_scores)

---------------------------------  ----------------------
embedding_dim                      200
training_loss_function             Measured Squared Error
regularization_parameter           0.0001
window_sizes_used                  2, 3
num_convolutional_filters_used     18
num_filters_with_max_pooling       8
num_filters_with_mean_pooling      6
num_filters_with_min_pooling       4
similarity_computation             structured_similarity
num_training_epochs                4
batch_size_used_for_training       8
learning_rate                      0.004

Pearson's Correlation Coefficient  0.6986376124486324
Mean Squared Error                 0.5392236397246336
---------------------------------  ----------------------
Top 10 worst pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
5601,5745,A tiger cub is playing with a ball,A baby is playing with a doll,"[0.3076923076923077, 0.30390669190598224, 0.14...",1.4,3.861058,2.461058
8668,8828,A boy is wearing all black and is sliding down...,A kid is speeding down a steep slide,"[0.2, 0.44973287829224395, 0.09090909090909091...",4.1,1.725043,2.374957
4241,4329,A man is making a sketch,The man is drawing,"[0.2222222222222222, 0.1370789028027539, 0.2, ...",4.9,2.579376,2.320624
4599,4717,A monkey is kicking a person,A person is being kicked by a monkey,"[0.3333333333333333, 0.6298103112438356, 0.333...",4.9,2.592914,2.307086
9794,9955,"Two men are doing a skit, which is very funny,...",A rabbit is playing with a toy rabbit,"[0.125, 0.08747313308597149, 0.0, 0.0]",1.0,3.302805,2.302805
9634,9794,A monkey is brushing the dog,One man is jumping off a rock wall and another...,"[0.1111111111111111, 0.01308607865896773, 0.0,...",1.0,3.275209,2.275209
4264,4354,A jet is not flying,A jet is flying,"[0.4444444444444444, 1.4378212920683706, 0.5, ...",2.4,4.631783,2.231783
2419,2452,A baby rhino is following an adult rhino,"A rhino, which is a youngling, is walking arou...","[0.15789473684210525, 0.3190770829432234, 0.09...",4.8,2.596921,2.203079
1604,1625,The man is playing a piano,There is a guy cheerfully playing with a footbag,"[0.21428571428571427, 0.14907008689827658, 0.1...",1.5,3.658001,2.158001
8993,9153,A lady is wearing a black dress and is carryin...,A man in a black dress is pulling a cart and i...,"[0.16666666666666666, 0.3064587039293566, 0.10...",1.4,3.535406,2.135406


Top 10 best pairs:


Unnamed: 0,id,sentence_1,sentence_2,external_features,label,pred_scores,abs_diff
5728,5880,A skateboarder is doing acrobatics on a half pipe,A skateboarder is doing tricks on a half pipe,"[0.4375, 1.5921989298170784, 0.375, 2.49161754...",4.6,4.600728,0.000728
2125,2151,The picture of a man is being taken near a lake,A man is taking pictures of a lake,"[0.29411764705882354, 0.47475895711315225, 0.2...",3.4,3.399128,0.000872
4162,4249,There is no clown singing and people are not d...,A clown is singing and people are dancing,"[0.3888888888888889, 1.0799934702716998, 0.5, ...",3.5,3.498847,0.001153
1974,1999,The man is intensely playing the guitar,The guitar player is loving one girl,"[0.23076923076923078, 0.298904803676139, 0.111...",2.7,2.701306,0.001306
3008,3048,A man is athletically jumping a wall,A man is jumping onto a low wall,"[0.38461538461538464, 0.6477008688693903, 0.33...",4.0,3.998386,0.001614
494,495,A little dog is slowing down and the owner is ...,A little dog is sprinting and its owner is try...,"[0.2916666666666667, 0.7845474339859532, 0.25,...",3.9,3.898206,0.001794
6807,6962,The man in the white hat is playing guitar on ...,Four people are performing music outdoors,"[0.0, 0.0, 0.0, 0.0]",2.7,2.69752,0.00248
9172,9332,A group of people is walking down the street a...,A group of people is walking down the street a...,"[0.4838709677419355, 1.1243547135914058, 0.466...",4.5,4.49593,0.00407
4089,4175,A woman is slicing an eggplant,There is no woman slicing an eggplant,"[0.38461538461538464, 1.1391218252304063, 0.5,...",3.8,3.794676,0.005324
936,944,A young child is barefoot and is walking past ...,A little boy is jumping into a puddle on the s...,"[0.13636363636363635, 0.27716163162420515, 0.0...",2.9,2.905514,0.005514


Complete the above `test` function's implementation. You may want to store various model properties in a file (e.g., the actual fine-tuned embeddings are in a file `blah-blah.txt`; and all the details about its training and other properties may be stored in `blah-blah.properties`).

If your `test` function expects such a properties file, make sure that those files are also saved in a human-readable format, and included in your submission.

For your four models (as required for grading, described earlier), run the `test` function. Each run should be in a separate code cell. **Do NOT remove the results of these runs.**

Similarly, for each one of your four models, return the results of calling the `best` and `worst` functions. You are free to change the signature of these two functions, if needed. Each run should be in a separate code cell, and clearly mention which model's `best` and `worst` are being called. **Do NOT remove the results of these runs.**


# 4. Conceptual Questions and Qualitative Analysis

**4.1** Write down a brief but precise description of your structured similarity computation. Then, write this similarity using a mathematical formula. Make sure your formula and your implementation are faithful to each other!

> For my structure similarity computation, I basically calculated the cosine similarity between the max-pooled region, mean-pooled region and min-pooled region of the pair of sentences. With these 3 scores, I combined them and the four lexical overlap features into a vector so it is of the size 1x7. This vector is passed into the fully connected layer.

> cosine_similarity($A_{max\_pooled}$ , $B_{max\_pooled}$) = $\frac{A_{max\_pooled} * B_{max\_pooled}}{||A_{max\_pooled}||*||B_{max\_pooled}||}$

> cosine_similarity($A_{mean\_pooled}$ , $B_{mean\_pooled}$) = $\frac{A_{mean\_pooled} * B_{mean\_pooled}}{||A_{mean\_pooled}||*||B_{mean\_pooled}||}$

> cosine_similarity($A_{min\_pooled}$ , $B_{min\_pooled}$) = $\frac{A_{min\_pooled} * B_{min\_pooled}}{||A_{min\_pooled}||*||B_{min\_pooled}||}$

**4.2** Identify at least one linguistic pattern in the worst performing sentence-pairs as shown by the call to `worst` for your best performing model (you may need to use a non-default value of `k`). What aspect of your fine-tuning process will you change to improve the predicted scores for sentences that fall into this pattern? Why do you think this change will work to improve the result for these sentence-pairs?

> My best model was using mse as loss and structured similarity. One linguistic pattern I see is some sentences have a lot of overlapping words at the same window but the meaning of the sentences are very different. For example, the action word "playing" exist in both sentences at almost the same window but the subject applying the action is vastly different. Another pattern I see is sentences may have a lot of overlapping but the overlap exist at different windows of the sentences. Also, sometimes the meaning of a window can be similar to the meaning of a word like "drawing" and "making a sketch". But that is not being captured correctly. I might use more a little more min-pooling because maybe it can help me capture sentences with almost the same words but opposite meaning or increase window size to capture more details of the sentences.

---

# Collaboration Policy

You may discuss any details of this assignment at a conceptual level with anyone. In fact, discussion of ideas and helping each other to gain a better understanding of the concepts and the mathematical principles is encouraged. But any written answers (natural language or programming language) must be entirely your own original work.

There must not be any collaboration in programming (including the design, implementation, and debugging of code). There must not be any code in your submission that is written by anyone other than you (whether human or AI). Submitted code will be checked against other submissions AND against AI-generated code, and evidence of plagiarism will lead to academic dishonesty charges.

---

# What to submit?

**This assignment is due by 11:59 pm, APR 28 (Sunday)**

- Make your colab notebook publicly accessible. You can do this by clicking "Share" on the top-right corner of your notebook and make sure "anyone with the link" can view your notebook. Also make sure that viewers are allowed to download your notebook. Then, **put this link in the comment section of your submission on Brightspace**.

- Create an empty folder (locally, on your computer) called `firstname-lastname-cse354-hw3`. For example, John Doe will create `john-doe-cse354-hw3`.
    * Download this colab notebook with all the questions (code as well as the text questions) implemented/answered. This will be a single Python notebook, as a `.ipynb` file. Put the notebook in your folder.
    * Also put your saved models, i.e., the fine-tuned embeddings. *These are the four models mentioned in the 'Experiments and Evaluation' section of this assignment.*
    
    Zip this folder (i.e., create firstname-lastname-cse354-hw1.zip) and submit on Brightspace.

Once unzipped, your submission is expected to have the following structure (if you don't need to store the model properties for the evaluation functions to work, then the `*.properties.md` files are not needed):

```
john-doe-cse354-hw4
├── CSE354-Assignment-4.ipynb
├── README.md (optional)
├── glove.100d.tuned.15-filters.kl-loss.txt
├── glove.100d.tuned.15-filters.kl-loss.properties.md
├── glove.200d.tuned.18-filters.cosine-loss.txt
├── glove.200d.tuned.18-filters.cosine-loss.properties.md
├── glove.100d.tuned.16-filters.kl-loss.txt
├── glove.100d.tuned.16-filters.kl-loss.properties.md
├── glove.100d.tuned.8-filters.kl-loss.txt
└── glove.100d.tuned.8-filters.kl-loss.properties.md
```

If you want to store the model properties in some file format other than `md`, that is ok. But if you do so, it is **strongly recommended** that you keep the format very intuitive *and* describe it in your README.md file.