### Few-Shot Learning & One-Shot Learning
- https://www.kaggle.com/competitions/humpback-whale-identification/discussion/73454

- Challenge: Traditional machine learning models require large amounts of labeled data to train effectively.  Few-shot and one-shot learning are geared towards scenarios where you have very limited labeled data per class.

- **Few-Shot Learning**:  Training models to learn new categories from a few examples (typically 2-5).

- **One-Shot Learning**:  An extreme case of few-shot learning where you only have a single example per new category.

- Techniques:
    - Meta-Learning: Focuses on "learning to learn." The model is trained across many tasks to become good at adapting to new classes with minimal data.
    - Data Augmentation: Generating variations of your limited examples to expand the training set.
    - Transfer Learning: Leveraging a pre-trained model and fine-tuning it on the small dataset.

### Semi-Supervised Learning

- Semi-Supervised Learning a machine learning approach that combines a small amount of labeled data with a large amount of unlabeled data during training. 
- Scenario: You have a dataset with a small portion of labeled data and a much larger portion of unlabeled data.
- Goal: Semi-supervised learning aims to leverage both the labeled and unlabeled data to improve the model's performance.
- Techniques:
    - Self-Training: The model trains on the labeled data, predicts labels for the unlabeled data, and then retrains on the combined set.
    - Co-Training: Multiple models train on different aspects (or views) of the data, and their predictions are used to label unlabeled examples for each other.

### Pairwise Learning

- Pairwise Learning is a strategy used in machine learning where the model learns from pairs of examples.
- It is often used in tasks like ranking, recommendation systems, or anywhere the relative comparison between two items is important. For example, in a recommendation system, pairwise learning could be used to learn user preferences by comparing pairs of items to determine which is preferred.
- Applications:
    - Ranking: Learning to rank items (e.g., search results).
    - Metric Learning: Learning a distance function that meaningfully captures similarity between examples.

### Siamese Networks, Triplet Loss, & Contrastive Loss

- Architecture of Siamese Network
    <div>
    <img src = "nlp_images/siameseNetwork.png" width = 700>
    </div>

- **Siamese Networks**: Neural networks with two identical "branches" that share weights.  They take two inputs, process them, and the outputs are compared. “Identical” here means they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both sub-networks and it’s used to find similarities between  inputs by comparing its feature vectors.
- **Loss Functions**
    - **Triplet Loss**: Used to train Siamese networks. A triplet consists of:
        - Anchor: A reference example.
        - Positive: An example similar to the Anchor.
        - Negative: An example dissimilar to the Anchor.
        - Triplet loss aims to minimize the distance between the Anchor and Positive representations while maximizing the distance between the Anchor and Negative.
        - Triplet loss is a loss function where a baseline (anchor) input is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized.
            <div>
            <img src = "nlp_images/tripletLoss.png" width = 400>
            </div>

    - **Contrastive Loss**:  A distance-based loss function that minimizes the Euclidean distance between similar points and maximizes it between dissimilar points. It operates on pairs:
        - Positive Pair: Two examples that are similar.
        - Negative Pair: Two examples that are dissimilar.
        - Contrastive loss encourages representations of positive pairs to be close together and negative pairs to be far apart.
        - Contrastive Loss is a popular loss function used highly nowadays, It is a distance-based loss as opposed to more conventional error-prediction losses. This loss is used to learn embeddings in which two similar points have a low Euclidean distance and two dissimilar points have a large Euclidean distance.
        <div>
        <img src = "nlp_images/ContrastiveLoss.png" width = 400>
        </div>

- Signature verification with Siamese Networks:
    <div>
    <img src = "nlp_images/verification_SNN.png" width = 700>
    </div>

### Summary
- **Few-Shot and One-Shot Learning** focus on learning from very few examples, with applications in domains where data collection is challenging.
- **Semi-Supervised Learning** leverages both labeled and unlabeled data to improve learning efficiency.
- **Pairwise Learning** involves learning from pairs of examples, useful in ranking and recommendation systems.
- **Siamese Networks (SNN) with Triplet Loss** learn embeddings by comparing an anchor to both similar and dissimilar examples.
- **Contrastive Loss** aims to separate similar and dissimilar pairs in embedding space, beneficial for similarity comparisons.

In [34]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from transformers import DataCollatorForTokenClassification
from transformers import DistilBertTokenizerFast
from sklearn.model_selection import train_test_split
import datasets
import evaluate
from itertools import chain
from datasets import Dataset
from torch.utils.data import DataLoader
import torch.nn.utils.rnn as rnn_utils

In [35]:
import pandas as pd
#Label encoding
df = pd.read_csv('TalkFile_ner_2.csv').iloc[:300,:]
df['Tag'] = df['Tag'].apply(eval)
list_all_tag = df.Tag.to_list()
list_labels = ['O'] + sorted(set(chain.from_iterable(list_all_tag)) - {'O'})

label2ind = {label: idx for idx, label in enumerate(list_labels)}
ind2label = {idx: label for label, idx in label2ind.items()}
label2ind

{'O': 0,
 'B-art': 1,
 'B-eve': 2,
 'B-geo': 3,
 'B-gpe': 4,
 'B-nat': 5,
 'B-org': 6,
 'B-per': 7,
 'B-tim': 8,
 'I-art': 9,
 'I-eve': 10,
 'I-geo': 11,
 'I-gpe': 12,
 'I-nat': 13,
 'I-org': 14,
 'I-per': 15,
 'I-tim': 16}

In [36]:
#Sentence Tokenization & Label Encoding
data_dict = {
    'id': list(range(len(df))),
    'tokens': [sentence.split(' ') for sentence in df['Sentence']],
    'ner_tags': [list(map(label2ind.get, tags)) for tags in df['Tag']]
}
new_df = pd.DataFrame(data_dict)
new_df.head()

Unnamed: 0,id,tokens,ner_tags
0,0,"[Thousands, of, demonstrators, have, marched, ...","[0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, ..."
1,1,"[Families, of, soldiers, killed, in, the, conf...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,2,"[They, marched, from, the, Houses, of, Parliam...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 11, 0]"
3,3,"[Police, put, the, number, of, marchers, at, 1...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
4,4,"[The, protest, comes, on, the, eve, of, the, a...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 6, ..."


In [37]:
def generate_triplets(df): #generate a new DataFrame of triplets, anchor, positive, negative example.
    triplets = [] #create an empty list to store the triplet dictionaries

    for index, row in df.iterrows():
        if index + 2 <= df.index.max():
            positive_index = index + 1
            negative_index = index + 2

            triplets.append({
                'anchor_id': row['id'],
                'anchor_tokens': row['tokens'],
                'positive_id': df.at[positive_index, 'id'],
                'positive_tokens': df.at[positive_index, 'tokens'],
                'negative_id': df.at[negative_index, 'id'],
                'negative_tokens': df.at[negative_index, 'tokens'],
            })

    return pd.DataFrame(triplets)

triplets_df = generate_triplets(new_df)

if not triplets_df.empty:
    print(triplets_df)
else:
    print("No suitable triplets found.")



     anchor_id                                      anchor_tokens  \
0            0  [Thousands, of, demonstrators, have, marched, ...   
1            1  [Families, of, soldiers, killed, in, the, conf...   
2            2  [They, marched, from, the, Houses, of, Parliam...   
3            3  [Police, put, the, number, of, marchers, at, 1...   
4            4  [The, protest, comes, on, the, eve, of, the, a...   
..         ...                                                ...   
293        293  [People, who, worship, in, unauthorized, ways,...   
294        294  [President, Bush, has, issued, 14, pardons, to...   
295        295  [The, pardons, ,, announced, Monday, ,, includ...   
296        296  [They, were, for, people, convicted, of, such,...   
297        297  [Including, Monday, 's, actions, ,, Mr., Bush,...   

     positive_id                                    positive_tokens  \
0              1  [Families, of, soldiers, killed, in, the, conf...   
1              2  [They, marc

In [38]:
triplet_dataset = Dataset.from_pandas(triplets_df)
triplet_dataset

Dataset({
    features: ['anchor_id', 'anchor_tokens', 'positive_id', 'positive_tokens', 'negative_id', 'negative_tokens'],
    num_rows: 298
})

In [39]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
#prepares batches for processing with a ML model that uses triplet loss
def collate_fn(batch):
    anchor_texts = [' '.join(triplet['anchor_tokens']) for triplet in batch]
    positive_texts = [' '.join(triplet['positive_tokens']) for triplet in batch]
    negative_texts = [' '.join(triplet['negative_tokens']) for triplet in batch]
    
    # Tokenize and encode texts in the batch
    anchor_encodings = tokenizer(anchor_texts, padding=True, truncation=True, return_tensors="pt")
    positive_encodings = tokenizer(positive_texts, padding=True, truncation=True, return_tensors="pt")
    negative_encodings = tokenizer(negative_texts, padding=True, truncation=True, return_tensors="pt")
    
    # Separate and structure the encodings as needed
    anchor_ids, anchor_attn_masks = anchor_encodings['input_ids'], anchor_encodings['attention_mask']
    positive_ids, positive_attn_masks = positive_encodings['input_ids'], positive_encodings['attention_mask']
    negative_ids, negative_attn_masks = negative_encodings['input_ids'], negative_encodings['attention_mask']

    return (anchor_ids, anchor_attn_masks), (positive_ids, positive_attn_masks), (negative_ids, negative_attn_masks)


In [40]:
#The primary purpose of a DataLoader is to automate the process of data sampling, batching, and (optionally) applying transformations to the data.
triplet_dataset_train, triplet_dataset_eval = train_test_split(triplets_df, test_size=0.2, random_state=42) 
triplet_dataset_train = Dataset.from_pandas(triplet_dataset_train)
triplet_dataset_eval = Dataset.from_pandas(triplet_dataset_eval)
train_dataloader = DataLoader(triplet_dataset_train, batch_size=32, shuffle=True, collate_fn=collate_fn) 
eval_dataloader = DataLoader(triplet_dataset_eval, batch_size=32, collate_fn=collate_fn)  

In [41]:
class BiLSTMEncoder(torch.nn.Module): #encode sequences using a BiLSTM network.
    def __init__(self, embedding_dim, hidden_dim, vocab_size):
        super(BiLSTMEncoder, self).__init__()
        self.embedding = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        self.lstm = torch.nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim // 2, bidirectional=True, batch_first=True)
    #Defines the forward pass of the encoder
    def forward(self, input_ids, attention_mask=None):
        if attention_mask is not None:
            sequence_lengths = attention_mask.sum(dim=1)

            # Sort batches by sequence length
            sequence_lengths, perm_idx = sequence_lengths.sort(0, descending=True)
            input_ids = input_ids[perm_idx]

        embedded = self.embedding(input_ids)

        # If using sequence lengths, pack the sequence
        if attention_mask is not None:
            packed_embedded = rnn_utils.pack_padded_sequence(embedded, lengths=sequence_lengths.cpu(), batch_first=True)
            packed_output, (hidden, cell) = self.lstm(packed_embedded)
            output, _ = rnn_utils.pad_packed_sequence(packed_output, batch_first=True)
        else:
            output, (hidden, cell) = self.lstm(embedded)

        # Concatenate the final forward and backward hidden states
        hidden = torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)

        # If you reordered the batch by sequence length, reorder back to its original order
        if attention_mask is not None:
            _, unperm_idx = perm_idx.sort(0)
            hidden = hidden[unperm_idx]
        #concatenated hidden state for each sequence in the batch, representing the encoded information of each sequence
        return hidden


-  Triplet loss function, commonly used in tasks that involve learning from relative comparisons between an anchor sample, a positive sample (similar to the anchor), and a negative sample (dissimilar from the anchor).
- The purpose of this loss function is to ensure that, in the learned representation space, the anchor is closer to the positive sample than to the negative sample by at least a margin.
- The constructor of the TripletLoss class takes a single parameter, margin, which defaults to 1.0. This margin is a hyperparameter that defines the smallest distance between the positive and negative pairs' distances we want to achieve.
The margin is stored as an instance variable to be used in the forward method.

In [42]:
class TripletLoss(nn.Module):
    def __init__(self, margin=1.0):
        super(TripletLoss, self).__init__()
        self.margin = margin

    def forward(self, anchor, positive, negative):
        distance_positive = (anchor - positive).pow(2).sum(1)
        distance_negative = (anchor - negative).pow(2).sum(1)
        losses = torch.relu(distance_positive - distance_negative + self.margin)
        return losses.mean()

In [47]:
def train(model, data_loader, loss_fn, optimizer, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        for (anchor_ids, anchor_masks), (positive_ids, positive_masks), (negative_ids, negative_masks) in data_loader:
            optimizer.zero_grad() #to prevent accumulation of gradients from multiple backward passes.

            # Call the model with only token IDs, ignore the attention masks
            anchor_embeddings = model(anchor_ids)
            positive_embeddings = model(positive_ids)
            negative_embeddings = model(negative_ids)

            # Compute the loss using the embeddings
            loss = loss_fn(anchor_embeddings, positive_embeddings, negative_embeddings)
            loss.backward() #computes the gradients of the loss with respect to model parameters.
            optimizer.step() #updates the model parameters based on the gradients

        print(f'Epoch {epoch+1}, Loss: {loss.item()}')


# Example usage:
vocab_size = len(tokenizer)
embedding_dim = 300  
hidden_dim = 100

model = BiLSTMEncoder(embedding_dim, hidden_dim, vocab_size)
loss_fn = TripletLoss(margin=1.0)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [48]:
data_loader = train_dataloader
num_epochs = 5
triplet_dataset = Dataset.from_pandas(triplets_df)
train(model, data_loader, loss_fn, optimizer, num_epochs)

Epoch 1, Loss: 1.3259323835372925
Epoch 2, Loss: 0.6622036695480347
Epoch 3, Loss: 0.5667781829833984
Epoch 4, Loss: 0.25489354133605957
Epoch 5, Loss: 0.20623275637626648


- Next Step
    - Perform cross-validation and hyperparameter tuning for finding the optimal combination of parameters(embedding_dim, hidden_dim) 