# Task 1 -- NLP
## Things to do:
### *Coding* - Document work
### *Paper*  - Summarize, 3 strength, weakness, improvement



Representations for Words, Phrases, Sentences
NLP encompasses various tasks - Regression, Classification, Generation. A common denominator across all these tasks is the question of "how do we convert text into numbers/representations" so that machines can process them. One way to measure a machine's ability to “understand” text is Semantic Similarity i.e one should be able to tell you how similar or dissimilar are a given pair of text inputs - and this is the central theme of this task. You are expected to come up with solutions to the problems listed below. For all the tasks below, you can assess how well your solution is working based on some quantitative measures - you are expected to choose a metric that is suitable and justify
the same.

a. Word Similarity Scores: Given a pair of words, predict their similarity score. The focus is how do you convert a word to its numerical representation, on which learning algorithms (like Regression, classification etc) can be applied. Download the dataset from this link. You have to come up with an unsupervised / semi supervised method to achieve the task. Assume that you don't have any supervised training data at your disposal. The whole dataset will be used as a test set. Choose an appropriate metric that is suitable to assess the task and report the results. You have to come up with a solution for each of the following conditions:

i. Constraints on Data Resources: You can only use the following resources
(any one or all) to solve the problem (DON’T USE PRE-TRAINED MODELS!) :
- any monolingual English corpus - Maximum 1 million tokens.
- any curated/structured knowledge-bases / ontologies

ii. Unconstrained : Consider that the constraints above are removed and you
are allowed to use any data or model.
Compare results/analysis across the two settings. What works, what doesn’t?
And Why?

b. Phrase and Sentence Similarity : In question (1) you would have come up with a method to get numerical/vector representation given a word. Now you have to come up with a mechanism to get representations for phrases and sentences. How do you aggregate individual word representations to get phrase or sentence embedding?
- You can use any pretrained static word embeddings like word2vec,
GLOVE, FASTTEXT etc, or create your own.
- You can use popular tool/libraries (e.g nltk, Stanza, Spacy etc) to
compute linguistic features (PoS, Constituency/Dependency Parse).
i. Phrase Similarity : Given a pair of phrases classify whether or not they are similar. Dataset can be found here. Dataset has train/dev/test splits. You have to report results on the test set, and use train/dev sets as needed.
ii. Sentence Similarity : Given a pair of sentences, classify whether or not they are similar. Dataset can be found here. Dataset has train/dev/test split. You have to report results on the test set. , and use train/dev sets as needed.
You are encouraged to try multiple approaches to come up with phrase / sentence representations and models to solve the task, and do comparative analysis. What are the cases where your model fails - any patterns? Why so?

c. BONUS TASK:
i. Transformers are all the rage right now (backbone of most of the LLMs you might have used). Can you fine-tune a pre-trained transformer based models (BERT, Roberta, etc) to solve Phrase and Sentence Similarity Tasks described above? You are free to use any resource out there.
ii. Can you prompt LLMs (ChatGPT, LLAMA) to solve the phrase and sentence similarity scores? Solve the task using
1. commercial LLM APIs (ChatGPT, BARD etc);
2. open source LLMs/APIs (LLAMA, Mistral etc).
Try with zero and few shot settings. If querying LLMs is computational /
commercially prohibitive, do it for only the test set / subset of test set. Analyze the results. Explain some analysis that you have done.
iii. Compare all the approaches that you tried - static word embeddings, fine-tuned transformers, LLMs. What are the improvements you notice across the three settings?
d. Paper Reading Task : BERTSCORE: EVALUATING TEXT GENERATION WITH BERT

#Task 1


##Constrained Word Similarity

In [None]:
import pandas as pd
import io
from google.colab import files

In [None]:
# Upload Dataset
uploaded_dataset_file_constrained_word_similarity = files.upload()

Saving SimLex-999.txt to SimLex-999.txt


In [None]:
data_constrained_word = pd.read_csv(io.BytesIO(uploaded_dataset_file_constrained_word_similarity['SimLex-999.txt']), delimiter='\t')
print(data_constrained_word)

      word1        word2 POS  SimLex999  conc(w1)  conc(w2)  concQ  \
0       old          new   A       1.58      2.72      2.81      2   
1     smart  intelligent   A       9.20      1.75      2.46      1   
2      hard    difficult   A       8.77      3.76      2.21      2   
3     happy     cheerful   A       9.55      2.56      2.34      1   
4      hard         easy   A       0.95      3.76      2.07      2   
..      ...          ...  ..        ...       ...       ...    ...   
994    join      acquire   V       2.85      2.86      2.93      2   
995    send       attend   V       1.67      2.70      3.17      2   
996  gather       attend   V       4.80      2.75      3.17      2   
997  absorb     withdraw   V       2.97      3.11      3.04      2   
998  attend       arrive   V       6.08      3.17      3.22      2   

     Assoc(USF)  SimAssoc333  SD(SimLex)  
0          7.25            1        0.41  
1          7.11            1        0.67  
2          5.94            1  

##Unconstrained Word Similarity

###Using Word2Vec

In [None]:
from google.colab import files
import pandas as pd
import io
import gensim
import os
from scipy.stats import spearmanr
import csv
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from gensim.models import KeyedVectors
import spacy

In [None]:
uploaded_unconstrained_word2vec = files.upload()

Saving word.csv to word.csv


In [None]:
# uploaded_unconstrained_word = files.upload()
# encyclopedia

In [None]:
class Corpus(object):
    def __init__(self, filename):
        self.filename = filename
        self.nlp = spacy.blank("en")
    def __iter__(self):
        with open(self.filename, "r") as i:
            reader = csv.reader(i, delimiter=",")
            for _, abstract in reader:
                tokens = [t.text.lower() for t in self.nlp(abstract)]
                yield tokens
documents = Corpus("word.csv")

In [None]:
model_unconstrained_word2vec = gensim.models.Word2Vec(documents, min_count=100, window=5, vector_size=200)

In [None]:
model_unconstrained_word2vec.save('models/unconstrained_word2vec')

In [None]:
model_unconstrained_word2vec_path = 'models/unconstrained_word2vec'  # Path to your trained Word2Vec model
model_unconstrained_word2vec = gensim.models.Word2Vec.load(model_unconstrained_word2vec_path)

In [None]:
def word_similarity(word1, word2, word_embeddings):
    if word1 in word_embeddings.wv.key_to_index and word2 in word_embeddings.wv.key_to_index:
        similarity_score = cosine_similarity(word_embeddings.wv[word1].reshape(1, -1), word_embeddings.wv[word2].reshape(1, -1))[0][0]
        return similarity_score
    return -1

In [None]:
print(word_similarity("ml", "nlp", model_unconstrained_word2vec))

0.5263437


In [None]:
simlex_copy=np.array(data_constrained_word)
word_pairs = [(row[0], row[1]) for row in simlex_copy]

simlex_ratings = [float(row[3]) for row in simlex_copy]
simlex_ratings = [rating for rating in simlex_ratings]

In [None]:
word2vec_model_similarities = []
filtered_word2vec_simlex_ratings = []
for pair, rating in zip(word_pairs, simlex_ratings):
    word1, word2 = pair
    similarity_score = word_similarity(word1, word2, model_unconstrained_word2vec)
    if similarity_score != -1:
        word2vec_model_similarities.append(similarity_score * 10)
        filtered_word2vec_simlex_ratings.append(rating)

spearman_correlation, _ = spearmanr(filtered_word2vec_simlex_ratings, word2vec_model_similarities)
print("Spearman correlation between word2vec model predictions and SimLex-999 ratings:", spearman_correlation)

Spearman correlation between word2vec model predictions and SimLex-999 ratings: 0.20142711234786037


###Glove using D2L data

In [None]:
import os
import torch
from torch import nn
!pip install d2l
from d2l import torch as d2l

In [None]:


d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip',
                                '0b8703943ccdb6eb788e6f091b8946e82231bc4d')

d2l.DATA_HUB['glove.6b.100d'] = (d2l.DATA_URL + 'glove.6B.100d.zip',
                                 'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')

d2l.DATA_HUB['glove.42b.300d'] = (d2l.DATA_URL + 'glove.42B.300d.zip',
                                  'b5116e234e9eb9076672cfeabf5469f3eec904fa')

d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',
                           'c1816da3821ae9f43899be655002f6c723e91b88')

class TokenEmbedding:
    """Token Embedding."""
    def __init__(self, embedding_name):
        self.idx_to_token, self.idx_to_vec = self._load_embedding(
            embedding_name)
        self.unknown_idx = 0
        self.token_to_idx = {token: idx for idx, token in
                             enumerate(self.idx_to_token)}

    def _load_embedding(self, embedding_name):
        idx_to_token, idx_to_vec = ['<unk>'], []
        data_dir = d2l.download_extract(embedding_name)
        with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
            for line in f:
                elems = line.rstrip().split(' ')
                token, elems = elems[0], [float(elem) for elem in elems[1:]]
                if len(elems) > 1:
                    idx_to_token.append(token)
                    idx_to_vec.append(elems)
        idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
        return idx_to_token, torch.tensor(idx_to_vec)

    def __getitem__(self, tokens):
        indices = [self.token_to_idx.get(token, self.unknown_idx)
                   for token in tokens]
        vecs = self.idx_to_vec[torch.tensor(indices)]
        return vecs

    def __len__(self):
        return len(self.idx_to_token)

In [None]:
glove_6b50d = TokenEmbedding('glove.6b.50d')

In [None]:
def knn(W, x, k):
    # Add 1e-9 for numerical stability
    cos = torch.mv(W, x.reshape(-1,)) / (
        torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) *
        torch.sqrt((x * x).sum()))
    _, topk = torch.topk(cos, k=k)
    return topk, [cos[int(i)] for i in topk]

def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
    for i, c in zip(topk[1:], cos[1:]):  # Exclude the input word
        print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')

In [None]:
get_similar_tokens('chip', 3, glove_6b50d)

cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics


##Comparing both

#Task 2

##Phrase Similarity

In [None]:
!git lfs install
!pip install datasets

In [None]:
from datasets import load_dataset
phrase_similarity_dataset = load_dataset("PiC/phrase_similarity")

Similar to Sentence Similarity Changing the dataset

##Sentence Similarity



In [42]:
from datasets import load_dataset
sentence_similarity_dataset = load_dataset("paws", "labeled_final")

In [5]:
import spacy
from sklearn.metrics.pairwise import cosine_similarity
nlp = spacy.load("en_core_web_sm")
import numpy as np
from gensim.models import KeyedVectors
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
import pandas as pd
import datasets
#nltk.download('stopwords')
from datasets import load_dataset
# import sys
# sys.path.append("../../")
import os
from collections import Counter
import math
from sklearn.metrics import accuracy_score
from tempfile import TemporaryDirectory
# !pip install scrapbook
import scrapbook as sb
import scipy
from scipy.spatial import distance
import gensim
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from sklearn.feature_extraction.text import TfidfVectorizer

In [48]:
sentence_similarity_train=pd.DataFrame(sentence_similarity_dataset['train'])
sentence_similarity_test=pd.DataFrame(sentence_similarity_dataset['test'])
sentence_similarity_dev=pd.DataFrame(sentence_similarity_dataset['validation'])

In [54]:
sentence_similarity_test.head(10)
# with open('output.txt', 'a') as f:
#     f.write(sentence_similarity_train[:100].to_string())

Unnamed: 0,id,sentence1,sentence2,label
0,1,This was a series of nested angular standards ...,"This was a series of nested polar scales , so ...",0
1,2,His father emigrated to Missouri in 1868 but r...,"His father emigrated to America in 1868 , but ...",0
2,3,"In January 2011 , the Deputy Secretary General...","In January 2011 , FIBA Asia deputy secretary g...",1
3,4,"Steiner argued that , in the right circumstanc...",Steiner held that the spiritual world can be r...,0
4,5,"Luciano Williames Dias ( born July 25 , 1970 )...",Luciano Williames Dias ( born 25 July 1970 ) i...,0
5,6,"During her sophomore , junior and senior summe...","During her second , junior and senior summers ...",1
6,7,The smallest number that can be represented in...,The smallest number that can be represented as...,0
7,8,"His father emigrated to Missouri in 1868 , but...",His father emigrated to Missouri in 1868 but r...,1
8,9,The Villa Pesquera facilities are owned by the...,The facilities of Villa Pesquera are operated ...,0
9,10,It is situated south of Köroğlu Mountains and ...,It is situated south of Köroğlu - mountains an...,1


###Using Hugging Face SentenceTransformer

In [40]:
# !pip install sentence-transformers
from sentence_transformers import SentenceTransformer, InputExample, models, losses, util
from torch import nn
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])



In [13]:
word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(
    in_features=pooling_model.get_sentence_embedding_dimension(),
    out_features=256,
    activation_function=nn.Tanh(),
)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

In [14]:
def dataframe_to_input_examples(df_chunk):
    examples = []
    for index, row in df_chunk.iterrows():
        text1 = row['sentence1']
        text2 = row['sentence2']
        label = float(row['label'])
        examples.append(InputExample(texts=[text1, text2], label=label))
    return examples
chunk_size = 100

train_examples_sentence = []
dev_examples_sentence = []
test_examples_sentence = []

for i in range(10):
    start_idx = i * chunk_size
    end_idx = min((i + 1) * chunk_size, 1000)
    df_chunk = sentence_similarity_test[start_idx:end_idx]
    examples_chunk = dataframe_to_input_examples(df_chunk)
    train_examples_sentence.extend(examples_chunk)

dev_examples_sentence = []
for i in range(5):
    start_idx = i * chunk_size
    end_idx = min((i + 1) * chunk_size, 500)
    df_chunk = sentence_similarity_dev[start_idx:end_idx]
    examples_chunk = dataframe_to_input_examples(df_chunk)
    dev_examples_sentence.extend(examples_chunk)


test_examples_sentence = []
for i in range(5):
    start_idx = i * chunk_size
    end_idx = min((i + 1) * chunk_size, 500)
    df_chunk = sentence_similarity_train[start_idx:end_idx]
    examples_chunk = dataframe_to_input_examples(df_chunk)
    test_examples_sentence.extend(examples_chunk)

In [15]:
model = SentenceTransformer("distilbert-base-nli-mean-tokens")
# train_examples = [
#     InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
# ]
train_dataloader = DataLoader(train_examples_sentence, shuffle=True, batch_size=100)

In [16]:
train_loss = losses.CosineSimilarityLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/10 [00:00<?, ?it/s]

In [32]:
model.save("models/sentence")

In [36]:
sentence_similarity_test1=sentence_similarity_test[:500]
test_encodings1 = model.encode(sentence_similarity_test1['sentence1'].tolist())
test_encodings2 = model.encode(sentence_similarity_test1['sentence2'].tolist())

In [80]:
sentence1 = "This is the first sentence."
sentence2 = "This is the second sentence."
# Encode sentences
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)
error=0
for i in range(500):
  error=(error*(i)+(sentence_similarity_test1['label'][i]-(util.pytorch_cos_sim(test_encodings1[i], test_encodings2[i])).item()))/(i+1)
print("Error:", error)

Error: -0.5355040491819378


For large corpus we can use Pragraph mining

#Bonus Task

##Fine Tune Transformer

####Using BERT

In [None]:
!pip install transformers sentence-transformers datasets
!pip install dataset
!pip install transformers
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, models
from transformers import BertTokenizer
from transformers import get_linear_schedule_with_warmup
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm
import time
import datetime
import random
import numpy as np
import pandas as pd

In [6]:
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

In [7]:
dataset = load_dataset("stsb_multi_mt", "en")

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [10]:
class STSBDataset(torch.utils.data.Dataset):

    def __init__(self, dataset):

        # Normalize the similarity scores in the dataset
        similarity_scores = [i['similarity_score'] for i in dataset]
        self.normalized_similarity_scores = [i/5.0 for i in similarity_scores]
        self.first_sentences = [i['sentence1'] for i in dataset]
        self.second_sentences = [i['sentence2'] for i in dataset]
        self.concatenated_sentences = [[str(x), str(y)] for x,y in zip(self.first_sentences, self.second_sentences)]

    def __len__(self):
        return len(self.concatenated_sentences)

    def get_batch_labels(self, idx):
        return torch.tensor(self.normalized_similarity_scores[idx])

    def get_batch_texts(self, idx):
        return tokenizer(self.concatenated_sentences[idx], padding='max_length', max_length=128, truncation=True, return_tensors="pt")

    def __getitem__(self, idx):
        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_texts, batch_y


def collate_fn(texts):
    input_ids = texts['input_ids']
    attention_masks = texts['attention_mask']
    features = [{'input_ids': input_id, 'attention_mask': attention_mask}
                for input_id, attention_mask in zip(input_ids, attention_masks)]
    return features

class BertForSTS(torch.nn.Module):

    def __init__(self):
        super(BertForSTS, self).__init__()
        self.bert = models.Transformer('bert-base-uncased', max_seq_length=128)
        self.pooling_layer = models.Pooling(self.bert.get_word_embedding_dimension())
        self.sts_bert = SentenceTransformer(modules=[self.bert, self.pooling_layer])

    def forward(self, input_data):
        output = self.sts_bert(input_data)['sentence_embedding']
        return output

In [None]:
model = BertForSTS()
model.to(device)

In [12]:
class CosineSimilarityLoss(torch.nn.Module):

    def __init__(self,  loss_fn=torch.nn.MSELoss(), transform_fn=torch.nn.Identity()):
        super(CosineSimilarityLoss, self).__init__()
        self.loss_fn = loss_fn
        self.transform_fn = transform_fn
        self.cos_similarity = torch.nn.CosineSimilarity(dim=1)

    def forward(self, inputs, labels):
        emb_1 = torch.stack([inp[0] for inp in inputs])
        emb_2 = torch.stack([inp[1] for inp in inputs])
        outputs = self.transform_fn(self.cos_similarity(emb_1, emb_2))
        return self.loss_fn(outputs, labels.squeeze())

train_ds = STSBDataset(dataset['train'])
val_ds = STSBDataset(dataset['dev'])

train_size = len(train_ds)
val_size = len(val_ds)

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

batch_size = 8

train_dataloader = DataLoader(
            train_ds,  # The training samples.
            num_workers = 4,
            batch_size = batch_size, # Use this batch size.
            shuffle=True # Select samples randomly for each batch
        )

validation_dataloader = DataLoader(
            val_ds,
            num_workers = 4,
            batch_size = batch_size # Use the same batch size
        )

5,749 training samples
1,500 validation samples




In [18]:
optimizer = AdamW(model.parameters(),
                  lr = 1e-6)
epochs = 2

# Total number of training steps is [number of batches] x [number of epochs].
total_steps = len(train_dataloader) * epochs

scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

In [None]:
#Training in data: Can be skipped afterward as our model is trained

def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

def train():
  seed_val = 42

  criterion = CosineSimilarityLoss()
  criterion = criterion.to(device)

  random.seed(seed_val)
  torch.manual_seed(seed_val)

  # We'll store a number of quantities such as training and validation loss,
  # validation accuracy, and timings.
  training_stats = []
  total_t0 = time.time()

  for epoch_i in range(0, epochs):

      # ========================================
      #               Training
      # ========================================

      print("")
      print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
      print('Training...')

      t0 = time.time()

      total_train_loss = 0

      model.train()

      # For each batch of training data...
      for train_data, train_label in tqdm(train_dataloader):

          train_data['input_ids'] = train_data['input_ids'].to(device)
          train_data['attention_mask'] = train_data['attention_mask'].to(device)

          train_data = collate_fn(train_data)
          model.zero_grad()

          output = [model(feature) for feature in train_data]

          loss = criterion(output, train_label.to(device))
          total_train_loss += loss.item()

          loss.backward()
          torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
          optimizer.step()
          scheduler.step()


      # Calculate the average loss over all of the batches.
      avg_train_loss = total_train_loss / len(train_dataloader)

      # Measure how long this epoch took.
      training_time = format_time(time.time() - t0)

      print("")
      print("  Average training loss: {0:.5f}".format(avg_train_loss))
      print("  Training epoch took: {:}".format(training_time))

      # ========================================
      #               Validation
      # ========================================

      print("")
      print("Running Validation...")

      t0 = time.time()

      model.eval()

      total_eval_accuracy = 0
      total_eval_loss = 0
      nb_eval_steps = 0

      # Evaluate data for one epoch
      for val_data, val_label in tqdm(validation_dataloader):

          val_data['input_ids'] = val_data['input_ids'].to(device)
          val_data['attention_mask'] = val_data['attention_mask'].to(device)

          val_data = collate_fn(val_data)

          with torch.no_grad():
              output = [model(feature) for feature in val_data]

          loss = criterion(output, val_label.to(device))
          total_eval_loss += loss.item()

      # Calculate the average loss over all of the batches.
      avg_val_loss = total_eval_loss / len(validation_dataloader)

      # Measure how long the validation run took.
      validation_time = format_time(time.time() - t0)

      print("  Validation Loss: {0:.5f}".format(avg_val_loss))
      print("  Validation took: {:}".format(validation_time))

      # Record all statistics from this epoch.
      training_stats.append(
          {
              'epoch': epoch_i + 1,
              'Training Loss': avg_train_loss,
              'Valid. Loss': avg_val_loss,
              'Training Time': training_time,
              'Validation Time': validation_time
          }
      )

  print("")
  print("Training complete!")

  print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

  return model, training_stats
  model, training_stats = train()
# Create a DataFrame from our training statistics
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index
df_stats = df_stats.set_index('epoch')

# Display the table
df_stats

In [29]:
test_dataset = load_dataset("stsb_multi_mt", name="en", split="test")

first_sent = [i['sentence1'] for i in test_dataset]
second_sent = [i['sentence2'] for i in test_dataset]
full_text = [[str(x), str(y)] for x,y in zip(first_sent, second_sent)]

In [26]:
model.eval()

def predict_similarity(sentence_pair):
  test_input = tokenizer(sentence_pair, padding='max_length', max_length = 128, truncation=True, return_tensors="pt").to(device)
  test_input['input_ids'] = test_input['input_ids']
  test_input['attention_mask'] = test_input['attention_mask']
  del test_input['token_type_ids']
  output = model(test_input)
  sim = torch.nn.functional.cosine_similarity(output[0], output[1], dim=0).item()

  return sim

In [27]:
example_2 = full_text[130]
print(f"Sentence 1: {example_2[0]}")
print(f"Sentence 2: {example_2[1]}")
print(f"Predicted similarity score: {round(predict_similarity(example_2), 2)}")

Sentence 1: Two men are playing football.
Sentence 2: Two men are practicing football.
Predicted similarity score: 0.94


##Prompt LLM

###Commercial bardapi

In [None]:
!pip install bardapi
! pip install git+https://github.com/dsdanielpark/Bard-API.git

In [41]:
import bardapi
import os

# set your __Secure-1PSID value to key
token = 'xxxxxxxxxx'

# set your input text
input_text = "Hello"

# Send an API request and get a response.
response = bardapi.core.Bard(token).get_answer(input_text)
print(response)

{'content': 'Response Error: b\')]}\\\'\\n\\n38\\n[["wrb.fr",null,null,null,null,[7]]]\\n54\\n[["di",39],["af.httprm",38,"3346129587758524624",0]]\\n25\\n[["e",4,null,null,129]]\\n\'. \nUnable to get response.\nPlease double-check the cookie values and verify your network environment or google account.'}


ChatGPT https://chat.openai.com/share/04375dc5-9265-4f59-8b8c-840b6f750ed5

In [55]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the dataset
data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
    'sentence1': [
        "In Paris , in October 1560 , he secretly met the English ambassador , Nicolas Throckmorton , asking him for a passport to return to England through Scotland .",
        "The NBA season of 1975 -- 76 was the 30th season of the National Basketball Association .",
        "There are also specific discussions , public profile debates and project discussions .",
        "When comparable rates of flow can be maintained , the results are high .",
        "It is the seat of Zerendi District in Akmola Region .",
        "William Henry Henry Harman was born on 17 February 1828 in Waynesboro , Virginia , where his parents were Lewis and Sally ( Garber ) Harman .",
        "Bullion Express - concept is being introduced new store in Dallas , Texas in Preston Center opened .",
        "With a discrete amount of probabilities Formula 1 with the condition formula 2 and Formula 3 any real number , the Tsallis is defined as entropy as",
        "The Soviet Union maintained an embassy in Oslo and a consulate in Barentsburg , while Norway maintained a message in Moscow .",
        "Vocabulary even went to Brazil through leaving Portuguese settlers with some Macanese and Chinese settlers .",
        "Kabir Suman recorded several albums under the name of Suman Chattopaddhyay or Suman Chatterjee between 1992 and 1999 .",
        "He was a scholar in Metaphysical Literature , Theology and Classical sciences .",
        "The city sits at the confluence of the Snake River with the great Weiser River , which marks the border with Oregon .",
        "He has been trained by his grandfather , Nick Dakin , and is now trained by Geoff Barraclough .",
        "The Austrian school assumes that the subjective choices of individuals , including individual knowledge , time , expectations , and other subjective factors , cause all economic phenomena .",
        "Werder 's forces invested Belfort and reached the city on 3 November .",
        "The kBox facilitates both isometric and concentric contractions as well as eccentric training .",
        "The first five weapons were delivered in the first half of 1916 , with a total of 57 barrels and 56 cars completed by the end of the war .",
        "Elizabeth II was an ancestor of Queens Edzard II and Beatrix of the Netherlands .",
        "The friendship between him and Duncan ended at a club meeting in 1951 when the two disagreed at an annual meeting and Duncan reported that Greaves said :",
        "Pluto was classified as the planet when the Grand Tour was proposed and was launched at the time `` New Horizons '' .",
        "For their performances in the game , quarterback Jameis Winston and defensive back P. J. Williams were named the game 's most valuable players .",
        "Shaffer Creek is a tributary of the Raystown Branch Juniata River ( Brush Creek ) in Bedford County , Pennsylvania , United States .",
        "Kevin Spacey ( Henry Drummond ) and David Troughton ( Matthew Harrison Brady ) starred in a 2009 revival at The Old Vic in London .",
        "Briggs later met Briggs at the 1967 Monterey Pop Festival , where Ravi Shankar was also performing , with Eric Burdon and The Animals .",
        "Laura Myntti was born in Salt Lake City and lived in Sioux City , Iowa and San Diego before settling in Minnesota in 1968 .",
        "The female lead role was played by Cortez in `` Ali Baba and the Sacred Crown '' , directed by Erminio Salvi .",
        "She worked and lived in Stuttgart , Berlin ( Germany ) and in Vienna ( Austria ) .",
        "Akshuat dendropark ( Russian : Акшуатский дендропарк ) is a natural monument ( Ulyanovsk Oblast protected areas )",
        "The Little Jocko River flows across the Saint Lawrence River and the Ottawa River to the Jocko River .",
        "In 1951 , he died and retired in 1956 ."
    ],
    'sentence2': [
        "In October 1560 , he secretly met with the English ambassador , Nicolas Throckmorton , in Paris , and asked him for a passport to return to Scotland through England .",
        "The 1975 -- 76 season of the National Basketball Association was the 30th season of the NBA .",
        "There are also public discussions , profile specific discussions , and project discussions .",
        "The results are high when comparable flow rates can be maintained .",
        "It is the seat of the district of Zerendi in Akmola region .",
        "William Henry Harman was born in Waynesboro , Virginia on February 17 , 1828 . His parents were Lewis and Sally ( Garber ) Harman .",
        "2011-DGSE Bullion Express concept is introduced , new store opened in Preston Center in Dallas , Texas",
        "Given a discrete set of probabilities formula _ 1 with the condition formula _ 2 , and formula _ 3 any real number , the Tsallis entropy is defined as",
        "The Soviet Union maintained an embassy in Moscow and a consulate in Barentsburg , while Norway maintained a message in Oslo .",
        "Vocabulary even went to Brazil by leaving Macanese and Chinese settlers with some Portuguese settlers .",
        "Suman Chatterjee , recorded a number of albums between 1992 and 1999 under the name Suman Chattopaddhyay or Kabir Suman .",
        "He was a scholar in metaphysical literature , theology , and classical science .",
        "The city lies at the confluence of the Snake River and the Great Weiser River , which marks the border with Oregon .",
        "He has been trained by his grandfather , Geoff Barraclough , and is now coached by Nick Dakin .",
        "The Austrian school assumes that the subjective choices of individuals , including subjective knowledge , time , expectation , and other individual factors , cause all economic phenomena .",
        "Werder 's troops invested Belfort and reached the city on November 3 .",
        "The kBox facilitates eccentric as well as concentric contractions and isometric training .",
        "The first five weapons were delivered in the first half of 1916 . A total of 57 barrels and 56 carriages were completed by the end of the war .",
        "Edzard II was an ancestor of the Queens Elizabeth II and the Beatrix of the Netherlands .",
        "The friendship between him and Duncan ended in 1951 at a club meeting , when the two did not agree at an annual meeting , and Duncan reported that Greaves said :",
        "Note : Pluto was classified as a planet when the Grand Tour was launched and at the time `` New Horizons '' was proposed .",
        "Quarterback P. J. Williams and Defensive Back Jameis Winston were named the most valuable players of the game for their performances in the game .",
        "Shaffer Creek is an tributary of Brush Creek ( Raystown Branch Juniata River ) in Bedford County , Pennsylvania in the United States .",
        "Kevin Spacey ( Henry Drummond ) and David Troughton ( Matthew Harrison Brady ) played in a resume in 2009 at the Old Vic London .",
        "Briggs met Briggs later at the Monterey Pop Festival of 1967 , where Ravi Shankar also performed with Eric Burdon and The Animals .",
        "Born in Minnesota , Laura Myntti lived in Sioux City , Iowa and San Diego , before settling in Salt Lake City in 1968 .",
        "Cortez played the female lead in `` Ali Baba and the Sacred Crown '' , directed by Erminio Salvi .",
        "She worked and lived in Germany ( Stuttgart , Berlin ) and in Vienna ( Austria ) .",
        "Akshuat dendropark ( Russian : Акшуатский дендропарк ) is a natural monument ( Protected areas of Ulyanovsk Oblast )",
        "The Little Jocko River flows via the Saint Lawrence River and the Ottawa River to the Jocko River .",
        "He died in 1951 and retired in 1956 ."
    ],
    'label': [0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1]
}

df = pd.DataFrame(data)

# Function to calculate cosine similarity
def calculate_similarity(sentence1, sentence2):
    vectorizer = CountVectorizer().fit_transform([sentence1, sentence2])
    vectors = vectorizer.toarray()
    return cosine_similarity([vectors[0]], [vectors[1]])[0][0]

# Calculate cosine similarity for each pair of sentences
similarities = []
for index, row in df.iterrows():
    similarity = calculate_similarity(row['sentence1'], row['sentence2'])
    similarities.append(similarity)

# Check for errors
errors = 0
for i, label in enumerate(df['label']):
    if label == 0 and similarities[i] >= 0.5:
        errors += 1
    elif label == 1 and similarities[i] < 0.5:
        errors += 1

# Report errors
print("Total errors found:", errors)


Total errors found: 14


###Open Source

In [None]:
!pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"
import torch
from transformers import pipeline

In [2]:
generate_text = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
res = generate_text("Give semantic similarity label for 'This is first', 'This is second'")
print(res[0]["generated_text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


KeyboardInterrupt: 