This notebook is an attempt at the implementation of [Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings](https://aclanthology.org/P19-1309) (Artetxe & Schwenk, ACL 2019), and attempting at utilising it for parallel corpus mining of low ressource languages (in our case, we are working with North African Dialects).

In [None]:
from transformers import BertTokenizer, BertModel

In [None]:
tokenizer=BertTokeninzer.from_pretrained('UBC-NLP/MARBERT')

In [None]:
MARBERT=BertModel.from_pretrained('/content/drive/MyDrive/Marbert_tuned')

All TF 2.0 model weights were used when initializing BertModel.

All the weights of BertModel were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training.


In [None]:
with open('/content/drive/MyDrive/corp.mor', 'r') as f:
    mor_sentences = [line.strip() for line in f]

In [None]:
with open('/content/drive/MyDrive/corp.alg', 'r') as f:
    alg_sentences = [line.strip() for line in f]

In [None]:
import torch

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
device

'cuda'

#Data Pre-Processing

In [None]:
mor_tokenized=[['[CLS]']+tokenizer.tokenize(sentence)+['[SEP]'] for sentence in mor_sentences]

In [None]:
alg_tokenized=[['[CLS]']+tokenizer.tokenize(sentence)+['[SEP]'] for sentence in alg_sentences]

In [None]:
maxlen=75

In [None]:
padded_mor=[tokens +['[PAD]' for _ in range(maxlen-len(tokens))] for tokens in mor_tokenized]

In [None]:
padded_alg=[tokens +['[PAD]' for _ in range(maxlen-len(tokens))] for tokens in alg_tokenized]

In [None]:
seg_ids_mor=[[0 for _ in range(len(padded))]for padded in padded_mor]

In [None]:
seg_ids_alg=[[0 for _ in range(len(padded))]for padded in padded_alg]

In [None]:
sent_ids_mor=[tokenizer.convert_tokens_to_ids(padded) for padded in padded_mor]

In [None]:
sent_ids_alg=[tokenizer.convert_tokens_to_ids(padded) for padded in padded_alg]

In [None]:
attn_mask_mor=[[ 1 if token != '[PAD]' else 0 for token in padded ]for padded in padded_mor]

In [None]:
attn_mask_alg=[[ 1 if token != '[PAD]' else 0 for token in padded ]for padded in padded_alg]

In [None]:
import torch

In [None]:
token_ids_mor = [torch.tensor(sent_ids).unsqueeze(0).to(device) for sent_ids in sent_ids_mor ]
attn_mask_mor = [torch.tensor(attn_mask).unsqueeze(0).to(device) for attn_mask in attn_mask_mor]
seg_ids_mor = [torch.tensor(seg_ids).unsqueeze(0).to(device) for seg_ids in seg_ids_mor]

In [None]:
token_ids_alg = [torch.tensor(sent_ids).unsqueeze(0).to(device) for sent_ids in sent_ids_alg]



In [None]:
attn_mask_alg = [torch.tensor(attn_mask).unsqueeze(0).to(device) for attn_mask in attn_mask_alg]

In [None]:
seg_ids_alg   = [torch.tensor(seg_ids).unsqueeze(0).to(device) for seg_ids in seg_ids_alg]

In [None]:
MARBERT.to(device)

We write the [CLS] token embedding into a file as it contains a representation of the whole sentence. We can also use a pooler output of the model (mean-pooled) and look for the encoder layer that gives us the best sentence representation. Not that the model used has already been fine tuned for the classification task.

The reason we write the embeddings into a file is because the embeddings have a high dimensionnality and then require more VRAM than what we have at hand.

In [None]:
with open('/content/drive/MyDrive/algerian_cls_embeddings_1', 'w') as f:
  i=0
  f.write(f"id | cls_embedding\n")
  for token_ids, attn_mask, seg_ids in zip(token_ids_alg, attn_mask_alg, seg_ids_alg):
    f.write(f"{i}  | {MARBERT(token_ids, attention_mask = attn_mask).last_hidden_state[0][0].tolist()} \n")
    i=i+1 

In [None]:
with open('/content/drive/MyDrive/moroccan_cls_embeddings_', 'w') as f:
  i=0
  f.write(f"id | cls_embedding \n")
  for token_ids, attn_mask, seg_ids in zip(token_ids_mor, attn_mask_mor, seg_ids_mor):
    f.write(f"{i}  | {MARBERT(token_ids, attention_mask = attn_mask).last_hidden_state[0][0].tolist()} \n")
    i=i+1 

In [None]:
import pandas as pd

# Read
If you have the embeddings start from here.

Link to Pytorch tuned marbert https://drive.google.com/drive/folders/1-9fNj9RaAErb9VCjgaLxKGT-tkFRF4R4?usp=sharing

Link to alg CLS emb: https://drive.google.com/file/d/10EZD1qs9U1vjhwaP-LrPH9eToNvs8w27/view?usp=sharing

Link to mor CLS emb:
https://drive.google.com/file/d/10DmcHc-l7I_POmDs5FU7OyPsM4QL7eJB/view?usp=sharing

In [None]:
df_mor= pd.read_csv('/content/drive/MyDrive/moroccan_cls_embeddings_', sep='|', names=['id','cls_emb']).set_index('id')

In [None]:
df_mor

Unnamed: 0_level_0,cls_emb
id,Unnamed: 1_level_1
id,cls_embedding
0,"[-0.3836098611354828, -0.12836281955242157, 0..."
1,"[-0.4837908148765564, -0.10353101789951324, 0..."
2,"[0.03222808986902237, -0.06129252910614014, 1..."
3,"[-0.026018619537353516, -0.07560951262712479,..."
...,...
6407,"[-0.3569198548793793, -0.14029435813426971, 0..."
6408,"[0.02530290000140667, -0.06131020560860634, -..."
6409,"[0.07313661277294159, -0.07153552770614624, 1..."
6410,"[0.0845789685845375, -0.08402840793132782, 1...."


In [None]:
df_alg= pd.read_csv('/content/drive/MyDrive/algerian_cls_embeddings_1', sep='|', names=['id','cls_emb']).set_index('id')

In [None]:
cos_sim=torch.nn.CosineSimilarity(dim=1, eps=1e-6)

# Hubness
The tendency of high-dimensional data to contain points (hubs) that frequently occur in k-nearest-neighbor lists of other points.
As we are using embeddings from our pretrained Bert model (high-dimensional data), we chose to adopt a margin based approach in order to  mine our corpus for parallel sentences.

# Margin based scoring
This method has been inspired by *Mikel Artexe* and *Holger Schwenk*'s paper on Margin-based *Parallel Corpus Mining
with Multilingual Sentence Embeddings*. 

We consider the margin between the cosine of a
given candidate and the average cosine of its k
nearest neighbors in both directions as follows:

$$score(x,y)=margin(cos(x,y),
\\∑_{z \in NN_k(x)}{\dfrac{cos(x,z)}{2k}} + ∑_{z \in NN_k(y)}{\dfrac{cos(y,z)}{2k}} )$$
Where $NNk(x)$ denotes the nearest neighbors of x and the other language excluding duplicates (same thing for $NNk(y)$).
As for the margin, it can be either:
  * Absolute: $margin(a,b)=a$
  * Distance: $margin(a,b)=a-b$
  * Ration: $margin(a,b)=a/b$

### Determining NNk based on cosine similarity

In [None]:
def nnSimAlg(emb, n):
  cos_sim_table=[cos_sim(emb, sentence) for sentence in alg_emb]
  a=sorted(range(len(alg_emb)), key=lambda i: cos_sim_table[i], reverse=True)[:n+1]
  return a
def nnSimMor(emb, n):
  cos_sim_table=[cos_sim(emb, sentence) for sentence in mor_emb]
  a=sorted(range(len(mor_emb)), key=lambda i: cos_sim_table[i], reverse=True)[:n+1]
  return a
def Margin_score(mor_embed, alg_embed, n):
  mor=nnSimMor(mor_embed, n)
  alg=nnSimAlg(alg_embed, n)
  sum_cos_mor=0
  sum_cos_alg=0
  for i in range(n):
    sum_cos_mor=sum_cos_mor+(cos_sim(mor_embed,torch.tensor(mor[i+1])))
    sum_cos_alg=sum_cos_alg+(cos_sim(alg_embed,torch.tensor(alg[i+1])))
  sum_cos_mor=sum_cos_mor/(2*n)
  sum_cos_alg=sum_cos_alg/(2*n)
  a=cos_sim(mor_embed,alg_embed)
  b=(sum_cos_mor+sum_cos_alg)
  score_diff=a-b
  score_frac=a/b
  score_abs=a
  return score_diff, score_frac, score_abs

### Determining NNk based on distance

In [1]:
import torch.linalg as LA

In [None]:
def nnNormSimilarMor(emb, n):
  norm_table=[LA.vector_norm(emb-sentence, ord=17)for sentence in mor_emb]
  a=sorted(range(len(mor_emb)), key=lambda i: norm_table[i])[:n+1]
  return a

In [None]:
def nnNormSimilarAlg(emb, n):
  norm_table=[LA.vector_norm(emb-sentence, ord=17)for sentence in alg_emb]
  a=sorted(range(len(alg_emb)), key=lambda i: norm_table[i])[:n+1]
  return a

In [None]:
def Margin_score_with_norm(mor_embed, alg_embed, n):
  mor=nnNormSimilarMor(mor_embed, n)
  alg=nnNormSimilarAlg(alg_embed, n)
  sum_cos_mor=0
  sum_cos_alg=0
  for i in range(n):
    sum_cos_mor=sum_cos_mor+(cos_sim(mor_embed,torch.tensor(mor[i+1])))
    sum_cos_alg=sum_cos_alg+(cos_sim(alg_embed,torch.tensor(alg[i+1])))
  sum_cos_mor=sum_cos_mor/(2*n)
  sum_cos_alg=sum_cos_alg/(2*n)
  a=cos_sim(mor_embed,alg_embed)
  b=(sum_cos_mor+sum_cos_alg)
  score_diff=a-b
  score_frac=a/b
  score_abs=a
  return score_diff, score_frac, score_abs