Alex Jones (alexander.g.jones.23@dartmouth.edu) <br>
March 15, 2022 <br>
LING 28 (Rolando Coto-Solano), Winter 2022 <br>
Final Project


---

This notebook contains code for translating mined Danish sentences to English using a pretrained MT model.

In [25]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm import tqdm
import torch
import pandas as pd
import time

In [3]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    
    if torch.cuda.get_device_name(0) == "Tesla K40m":
        raise GPUError("GPU Error: No compatible GPU found")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: NVIDIA TITAN V


In [4]:
MODEL_NAME = 'Helsinki-NLP/opus-mt-da-en'

In [5]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [6]:
# Load model
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

In [7]:
# Put model on GPU
model.cuda()

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(58930, 512, padding_idx=58929)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(58930, 512, padding_idx=58929)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
   

In [34]:
def translateDAtoEN(sentences,
                    tokenizer,
                    model,
                    device):
    '''
    Translate Danish sentences to English using pretrained Hugging Face model
    '''
    tokenized = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
    tokenized.to(device)
    translated = model.generate(**tokenized)
    decoded = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
    tokenized.to('cpu')
    return decoded

In [19]:
kl_da_corpus = pd.read_csv('./data/kl-da/kl-da.csv', index_col=0)

In [20]:
kl_da_corpus

Unnamed: 0,kl,da
0,netredaktionen ansvarshavend chefredaktørchris...,flere kritiser laksepolitikken sermitsiaq\n
1,mary arctica angalaneq 144-mut allannguuteqart...,nægtede godkend lån siumut notat er udsendt ef...
2,statistik sermitsiaq\n,forsiden indland nuuk politik erhverv politi u...
3,polar aassik kivivoq sermitsiaq\n,vi har ikk undersøgt det siger ida abelsenhans...
4,dk iserfigiuk oqalliffimmut tikilluarit oqalli...,unikt sort hul opdaget mælkevejen sermitsiaq\n
...,...,...
6388,""" www\n",arbejd vore medarbejder på gøre leveranc klar ...
6389,filmi illoqarfiup filmertarfiani pingasunngorp...,når de kommend atlantlufthavn ilulissat og nuu...
6390,bussit ingerlaarnissaannut pilersaarut faceboo...,grønland erhverv vil derfor på det kraftigst o...
6391,euro-nngorlugit nusunneqartarput\n,del på facebook del på twitter del på googl em...


In [29]:
da_sents = list(kl_da_corpus['da'])

In [30]:
NUM_SENTS = len(da_sents)
BATCH_SIZE = 32
NUM_BATCHES = (NUM_SENTS // BATCH_SIZE) + 1
print(f'We will translate {NUM_BATCHES} batches of size {BATCH_SIZE}')

We will translate 200 batches of size 32


In [35]:
i = 0
transl_da_sents = []
start = time.time()
for i in range(NUM_BATCHES):
    transl_da_sents.extend(translateDAtoEN(da_sents[i*BATCH_SIZE : (i+1)*BATCH_SIZE],
                                           tokenizer,
                                           model,
                                           device))
    print("Completed batch {:} of {:}".format(i+1, NUM_BATCHES))
end = time.time()
print("Time taken: {:.3f}".format(end-start))

Completed batch 1 of 200
Completed batch 2 of 200
Completed batch 3 of 200
Completed batch 4 of 200
Completed batch 5 of 200
Completed batch 6 of 200
Completed batch 7 of 200
Completed batch 8 of 200
Completed batch 9 of 200
Completed batch 10 of 200
Completed batch 11 of 200
Completed batch 12 of 200
Completed batch 13 of 200
Completed batch 14 of 200
Completed batch 15 of 200
Completed batch 16 of 200
Completed batch 17 of 200
Completed batch 18 of 200
Completed batch 19 of 200
Completed batch 20 of 200
Completed batch 21 of 200
Completed batch 22 of 200
Completed batch 23 of 200
Completed batch 24 of 200
Completed batch 25 of 200
Completed batch 26 of 200
Completed batch 27 of 200
Completed batch 28 of 200
Completed batch 29 of 200
Completed batch 30 of 200
Completed batch 31 of 200
Completed batch 32 of 200
Completed batch 33 of 200
Completed batch 34 of 200
Completed batch 35 of 200
Completed batch 36 of 200
Completed batch 37 of 200
Completed batch 38 of 200
Completed batch 39 of

In [42]:
pd.DataFrame({'kl': kl_da_corpus['kl'], 'en': transl_da_sents}).to_csv('./data/kl-en/kl-en.csv')