What are all the languages it supports?

MuRIL currently supports the following 17 languages:

    Assamese
    Bengali
    English
    Gujarati
    Hindi
    Kannada
    Kashmiri
    Malayalam
    Marathi
    Nepali
    Oriya
    Punjabi
    Sanskrit
    Sindhi
    Tamil
    Telugu
    Urdu

The training of MuRIL is very similar to Mulitlingual BERT with addition of translation and transliteration segment pairs in training. The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below :

    Monolingual Data : Publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
    Translated Data : Translations of the above monolingual corpora using the Google NMT pipeline and publicly available PMINDIA corpus.
    Transliterated Data : Transliterations of Wikipedia using the IndicTrans library and publicly available Dakshina dataset.

In [1]:
!pip install bert-for-tf2
!pip install tensorflow-text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert-for-tf2
  Downloading bert-for-tf2-0.14.9.tar.gz (41 kB)
[K     |████████████████████████████████| 41 kB 169 kB/s 
[?25hCollecting py-params>=0.9.6
  Downloading py-params-0.10.2.tar.gz (7.4 kB)
Collecting params-flow>=0.8.0
  Downloading params-flow-0.8.2.tar.gz (22 kB)
Building wheels for collected packages: bert-for-tf2, params-flow, py-params
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2-0.14.9-py3-none-any.whl size=30535 sha256=e526f6c6c79b4b16ce456a214de18201f6895d11f6e6b3ef553d3fd09ec5ae07
  Stored in directory: /root/.cache/pip/wheels/47/b6/e5/8c76ec779f54bc5c2f1b57d2200bb9c77616da83873e8acb53
  Building wheel for params-flow (setup.py) ... [?25l[?25hdone
  Created wheel for params-flow: filename=params_flow-0.8.2-py3-none-any.whl size=19472 sha256=3649a1e4ba4ed74c1211f526a8df45c0e8bb4d

In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
import tensorflow_text as text
from bert import bert_tokenization
import numpy as np
from scipy.spatial import distance

The get_model function created here helps us download the MuRIL model from the Tensorflow Hub platform. This function accepts two inputs: the model_url and the maximum sequence length or the maximum number of tokens that can be imported at a time.
 TensorFlow Hub is a repository of trained machine learning models ready for fine-tuning and deployable anywhere.

In [4]:
#To create token embedding,segment embedding and  positional embedding
def get_model(model_url, max_seq_length):
    inputs = dict(
    input_word_ids=tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32),
    input_mask=tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32),
    input_type_ids=tf.keras.layers.Input(shape=(max_seq_length,), dtype=tf.int32),
    )
  #Now let us build MuRIL model
    muril_layer = hub.KerasLayer(model_url, trainable=True)
    outputs = muril_layer(inputs)

    assert 'sequence_output' in outputs
    assert 'pooled_output' in outputs
    assert 'encoder_outputs' in outputs
    assert 'default' in outputs
    return tf.keras.Model(inputs=inputs,outputs=outputs["pooled_output"]), muril_layer

In [None]:
max_seq_length = 128
muril_model, muril_layer = get_model(
    model_url="https://tfhub.dev/google/MuRIL/1", max_seq_length=max_seq_length)

A vocabulary file in the form of Numpy array and a full tokenizer for BERT are created here.

In [6]:
vocab_file = muril_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = muril_layer.resolved_object.do_lower_case.numpy()
tokenizer = bert_tokenization.FullTokenizer(vocab_file, do_lower_case)

The next step is to create the metadata for the input sequence.
Here Token embedding,segment embedding and postion embedding are created

In [7]:
def create_input(input_strings, tokenizer, max_seq_length):
  input_ids_all, input_mask_all, input_type_ids_all = [], [], []
  for input_string in input_strings:
    input_tokens = ["[CLS]"] + tokenizer.tokenize(input_string) + ["[SEP]"]
    input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
    sequence_length = min(len(input_ids), max_seq_length)
    
    if len(input_ids) >= max_seq_length:
      input_ids = input_ids[:max_seq_length]
    else:
      input_ids = input_ids + [0] * (max_seq_length - len(input_ids))

    input_mask = [1] * sequence_length + [0] * (max_seq_length - sequence_length)

    input_ids_all.append(input_ids)
    input_mask_all.append(input_mask)
    input_type_ids_all.append([0] * max_seq_length)
  
  return np.array(input_ids_all), np.array(input_mask_all), np.array(input_type_ids_all)

The encode function created here is obviously the final part of this experiment. It accepts the input string, creates the metadata, and is then passed into the MuRIL model.

In [8]:
def encode(input_text):
  input_ids, input_mask, input_type_ids = create_input(input_text, 
                                                       tokenizer, 
                                                       max_seq_length)
  inputs = dict(
      input_word_ids=input_ids,
      input_mask=input_mask,
      input_type_ids=input_type_ids,
  )
  return muril_model(inputs)

This list is now passed into the encode function to get the final BERT representations of the words.

In [9]:
sentences = ["दोस्त", "मित्र", "शत्रु"]

In [10]:
embeddings = encode(sentences)

Knowing the meaning of the three words specified in the list, let us check the distance between each of the words i.e. the first two words, “dost” and “mitr”, more or less means the same, meaning that the distance between these words in a high dimensional plane is as less as possible.

In [11]:
dst_1 = distance.euclidean(np.array(embeddings[0]), 
                           np.array(embeddings[1]))
print("Distance between {} & {} is {}".format(sentences[0],
                                                sentences[1],
                                                dst_1))

Distance between दोस्त & मित्र is 0.009007968939840794


Whereas the distance between the second and third words, “mitr” and “shatru”, seems comparatively higher.

In [12]:
dst_2 = distance.euclidean(np.array(embeddings[1]), 
                           np.array(embeddings[2]))
print("Distance between {} & {} is {}".format(sentences[1],
                                                sentences[2],
                                                dst_2))

Distance between मित्र & शत्रु is 0.011569418013095856


In [13]:
dst_2 > dst_1

True

In [15]:
code_mix_sentences = ["मै घर जाऊंगा","मै घर जा रही हूँ","i am going home",'main ghar ja raha hoon','apka naam kya hai']

In [16]:
code_mix_embedding = encode(code_mix_sentences)

In [17]:
from sklearn.metrics.pairwise import cosine_similarity

In [18]:
dst_1 = cosine_similarity(np.array(code_mix_embedding[0]).reshape(1,-1), 
                           np.array(code_mix_embedding[1]).reshape(1,-1))
print("SIMILARITY between {} & {} is {}".format(code_mix_sentences[0],
                                                code_mix_sentences[1],
                                                dst_1))

SIMILARITY between मै घर जाऊंगा & मै घर जा रही हूँ is [[0.9997084]]


In [19]:
dst_1 = cosine_similarity(np.array(code_mix_embedding[0]).reshape(1,-1), 
                           np.array(code_mix_embedding[2]).reshape(1,-1))
print("SIMILARITY between {} & {} is {}".format(code_mix_sentences[0],
                                                code_mix_sentences[2],
                                                dst_1))

SIMILARITY between मै घर जाऊंगा & i am going home is [[0.99974626]]


In [20]:
dst_1 = cosine_similarity(np.array(code_mix_embedding[0]).reshape(1,-1), 
                           np.array(code_mix_embedding[4]).reshape(1,-1))
print("SIMILARITY between {} & {} is {}".format(code_mix_sentences[0],
                                                code_mix_sentences[4],
                                                dst_1))

SIMILARITY between मै घर जाऊंगा & apka naam kya hai is [[0.99927735]]
