Skip to content
No description, website, or topics provided.
Python
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
__pycache__
character-model
dataset_char2vec
dataset_cognates
evaluate
model_weights
models
outputs
preprocess
README.md
attentionDecoder.py

README.md

Learning Cross-Lingual Phonological and Orthagraphic Adaptations made-with-python

This repo hosts the code necessary to reproduce the results of our paper, formerly titled as Neural machine translation based word transduction mechanisms for Low Resource Languages (recently accepted at the Journal of Language Modelling).

enc-dec


Generating char2vec from pre-trained Hindi fastText embeddings

The pre-trained Hindi character vectors can be downloaded from here. This repo contains two such methods for generating these character embeddings:

  1. Running the file generate_char2vec.py generates the character vectors for 71 Devanagari characters from the pre-trained word vectors. The outputs can be found in char2vec.txt.
  2. Running the file char_rnn.py trains a language model over the hindi-wikipedia-articles-55000 (i.e., generating the 30th character given the sequence of 29 consecutive characters). The embedding weights are then retained to extract the character-level embeddings.

Models Used

We experimented with four variants of sequence-to-sequence models for our project:

  • Peeky Seq2seq Model: Run the file peeky_Seq2seq.py. The implementation is based on Sequence to Sequence Learning with Keras.

  • Alignment Model (AM): Run the file attentionDecoder.py. Following the work of Bahdanau et al. [1], the file attention_decoder.py contains the custom Keras layer based on Tensorflow backend. The original implementation can be found here. A good blog post guiding the use of this implementation can be found here.

  • Heirarchical Attention Model (HAM): Run the file attentionEncoder.py. Inspired from the work of Yang et al. [2] Original implementation can be found here.

  • Transformer Network: generate_data_for_tensor2tensor.py generates the data as required by the Transformer network. [3] The data is required while registering your own database (See this for further reading). For a detailed look at installation and usage, visit their official github page.

Evaluation metrics

  • bleu_score.py measures the BLEU score between the transduced and the actual Bhojpuri words averaged over the entire output file.

  • word_accuracy.py simply measures the proportion of correctly transduced words in the output file.

  • measure_distance.py measures the Soundex score similarity between the actual and transduced Bhojpuri word pairs, averaged over the output file. A good blog post explaining the implementation can be found here.

Citation

If our code was helpful in your research, consider citing our work:

@article{jha2018neural,
  title={Neural Machine Translation based Word Transduction Mechanisms for Low-Resource Languages},
  author={Jha, Saurav and Sudhakar, Akhilesh and Singh, Anil Kumar},
  journal={arXiv preprint arXiv:1811.08816},
  year={2018}
}

References

[1] Bahdanau, D., Bengio, Y., & Cho, K. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. CoRR, abs/1409.0473.

[2] Dyer, C., He, X., Hovy, E.H., Smola, A.J., Yang, Z., & Yang, D. (2016). Hierarchical Attention Networks for Document Classification. HLT-NAACL.

[3] Gomez, A.N., Jones, L., Kaiser, L., Parmar, N., Polosukhin, I., Shazeer, N., Uszkoreit, J., & Vaswani, A. (2017). Attention is All you Need. NIPS.

You can’t perform that action at this time.