State-of-the-Art Language Modeling and Text Classification in Hindi Language
Clone or download
Latest commit 698e364 Oct 17, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
assets/images Add Hindi2Vec Logo Sep 30, 2018
.gitignore Initial commit Mar 2, 2018
Hindi-Language-Modeling.ipynb Extract Word Embedding from Language Model Mar 6, 2018
LICENSE Initial commit Mar 2, 2018
README.md Add Installation Instructions Oct 17, 2018
_config.yml Set theme jekyll-theme-dinky Jul 18, 2018

README.md

hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)

  • To the best of my knowledge on September 18, 2018

Downloads

TODO

  • Language modeling based on wikipedia dump
  • Release Language Models: Hindi Language Model
  • Create Text classification Datasets: BBC Hindi
  • Benchmark text classification with FastText

Idea Dump

  • Change the custom head to be used for transliteration instead of classification, Hindi script (Devnagri) to English script (Roman)
  • MTL tasks for training and inference using custom heads
  • Text to Speech - using datasets from news recordings or Hindi subtitles of dubbed movies

FastAI Installation

This version of the notebook uses fastai lib's v0.7, used in their Part 2 v2 course in Summer 2018. The best way to install it via conda as mentioned here

Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.