12-in-1: Multi-Task Vision and Language Representation Learning Web Demo

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

Arxiv Paper Link: https://arxiv.org/abs/1912.02315

Demo Link: https://vilbert.cloudcv.org/

If you have more questions about the project, then you can email us on team@cloudcv.org

Bulit & Maintained by -

Rishabh Jain

Acknowledgements

We thank Jiasen Lu for his help.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
demo		demo
vilbert_multitask		vilbert_multitask
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

demo

demo

vilbert_multitask

vilbert_multitask

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

manage.py

manage.py

requirements.txt

requirements.txt

worker.py

worker.py

Repository files navigation

12-in-1: Multi-Task Vision and Language Representation Learning Web Demo

Bulit & Maintained by -

Acknowledgements

About

Releases

Packages

Languages

License

Cloud-CV/vilbert-multi-task

Folders and files

Latest commit

History

Repository files navigation

12-in-1: Multi-Task Vision and Language Representation Learning Web Demo

Bulit & Maintained by -

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages