Skip to content

NAN.ai English-Filipino Corpus is an open source initiative to create an open data repository for Filipino-English corpus intended to train cross-lingual natural language models for chatbots and topic models, among others.

License

Notifications You must be signed in to change notification settings

Saphron-Asia/nan.ai-opendata-corpus

Repository files navigation

Nan.ai English-Filipino Corpus Open Data

Nan.ai English-Filipino Corpus is an open source initiative to create an open data repository for Filipino-English corpus intended to train cross-lingual natural language models for chatbots and topic models, among others. This corpus is currently maintained to train Nan.ai support, the chatbot that serves as customer support for Nan.ai users. Supported by UNICEF Innovations, this open source initiative aims to start collaboration on enriching low resource languages, such as Filipino, and jumpstart applications on NLP.

You can participate by (1) contributing language data or (2) annotating existing datasets. We also welcome computing and linguistics experts to improve this repository's usability for various use cases.

To explore our datasets, you can use the existing NLP notebooks available here or import data by following the instructions here.

Description of the data

Language data is collected from public domain and stored as text files. These text files are grouped according to use i.e. spoken or written texts and labeled by source (e.g. reportage, conversation). Available data in this repository are annotated and anonymized (if any PII is part of the dataset).

We are also creating datasets derived and annotated based on the corpus data such as stoplists, labeled sentiments, and domain-specific dictionaries.

Alongside our open data initiative, we are also open sourcing a related machine learning service, NAN.ai Natural Language Understanding (NLU).

Navigate this project

Resources

License

nanai-opendata-corpus is licensed under the Creative Commons Zero v1.0 Universal

About

NAN.ai English-Filipino Corpus is an open source initiative to create an open data repository for Filipino-English corpus intended to train cross-lingual natural language models for chatbots and topic models, among others.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •