Skip to content

Text As Corpus Repository for Multilingual Machine Translation of Low-Resource Languages

License

Notifications You must be signed in to change notification settings

Low-ResourceDialectology/TextAsCorpusRep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues Apache License LinkedIn


Logo

TextAsCorpusRep

Multilingual Text As Corpus Repository for Machine Translation of Low-Resource Languages
Explore the docs »

View Demo (TODO) · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Usage
  3. Roadmap
  4. Contributing
  5. License
  6. Contact
  7. Acknowledgments

About The Project

Project Name Screen Shot

Our project addresses low-resource languages, focusing on Mauritian Creole and Kurdish dialects. We aim to collect and curate language data to support natural language processing, especially the development of robust translation systems for low-resource languages.

Our research questions are:

  • (Q1) How to create comprehensive, high-quality language datasets from diverse data sources of varying quality?
  • (Q2) How can we ensure correct, useful, and quality translations and linguistic annotations considering variations and dialectal nuances?

The project targets native speakers, language experts, and language technology practitioners. We follow a data-driven approach, including data acquisition, evaluation, and risk mitigation. Our project contributes to UN's sustainability goals of Quality Education and Reduced Inequalities by preserving languages, promoting inclusivity, and fostering data literacy.

(back to top)

Built With

List of major frameworks/libraries used/considered to bootstrap this project.

  • BeautifulSoup
  • Scrapy
  • langid.py
  • fastText
  • spaCy
  • NLTK
  • KLPT
  • ASAB

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

  • You need a Python installation (tested with: 3.11.5 on macOS & 3.10.9 on Ubuntu)
  • You need to use a terminal (at least once ;) ) For more information about how to work with a terminal, refer to Microsoft's Guide for Windoof, Apple's Guide for macOS, and Ubuntu's Guide for Linux systems.

Installation

Create a directory for the corpus and all your projects to be saved in. For this description we will call it "MyAwesomeDirectory" Then navigate into this directory and open the terminal from within it.

  1. Clone this repository to get a local copy on your system: Execute the following lines inside of your terminal.
    git clone git@github.com:Low-ResourceDialectology/TextAsCorpusRep.git
  2. (Optional, but recommended) Create a virtual environment:
    1. (If not yet installed) Install python venv:
    python3 -m pip install virtualenv
    or alternatively (at least on Ubuntu) via:
    apt install python3.10-venv
    1. Create an environment named "venvTextAsCorpusRep"
    python -m venv venvTextAsCorpusRep
    1. Activate the virtual environment every time before starting work
    source venvTextAsCorpusRep/bin/activate
  3. Navigate into the cloned corpus-directory named "TextAsCorpusRep", so you end up
    cd TextAsCorpusRep
    Assuming you cloned the repository into your "/Home/Download/" directory, you would type
    cd /Home/Download/MyAwesomeDirectory/TextAsCorpusRep
  4. Install the requirements:
    python -m pip install -r requirements.txt
  5. Usage is managed via the main.py script (continue in Usage section below):
    python main.py -MODE_TO_OPERATE -l LANGUAGE_A LANGUAGE_B LANGUAGE_C

Usage

TODO: How to explore/read/use the corpus data.

(First steps) on Ubuntu 22.04

(First steps) on Windows 10

(First steps) on Mac

Continuing for any operating system

Collect datasets

python main.py -c -l ger kur mor ukr vie

Preprocess collected data

python main.py -p -l ger kur mor ukr vie

Explore collected and preprocessed data

python main.py -e -l ger kur mor ukr vie

(back to top)

Roadmap

  • Set up this Repository
  • Prior Exploration of Available Data Sets
  • Phase 1: Initial Data Acquisition
    • Web Crawling and Scraping
      • Language specific (News-) Websites
      • Language specific Wikipedia & Similar
    • Language Identification
      • Based on already available tools
      • Own apporach based on linguistic rules
  • Phase 2: Targeting Crucial Aspects
    • Native Speaker Involvements
      • Contact Language Communities
      • Field Worker Data Collection
    • Exchange with Language Experts
      • Expert Interviews (Delphi Study?)
      • Situating Project's Corpus in Research
  • Phase 3: Final Quality Evaluation
    • Algorithmic Appraches
      • Structure and Basic Attributes of Data
      • Tentative Use of Language Models
    • Mobilizing Naive Speakers
      • Application for easy use on Smarthpone
      • Social Media Platforms
  • Finalize Documentation and Release Corpus

See the TODO: open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the Apache License. See LICENSE.txt for more information.

(back to top)

Contact

Christian Schuler - @christians89898 - christianschuler8989(4T)gmail.com

Deepesha Saurty - deepesha.saurty@studium.uni-hamburg.de

Tramy Thi Tran - @TranyMyy - tramy.thi.tran@studium.uni-hamburg.de

Raman Ahmed -

Anran Wang - @AnranW - anran.wang (thesymbolforemail)tum.de

(back to top)

Acknowledgments

A list of helpful resources we would like to give credit to:

(back to top)

About

Text As Corpus Repository for Multilingual Machine Translation of Low-Resource Languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages