Multilingual Text As Corpus Repository for Machine Translation of Low-Resource Languages
Explore the docs »
View Demo (TODO)
·
Report Bug
·
Request Feature
Table of Contents
Our project addresses low-resource languages, focusing on Mauritian Creole and Kurdish dialects. We aim to collect and curate language data to support natural language processing, especially the development of robust translation systems for low-resource languages.
Our research questions are:
- (Q1) How to create comprehensive, high-quality language datasets from diverse data sources of varying quality?
- (Q2) How can we ensure correct, useful, and quality translations and linguistic annotations considering variations and dialectal nuances?
The project targets native speakers, language experts, and language technology practitioners. We follow a data-driven approach, including data acquisition, evaluation, and risk mitigation. Our project contributes to UN's sustainability goals of Quality Education and Reduced Inequalities by preserving languages, promoting inclusivity, and fostering data literacy.
List of major frameworks/libraries used/considered to bootstrap this project.
- BeautifulSoup
- Scrapy
- langid.py
- fastText
- spaCy
- NLTK
- KLPT
- ASAB
To get a local copy up and running follow these simple steps.
- You need a Python installation (tested with: 3.11.5 on macOS & 3.10.9 on Ubuntu)
- For how to install Python on Windoof refer to: Using Python on Windows
- For how to install Python on macOS refer to: Using Python on a Mac
- For how to install Python on Linux refer to: The person who introduced you to Linux, and please tell them "The Lannisters send their regards!" (or go to Using Python on Unix platforms)
- You need to use a terminal (at least once ;) ) For more information about how to work with a terminal, refer to Microsoft's Guide for Windoof, Apple's Guide for macOS, and Ubuntu's Guide for Linux systems.
Create a directory for the corpus and all your projects to be saved in. For this description we will call it "MyAwesomeDirectory" Then navigate into this directory and open the terminal from within it.
- Clone this repository to get a local copy on your system:
Execute the following lines inside of your terminal.
git clone git@github.com:Low-ResourceDialectology/TextAsCorpusRep.git
- (Optional, but recommended) Create a virtual environment:
- (If not yet installed) Install python venv:
or alternatively (at least on Ubuntu) via:python3 -m pip install virtualenv
apt install python3.10-venv
- Create an environment named "venvTextAsCorpusRep"
python -m venv venvTextAsCorpusRep
- Activate the virtual environment every time before starting work
source venvTextAsCorpusRep/bin/activate
- Navigate into the cloned corpus-directory named "TextAsCorpusRep", so you end up
Assuming you cloned the repository into your "/Home/Download/" directory, you would type
cd TextAsCorpusRep
cd /Home/Download/MyAwesomeDirectory/TextAsCorpusRep
- Install the requirements:
python -m pip install -r requirements.txt
- Usage is managed via the main.py script (continue in Usage section below):
python main.py -MODE_TO_OPERATE -l LANGUAGE_A LANGUAGE_B LANGUAGE_C
TODO: How to explore/read/use the corpus data.
python main.py -c -l ger kur mor ukr vie
python main.py -p -l ger kur mor ukr vie
python main.py -e -l ger kur mor ukr vie
- Set up this Repository
- Prior Exploration of Available Data Sets
- Phase 1: Initial Data Acquisition
- Web Crawling and Scraping
- Language specific (News-) Websites
- Language specific Wikipedia & Similar
- Language Identification
- Based on already available tools
- Own apporach based on linguistic rules
- Web Crawling and Scraping
- Phase 2: Targeting Crucial Aspects
- Native Speaker Involvements
- Contact Language Communities
- Field Worker Data Collection
- Exchange with Language Experts
- Expert Interviews (Delphi Study?)
- Situating Project's Corpus in Research
- Native Speaker Involvements
- Phase 3: Final Quality Evaluation
- Algorithmic Appraches
- Structure and Basic Attributes of Data
- Tentative Use of Language Models
- Mobilizing Naive Speakers
- Application for easy use on Smarthpone
- Social Media Platforms
- Algorithmic Appraches
- Finalize Documentation and Release Corpus
See the TODO: open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the Apache License. See LICENSE.txt
for more information.
Christian Schuler - @christians89898 - christianschuler8989(4T)gmail.com
Deepesha Saurty - deepesha.saurty@studium.uni-hamburg.de
Tramy Thi Tran - @TranyMyy - tramy.thi.tran@studium.uni-hamburg.de
Raman Ahmed -
Anran Wang - @AnranW - anran.wang (thesymbolforemail)tum.de
A list of helpful resources we would like to give credit to: