GitHub - Low-ResourceDialectology/TextAsCorpusRep: Text As Corpus Repository for Multilingual Machine Translation of Low-Resource Languages

TextAsCorpusRep

Multilingual Text As Corpus Repository for Machine Translation of Low-Resource Languages
Explore the docs »

View Demo (TODO) · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Usage
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

Our project addresses low-resource languages, focusing on Mauritian Creole and Kurdish dialects. We aim to collect and curate language data to support natural language processing, especially the development of robust translation systems for low-resource languages.

Our research questions are:

(Q1) How to create comprehensive, high-quality language datasets from diverse data sources of varying quality?
(Q2) How can we ensure correct, useful, and quality translations and linguistic annotations considering variations and dialectal nuances?

The project targets native speakers, language experts, and language technology practitioners. We follow a data-driven approach, including data acquisition, evaluation, and risk mitigation. Our project contributes to UN's sustainability goals of Quality Education and Reduced Inequalities by preserving languages, promoting inclusivity, and fostering data literacy.

(back to top)

Built With

List of major frameworks/libraries used/considered to bootstrap this project.

BeautifulSoup
Scrapy
langid.py
fastText
spaCy
NLTK
KLPT
ASAB

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

You need a Python installation (tested with: 3.11.5 on macOS & 3.10.9 on Ubuntu)
- For how to install Python on Windoof refer to: Using Python on Windows
- For how to install Python on macOS refer to: Using Python on a Mac
- For how to install Python on Linux refer to: The person who introduced you to Linux, and please tell them "The Lannisters send their regards!" (or go to Using Python on Unix platforms)
You need to use a terminal (at least once ;) ) For more information about how to work with a terminal, refer to Microsoft's Guide for Windoof, Apple's Guide for macOS, and Ubuntu's Guide for Linux systems.

Installation

Create a directory for the corpus and all your projects to be saved in. For this description we will call it "MyAwesomeDirectory" Then navigate into this directory and open the terminal from within it.

Clone this repository to get a local copy on your system: Execute the following lines inside of your terminal.
```
git clone git@github.com:Low-ResourceDialectology/TextAsCorpusRep.git
```
(Optional, but recommended) Create a virtual environment:
1. (If not yet installed) Install python venv:
```
python3 -m pip install virtualenv
```
or alternatively (at least on Ubuntu) via:
```
apt install python3.10-venv
```
1. Create an environment named "venvTextAsCorpusRep"
```
python -m venv venvTextAsCorpusRep
```
1. Activate the virtual environment every time before starting work
```
source venvTextAsCorpusRep/bin/activate
```
Navigate into the cloned corpus-directory named "TextAsCorpusRep", so you end up
```
cd TextAsCorpusRep
```
Assuming you cloned the repository into your "/Home/Download/" directory, you would type
```
cd /Home/Download/MyAwesomeDirectory/TextAsCorpusRep
```

Install the requirements:

python -m pip install -r requirements.txt

Usage is managed via the main.py script (continue in Usage section below):
```
python main.py -MODE_TO_OPERATE -l LANGUAGE_A LANGUAGE_B LANGUAGE_C
```

Usage

TODO: How to explore/read/use the corpus data.

(First steps) on Ubuntu 22.04

(First steps) on Windows 10

(First steps) on Mac

Continuing for any operating system

Collect datasets

python main.py -c -l ger kur mor ukr vie

Preprocess collected data

python main.py -p -l ger kur mor ukr vie

Explore collected and preprocessed data

python main.py -e -l ger kur mor ukr vie

(back to top)

Roadmap

See the TODO: open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the Apache License. See LICENSE.txt for more information.

(back to top)

Contact

Christian Schuler - @christians89898 - christianschuler8989(4T)gmail.com

Deepesha Saurty - deepesha.saurty@studium.uni-hamburg.de

Tramy Thi Tran - @TranyMyy - tramy.thi.tran@studium.uni-hamburg.de

Raman Ahmed -

Anran Wang - @AnranW - anran.wang (thesymbolforemail)tum.de

(back to top)

Acknowledgments

A list of helpful resources we would like to give credit to:

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
configs		configs
data		data
docs		docs
experiments		experiments
images		images
logs		logs
scr		scr
scripts		scripts
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Low-ResourceDialectology/TextAsCorpusRep

Folders and files

Latest commit

History

Repository files navigation

TextAsCorpusRep

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

(First steps) on Ubuntu 22.04

(First steps) on Windows 10

(First steps) on Mac

Continuing for any operating system

Collect datasets

Preprocess collected data

Explore collected and preprocessed data

Roadmap

Contributing

License

Contact

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Languages