Misalignment, i.e., parallel sentence pairs that are not accurate translations of each other, is a common problem that occurs even in well-curated datasets. This project is focused on making a prototypical cleaner to exploit sentence embeddings to filter misaligned segment pairs
git clone https://github.com/Priyanshiguptaaa/FilterMisalignedTranslationPairs.git
cd FilterMisalignedTranslationPairs
virtualenv venv
source venv/bin/activate
For MacOS
brew update
brew install rabbitmq
For your desired platform: Check this link : https://www.rabbitmq.com/download.html
** Note: You can uncomment the code blocks in the run.sh file according to your current requirements
bash run.sh
- Installing dependencies
pip install --upgrade pip
#installing the libraries mentioned in requirements.txt
pip3 install -r requirements.txt
#installing pika
python -m pip install pika --upgrade
#installing the laser models
python -m laserembeddings download-models
- For extracting data from a tmx file
python scripts/extractdata.py resources/tmx-file.tmx data/data.de data/data.fr
- For cleaning data
Use cases taken care of:
- * <a something ></a>
- * <br \>
- * </something>
- * <something>
- * <something/>
- * %something
- * Multiple spaces -> single Space
- * Remove space at beginnig of sentence
bash command to run for cleaning the text files
sed -i.old "s/<a[^>]*>/ /g;s/<br \/>/ /g;s/<\/[[:alpha:]]*>//g;s/<[[:alpha:]]*>//g;/%link/d;s/<[[:alpha:]]*\/>//g;s/<\;.*>\;//g;s/[[:space:]][[:space:]]*/ /g;s/^[[:space:]]//" data/data.de
sed -i.old "s/<a[^>]*>/ /g;s/<br \/>/ /g;s/<\/[[:alpha:]]*>//g;s/<[[:alpha:]]*>//g;/%link/d;s/<[[:alpha:]]*\/>//g;s/<\;.*>\;//g;s/[[:space:]][[:space:]]*/ /g;s/^[[:space:]]//" data/data.fr
brew services start rabbitmq
** Note: You can add or reduce the number of consumers according to the number of terminals you open
Navigate to your working directory and execute the following commands:
source venv/bin/activate
python3 scripts/worker.py
The terminal shows:
" [x] Awaiting Language Pairs, To exit press CTR+C"
** Note: You have to press CTR+C for all the consuming terminals after the client.py script completes execution and the command line exits
** Note: Provide the directories in the command according to the data you want to send
Navigate to your working directory and execute the following commands:
source venv/bin/activate
python3 scripts/client.py data/data.de data/data.fr
Note: Pass these as arguments
- file with the filtered data
- file with the initial data
- similarity score value based on which you classify pair as aligned or misaligned
python3 scripts/filtereddataanalysis.py output/filtereddata.de data/data.de 0.80
Analysis for filtering based on similairty score: 0.80
Total langauge pairs : 1449
Aligned langauge pairs : 70
Percentage of aligned langauge pairs : 4.830917874396135 %
Analysis for filtering based on similairty score: 0.75
Total langauge pairs : 1449
Aligned langauge pairs : 188
Percentage of aligned langauge pairs : 12.974465148378192 %
While we execute a product or service locally, there are several issues that must be considered when using local hardware or CPU. The purpose of scaling is to design the service in such a way that it can operate optimally even when the data load or traffic grows. We must also ensure that no single worker is overburdened; otherwise, they may crash due to unforeseen circumstances.
Potential inclusions in the software when scaling:
- Include end-to-end tests to ensure that the service does not fail or operate poorly at any phase.
- Include data validation tests to ensure that the data is clean and contains the qualities required by the service.
- When deploying the service, include the processes in a CI/CD pipeline. Building, packaging, testing, validating, certifying infrastructure, and deploying to all required environments are among them.
- After the data extraction and cleaning, add an approval phase to ensure that no use cases for cleaning are missing.
- Figure out the number of workers needed to scale the service.
- Enable worker auto scaling so that we may skip unnecessary processes while the queue size is small and add more processes when the number of waiting messages grows.