# **Neural Fuzzy Repair (NFR) is a data augmentation pipeline, which integrates fuzzy matches (i.e. similar translations) into neural machine translation.**

In [None]:
%cd /content/gdrive/MyDrive/DataAugmentationNMT

/content/gdrive/MyDrive/DataAugmentationNMT


**For basic usage you can simply install the library via clone from git and pip install.**

In [None]:
!git clone https://github.com/lt3/nfr.git
%cd nfr
!pip install .

In [None]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[K     |████████████████████████████████| 85.5 MB 93 kB/s 
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [None]:
!pip install sentence-transformers

# **nfr-create-faiss-index:** 
# Creates a FAISS index for semantic matches with Sent2Vec or Sentence Transformers. This is a necessary step if you want to extract semantic fuzzy matches later on.



```
# usage: nfr-create-faiss-index [-h] -c CORPUS_F -p MODEL_NAME_OR_PATH -o
                              OUTPUT_F [-m {sent2vec,stransformers}]
                              [-b BATCH_SIZE] [--use_cuda]

Create a FAISS index based on the semantic representation of an existing text
corpus. To do so, the text will be embedded by means of a sent2vec model or a
sentence-transformers model. The index is (basically) an efficient list that
contains all the representations of the training corpus sentences (the TM). as
such, this index can later be used to find those entries that are most similar
to a given representation of a sentence. The index is saved to a binary file
so that it can be reused later on to calculate cosine similarity scores and to
retrieve the most resembling entries.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS_F, --corpus_f CORPUS_F
                        Path to the corpus to turn into vectors and add to the
                        index. This is typically your TM or training file for
                        an MT system containing text, one sentence per line
  -p MODEL_NAME_OR_PATH, --model_name_or_path MODEL_NAME_OR_PATH
                        Path to sent2vec model (when `method` is sent2vec) or
                        sentence-transformers model name when method is
                        stransformers (see
                        https://www.sbert.net/docs/pretrained_models.html)
  -o OUTPUT_F, --output_f OUTPUT_F
                        Path to the output file to write the FAISS index to
  -m {sent2vec,stransformers}, --mode {sent2vec,stransformers}
                        Whether to use 'sent2vec' or 'stransformers'
                        (sentence-transformers)
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size to use to create sent2vec embeddings or
                        sentence-transformers embeddings. A larger value will
                        result in faster creation, but may lead to an out-of-
                        memory error. If you get such an error, lower the
                        value.
  --use_cuda            Whether to use GPU when using sentence-transformers.
                        Requires PyTorch installation with CUDA support and a
                        CUDA-enabled device
```



In [None]:
!nfr-create-faiss-index  -c /content/gdrive/MyDrive/DataAugmentationNMT/ta/train.en -p paraphrase-multilingual-MiniLM-L12-v2 -o /train.faiss.hi.cln -m stransformers

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# **nfr-extract-fuzzy-matches:** 
# Here, fuzzy matches can be extracted from the training set. A variety of options are available, including semantic fuzzy matching, setsimilarity and edit distance.



```
# usage: nfr-extract-fuzzy-matches [-h] --tmsrc TMSRC --tmtgt TMTGT --insrc
                                 INSRC --method
                                 {editdist,setsim,setsimeditdist,sent2vec,stransformers}
                                 --minscore MINSCORE --maxmatch MAXMATCH
                                 [--model_name_or_path MODEL_NAME_OR_PATH]
                                 [--faiss FAISS] [--threads THREADS]
                                 [--n_setsim_candidates N_SETSIM_CANDIDATES]
                                 [--setsim_function SETSIM_FUNCTION]
                                 [--use_cuda] [-q QUERY_MULTIPLIER]
                                 [-v {info,debug}]

Given source and target TM files, extract fuzzy matches for a new input file
by using a variety of methods. You can use formal matching methods such as
edit distance and set similarity, as well as semantic fuzzy matching with
sent2vec and Sentence Transformers.

optional arguments:
  -h, --help            show this help message and exit
  --tmsrc TMSRC         Source text of the TM from which fuzzy matches will be
                        extracted
  --tmtgt TMTGT         Target text of the TM from which fuzzy matches will be
                        extracted
  --insrc INSRC         Input source file to extract matches for (insrc is
                        queried against tmsrc)
  --method {editdist,setsim,setsimeditdist,sent2vec,stransformers}
                        Method to find fuzzy matches
  --minscore MINSCORE   Min fuzzy match score. Only matches with a similarity
                        score of at least 'minscore' will be included
  --maxmatch MAXMATCH   Max number of fuzzy matches kept per source segment
  --model_name_or_path MODEL_NAME_OR_PATH
                        Path to sent2vec model (when `method` is sent2vec) or
                        sentence-transformers model name when method is
                        stransformers (see
                        https://www.sbert.net/docs/pretrained_models.html)
  --faiss FAISS         Path to faiss index. Must be provided when `method` is
                        sent2vec or stransformers
  --threads THREADS     Number of threads. Must be 0 or 1 when using
                        `use_cuda`
  --n_setsim_candidates N_SETSIM_CANDIDATES
                        Number of fuzzy match candidates extracted by setsim
  --setsim_function SETSIM_FUNCTION
                        Similarity function used by setsimsearch
  --use_cuda            Whether to use GPU for FAISS indexing and sentence-
                        transformers. For this to work properly `threads`
                        should be 0 or 1.
  -q QUERY_MULTIPLIER, --query_multiplier QUERY_MULTIPLIER
                        (applies only to FAISS) Initially look for
                        `query_multiplier * maxmatch` matches to ensure that
                        we find enough hits after filtering. If still not
                        enough matches, search the whole index
  -v {info,debug}, --logging_level {info,debug}
                        Set the information level of the logger. 'info' shows
                        trivial information about the process. 'debug' also
                        notifies you when less matches are found than
                        requested during semantic matching
```



In [None]:
!nfr-extract-fuzzy-matches --tmsrc /content/gdrive/MyDrive/DataAugmentationNMT/ta/train.en --tmtgt /content/gdrive/MyDrive/DataAugmentationNMT/ta/train.ta --insrc /content/gdrive/MyDrive/DataAugmentationNMT/ta/augmentedOutput.ta.augment.multmax.aug.filtered.1000 --method stransformers --faiss /train.faiss.hi.cln --model_name_or_path  paraphrase-multilingual-MiniLM-L12-v2 --maxmatch 3 --minscore 0.75 --threads 1

# **Slpit the retrieved fuzzy matched source and target sentences .**

In [None]:
import string 
 
# Open the file in read mode 
text = open("/content/gdrive/MyDrive/DataAugmentationNMT/ta/augmentedOutput.ta.augment.multmax.aug.filtered.1000.matches.mins0.75.maxm3.stransformers.txt", "r") 
 

 
# Loop through each line of the file 
for line in text: 
    # Remove the leading spaces and newline character 
    line = line.strip() 
 
    # Split the line 
    sentences = line.split("\t") 
    with open('eng_ta_0.75.nfr', 'a') as g:
         print(sentences[0], file=g)

    with open('ta_0.75.nfr', 'a') as g:
         print(sentences[3], file=g)
 
