Translation with NLLB-200 using CTranslate2

This notebook is part of the repository [Adaptive-MT-LLM-Fine-tuning](https://github.com/ymoslem/Adaptive-MT-LLM-Fine-tuning).

# Requirements

In [None]:
# Requirements
#!pip install ctranslate2 sentencepiece -q
#!ls /content/models/ct2-nllb-200-3.3B-int8

################
# Example of converting an NLLB model to CTranslate2 with int8 quantization (it takes a while, so you can skip this step if you already have the model or run it on Google Colab and then download the resulting model):
################

#!ct2-transformers-converter --model facebook/nllb-200-1.3B --quantization int8 --output_dir /content/models/ct2-nllb-200-1.3B-int8

# Download the SentencePiece model
#!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model

# Loading the data

In [15]:
import os

data_path = "data"
tgt_lang = "spa_Latn"
src_lang = "eng_Latn"

# Language dictionary for NLLB-200 including ES, EN, FR, RU, ZH, AR, PT, SW
lang_dict = {
    "spa_Latn": "Spanish",
    "eng_Latn": "English",
    "fra_Latn": "French",
    "rus_Cyrl": "Russian",
    "zho_Hans": "Chinese",
    "ara_Arab": "Arabic",
    "por_Latn": "Portuguese",
    "swa_Latn": "Swahili"
}

src_lang_name = lang_dict[src_lang]
tgt_lang_name = lang_dict[tgt_lang]

source = """The UN Environment Programme (UNEP) and the Food and Agriculture Organization of the UN (FAO) have named the first World Restoration Flagships for this year, tackling pollution, unsustainable exploitation, and invasive species in three continents.
These initiatives are restoring almost five million hectares of marine ecosystems – an area about the size of Costa Rica, which, together with France, is hosting the 3rd UN Ocean Conference.

The three new flagships comprise restoration initiatives in the coral-rich Northern Mozambique Channel Region, more than 60 of Mexico’s islands and the Mar Menor in Spain, Europe’s first ecosystem with legal personhood.
The winning initiatives were announced at an event during the UN Ocean Conference in Nice, France, and are now eligible for UN support.

“After decades of taking the ocean for granted, we are witnessing a great shift towards restoration.
But the challenge ahead of us is significant and we need everyone to play their part,” said Inger Andersen, Executive Director of UNEP.
“These World Restoration Flagships show how biodiversity protection, climate action, and economic development are deeply interconnected.
To deliver our restoration goals, our ambition must be as big as the ocean we must protect.”

FAO Director-General QU Dongyu said: “The climate crisis, unsustainable exploitation practices and nature resources shrinking are affecting our blue ecosystems, harming marine life and threatening the livelihoods of dependent communities.
These new World Restoration Flagships show that halting and reversing degradation is not only possible, but also beneficial to planet and people."

The World Restoration Flagship awards are part of the UN Decade on Ecosystem Restoration – led by UNEP and FAO – which aims to prevent, halt, and reverse the degradation of ecosystems on every continent and in every ocean.
The awards track notable initiatives that support global commitments to restore one billion hectares – an area larger than China – by 2030."""


#directory = os.path.join(data_path, "spanish")

#os.chdir(directory)
os.getcwd()

# print language for input src_lang
print(lang_dict[src_lang])

English


## Load the models

In [4]:
import os

# [Modify] Set paths to the CTranslate2 and SentencePiece models
#!mkdir -p /content/models
#!cp -r /content/ct2-nllb* /content/models
drive = "../models"

ct_model_path = os.path.join(drive, "ct2-nllb-200-1.3B-int8")
sp_model_path = os.path.join(drive, "flores200_sacrebleu_tokenizer_spm.model")

In [5]:
import ctranslate2
import sentencepiece as spm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)

# Load the CTranslate2 model
translator = ctranslate2.Translator(ct_model_path, device=device)

  import pkg_resources


In [6]:
sp.encode_as_pieces("English:")


['▁English', ':']

# Translate (source sentences only)

In [16]:

source_sents = [sent.strip() for sent in source.split("\n")]
print(*source_sents, sep="\n")

The UN Environment Programme (UNEP) and the Food and Agriculture Organization of the UN (FAO) have named the first World Restoration Flagships for this year, tackling pollution, unsustainable exploitation, and invasive species in three continents.
These initiatives are restoring almost five million hectares of marine ecosystems – an area about the size of Costa Rica, which, together with France, is hosting the 3rd UN Ocean Conference.

The three new flagships comprise restoration initiatives in the coral-rich Northern Mozambique Channel Region, more than 60 of Mexico’s islands and the Mar Menor in Spain, Europe’s first ecosystem with legal personhood.
The winning initiatives were announced at an event during the UN Ocean Conference in Nice, France, and are now eligible for UN support.

“After decades of taking the ocean for granted, we are witnessing a great shift towards restoration.
But the challenge ahead of us is significant and we need everyone to play their part,” said Inger Ande

In [17]:
# src_lang = "eng_Latn"
# tgt_lang = "spa_Latn"

beam_size = 2

# Replace special characters in source_sents, like “,«, –
source_sents = [sent.replace("“", '"').replace("”", '"') for sent in source_sents]
source_sents = [sent.replace("–", "-") for sent in source_sents]

# Remove empty string from source_sents
source_sents = [sent.strip() for sent in source_sents if sent.strip()]
target_prefix = [[tgt_lang]] * len(source_sents)

# Subword the source sentences
source_sents_subworded = sp.encode_as_pieces(source_sents)
source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]

# Translate the source sentences
translations = translator.translate_batch(source_sents_subworded,
                                          batch_type="tokens",
                                          max_batch_size=2024,
                                          beam_size=beam_size,
                                          target_prefix=target_prefix)
translations = [translation.hypotheses[0] for translation in translations]

# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]

print(*translations_desubword[:10], sep="\n")

El Programa de las Naciones Unidas para el Medio Ambiente (PNUMA) y la Organización de las Naciones Unidas para la Alimentación y la Agricultura (FAO) han nombrado los primeros buques insignia de restauración mundial para este año, que abordan la contaminación, la explotación insostenible y las especies invasoras en tres continentes.
Estas iniciativas están restaurando casi cinco millones de hectáreas de ecosistemas marinos, un área de aproximadamente el tamaño de Costa Rica, que, junto con Francia, acoge la 3a Conferencia de las Naciones Unidas sobre los Océanos.
Los tres nuevos buques insignia incluyen iniciativas de restauración en la región del canal de Mozambique, rica en corales, más de 60 de las islas de México y el Mar Menor en España, el primer ecosistema europeo con personalidad jurídica.
Las iniciativas ganadoras se anunciaron en un evento durante la Conferencia de las Naciones Unidas sobre el Océano en Niza, Francia, y ahora son elegibles para el apoyo de las Naciones Unida

In [None]:
# Save the translations
# with open("testUNEP.en", "w+") as output:
#   for translation in translations_desubword:
#     output.write(translation + "\n")

# Fuzzy search indexer WIP
* ✅ glossary
* ➡️ translation memory

In [18]:
import pandas as pd
import re
from typing import List, Dict, Optional, Union, Callable, Any, Tuple
from rapidfuzz import fuzz, process, utils
import numpy as np


class MultilingualGlossaryProcessor:
    """
    A class for processing multilingual glossaries using RapidFuzz for fuzzy string matching.
    """
    
    def __init__(self, glossary_path: str):
        """
        Initialize the processor with a glossary file.
        
        Args:
            glossary_path: Path to CSV/Excel file with columns:
                          Keyword, Category, English, Arabic, French, Spanish, Chinese, Russian, Portuguese, Swahili
        """
        if glossary_path.endswith('.xlsx') or glossary_path.endswith('.xls'):
            self.glossary = pd.read_excel(glossary_path)
        else:
            self.glossary = pd.read_csv(glossary_path)
        
        # Define available languages
        self.languages = ['English', 'Arabic', 'French', 'Spanish', 'Chinese', 'Russian', 'Portuguese', 'Swahili']
        
        # Validate glossary structure
        required_columns = ['Keyword', 'Category'] + self.languages
        missing_columns = [col for col in required_columns if col not in self.glossary.columns]
        if missing_columns:
            raise ValueError(f"Missing required columns: {missing_columns}")
    
    def find_best_fuzzy_match(
        self,
        query: str,
        source_language: str,
        target_languages: List[str],
        scorer: Callable = fuzz.WRatio,
        processor: Optional[Callable] = None,
        score_cutoff: Optional[float] = 60.0,
        process_method: str = "extractOne"
    ) -> Dict[str, Union[str, Dict[str, str]]]:
        """
        Find the best fuzzy match in the glossary for a given query.
        
        Args:
            query: The text to search for
            source_language: Language of the query
            target_languages: List of target languages to return translations
            scorer: RapidFuzz scorer function (default: fuzz.WRatio)
            processor: Text preprocessing function (default: None)
            score_cutoff: Minimum similarity score (default: 60.0)
            process_method: RapidFuzz process method ("extractOne", "extract", "cdist", "cpdist")
            
        Returns:
            Dictionary with best_fuzzy match and translations in target languages
        """
        if source_language not in self.languages:
            raise ValueError(f"Source language '{source_language}' not supported. Available: {self.languages}")
        
        invalid_targets = [lang for lang in target_languages if lang not in self.languages]
        if invalid_targets:
            raise ValueError(f"Invalid target languages: {invalid_targets}. Available: {self.languages}")
        
        # Get all terms in source language (excluding NaN values)
        source_terms = self.glossary[source_language].dropna().tolist()
        
        if not source_terms:
            return {"best_fuzzy": "", "result": {}}
        
        # Find best match using specified process method
        if process_method == "extractOne":
            result = process.extractOne(
                query, 
                source_terms, 
                scorer=scorer, 
                processor=processor, 
                score_cutoff=score_cutoff
            )
        elif process_method == "extract":
            results = process.extract(
                query, 
                source_terms, 
                scorer=scorer, 
                processor=processor, 
                limit=1, 
                score_cutoff=score_cutoff
            )
            result = results[0] if results else None
        elif process_method == "cdist":
            # Using cdist for single query
            distances = process.cdist(
                [query], 
                source_terms, 
                scorer=scorer, 
                processor=processor, 
                score_cutoff=score_cutoff
            )
            if distances.size > 0:
                best_idx = np.argmax(distances[0])
                if distances[0][best_idx] >= (score_cutoff or 0):
                    result = (source_terms[best_idx], distances[0][best_idx], best_idx)
                else:
                    result = None
            else:
                result = None
        elif process_method == "cpdist":
            # cpdist requires equal length arrays, so we'll use the query repeated
            if len(source_terms) > 0:
                distances = process.cpdist(
                    [query] * len(source_terms), 
                    source_terms, 
                    scorer=scorer, 
                    processor=processor, 
                    score_cutoff=score_cutoff
                )
                if distances.size > 0:
                    best_idx = np.argmax(distances)
                    if distances[best_idx] >= (score_cutoff or 0):
                        result = (source_terms[best_idx], distances[best_idx], best_idx)
                    else:
                        result = None
                else:
                    result = None
            else:
                result = None
        else:
            raise ValueError(f"Unsupported process method: {process_method}")
        
        if not result:
            return {"best_fuzzy": "", "result": {}}
        
        best_match, score, index = result
        
        # Find the row containing this match
        match_row = self.glossary[self.glossary[source_language] == best_match].iloc[0]
        
        # Get translations for target languages
        translations = {}
        for lang in target_languages:
            translation = match_row[lang]
            if pd.notna(translation):
                translations[lang] = str(translation)
            else:
                translations[lang] = ""
        
        return {
            "best_fuzzy": best_match,
            "score": score,
            "result": translations
        }
    
    def _remove_overlapping_matches(self, matches: List[Dict]) -> List[Dict]:
        """
        Remove overlapping matches, keeping the longest/highest scoring ones.
        
        Args:
            matches: List of match dictionaries with 'start', 'end', 'score', etc.
            
        Returns:
            Filtered list with non-overlapping matches
        """
        if not matches:
            return []
        
        # Sort by length (descending) then by score (descending)
        sorted_matches = sorted(matches, 
                              key=lambda x: (x['end'] - x['start'], x['score']), 
                              reverse=True)
        
        final_matches = []
        used_positions = set()
        
        for match in sorted_matches:
            # Check if this match overlaps with any already selected match
            match_positions = set(range(match['start'], match['end']))
            
            if not match_positions.intersection(used_positions):
                # No overlap, add this match
                final_matches.append(match)
                used_positions.update(match_positions)
        
        # Sort final matches by position in text
        final_matches.sort(key=lambda x: x['start'])
        return final_matches
    
    def find_all_fuzzy_matches_in_text(
        self,
        text: str,
        source_language: str,
        target_languages: List[str],
        scorer: Callable = fuzz.partial_ratio,
        processor: Optional[Callable] = None,
        score_cutoff: Optional[float] = 80.0,
        min_word_length: int = 1,
        limit: int = None
    ) -> List[Dict[str, Union[str, Dict[str, str]]]]:
        """
        Find all glossary terms that fuzzy match within a given text using extract method and token_set_ratio.
        Efficiently searches for glossary entries in the text and handles overlapping matches.
        
        Args:
            text: Input text to search within
            source_language: Language of the input text
            target_languages: List of target languages to return translations
            scorer: RapidFuzz scorer function (default: fuzz.partial_ratio)
            processor: Text preprocessing function (default: None)
            score_cutoff: Minimum similarity score (default: 80.0)
            min_word_length: Minimum length of words to consider (default: 2)
            limit: Maximum number of matches to return (default: None for all matches)
            
        Returns:
            List of dictionaries with found matches and their translations
        """
        if source_language not in self.languages:
            raise ValueError(f"Source language '{source_language}' not supported. Available: {self.languages}")
        
        invalid_targets = [lang for lang in target_languages if lang not in self.languages]
        if invalid_targets:
            raise ValueError(f"Invalid target languages: {invalid_targets}. Available: {self.languages}")
        
        # Get all terms in source language (excluding NaN values)
        source_terms = self.glossary[source_language].dropna().tolist()
        
        if not source_terms:
            return []
        
        # Filter terms by minimum word length
        filtered_terms = [term for term in source_terms if len(str(term).strip()) >= min_word_length]
        
        if not filtered_terms:
            return []
        
        # Sort glossary terms by length (longest first) for better matching
        source_terms_sorted = sorted(filtered_terms, key=len, reverse=True)
        
        all_matches = []
        
        # Use extract method with token_set_ratio as the only scorer
        extract_results = process.extract(
            text,
            source_terms_sorted,
            scorer=scorer,
            processor=processor,
            score_cutoff=score_cutoff,
            limit=limit
        )
        
        if not extract_results:
            return []
        
        #print(f"Found {len(extract_results)} matches in text using extract method with token_set_ratio")
        
        # Process each match result
        for match_term, similarity, _ in extract_results:
            # Find the row in glossary containing this match
            match_rows = self.glossary[self.glossary[source_language] == match_term]
            
            if match_rows.empty:
                continue
                
            match_row = match_rows.iloc[0]
            
            # Get translations for target languages
            translations = {}
            for lang in target_languages:
                translation = match_row[lang] if lang in match_row else None
                if pd.notna(translation):
                    translations[lang] = str(translation)
                else:
                    translations[lang] = ""
            
            # Find approximate positions in text for overlap detection
            # Using case-insensitive search to find the term in text
            text_lower = text.lower()
            term_lower = match_term.lower()
            
            # Try to find the exact match position
            start_pos = text_lower.find(term_lower)
            if start_pos != -1:
                end_pos = start_pos + len(match_term)
            else:
                # If exact match not found, use fuzzy position estimation
                # Split text into words and try to find approximate position
                words = text.split()
                best_match_idx = 0
                best_score = 0
                
                term_words = match_term.split()
                term_length = len(term_words)
                
                # Search for best matching position using sliding window
                for i in range(len(words) - term_length + 1):
                    window_text = " ".join(words[i:i + term_length])
                    window_score = fuzz.token_set_ratio(window_text.lower(), term_lower)
                    if window_score > best_score:
                        best_score = window_score
                        best_match_idx = i
                
                # Calculate approximate positions based on best match
                if best_match_idx < len(words):
                    words_before = " ".join(words[:best_match_idx])
                    start_pos = len(words_before) + (1 if words_before else 0)
                    
                    matched_words = words[best_match_idx:best_match_idx + term_length]
                    end_pos = start_pos + len(" ".join(matched_words))
                else:
                    start_pos = 0
                    end_pos = len(match_term)
            
            # Extract the actual text segment that was matched
            if start_pos >= 0 and end_pos <= len(text):
                found_text = text[start_pos:end_pos]
            else:
                found_text = match_term  # Fallback to the glossary term
            
            all_matches.append({
                "found_in_text": found_text,
                "best_fuzzy": match_term,
                "score": similarity,
                "result": translations,
                "start": start_pos,
                "end": end_pos
            })
        
        # Remove overlapping matches (prefer longer and higher scoring matches)
        final_matches = self._remove_overlapping_matches(all_matches)
        
        # Remove position information from final output and sort by score
        result_matches = []
        for match in final_matches:
            result_match = {k: v for k, v in match.items() if k not in ['start', 'end']}
            result_matches.append(result_match)
        
        # Sort by score (highest first)
        result_matches.sort(key=lambda x: x["score"], reverse=True)
        
        return result_matches

    def find_nearly_exact_english_matches(
        self,
        text: str,
        target_languages: List[str],
        score_cutoff: float = 95.0,
        normalize_text: bool = True,
        remove_overlaps: bool = True
    ) -> List[Dict[str, Union[str, Dict[str, str]]]]:
        """
        Find nearly-exact matches for English glossary terms in the given text.
        Optionally normalizes the text and glossary terms before matching.
        
        Args:
            text: Input English text
            target_languages: List of target languages to return translations
            score_cutoff: Minimum similarity score (default: 95.0)
            normalize_text: Whether to normalize text and terms (default: True)
            remove_overlaps: Whether to remove overlapping matches (default: True)
        
        Returns:
            List of dictionaries with found matches and their translations
        """
        def normalize(s):
            s = str(s).strip()
            # Remove numerical substring at the start and its trailing space and punctuation
            s = re.sub(r'^\d+\s*', '', s) # Example: "123 term" -> "term", "123. term" -> "term"
            s = re.sub(r'[^\w\s]', '', s) # Remove punctuation
            s = re.sub(r'\s+', ' ', s) # Normalize whitespace
            return s
        
        if "English" not in self.languages:
            raise ValueError("English language not available in glossary.")
        
        invalid_targets = [lang for lang in target_languages if lang not in self.languages]
        if invalid_targets:
            raise ValueError(f"Invalid target languages: {invalid_targets}. Available: {self.languages}")
        
        # Get all English terms
        english_terms = self.glossary["English"].dropna().tolist()
        if not english_terms:
            return []
        
        # Sort terms by length (longest first) for better matching priority
        english_terms_sorted = sorted(english_terms, key=len, reverse=True)
        
        all_matches = []
        
        # For each glossary term, try to find it in the text
        for orig_term in english_terms_sorted:
            # Normalize term if requested
            if normalize_text:
                search_term = normalize(orig_term)
                search_text = normalize(text)
            else:
                search_term = orig_term.strip()
                search_text = text.strip()
            
            # Split into words for position tracking
            term_words = search_term.split()
            if not term_words:
                continue
            
            # Find all possible matches in the text
            text_words = search_text.split()
            term_length = len(term_words)
            
            for i in range(len(text_words) - term_length + 1):
                # Get n-gram from text
                ngram_words = text_words[i:i + term_length]
                ngram_text = " ".join(ngram_words)
                
                # Calculate similarity using token_set_ratio for better subset matching
                score = fuzz.token_set_ratio(ngram_text, search_term)
                
                if score >= score_cutoff:
                    # Find positions in original text
                    # This is approximate since we're working with normalized text
                    original_words = text.split()
                    if i < len(original_words) and i + term_length <= len(original_words):
                        # Get the original text segment
                        original_segment = " ".join(original_words[i:i + term_length])
                        
                        # Estimate positions (approximate)
                        start_pos = text.lower().find(original_segment.lower())
                        if start_pos == -1:
                            # Fallback: use word-based estimation
                            words_before = " ".join(original_words[:i])
                            start_pos = len(words_before) + (1 if words_before else 0)
                        end_pos = start_pos + len(original_segment)
                        
                        # Get translations
                        match_row = self.glossary[self.glossary["English"] == orig_term].iloc[0]
                        translations = {}
                        for lang in target_languages:
                            translation = match_row[lang]
                            if pd.notna(translation):
                                translations[lang] = str(translation)
                            else:
                                translations[lang] = ""
                        
                        all_matches.append({
                            "found_in_text": original_segment,
                            "best_fuzzy": orig_term,
                            "score": score,
                            "result": translations,
                            "start": start_pos,
                            "end": end_pos
                        })
        
        # Remove overlapping matches if requested
        if remove_overlaps:
            final_matches = self._remove_overlapping_matches(all_matches)
        else:
            final_matches = all_matches
        
        # Remove position information from final output and sort by score
        result_matches = []
        for match in final_matches:
            result_match = {k: v for k, v in match.items() if k not in ['start', 'end']}
            result_matches.append(result_match)
        
        # Sort by score (highest first)
        result_matches.sort(key=lambda x: x["score"], reverse=True)
        
        return result_matches


def create_processor_function(processor_type: str) -> Optional[Callable]:
    """
    Create a processor function based on the specified type.
    
    Args:
        processor_type: Type of processor ("none", "default", "custom")
        
    Returns:
        Processor function or None
    """
    if processor_type == "none":
        return None
    elif processor_type == "default":
        return utils.default_process
    elif processor_type == "custom":
        # Custom processor that handles special cases
        def custom_processor(text):
            if not text:
                return ""
            # Remove extra whitespace, keep alphanumeric and spaces
            processed = re.sub(r'^\d+\s*', '', str(text).strip()) # Example: "123 term" -> "term", "123. term" -> "term"
            #processed = re.sub(r'[^\w\s]', '', processed) # Remove punctuation
            processed = re.sub(r'\s+', ' ', processed)
            return processed
        return custom_processor
    else:
        raise ValueError(f"Unknown processor type: {processor_type}")


# Example usage and testing functions
def example_usage():
    """
    Example usage of the MultilingualGlossaryProcessor.
    """
    # Initialize processor (assuming you have a glossary file)
    processor = MultilingualGlossaryProcessor("data/glossaryUNEP_corrected.xlsx")
    
    # Example 1: Find best fuzzy match
    result1 = processor.find_best_fuzzy_match(
        query="UN Environment Program",
        source_language="English",
        target_languages=["French", "Spanish", "Arabic"],
        scorer=fuzz.WRatio,
        processor=utils.default_process,
        score_cutoff=70.0,
        process_method="extractOne"
    )
    print("Best match result:", result1)
    
    # Example 2: Find all matches in text with overlap handling
    text = "UN Environment Programme is sponsored by UNESCO."
    result2 = processor.find_all_fuzzy_matches_in_text(
        text=text,
        source_language="English",
        target_languages=["French", "Spanish"],
        processor=utils.default_process,
        score_cutoff=80.0
    )
    print(f"All matches in text (no overlaps), {len(result2)} results in total:", result2)
    
    # Example 3: Nearly-exact English matches with normalization
    english_text = "The UN Environment Programme and UNEA are working on the International Day of Women Judge with organizations and developing new policies."
    result3 = processor.find_nearly_exact_english_matches(
        text=english_text,
        target_languages=["French", "Spanish"],
        score_cutoff=95.0,
        normalize_text=True,
        remove_overlaps=True
    )
    print(f"Nearly-exact English matches, {len(result3)} in total:", result3)
    
    # Example 4: More complex text
    complex_text = "This year, the United Nations Environment Programme (UNEP) and the SSC (South-South cooperation) are presiding the COP on Climate Change to address persistent organic pollutants before the UNEA7 with FAO and UNESCO, where the 1. total greenhouse gas emissions per year indicator is expected to be reduced by 50%."
    result4 = processor.find_all_fuzzy_matches_in_text(
        text=complex_text,
        source_language="English",
        target_languages=["French", "Spanish"],
        processor=utils.default_process,
        score_cutoff=95.0
    )
    print(f"Complex text matches RESULT #4, {len(result4)} results in total:", result4)

    # Example 5: Nearly-exact English matches with normalization
    english_text = complex_text
    result5 = processor.find_nearly_exact_english_matches(
        text=english_text,
        target_languages=["French", "Spanish"],
        score_cutoff=90.0,
        normalize_text=True,
        remove_overlaps=True
    )
    print(f"Nearly-exact English matches: {len(result5)} matches in total: ", result5)
    # print set of best_fuzzy and result['Spanish'] of result5
    bilingual_pairs = [(match['best_fuzzy'], match['result'].get('Spanish', '')) for match in result5]
    print("Bilingual pairs (best_fuzzy, Spanish translation):")
    for pair in bilingual_pairs:
        print(pair)


if __name__ == "__main__":
    example_usage()
        

Best match result: {'best_fuzzy': 'UN Environment Programme', 'score': 95.65217391304348, 'result': {'French': '', 'Spanish': 'Programa ONU Medio Ambiente', 'Arabic': ''}}
All matches in text (no overlaps), 2 results in total: [{'found_in_text': 'ore', 'best_fuzzy': 'ore', 'score': 100.0, 'result': {'French': 'minerais', 'Spanish': 'yacimientos minerales'}}, {'found_in_text': 'UN Environment Programme is', 'best_fuzzy': 'United Nations Environment Programme', 'score': 80.0, 'result': {'French': '', 'Spanish': 'Programa de las Naciones Unidas para el Medio Ambiente'}}]
Nearly-exact English matches, 3 in total: [{'found_in_text': 'UN Environment Programme', 'best_fuzzy': 'UN Environment Programme', 'score': 100.0, 'result': {'French': '', 'Spanish': 'Programa ONU Medio Ambiente'}}, {'found_in_text': 'UNEA', 'best_fuzzy': 'UNEA', 'score': 100.0, 'result': {'French': '', 'Spanish': ''}}, {'found_in_text': 'International Day of Women Judge', 'best_fuzzy': 'International Day of Women Judges'

# TM fuzzy matches TODO

In [19]:
similar_text = """El Programa de las Naciones Unidas para el Medio Ambiente (PNUMA) y la Organización de las Naciones Unidas para la Alimentación y la Agricultura (FAO) han nombrado las primeras Iniciativas Emblemáticas de la Restauración Mundial para este año, que abordan la degradación de los ecosistemas en todo el planeta.
Estas iniciativas han estado restaurando alrededor tres millones de hectáreas de ecosistemas marinos, un área del tamaño de El Salvador.
Las siete nuevas Iniciativas Emblemáticas comprenden iniciativas de restauración en Ecuador, Colombia, Kenya e Indonesia.
.
"Por mucho tiempo se ha dado por sentado el poder de los bosques, tan esenciales para la restauración.
Cada persona debe cumplir su parte", afirmó Inger Andersen, Directora Ejecutiva del PNUMA.
"Las Iniciativas Emblemáticas de la Restauración Mundial muestran cómo la protección de la biodiversidad, la acción climática y el desarrollo económico están profundamente interconectados.
Para lograr nuestros objetivos de restauración, nuestra ambición debe ser tan grande como el océano que debemos proteger".
El Director General de la FAO, QU Dongyu, manifestó: "La crisis climática, las prácticas de explotación insostenible y la reducción de los recursos naturales están afectando nuestros ecosistemas azules, dañando la vida marina y amenazando los medios de vida de las comunidades.
Estas nuevas 7 Iniciativas Emblemáticas muestran que detener y revertir la degradación es posible y beneficioso para el planeta y las personas"."""
fuzzy_sents = [sent.strip() for sent in similar_text.split("\n")]
fuzzy_target_prefixes = [sent.strip() for sent in fuzzy_sents if sent.strip()]

similar_text_en = """The United Nations Environment Programme (UNEP) and the Food and Agriculture Organization (FAO) have named the first World Restoration Flagships for this year, which address ecosystem degradation across the globe.
These initiatives have been restoring around three million hectares of marine ecosystems, an area the size of El Salvador.
The seven new flagships include restoration initiatives in Ecuador, Colombia, Kenya and Indonesia.
.
"The power of forests, so essential to restoration, has long been taken for granted.
Everyone must do their part," said Inger Andersen, Executive Director of UNEP.
"The World Restoration Flagships show how biodiversity protection, climate action and economic development are deeply interconnected.
To achieve our restoration goals, our ambition must be as big as the ocean we must protect."
FAO Director-General QU Dongyu said, "The climate crisis, unsustainable exploitation practices and depletion of natural resources are affecting our blue ecosystems, damaging marine life and threatening the livelihoods of communities.
These new 7 flagships show that halting and reversing degradation is possible and beneficial for the planet and people."
"""

#glossary_entry = ["World Restoration Flagships", "Iniciativas Emblemáticas de la Restauración Mundial"] #plural
#glossary_entry = ["World Restoration Flagship", "Iniciativa Emblemática de la Restauración Mundial"] #singular
#glossary_entry = ["world restoration flagship", "iniciativa emblemática de la restauración mundial"] #singular_lowercase

fuzzy_src_sents = [sent.strip() for sent in similar_text_en.split("\n")]
fuzzy_source_sentences = [sent.strip() for sent in fuzzy_src_sents if sent.strip()]

# Replace first and second elements of source and target texts with the glossary entry
#fuzzy_source_sentences[0] = glossary_entry[0]
#fuzzy_target_prefixes[0] = glossary_entry[1]

print("Length of fuzzy source and fuzzy target prefixes:")
print(len(fuzzy_source_sentences))
print(len(fuzzy_target_prefixes))

print(fuzzy_source_sentences[0])
print(fuzzy_target_prefixes[0])

Length of fuzzy source and fuzzy target prefixes:
10
10
The United Nations Environment Programme (UNEP) and the Food and Agriculture Organization (FAO) have named the first World Restoration Flagships for this year, which address ecosystem degradation across the globe.
El Programa de las Naciones Unidas para el Medio Ambiente (PNUMA) y la Organización de las Naciones Unidas para la Alimentación y la Agricultura (FAO) han nombrado las primeras Iniciativas Emblemáticas de la Restauración Mundial para este año, que abordan la degradación de los ecosistemas en todo el planeta.


# Glossary fuzzy matches

In [20]:
tgt_lang_name

'Spanish'

In [21]:
processor = MultilingualGlossaryProcessor("data/glossaryUNEP_corrected.xlsx")

glossary_matches = []

for src_sent in source_sents:
    sent_matches = processor.find_nearly_exact_english_matches(
        text=src_sent,
        target_languages=[tgt_lang_name],
        score_cutoff=90.0,
        normalize_text=True,
        remove_overlaps=True
    )

    # filter sent_matches as a list of str of best_fuzzy and results['Spanish'] if both are not empty
    sent_matches = [(match['best_fuzzy'], match['result'].get(tgt_lang_name, '')) for match in sent_matches if 'best_fuzzy' in match and tgt_lang_name in match['result']]
    # remove set in sent_matches if best_fuzzy is empty or Spanish translation is empty
    sent_matches = [match for match in sent_matches if match[0] and match[1]]
    if sent_matches:
        print(f"Matches found for source sentence '{src_sent}': {len(sent_matches)}"
              f" - {sent_matches}")
        glossary_matches.append(sent_matches)
    else:
        # add empty tuple if no matches found
        print(f"No matches found for source sentence '{src_sent}'")
        glossary_matches.append(("", ""))

# Transform glossary_matches into a list of tuples (A, B) where A is joined string from first elements of each tuple in glossary_matches and B is joined string from second elements of each tuple in glossary_matches
glossary_matches = [(", ".join([match[0] for match in matches]),
                                 ", ".join([match[1] for match in matches])) for matches in glossary_matches]

# separate the glossary_matches into two lists: first elements and second elements
glossary_matches_src = [match[0] for match in glossary_matches]
glossary_matches_tgt = [match[1] for match in glossary_matches]

fuzzy_source_sentences = glossary_matches_src
fuzzy_target_prefixes = glossary_matches_tgt

Matches found for source sentence 'The UN Environment Programme (UNEP) and the Food and Agriculture Organization of the UN (FAO) have named the first World Restoration Flagships for this year, tackling pollution, unsustainable exploitation, and invasive species in three continents.': 4 - [('pollution', 'contaminación'), ('World Restoration Flagship', 'Iniciativa Emblemática de la Restauración Mundial'), ('UN Environment Programme', 'Programa ONU Medio Ambiente'), ('invasive alien species', 'Especie exótica invasiva')]
Matches found for source sentence 'These initiatives are restoring almost five million hectares of marine ecosystems - an area about the size of Costa Rica, which, together with France, is hosting the 3rd UN Ocean Conference.': 4 - [('marine ecosystems', 'ecosistemas marinos'), ('Costa Rica', 'Costa Rica'), ('France', 'Francia'), ('conference', 'conferencias')]
Matches found for source sentence 'The three new flagships comprise restoration initiatives in the coral-rich No

## Translation inserting matches

In [22]:
import ctranslate2
import sentencepiece as spm
import torch

# src_lang = "eng_Latn"
# tgt_lang = "spa_Latn"

beam_size = 2

# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)


# Subword the source sentences
fuzzy_source_sentences_subworded = sp.encode_as_pieces(fuzzy_source_sentences)
real_source_sentences_subworded = sp.encode_as_pieces(source_sents)
fuzzy_real_subworded = zip(fuzzy_source_sentences_subworded, real_source_sentences_subworded)

separator = sp.encode_as_pieces("•")  # tokenize "•" -- output is "▁•"

source_sents_subworded = [[src_lang] + fuzzy_src + [src_lang] + separator + real_src + ["</s>"]
                          for fuzzy_src, real_src in fuzzy_real_subworded]
#source_sents_subworded = [[src_lang] + fuzzy_src + [src_lang] + separator + ["</s>"]
                          #for fuzzy_src in fuzzy_source_sentences_subworded]
print(source_sents_subworded[0])

prefixes_subworded = sp.encode_as_pieces(fuzzy_target_prefixes)
target_prefixes = [[tgt_lang] + sent + [tgt_lang] + separator for sent in prefixes_subworded]
print(target_prefixes[0])

# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device=device)
translations = translator.translate_batch(source_sents_subworded,
                                          batch_type="tokens",
                                          max_batch_size=2024,
                                          beam_size=beam_size,
                                          min_decoding_length=2,
                                          max_decoding_length=512,
                                          target_prefix=target_prefixes)
translations = [translation.hypotheses[0] for translation in translations]

# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]

translations_only = [sent.split(tgt_lang)[1].strip() for sent in translations_desubword]

print("\nTranslations:", *translations_desubword[:10], sep="\n")
print("\nTranslations only:", *translations_only[:10], sep="\n")

# Remove bullet points and leading/trailing whitespace from translations_only
translations_only = [sent[1:].strip() if sent.startswith("•") else sent.strip() for sent in translations_only]

['eng_Latn', '▁pollu', 'tion', ',', '▁World', '▁Rest', 'oration', '▁Flag', 'ship', ',', '▁UN', '▁Environment', '▁Programme', ',', '▁invasi', 've', '▁alien', '▁species', 'eng_Latn', '▁•', '▁The', '▁UN', '▁Environment', '▁Programme', '▁(', 'UN', 'EP', ')', '▁and', '▁the', '▁Food', '▁and', '▁Agric', 'ulture', '▁Organization', '▁of', '▁the', '▁UN', '▁(', 'FA', 'O', ')', '▁have', '▁named', '▁the', '▁first', '▁World', '▁Rest', 'oration', '▁Flag', 'shi', 'ps', '▁for', '▁this', '▁year', ',', '▁tack', 'ling', '▁pollu', 'tion', ',', '▁uns', 'usta', 'inable', '▁explo', 'itation', ',', '▁and', '▁invasi', 've', '▁species', '▁in', '▁three', '▁contin', 'ents', '.', '</s>']
['spa_Latn', '▁contamina', 'ción', ',', '▁Inici', 'ativa', '▁Emb', 'lem', 'ática', '▁de', '▁la', '▁Resta', 'uración', '▁Mundial', ',', '▁Programa', '▁ONU', '▁Medio', '▁Ambiente', ',', '▁Es', 'pe', 'cie', '▁ex', 'ó', 'tica', '▁invasi', 'va', 'spa_Latn', '▁•']

Translations:
contaminación, Iniciativa Emblemática de la Restauración Mu

In [None]:
# # Save the translations

# translations_file_name = "testUNEP.es"

# with open(translations_file_name, "w+") as output:
#   for translation in translations_only:
#     output.write(translation + "\n")

# Show in parallel print each line of testUNEP.en and testUNEP.es



In [23]:
# translations_desubword and translations_only
print(len(translations_desubword))
print(len(translations_only))

for i in range(len(translations_desubword)):
  print(translations_desubword[i])
  if i > len(translations_only) - 1:
    print()
  else:
    print(translations_only[i])
    print(source_sents[i])
  print()

12
12
contaminación, Iniciativa Emblemática de la Restauración Mundial, Programa ONU Medio Ambiente, Especie exótica invasivaspa_Latn • El Programa de las Naciones Unidas para el Medio Ambiente (PNUMA) y la Organización de las Naciones Unidas para la Alimentación y la Agricultura (FAO) han nombrado las primeras Iniciativas Emblemáticas de la Restauración Mundial para este año, que abordan la contaminación, la explotación insostenible y las especies invasoras en tres continentes.
El Programa de las Naciones Unidas para el Medio Ambiente (PNUMA) y la Organización de las Naciones Unidas para la Alimentación y la Agricultura (FAO) han nombrado las primeras Iniciativas Emblemáticas de la Restauración Mundial para este año, que abordan la contaminación, la explotación insostenible y las especies invasoras en tres continentes.
The UN Environment Programme (UNEP) and the Food and Agriculture Organization of the UN (FAO) have named the first World Restoration Flagships for this year, tackling p