<a href="https://colab.research.google.com/github/JoyeBright/domain-adapt-mt/blob/main/Examples/Ex_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we show how to use **[DomainAdapt-MT v1.0.0](https://github.com/JoyeBright/domain-adapt-mt)** to do data selection for training MT models. 😀


## 🌍 Out-of-Domain Data

We download a large out-of-domain English–French corpus (**CCAligned v1**) from the [OPUS repository](https://opus.nlpl.eu/CCAligned.php).  


In [1]:
!wget  https://object.pouta.csc.fi/OPUS-CCAligned/v1/moses/en-fr.txt.zip

--2025-06-13 10:03:59--  https://object.pouta.csc.fi/OPUS-CCAligned/v1/moses/en-fr.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1013655142 (967M) [application/zip]
Saving to: ‘en-fr.txt.zip’


2025-06-13 10:08:11 (3.86 MB/s) - ‘en-fr.txt.zip’ saved [1013655142/1013655142]



In [2]:
!unzip /content/en-fr.txt.zip

Archive:  /content/en-fr.txt.zip
  inflating: README                  
  inflating: LICENSE                 
  inflating: CCAligned.en-fr.en      
  inflating: CCAligned.en-fr.fr      
  inflating: CCAligned.en-fr.xml     


## ⚠️ Downsizing the Out-of-Domain Corpus (for Demo Only)

To show how the tool works in environments like Google Colab (which have limited memory), we downsize the OOD corpus.  
We randomly sample 4 million sentence pairs from the full CCAligned dataset:

In [None]:
!paste /content/CCAligned.en-fr.en /content/CCAligned.en-fr.fr \
  | shuf \
  | head -n 4000000 \
  | tee >(cut -f1 > /content/sample.en) >(cut -f2 > /content/sample.fr)


## 📘 In-Domain Data

We use the [Unsupervised Domain Clusters](https://github.com/roeeaharoni/unsupervised-domain-clusters) dataset for in-domain data.
For this example, we use only the **IT** domain subset, which we've uploaded to our drive. You can do the same.


In [7]:
!unzip /content/multi_domain_new_split.zip -d /content/

Archive:  /content/multi_domain_new_split.zip
   creating: /content/it/
   creating: /content/koran/
   creating: /content/law/
   creating: /content/medical/
   creating: /content/subtitles/
  inflating: /content/it/dev.de      
  inflating: /content/it/dev.en      
  inflating: /content/it/test.de     
  inflating: /content/it/test.en     
  inflating: /content/it/train.de    
  inflating: /content/it/train.en    
  inflating: /content/koran/dev.de   
  inflating: /content/koran/dev.en   
  inflating: /content/koran/test.de  
  inflating: /content/koran/test.en  
  inflating: /content/koran/train.de  
  inflating: /content/koran/train.en  
  inflating: /content/law/dev.de     
  inflating: /content/law/dev.en     
  inflating: /content/law/test.de    
  inflating: /content/law/test.en    
  inflating: /content/law/train.de   
  inflating: /content/law/train.en   
  inflating: /content/medical/dev.de  
  inflating: /content/medical/dev.en  
  inflating: /content/medical/test.de  
  in

## 🚀 Run the Data Selection Script

Now that we have the required input files, we can use the data selection script from the repository:  
👉 [domain-adapt-mt](https://github.com/JoyeBright/domain-adapt-mt)

### 📥 Step 1: Download the Script

Download the main script directly:
[main.py](https://github.com/JoyeBright/domain-adapt-mt/blob/main/main.py)
Or
You can also clone the full repo if preferred:
```bash
git clone https://github.com/JoyeBright/domain-adapt-mt.git
cd domain-adapt-mt


In [8]:
!git clone https://github.com/JoyeBright/domain-adapt-mt.git

Cloning into 'domain-adapt-mt'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 15 (delta 3), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (15/15), 8.53 KiB | 8.53 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [9]:
cd domain-adapt-mt

/content/domain-adapt-mt


In [None]:
!python /content/domain-adapt-mt/main.py -ood_src /content/sample.en \
                                                               -ood_tgt /content/sample.fr \
                                                               -id /content/it/train.en

2025-06-13 10:23:56.304768: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749810236.533197    6480 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749810236.595197    6480 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-13 10:23:57.067188: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Below are the arguments entered ...
source-side OOD= /content/sample.en
target-side OOD= /content/sample.fr
ID= /cont