<a href="https://colab.research.google.com/github/FarhanDhanani/joker-clef-22-FAST-MT/blob/main/JOKER_CLEF_TASK_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center> This notebook is prepared to submit the solutions to the JOKER CLEF conference from the FAST-MT Team.</center></h1>

<center><img src="https://upload.wikimedia.org/wikipedia/en/e/e4/National_University_of_Computer_and_Emerging_Sciences_logo.png"/></center>

A simple set of booleans that represents the notebook is running on google colab or locally. And the model are already downloaded and loaded either on G-Drive or locally or it needs to be downloaded from the checkpoint repository.

In [1]:
'''
## A set of simple booleeans:
##  - (run_on_colab) = True : if notebook is running on colab
##  - (run_on_colab) = False : if notebook is running locally
'''
run_on_colab = True

'''
##  - (download_camem_bert_for_extractive_question_answering) = True : if Camem-BERT Model needs to get download from given checkpoint
##  - (download_camem_bert_for_extractive_question_answering) = False : if Camem-BERT Model is already downloaded and its path is step
'''
download_camem_bert_for_extractive_question_answering = True

'''
##  - (download_distil_bert_for_extractive_question_answering) = True : if Distil-BERT Model needs to get download from given checkpoint
##  - (download_distil_bert_for_extractive_question_answering) = False : if Distil-BERT Model is already downloaded and its path is step
'''
download_distil_bert_for_extractive_question_answering = True


Installing required dependencies and setuping base paths

In [2]:
if run_on_colab:
    from google.colab import drive
    from google.colab import files
    base_path = '/content/drive'
    drive.mount(base_path)
    base_path = base_path + '/My Drive/'

    ## UPLOADING REQUIREMENTS.TXT
    uploaded = files.upload()
    for fn in uploaded.keys():
      print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

    ## INSTALLING ALL DEPENDENCIES
    !pip3 install -r requirements_task2.txt

else:
    base_path = '/Users/fdhanani/Desktop/JOKER CLEF/'
    ## INSTALLING ALL DEPENDENCIES
    !pip3 install -r requirements_task2.txt

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Saving requirements_task2.txt to requirements_task2 (1).txt
User uploaded file "requirements_task2.txt" with length 1627 bytes
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Collection of all the imports required to run this noteboook

In [3]:
import json
import torch
import collections
import numpy as np
import pandas as pd 
from tqdm import tqdm
from functools import partial
from torch.optim import AdamW
from sklearn.utils import shuffle
from accelerate import Accelerator
from torch.utils.data import DataLoader
from translate.storage.tmx import tmxfile
from sklearn.model_selection import train_test_split

from datasets import load_dataset, Features, Value, load_metric
from transformers import AutoTokenizer, default_data_collator, AutoModelForQuestionAnswering, get_scheduler, pipeline

Before executing the subsequent notebook cells, please clone all the related files from the [official Github repository](https://github.com/FarhanDhanani/joker-clef-22-FAST-MT). And ensure the mentioned directory structure is set up correctly either on the Google Drive if you are executing on the collab or your local system, in case you are running it locally. 

**DIRECTORY STRUCTURE:** 
```
.
├── JOKER_CLEF_TASK_1.ipynb
├── JOKER_CLEF_TASK_2.ipynb
├── JOKER_CLEF_TASK_3.ipynb
├── requirements_task1.txt
├── requirements_task2.txt
├── requirements_task3.txt
│── Readme.md
├── dataset
│   └── JOKER
│       ├── Task 1
│       │   ├── test
│       │   │   ├── check_format.py
│       │   │   ├── joker_task1_en_test.csv
│       │   │   ├── joker_task1_en_test.json
│       │   │   └── joker_task1_en_testfull.json
│       │   └── train
│       │       ├── FT_Model
│       │       │   ├── FINE_TUNED_MODELS_FOR_TASK1_WILL_SAVE_HERE
|       |       |   ├── TFDistilBertForSequenceClassification
│       │       │   │   ├── finetuned model TFDistilBertForSequenceClassification
│       │       │   ├── TFDistilBertForSequenceClassificationOfConventionalType
│       │       │   │   ├── finetuned model TFDistilBertForSequenceClassificationOfConventionalType
│       │       │   ├── TFDistilBertForSequenceClassificationOfCulturalReferenceType
│       │       │   │   ├── finetuned model TFDistilBertForSequenceClassificationOfCulturalReferenceType
│       │       │   ├── TFDistilBertForSequenceClassificationOfHorizontalVerticalType
│       │       │   │   ├── finetuned model TFDistilBertForSequenceClassificationOfHorizontalVerticalType
│       │       │   ├── TFDistilBertForSequenceClassificationOfManipulationType
│       │       │   │   ├── finetuned model TFDistilBertForSequenceClassificationOfManipulationType
│       │       │   ├── bert-finetuned-ner-JOKER_task1
│       │       │   │   ├── finetuned model ner-JOKER_task1
│       │       │   ├── bert-finetuned-ner-location-JOKER_task1
│       │       │   │   ├── finetuned model location-JOKER_task1
│       │       ├── joker_task1_en_train.csv
│       │       ├── joker_task1_en_train.json
│       │       ├── joker_task1_en_trainfull.json
│       │       ├── train_val_split
│       │       │   ├── train1.csv
│       │       │   ├── train1.json
│       │       │   ├── val1.csv
│       │       │   └── val1.json
│       ├── Task 2
│       │   ├── test
│       │   │   ├── CAMEM-solutions_joker_task2_test_context_aware_data_where_answers_exist.json
│       │   │   ├── DISTIL-solutions_joker_task2_test_context_aware_data_where_answers_exist.json
│       │   │   ├── check_format.py
│       │   │   ├── joker_task2_test.csv
│       │   │   ├── joker_task2_test.json
│       │   │   ├── joker_task2_test_context_aware_data_where_answers_exist.csv
│       │   └── train
│       │       ├── FT_Model
│       │       │   ├── camembert-base-squadFR-fquad-piaf
│       │       │   │   ├── Fine-tuned camembert model will save here
│       │       │   └── distilbert-base-cased-distilled-squad
│       │       │       ├── Fine-tuned distilbert model will save here
│       │       ├── joker_task2_train.csv
│       │       ├── joker_task2_train.json
│       │       ├── joker_task2_train_context_aware_data_where_answers_exist.csv
│       │       └── train_val_split
│       │           ├── train2.csv
│       │           └── val2.csv
│       ├── Task 3
│       │   ├── test
│       │   │   ├── check_format.py
│       │   │   ├── joker_task3_test.csv
│       │   │   └── joker_task3_test.json
│       │   └── train
│       │       ├── Helsinki-NLP-performance.json
│       │       ├── joker_task3_train.csv
│       │       ├── joker_task3_train.json
│       │       ├── t5-base-performance.json
│       │       ├── t5-large-performance.json
│       │       └── t5-small-performance.json
│       ├── csv
│       │   ├── context
│       │   │   ├── c_ref.csv
│       │   │   ├── c_ref_test.csv
│       │   │   └── c_ref_train.csv
│       │   ├── en-fr_books.csv
│       │   ├── en-fr_bookshop.csv
│       │   ├── en-fr_opensub.csv
│       │   ├── en-fr_ted.csv
│       │   ├── en-fr_wiki.csv
│       │   ├── en-fr_wikipedia.csv
│       │   ├── en_processed
│       │   │   ├── en-fr_books.csv
│       │   │   ├── en-fr_bookshop.csv
│       │   │   ├── en-fr_opensub.csv
│       │   │   ├── en-fr_ted.csv
│       │   │   ├── en-fr_wiki.csv
│       │   │   └── en-fr_wikipedia.csv
│       │   └── extractVocab
│       │       ├── test_nonPok_en-fr_books.json
│       │       ├── test_nonPok_en-fr_bookshop.json
│       │       ├── test_nonPok_en-fr_opensub.json
│       │       ├── test_nonPok_en-fr_ted.json
│       │       ├── test_nonPok_en-fr_wiki.json
│       │       ├── test_nonPok_en-fr_wikipedia.json
│       │       ├── train_nonPok_en-fr_books.json
│       │       ├── train_nonPok_en-fr_bookshop.json
│       │       ├── train_nonPok_en-fr_opensub.json
│       │       ├── train_nonPok_en-fr_ted.json
│       │       ├── train_nonPok_en-fr_wiki.json
│       │       └── train_nonPok_en-fr_wikipedia.json
│       ├── en-fr.tmx_wiki.gz
│       ├── en-fr_books.tmx
│       ├── en-fr_books.tmx.gz
│       ├── en-fr_bookshop.tmx
│       ├── en-fr_bookshop.tmx_2.gz
│       ├── en-fr_opensub.tmx
│       ├── en-fr_opensub.tmx.gz
│       ├── en-fr_ted.tmx
│       ├── en-fr_ted.tmx.gz
│       ├── en-fr_wiki.tmx
│       ├── en-fr_wikipedia.tmx
│       ├── en-fr_wikipedia.tmx.gz
│       ├── dragons.csv
│       ├── task2_pokemon_en.json
│       └── task2_pokemon_fr.json
├── models
|   |── Downloaded Pre-trained Models will get save here
│   ├── Helsinki-NLP-opus-mt-en-fr
│   │   
│   ├── bert-base-cased
│   │   
│   ├── camembert-base-squadFR-fquad-piaf
│   │   
│   ├── distilbert-base-cased-distilled-squad
│   │   
│   ├── distilbert-base-uncased
│   │   
│   ├── mt5-small
│   │   
│   ├── roberta-base
│   │   
│   └── t5-base
│       
├── .
```

# **TASK 2: TRANSLATE SINGLE WORDS (NOUNS) CONTAINING WORDPLAY.**

Train data format: List of translated wordplay instances in a JSON format or a CSV file (for manual runs) with the following fields:

    id: a unique wordplay identifier
    en: wordplay text in English (source)
    fr: wordplay text in French (target)

Example:

    [{"id":"noun_1","en":"Ambipom","fr":"Capidextre"}]

Test data input format: List of wordplay instances to translate in a JSON format or a CSV file (for manual runs) with the following fields:

    id: a unique wordplay identifier
    en: wordplay text in English (source)

Input example:

    [{"id":"noun_1185","en":"Fungun"}]

Test data output format:

List of wordplay instances to be translated in a JSON format or a CSV file (for manual runs) with the following fields:

    RUN_ID: Run ID starting with team_id_ (as registered at the CLEF website)
    MANUAL: Whether the run is manual {0,1}
    id: a unique wordplay identifier
    en: wordplay text in English (source)
    fr: wordplay text in French (target)

Output example:List of wordplay instances to be translated in a JSON format or a CSV file (for manual runs) with the following fields:

    [{"RUN_ID":"OFFICIAL","MANUAL":1,"id":"noun_1","en":"Ambipom","fr":"Capidextre"}]


**Evaluation**. Human evaluators will manually annotate the submitted translations according to both subjective measures and according to more concrete features such as whether wordplay exists in the target text, whether it corresponds to the type used in the source text, whether the target text preserves the semantic field, etc.

**Result submission**. Participants should put their run results into the folder Documents created for their user and submit them by email to contact@joker-project.com. The email subject has to be in the format [CLEF TASK 2] TEAM_ID.


## LOADING THE DATASETS

LOADING THE TRAIN DATA-SET FOR TASK-2

In [4]:
'''
## The train data file were given in the form of json format.
## We have load them into pandas dataframe for further analysis.
'''

path = "/dataset/JOKER/Task 2/train/joker_task2_train.csv"
data_set_2 = pd.read_csv(base_path+path)
data_set_2.head(5)

Unnamed: 0,id,en,fr
0,noun_1,Ambipom,Capidextre
1,noun_2,Dartrix,Efflèche
2,noun_3,Malamar,Sepiatroce
3,noun_4,Bounsweet,Croquine
4,noun_5,Obelix,Obélix


THE STRUCTURE OF THE TRAIN DATA-SET

In [5]:
'''
 ## Determining the structure of the training data set
'''
data_set_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1164 entries, 0 to 1163
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      1158 non-null   object
 1   en      1164 non-null   object
 2   fr      1164 non-null   object
dtypes: object(3)
memory usage: 27.4+ KB


LOADING THE TEST DATA-SET FOR TASK-2

In [6]:
'''
## The test data file were given in the form of json format.
## We have load them into pandas dataframe for further analysis.
'''
path = "dataset/JOKER/Task 2/test/joker_task2_test.csv"
test_data_set_2 = pd.read_csv(base_path+path)
test_data_set_2.head(5)

Unnamed: 0,id,en
0,noun_1161,Orbeetle
1,noun_1162,Gossifleur
2,noun_1163,Eldegoss
3,noun_1164,Crabominable
4,noun_1165,Ribombee


In [7]:
'''
 ## Determining the structure of the testing data set.
'''
test_data_set_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284 entries, 0 to 283
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      284 non-null    object
 1   en      284 non-null    object
dtypes: object(2)
memory usage: 4.6+ KB


## EDA FOR THE LOADED DATASET

TOTAL NUMBER OF RECORDS IN THE GiVEN TESTING DATA-SET

In [8]:
'''
 ## The number of total records in the given testing data set
'''
len(test_data_set_2)

284

NUMBER OF MISSING VALUES IN EACH COLUMN OF THE TEST DATA-SET

In [9]:
test_data_set_2.isnull().sum()

id    0
en    0
dtype: int64

TOTAL NUMBER OF RECORDS IN TRAIN DATASET

In [10]:
'''
 ## The number of total records in the given training data set
'''
len(data_set_2)

1164

NUMBER OF TOTAL MISSING VALUES IN EACH COLOUMN

In [11]:
'''
 ## Finding out total Number of missing cells in the given training data set
'''
data_set_2.isnull().sum()

id    6
en    0
fr    0
dtype: int64

EXPLORIND ID VARIABLE IN THE TRAINING DATA-SET

Usually ID should be non-null unique identifier for each record (row) in the data-set. Here, in the provided training data, we can observe that ID is unique but its null (absent) for six records.

 - TOTAL NUMBER OF RECORDS: 1164

 - DISTINCT NUMBER OF IDs: 1158

 - RECORDS WITH MISSING IDs: 6

In [12]:
'''
 ## Finding out the description data related to id variable
'''
data_set_2.id.describe()

count       1158
unique      1158
top       noun_1
freq           1
Name: id, dtype: object

EXPLORING EN COLOUMN

In [13]:
'''
 ## Finding out the description data related to en variable
'''
data_set_2.en.describe()

count         1164
unique        1152
top       official
freq             3
Name: en, dtype: object

NOT ALL VALUES IN THE EN COLUMN ARE UNIQUE

Usually, as this task is predicting the parallel translation in french for a given word in English. So, ideally there must not be any diplicates, and specifically the duplicates words in English should not have different parallel translation in French. Because the translation is based on single given word so a determinsitic model can never produce two different outputs for a samen given input. In other case, we can also assume that for a given English word there exist more than one possible translated word in French.

In [14]:
'''
 ## Finding duplicate English words in the given training data.
'''
dups = data_set_2[data_set_2.duplicated(subset=['en'], keep=False)]
dups.sort_values(['en'], ascending=[1])

Unnamed: 0,id,en,fr
60,noun_61,Crustacius,Gracchus Cétinconcensus
676,noun_674,Crustacius,Gracchus Cétinconsensus
1157,noun_1155,Elimentaler,Fromager
1065,noun_1063,Elimentaler,Demmentaleur
819,noun_817,Face Melter,Amplificatueur
1114,noun_1112,Face Melter,Défiguratrice
656,noun_654,Fightsabre,Épée-flingue
650,noun_648,Fightsabre,Mitrailleuse Jedi
169,noun_170,Glummy,Savourable
253,noun_252,Glummy,Miamifique


FINDING RECORDS IN WHICH EN COLUMN HAS NUMERIC VALUES

There are zero number of records in which English coloumn contains a numeric value.

In [15]:
'''
 ## Sanity check to verify if there exist any invalid numeric entry in the
 ## English column of the given training data-set
'''
data_set_2[pd.to_numeric(data_set_2['en'], errors='coerce').notnull()]

Unnamed: 0,id,en,fr


EXPLORING FR COLOUMN

In [16]:
'''
 ## Finding out the description data related to fr variable
'''
data_set_2.fr.describe()

count     1164
unique    1163
top          2
freq         2
Name: fr, dtype: object

NOT ALL VALUES IN THE FR COLUMN ARE UNIQUE

The following provided French translations for the English word are not valid they must get filtered from the data.

In [17]:
'''
 ## Finding duplicate French words in the given training data.
'''
dups = data_set_2[data_set_2.duplicated(subset=['fr'], keep=False)]
dups.sort_values(['fr'], ascending=[1])

Unnamed: 0,id,en,fr
179,,official,2
321,,official,2


FINDING RECORDS IN WHICH FR COLUMN HAS NUMERIC VALUES

There are three records in which parallel French translations against an English word is numeric. These are invalid records which either needs to get ccorrected or filtered out.

In [18]:
'''
 ## Sanity check to verify if there exist any invalid numeric entry in the
 ## French column of the given training data-set
'''
data_set_2[pd.to_numeric(data_set_2['fr'], errors='coerce').notnull()]

Unnamed: 0,id,en,fr
179,,official,2
214,,official,1
321,,official,2


## PRE-PROCESSING AND BUILDING MODELS

After analyzing the data set, we can observe that it contains English nouns from movies, cartoons, and animes, along with their corresponding translations in French versions. For example, we can look at the first record from the data set which is **"Ambipom."** It is the English name of a pokemon, and the following pictures show its appearance.

<center><img src="https://img.pokemondb.net/artwork/large/ambipom.jpg"/></center>

So usually, the anime and movie writers choose the names of their movie characters very carefully to advertise the idea of the role their characters will play in the movie. So, in this case, **"Ambi"** is a prefix from Latin that means **both**. The other sub-word **"pom"** refers to the word **"pom-pom"** (the stuff used by cheerleaders for decorating their hands and emphasizing their hand movements). Collectively, the word **"Ambipom"** reflects the idea of someone who wears a pom-pom in both hands all the time and has a very cheerful feminine nature. The challenge given in the task is to capture this pun in given English nouns taken from popular movies, anime, and cartoon shows and learn to output their corresponding translations in French.

In [19]:
data_set_2.iloc[0]

id        noun_1
en       Ambipom
fr    Capidextre
Name: 0, dtype: object

We have mapped the task of learning the mapping between English nouns and their corresponding French Translations into the extractive Question/Answer problem. We have transformed the two-column-based tabular data set for task2 into a classical styled extractive Q/A problem data set by using the English/French parallel sentence pairs (subtitles, books, and wiki) from the [OPUS open-source parallel corpus](https://opus.nlpl.eu/). In such a way that, for each English/French noun pair mentioned in the Joker CLEF task2 data set, we have collected English/French pairs text from the downloaded English/French OPUS parallel corpus, which contains those nouns in them. The deep learning models have to use each English noun from the Joker CLEF task2 data set as a query and the corresponding French text as the context, which somewhere holds the translation of the queried English noun. And now, the task for the deep learning models is to locate the exact position of the French translation in the French text for the queried English noun. 


So as an example, we can consider the following fourth record where we have to learn the translation of the English noun **"Obelix"** to French version **"Obélix."**

| id | en      | fr |
| ----------- | ----------- | ----------- |
| 4  | Obelix      | Obélix |

For learning this translation, we will utillize the developed Extractive Q/A styled data set, which will have following entry for the given English/French nouns pair.

| id | context | question | answers |
| :---   | :---:     |    :----:   | :---:  |
| 17  | astérix et obélix ne devraient plus quitter le village. | Obelix      | {"text": ["ob\u00e9lix"], "answer_start": [11]} |

In this way, the task of deep learning models is to learn to output the values of the answers column after processing the given context and question in its input.



For more details about how we have converted all the English/French noun pairs provided in the training data-set of Joker CLEF task2 in the Extractive Q/A style data set, please check the notebook section [Conversion of two-column based tabular data set into Extractive Q/A style data set](#joker-clef-task2-processing) 

In [20]:
'''
 ## Exploring English/French noun pair given in the training data set.
'''
data_set_2.iloc[4]

id    noun_5
en    Obelix
fr    Obélix
Name: 4, dtype: object

In [21]:
'''
 ## Exploring the saved Q/A style contextual reference for each English noun in the training data set.
'''
col_list = ['id', 'context', 'question', 'answers']
train_dataset_path = "/dataset/JOKER/Task 2/train/joker_task2_train_context_aware_data_where_answers_exist.csv"
data_set_with_context_ref = pd.read_csv(base_path+train_dataset_path, usecols=col_list)
data_set_with_context_ref.iloc[17]

id                                                         17
context     astérix et obélix ne devraient plus quitter le...
question                                               obelix
answers       {"text": ["ob\u00e9lix"], "answer_start": [11]}
Name: 17, dtype: object

### PRE-PROCESSING THE PREPARED TRAINING DATA-SET TO LEARN TO LOCATE THE FRENCH TRANSLATION FOR THE GIVEN ENGLISH NOUN IN THE FORM OF QUERY IN THE FRENCH TEXT CONTEXT

First, we will split the train data set into training, testing, and validation sets to evaluate and track our model's performance. The joker CLEF task2 team has provided us with the test data set, but we don't know the real ground truth answers for them. We actually need to submit the predictions for the English nouns given in the task-2 test data set. So, the purpose of making an additional test set here is just to estimate the model performance on the unseen records and compare their performance with other models. 


Plus, we have created this test set by keeping some data aside from the actual training set. Therefore, we also have the ground-truth answers for this synthetic test set. And now, we can use this test set to evaluate the model's performance on the unseen instances and also to compare different models' performances.

In [22]:
'''
 ## Training and Validation Split
'''
col_list = ['id', 'context', 'question', 'answers']
train_dataset_path = "/dataset/JOKER/Task 2/train/joker_task2_train_context_aware_data_where_answers_exist.csv"
data_set_with_context_ref = pd.read_csv(base_path+train_dataset_path, usecols=col_list)
data_set_with_context_ref = shuffle(data_set_with_context_ref)
train_2, test_2 = train_test_split(data_set_with_context_ref, test_size=0.1, shuffle=True)
train_2, val_2 = train_test_split(train_2, test_size=0.2, shuffle=True)

'''
 ## Saving the split into the file system.
'''
save_split_path = "/dataset/JOKER/Task 2/train/train_val_split/"
train_2.to_csv(base_path+save_split_path+"train2.csv",columns =col_list)
val_2.to_csv(base_path+save_split_path+"val2.csv", columns=col_list)
print(len(train_2), len(val_2))

1563 391


Next, we will convert the developed train and validation CSVs into Dataset, which later gets utilized by the deep-learning models.

In [23]:
## FEATURES FOR DATA-SET
ft = Features({
    'id': Value(dtype='int64', id=None),
    'context': Value(dtype='string', id=None),
    'question': Value(dtype='string', id=None),
    'answers': Value(dtype='string', id=None),
})

## DATA FILES FOR DATA-SET
data_files = {
    "train": base_path+save_split_path+"train2.csv", 
    "validation": base_path+save_split_path+"val2.csv"
}

## LOADING THE DATA FILES INTO DATA SET
dataset_2 = load_dataset('csv', data_files=data_files, features=ft)
dataset_2

Using custom data configuration default-25ec87105edd7413


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-25ec87105edd7413/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-25ec87105edd7413/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 1563
    })
    validation: Dataset({
        features: ['id', 'context', 'question', 'answers'],
        num_rows: 391
    })
})

Let's have a quick glance at the loaded data set and its contents.

In [24]:
'''
 ## Exploring 6th index of the loaded data set
'''
index = 3
print("Context: ", dataset_2["train"][index]["context"])
print("Question: ", dataset_2["train"][index]["question"])
print("Answer: ", dataset_2["train"][index]["answers"])

Context:  j'en fais le vœu solennel, j'irai au pays des vikings et je ramènerai goudurix avant la prochaine pleine lune.
Question:  justforkix
Answer:  {"text": ["goudurix"], "answer_start": [70]}


While training the deep learning models for extractive question answering, it is necessary to ensure that there must be only one possible answer for each of the given records. We have assured this while transforming the [two-column-based tabular data set provided by the Joker CLEF task2 team into an extractive Q/A styled data set](#joker-clef-task2-processing). But here, we can also re-verify this again with the following code snippet.

In [25]:
'''
 ## Verifying that no records in the data set have more than one possible answer or no available answer at all
'''
dataset_2["train"].filter(lambda x: len(json.loads(x["answers"])["text"]) != 1)

  0%|          | 0/2 [00:00<?, ?ba/s]

Dataset({
    features: ['id', 'context', 'question', 'answers'],
    num_rows: 0
})

Next, we will tokenize the entire training data set so that our deep learning model can consume it.

In [26]:
'''
 ## Defining the max length and stride length for tokenizing the data set.
'''
max_length = 256
stride = 64


'''
 ## The function takes an example training record as its input and returns its aligned tokenized form as its output.
 
 ## INPUT:
 ## Example record from the training data set
 
 ## OUTPUT:
 ## Aligned tokenized form of the input training record
'''
def preprocess_training_examples(examples, tokenizer):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = json.loads(answers[sample_idx])
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

We will define a similar kind of function for tokenizing and aligning the validation data, just as we have defined for the training data.

In [27]:
'''
 ## The function takes an example validation record as its input and returns its aligned tokenized form as its output.
 
 ## INPUT:
 ## Example record from the validation data set
 
 ## OUTPUT:
 ## Aligned tokenized form of the input validation record
'''
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

Next, we will design a function that will be able to evaluate the model's prediction based on a given metric.

In [28]:
'''
 ## The function evaluates and scores the produced predictions emitted by the 
 ## deep learning model according to a given metric.
 
 ## INPUTS:
 ## START LOGITS: defining predicted starting point for the answer in the provided context
 ## END LOGITS: defining predicted ending point for the answer in the provided context
 ## FEATURES: tokenized data set developed earlier from which we need to extract basic features
 ## EXAMPLES: individual examples in the data set which needs to get evaluated
 ## N_BEST: N number of best predictions which needs to be consider
 ## MAX_ANSWER_LENGTH: Number of expected max length for the produceed answer
 
 ## OUTPUTS:
 ## EVALUATED SOCORES for the generated prediction by model for the given input and context
'''
def compute_metrics(start_logits, end_logits, features, examples, metric, n_best = 1, max_answer_length = 20):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)
      
    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if not offsets[start_index] or not offsets[end_index]:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue
                    
                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": json.loads(ex["answers"])} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers), predicted_answers, theoretical_answers

### PREPARING [CAMEMBERT](https://huggingface.co/etalab-ia/camembert-base-squadFR-fquad-piaf) TO GET FINETUNED ON THE DEVELOPED DATA SET FOR JOKER CLEF TASK-2

We have selected the [CamemBERT model](https://huggingface.co/etalab-ia/camembert-base-squadFR-fquad-piaf) from the hugging face repository to get fine-tuned on the developed data set for the Joker CLEF task 2 in extractive Q/A style. Later we will use the fine-tuned version of this model to generate the translations for the English nouns provided in the test set of the Joker CLEF task 2 data set.

In [29]:
'''
 ## Loading the tokenizer for CamemBERT
'''
model_checkpoint = "etalab-ia/camembert-base-squadFR-fquad-piaf" if download_camem_bert_for_extractive_question_answering else "models/camembert-base-squadFR-fquad-piaf"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


'''
 ## Exploring the working of loaded tokenizer
'''
context = dataset_2["train"][6]["context"]
question = dataset_2["train"][6]["question"]

inputs = tokenizer(question, context, max_length=384, 
                   truncation="only_second", stride=128, 
                   return_overflowing_tokens=True, return_offsets_mapping=True,)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))
    
inputs.keys()

Downloading:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/515 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/210 [00:00<?, ?B/s]

<s> electric</s></s> gec alsthom est une entreprise commune néer­landaise créée par general electric company et alcatel alsthom cge.</s>


dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

Now, we will process each input record of the training and test data set with the previously defined preprocess_training_examples and preprocess_validation_examples function to generate the aligned tokenized dataset. 

In [30]:
'''
 ## Mapping each record of the training data set on the preprocess_training_examples function.
'''
train_dataset = dataset_2["train"].map(
    partial(preprocess_training_examples, tokenizer=tokenizer),
    batched=True,
    remove_columns=dataset_2["train"].column_names,)

len(dataset_2["train"]), len(train_dataset)

  0%|          | 0/2 [00:00<?, ?ba/s]

(1563, 1846)

In [31]:
'''
 ## Mapping each record of the validation data set on the preprocess_validation_examples function.
'''
validation_dataset = dataset_2["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset_2["validation"].column_names,)
len(dataset_2["validation"]), len(validation_dataset)

  0%|          | 0/1 [00:00<?, ?ba/s]

(391, 443)

We will define data loaders for our prepared training and validation data sets. Plus, we will also enable torch format on the training and validation data set.

In [32]:
'''
 ## Set torch format on training and validation data set. 
 ## Plus, removing un-used columns names from the validation data set.
'''
train_dataset.set_format("torch")
validation_set = validation_dataset.remove_columns(["example_id", "offset_mapping"])
validation_set.set_format("torch")

'''
 ## Initializing data loaders for the developed train and validation data set. 
'''
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    validation_set, collate_fn=default_data_collator, batch_size=8
)

Almost all things are ready till here. Now, we can load the [CamemBERT](https://huggingface.co/etalab-ia/camembert-base-squadFR-fquad-piaf) model and the Adam optimizer.

In [33]:
'''
 ## Loading the CamemBERT model and Adam optimizer
'''
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
optimizer = AdamW(model.parameters(), lr=2e-5)

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

We will be using the accelerator API to fine-tune our model on the developed data set. So, let's create an object for the accelerator API, and initialize the schedulers with other hyper-parameters to trigger the training.

In [34]:
'''
 ## Preparing the accelerator API for starting the training.
'''
accelerator = Accelerator(fp16=True)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)

In [35]:
'''
 ## Initializing hyper-parameters and the schedulers
'''
num_train_epochs = 1
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,)

Finally, we will define the training loop to fine-tune the loaded model

In [36]:
output_dir = base_path+"/dataset/JOKER/Task 2/train/FT_Model/camembert-base-squadFR-fquad-piaf"

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    start_logits = []
    end_logits = []
    accelerator.print("Evaluation!")
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())

    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    start_logits = start_logits[: len(validation_dataset)]
    end_logits = end_logits[: len(validation_dataset)]

    metrics, pridictions, validations = compute_metrics(
        start_logits, end_logits, validation_dataset, dataset_2["validation"],
        load_metric("squad"))
    
    print(f"epoch {epoch}:", metrics)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)

100%|██████████| 231/231 [00:38<00:00,  6.35it/s]

Evaluation!



  0%|          | 0/56 [00:00<?, ?it/s][A
  5%|▌         | 3/56 [00:00<00:02, 24.42it/s][A
 11%|█         | 6/56 [00:00<00:02, 23.58it/s][A
 16%|█▌        | 9/56 [00:00<00:02, 23.15it/s][A
 21%|██▏       | 12/56 [00:00<00:01, 23.07it/s][A
 27%|██▋       | 15/56 [00:00<00:01, 22.89it/s][A
 32%|███▏      | 18/56 [00:00<00:01, 22.93it/s][A
 38%|███▊      | 21/56 [00:00<00:01, 22.80it/s][A
 43%|████▎     | 24/56 [00:01<00:01, 22.87it/s][A
 48%|████▊     | 27/56 [00:01<00:01, 22.68it/s][A
 54%|█████▎    | 30/56 [00:01<00:01, 22.72it/s][A
 59%|█████▉    | 33/56 [00:01<00:01, 22.74it/s][A
 64%|██████▍   | 36/56 [00:01<00:00, 22.64it/s][A
 70%|██████▉   | 39/56 [00:01<00:00, 22.71it/s][A
 75%|███████▌  | 42/56 [00:01<00:00, 22.79it/s][A
 80%|████████  | 45/56 [00:01<00:00, 22.57it/s][A
 86%|████████▌ | 48/56 [00:02<00:00, 22.76it/s][A
 91%|█████████ | 51/56 [00:02<00:00, 22.81it/s][A
100%|██████████| 56/56 [00:02<00:00, 23.06it/s]


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]


  0%|          | 0/391 [00:00<?, ?it/s][A
 51%|█████     | 198/391 [00:00<00:00, 1855.87it/s][A
100%|██████████| 391/391 [00:00<00:00, 1627.32it/s]


epoch 0: {'exact_match': 93.60613810741688, 'f1': 94.33077578857629}


### PREPARING [DistilBERT](https://huggingface.co/distilbert-base-cased-distilled-squad) TO FINETUNE MODEL ON THE DEVELOPED DATA SET FOR JOKER CLEF TASK-2

We have selected the [DistilBERT model](https://huggingface.co/distilbert-base-cased-distilled-squad) from the hugging face repository to get fine-tuned on the developed data set for the Joker CLEF task 2 in extractive Q/A style. Later we will use the fine-tuned version of this model to generate the translations for the English nouns provided in the test set of the Joker CLEF task 2 data set.

In [37]:
'''
 ## Loading the tokenizer for DistilBERT
'''
model_checkpoint = "distilbert-base-cased-distilled-squad" if download_distil_bert_for_extractive_question_answering else "models/distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


'''
 ## Exploring the working of loaded tokenizer
'''
context = dataset_2["train"][6]["context"]
question = dataset_2["train"][6]["question"]

inputs = tokenizer(question, context, max_length=384, 
                   truncation="only_second", stride=128, 
                   return_overflowing_tokens=True, return_offsets_mapping=True,)

for ids in inputs["input_ids"]:
    print(tokenizer.decode(ids))
    
inputs.keys()

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

[CLS] electric [SEP] gec alsthom est une entreprise commune néerlandaise créée par general electric company et alcatel alsthom cge. [SEP]


dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

Now, we will process each input record of the training and test data set with the previously defined preprocess_training_examples and preprocess_validation_examples function to generate the aligned tokenized dataset.

In [38]:
'''
 ## Mapping each record of the training data set on the preprocess_training_examples function.
'''
train_dataset = dataset_2["train"].map(
    partial(preprocess_training_examples, tokenizer=tokenizer),
    batched=True,
    remove_columns=dataset_2["train"].column_names,)

len(dataset_2["train"]), len(train_dataset)

  0%|          | 0/2 [00:00<?, ?ba/s]

(1563, 1827)

In [39]:
'''
 ## Mapping each record of the validation data set on the preprocess_validation_examples function.
'''
validation_dataset = dataset_2["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dataset_2["validation"].column_names,)
len(dataset_2["validation"]), len(validation_dataset)

  0%|          | 0/1 [00:00<?, ?ba/s]

(391, 436)

We will define data loaders for our prepared training and validation data sets. Plus, we will also enable torch format on the training and validation data set.

In [40]:
'''
 ## Set torch format on training and validation data set. 
 ## Plus, removing un-used columns names from the validation data set.
'''
train_dataset.set_format("torch")
validation_set = validation_dataset.remove_columns(["example_id", "offset_mapping"])
validation_set.set_format("torch")

'''
 ## Initializing data loaders for the developed train and validation data set. 
'''
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    validation_set, collate_fn=default_data_collator, batch_size=8
)

Almost all things are ready till here. Now, we can load the [DistilBERT](https://huggingface.co/distilbert-base-cased-distilled-squad) model and the Adam optimizer.

In [41]:
'''
 ## Loading the DistilERT model and Adam optimizer
'''
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
optimizer = AdamW(model.parameters(), lr=2e-5)

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

We will be using the accelerator API to fine-tune our model on the developed data set. So, let's create an object for the accelerator API, and initialize the schedulers with other hyper-parameters to trigger the training.

In [42]:
'''
 ## Preparing the accelerator API for starting the training.
'''
accelerator = Accelerator(fp16=True)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(model, optimizer, train_dataloader, eval_dataloader)

In [43]:
'''
 ## Initializing hyper-parameters and the schedulers
'''
num_train_epochs = 1
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,)

Finally, we will define the training loop to fine-tune the loaded model

In [44]:
output_dir = base_path+"/dataset/JOKER/Task 2/train/FT_Model/distilbert-base-cased-distilled-squad"

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    start_logits = []
    end_logits = []
    accelerator.print("Evaluation!")
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())

    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    start_logits = start_logits[: len(validation_dataset)]
    end_logits = end_logits[: len(validation_dataset)]

    metrics, pridictions, validations = compute_metrics(
        start_logits, end_logits, validation_dataset, dataset_2["validation"],
        load_metric("squad"))
    
    print(f"epoch {epoch}:", metrics)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)


100%|██████████| 231/231 [01:31<00:00,  2.52it/s]

  0%|          | 1/229 [00:00<00:26,  8.54it/s][A
  1%|          | 2/229 [00:00<00:27,  8.17it/s][A
  2%|▏         | 4/229 [00:00<00:22,  9.85it/s][A
  3%|▎         | 6/229 [00:00<00:20, 10.98it/s][A
  3%|▎         | 8/229 [00:00<00:20, 11.03it/s][A
  4%|▍         | 10/229 [00:00<00:19, 11.08it/s][A
  5%|▌         | 12/229 [00:01<00:19, 11.09it/s][A
  6%|▌         | 14/229 [00:01<00:19, 11.12it/s][A
  7%|▋         | 16/229 [00:01<00:19, 11.11it/s][A
  8%|▊         | 18/229 [00:01<00:18, 11.11it/s][A
  9%|▊         | 20/229 [00:01<00:18, 11.14it/s][A
 10%|▉         | 22/229 [00:02<00:18, 11.14it/s][A
 10%|█         | 24/229 [00:02<00:18, 11.14it/s][A
 11%|█▏        | 26/229 [00:02<00:18, 11.11it/s][A
 12%|█▏        | 28/229 [00:02<00:18, 11.10it/s][A
 13%|█▎        | 30/229 [00:02<00:17, 11.11it/s][A
 14%|█▍        | 32/229 [00:02<00:17, 11.11it/s][A
 15%|█▍        | 34/229 [00:03<00:17, 11.10it/s][A
 16%|█▌        | 

Evaluation!


100%|██████████| 55/55 [00:01<00:00, 44.36it/s]
100%|██████████| 391/391 [00:00<00:00, 1690.68it/s]


epoch 0: {'exact_match': 97.9539641943734, 'f1': 98.05626598465473}


### COMPARING PERFORMANCE OF FINE-TUNED MODELS 

In [45]:
'''
 ## Loading the fine-tuned CamemBRET model
'''
output_dir = base_path+"/dataset/JOKER/Task 2/train/FT_Model/camembert-base-squadFR-fquad-piaf"
FT_model_camemBERT = pipeline(task= "question-answering", model= output_dir, tokenizer = output_dir)

'''
 ## Loading the fine-tuned DistilBRET model
'''
output_dir = base_path+"/dataset/JOKER/Task 2/train/FT_Model/distilbert-base-cased-distilled-squad"
FT_model_distilBERT = pipeline(task= "question-answering", model= output_dir, tokenizer = output_dir)

perfect_matches = 0
camemBertPredictions = []
camemBertActualLabels = []

for index, row in tqdm(test_2.iterrows()):
    context = row.context.lower().strip()
    question = row.question.lower().strip()
    prediction = FT_model_camemBERT(question=question, context=context)
    
    camemBertPredictions.append(prediction["answer"])
    camemBertActualLabels.append(json.loads(row.answers)['text'][0])
    
    if (prediction["answer"] == json.loads(row.answers)['text'][0]):
        perfect_matches +=1
print("CamemBERT PERFECT MATCHES: ", perfect_matches, "OUT OF TOTAL RECORDS: ", len(test_2))


perfect_matches = 0
distilBertPredictions = []
distilBertActualLabels = []
for index, row in tqdm(test_2.iterrows()):
    context = row.context.lower().strip()
    question = row.question.lower().strip()
    prediction = FT_model_distilBERT(question=question, context=context)
    
    distilBertPredictions.append(prediction["answer"])
    distilBertActualLabels.append(json.loads(row.answers)['text'][0])
    
    if (prediction["answer"] == json.loads(row.answers)['text'][0]):
        perfect_matches +=1

print("DistilBERT PERFECT MATCHES: ", perfect_matches, "OUT OF TOTAL RECORDS: ", len(test_2))

  tensor = as_tensor(value)
  for span_id in range(num_spans)

218it [01:12,  3.01it/s]


CamemBERT PERFECT MATCHES:  96 OUT OF TOTAL RECORDS:  218


218it [00:38,  5.66it/s]

DistilBERT PERFECT MATCHES:  209 OUT OF TOTAL RECORDS:  218





### EXECUTING FINE-TUNED MODELS TO GENERATE TRANSLATIONS FOR ENGLISH NOUNS FROM THE TEST DATA-SET

Lastly, we have applied the same approach to transform the English nouns in the test data set provided by the Joker CLEF task2 team into extractive Q/A styled data sets by using the OPUS English/French parallel corpus. The only difference is that this time we will not the answers column. So, during this conversion, we have only ensured the existence of English nouns in the English version text of the English/French parallel corpus to filter out the relevant pairs. We have assumed that the French text in the extracted pair contains the relevant translation for the English noun that exists in the English text.

So as an example, we can consider the following record from test data set where we have to predict the french translation of the English noun **"Loompaland."**

| id | en      | 
| ----------- | ----------- | 
| 18  | Loompaland      | 

For predicting this translation, we will utillize the developed Extractive Q/A styled data set, which will have following entry for the mentioned English noun.

| id | context | question |  
| :---   | :---:     |    :----:   | 
| 17  | j'étais venu à lumpaland pour chercher de nouvelles saveurs. | loompaland |

In this way, the task of fine-tuned deep learning model is now to output the location of the answer in the given context. And if we consider the above example, then the fine-tuned model has to spot the word **lumpaland** as an answer.

For more details about how we have converted all the English nouns provided in the test data-set of Joker CLEF task2 in the Extractive Q/A style data set, please check the notebook section [Conversion of two-column based tabular data set into Extractive Q/A style data set](#joker-clef-task2-processing) 

In [46]:
'''
 ## Exploring the mentioned record from the example above in the test data set for Joker CLEF task2
'''
test_data_set_2.iloc[18]

id     noun_1179
en    Loompaland
Name: 18, dtype: object

In [47]:
'''
 ## Exploring the correesponding Q/A stuled entry for mentioned record in the example above
'''
cols = ["id", "context", "question"]
test_data_2 = pd.read_csv(base_path+
                          "/dataset/JOKER/Task 2/test/joker_task2_test_context_aware_data_where_answers_exist.csv", 
                          usecols = cols)
test_data_2.iloc[18]

id                                                         18
context     j'étais venu à lumpaland pour chercher de nouv...
question                                           loompaland
Name: 18, dtype: object

Finally, we will use the fine-tuned model to generate the predictions for all the English nouns provided in the test data set and are successfully converted into Q/A styled dataset.

In [48]:
'''
 ## Loading the Q/A styled transformed data set for the English nouns provided in the test data set 
 ## of Joker CLEF task2
'''
cols = ["id", "context", "question"]
test_data_2 = pd.read_csv(base_path+
                          "/dataset/JOKER/Task 2/test/joker_task2_test_context_aware_data_where_answers_exist.csv", 
                          usecols = cols)

'''
 ## Loading the fine-tuned CamemBRET model
'''
output_dir = base_path+"/dataset/JOKER/Task 2/train/FT_Model/camembert-base-squadFR-fquad-piaf"
FT_model = pipeline(task= "question-answering", model= output_dir, tokenizer = output_dir)

'''
 ## Generating translations
'''
predictions=[]
for index, row in tqdm(test_data_2.iterrows()):
    context = row.context.lower().strip()
    question = row.question.lower().strip()
    prediction = FT_model(question=question, context=context)
    predictions.append({"MANUAL":0,
                       "id":row.id,
                       "en":row.question,
                       "fr":prediction["answer"]})

'''
 ## Saving the generated translation
'''    
with open(base_path+"/dataset/JOKER/Task 2/test/CAMEM-solutions_joker_task2_test_context_aware_data_where_answers_exist.json", "w") as outfile:
    outfile.write(json.dumps(predictions, indent = 4))

6176it [20:59,  4.90it/s]


In [49]:
'''
 ## Loading the Q/A styled transformed data set for the English nouns provided in the test data set 
 ## of Joker CLEF task2
'''
cols = ["id", "context", "question"]
test_data_2 = pd.read_csv(base_path+
                          "/dataset/JOKER/Task 2/test/joker_task2_test_context_aware_data_where_answers_exist.csv", 
                          usecols = cols)

'''
 ## Loading the fine-tuned DistilBRET model
'''
output_dir = base_path+"/dataset/JOKER/Task 2/train/FT_Model/distilbert-base-cased-distilled-squad"
FT_model = pipeline(task= "question-answering", model= output_dir, tokenizer = output_dir)

'''
 ## Generating translations
'''
predictions=[]
for index, row in tqdm(test_data_2.iterrows()):
    context = row.context.lower().strip()
    question = row.question.lower().strip()
    prediction = FT_model(question=question, context=context)
    predictions.append({"MANUAL":0,
                       "id":row.id,
                       "en":row.question,
                       "fr":prediction["answer"]})

'''
 ## Saving the generated translation
'''    
with open(base_path+"/dataset/JOKER/Task 2/test/DISTIL-solutions_joker_task2_test_context_aware_data_where_answers_exist.json", "w") as outfile:
    outfile.write(json.dumps(predictions, indent = 4))

6176it [15:13,  6.76it/s]


In [56]:
path = "/dataset/JOKER/Task 2/test/DISTIL-solutions_joker_task2_test_context_aware_data_where_answers_exist.json"
file = open(base_path+path)
solutions = json.load(file)

path = "dataset/JOKER/Task 2/test/joker_task2_test.csv"
test_data_set_2 = pd.read_csv(base_path+path)

task_subimission_2 = []
run_id = "non-disclosable-numeric-UUID"
is_manual = 0

for index, row in tqdm(test_data_set_2.iterrows()):
    prediction = "NOT FOUND"
    for sol in solutions:
      if(row.en == sol['en']):
        prediction = sol['fr']
        break

    predictions_dict = {}
    predictions_dict["RUN_ID"] = run_id
    predictions_dict["MANUAL"] = is_manual
    predictions_dict["id"] = row.id
    predictions_dict["en"] = row.en
    predictions_dict["fr"] = prediction
    task_subimission_2.append(predictions_dict)

with open(base_path + "dataset/JOKER/Task 2/test/[CLEF TASK <2>] "+run_id+".json", 'w') as fp:
    json_object = json.dumps(task_subimission_2, indent = 4) 
    json.dump(json_object, fp)

284it [00:12, 22.78it/s]


## Transforming two-column-based tabular data set for English/French noun pairs into Extractive Q/A styled data set

We aim to map the JOKER CLEF task-2 to the extractive Question/Answer approach by using the English/French parallel sentence pairs (subtitles, books, and wiki) from the [OPUS open-source parallel corpus](https://opus.nlpl.eu/) where each pair must contain at least one English noun in the English version and the corresponding translation of the English noun in the French version. After that the processing of training is simple, we can treat the English noun as a question and the French text which contains its corresponding translation as its context. Now, the task is to extract the corresponding translation of the given English noun in the French text. Below, we have given the script to generate a simple CSV file from the TMX version of the English/French parallel corpus developed from ted talks downloaded from the [OPUS open-source parallel corpus](https://opus.nlpl.eu/). 

<a id='joker-clef-task2-processing'></a>

We can repeat this process for all the English/French parallel corpus available at the OPUS. Here in this notebook, we have only used the following listed English/French parallel corpus.

 - BOOKS
 - OPEN-SUB 
 - WIKIPEDIA
 - WIKIMDEIA
 - TED
 
 

In [None]:
'''
 ## Conversion of English/French ted talks parllel corpus from tmx to csv
'''

## Opening tmx file
d_name = "en-fr_ted"
store_tmx_name = d_name+".tmx"
store_csv_name = d_name+".csv"
store_json_name = d_name+".json"
path = "dataset/JOKER/" + store_tmx_name
file_name = base_path + path
with open(file_name, 'rb') as fin:
    tmx_file = tmxfile(fin, 'en', 'fr')

## converting and saving it into csv file
en = []
fr = []
count = 0
for node in tqdm(tmx_file.unit_iter()):
    en.append(node.source)
    fr.append(node.target)

df = pd.DataFrame(list(zip([],[])), columns = ["en", "fr"])    
df['en'] = en
df['fr'] = fr
df.to_csv(base_path + "dataset/JOKER/csv/" + store_csv_name)


0it [00:00, ?it/s][A
4648it [00:00, 46471.18it/s][A
9527it [00:00, 47832.23it/s][A
14373it [00:00, 48117.94it/s][A
19264it [00:00, 48427.35it/s][A
24107it [00:00, 47212.16it/s][A
28959it [00:00, 47646.04it/s][A
33728it [00:00, 46379.51it/s][A
38375it [00:00, 46257.44it/s][A
43324it [00:00, 47240.91it/s][A
48274it [00:01, 47927.40it/s][A
53072it [00:01, 47861.10it/s][A
57862it [00:01, 47791.13it/s][A
62670it [00:01, 47872.83it/s][A
67459it [00:01, 46323.44it/s][A
72103it [00:01, 46179.70it/s][A
76823it [00:01, 46477.68it/s][A
81725it [00:01, 47229.45it/s][A
86576it [00:01, 47609.03it/s][A
91488it [00:01, 48048.87it/s][A
96297it [00:02, 44106.81it/s][A
100772it [00:02, 43830.99it/s][A
105878it [00:02, 45886.04it/s][A
111009it [00:02, 47450.72it/s][A
115795it [00:02, 47568.38it/s][A
120776it [00:02, 48227.11it/s][A
125618it [00:02, 47924.90it/s][A
130539it [00:02, 48302.80it/s][A
135495it [00:02, 48674.11it/s][A
140507it [00:02, 49104.08it/s][A
145440it [00:

It's important to note that the complete English/French parallel corpus is not useful for us. Because we only want the French version text for the given English nouns in the Joker-CLEF task-2 train/test data set for developing the contexts. So, the problem can be transformed into the extractive question answering where English nouns will serve as a query, and the task is to extract the translation of the provided English noun from the French version of the text. 

To implement such an approach, we have to extract a subset of the parallel corpus where the English version text must contain at least one English noun from the Joker-CLEF task-2 data set. Here we will assume that its corresponding French version will always have the translated version of the English noun in it for the Joker-CLEF task 2 test data set. So, to develop such sort of sub-set of English/French parallel corpus, we will iterate through all English nouns given in the training data-set by the JOKER-CLEF task-2 team and search them word by word in the English version text of the English/French parallel corpus. After that, we will create another data frame to store such pairs and use them further in our training.

To make the process of searching case-insensitive, we must first lower the English version text in the English/French parallel corpus and save it in a new CSV file on the disk.

In [None]:
'''
 ## Converting text present in the English Column into lower case and saving the result in a new processed CSV file
'''

listed = df["en"].tolist()
pr_en = [ n.lower() for n in listed]
faster_search_df = pd.DataFrame(list(zip([],[])), columns = ["en", "fr"])
faster_search_df['en'] = pr_en
faster_search_df['fr'] = df["fr"].tolist()
faster_search_df.to_csv(base_path + "dataset/JOKER/csv/en_processed/" + store_csv_name)

In the EDA, we have also noticed that both the train and test data set of the JOKER CLEF task 2 consist of a lot of English nouns from the [Pokemon](https://pokemondb.net/pokedex/all) and [How to train your dragon series](https://howtotrainyourdragon.fandom.com/wiki/A_Hero%27s_Guide_to_Deadly_Dragons). Below, we have downloaded a list of pokemon names and the nouns from the "How to train your dragon series" to check the count of records collected from this list in the train/test data set provided by the JOKER CLEF task 2 team.

 FINDING FROM POKEMON DATASET HOW MANY RECORDS WE CAN USE FOR TRAINING FROM TRAINING DATASET

In [None]:
'''
 ## Counting the number of records from the training dataset that are Pokemon names.
'''

## loading the pokemon names json file
path = "/dataset/JOKER/task2_pokemon_en.json"
file = open(base_path+path)
pokemonVocab = json.load(file)

## counting the number of English nouns present in the loaded json file of Pokemon names
count =0;
for index, row in data_set_2.iterrows():
    if (row.en in pokemonVocab):
        count = count+1
        pokemonVocab.index(row.en)
        
print("Count Of Pokemon Records In Train=",count, ", Total Records=", len(data_set_2.en), ", Percent=", count/len(data_set_2.en)*100)

Count Of Pokemon Records In Train= 520 , Total Records= 1164 , Percent= 44.67353951890035


FINDING FROM POKEMON DATASET FOR HOW MANY TESTING RECORDS WE CAN GENERATE TRANSLATIONS

In [None]:
'''
 ## Counting the number of records from the test dataset that are Pokemon names.
'''

## loading the test data set
path = "dataset/JOKER/Task 2/test/joker_task2_test.csv"
test_data_set_2 = pd.read_csv(base_path+path)

## counting the number of English nouns from the test data set present in the loaded json file of Pokemon names
count =0;
for index, row in test_data_set_2.iterrows():
    if (row.en in pokemonVocab):
        count = count+1
        pokemonVocab.index(row.en)
        
print("Count Of Pokemon Records In Test=",count, ", Total Records=", len(test_data_set_2.en), ", Percent=", count/len(test_data_set_2.en)*100)

Count Of Pokemon Records In Test= 140 , Total Records= 284 , Percent= 49.29577464788733


FINDING FROM DRAGON DATASET HOW MANY RECORDS WE CAN USE FOR TRAINING FROM TRAINING DATASET

In [None]:
'''
 ## Counting the number of English nouns from the training dataset that belongs to the "How to train your dragon" series.
'''

## loading the CSV file containing nouns from the "How to train your dragon" series. 
path = "dataset/JOKER/dragons.csv"
dragon_data_set = pd.read_csv(base_path+path, names=["en", "fr"])


## counting the number of English nouns from the train data set present in the loaded CSV file
dragon_dictionary_train = {}
listed = np.array([ n.lower() for n in dragon_data_set.en])
f = np.frompyfunc(lambda x,w: w in x, 2,1)
for index, row in tqdm(data_set_2.iterrows()):
    word = row.en.lower()
    finds = f(listed, word)
    dragon_dictionary_train[word] = finds.sum()

count=0
for key in dragon_dictionary_train:
    if(dragon_dictionary_train[key]>0):
        count = count+1
        
print("Count Of Dragon Records In Train=",count, ", Total Records=", len(data_set_2.en), ", Percent=", count/len(data_set_2.en)*100)



0it [00:00, ?it/s][A
176it [00:00, 1754.79it/s][A
369it [00:00, 1857.20it/s][A
584it [00:00, 1989.46it/s][A
787it [00:00, 2004.06it/s][A
1164it [00:00, 1987.98it/s][A

Count Of Dragon Records In Train= 161 , Total Records= 1164 , Percent= 13.831615120274915





FINDING FROM DRAGON DATASET FOR HOW MANY TESTING RECORDS WE CAN GENERATE TRANSLATIONS

In [None]:
'''
 ## Counting the number of English nouns from the test dataset that belongs to the "How to train your dragon" series.
'''

## loading the CSV file containing nouns from the "How to train your dragon" series. 
path = "dataset/JOKER/dragons.csv"
dragon_data_set = pd.read_csv(base_path+path, names=["en", "fr"])

## counting the number of English nouns from the test data set present in the loaded CSV file
dragon_dictionary_test = {}
listed = np.array([ n.lower() for n in dragon_data_set.en])
f = np.frompyfunc(lambda x,w: w in x, 2,1)
for index, row in tqdm(test_data_set_2.iterrows()):
    word = row.en.lower()
    finds = f(listed, word)
    dragon_dictionary_test[word] = finds.sum()

count=0
for key in dragon_dictionary_test:
    if(dragon_dictionary_test[key]>0):
        count = count+1
        
print("Count Of Dragon Records In Test=",count, ", Total Records=", len(test_data_set_2.en), ", Percent=", count/len(test_data_set_2.en)*100)


0it [00:00, ?it/s][A
284it [00:00, 1865.66it/s][A

Count Of Dragon Records In Test= 32 , Total Records= 284 , Percent= 11.267605633802818





Here we can conclude that almost 57% of the English given provided in the training data set for the Joker CLEF task2 belongs to the Pokemon and how to train your dragon series. Along with that almost 60% of the English nouns mentioned in the test data set belong to the same Pokemon and How to train your dragon series. 

So, it's important to consider that we must have to search the context for those English nouns in the downloaded English/French parallel corpus from the OPUS data set which does not belong to the Pokemon and How to train your dragon series.

In [None]:
'''
 ## Calculation of the percentage of records from the training data set that do not belong
 ## to either the pokemon names or the nouns from the How to train your dragon series.
'''

filter_unresolved_train_records = []
for index, row in data_set_2.iterrows():
    if not ((row.en in pokemonVocab) or (dragon_dictionary_train[row.en.lower()] > 0)):
        filter_unresolved_train_records.append(row.en)
        
print("Count Of Records In Training that does not belong to either Pokemon or How to train your dragon series=",len(filter_unresolved_train_records), ", Total Records=", len(data_set_2.en),", Percent=", len(filter_unresolved_train_records)/len(data_set_2.en)*100)


Count Of Records In Training that does not belong to either Pokemon or How to train your dragon series= 483 , Total Records= 1164 , Percent= 41.49484536082475


In [None]:
'''
 ## Calculation of the percentage of records from the test data set that do not belong
 ## to either the pokemon names or the nouns from the How to train your dragon series.
'''

filter_unresolved_test_records = []
for index, row in test_data_set_2.iterrows():
    if not ((row.en in pokemonVocab) or (dragon_dictionary_test[row.en.lower()] > 0)):
        filter_unresolved_test_records.append(row.en)

print("Count Of Records In Testing that does not belong to either Pokemon or How to train your dragon series=",len(filter_unresolved_test_records), ", Total Records=", len(test_data_set_2.en),", Percent=", len(filter_unresolved_test_records)/len(test_data_set_2.en)*100)


Count Of Records In Testing that does not belong to either Pokemon or How to train your dragon series= 113 , Total Records= 284 , Percent= 39.7887323943662


This analysis shows that we need to find almost 42% of English nouns from the training data set of the Joker CLEF task2. Plus, 40% of English nouns from the test data set provided by the Joker CLEF task2 team. 

Below, we have given a visual explanation of this concept as a sort of verification that our calculations are correct.

VERIFICATION FOR MATH IS CORRECT:

    TOTAL VOCAB COUNT IN TRAIN = 1164
    ---
    TOTAL TRAIN VOCAB COUNT WHOS CONTEXTUAL REF. FOUND IN POKEMON DATASET  = 520 +
    TOTAL TRAIN VOCAB COUNT WHOS CONTEXTUAL REF. FOUND IN DRAGON DATASET   = 161 +
    TOTAL TRAIN VOCAB COUNT WHOS CONTEXTUAL REF. NOT FOUND IN ABOVE DATASET= 483 = 1164
    
    
    TOTAL VOCAB COUNT IN TEST = 284
    ---
    TOTAL TRAIN VOCAB COUNT WHOS CONTEXTUAL REF. FOUND IN POKEMON DATASET  = 140 +
    TOTAL TRAIN VOCAB COUNT WHOS CONTEXTUAL REF. FOUND IN DRAGON DATASET   = 32 +
    TOTAL TRAIN VOCAB COUNT WHOS CONTEXTUAL REF. NOT FOUND IN ABOVE DATASET= 113 = 185
    ---
    185 because test_data_set_2 records contains one duplicate VOCAB "COLD" 
                    
    

In [None]:
print(set([x for x in filter_unresolved_test_records if filter_unresolved_test_records.count(x) > 1]))
print("Count of record \'Cold 45\' appearance in the test data set:", (test_data_set_2['en'] == "Cold 45").sum())

{'Cold 45'}
Count of record 'Cold 45' appearance in the test data set: 2


Now, we will implement our approach by iterating for each English noun provided in the test/ train the data set to search them in the parallel corpus individually and save all those pairs where the search results in the success. As a result of this activity, we will create a sub-set of all the downloaded parallel corpus, where the English version text in each pair must contain at least one English noun in it.

Please note that the complete execution of the following cells will take some time depending on the configuration of the system where you are running this notebook. Because the parallel corpuses downloaded from OPUS are huge, so it will take some time to search each English noun in the English text of all the record pairs present in the corpora.

BUILDING CONTEXTUAL REFERENCE DATA FOR PROCESSING TEESTING QUERIES

In [None]:
'''
 ## Extraction of sub-set of English/French pairs from the downloaded parallel corpus
 ## that contains atleast one English noun from the test set in it in the English version text.
'''
en = []
fr = []
store_csv_name = "c_ref_test.csv"
df_contextual_refrences = pd.DataFrame(list(zip([],[])), columns = ["en", "fr"])

## Loading POKEMON DATASET
file = open(base_path+"/dataset/JOKER/task2_pokemon_en.json")
pokemonENVocab = json.load(file)
file = open(base_path+"/dataset/JOKER/task2_pokemon_fr.json")
pokemonFrVocab = json.load(file)

## Loading DRAGON DATASET
dragon_data_set = pd.read_csv(base_path+"dataset/JOKER/dragons.csv", names=["en", "fr"])

## Loading names of downloaded corpus from the OPUS data set
processed_csv_file_names = ["opensub","books","bookshop", "wikipedia", "wiki", "ted"]
csv_file_names_head = "en-fr_"

f = np.frompyfunc(lambda x,w: w in x, 2,1)

## Extracting English/French pairs from the pokemon names
for index, row in tqdm(test_data_set_2.iterrows()):
    if(row.en in pokemonVocab):
        en.append(row.en.lower().strip())
        fr.append(pokemonFrVocab[pokemonENVocab.index(row.en)].lower().strip())

## Extracting English/French pairs from the nouns of How to train your dragon series
listed = np.array([ n.lower() for n in dragon_data_set.en])
for index, row in tqdm(test_data_set_2.iterrows()):
    word = row.en.lower()
    finds = f(listed, word)
    if(finds.sum()>0):
        en+=dragon_data_set[finds].en.str.lower().str.strip().tolist()
        fr+=dragon_data_set[finds].fr.str.lower().str.strip().tolist()

## Extracting English/French pairs from the downloaded OPUS corpus        
check_pr = [word.lower().strip() for word in test_data_set_2.en.tolist()]
for csv_file in processed_csv_file_names:
    faster_search_df = pd.read_csv(base_path+"dataset/JOKER/csv/en_processed/"+csv_file_names_head+csv_file+".csv")
    faster_search_df = faster_search_df.dropna()
    listed = faster_search_df['en'].to_numpy()
    for word in tqdm(check_pr):
        finds = f(listed, word)
        if(finds.sum()>0):
            en+=faster_search_df[finds].en.str.lower().str.strip().tolist()
            fr+=faster_search_df[finds].fr.str.lower().str.strip().tolist()

## Saving the extracted pairs            
print("c_ref_test is ready")
df_contextual_refrences.en = en
df_contextual_refrences.fr = fr
df_contextual_refrences.to_csv(base_path + "dataset/JOKER/csv/context/" + store_csv_name)


284it [00:00, 10137.38it/s]

0it [00:00, ?it/s][A
284it [00:00, 1438.34it/s][A

  0%|                                                   | 0/284 [00:00<?, ?it/s][A
  0%|▏                                          | 1/284 [00:04<21:27,  4.55s/it][A
  1%|▎                                          | 2/284 [00:08<20:57,  4.46s/it][A
  1%|▍                                          | 3/284 [00:13<20:57,  4.47s/it][A
  1%|▌                                          | 4/284 [00:17<21:02,  4.51s/it][A
  2%|▊                                          | 5/284 [00:22<21:20,  4.59s/it][A
  2%|▉                                          | 6/284 [00:27<21:10,  4.57s/it][A
  2%|█                                          | 7/284 [00:31<20:50,  4.52s/it][A
  3%|█▏                                         | 8/284 [00:36<20:36,  4.48s/it][A
  3%|█▎                                         | 9/284 [00:40<20:28,  4.47s/it][A
  4%|█▍                                        | 10/284 [00:44<20:16,  4.44s/

 34%|██████████████▏                           | 96/284 [08:03<17:09,  5.48s/it][A
 34%|██████████████▎                           | 97/284 [08:09<17:49,  5.72s/it][A
 35%|██████████████▍                           | 98/284 [08:16<18:08,  5.85s/it][A
 35%|██████████████▋                           | 99/284 [08:20<16:34,  5.38s/it][A
 35%|██████████████▍                          | 100/284 [08:26<17:01,  5.55s/it][A
 36%|██████████████▌                          | 101/284 [08:30<15:46,  5.17s/it][A
 36%|██████████████▋                          | 102/284 [08:34<14:51,  4.90s/it][A
 36%|██████████████▊                          | 103/284 [08:39<14:14,  4.72s/it][A
 37%|███████████████                          | 104/284 [08:45<15:20,  5.11s/it][A
 37%|███████████████▏                         | 105/284 [08:49<14:30,  4.86s/it][A
 37%|███████████████▎                         | 106/284 [08:55<15:30,  5.23s/it][A
 38%|███████████████▍                         | 107/284 [09:01<16:09,  5.48s

 68%|███████████████████████████▊             | 193/284 [16:13<07:03,  4.66s/it][A
 68%|████████████████████████████             | 194/284 [16:19<07:37,  5.08s/it][A
 69%|████████████████████████████▏            | 195/284 [16:24<07:12,  4.86s/it][A
 69%|████████████████████████████▎            | 196/284 [16:30<07:43,  5.27s/it][A
 69%|████████████████████████████▍            | 197/284 [16:34<07:17,  5.03s/it][A
 70%|████████████████████████████▌            | 198/284 [16:39<07:02,  4.92s/it][A
 70%|████████████████████████████▋            | 199/284 [16:44<06:52,  4.86s/it][A
 70%|████████████████████████████▊            | 200/284 [16:48<06:33,  4.68s/it][A
 71%|█████████████████████████████            | 201/284 [16:52<06:16,  4.53s/it][A
 71%|█████████████████████████████▏           | 202/284 [16:58<06:50,  5.01s/it][A
 71%|█████████████████████████████▎           | 203/284 [17:03<06:29,  4.81s/it][A
 72%|█████████████████████████████▍           | 204/284 [17:07<06:12,  4.65s

  7%|██▉                                       | 20/284 [00:00<00:06, 41.48it/s][A
  9%|███▋                                      | 25/284 [00:00<00:05, 43.33it/s][A
 11%|████▍                                     | 30/284 [00:00<00:05, 43.01it/s][A
 12%|█████▏                                    | 35/284 [00:00<00:05, 43.47it/s][A
 14%|█████▉                                    | 40/284 [00:00<00:05, 43.43it/s][A
 16%|██████▋                                   | 45/284 [00:01<00:05, 44.72it/s][A
 18%|███████▍                                  | 50/284 [00:01<00:05, 44.56it/s][A
 19%|████████▏                                 | 55/284 [00:01<00:05, 45.11it/s][A
 21%|████████▊                                 | 60/284 [00:01<00:04, 45.04it/s][A
 23%|█████████▌                                | 65/284 [00:01<00:04, 45.52it/s][A
 25%|██████████▎                               | 70/284 [00:01<00:04, 46.57it/s][A
 26%|███████████                               | 75/284 [00:01<00:04, 45.93i

 15%|██████▎                                   | 43/284 [01:17<07:26,  1.85s/it][A
 15%|██████▌                                   | 44/284 [01:19<07:29,  1.87s/it][A
 16%|██████▋                                   | 45/284 [01:21<07:26,  1.87s/it][A
 16%|██████▊                                   | 46/284 [01:23<07:23,  1.86s/it][A
 17%|██████▉                                   | 47/284 [01:25<07:17,  1.85s/it][A
 17%|███████                                   | 48/284 [01:26<07:17,  1.85s/it][A
 17%|███████▏                                  | 49/284 [01:28<07:20,  1.88s/it][A
 18%|███████▍                                  | 50/284 [01:30<07:26,  1.91s/it][A
 18%|███████▌                                  | 51/284 [01:32<07:27,  1.92s/it][A
 18%|███████▋                                  | 52/284 [01:34<07:17,  1.88s/it][A
 19%|███████▊                                  | 53/284 [01:36<06:59,  1.82s/it][A
 19%|███████▉                                  | 54/284 [01:38<07:21,  1.92s

 49%|████████████████████▏                    | 140/284 [04:17<04:42,  1.96s/it][A
 50%|████████████████████▎                    | 141/284 [04:19<04:39,  1.95s/it][A
 50%|████████████████████▌                    | 142/284 [04:21<04:30,  1.91s/it][A
 50%|████████████████████▋                    | 143/284 [04:22<04:21,  1.85s/it][A
 51%|████████████████████▊                    | 144/284 [04:24<04:17,  1.84s/it][A
 51%|████████████████████▉                    | 145/284 [04:26<04:08,  1.79s/it][A
 51%|█████████████████████                    | 146/284 [04:28<03:56,  1.72s/it][A
 52%|█████████████████████▏                   | 147/284 [04:29<03:46,  1.65s/it][A
 52%|█████████████████████▎                   | 148/284 [04:31<03:46,  1.67s/it][A
 52%|█████████████████████▌                   | 149/284 [04:33<03:50,  1.71s/it][A
 53%|█████████████████████▋                   | 150/284 [04:34<03:41,  1.65s/it][A
 53%|█████████████████████▊                   | 151/284 [04:36<03:38,  1.64s

 83%|██████████████████████████████████▏      | 237/284 [06:59<01:15,  1.60s/it][A
 84%|██████████████████████████████████▎      | 238/284 [07:01<01:16,  1.67s/it][A
 84%|██████████████████████████████████▌      | 239/284 [07:02<01:14,  1.65s/it][A
 85%|██████████████████████████████████▋      | 240/284 [07:04<01:12,  1.66s/it][A
 85%|██████████████████████████████████▊      | 241/284 [07:06<01:15,  1.76s/it][A
 85%|██████████████████████████████████▉      | 242/284 [07:08<01:18,  1.87s/it][A
 86%|███████████████████████████████████      | 243/284 [07:10<01:15,  1.83s/it][A
 86%|███████████████████████████████████▏     | 244/284 [07:11<01:10,  1.76s/it][A
 86%|███████████████████████████████████▎     | 245/284 [07:13<01:06,  1.71s/it][A
 87%|███████████████████████████████████▌     | 246/284 [07:14<01:03,  1.66s/it][A
 87%|███████████████████████████████████▋     | 247/284 [07:16<01:00,  1.63s/it][A
 87%|███████████████████████████████████▊     | 248/284 [07:18<00:58,  1.62s

 17%|███████▏                                  | 49/284 [00:06<00:33,  7.06it/s][A
 18%|███████▍                                  | 50/284 [00:07<00:33,  7.08it/s][A
 18%|███████▌                                  | 51/284 [00:07<00:33,  6.94it/s][A
 18%|███████▋                                  | 52/284 [00:07<00:33,  6.96it/s][A
 19%|███████▊                                  | 53/284 [00:07<00:32,  7.11it/s][A
 19%|███████▉                                  | 54/284 [00:07<00:32,  6.98it/s][A
 19%|████████▏                                 | 55/284 [00:07<00:32,  7.12it/s][A
 20%|████████▎                                 | 56/284 [00:07<00:32,  6.99it/s][A
 20%|████████▍                                 | 57/284 [00:08<00:31,  7.19it/s][A
 20%|████████▌                                 | 58/284 [00:08<00:31,  7.16it/s][A
 21%|████████▋                                 | 59/284 [00:08<00:30,  7.28it/s][A
 21%|████████▊                                 | 60/284 [00:08<00:30,  7.35i

 51%|█████████████████████                    | 146/284 [00:22<00:24,  5.62it/s][A
 52%|█████████████████████▏                   | 147/284 [00:22<00:23,  5.91it/s][A
 52%|█████████████████████▎                   | 148/284 [00:22<00:22,  5.95it/s][A
 52%|█████████████████████▌                   | 149/284 [00:22<00:22,  5.89it/s][A
 53%|█████████████████████▋                   | 150/284 [00:22<00:22,  6.08it/s][A
 53%|█████████████████████▊                   | 151/284 [00:23<00:21,  6.12it/s][A
 54%|█████████████████████▉                   | 152/284 [00:23<00:21,  6.17it/s][A
 54%|██████████████████████                   | 153/284 [00:23<00:21,  6.09it/s][A
 54%|██████████████████████▏                  | 154/284 [00:23<00:21,  6.02it/s][A
 55%|██████████████████████▍                  | 155/284 [00:23<00:20,  6.15it/s][A
 55%|██████████████████████▌                  | 156/284 [00:23<00:20,  6.21it/s][A
 55%|██████████████████████▋                  | 157/284 [00:24<00:20,  6.24i

 86%|███████████████████████████████████      | 243/284 [00:37<00:06,  6.22it/s][A
 86%|███████████████████████████████████▏     | 244/284 [00:38<00:06,  6.53it/s][A
 86%|███████████████████████████████████▎     | 245/284 [00:38<00:05,  6.72it/s][A
 87%|███████████████████████████████████▌     | 246/284 [00:38<00:05,  6.99it/s][A
 87%|███████████████████████████████████▋     | 247/284 [00:38<00:05,  7.10it/s][A
 87%|███████████████████████████████████▊     | 248/284 [00:38<00:04,  7.22it/s][A
 88%|███████████████████████████████████▉     | 249/284 [00:38<00:04,  7.35it/s][A
 88%|████████████████████████████████████     | 250/284 [00:38<00:04,  7.45it/s][A
 88%|████████████████████████████████████▏    | 251/284 [00:39<00:04,  7.35it/s][A
 89%|████████████████████████████████████▍    | 252/284 [00:39<00:04,  7.24it/s][A
 89%|████████████████████████████████████▌    | 253/284 [00:39<00:04,  7.18it/s][A
 89%|████████████████████████████████████▋    | 254/284 [00:39<00:04,  7.18i

 19%|████████▏                                 | 55/284 [00:09<00:39,  5.75it/s][A
 20%|████████▎                                 | 56/284 [00:10<00:40,  5.66it/s][A
 20%|████████▍                                 | 57/284 [00:10<00:39,  5.78it/s][A
 20%|████████▌                                 | 58/284 [00:10<00:39,  5.74it/s][A
 21%|████████▋                                 | 59/284 [00:10<00:38,  5.83it/s][A
 21%|████████▊                                 | 60/284 [00:10<00:37,  5.90it/s][A
 21%|█████████                                 | 61/284 [00:10<00:36,  6.07it/s][A
 22%|█████████▏                                | 62/284 [00:10<00:37,  5.97it/s][A
 22%|█████████▎                                | 63/284 [00:11<00:38,  5.75it/s][A
 23%|█████████▍                                | 64/284 [00:11<00:37,  5.86it/s][A
 23%|█████████▌                                | 65/284 [00:11<00:36,  5.98it/s][A
 23%|█████████▊                                | 66/284 [00:11<00:36,  5.90i

 54%|█████████████████████▉                   | 152/284 [00:27<00:23,  5.63it/s][A
 54%|██████████████████████                   | 153/284 [00:28<00:23,  5.51it/s][A
 54%|██████████████████████▏                  | 154/284 [00:28<00:24,  5.39it/s][A
 55%|██████████████████████▍                  | 155/284 [00:28<00:23,  5.57it/s][A
 55%|██████████████████████▌                  | 156/284 [00:28<00:22,  5.62it/s][A
 55%|██████████████████████▋                  | 157/284 [00:28<00:22,  5.61it/s][A
 56%|██████████████████████▊                  | 158/284 [00:28<00:22,  5.65it/s][A
 56%|██████████████████████▉                  | 159/284 [00:29<00:22,  5.55it/s][A
 56%|███████████████████████                  | 160/284 [00:29<00:21,  5.72it/s][A
 57%|███████████████████████▏                 | 161/284 [00:29<00:21,  5.79it/s][A
 57%|███████████████████████▍                 | 162/284 [00:29<00:20,  5.94it/s][A
 57%|███████████████████████▌                 | 163/284 [00:29<00:20,  6.02i

 88%|███████████████████████████████████▉     | 249/284 [00:45<00:05,  6.05it/s][A
 88%|████████████████████████████████████     | 250/284 [00:45<00:05,  6.13it/s][A
 88%|████████████████████████████████████▏    | 251/284 [00:45<00:05,  6.03it/s][A
 89%|████████████████████████████████████▍    | 252/284 [00:45<00:05,  5.88it/s][A
 89%|████████████████████████████████████▌    | 253/284 [00:45<00:05,  5.72it/s][A
 89%|████████████████████████████████████▋    | 254/284 [00:46<00:05,  5.50it/s][A
 90%|████████████████████████████████████▊    | 255/284 [00:46<00:05,  5.39it/s][A
 90%|████████████████████████████████████▉    | 256/284 [00:46<00:05,  5.43it/s][A
 90%|█████████████████████████████████████    | 257/284 [00:46<00:04,  5.51it/s][A
 91%|█████████████████████████████████████▏   | 258/284 [00:46<00:04,  5.67it/s][A
 91%|█████████████████████████████████████▍   | 259/284 [00:46<00:04,  5.71it/s][A
 92%|█████████████████████████████████████▌   | 260/284 [00:47<00:04,  5.77i

 43%|█████████████████▌                       | 122/284 [00:07<00:10, 15.23it/s][A
 44%|█████████████████▉                       | 124/284 [00:07<00:10, 15.48it/s][A
 44%|██████████████████▏                      | 126/284 [00:08<00:10, 15.13it/s][A
 45%|██████████████████▍                      | 128/284 [00:08<00:10, 15.21it/s][A
 46%|██████████████████▊                      | 130/284 [00:08<00:10, 15.31it/s][A
 46%|███████████████████                      | 132/284 [00:08<00:10, 14.60it/s][A
 47%|███████████████████▎                     | 134/284 [00:08<00:10, 14.72it/s][A
 48%|███████████████████▋                     | 136/284 [00:08<00:10, 14.72it/s][A
 49%|███████████████████▉                     | 138/284 [00:08<00:10, 14.05it/s][A
 49%|████████████████████▏                    | 140/284 [00:08<00:10, 14.32it/s][A
 50%|████████████████████▌                    | 142/284 [00:09<00:09, 14.62it/s][A
 51%|████████████████████▊                    | 144/284 [00:09<00:09, 14.66i

c_ref_test is ready


In [None]:
'''
 ## Exploring the saved sub-set of English/French pairs that contains atleast 
 ## one English noun from the test set in the English version of the text
'''
store_csv_name = "c_ref_test.csv"
df_contextual_refrences = pd.read_csv(base_path + "dataset/JOKER/csv/context/" + store_csv_name, names=["en", "fr"])
df_contextual_refrences.tail(5)

Unnamed: 0,en,fr
12743.0,"well, seven years passed, we were sold to a he...","» sept années ont passé, nous avons été vendus..."
12744.0,"in those days, nobody had bank accounts, or am...","a l'époque, personne n'avait de compte bancair..."
12745.0,"in one study, we asked college soccer players ...","pour une étude, nous avons demandé à des footb..."
12746.0,(laughter) i play sports at a four-year-old le...,et moi non plus. (rrires) mon niveau sportif e...
12747.0,"so for some reason, this was what i wanted to ...","pour une raison ou une autre, c'est ce que je ..."


Next, we will replicate the same approach for English nouns of the training data set to create a sub-set of all the downloaded parallel corpus, where the English version text in each pair must contain at least one English noun from the training data set in it. 

But before that, we will drop the record having the "official" value because previously, in the EDA, we have already seen that it contains an invalid value as its corresponding translation.

In [None]:
'''
 ## The records having the "official" value in the English column don't have a 
 ## corresponding valid translation in French.   
'''
data_set_2[data_set_2.en == "official"]

Unnamed: 0,id,en,fr
179,,official,2
214,,official,1
321,,official,2


In [None]:
'''
 ## Dropping all the records where the English column has the "official" value.
'''
data_set_2 = data_set_2[data_set_2.en != "official"]

In [None]:
data_set_2[data_set_2.en == "official"]

Unnamed: 0,id,en,fr


BUILDING CONTEXTUAL REFERENCE DATA FOR PROCESSING TRAINING QUERIES

In [None]:
'''
 ## Extraction of sub-set of English/French pairs from the downloaded parallel corpus
 ## that contains atleast one English noun from the training set in it in the English version text.
'''
en = []
fr = []
store_csv_name = "c_ref_train.csv"
df_contextual_refrences = pd.DataFrame(list(zip([],[])), columns = ["en", "fr"])

## Loading POKEMON DATASET
file = open(base_path+"/dataset/JOKER/task2_pokemon_en.json")
pokemonENVocab = json.load(file)
file = open(base_path+"/dataset/JOKER/task2_pokemon_fr.json")
pokemonFrVocab = json.load(file)

## Loading DRAGON DATASET
dragon_data_set = pd.read_csv(base_path+"dataset/JOKER/dragons.csv", names=["en", "fr"])

## Loading names of downloaded corpus from the OPUS data set
processed_csv_file_names = ["opensub","books","bookshop", "wikipedia", "wiki", "ted"]
csv_file_names_head = "en-fr_"


f = np.frompyfunc(lambda x,w: w in x, 2,1)

## Extracting English/French pairs from the pokemon names
for index, row in tqdm(data_set_2.iterrows()):
    if(row.en in pokemonVocab):
        en.append(row.en.lower().strip())
        fr.append(pokemonFrVocab[pokemonENVocab.index(row.en)].lower().strip())

## Extracting English/French pairs from the nouns of How to train your dragon series
listed = np.array([ n.lower() for n in dragon_data_set.en])
for index, row in tqdm(data_set_2.iterrows()):
    word = row.en.lower()
    finds = f(listed, word)
    if(finds.sum()>0):
        en+=dragon_data_set[finds].en.str.lower().str.strip().tolist()
        fr+=dragon_data_set[finds].fr.str.lower().str.strip().tolist()

## Extracting English/French pairs from the downloaded OPUS corpus          
check_pr = [word.lower().strip() for word in data_set_2.en.tolist()]
for csv_file in processed_csv_file_names:
    faster_search_df = pd.read_csv(base_path+"dataset/JOKER/csv/en_processed/"+csv_file_names_head+csv_file+".csv")
    faster_search_df = faster_search_df.dropna()
    listed = faster_search_df['en'].to_numpy()
    for word in tqdm(check_pr):
        finds = f(listed, word)
        if(finds.sum()>0):
            en+=faster_search_df[finds].en.str.lower().str.strip().tolist()
            fr+=faster_search_df[finds].fr.str.lower().str.strip().tolist()

## Saving the extracted pairs            
print("c_ref_train is ready")
df_contextual_refrences.en = en
df_contextual_refrences.fr = fr
df_contextual_refrences.to_csv(base_path + "dataset/JOKER/csv/context/" + store_csv_name)


0it [00:00, ?it/s][A
1161it [00:00, 10730.10it/s][A

0it [00:00, ?it/s][A
86it [00:00, 852.16it/s][A
172it [00:00, 385.25it/s][A
278it [00:00, 567.36it/s][A
384it [00:00, 705.28it/s][A
499it [00:00, 831.87it/s][A
626it [00:00, 957.81it/s][A
746it [00:00, 1026.26it/s][A
889it [00:01, 1143.92it/s][A
1161it [00:01, 970.94it/s] [A

  0%|                                                  | 0/1161 [00:00<?, ?it/s][A
  0%|                                        | 1/1161 [00:04<1:25:28,  4.42s/it][A
  0%|                                        | 2/1161 [00:08<1:26:18,  4.47s/it][A
  0%|                                        | 3/1161 [00:13<1:25:58,  4.45s/it][A
  0%|▏                                       | 4/1161 [00:18<1:28:25,  4.59s/it][A
  0%|▏                                       | 5/1161 [00:24<1:41:32,  5.27s/it][A
  1%|▏                                       | 6/1161 [00:29<1:37:05,  5.04s/it][A
  1%|▏                                       | 7/1161 [00:34<1:41:09,

  8%|███                                    | 93/1161 [08:35<1:41:37,  5.71s/it][A
  8%|███▏                                   | 94/1161 [08:43<1:54:47,  6.45s/it][A
  8%|███▏                                   | 95/1161 [08:49<1:51:49,  6.29s/it][A
  8%|███▏                                   | 96/1161 [08:56<1:56:32,  6.57s/it][A
  8%|███▎                                   | 97/1161 [09:01<1:48:23,  6.11s/it][A
  8%|███▎                                   | 98/1161 [09:07<1:43:07,  5.82s/it][A
  9%|███▎                                   | 99/1161 [09:12<1:38:44,  5.58s/it][A
  9%|███▎                                  | 100/1161 [09:17<1:36:44,  5.47s/it][A
  9%|███▎                                  | 101/1161 [09:24<1:44:04,  5.89s/it][A
  9%|███▎                                  | 102/1161 [09:31<1:50:08,  6.24s/it][A
  9%|███▎                                  | 103/1161 [09:36<1:43:01,  5.84s/it][A
  9%|███▍                                  | 104/1161 [09:41<1:41:08,  5.74s

 16%|██████▏                               | 190/1161 [17:39<1:16:46,  4.74s/it][A
 16%|██████▎                               | 191/1161 [17:43<1:15:17,  4.66s/it][A
 17%|██████▎                               | 192/1161 [17:48<1:14:24,  4.61s/it][A
 17%|██████▎                               | 193/1161 [17:52<1:13:45,  4.57s/it][A
 17%|██████▎                               | 194/1161 [17:59<1:22:17,  5.11s/it][A
 17%|██████▍                               | 195/1161 [18:03<1:20:08,  4.98s/it][A
 17%|██████▍                               | 196/1161 [18:10<1:26:37,  5.39s/it][A
 17%|██████▍                               | 197/1161 [18:14<1:22:07,  5.11s/it][A
 17%|██████▍                               | 198/1161 [18:19<1:19:30,  4.95s/it][A
 17%|██████▌                               | 199/1161 [18:24<1:19:18,  4.95s/it][A
 17%|██████▌                               | 200/1161 [18:28<1:16:45,  4.79s/it][A
 17%|██████▌                               | 201/1161 [18:33<1:14:42,  4.67s

 25%|█████████▍                            | 287/1161 [26:37<1:24:21,  5.79s/it][A
 25%|█████████▍                            | 288/1161 [26:44<1:28:04,  6.05s/it][A
 25%|█████████▍                            | 289/1161 [26:48<1:21:41,  5.62s/it][A
 25%|█████████▍                            | 290/1161 [26:53<1:18:58,  5.44s/it][A
 25%|█████████▌                            | 291/1161 [26:58<1:16:35,  5.28s/it][A
 25%|█████████▌                            | 292/1161 [27:03<1:13:50,  5.10s/it][A
 25%|█████████▌                            | 293/1161 [27:08<1:12:55,  5.04s/it][A
 25%|█████████▌                            | 294/1161 [27:14<1:18:45,  5.45s/it][A
 25%|█████████▋                            | 295/1161 [27:19<1:15:42,  5.25s/it][A
 25%|█████████▋                            | 296/1161 [27:24<1:13:26,  5.09s/it][A
 26%|█████████▋                            | 297/1161 [27:30<1:19:41,  5.53s/it][A
 26%|█████████▊                            | 298/1161 [27:37<1:23:26,  5.80s

 33%|████████████▌                         | 384/1161 [35:43<1:17:24,  5.98s/it][A
 33%|████████████▌                         | 385/1161 [35:48<1:12:37,  5.61s/it][A
 33%|████████████▋                         | 386/1161 [35:54<1:17:32,  6.00s/it][A
 33%|████████████▋                         | 387/1161 [35:59<1:13:12,  5.67s/it][A
 33%|████████████▋                         | 388/1161 [36:04<1:09:24,  5.39s/it][A
 34%|████████████▋                         | 389/1161 [36:11<1:14:00,  5.75s/it][A
 34%|████████████▊                         | 390/1161 [36:15<1:09:50,  5.43s/it][A
 34%|████████████▊                         | 391/1161 [36:20<1:07:50,  5.29s/it][A
 34%|████████████▊                         | 392/1161 [36:25<1:05:34,  5.12s/it][A
 34%|████████████▊                         | 393/1161 [36:30<1:04:04,  5.01s/it][A
 34%|████████████▉                         | 394/1161 [36:35<1:03:52,  5.00s/it][A
 34%|████████████▉                         | 395/1161 [36:40<1:03:02,  4.94s

 41%|████████████████▌                       | 481/1161 [44:33<53:24,  4.71s/it][A
 42%|████████████████▌                       | 482/1161 [44:38<53:49,  4.76s/it][A
 42%|████████████████▋                       | 483/1161 [44:43<53:31,  4.74s/it][A
 42%|████████████████▋                       | 484/1161 [44:49<58:24,  5.18s/it][A
 42%|████████████████▋                       | 485/1161 [44:54<56:32,  5.02s/it][A
 42%|████████████████▋                       | 486/1161 [44:58<54:29,  4.84s/it][A
 42%|████████████████▊                       | 487/1161 [45:03<53:30,  4.76s/it][A
 42%|████████████████▊                       | 488/1161 [45:09<59:05,  5.27s/it][A
 42%|████████████████▊                       | 489/1161 [45:14<55:54,  4.99s/it][A
 42%|████████████████▉                       | 490/1161 [45:18<54:01,  4.83s/it][A
 42%|████████████████▉                       | 491/1161 [45:23<52:52,  4.74s/it][A
 42%|████████████████▉                       | 492/1161 [45:27<51:58,  4.66s

 50%|███████████████████▉                    | 578/1161 [52:38<47:19,  4.87s/it][A
 50%|███████████████████▉                    | 579/1161 [52:43<46:14,  4.77s/it][A
 50%|███████████████████▉                    | 580/1161 [52:47<44:57,  4.64s/it][A
 50%|████████████████████                    | 581/1161 [52:52<44:25,  4.60s/it][A
 50%|████████████████████                    | 582/1161 [52:56<44:31,  4.61s/it][A
 50%|████████████████████                    | 583/1161 [53:03<49:05,  5.10s/it][A
 50%|████████████████████                    | 584/1161 [53:07<47:10,  4.91s/it][A
 50%|████████████████████▏                   | 585/1161 [53:12<46:07,  4.80s/it][A
 50%|████████████████████▏                   | 586/1161 [53:16<45:22,  4.74s/it][A
 51%|████████████████████▏                   | 587/1161 [53:23<50:31,  5.28s/it][A
 51%|████████████████████▎                   | 588/1161 [53:30<55:40,  5.83s/it][A
 51%|████████████████████▎                   | 589/1161 [53:35<52:07,  5.47s

 58%|██████████████████████                | 675/1161 [1:01:17<47:18,  5.84s/it][A
 58%|██████████████████████▏               | 676/1161 [1:01:22<44:55,  5.56s/it][A
 58%|██████████████████████▏               | 677/1161 [1:01:27<43:34,  5.40s/it][A
 58%|██████████████████████▏               | 678/1161 [1:01:31<41:30,  5.16s/it][A
 58%|██████████████████████▏               | 679/1161 [1:01:38<44:56,  5.59s/it][A
 59%|██████████████████████▎               | 680/1161 [1:01:43<43:18,  5.40s/it][A
 59%|██████████████████████▎               | 681/1161 [1:01:48<42:32,  5.32s/it][A
 59%|██████████████████████▎               | 682/1161 [1:01:53<41:18,  5.17s/it][A
 59%|██████████████████████▎               | 683/1161 [1:01:58<40:19,  5.06s/it][A
 59%|██████████████████████▍               | 684/1161 [1:02:02<39:43,  5.00s/it][A
 59%|██████████████████████▍               | 685/1161 [1:02:13<53:00,  6.68s/it][A
 59%|██████████████████████▍               | 686/1161 [1:02:18<48:15,  6.10s

 66%|█████████████████████████▎            | 772/1161 [1:10:08<35:39,  5.50s/it][A
 67%|█████████████████████████▎            | 773/1161 [1:10:13<34:01,  5.26s/it][A
 67%|█████████████████████████▎            | 774/1161 [1:10:18<32:50,  5.09s/it][A
 67%|█████████████████████████▎            | 775/1161 [1:10:23<32:05,  4.99s/it][A
 67%|█████████████████████████▍            | 776/1161 [1:10:27<31:43,  4.95s/it][A
 67%|█████████████████████████▍            | 777/1161 [1:10:35<35:44,  5.58s/it][A
 67%|█████████████████████████▍            | 778/1161 [1:10:39<34:19,  5.38s/it][A
 67%|█████████████████████████▍            | 779/1161 [1:10:44<32:53,  5.17s/it][A
 67%|█████████████████████████▌            | 780/1161 [1:10:49<32:11,  5.07s/it][A
 67%|█████████████████████████▌            | 781/1161 [1:10:54<31:38,  5.00s/it][A
 67%|█████████████████████████▌            | 782/1161 [1:10:59<31:14,  4.94s/it][A
 67%|█████████████████████████▋            | 783/1161 [1:11:03<30:39,  4.87s

 75%|████████████████████████████▍         | 869/1161 [1:18:49<25:08,  5.16s/it][A
 75%|████████████████████████████▍         | 870/1161 [1:18:54<24:37,  5.08s/it][A
 75%|████████████████████████████▌         | 871/1161 [1:18:59<24:13,  5.01s/it][A
 75%|████████████████████████████▌         | 872/1161 [1:19:04<23:50,  4.95s/it][A
 75%|████████████████████████████▌         | 873/1161 [1:19:10<26:10,  5.45s/it][A
 75%|████████████████████████████▌         | 874/1161 [1:19:15<25:06,  5.25s/it][A
 75%|████████████████████████████▋         | 875/1161 [1:19:22<27:17,  5.73s/it][A
 75%|████████████████████████████▋         | 876/1161 [1:19:27<25:51,  5.44s/it][A
 76%|████████████████████████████▋         | 877/1161 [1:19:31<24:46,  5.23s/it][A
 76%|████████████████████████████▋         | 878/1161 [1:19:36<24:00,  5.09s/it][A
 76%|████████████████████████████▊         | 879/1161 [1:19:41<23:30,  5.00s/it][A
 76%|████████████████████████████▊         | 880/1161 [1:19:46<22:59,  4.91s

 83%|███████████████████████████████▌      | 966/1161 [1:28:51<17:27,  5.37s/it][A
 83%|███████████████████████████████▋      | 967/1161 [1:28:56<16:43,  5.17s/it][A
 83%|███████████████████████████████▋      | 968/1161 [1:29:02<17:45,  5.52s/it][A
 83%|███████████████████████████████▋      | 969/1161 [1:29:07<16:44,  5.23s/it][A
 84%|███████████████████████████████▋      | 970/1161 [1:29:12<16:08,  5.07s/it][A
 84%|███████████████████████████████▊      | 971/1161 [1:29:16<15:28,  4.89s/it][A
 84%|███████████████████████████████▊      | 972/1161 [1:29:21<15:02,  4.77s/it][A
 84%|███████████████████████████████▊      | 973/1161 [1:29:25<14:38,  4.67s/it][A
 84%|███████████████████████████████▉      | 974/1161 [1:29:32<16:12,  5.20s/it][A
 84%|███████████████████████████████▉      | 975/1161 [1:29:36<15:37,  5.04s/it][A
 84%|███████████████████████████████▉      | 976/1161 [1:29:41<15:10,  4.92s/it][A
 84%|███████████████████████████████▉      | 977/1161 [1:29:47<16:22,  5.34s

 92%|█████████████████████████████████▉   | 1063/1161 [1:38:00<09:56,  6.08s/it][A
 92%|█████████████████████████████████▉   | 1064/1161 [1:38:05<09:36,  5.94s/it][A
 92%|█████████████████████████████████▉   | 1065/1161 [1:38:12<09:39,  6.04s/it][A
 92%|█████████████████████████████████▉   | 1066/1161 [1:38:18<09:51,  6.22s/it][A
 92%|██████████████████████████████████   | 1067/1161 [1:38:27<10:49,  6.91s/it][A
 92%|██████████████████████████████████   | 1068/1161 [1:38:36<11:58,  7.73s/it][A
 92%|██████████████████████████████████   | 1069/1161 [1:38:42<11:02,  7.20s/it][A
 92%|██████████████████████████████████   | 1070/1161 [1:38:50<10:52,  7.17s/it][A
 92%|██████████████████████████████████▏  | 1071/1161 [1:38:55<10:07,  6.75s/it][A
 92%|██████████████████████████████████▏  | 1072/1161 [1:39:02<10:07,  6.83s/it][A
 92%|██████████████████████████████████▏  | 1073/1161 [1:39:07<09:12,  6.27s/it][A
 93%|██████████████████████████████████▏  | 1074/1161 [1:39:12<08:30,  5.87s

100%|████████████████████████████████████▉| 1160/1161 [1:46:43<00:04,  4.64s/it][A
100%|█████████████████████████████████████| 1161/1161 [1:46:49<00:00,  5.52s/it][A

  0%|                                                  | 0/1161 [00:00<?, ?it/s][A
  0%|                                          | 1/1161 [00:00<02:07,  9.09it/s][A
  0%|▏                                         | 5/1161 [00:00<00:44, 26.11it/s][A
  1%|▎                                        | 10/1161 [00:00<00:34, 33.58it/s][A
  1%|▍                                        | 14/1161 [00:00<00:32, 34.85it/s][A
  2%|▋                                        | 19/1161 [00:00<00:30, 37.51it/s][A
  2%|▊                                        | 24/1161 [00:00<00:28, 39.99it/s][A
  2%|█                                        | 29/1161 [00:00<00:28, 40.33it/s][A
  3%|█▏                                       | 34/1161 [00:00<00:27, 41.41it/s][A
  3%|█▍                                       | 39/1161 [00:01<00:26, 42.40

 40%|████████████████                        | 466/1161 [00:11<00:16, 42.57it/s][A
 41%|████████████████▏                       | 471/1161 [00:11<00:16, 41.56it/s][A
 41%|████████████████▍                       | 476/1161 [00:11<00:16, 41.68it/s][A
 41%|████████████████▌                       | 481/1161 [00:11<00:16, 41.93it/s][A
 42%|████████████████▋                       | 486/1161 [00:11<00:15, 42.38it/s][A
 42%|████████████████▉                       | 491/1161 [00:11<00:15, 42.87it/s][A
 43%|█████████████████                       | 496/1161 [00:12<00:15, 41.80it/s][A
 43%|█████████████████▎                      | 501/1161 [00:12<00:16, 39.62it/s][A
 43%|█████████████████▍                      | 505/1161 [00:12<00:16, 39.70it/s][A
 44%|█████████████████▌                      | 510/1161 [00:12<00:16, 40.47it/s][A
 44%|█████████████████▋                      | 515/1161 [00:12<00:16, 38.26it/s][A
 45%|█████████████████▉                      | 520/1161 [00:12<00:15, 40.24i

 82%|████████████████████████████████▋       | 949/1161 [00:22<00:05, 41.43it/s][A
 82%|████████████████████████████████▊       | 954/1161 [00:23<00:04, 41.44it/s][A
 83%|█████████████████████████████████       | 959/1161 [00:23<00:04, 41.89it/s][A
 83%|█████████████████████████████████▏      | 964/1161 [00:23<00:04, 42.45it/s][A
 83%|█████████████████████████████████▍      | 969/1161 [00:23<00:04, 42.64it/s][A
 84%|█████████████████████████████████▌      | 974/1161 [00:23<00:04, 42.36it/s][A
 84%|█████████████████████████████████▋      | 979/1161 [00:23<00:04, 42.30it/s][A
 85%|█████████████████████████████████▉      | 984/1161 [00:23<00:04, 41.97it/s][A
 85%|██████████████████████████████████      | 989/1161 [00:23<00:04, 42.40it/s][A
 86%|██████████████████████████████████▏     | 994/1161 [00:24<00:03, 42.17it/s][A
 86%|██████████████████████████████████▍     | 999/1161 [00:24<00:03, 42.34it/s][A
 86%|█████████████████████████████████▋     | 1004/1161 [00:24<00:03, 43.02i

  5%|█▉                                       | 54/1161 [01:37<34:29,  1.87s/it][A
  5%|█▉                                       | 55/1161 [01:39<33:15,  1.80s/it][A
  5%|█▉                                       | 56/1161 [01:41<32:01,  1.74s/it][A
  5%|██                                       | 57/1161 [01:42<32:22,  1.76s/it][A
  5%|██                                       | 58/1161 [01:44<31:04,  1.69s/it][A
  5%|██                                       | 59/1161 [01:46<32:16,  1.76s/it][A
  5%|██                                       | 60/1161 [01:47<31:52,  1.74s/it][A
  5%|██▏                                      | 61/1161 [01:49<31:59,  1.74s/it][A
  5%|██▏                                      | 62/1161 [01:51<31:47,  1.74s/it][A
  5%|██▏                                      | 63/1161 [01:53<31:16,  1.71s/it][A
  6%|██▎                                      | 64/1161 [01:54<31:01,  1.70s/it][A
  6%|██▎                                      | 65/1161 [01:56<31:06,  1.70s

 13%|█████▏                                  | 151/1161 [04:30<35:41,  2.12s/it][A
 13%|█████▏                                  | 152/1161 [04:32<33:38,  2.00s/it][A
 13%|█████▎                                  | 153/1161 [04:34<31:20,  1.87s/it][A
 13%|█████▎                                  | 154/1161 [04:35<30:58,  1.85s/it][A
 13%|█████▎                                  | 155/1161 [04:37<30:16,  1.81s/it][A
 13%|█████▎                                  | 156/1161 [04:39<29:42,  1.77s/it][A
 14%|█████▍                                  | 157/1161 [04:41<29:44,  1.78s/it][A
 14%|█████▍                                  | 158/1161 [04:42<29:49,  1.78s/it][A
 14%|█████▍                                  | 159/1161 [04:44<29:11,  1.75s/it][A
 14%|█████▌                                  | 160/1161 [04:46<29:50,  1.79s/it][A
 14%|█████▌                                  | 161/1161 [04:48<28:36,  1.72s/it][A
 14%|█████▌                                  | 162/1161 [04:49<28:16,  1.70s

 21%|████████▌                               | 248/1161 [07:26<26:43,  1.76s/it][A
 21%|████████▌                               | 249/1161 [07:28<27:12,  1.79s/it][A
 22%|████████▌                               | 250/1161 [07:30<26:57,  1.78s/it][A
 22%|████████▋                               | 251/1161 [07:31<26:35,  1.75s/it][A
 22%|████████▋                               | 252/1161 [07:33<26:09,  1.73s/it][A
 22%|████████▋                               | 253/1161 [07:35<26:17,  1.74s/it][A
 22%|████████▊                               | 254/1161 [07:37<26:25,  1.75s/it][A
 22%|████████▊                               | 255/1161 [07:38<26:06,  1.73s/it][A
 22%|████████▊                               | 256/1161 [07:40<26:14,  1.74s/it][A
 22%|████████▊                               | 257/1161 [07:42<26:30,  1.76s/it][A
 22%|████████▉                               | 258/1161 [07:44<26:10,  1.74s/it][A
 22%|████████▉                               | 259/1161 [07:45<26:45,  1.78s

 30%|███████████▉                            | 345/1161 [10:18<24:05,  1.77s/it][A
 30%|███████████▉                            | 346/1161 [10:20<23:51,  1.76s/it][A
 30%|███████████▉                            | 347/1161 [10:22<23:28,  1.73s/it][A
 30%|███████████▉                            | 348/1161 [10:24<25:27,  1.88s/it][A
 30%|████████████                            | 349/1161 [10:26<25:00,  1.85s/it][A
 30%|████████████                            | 350/1161 [10:27<23:58,  1.77s/it][A
 30%|████████████                            | 351/1161 [10:29<24:33,  1.82s/it][A
 30%|████████████▏                           | 352/1161 [10:31<24:33,  1.82s/it][A
 30%|████████████▏                           | 353/1161 [10:33<23:56,  1.78s/it][A
 30%|████████████▏                           | 354/1161 [10:35<23:48,  1.77s/it][A
 31%|████████████▏                           | 355/1161 [10:36<23:58,  1.78s/it][A
 31%|████████████▎                           | 356/1161 [10:38<24:25,  1.82s

 38%|███████████████▏                        | 442/1161 [13:16<21:47,  1.82s/it][A
 38%|███████████████▎                        | 443/1161 [13:18<21:09,  1.77s/it][A
 38%|███████████████▎                        | 444/1161 [13:20<20:58,  1.76s/it][A
 38%|███████████████▎                        | 445/1161 [13:21<20:38,  1.73s/it][A
 38%|███████████████▎                        | 446/1161 [13:24<24:25,  2.05s/it][A
 39%|███████████████▍                        | 447/1161 [13:26<23:09,  1.95s/it][A
 39%|███████████████▍                        | 448/1161 [13:28<22:21,  1.88s/it][A
 39%|███████████████▍                        | 449/1161 [13:29<21:05,  1.78s/it][A
 39%|███████████████▌                        | 450/1161 [13:31<22:55,  1.94s/it][A
 39%|███████████████▌                        | 451/1161 [13:33<22:28,  1.90s/it][A
 39%|███████████████▌                        | 452/1161 [13:35<21:47,  1.84s/it][A
 39%|███████████████▌                        | 453/1161 [13:37<21:15,  1.80s

 46%|██████████████████▌                     | 539/1161 [16:12<17:45,  1.71s/it][A
 47%|██████████████████▌                     | 540/1161 [16:13<17:31,  1.69s/it][A
 47%|██████████████████▋                     | 541/1161 [16:15<17:19,  1.68s/it][A
 47%|██████████████████▋                     | 542/1161 [16:17<17:15,  1.67s/it][A
 47%|██████████████████▋                     | 543/1161 [16:18<17:12,  1.67s/it][A
 47%|██████████████████▋                     | 544/1161 [16:20<17:09,  1.67s/it][A
 47%|██████████████████▊                     | 545/1161 [16:22<17:22,  1.69s/it][A
 47%|██████████████████▊                     | 546/1161 [16:23<17:20,  1.69s/it][A
 47%|██████████████████▊                     | 547/1161 [16:25<17:20,  1.70s/it][A
 47%|██████████████████▉                     | 548/1161 [16:27<17:16,  1.69s/it][A
 47%|██████████████████▉                     | 549/1161 [16:29<19:17,  1.89s/it][A
 47%|██████████████████▉                     | 550/1161 [16:31<19:03,  1.87s

 55%|█████████████████████▉                  | 636/1161 [19:10<16:48,  1.92s/it][A
 55%|█████████████████████▉                  | 637/1161 [19:12<16:26,  1.88s/it][A
 55%|█████████████████████▉                  | 638/1161 [19:13<16:30,  1.89s/it][A
 55%|██████████████████████                  | 639/1161 [19:15<15:58,  1.84s/it][A
 55%|██████████████████████                  | 640/1161 [19:17<16:14,  1.87s/it][A
 55%|██████████████████████                  | 641/1161 [19:19<15:54,  1.84s/it][A
 55%|██████████████████████                  | 642/1161 [19:21<17:54,  2.07s/it][A
 55%|██████████████████████▏                 | 643/1161 [19:23<17:24,  2.02s/it][A
 55%|██████████████████████▏                 | 644/1161 [19:25<17:03,  1.98s/it][A
 56%|██████████████████████▏                 | 645/1161 [19:27<16:16,  1.89s/it][A
 56%|██████████████████████▎                 | 646/1161 [19:29<16:35,  1.93s/it][A
 56%|██████████████████████▎                 | 647/1161 [19:31<16:43,  1.95s

 63%|█████████████████████████▎              | 733/1161 [22:29<13:06,  1.84s/it][A
 63%|█████████████████████████▎              | 734/1161 [22:30<13:02,  1.83s/it][A
 63%|█████████████████████████▎              | 735/1161 [22:32<13:08,  1.85s/it][A
 63%|█████████████████████████▎              | 736/1161 [22:34<13:04,  1.84s/it][A
 63%|█████████████████████████▍              | 737/1161 [22:36<13:22,  1.89s/it][A
 64%|█████████████████████████▍              | 738/1161 [22:38<13:29,  1.91s/it][A
 64%|█████████████████████████▍              | 739/1161 [22:40<14:29,  2.06s/it][A
 64%|█████████████████████████▍              | 740/1161 [22:42<14:01,  2.00s/it][A
 64%|█████████████████████████▌              | 741/1161 [22:44<13:35,  1.94s/it][A
 64%|█████████████████████████▌              | 742/1161 [22:46<13:52,  1.99s/it][A
 64%|█████████████████████████▌              | 743/1161 [22:48<13:31,  1.94s/it][A
 64%|█████████████████████████▋              | 744/1161 [22:50<13:36,  1.96s

 71%|████████████████████████████▌           | 830/1161 [25:40<10:24,  1.89s/it][A
 72%|████████████████████████████▋           | 831/1161 [25:42<10:20,  1.88s/it][A
 72%|████████████████████████████▋           | 832/1161 [25:44<10:25,  1.90s/it][A
 72%|████████████████████████████▋           | 833/1161 [25:45<10:02,  1.84s/it][A
 72%|████████████████████████████▋           | 834/1161 [25:47<10:03,  1.85s/it][A
 72%|████████████████████████████▊           | 835/1161 [25:49<10:21,  1.91s/it][A
 72%|████████████████████████████▊           | 836/1161 [25:51<10:29,  1.94s/it][A
 72%|████████████████████████████▊           | 837/1161 [25:53<10:23,  1.92s/it][A
 72%|████████████████████████████▊           | 838/1161 [25:55<10:16,  1.91s/it][A
 72%|████████████████████████████▉           | 839/1161 [25:57<10:18,  1.92s/it][A
 72%|████████████████████████████▉           | 840/1161 [26:00<11:40,  2.18s/it][A
 72%|████████████████████████████▉           | 841/1161 [26:02<11:21,  2.13s

 80%|██████████████████████████████▎       | 927/1161 [39:12<1:36:24, 24.72s/it][A
 80%|██████████████████████████████▎       | 928/1161 [39:15<1:10:12, 18.08s/it][A
 80%|████████████████████████████████        | 929/1161 [39:17<51:40, 13.36s/it][A
 80%|████████████████████████████████        | 930/1161 [39:19<38:10,  9.91s/it][A
 80%|████████████████████████████████        | 931/1161 [39:22<29:35,  7.72s/it][A
 80%|████████████████████████████████        | 932/1161 [39:24<22:57,  6.02s/it][A
 80%|████████████████████████████████▏       | 933/1161 [39:26<18:13,  4.80s/it][A
 80%|████████████████████████████████▏       | 934/1161 [39:27<14:41,  3.89s/it][A
 81%|████████████████████████████████▏       | 935/1161 [39:29<12:17,  3.26s/it][A
 81%|████████████████████████████████▏       | 936/1161 [39:31<10:41,  2.85s/it][A
 81%|████████████████████████████████▎       | 937/1161 [39:34<10:23,  2.78s/it][A
 81%|████████████████████████████████▎       | 938/1161 [39:36<10:04,  2.71s

 88%|██████████████████████████████████▍    | 1024/1161 [42:19<03:59,  1.75s/it][A
 88%|██████████████████████████████████▍    | 1025/1161 [42:21<04:07,  1.82s/it][A
 88%|██████████████████████████████████▍    | 1026/1161 [42:23<03:55,  1.74s/it][A
 88%|██████████████████████████████████▍    | 1027/1161 [42:25<03:56,  1.76s/it][A
 89%|██████████████████████████████████▌    | 1028/1161 [42:26<03:51,  1.74s/it][A
 89%|██████████████████████████████████▌    | 1029/1161 [42:28<03:51,  1.75s/it][A
 89%|██████████████████████████████████▌    | 1030/1161 [42:30<03:50,  1.76s/it][A
 89%|██████████████████████████████████▋    | 1031/1161 [42:32<03:46,  1.74s/it][A
 89%|██████████████████████████████████▋    | 1032/1161 [42:33<03:43,  1.73s/it][A
 89%|██████████████████████████████████▋    | 1033/1161 [42:35<03:37,  1.70s/it][A
 89%|██████████████████████████████████▋    | 1034/1161 [42:37<04:00,  1.89s/it][A
 89%|██████████████████████████████████▊    | 1035/1161 [42:39<03:50,  1.83s

 97%|███████████████████████████████████▋ | 1121/1161 [1:23:59<01:15,  1.89s/it][A
 97%|███████████████████████████████████▊ | 1122/1161 [1:24:01<01:14,  1.92s/it][A
 97%|███████████████████████████████████▊ | 1123/1161 [1:24:03<01:14,  1.97s/it][A
 97%|███████████████████████████████████▊ | 1124/1161 [1:24:05<01:14,  2.02s/it][A
 97%|███████████████████████████████████▊ | 1125/1161 [1:24:07<01:12,  2.02s/it][A
 97%|███████████████████████████████████▉ | 1126/1161 [1:24:10<01:11,  2.05s/it][A
 97%|███████████████████████████████████▉ | 1127/1161 [1:24:11<01:07,  1.98s/it][A
 97%|███████████████████████████████████▉ | 1128/1161 [1:24:13<01:04,  1.95s/it][A
 97%|███████████████████████████████████▉ | 1129/1161 [1:24:15<01:03,  1.99s/it][A
 97%|████████████████████████████████████ | 1130/1161 [1:24:17<01:00,  1.96s/it][A
 97%|████████████████████████████████████ | 1131/1161 [1:24:19<00:58,  1.94s/it][A
 98%|████████████████████████████████████ | 1132/1161 [1:24:22<01:03,  2.21s

  5%|█▉                                       | 56/1161 [00:10<03:06,  5.91it/s][A
  5%|██                                       | 57/1161 [00:10<03:05,  5.95it/s][A
  5%|██                                       | 58/1161 [00:10<02:54,  6.33it/s][A
  5%|██                                       | 59/1161 [00:10<02:57,  6.22it/s][A
  5%|██                                       | 60/1161 [00:10<02:54,  6.31it/s][A
  5%|██▏                                      | 61/1161 [00:10<02:52,  6.38it/s][A
  5%|██▏                                      | 62/1161 [00:10<02:55,  6.25it/s][A
  5%|██▏                                      | 63/1161 [00:11<02:54,  6.29it/s][A
  6%|██▎                                      | 64/1161 [00:11<02:52,  6.37it/s][A
  6%|██▎                                      | 65/1161 [00:11<02:51,  6.40it/s][A
  6%|██▎                                      | 66/1161 [00:11<02:54,  6.28it/s][A
  6%|██▎                                      | 67/1161 [00:11<03:13,  5.67i

 13%|█████▎                                  | 153/1161 [00:26<02:59,  5.63it/s][A
 13%|█████▎                                  | 154/1161 [00:26<03:01,  5.54it/s][A
 13%|█████▎                                  | 155/1161 [00:26<02:51,  5.85it/s][A
 13%|█████▎                                  | 156/1161 [00:26<02:47,  6.00it/s][A
 14%|█████▍                                  | 157/1161 [00:27<02:43,  6.15it/s][A
 14%|█████▍                                  | 158/1161 [00:27<02:42,  6.18it/s][A
 14%|█████▍                                  | 159/1161 [00:27<02:56,  5.69it/s][A
 14%|█████▌                                  | 160/1161 [00:27<03:02,  5.48it/s][A
 14%|█████▌                                  | 161/1161 [00:27<02:56,  5.68it/s][A
 14%|█████▌                                  | 162/1161 [00:28<02:48,  5.92it/s][A
 14%|█████▌                                  | 163/1161 [00:28<03:07,  5.31it/s][A
 14%|█████▋                                  | 164/1161 [00:28<03:55,  4.23i

 22%|████████▌                               | 250/1161 [00:47<02:50,  5.33it/s][A
 22%|████████▋                               | 251/1161 [00:47<02:48,  5.39it/s][A
 22%|████████▋                               | 252/1161 [00:47<02:45,  5.50it/s][A
 22%|████████▋                               | 253/1161 [00:47<02:47,  5.43it/s][A
 22%|████████▊                               | 254/1161 [00:47<02:45,  5.47it/s][A
 22%|████████▊                               | 255/1161 [00:48<02:46,  5.45it/s][A
 22%|████████▊                               | 256/1161 [00:48<02:46,  5.44it/s][A
 22%|████████▊                               | 257/1161 [00:48<02:46,  5.43it/s][A
 22%|████████▉                               | 258/1161 [00:48<02:51,  5.26it/s][A
 22%|████████▉                               | 259/1161 [00:48<02:44,  5.48it/s][A
 22%|████████▉                               | 260/1161 [00:48<02:37,  5.72it/s][A
 22%|████████▉                               | 261/1161 [00:49<02:31,  5.94i

 30%|███████████▉                            | 347/1161 [01:03<02:12,  6.15it/s][A
 30%|███████████▉                            | 348/1161 [01:04<02:22,  5.72it/s][A
 30%|████████████                            | 349/1161 [01:04<02:18,  5.87it/s][A
 30%|████████████                            | 350/1161 [01:04<02:12,  6.14it/s][A
 30%|████████████                            | 351/1161 [01:04<02:14,  6.03it/s][A
 30%|████████████▏                           | 352/1161 [01:04<02:09,  6.27it/s][A
 30%|████████████▏                           | 353/1161 [01:04<02:06,  6.39it/s][A
 30%|████████████▏                           | 354/1161 [01:04<02:06,  6.37it/s][A
 31%|████████████▏                           | 355/1161 [01:05<02:26,  5.50it/s][A
 31%|████████████▎                           | 356/1161 [01:05<02:24,  5.57it/s][A
 31%|████████████▎                           | 357/1161 [01:05<02:31,  5.30it/s][A
 31%|████████████▎                           | 358/1161 [01:05<02:19,  5.77i

 38%|███████████████▎                        | 444/1161 [01:21<02:01,  5.90it/s][A
 38%|███████████████▎                        | 445/1161 [01:21<02:02,  5.87it/s][A
 38%|███████████████▎                        | 446/1161 [01:21<02:33,  4.66it/s][A
 39%|███████████████▍                        | 447/1161 [01:21<02:21,  5.06it/s][A
 39%|███████████████▍                        | 448/1161 [01:21<02:24,  4.93it/s][A
 39%|███████████████▍                        | 449/1161 [01:22<02:13,  5.34it/s][A
 39%|███████████████▌                        | 450/1161 [01:22<02:17,  5.17it/s][A
 39%|███████████████▌                        | 451/1161 [01:22<02:10,  5.44it/s][A
 39%|███████████████▌                        | 452/1161 [01:22<02:06,  5.61it/s][A
 39%|███████████████▌                        | 453/1161 [01:22<02:07,  5.55it/s][A
 39%|███████████████▋                        | 454/1161 [01:22<02:05,  5.62it/s][A
 39%|███████████████▋                        | 455/1161 [01:23<02:04,  5.67i

 47%|██████████████████▋                     | 541/1161 [01:38<01:46,  5.80it/s][A
 47%|██████████████████▋                     | 542/1161 [01:38<01:46,  5.80it/s][A
 47%|██████████████████▋                     | 543/1161 [01:38<01:46,  5.82it/s][A
 47%|██████████████████▋                     | 544/1161 [01:38<01:46,  5.81it/s][A
 47%|██████████████████▊                     | 545/1161 [01:38<01:47,  5.73it/s][A
 47%|██████████████████▊                     | 546/1161 [01:38<01:56,  5.30it/s][A
 47%|██████████████████▊                     | 547/1161 [01:39<01:53,  5.40it/s][A
 47%|██████████████████▉                     | 548/1161 [01:39<01:50,  5.57it/s][A
 47%|██████████████████▉                     | 549/1161 [01:39<02:01,  5.05it/s][A
 47%|██████████████████▉                     | 550/1161 [01:39<01:56,  5.26it/s][A
 47%|██████████████████▉                     | 551/1161 [01:39<01:53,  5.38it/s][A
 48%|███████████████████                     | 552/1161 [01:40<01:52,  5.39i

 55%|█████████████████████▉                  | 638/1161 [01:54<01:28,  5.89it/s][A
 55%|██████████████████████                  | 639/1161 [01:55<01:26,  6.01it/s][A
 55%|██████████████████████                  | 640/1161 [01:55<01:27,  5.96it/s][A
 55%|██████████████████████                  | 641/1161 [01:55<01:25,  6.06it/s][A
 55%|██████████████████████                  | 642/1161 [01:55<01:32,  5.58it/s][A
 55%|██████████████████████▏                 | 643/1161 [01:55<01:30,  5.69it/s][A
 55%|██████████████████████▏                 | 644/1161 [01:55<01:28,  5.85it/s][A
 56%|██████████████████████▏                 | 645/1161 [01:56<01:25,  6.07it/s][A
 56%|██████████████████████▎                 | 646/1161 [01:56<01:26,  5.99it/s][A
 56%|██████████████████████▎                 | 647/1161 [01:56<01:31,  5.59it/s][A
 56%|██████████████████████▎                 | 648/1161 [01:56<01:32,  5.57it/s][A
 56%|██████████████████████▎                 | 649/1161 [01:56<01:29,  5.72i

 63%|█████████████████████████▎              | 735/1161 [02:11<01:13,  5.76it/s][A
 63%|█████████████████████████▎              | 736/1161 [02:11<01:14,  5.73it/s][A
 63%|█████████████████████████▍              | 737/1161 [02:11<01:13,  5.76it/s][A
 64%|█████████████████████████▍              | 738/1161 [02:12<01:15,  5.63it/s][A
 64%|█████████████████████████▍              | 739/1161 [02:12<01:25,  4.93it/s][A
 64%|█████████████████████████▍              | 740/1161 [02:12<01:22,  5.08it/s][A
 64%|█████████████████████████▌              | 741/1161 [02:12<01:25,  4.91it/s][A
 64%|█████████████████████████▌              | 742/1161 [02:12<01:26,  4.86it/s][A
 64%|█████████████████████████▌              | 743/1161 [02:13<01:19,  5.24it/s][A
 64%|█████████████████████████▋              | 744/1161 [02:13<01:18,  5.34it/s][A
 64%|█████████████████████████▋              | 745/1161 [02:13<01:14,  5.57it/s][A
 64%|█████████████████████████▋              | 746/1161 [02:13<01:11,  5.81i

 72%|████████████████████████████▋           | 832/1161 [02:28<00:53,  6.16it/s][A
 72%|████████████████████████████▋           | 833/1161 [02:28<00:50,  6.46it/s][A
 72%|████████████████████████████▋           | 834/1161 [02:28<00:50,  6.45it/s][A
 72%|████████████████████████████▊           | 835/1161 [02:29<00:52,  6.22it/s][A
 72%|████████████████████████████▊           | 836/1161 [02:29<00:51,  6.29it/s][A
 72%|████████████████████████████▊           | 837/1161 [02:29<00:51,  6.34it/s][A
 72%|████████████████████████████▊           | 838/1161 [02:29<00:52,  6.18it/s][A
 72%|████████████████████████████▉           | 839/1161 [02:29<00:51,  6.28it/s][A
 72%|████████████████████████████▉           | 840/1161 [02:29<00:55,  5.75it/s][A
 72%|████████████████████████████▉           | 841/1161 [02:30<00:53,  5.97it/s][A
 73%|█████████████████████████████           | 842/1161 [02:30<00:52,  6.07it/s][A
 73%|█████████████████████████████           | 843/1161 [02:30<00:52,  6.06i

 80%|████████████████████████████████        | 929/1161 [02:44<00:43,  5.39it/s][A
 80%|████████████████████████████████        | 930/1161 [02:44<00:40,  5.75it/s][A
 80%|████████████████████████████████        | 931/1161 [02:44<00:39,  5.89it/s][A
 80%|████████████████████████████████        | 932/1161 [02:45<00:37,  6.11it/s][A
 80%|████████████████████████████████▏       | 933/1161 [02:45<00:36,  6.21it/s][A
 80%|████████████████████████████████▏       | 934/1161 [02:45<00:38,  5.89it/s][A
 81%|████████████████████████████████▏       | 935/1161 [02:45<00:36,  6.19it/s][A
 81%|████████████████████████████████▏       | 936/1161 [02:45<00:35,  6.25it/s][A
 81%|████████████████████████████████▎       | 937/1161 [02:45<00:36,  6.15it/s][A
 81%|████████████████████████████████▎       | 938/1161 [02:46<00:39,  5.71it/s][A
 81%|████████████████████████████████▎       | 939/1161 [02:46<00:37,  5.90it/s][A
 81%|████████████████████████████████▍       | 940/1161 [02:46<00:41,  5.38i

 88%|██████████████████████████████████▍    | 1026/1161 [03:00<00:22,  5.97it/s][A
 88%|██████████████████████████████████▍    | 1027/1161 [03:01<00:23,  5.72it/s][A
 89%|██████████████████████████████████▌    | 1028/1161 [03:01<00:22,  5.95it/s][A
 89%|██████████████████████████████████▌    | 1029/1161 [03:01<00:21,  6.19it/s][A
 89%|██████████████████████████████████▌    | 1030/1161 [03:01<00:20,  6.33it/s][A
 89%|██████████████████████████████████▋    | 1031/1161 [03:01<00:20,  6.43it/s][A
 89%|██████████████████████████████████▋    | 1032/1161 [03:01<00:19,  6.65it/s][A
 89%|██████████████████████████████████▋    | 1033/1161 [03:01<00:18,  6.89it/s][A
 89%|██████████████████████████████████▋    | 1034/1161 [03:02<00:18,  6.85it/s][A
 89%|██████████████████████████████████▊    | 1035/1161 [03:02<00:18,  6.91it/s][A
 89%|██████████████████████████████████▊    | 1036/1161 [03:02<00:18,  6.94it/s][A
 89%|██████████████████████████████████▊    | 1037/1161 [03:02<00:18,  6.74i

 97%|█████████████████████████████████████▊ | 1124/1161 [03:16<00:06,  5.99it/s][A
 97%|█████████████████████████████████████▊ | 1125/1161 [03:16<00:05,  6.20it/s][A
 97%|█████████████████████████████████████▊ | 1126/1161 [03:16<00:05,  6.40it/s][A
 97%|█████████████████████████████████████▊ | 1127/1161 [03:16<00:05,  6.58it/s][A
 97%|█████████████████████████████████████▉ | 1128/1161 [03:16<00:04,  6.71it/s][A
 97%|█████████████████████████████████████▉ | 1129/1161 [03:17<00:04,  6.48it/s][A
 97%|█████████████████████████████████████▉ | 1130/1161 [03:17<00:04,  6.32it/s][A
 97%|█████████████████████████████████████▉ | 1131/1161 [03:17<00:04,  6.32it/s][A
 98%|██████████████████████████████████████ | 1132/1161 [03:17<00:04,  5.98it/s][A
 98%|██████████████████████████████████████ | 1133/1161 [03:17<00:04,  6.09it/s][A
 98%|██████████████████████████████████████ | 1134/1161 [03:17<00:04,  6.16it/s][A
 98%|██████████████████████████████████████▏| 1135/1161 [03:18<00:04,  6.19i

  5%|██                                       | 59/1161 [00:11<03:27,  5.30it/s][A
  5%|██                                       | 60/1161 [00:11<03:25,  5.35it/s][A
  5%|██▏                                      | 61/1161 [00:12<03:25,  5.35it/s][A
  5%|██▏                                      | 62/1161 [00:12<03:26,  5.32it/s][A
  5%|██▏                                      | 63/1161 [00:12<03:22,  5.43it/s][A
  6%|██▎                                      | 64/1161 [00:12<03:21,  5.44it/s][A
  6%|██▎                                      | 65/1161 [00:12<03:23,  5.38it/s][A
  6%|██▎                                      | 66/1161 [00:12<03:21,  5.44it/s][A
  6%|██▎                                      | 67/1161 [00:13<03:23,  5.37it/s][A
  6%|██▍                                      | 68/1161 [00:13<03:36,  5.06it/s][A
  6%|██▍                                      | 69/1161 [00:13<03:59,  4.57it/s][A
  6%|██▍                                      | 70/1161 [00:13<04:04,  4.47i

 13%|█████▎                                  | 156/1161 [00:32<03:20,  5.02it/s][A
 14%|█████▍                                  | 157/1161 [00:32<03:12,  5.21it/s][A
 14%|█████▍                                  | 158/1161 [00:32<03:18,  5.07it/s][A
 14%|█████▍                                  | 159/1161 [00:32<03:18,  5.06it/s][A
 14%|█████▌                                  | 160/1161 [00:32<03:24,  4.90it/s][A
 14%|█████▌                                  | 161/1161 [00:33<03:17,  5.06it/s][A
 14%|█████▌                                  | 162/1161 [00:33<03:11,  5.21it/s][A
 14%|█████▌                                  | 163/1161 [00:33<03:27,  4.82it/s][A
 14%|█████▋                                  | 164/1161 [00:33<03:18,  5.03it/s][A
 14%|█████▋                                  | 165/1161 [00:33<03:34,  4.65it/s][A
 14%|█████▋                                  | 166/1161 [00:34<03:45,  4.41it/s][A
 14%|█████▊                                  | 167/1161 [00:34<03:26,  4.81i

 22%|████████▋                               | 253/1161 [00:51<03:00,  5.02it/s][A
 22%|████████▊                               | 254/1161 [00:52<03:03,  4.94it/s][A
 22%|████████▊                               | 255/1161 [00:52<03:01,  4.98it/s][A
 22%|████████▊                               | 256/1161 [00:52<03:25,  4.40it/s][A
 22%|████████▊                               | 257/1161 [00:52<03:18,  4.55it/s][A
 22%|████████▉                               | 258/1161 [00:53<03:10,  4.74it/s][A
 22%|████████▉                               | 259/1161 [00:53<03:08,  4.78it/s][A
 22%|████████▉                               | 260/1161 [00:53<03:10,  4.73it/s][A
 22%|████████▉                               | 261/1161 [00:53<03:17,  4.56it/s][A
 23%|█████████                               | 262/1161 [00:53<03:14,  4.62it/s][A
 23%|█████████                               | 263/1161 [00:54<03:07,  4.79it/s][A
 23%|█████████                               | 264/1161 [00:54<03:05,  4.82i

 30%|████████████                            | 350/1161 [01:11<02:46,  4.88it/s][A
 30%|████████████                            | 351/1161 [01:11<02:50,  4.75it/s][A
 30%|████████████▏                           | 352/1161 [01:12<02:42,  4.97it/s][A
 30%|████████████▏                           | 353/1161 [01:12<02:37,  5.15it/s][A
 30%|████████████▏                           | 354/1161 [01:12<02:38,  5.09it/s][A
 31%|████████████▏                           | 355/1161 [01:12<02:40,  5.01it/s][A
 31%|████████████▎                           | 356/1161 [01:12<02:41,  4.99it/s][A
 31%|████████████▎                           | 357/1161 [01:13<02:53,  4.64it/s][A
 31%|████████████▎                           | 358/1161 [01:13<02:41,  4.98it/s][A
 31%|████████████▎                           | 359/1161 [01:13<02:46,  4.80it/s][A
 31%|████████████▍                           | 360/1161 [01:13<03:01,  4.42it/s][A
 31%|████████████▍                           | 361/1161 [01:13<02:49,  4.71i

 39%|███████████████▍                        | 447/1161 [01:30<02:23,  4.97it/s][A
 39%|███████████████▍                        | 448/1161 [01:30<02:30,  4.73it/s][A
 39%|███████████████▍                        | 449/1161 [01:30<02:19,  5.09it/s][A
 39%|███████████████▌                        | 450/1161 [01:31<02:29,  4.76it/s][A
 39%|███████████████▌                        | 451/1161 [01:31<02:25,  4.89it/s][A
 39%|███████████████▌                        | 452/1161 [01:31<02:20,  5.05it/s][A
 39%|███████████████▌                        | 453/1161 [01:31<02:16,  5.18it/s][A
 39%|███████████████▋                        | 454/1161 [01:31<02:11,  5.37it/s][A
 39%|███████████████▋                        | 455/1161 [01:32<02:09,  5.44it/s][A
 39%|███████████████▋                        | 456/1161 [01:32<02:08,  5.47it/s][A
 39%|███████████████▋                        | 457/1161 [01:32<02:09,  5.43it/s][A
 39%|███████████████▊                        | 458/1161 [01:32<02:09,  5.44i

 47%|██████████████████▋                     | 544/1161 [01:50<02:01,  5.08it/s][A
 47%|██████████████████▊                     | 545/1161 [01:50<01:58,  5.20it/s][A
 47%|██████████████████▊                     | 546/1161 [01:50<02:06,  4.87it/s][A
 47%|██████████████████▊                     | 547/1161 [01:50<02:01,  5.05it/s][A
 47%|██████████████████▉                     | 548/1161 [01:50<01:57,  5.22it/s][A
 47%|██████████████████▉                     | 549/1161 [01:51<01:58,  5.17it/s][A
 47%|██████████████████▉                     | 550/1161 [01:51<01:55,  5.30it/s][A
 47%|██████████████████▉                     | 551/1161 [01:51<01:54,  5.34it/s][A
 48%|███████████████████                     | 552/1161 [01:51<01:52,  5.43it/s][A
 48%|███████████████████                     | 553/1161 [01:51<01:59,  5.09it/s][A
 48%|███████████████████                     | 554/1161 [01:52<02:00,  5.02it/s][A
 48%|███████████████████                     | 555/1161 [01:52<01:57,  5.16i

 55%|██████████████████████                  | 641/1161 [02:09<01:37,  5.32it/s][A
 55%|██████████████████████                  | 642/1161 [02:09<01:48,  4.77it/s][A
 55%|██████████████████████▏                 | 643/1161 [02:09<01:45,  4.92it/s][A
 55%|██████████████████████▏                 | 644/1161 [02:09<01:42,  5.03it/s][A
 56%|██████████████████████▏                 | 645/1161 [02:10<01:37,  5.29it/s][A
 56%|██████████████████████▎                 | 646/1161 [02:10<01:38,  5.20it/s][A
 56%|██████████████████████▎                 | 647/1161 [02:10<01:49,  4.69it/s][A
 56%|██████████████████████▎                 | 648/1161 [02:10<01:46,  4.80it/s][A
 56%|██████████████████████▎                 | 649/1161 [02:11<01:47,  4.76it/s][A
 56%|██████████████████████▍                 | 650/1161 [02:11<01:47,  4.74it/s][A
 56%|██████████████████████▍                 | 651/1161 [02:11<01:45,  4.83it/s][A
 56%|██████████████████████▍                 | 652/1161 [02:11<01:42,  4.98i

 64%|█████████████████████████▍              | 738/1161 [02:28<01:19,  5.35it/s][A
 64%|█████████████████████████▍              | 739/1161 [02:28<01:22,  5.12it/s][A
 64%|█████████████████████████▍              | 740/1161 [02:28<01:21,  5.20it/s][A
 64%|█████████████████████████▌              | 741/1161 [02:29<01:18,  5.38it/s][A
 64%|█████████████████████████▌              | 742/1161 [02:29<01:26,  4.86it/s][A
 64%|█████████████████████████▌              | 743/1161 [02:29<01:23,  5.02it/s][A
 64%|█████████████████████████▋              | 744/1161 [02:29<01:22,  5.04it/s][A
 64%|█████████████████████████▋              | 745/1161 [02:29<01:20,  5.19it/s][A
 64%|█████████████████████████▋              | 746/1161 [02:30<01:17,  5.37it/s][A
 64%|█████████████████████████▋              | 747/1161 [02:30<01:16,  5.42it/s][A
 64%|█████████████████████████▊              | 748/1161 [02:30<01:16,  5.38it/s][A
 65%|█████████████████████████▊              | 749/1161 [02:30<01:15,  5.43i

 72%|████████████████████████████▊           | 835/1161 [02:47<01:05,  4.96it/s][A
 72%|████████████████████████████▊           | 836/1161 [02:48<01:04,  5.07it/s][A
 72%|████████████████████████████▊           | 837/1161 [02:48<01:03,  5.12it/s][A
 72%|████████████████████████████▊           | 838/1161 [02:48<01:02,  5.18it/s][A
 72%|████████████████████████████▉           | 839/1161 [02:48<01:03,  5.05it/s][A
 72%|████████████████████████████▉           | 840/1161 [02:48<01:10,  4.57it/s][A
 72%|████████████████████████████▉           | 841/1161 [02:49<01:06,  4.82it/s][A
 73%|█████████████████████████████           | 842/1161 [02:49<01:03,  5.02it/s][A
 73%|█████████████████████████████           | 843/1161 [02:49<01:03,  5.00it/s][A
 73%|█████████████████████████████           | 844/1161 [02:49<01:03,  5.03it/s][A
 73%|█████████████████████████████           | 845/1161 [02:49<01:00,  5.26it/s][A
 73%|█████████████████████████████▏          | 846/1161 [02:50<00:59,  5.26i

 80%|████████████████████████████████        | 932/1161 [03:06<00:49,  4.62it/s][A
 80%|████████████████████████████████▏       | 933/1161 [03:07<00:48,  4.75it/s][A
 80%|████████████████████████████████▏       | 934/1161 [03:07<00:46,  4.91it/s][A
 81%|████████████████████████████████▏       | 935/1161 [03:07<00:44,  5.06it/s][A
 81%|████████████████████████████████▏       | 936/1161 [03:07<00:44,  5.07it/s][A
 81%|████████████████████████████████▎       | 937/1161 [03:07<00:45,  4.94it/s][A
 81%|████████████████████████████████▎       | 938/1161 [03:08<00:48,  4.57it/s][A
 81%|████████████████████████████████▎       | 939/1161 [03:08<00:46,  4.72it/s][A
 81%|████████████████████████████████▍       | 940/1161 [03:08<00:46,  4.77it/s][A
 81%|████████████████████████████████▍       | 941/1161 [03:08<00:46,  4.78it/s][A
 81%|████████████████████████████████▍       | 942/1161 [03:08<00:47,  4.65it/s][A
 81%|████████████████████████████████▍       | 943/1161 [03:09<00:46,  4.69i

 89%|██████████████████████████████████▌    | 1029/1161 [03:27<00:26,  5.01it/s][A
 89%|██████████████████████████████████▌    | 1030/1161 [03:27<00:26,  4.92it/s][A
 89%|██████████████████████████████████▋    | 1031/1161 [03:27<00:26,  4.85it/s][A
 89%|██████████████████████████████████▋    | 1032/1161 [03:27<00:25,  5.05it/s][A
 89%|██████████████████████████████████▋    | 1033/1161 [03:28<00:24,  5.15it/s][A
 89%|██████████████████████████████████▋    | 1034/1161 [03:28<00:24,  5.24it/s][A
 89%|██████████████████████████████████▊    | 1035/1161 [03:28<00:23,  5.34it/s][A
 89%|██████████████████████████████████▊    | 1036/1161 [03:28<00:23,  5.39it/s][A
 89%|██████████████████████████████████▊    | 1037/1161 [03:28<00:22,  5.39it/s][A
 89%|██████████████████████████████████▊    | 1038/1161 [03:28<00:22,  5.45it/s][A
 89%|██████████████████████████████████▉    | 1039/1161 [03:29<00:23,  5.14it/s][A
 90%|██████████████████████████████████▉    | 1040/1161 [03:29<00:23,  5.12i

 97%|█████████████████████████████████████▊ | 1126/1161 [03:47<00:07,  4.97it/s][A
 97%|█████████████████████████████████████▊ | 1127/1161 [03:47<00:06,  5.08it/s][A
 97%|█████████████████████████████████████▉ | 1128/1161 [03:47<00:06,  5.15it/s][A
 97%|█████████████████████████████████████▉ | 1129/1161 [03:48<00:06,  5.07it/s][A
 97%|█████████████████████████████████████▉ | 1130/1161 [03:48<00:06,  5.11it/s][A
 97%|█████████████████████████████████████▉ | 1131/1161 [03:48<00:05,  5.26it/s][A
 98%|██████████████████████████████████████ | 1132/1161 [03:48<00:06,  4.68it/s][A
 98%|██████████████████████████████████████ | 1133/1161 [03:48<00:05,  4.88it/s][A
 98%|██████████████████████████████████████ | 1134/1161 [03:49<00:05,  5.08it/s][A
 98%|██████████████████████████████████████▏| 1135/1161 [03:49<00:05,  5.17it/s][A
 98%|██████████████████████████████████████▏| 1136/1161 [03:49<00:04,  5.11it/s][A
 98%|██████████████████████████████████████▏| 1137/1161 [03:49<00:04,  5.12i

 11%|████▏                                   | 122/1161 [00:08<01:12, 14.42it/s][A
 11%|████▎                                   | 124/1161 [00:08<01:09, 14.88it/s][A
 11%|████▎                                   | 126/1161 [00:08<01:09, 14.79it/s][A
 11%|████▍                                   | 128/1161 [00:08<01:08, 15.08it/s][A
 11%|████▍                                   | 130/1161 [00:09<01:07, 15.17it/s][A
 11%|████▌                                   | 132/1161 [00:09<01:06, 15.42it/s][A
 12%|████▌                                   | 134/1161 [00:09<01:06, 15.51it/s][A
 12%|████▋                                   | 136/1161 [00:09<01:07, 15.23it/s][A
 12%|████▊                                   | 138/1161 [00:09<01:06, 15.34it/s][A
 12%|████▊                                   | 140/1161 [00:09<01:07, 15.06it/s][A
 12%|████▉                                   | 142/1161 [00:09<01:07, 15.04it/s][A
 12%|████▉                                   | 144/1161 [00:09<01:08, 14.91i

 27%|██████████▉                             | 316/1161 [00:22<01:03, 13.36it/s][A
 27%|██████████▉                             | 318/1161 [00:22<01:06, 12.72it/s][A
 28%|███████████                             | 320/1161 [00:22<01:11, 11.80it/s][A
 28%|███████████                             | 322/1161 [00:22<01:06, 12.66it/s][A
 28%|███████████▏                            | 324/1161 [00:23<01:03, 13.28it/s][A
 28%|███████████▏                            | 326/1161 [00:23<01:00, 13.77it/s][A
 28%|███████████▎                            | 328/1161 [00:23<00:58, 14.22it/s][A
 28%|███████████▎                            | 330/1161 [00:23<00:57, 14.49it/s][A
 29%|███████████▍                            | 332/1161 [00:23<00:57, 14.52it/s][A
 29%|███████████▌                            | 334/1161 [00:23<01:00, 13.61it/s][A
 29%|███████████▌                            | 336/1161 [00:23<00:58, 14.15it/s][A
 29%|███████████▋                            | 338/1161 [00:24<00:59, 13.90i

 44%|█████████████████▌                      | 510/1161 [00:36<00:47, 13.75it/s][A
 44%|█████████████████▋                      | 512/1161 [00:36<00:46, 13.86it/s][A
 44%|█████████████████▋                      | 514/1161 [00:36<00:46, 14.00it/s][A
 44%|█████████████████▊                      | 516/1161 [00:37<00:45, 14.14it/s][A
 45%|█████████████████▊                      | 518/1161 [00:37<00:44, 14.40it/s][A
 45%|█████████████████▉                      | 520/1161 [00:37<00:41, 15.37it/s][A
 45%|█████████████████▉                      | 522/1161 [00:37<00:42, 15.00it/s][A
 45%|██████████████████                      | 524/1161 [00:37<00:42, 14.84it/s][A
 45%|██████████████████                      | 526/1161 [00:37<00:45, 14.07it/s][A
 45%|██████████████████▏                     | 528/1161 [00:37<00:44, 14.28it/s][A
 46%|██████████████████▎                     | 530/1161 [00:38<00:43, 14.37it/s][A
 46%|██████████████████▎                     | 532/1161 [00:38<00:46, 13.59i

 61%|████████████████████████▎               | 704/1161 [00:50<00:30, 14.91it/s][A
 61%|████████████████████████▎               | 706/1161 [00:50<00:30, 14.84it/s][A
 61%|████████████████████████▍               | 708/1161 [00:51<00:30, 14.75it/s][A
 61%|████████████████████████▍               | 710/1161 [00:51<00:30, 14.69it/s][A
 61%|████████████████████████▌               | 712/1161 [00:51<00:30, 14.69it/s][A
 61%|████████████████████████▌               | 714/1161 [00:51<00:29, 15.00it/s][A
 62%|████████████████████████▋               | 716/1161 [00:51<00:30, 14.75it/s][A
 62%|████████████████████████▋               | 718/1161 [00:51<00:31, 13.90it/s][A
 62%|████████████████████████▊               | 720/1161 [00:51<00:32, 13.41it/s][A
 62%|████████████████████████▉               | 722/1161 [00:52<00:31, 13.75it/s][A
 62%|████████████████████████▉               | 724/1161 [00:52<00:30, 14.19it/s][A
 63%|█████████████████████████               | 726/1161 [00:52<00:30, 14.17i

 77%|██████████████████████████████▉         | 898/1161 [01:05<00:23, 11.22it/s][A
 78%|███████████████████████████████         | 900/1161 [01:05<00:21, 12.02it/s][A
 78%|███████████████████████████████         | 902/1161 [01:05<00:20, 12.50it/s][A
 78%|███████████████████████████████▏        | 904/1161 [01:05<00:19, 12.98it/s][A
 78%|███████████████████████████████▏        | 906/1161 [01:05<00:19, 13.07it/s][A
 78%|███████████████████████████████▎        | 908/1161 [01:05<00:19, 12.99it/s][A
 78%|███████████████████████████████▎        | 910/1161 [01:06<00:19, 13.18it/s][A
 79%|███████████████████████████████▍        | 912/1161 [01:06<00:18, 13.47it/s][A
 79%|███████████████████████████████▍        | 914/1161 [01:06<00:17, 13.80it/s][A
 79%|███████████████████████████████▌        | 916/1161 [01:06<00:17, 14.32it/s][A
 79%|███████████████████████████████▋        | 918/1161 [01:06<00:17, 14.17it/s][A
 79%|███████████████████████████████▋        | 920/1161 [01:06<00:16, 14.27i

 94%|████████████████████████████████████▋  | 1092/1161 [01:18<00:04, 14.67it/s][A
 94%|████████████████████████████████████▋  | 1094/1161 [01:18<00:04, 14.69it/s][A
 94%|████████████████████████████████████▊  | 1096/1161 [01:18<00:04, 14.13it/s][A
 95%|████████████████████████████████████▉  | 1098/1161 [01:19<00:04, 13.26it/s][A
 95%|████████████████████████████████████▉  | 1100/1161 [01:19<00:04, 14.51it/s][A
 95%|█████████████████████████████████████  | 1102/1161 [01:19<00:03, 14.77it/s][A
 95%|█████████████████████████████████████  | 1104/1161 [01:19<00:03, 14.69it/s][A
 95%|█████████████████████████████████████▏ | 1106/1161 [01:19<00:03, 14.73it/s][A
 95%|█████████████████████████████████████▏ | 1108/1161 [01:19<00:03, 14.90it/s][A
 96%|█████████████████████████████████████▎ | 1110/1161 [01:19<00:03, 15.26it/s][A
 96%|█████████████████████████████████████▎ | 1112/1161 [01:19<00:03, 15.31it/s][A
 96%|█████████████████████████████████████▍ | 1114/1161 [01:20<00:03, 14.48i

c_ref_train is ready


In [None]:
'''
 ## Exploring the saved sub-set of English/French pairs that contains atleast 
 ## one English noun from the training set in the English version of the text
'''
store_csv_name = "c_ref_train.csv"
df_contextual_refrences = pd.read_csv(base_path + "dataset/JOKER/csv/context/" + store_csv_name, names=["en", "fr"])
df_contextual_refrences.tail(5)

Unnamed: 0,en,fr
184887.0,because while having less honey might be a buz...,"parce que si moins de miel, ce n'est pas drôle..."
184888.0,cheshire cats can juggle their own heads.,les chats du cheshire savent jongler avec leur...
184889.0,so that goes along with the cheshire cat sayin...,"« si l'endroit où tu veux aller t'importe peu,..."
184890.0,no quidditch match ends until the golden snitc...,un match de quidditch ne s'arrête que lorsque ...
184891.0,apricot halves like the ears of cherubim.,des moitiés d'abricot comme des oreilles de ch...


Next, we will count the number of English nouns from the test and training data set provided by the Joker CLEF task2 team exist in the English version text of the developed sub set of the parallel corpus from the OPUS repository.

In [None]:
f = np.frompyfunc(lambda x,w: w in x, 2,1)

'''
 ## Count the number of English nouns from the training data set exist in the 
 ## developed context references (c_ref_train.csv) for the train data set
'''
store_csv_name = "c_ref_train.csv"
df_contextual_refrences = pd.read_csv(base_path+"dataset/JOKER/csv/context/"+store_csv_name)

check_pr = [word.lower().strip() for word in data_set_2.en.tolist()]
listed = df_contextual_refrences.en.to_numpy()

train_nf = []
count_ref_for_train_vocab = 0
for word in tqdm(check_pr):
    finds = f(listed, word)
    if (finds.sum()>0):
        count_ref_for_train_vocab +=1
    else:
        train_nf.append(word)

print("Count Of training vocabs for which contextual refrences exist in data set", count_ref_for_train_vocab, ", Other Remaining Records=", len(check_pr)-count_ref_for_train_vocab)


'''
 ## Count the number of English nouns from the test data set exist in the 
 ## developed context references (c_ref_test.csv) for the test data set
'''
store_csv_name = "c_ref_test.csv"
df_contextual_refrences = pd.read_csv(base_path+"dataset/JOKER/csv/context/"+store_csv_name)

check_pr = [word.lower().strip() for word in test_data_set_2.en.tolist()]
listed = df_contextual_refrences.en.to_numpy()

test_nf = []
count_ref_for_test_vocab=0
for word in tqdm(check_pr):
    finds = f(listed, word)
    if (finds.sum()>0):
        count_ref_for_test_vocab +=1
    else:
        test_nf.append(word)
        
print("Count Of testing vocabs for which contextual refrences exist in data set", count_ref_for_test_vocab, ", Other Remaining Records=", len(check_pr)-count_ref_for_test_vocab)



  0%|                                                  | 0/1161 [00:00<?, ?it/s][A
  0%|                                          | 3/1161 [00:00<00:43, 26.64it/s][A
  1%|▏                                         | 6/1161 [00:00<00:43, 26.42it/s][A
  1%|▎                                         | 9/1161 [00:00<00:44, 26.00it/s][A
  1%|▍                                        | 12/1161 [00:00<00:45, 25.04it/s][A
  1%|▌                                        | 15/1161 [00:00<00:46, 24.61it/s][A
  2%|▋                                        | 18/1161 [00:00<00:45, 24.85it/s][A
  2%|▋                                        | 21/1161 [00:00<00:46, 24.52it/s][A
  2%|▊                                        | 24/1161 [00:00<00:45, 24.75it/s][A
  2%|▉                                        | 27/1161 [00:01<00:45, 25.02it/s][A
  3%|█                                        | 30/1161 [00:01<00:45, 25.10it/s][A
  3%|█▏                                       | 33/1161 [00:01<00:45, 25.03

 25%|██████████                              | 291/1161 [00:12<00:36, 23.99it/s][A
 25%|██████████▏                             | 294/1161 [00:12<00:36, 24.04it/s][A
 26%|██████████▏                             | 297/1161 [00:12<00:35, 24.11it/s][A
 26%|██████████▎                             | 300/1161 [00:12<00:35, 24.23it/s][A
 26%|██████████▍                             | 303/1161 [00:12<00:35, 24.03it/s][A
 26%|██████████▌                             | 306/1161 [00:12<00:36, 23.14it/s][A
 27%|██████████▋                             | 309/1161 [00:12<00:36, 23.64it/s][A
 27%|██████████▋                             | 312/1161 [00:13<00:36, 23.39it/s][A
 27%|██████████▊                             | 315/1161 [00:13<00:35, 23.93it/s][A
 27%|██████████▉                             | 318/1161 [00:13<00:35, 23.87it/s][A
 28%|███████████                             | 321/1161 [00:13<00:34, 24.15it/s][A
 28%|███████████▏                            | 324/1161 [00:13<00:34, 24.20i

 50%|████████████████████                    | 582/1161 [00:24<00:24, 23.89it/s][A
 50%|████████████████████▏                   | 585/1161 [00:24<00:23, 24.38it/s][A
 51%|████████████████████▎                   | 588/1161 [00:24<00:24, 23.48it/s][A
 51%|████████████████████▎                   | 591/1161 [00:24<00:24, 23.00it/s][A
 51%|████████████████████▍                   | 594/1161 [00:24<00:24, 23.45it/s][A
 51%|████████████████████▌                   | 597/1161 [00:24<00:24, 23.14it/s][A
 52%|████████████████████▋                   | 600/1161 [00:25<00:24, 23.07it/s][A
 52%|████████████████████▊                   | 603/1161 [00:25<00:24, 23.11it/s][A
 52%|████████████████████▉                   | 606/1161 [00:25<00:23, 23.45it/s][A
 52%|████████████████████▉                   | 609/1161 [00:25<00:23, 23.60it/s][A
 53%|█████████████████████                   | 612/1161 [00:25<00:23, 23.62it/s][A
 53%|█████████████████████▏                  | 615/1161 [00:25<00:23, 23.17i

 75%|██████████████████████████████          | 873/1161 [00:36<00:12, 23.89it/s][A
 75%|██████████████████████████████▏         | 876/1161 [00:36<00:12, 22.26it/s][A
 76%|██████████████████████████████▎         | 879/1161 [00:36<00:13, 20.15it/s][A
 76%|██████████████████████████████▍         | 882/1161 [00:37<00:13, 20.12it/s][A
 76%|██████████████████████████████▍         | 885/1161 [00:37<00:12, 21.70it/s][A
 76%|██████████████████████████████▌         | 888/1161 [00:37<00:12, 22.37it/s][A
 77%|██████████████████████████████▋         | 891/1161 [00:37<00:11, 22.99it/s][A
 77%|██████████████████████████████▊         | 894/1161 [00:37<00:11, 23.06it/s][A
 77%|██████████████████████████████▉         | 897/1161 [00:37<00:11, 23.42it/s][A
 78%|███████████████████████████████         | 900/1161 [00:37<00:10, 24.04it/s][A
 78%|███████████████████████████████         | 903/1161 [00:37<00:10, 24.22it/s][A
 78%|███████████████████████████████▏        | 906/1161 [00:38<00:10, 24.18i

Count Of training vocabs for which contextual refrences exist in data set 874 , Other Remaining Records= 287



  0%|                                                   | 0/284 [00:00<?, ?it/s][A
 14%|█████▋                                   | 39/284 [00:00<00:00, 383.40it/s][A
 28%|███████████▍                             | 79/284 [00:00<00:00, 392.25it/s][A
 42%|████████████████▊                       | 119/284 [00:00<00:00, 395.25it/s][A
 56%|██████████████████████▍                 | 159/284 [00:00<00:00, 388.21it/s][A
 70%|████████████████████████████            | 199/284 [00:00<00:00, 391.03it/s][A
 84%|█████████████████████████████████▋      | 239/284 [00:00<00:00, 389.64it/s][A
100%|████████████████████████████████████████| 284/284 [00:00<00:00, 390.40it/s][A

Count Of testing vocabs for which contextual refrences exist in data set 219 , Other Remaining Records= 65





The contextual references created for the English nouns from the training data set provided by JOKER CLEF task 2 team. It is important to ensure that those contextual reference pairs must contain atleast one English noun in the English version of the text and its corresponding translation in the French version of the text. Previously  we have ensured that they must have atleast one English noun in it, and now we will code another iteratation to make sure that they also have corresponding French translation in the French version of the text. Plus, now we will also change the structure of the saved contextual references from the two columns to classical Extractive Q/A style.

In [None]:
f = np.frompyfunc(lambda x,w: w in x.split(), 2,1)
train_dataset_QA = pd.DataFrame(list(zip([],[], [], [], [])), columns = ['id', 'context', 'question', 'answers'])

store_csv_name = "c_ref_train.csv"
df_contextual_refrences = pd.read_csv(base_path+"dataset/JOKER/csv/context/"+store_csv_name)

check_pr_en = [word.lower().strip() for word in data_set_2.en.tolist()]
chcek_pr_fr = [word.lower().strip() for word in data_set_2.fr.tolist()]

listed_en = df_contextual_refrences.en.to_numpy()
listed_fr = df_contextual_refrences.fr.to_numpy()

train_nf = []
count_ref_for_train_vocab = 0

ids = []
questions = []
contexts = []
answers = []

for word_en, word_fr in tqdm(zip(check_pr_en, chcek_pr_fr)):
    
    finds_en = f(listed_en, word_en.strip())
    finds_fr = f(listed_fr, word_fr.strip())
    
    if (finds_en.sum()>0 and finds_fr.sum()>0):
        count_ref_for_train_vocab +=1
        french_contexts = df_contextual_refrences[finds_fr].fr.to_list()
        questions+=[word_en for _ in french_contexts]
        contexts+=french_contexts
        for context_fr in french_contexts:
            answers.append(json.dumps({"text": [word_fr], "answer_start": [context_fr.find(word_fr)]}))
    else:
        train_nf.append(word)

train_dataset_QA.id = list(range(0,len(questions)))
train_dataset_QA.question = questions
train_dataset_QA.context = contexts
train_dataset_QA.answers = answers

train_dataset_QA.to_csv(base_path +"dataset/JOKER/Task 2/train/joker_task2_train_context_aware_data_where_answers_exist.csv")
print("Count Of training vocabs for which contextual refrences exist in data set", count_ref_for_train_vocab, ", Other Remaining Records=", len(check_pr_en)-count_ref_for_train_vocab)



0it [00:00, ?it/s][A
1it [00:00,  1.07it/s][A
2it [00:01,  1.11it/s][A
3it [00:02,  1.15it/s][A
4it [00:03,  1.15it/s][A
5it [00:04,  1.11it/s][A
6it [00:05,  1.10it/s][A
7it [00:06,  1.13it/s][A
8it [00:07,  1.12it/s][A
9it [00:08,  1.09it/s][A
10it [00:09,  1.10it/s][A
11it [00:09,  1.11it/s][A
12it [00:10,  1.10it/s][A
13it [00:11,  1.09it/s][A
14it [00:12,  1.09it/s][A
15it [00:13,  1.08it/s][A
16it [00:14,  1.10it/s][A
17it [00:15,  1.10it/s][A
18it [00:16,  1.12it/s][A
19it [00:17,  1.16it/s][A
20it [00:17,  1.17it/s][A
21it [00:18,  1.19it/s][A
22it [00:19,  1.19it/s][A
23it [00:20,  1.20it/s][A
24it [00:21,  1.17it/s][A
25it [00:22,  1.17it/s][A
26it [00:22,  1.17it/s][A
27it [00:23,  1.13it/s][A
28it [00:24,  1.12it/s][A
29it [00:25,  1.10it/s][A
30it [00:26,  1.10it/s][A
31it [00:27,  1.12it/s][A
32it [00:28,  1.14it/s][A
33it [00:29,  1.14it/s][A
34it [00:30,  1.14it/s][A
35it [00:30,  1.15it/s][A
36it [00:31,  1.16it/s][A
37it [00:32,  

296it [04:29,  1.03s/it][A
297it [04:30,  1.07s/it][A
298it [04:32,  1.07s/it][A
299it [04:32,  1.04s/it][A
300it [04:33,  1.02s/it][A
301it [04:34,  1.02s/it][A
302it [04:35,  1.01s/it][A
303it [04:36,  1.01s/it][A
304it [04:37,  1.01s/it][A
305it [04:38,  1.01it/s][A
306it [04:39,  1.03it/s][A
307it [04:40,  1.04it/s][A
308it [04:41,  1.07it/s][A
309it [04:42,  1.08it/s][A
310it [04:43,  1.05it/s][A
311it [04:44,  1.02it/s][A
312it [04:45,  1.05it/s][A
313it [04:46,  1.07it/s][A
314it [04:47,  1.09it/s][A
315it [04:48,  1.10it/s][A
316it [04:49,  1.11it/s][A
317it [04:49,  1.12it/s][A
318it [04:50,  1.13it/s][A
319it [04:51,  1.12it/s][A
320it [04:52,  1.12it/s][A
321it [04:53,  1.12it/s][A
322it [04:54,  1.12it/s][A
323it [04:55,  1.12it/s][A
324it [04:56,  1.11it/s][A
325it [04:57,  1.13it/s][A
326it [04:57,  1.10it/s][A
327it [04:58,  1.12it/s][A
328it [04:59,  1.13it/s][A
329it [05:00,  1.13it/s][A
330it [05:01,  1.14it/s][A
331it [05:02,  1.15i

588it [08:47,  1.13it/s][A
589it [08:48,  1.14it/s][A
590it [08:49,  1.14it/s][A
591it [08:50,  1.11it/s][A
592it [08:51,  1.12it/s][A
593it [08:52,  1.12it/s][A
594it [08:53,  1.10it/s][A
595it [08:54,  1.12it/s][A
596it [08:55,  1.12it/s][A
597it [08:55,  1.13it/s][A
598it [08:56,  1.14it/s][A
599it [08:57,  1.15it/s][A
600it [08:58,  1.15it/s][A
601it [08:59,  1.15it/s][A
602it [09:00,  1.15it/s][A
603it [09:01,  1.15it/s][A
604it [09:02,  1.15it/s][A
605it [09:02,  1.15it/s][A
606it [09:03,  1.16it/s][A
607it [09:04,  1.15it/s][A
608it [09:05,  1.15it/s][A
609it [09:06,  1.15it/s][A
610it [09:07,  1.12it/s][A
611it [09:08,  1.12it/s][A
612it [09:09,  1.14it/s][A
613it [09:09,  1.14it/s][A
614it [09:10,  1.12it/s][A
615it [09:11,  1.13it/s][A
616it [09:12,  1.13it/s][A
617it [09:13,  1.12it/s][A
618it [09:14,  1.10it/s][A
619it [09:15,  1.09it/s][A
620it [09:16,  1.10it/s][A
621it [09:17,  1.09it/s][A
622it [09:18,  1.03it/s][A
623it [09:19,  1.04i

880it [12:55,  1.19it/s][A
881it [12:56,  1.20it/s][A
882it [12:57,  1.20it/s][A
883it [12:57,  1.20it/s][A
884it [12:58,  1.20it/s][A
885it [12:59,  1.20it/s][A
886it [13:00,  1.19it/s][A
887it [13:01,  1.18it/s][A
888it [13:02,  1.18it/s][A
889it [13:02,  1.18it/s][A
890it [13:03,  1.19it/s][A
891it [13:04,  1.19it/s][A
892it [13:05,  1.20it/s][A
893it [13:06,  1.19it/s][A
894it [13:07,  1.19it/s][A
895it [13:07,  1.18it/s][A
896it [13:08,  1.18it/s][A
897it [13:09,  1.19it/s][A
898it [13:10,  1.19it/s][A
899it [13:11,  1.19it/s][A
900it [13:12,  1.19it/s][A
901it [13:13,  1.19it/s][A
902it [13:13,  1.18it/s][A
903it [13:14,  1.17it/s][A
904it [13:15,  1.18it/s][A
905it [13:16,  1.19it/s][A
906it [13:17,  1.20it/s][A
907it [13:18,  1.21it/s][A
908it [13:18,  1.21it/s][A
909it [13:19,  1.22it/s][A
910it [13:20,  1.22it/s][A
911it [13:21,  1.22it/s][A
912it [13:22,  1.22it/s][A
913it [13:22,  1.21it/s][A
914it [13:23,  1.21it/s][A
915it [13:24,  1.20i

Count Of training vocabs for which contextual refrences exist in data set 669 , Other Remaining Records= 492





In [None]:
'''
 ## Exploring the saved Q/A style contextual reference for each English noun in the training data set.
'''
train_dataset_path = "/dataset/JOKER/Task 2/train/joker_task2_train_context_aware_data_where_answers_exist.csv"
data_set_with_context_ref = pd.read_csv(base_path+train_dataset_path)
print("Total number of Q/A records:", len(data_set_with_context_ref))
data_set_with_context_ref.head(5)

Total number of Q/A records: 2172


Unnamed: 0.1,Unnamed: 0,id,context,question,answers
0,0,0,capidextre,ambipom,"{""text"": [""capidextre""], ""answer_start"": [0]}"
1,1,1,efflèche,dartrix,"{""text"": [""effl\u00e8che""], ""answer_start"": [0]}"
2,2,2,sepiatroce,malamar,"{""text"": [""sepiatroce""], ""answer_start"": [0]}"
3,3,3,croquine,bounsweet,"{""text"": [""croquine""], ""answer_start"": [0]}"
4,4,4,"eh, astérix et obélix sont de retour.",obelix,"{""text"": [""ob\u00e9lix""], ""answer_start"": [15]}"


We will do a simillar kind of transformation for the contextual references created for the English nouns that belongs from the test data set provided by Joker CLEF task2 team. We will change its structure from two columns to classical extractive Q/A data set style. Its important to note that, as we don't have the corresponding French translations for the English nouns of test data set. So, here there is no need to further filter data set to check their French version of the text contains the corresponding translation or not. Here, we will assume that they must contain the corrsponding translation for atleast one English noun in them.

In [None]:
f = np.frompyfunc(lambda x,w: w in x.split(), 2,1)
test_dataset_QA = pd.DataFrame(list(zip([],[], [], [], [])), columns = ['id', 'context', 'question'])

store_csv_name = "c_ref_test.csv"
df_contextual_refrences = pd.read_csv(base_path+"dataset/JOKER/csv/context/"+store_csv_name)

check_pr_en = [word.lower().strip() for word in test_data_set_2.en.tolist()]

listed_en = df_contextual_refrences.en.to_numpy()

test_nf = []
count_ref_for_test_vocab = 0

ids = []
questions = []
contexts = []
answers = []

for word_en in tqdm(check_pr_en):
    
    finds_en = f(listed_en, word_en.strip())
    
    if (finds_en.sum()>0):
        count_ref_for_test_vocab +=1
        french_contexts = df_contextual_refrences[finds_en].fr.to_list()
        questions+=[word_en for _ in french_contexts]
        contexts+=french_contexts
    
    else:
        test_nf.append(word)

test_dataset_QA.id = list(range(0,len(questions)))
test_dataset_QA.question = questions
test_dataset_QA.context = contexts

test_dataset_QA.to_csv(base_path +"dataset/JOKER/Task 2/test/joker_task2_test_context_aware_data_where_answers_exist.csv")
print("Count Of testing vocabs for which contextual refrences exist in data set", count_ref_for_test_vocab, ", Other Remaining Records=", len(check_pr_en)-count_ref_for_test_vocab)



  0%|                                                   | 0/284 [00:00<?, ?it/s][A
  1%|▌                                          | 4/284 [00:00<00:07, 37.07it/s][A
  3%|█▏                                         | 8/284 [00:00<00:07, 36.90it/s][A
  4%|█▊                                        | 12/284 [00:00<00:07, 36.84it/s][A
  6%|██▎                                       | 16/284 [00:00<00:07, 37.05it/s][A
  7%|██▉                                       | 20/284 [00:00<00:07, 36.98it/s][A
  8%|███▌                                      | 24/284 [00:00<00:06, 37.29it/s][A
 10%|████▏                                     | 28/284 [00:00<00:06, 37.17it/s][A
 11%|████▋                                     | 32/284 [00:00<00:06, 37.18it/s][A
 13%|█████▎                                    | 36/284 [00:00<00:06, 37.00it/s][A
 14%|█████▉                                    | 40/284 [00:01<00:06, 36.82it/s][A
 15%|██████▌                                   | 44/284 [00:01<00:06, 36.40

Count Of testing vocabs for which contextual refrences exist in data set 185 , Other Remaining Records= 99





In [None]:
'''
 ## Exploring the saved Q/A style contextual reference for each English noun in the test data set.
'''
train_dataset_path = "/dataset/JOKER/Task 2/test/joker_task2_test_context_aware_data_where_answers_exist.csv"
data_set_with_context_ref = pd.read_csv(base_path+train_dataset_path)
print("Total number of Q/A records:", len(data_set_with_context_ref))
data_set_with_context_ref.head(5)

Total number of Q/A records: 6176


Unnamed: 0.1,Unnamed: 0,id,context,question
0,0,0,astronelle,orbeetle
1,1,1,tournicoton,gossifleur
2,2,2,blancoton,eldegoss
3,3,3,crabominable,crabominable
4,4,4,rubombelle,ribombee
