# Document test set

#### In this notebook, we will create a test set from the WikiCLIR: A Cross-Lingual Retrieval Dataset from Wikipedia, to test our final models on a query-document level.

In [1]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

#### We will first import the data from our data folder and prepare them for our retrieval test. We will focues on 5000 positive examples here. We will then be able to use our pipeline in the same manner, as for the other test sets, to be able to create a test of 100 queries and 5000 documents.

In [2]:
source_id, queries = list(), list()
with open('../data/external/test.queries',encoding="utf8") as f:
    lines = f.readlines()
    for line in lines:
        split = line.split('\t')
        source_id.append(split[0])
        queries.append(split[1])
feature_retrieval_query=pd.DataFrame({'id_source': source_id, 'text_source': queries})
#get first 5000 rows as queries to make it comparable
feature_retrieval_query=feature_retrieval_query.iloc[:5000]

In [3]:
target_id, documents = list(), list()
with open('../data/external/test.docs',encoding="utf8") as f:
    lines = f.readlines()
    for line in lines:
        split = line.split('\t')
        target_id.append(split[0])
        documents.append(split[1])
feature_retrieval_documents=pd.DataFrame({'text_target': documents,'id_target': target_id})

In [4]:
source_id, target_id, translation = list(), list(), list()
with open('../data/external/test.qrel',encoding="utf8") as f:
    lines = f.readlines()
    for line in lines:
        split = line.split('\t')
        source_id.append(split[0])
        target_id.append(split[2])
        translation.append(split[3][0])
feature_retrieval_translation=pd.DataFrame({'id_source': source_id, 'id_target': target_id, 'translation': translation})
# replace translation numbers
feature_retrieval_translation['translation'].replace('3',1,inplace=True)
feature_retrieval_translation['translation'].replace('2',0,inplace=True)

In [5]:
feature_retrieval=pd.merge(left=feature_retrieval_query,right=feature_retrieval_translation, on ='id_source')
feature_retrieval=feature_retrieval[feature_retrieval['translation']==1]
feature_retrieval=feature_retrieval.merge(right=feature_retrieval_documents,on='id_target').drop(columns=['translation'])

In [7]:
feature_retrieval

Unnamed: 0,id_source,text_source,id_target,text_target
0,311,die afroasiatischen ( früher auch als hamito -...,599,afroasiatic languages afroasiatic afro asiatic...
1,331,"( von ‚ begnadigung , straferlass , amnestie '...",18947898,amnesty international amnesty international co...
2,561,die ( bíos ‚leben ' ; auch als synonym zu biot...,4502,biotechnology biotechnology is the use of livi...
3,658,eine ist die beschreibung der individuellen zu...,55309,blood type a blood type also called a blood gr...
4,708,( englische aussprache [ ˈbuːtən ] ; von engl ...,40909,booting in computing booting also known as boo...
5,782,ein [ kɔmˈpjuːtɐ ] oder rechner ist ein gerät ...,7878457,computer a computer is a general purpose devic...
6,785,( [ çeˈmi : ] ; mittel - und norddeutsch auch ...,5180,chemistry chemistry a branch of physical scien...
7,820,"( gesprochen [ çirʊrˈɡiː ] , regional auch [ k...",45599,surgery surgery from the cheirourgikē composed...
8,870,"vom altgriechischen namen grc kálliste "" die s...",43126,callisto ( moon ) callisto jupiter iv is a moo...
9,913,waffen ( auch chemiewaffen ) sind toxisch wirk...,27179600,chemical weapon a chemical weapon cw is a devi...


In [8]:
feature_retrieval.to_pickle('feature_retrieval_doc.pickle')
retrieval = pd.read_pickle('feature_retrieval_doc.pickle')

In [9]:
retrieval

Unnamed: 0,id_source,text_source,id_target,text_target
0,311,die afroasiatischen ( früher auch als hamito -...,599,afroasiatic languages afroasiatic afro asiatic...
1,331,"( von ‚ begnadigung , straferlass , amnestie '...",18947898,amnesty international amnesty international co...
2,561,die ( bíos ‚leben ' ; auch als synonym zu biot...,4502,biotechnology biotechnology is the use of livi...
3,658,eine ist die beschreibung der individuellen zu...,55309,blood type a blood type also called a blood gr...
4,708,( englische aussprache [ ˈbuːtən ] ; von engl ...,40909,booting in computing booting also known as boo...
5,782,ein [ kɔmˈpjuːtɐ ] oder rechner ist ein gerät ...,7878457,computer a computer is a general purpose devic...
6,785,( [ çeˈmi : ] ; mittel - und norddeutsch auch ...,5180,chemistry chemistry a branch of physical scien...
7,820,"( gesprochen [ çirʊrˈɡiː ] , regional auch [ k...",45599,surgery surgery from the cheirourgikē composed...
8,870,"vom altgriechischen namen grc kálliste "" die s...",43126,callisto ( moon ) callisto jupiter iv is a moo...
9,913,waffen ( auch chemiewaffen ) sind toxisch wirk...,27179600,chemical weapon a chemical weapon cw is a devi...
