In [9]:
import itertools
import os
from src.data.text_retriever import TextRetriever
from src.data.make_dataset import TrainTripletsDataset, ValPairsDataset

In [10]:
parent_dir = os.path.abspath(os.path.join('', os.pardir))
retriever = TextRetriever(parent_dir + '/data/raw/QQP/train.tsv', parent_dir + '/data/raw/QQP/dev.tsv')
train_df = retriever.train_df
val_df = retriever.val_df

In [4]:
train_df.tail()

Unnamed: 0,id_left,id_right,text_left,text_right,label
363841,100941,83372,How do I make money flying my drone?,How can I use a dji phantom to make money,1
363842,62873,34460,What can you do with an economics degree?,What jobs can you get with an economics degree?,1
363843,217377,217378,What type of current does a battery produce?,How does a generator work and produce current?,0
363844,425744,285638,Grammar: What is difference between schedule a...,How do I understand the difference between the...,0
363845,39774,20105,What is the easiest way to earn money using in...,How can I earn money online easily?,1


Data description
- id_left and id_right - identifiers of left and right question
- text_left and text_right - texts of these questions
- label is 1 when it's similiar questions and 0 when it's not

I create triplets for training phase and pairs for validation. Triplets has 3 variations:

- left_id | relevant right_id question | irrelevant right_id from left_id group | 1.0 target
- left_id | relevant right_id question | irrelevant question from entire dataset | 1.0 target
- left_id | irrelevant right_id question | irrelevant question from left_id group | 0.5 target

In [5]:
%%time
triplets = TrainTripletsDataset.create_train_triplets(train_df,
                                                      seed=0,
                                                      num_positive_examples=4,
                                                      num_same_rel_examples=2)

CPU times: total: 6.62 s
Wall time: 23.2 s


In [6]:
len(triplets)

8063

In [7]:
triplets[:8]

[['10024', '117', '159258', 1.0],
 ['10024', '12018', '159258', 1.0],
 ['10024', '12018', '10382', 1.0],
 ['10024', '37121', '29590', 1.0],
 ['10024', '37121', '477559', 1.0],
 ['10024', '29590', '10382', 0.5],
 ['10024', '29590', '159258', 0.5],
 ['100294', '491115', '100295', 1.0]]

##### Everything is fine. 4 positive examples was chosen from the same group (id_left=10024), 477559 is random sample from entire train dataset and 2 examples with 0.5 relevancy is pairs of irrelevant questions to 10024.

In [8]:
train_df[train_df['id_left']=='10024']

Unnamed: 0,id_left,id_right,text_left,text_right,label
18623,10024,12018,How do I reset my Gmail password when I don't ...,How can I reset my Gmail password when I don't...,1
36292,10024,27456,How do I reset my Gmail password when I don't ...,I forgot my Gmail password and I can't answer ...,1
54955,10024,64009,How do I reset my Gmail password when I don't ...,How can I add a recovery phone number to my Gm...,1
113459,10024,29590,How do I reset my Gmail password when I don't ...,How do I reset my Instagram password if I put ...,0
117166,10024,61676,How do I reset my Gmail password when I don't ...,How do I reset my password to Gmail without my...,1
139644,10024,159258,How do I reset my Gmail password when I don't ...,I forgot my Facebook password. I don't remembe...,0
198999,10024,21153,How do I reset my Gmail password when I don't ...,How can I reset my Gmail password if I don't r...,1
261942,10024,115735,How do I reset my Gmail password when I don't ...,How can I reset my Gmail password without know...,1
288171,10024,143891,How do I reset my Gmail password when I don't ...,How can I access my Gmail account if I don't r...,1
291969,10024,23752,How do I reset my Gmail password when I don't ...,How do I reset my Gmail password when I don't ...,1


Validation pairs has 3 targets:
- left_id | relevant question from the same group | 2.0 target
- left_id | irrelevant question from the same group | 1.0 target
- left_id | irrelevant question from entire data | 0.0 target

In [11]:
%%time
val_pairs = ValPairsDataset.create_val_pairs(val_df, fill_top_to=15, min_group_size=2, seed=0)

CPU times: total: 16 s
Wall time: 1min 22s


For every __id_left group__ we have to properly rank groups described above. If there less then 15 rows in group, it will be filled to 15 with random examples with relevancy 0 to __id_left question__

In [14]:
val_pairs[:5]

[['100141', '75743', 2],
 ['100141', '100142', 2],
 ['100141', '147228', 0],
 ['100141', '293530', 0],
 ['100141', '121016', 0]]

In [15]:
val_df[val_df['id_left']=='100141']

Unnamed: 0,id_left,id_right,text_left,text_right,label
13278,100141,75743,What should I shouldn't do when visiting your ...,What should I absolutely not do when visiting ...,1
22909,100141,100142,What should I shouldn't do when visiting your ...,What things should I not do when visiting your...,1
