# Search

To create search I firstly tried the suggested approach, then tried Iterative PCA, then doc2vec and then a BERT model.

The suggested approach could only work on the subset of the dataset, because of the memory limitations. 

Iterative PCA allowed the model to run, however the training process takes an hour to fit the pca model.

doc2vec produced bad results for searching.

BERT model produced pretty mediocre results and takes quite a long time to encode vectors.

In every apprach I took a subset of samples either due to memory limitations or due to slow processing speed.

As a last resort, I just took PCA, fitted it on a random subset of samples and transformed the whole dataset. It produced good results and was not running OOM, however it may not have selected best PC, that would explain best variance on the whole set.

NOTE: It might be better to restart kernel and run first three cells before each approach due to memory limitations.

In [14]:
import pandas as pd
import nltk

ds_path = "datasets/train.jsonl"
df = pd.read_json(ds_path, lines=True)
df.head()

Unnamed: 0,id,verifiable,label,claim,evidence
0,75397,VERIFIABLE,SUPPORTS,Nikolaj Coster-Waldau worked with the Fox Broa...,"[[[92206, 104971, Nikolaj_Coster-Waldau, 7], [..."
1,150448,VERIFIABLE,SUPPORTS,Roman Atwood is a content creator.,"[[[174271, 187498, Roman_Atwood, 1]], [[174271..."
2,214861,VERIFIABLE,SUPPORTS,"History of art includes architecture, dance, s...","[[[255136, 254645, History_of_art, 2]]]"
3,156709,VERIFIABLE,REFUTES,Adrienne Bailon is an accountant.,"[[[180804, 193183, Adrienne_Bailon, 0]]]"
4,83235,NOT VERIFIABLE,NOT ENOUGH INFO,System of a Down briefly disbanded in limbo.,"[[[100277, None, None, None]]]"


In [2]:
verif_df = df[(df['verifiable']=='VERIFIABLE') & (df['label']=='SUPPORTS')]
verif_df.head()

Unnamed: 0,id,verifiable,label,claim,evidence
0,75397,VERIFIABLE,SUPPORTS,Nikolaj Coster-Waldau worked with the Fox Broa...,"[[[92206, 104971, Nikolaj_Coster-Waldau, 7], [..."
1,150448,VERIFIABLE,SUPPORTS,Roman Atwood is a content creator.,"[[[174271, 187498, Roman_Atwood, 1]], [[174271..."
2,214861,VERIFIABLE,SUPPORTS,"History of art includes architecture, dance, s...","[[[255136, 254645, History_of_art, 2]]]"
5,129629,VERIFIABLE,SUPPORTS,Homeland is an American television spy thrille...,"[[[151831, 166598, Homeland_-LRB-TV_series-RRB..."
8,33078,VERIFIABLE,SUPPORTS,The Boston Celtics play their home games at TD...,"[[[49158, 58489, Boston_Celtics, 3]], [[49159,..."


In [3]:
from sklearn.feature_extraction.text import CountVectorizer

to_tdm = CountVectorizer(tokenizer=nltk.word_tokenize)
to_tdm.fit(verif_df['claim'])

CountVectorizer(tokenizer=<function word_tokenize at 0x7f342ac05700>)

## PCA+TDM+hsnw

In [4]:
X = to_tdm.transform(verif_df['claim'])
TDM = pd.DataFrame(X.toarray(), columns=to_tdm.get_feature_names())

fit_transforming PCA. I took only 20000 samples because my RAM doesn't allow for more

In [5]:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1500)
tdm_pca = pd.DataFrame(pca.fit_transform(TDM[:20000]))
print(sum(pca.explained_variance_ratio_))

0.8267685654458785


In [6]:
tdm_pca.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499
0,0.274933,-0.403122,-0.190818,-0.426525,-0.058092,-0.038775,-0.581888,-0.14626,0.037089,-0.039497,...,0.030712,-0.003086,-0.059155,-0.000657,-0.011793,0.038035,0.014311,-0.028168,0.021815,-0.038385
1,-0.627415,0.324823,0.822102,0.09397,0.075213,0.06131,0.001915,-0.184241,0.143812,-0.045845,...,-0.028753,0.066659,0.047388,-0.090217,0.017659,0.019931,0.030875,-0.018938,-0.024505,0.054116
2,6.168995,6.258847,0.938537,-0.300605,-0.230533,-0.954852,0.466747,-0.259475,-0.463401,-0.286365,...,0.002015,-0.030375,-0.038482,-0.039791,0.000306,-0.017464,-0.033838,0.03539,-0.080861,0.048241
3,0.399398,-0.650302,0.701173,-0.791744,0.0461,0.149518,0.746968,0.231181,-0.740373,-0.375252,...,0.027823,-0.038555,0.016616,-0.065417,0.053753,-0.037554,-0.00623,0.032986,0.054645,0.027326
4,0.302666,-0.425505,-0.19107,-0.41449,-0.031847,-0.008057,-0.537349,-0.144615,0.060322,-0.075051,...,-0.034048,0.019152,0.009938,0.023361,-0.008969,0.014082,0.004835,-0.064524,0.039825,0.011461


In [7]:
import hnswlib
import numpy as np

dim = len(tdm_pca.columns)
num_elements = len(tdm_pca)
p = hnswlib.Index(space = 'cosine', dim = dim)

p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
p.add_items(tdm_pca)
p.set_ef(50)

In [8]:
def process_query(query):
    a = to_tdm.transform([query]).toarray()
    labels, distances = p.knn_query(pca.transform(a), k = 10)
    return [i for i in verif_df.claim.iloc[labels[0]]]

In [9]:
print(process_query(input()))

['The Academy Awards have 24 awards.', 'The 79th Academy Awards honored films.', 'All About Eve won 6 Academy Awards.', 'Five Academy Awards were won by Braveheart.', 'The Academy Awards are given annually.', 'The Academy Awards are overseen by AMPAS.', 'The Incredibles won two Academy Awards.', "Schindler's List received seven Academy Awards.", 'The 79th Academy Awards began at 5:30 p.m. PST / 8:30 p.m. EST.', 'Braveheart was nominated for ten Academy Awards.']


## IncerementalPCA

IncerementalPCA takes an hour on google colab to train. It is not the worst time considering other data science tasks, however it is still too long.

In [4]:
from sklearn.decomposition import  IncrementalPCA
import numpy as np
import tqdm
ipca =  IncrementalPCA(n_components=1500)
num_rows= len(verif_df)//8
chunk_size = 10000
for i in tqdm.tqdm(range(0, num_rows//chunk_size)):
    X = to_tdm.transform(verif_df[i*chunk_size : (i+1)*chunk_size]['claim'])
    ipca.partial_fit(pd.DataFrame(X.toarray(), columns=to_tdm.get_feature_names()))
print(sum(ipca.explained_variance_ratio_))

100%|██████████| 1/1 [09:41<00:00, 581.57s/it]0.8526924910034374



In [9]:
X = to_tdm.transform(verif_df['claim'][:len(verif_df)//8])
TDM = pd.DataFrame(X.toarray(), columns=to_tdm.get_feature_names())
tdm_ipca = pd.DataFrame(ipca.transform(TDM))
print(sum(ipca.explained_variance_ratio_))

0.8526924910034374


In [10]:
import hnswlib
import numpy as np

dim = len(tdm_ipca.columns)
num_elements = len(tdm_ipca)
p = hnswlib.Index(space = 'cosine', dim = dim)

p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
p.add_items(tdm_ipca)
p.set_ef(50)

In [13]:
def process_query(query):
    a = to_tdm.transform([query]).toarray()
    labels, distances = p.knn_query(ipca.transform(a), k = 10)
    return [i for i in verif_df.claim.iloc[labels[0]]]
print(process_query(input()))

['The Academy Awards have 24 awards.', 'All About Eve won 6 Academy Awards.', 'Five Academy Awards were won by Braveheart.', 'The 84th Academy Awards winners included Rango.', 'The Academy Awards are an annual event.', 'Jerry Goldsmith was nominated for eighteen Academy Awards.', 'Winona Rider was nominated for two Academy Awards.', 'The 84th Academy Awards winners included Saving Face.', 'Tom Cruise has been nominated for Academy Awards.', 'Judd Apatow has been nominated for Academy Awards.']


## doc2vec

Results are mediocre and training time is not that good as well. Maybe, using pretrained model improves the quality of the search. 

In [71]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(nltk.word_tokenize(doc.lower()), [doc]) for i, doc in verif_df.claim.iteritems()]
model = Doc2Vec(
    documents,
    vector_size=1000,
    window=2,
    min_count=2,
    workers=4,
    epochs=25,
    # hs=1,
    negative=5
)

model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

In [50]:
documents[8].tags

['There is a movie called The Hunger Games.']

In [51]:
d2v_df = pd.DataFrame([model.infer_vector(nltk.word_tokenize(doc.lower())) for doc in verif_df.claim])
d2v_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,240,241,242,243,244,245,246,247,248,249
0,-0.01517,0.009911,-0.005649,0.044883,-0.070835,0.018299,0.079309,-0.032559,0.017582,-0.022468,...,-0.039067,0.009005,0.029349,0.010545,-0.081545,0.01277,0.030488,0.003625,0.046975,-0.043632
1,-0.017702,-0.056137,-0.03228,0.095168,-0.030644,-0.09085,-0.076155,-0.003721,-0.000136,-0.092783,...,0.003352,0.015115,-0.026281,0.058902,0.03929,-0.040519,0.10333,-0.070613,-0.033845,-0.023018
2,0.192599,-0.288584,-0.088721,0.032468,-0.106758,0.072062,0.091159,0.283151,-0.086075,-0.105104,...,0.173682,-0.18456,0.126438,0.00415,-0.006492,-0.141945,0.186002,-0.068589,0.081722,0.049389
3,0.066397,0.102317,-0.013497,0.010727,0.055052,-0.089071,-0.047674,0.0213,0.007775,-0.116966,...,-0.073393,0.005211,-0.04435,0.036146,0.056494,-0.010376,0.024438,-0.063543,-0.046629,0.03308
4,0.105879,-0.089824,-0.032577,0.059719,0.046317,0.020823,0.11976,0.156948,0.031081,-0.073652,...,0.08138,-0.024795,0.083456,-0.093101,-0.071669,0.058142,-0.017791,-0.069858,0.030407,0.061695


In [52]:
import hnswlib
import numpy as np

dim = len(d2v_df.columns)
num_elements = len(d2v_df)
p = hnswlib.Index(space = 'cosine', dim = dim)

p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
p.add_items(d2v_df)
p.set_ef(50)

In [53]:
def process_query_d2v(query):
    a = model.infer_vector(nltk.word_tokenize(query.lower()))
    labels, distances = p.knn_query(a, k = 10)
    return [i for i in verif_df.claim.iloc[labels[0]]]

In [74]:
print(process_query_d2v('Academy Awards'))

[[ 8431 43566 34240 51507 29183 73858 34000 25407 27924 17363]]
['Sarah Paulson was nominated for a Golden Globe Award.', 'Shania Twain plays music.', 'John Mayer won a Grammy Award.', 'Bob Marley wrote songs.', 'Anne Bancroft won two Emmy Awards.', 'Claire Danes received a Golden Globe Award.', 'John Cena is a professional WWE wrestler.', 'The 84th Academy Awards winners included Undefeated.', 'Foxcatcher was nominated for Best Actor.', 'Rihanna has sold more than 230 million records worldwide.']


In [77]:
query = 'Academy Award.'
print(model.docvecs.most_similar(positive=[model.infer_vector(nltk.word_tokenize(query.lower()))]))

[('Winona Ryder was nominated for an Academy Award.', 0.9120476245880127), ('Walt Disney smoked.', 0.9055764079093933), ('Liverpool F.C. plays football.', 0.9034287333488464), ('Guyana shares a border with Venezuela.', 0.9022931456565857), ("Inhumans's main character's full name is Blackagar Boltagon.", 0.9004422426223755), ('Scotland includes islands.', 0.8989565372467041), ('Maggie Gyllenhaal acts.', 0.8986670970916748), ('Daniel Craig has appeared in multiple movies.', 0.898438036441803), ('Doctor Strange is a fictional superhero.', 0.8980916738510132), ('Selene serves as a character.', 0.8965005874633789)]


## BERT model

I used roberta-base model from huggingface. It gives results that are somewhat similar to doc2vec results and take too long to encode vectors.

In [93]:
from simpletransformers.language_representation import RepresentationModel

model = RepresentationModel(model_type='roberta', model_name='roberta-base', use_cuda=True)

vec = model.encode_sentences(verif_df[:20000].claim, combine_strategy='mean')

f.query.weight', 'roberta.encoder.layer.0.attention.self.query.bias', 'roberta.encoder.layer.0.attention.self.key.weight', 'roberta.encoder.layer.0.attention.self.key.bias', 'roberta.encoder.layer.0.attention.self.value.weight', 'roberta.encoder.layer.0.attention.self.value.bias', 'roberta.encoder.layer.0.attention.output.dense.weight', 'roberta.encoder.layer.0.attention.output.dense.bias', 'roberta.encoder.layer.0.attention.output.LayerNorm.weight', 'roberta.encoder.layer.0.attention.output.LayerNorm.bias', 'roberta.encoder.layer.0.intermediate.dense.weight', 'roberta.encoder.layer.0.intermediate.dense.bias', 'roberta.encoder.layer.0.output.dense.weight', 'roberta.encoder.layer.0.output.dense.bias', 'roberta.encoder.layer.0.output.LayerNorm.weight', 'roberta.encoder.layer.0.output.LayerNorm.bias', 'roberta.encoder.layer.1.attention.self.query.weight', 'roberta.encoder.layer.1.attention.self.query.bias', 'roberta.encoder.layer.1.attention.self.key.weight', 'roberta.encoder.layer.1.atte

In [95]:
vec_df = pd.DataFrame(vec)

In [96]:
import hnswlib
import numpy as np

dim = len(vec_df.columns)
num_elements = len(vec_df)
p = hnswlib.Index(space = 'cosine', dim = dim)

p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
p.add_items(vec_df)
p.set_ef(50)

In [98]:
def process_query(query):
    a = model.encode_sentences(query, combine_strategy='mean')[0]
    labels, distances = p.knn_query(a, k = 10)
    return [i for i in verif_df.claim.iloc[labels[0]]]
print(process_query('Academy Awards'))


['Frank Sinatra received critical acclaim for his performance in The Manchurian Candidate.', 'A co-producer of From the Earth to the Moon is credited to be Michael Bostick.', 'Aishwarya Raj was nominated eleven times for films Aishwarya Raj was in.', 'War Dogs stars Jonah Hill, Miles Teller, Ana de Armas and Bradley Cooper.', 'A 2009 war film featured Mike Myers starring in a small role.', 'The Attitude Era of WWE saw Dwayne "The Rock" Johnson become a major figure.', 'The programmer of Tetris was Alexey Pajitnov.', 'Maggie Gyllenhaal has been nominated for an Oscar.', "Tesla Model S was ranked the world's best-selling plug-in electric car for 2015.", 'Tyrese Gibson is well known for his role as Joseph "Jody" Summers in Baby Boy.']


## PCA fitted on subset, transform all dataset

In [10]:
X = to_tdm.transform(verif_df['claim'])
TDM = pd.DataFrame(X.toarray(), columns=to_tdm.get_feature_names())

In [11]:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=1500)
pca.fit(TDM.sample(n=10000))
tdm_pca = pd.DataFrame(pca.transform(TDM))
print(sum(pca.explained_variance_ratio_))

0.8538933408821134


In [12]:
import hnswlib
import numpy as np

dim = len(tdm_pca.columns)
num_elements = len(tdm_pca)
p = hnswlib.Index(space = 'cosine', dim = dim)

p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
p.add_items(tdm_pca)
p.set_ef(50)

In [13]:
def process_query(query):
    a = to_tdm.transform([query]).toarray()
    labels, distances = p.knn_query(pca.transform(a), k = 10)
    return [i for i in verif_df.claim.iloc[labels[0]]]
print(process_query(input()))

['Braveheart won five Academy Awards.', 'Daniel Day-Lewis earned numerous awards like Academy Awards.', 'The Academy Awards have 24 awards.', 'Judi Dench has won Academy Awards.', 'The Academy Awards have multiple awards.', 'The 79th Academy Awards honored films.', 'Overwatch lets players gain cosmetic awards.', 'Jack Nicholson has won Academy Awards.', 'Laurence Olivier received four Academy Awards.', 'All About Eve won 6 Academy Awards.']
