<a href="https://colab.research.google.com/github/Santosh-Gupta/NaturalLanguageRecommendations/blob/master/notebooks/inference/DemoNaturalLanguageRecommendationsCPU_Autofeedback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is our simple Colab demo notebook, which can run on a CPU instance, though it may crash a regular colab CPU instance since it's memory intensive. If that happens, a message should appear on the bottom left asking if you would like to switch to a 25 gb RAM instance, which will be more than enough memory

If you would like play with using a GPU or TPU for inference, please see our advanced demo notebook here https://colab.research.google.com/github/Santosh-Gupta/NaturalLanguageRecommendations/blob/master/notebooks/inference/build_index_and_search.ipynb

### NOTE: This colab notebook automatically records queries submitted (anonymously), which will be used to optimize the next iterations of our model. If you wish not to automatically send queries, use this notebook instead https://colab.research.google.com/drive/1Es_peeshcFsReqFEVr12GIiGPKD0KnKY

In [0]:
#@title Download and load model, embeddings, and data, will take a several minutes. Double click on this to pop open the hood and checkout the code.

!gdown --id "10LV9QbZOkUyOzR4nh8hxesoKJhpmvpM9"   # citation vectors
# !gdown --id "1-8gmT9cQpOUoZ_HzEaT9Xz6qfeVooAFn"
!gdown --id "1-23aNm7j0bnycvyd_OaQfofVYPTewgOI"   # abstract vectors
!gdown --id "1NyUQwgUNj9bFsiCnZ2TfKmWn5r-Y6wav"   # TitlesIdAbstractsEmbedIds
!gdown --id "1wIRsAApaE2L7E1fjnDOSSVBG1fY-LT9i" # Model
!wget 'https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_uncased.tar'
!tar -xvf 'scibert_scivocab_uncased.tar'

import zipfile
with zipfile.ZipFile('tfworld.zip', 'r') as zip_ref:
    zip_ref.extractall('')

!pip install transformers --quiet

%tensorflow_version 2.x
import numpy as np
import tensorflow as tf
from time import time
from tqdm import tqdm_notebook as tqdm
from transformers import BertTokenizer
import pandas as pd
from pprint import pprint

print('TensorFlow:', tf.__version__)

!gdown --id "1owiHXcDyTYecOq0Y27bOk0s4jgxmukTs"
!pip install --upgrade --quiet gspread
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://www.googleapis.com/auth/spreadsheets']
credentials = ServiceAccountCredentials.from_json_keyfile_name('worksheet1.worksheet2.worksheet3.json', scope)
gc = gspread.authorize(credentials)
worksheet1 = gc.open_by_key('1qlIZAAK3ZYTb20KeOr9f-TAxYz7yQnvHUsdGP9Iakrc').sheet1
worksheet2 = gc.open_by_key('1AU37NTxsafd9GNhum2yR3iCux9nT9GAN5Bn4HaWcyU4').sheet1
worksheet3 = gc.open_by_key('1Vaxn8rWz0CufCeDF_Ip9lzErZjK3AUA3g02fYMBe5P4').sheet1

print('Loading Embeddings')
citations_embeddings = np.load('CitationSimilarityVectors106Epochs.npy')
abstract_embeddings = np.load('AbstractSimVectors.npy')
assert citations_embeddings.shape == abstract_embeddings.shape

normalizedC = tf.nn.l2_normalize(citations_embeddings, axis=1)
normalizedA = tf.nn.l2_normalize(abstract_embeddings, axis=1) 

print('Loading Model')
model = tf.saved_model.load('tfworld/inference_model/')
print('laoding Tokenizer')
tokenizer = BertTokenizer(vocab_file='scibert_scivocab_uncased/vocab.txt')

print('Loading Semantic Scholar CS data, almost done . . .')
df = pd.read_json('/content/TitlesIdsAbstractsEmbedIdsCOMPLETE_12-30-19.json.gzip', compression = 'gzip')
embed2Title = pd.Series(df['title'].values,index=df['EmbeddingID']).to_dict()
embed2Abstract = pd.Series(df['paperAbstract'].values,index=df['EmbeddingID']).to_dict()
embed2Paper = pd.Series(df['id'].values,index=df['EmbeddingID']).to_dict()

import sys, os

# Disable
def blockPrint():
    sys.stdout = open(os.devnull, 'w')

# Restore
def enablePrint():
    sys.stdout = sys.__stdout__

Downloading...
From: https://drive.google.com/uc?id=10LV9QbZOkUyOzR4nh8hxesoKJhpmvpM9
To: /content/CitationSimilarityVectors106Epochs.npy
2.59GB [00:38, 68.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-23aNm7j0bnycvyd_OaQfofVYPTewgOI
To: /content/AbstractSimVectors.npy
2.59GB [00:42, 60.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1NyUQwgUNj9bFsiCnZ2TfKmWn5r-Y6wav
To: /content/TitlesIdsAbstractsEmbedIdsCOMPLETE_12-30-19.json.gzip
432MB [00:05, 82.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=1wIRsAApaE2L7E1fjnDOSSVBG1fY-LT9i
To: /content/tfworld.zip
411MB [00:04, 85.2MB/s]
--2020-01-12 20:27:42--  https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_uncased.tar
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.237.80
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.237.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4424

Use cell below to search for papers. Our model was trained on using full abstracts using the 'query', so the model performs better with longer queries, but the model works surprisingly well with short queries as well. Give it a try.

Our model was trained to use a citation emedding as a label, but we found out running similarity on our abstract embeddings results in surprisingly robust results as well, so we included both. The first half of the results are from the citation embeddings, the second half are from the abstract embeddings. 

In [0]:
query = "Vector representation of text for information retrieval. Document embeddings for search. Vector representation of query. Embedding representation of queries. " #@param {type:"string"}

top_k_results = 50 #@param {type:"integer"}

if top_k_results%2 == 0:
    halfA = halfC = int(top_k_results/2)
else:
    halfC = int(top_k_results/2) + 1
    halfA = int(top_k_results/2) 

abstract_encoded = tokenizer.encode(query, max_length=512, pad_to_max_length=True)
abstract_encoded = tf.constant(abstract_encoded, dtype=tf.int32)[None, :]
print('\nQuery : ')
pprint(query)

s = time()
bert_output = model(abstract_encoded)
xq = tf.nn.l2_normalize(bert_output, axis=1)
prediction_time = time() - s

simNumpyC = np.matmul(normalizedC, tf.transpose(xq))
simNumpyCTopK = (-simNumpyC[:,0]).argsort()[:halfC]
simNumpyC_oTopK = -np.sort(-simNumpyC[:,0])[:halfC]
allCit = np.vstack((simNumpyCTopK , simNumpyC_oTopK) )
del simNumpyC

simNumpyA = np.matmul(normalizedA, tf.transpose(xq))
simNumpyATopK = (-simNumpyA[:,0]).argsort()[:halfA]
simNumpyA_oTopK = -np.sort(-simNumpyA[:,0])[:halfA]
allAbs = np.vstack((simNumpyATopK , simNumpyA_oTopK) )
del simNumpyA

allResults = np.concatenate((allAbs, allCit), axis = 1)

print('\n')

print('------ Nearest papers  -----------------------------------------------------------')
print('\n')

for embed in allResults[0]:
    print('---------------')
    print('-------')
    print('---')
    title = embed2Title[int(embed)]
    abstractR = embed2Abstract[int(embed)]
    paperId = embed2Paper[int(embed)]
    print('Title: ', title)
    print('\nAbstract : ')
    pprint(abstractR)
    # print('\n')
    print('\nLink: https://www.semanticscholar.org/paper/'+paperId)
    print('---')
    print('-------')

blockPrint()
values_list = worksheet1.col_values(3)
worksheet1.update_cell(len(values_list)+1, 3, query)
enablePrint()




Query : 
('Vector representation of text for information retrieval. Document embeddings '
 'for search. Vector representation of query. Embedding representation of '
 'queries. ')


------ Nearest papers  -----------------------------------------------------------


---------------
-------
---
Title:  Discriminative features for document classification

Abstract : 
('Document representation using the bag-of-words approach may require bringing '
 'the dimensionality of the representation down in order to be able to make '
 'effective use of various statistical classification methods. Latent Semantic '
 'Indexing (LSI) is one such method that is based on eigendecomposition of the '
 'covariance of the document-term matrix. Another often used approach is to '
 'select a small number of most important features out of the whole set '
 'according to some relevant criterion. This paper points out that LSI ignores '
 'discrimination while concentrating on representation. Furthermore, selectio

## If you have any additional feedback about a query, or just feedback in general, we would very much appreciate it. The feedback will help in the qualitative analysis of our models

In [0]:
#@title Feedback about a particular query

%%capture

query = "Vector representation of text for information retrieval. Document embeddings for search. Vector representation of query. Embedding representation of queries. " #@param {type:"string"}

feedback = "First result didn't seem to say anything about negative sampling" #@param {type:"string"}

blockPrint()
values_list = worksheet2.col_values(3)
values_list2 = worksheet2.col_values(4)
rowV = max(len(values_list) , len(values_list2) )
worksheet2.update_cell(rowV+1, 3, query)
worksheet2.update_cell(rowV+1, 4, feedback)
enablePrint()

print('Submitted')
print('Query recorded, ', query)
print('Feedback recorded, ', feedback)

In [0]:
#@title Feedback

%%capture

feedback = "UI could use some work" #@param {type:"string"}

blockPrint()
values_list = worksheet3.col_values(3)
values_list2 = worksheet3.col_values(4)
rowV = max(len(values_list) , len(values_list2) )
worksheet3.update_cell(rowV+1, 3, feedback)
enablePrint()

print('Submitted')
print('Feedback recorded, ', feedback)