<a href="https://colab.research.google.com/github/Khaninsi/patent-citation-prediction/blob/main/IR_Model_for_EV_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question-Answering using Simple Wikipedia Index

This examples demonstrates the setup for Query / Question-Answer-Retrieval.

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve 100 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder is neccessary to filter out certain noise
that might be retrieved from the semantic search step.


In [5]:
!pip install -U sentence-transformers rank_bm25 sklearn spacy nltk gensim


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 3.0 MB/s 
[?25hCollecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting gensim
  Downloading gensim-4.2.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 42.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 58.7 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.11.1-

In [6]:
from sklearn.metrics import ndcg_score
import pandas as pd
from tqdm.autonotebook import tqdm
import numpy as np
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings(action='ignore',category=UserWarning,module='gensim')  
warnings.filterwarnings(action='ignore',category=FutureWarning,module='gensim')
warnings.simplefilter('ignore')
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
import time
from gensim.models import Word2Vec

  from tqdm.autonotebook import tqdm


In [7]:
from google.colab import drive
#drive.mount('/content/drive/drive/Shareddrives/ISE_540_Project/data')
drive.mount('/content/drive')
## Change directory
import os
os.chdir("drive/Shareddrives/ISE_540_Project/data")

Mounted at /content/drive


In [4]:
# Read the file
import json
f = open("patent_information.json")
# returns JSON object as a dictionary
patent_dict = json.load(f)
f.close()

In [None]:
len(patent_dict)

35148

In [None]:
## NOTE: The first 800 files are queries, the remaining are documents
queries = patent_dict[:800]
docs = patent_dict[800:]
print(len(queries))


## TODO: Some files are the same, remove duplicates
query_dict = {}
for query in queries:
    if query['title'] in query_dict.values():
        continue
    else:
        query_dict[query['patent']] = query['title']

print('Before remove duplicated queries: {}'.format(len(queries)))
print('After remove duplicated queries: {}'.format(len(query_dict)))

800
Before remove duplicated queries: 800
After remove duplicated queries: 765


In [None]:
# patent_id_doc = [doc['patent'] for doc in docs]
# Select only US patent
docs = [doc for doc in docs if doc['patent'][:2] == 'US']
print(len(docs))

patent_id_doc = [doc['patent'] for doc in docs]
patent_id_doc[:10]

31856


['US20050108642A1',
 'US7454351B2',
 'US20150312666A1',
 'US20090112572A1',
 'US20080300768A1',
 'US7411316B2',
 'US7611198B2',
 'US20140112556A1',
 'US9229905B1',
 'US5745759A']

In [None]:
# Remove duplicates from the queries
queries = [query for query in queries if query['patent'] in query_dict.keys() and query['patent'][:2] == 'US']
len(queries)

745

In [None]:
## NOTE: The first 800 files are queries, the remaining are documents

def extract_information_patent(docs):
    title, abstract, title_and_abs, body = [], [], [], []
    citation = {}
    patent_index2id = {}
    patent_id2index = {}
    #doc_index2id = {}
    # Obtain all titles, abstracts, citations 
    for index, patent in enumerate(tqdm(docs)):
        patent_num = patent['patent']
        citation[patent_num] = [num['patent_number'] for num in patent['patent_citations'] if num['patent_number'][:2] == "US"]
        titl = patent['title']
        abs = patent['abstract_text']
        title.append(titl)
        abstract.append(abs)
        title_and_abs.append(titl+'[SEP]'+abs)
        body.append(patent['body_message'])
        
        # Create index to id for querys and documents
        patent_index2id[index] = patent_num
        patent_id2index[patent_num] = index
    return title, abstract, title_and_abs, body, citation, patent_index2id, patent_id2index


In [None]:
title_q, abstract_q, title_and_abs_q, body_q, citation_q, patent_index2id_q, patent_id2index_q = extract_information_patent(queries)
title, abstract, title_and_abs, body, _, patent_index2id, patent_id2index = extract_information_patent(docs)

  0%|          | 0/745 [00:00<?, ?it/s]

  0%|          | 0/31856 [00:00<?, ?it/s]

In [None]:
len(title)

31856

In [None]:
title_and_abs_q[0]

'Electric vehicle having cover for inlet for DC charging and lock mechanism to lock cover[SEP]An electric vehicle includes a battery, a charging lid, a vehicle-side charging connector, a contactor, a contactor welding detector, a DC inlet cover, a DC lock mechanism, and a lock controller. The battery is to be charged with power supplied from an external power supply. The vehicle-side charging connector is disposed inside a position of the charging lid in the electric vehicle and includes an AC inlet for AC charging and a DC inlet for DC charging. The contactor is provided on a power line connecting the DC inlet to the battery. The contactor welding detector is configured to detect welding of the contactor. The DC inlet cover covers the DC inlet without covering the AC inlet in a state where the DC inlet cover is in a closed state. The DC lock mechanism locks the DC inlet cover in the closed state.'

In [None]:
!nvidia-smi

Sat Nov 19 20:11:26 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import time
import gzip
import os
import torch

if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
# bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v4')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 100                         #Number of passages we want to retrieve with the bi-encoder
# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings_abs = bi_encoder.encode(abstract, convert_to_tensor=True, show_progress_bar=True)
corpus_embeddings_title_and_abs = bi_encoder.encode(title_and_abs, convert_to_tensor=True, show_progress_bar=True)

specter_model = SentenceTransformer('allenai-specter')
corpus_embeddings_specter_abs = specter_model.encode(abstract, convert_to_tensor=True, show_progress_bar=True)
corpus_embeddings_spector_title_and_abs = specter_model.encode(title_and_abs, convert_to_tensor=True, show_progress_bar=True)

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
# cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# cross_encoder = CrossEncoder('cross-encoder/ms-marco-electra-base')


# corpus_embeddings_body = bi_encoder.encode(body, convert_to_tensor=True, show_progress_bar=True)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/996 [00:00<?, ?it/s]

Batches:   0%|          | 0/996 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/996 [00:00<?, ?it/s]

Batches:   0%|          | 0/996 [00:00<?, ?it/s]

In [None]:
# corpus_embeddings_specter_abs = specter_model.encode(abstract, convert_to_tensor=True, show_progress_bar=True)
# corpus_embeddings_spector_title_and_abs = specter_model.encode(title_and_abs, convert_to_tensor=True, show_progress_bar=True)
#corpus_embeddings_spector_body = specter_model.encode(body, convert_to_tensor=True, show_progress_bar=True)

In [None]:
# We also compare the results to lexical search (keyword search). Here, we use 
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np

# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
  tokenized_doc = []
  for token in text.lower().split():
    token = token.strip(string.punctuation)

    if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
      tokenized_doc.append(token)
  return tokenized_doc

# Abstract
tokenized_corpus = []
for passage in tqdm(abstract):
  tokenized_corpus.append(bm25_tokenizer(passage))
print(tokenized_corpus[0])
bm25_abs = BM25Okapi(tokenized_corpus)

# Title + Abstract
tokenized_corpus = []
for passage in tqdm(title_and_abs):
  tokenized_corpus.append(bm25_tokenizer(passage))
print(tokenized_corpus[0])
bm25_title_and_abs = BM25Okapi(tokenized_corpus)

# Body
# tokenized_corpus = []
# for passage in tqdm(body):
#   tokenized_corpus.append(bm25_tokenizer(passage))
# print(tokenized_corpus[0])
# bm25_body = BM25Okapi(tokenized_corpus)

  0%|          | 0/31856 [00:00<?, ?it/s]

['method', 'adapting', 'computing', 'device', 'response', 'changes', 'environment', 'surrounding', 'computing', 'device', 'response', "user's", 'stated', 'preferences', 'computing', 'device', 'includes', 'sensors', 'sense', 'environment', 'changed', 'characteristic', 'environment', 'detected', 'determination', 'settings', 'change', 'response', 'changed', 'characteristic', 'settings', 'changed', 'cause', 'computing', 'device', 'interact', 'user', 'different', 'mode', 'mode', 'include', 'inputs', 'outputs', 'and/or', 'processes', 'used', 'communicate', 'user', 'mode', 'include', 'application', 'formats', 'output', 'receives', 'input']


  0%|          | 0/31856 [00:00<?, ?it/s]

['adaptive', 'computing', 'environment[sep]a', 'method', 'adapting', 'computing', 'device', 'response', 'changes', 'environment', 'surrounding', 'computing', 'device', 'response', "user's", 'stated', 'preferences', 'computing', 'device', 'includes', 'sensors', 'sense', 'environment', 'changed', 'characteristic', 'environment', 'detected', 'determination', 'settings', 'change', 'response', 'changed', 'characteristic', 'settings', 'changed', 'cause', 'computing', 'device', 'interact', 'user', 'different', 'mode', 'mode', 'include', 'inputs', 'outputs', 'and/or', 'processes', 'used', 'communicate', 'user', 'mode', 'include', 'application', 'formats', 'output', 'receives', 'input']


## **TFIDF Embedding**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer_abs = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words='english')
tfidf_abs = vectorizer_abs.fit_transform(abstract)

vectorizer_title_and_abs = TfidfVectorizer(ngram_range=(1,2), min_df=2, stop_words='english')
tfidf_title_and_abs = vectorizer_title_and_abs.fit_transform(title_and_abs)


print(tfidf_abs.shape)
print(tfidf_title_and_abs.shape)


(31856, 245817)
(31856, 259851)


In [None]:
def dot_product(A, B):
    return cosine_similarity(A, B)

#Jaccard Similarity

In [None]:
def jaccard_similarity(doc1, doc2): 
    # List the unique words in a document
    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)
    
    # Find the intersection of words list of doc1 & doc2
    intersection = words_doc1.intersection(words_doc2)

    # Find the union of words list of doc1 & doc2
    union = words_doc1.union(words_doc2)

    # Calculate Jaccard similarity score 
    # using length of intersection set divided by length of union set
    return float(len(intersection)) / len(union)

#Word2Vec

In [None]:
import spacy
import re
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "ner", "textcat", "tok2vec"])

In [None]:
def segmentation(text):
    # Select only chatacters by selecting characters
    # text = " ".join(re.findall("[a-zA-Z]+", text))
    text = re.sub(r'[^a-z ]', '', text)
    doc = nlp(text)
    
    # Apply lemmatization and lower
    # "if not token.is_stop"  is to remove stopwords
    return [token.lemma_ for token in doc if not token.is_stop]

In [None]:
# words = [re.sub(r'[^a-z]', '', w.lower()).strip() for w in words]

In [None]:
title[0]

'Adaptive computing environment'

In [None]:
tokenized_abs = [segmentation(txt) for txt in abstract]

In [None]:
tokenized_title_and_abs = [segmentation(txt) for txt in title_and_abs]

In [None]:
import time
from gensim.models import Word2Vec
print("training Word2Vec model for abstract only...")
startTime = time.time()
word2vec_title = Word2Vec(tokenized_abs)
usedTime = time.time() - startTime
print('spend %f seconds' %usedTime)

training Word2Vec model for abstract only...
spend 12.611078 seconds


In [None]:
print("training Word2Vec model for title and abstract...")
startTime = time.time()
word2vec_title_and_abs = Word2Vec(tokenized_title_and_abs)
usedTime = time.time() - startTime
print('spend %f seconds' %usedTime)

training Word2Vec model for title and abstract...
spend 13.827043 seconds


In [None]:
tokenized_q = segmentation(title_q[0])

# Search query using all techniques

In [None]:
#This function will search all wikipedia articles for passages that
#answer the query
def search(query):
    #print("Input question:", query)
    
    result_dict = []
    ##########################################################################
    ###########  Lexical search ###################
    # 1. BM25
    #BM25 search (lexical search) on abstract only
    bm25_scores = bm25_abs.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-10:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
#     print("Top-5 lexical search (BM25) hits")
    #print(bm25_hits[:10])
    bm25_hits_rel = []
    for hit in bm25_hits[0:10]:
        bm25_hits_rel.append(patent_index2id[hit['corpus_id']])
#         print(patent_index2id[hit['corpus_id']])
#         print("\t{:.3f}\t{}".format(hit['score'], abstract[hit['corpus_id']].replace("\n", " ")))
    bm25_dict = {'method': 'bm25_abs', "rel_docs": bm25_hits_rel}
    result_dict.append(bm25_dict)

    #BM25 search (lexical search) on title and abstract
    bm25_scores = bm25_title_and_abs.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-10:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
#     print("Top-5 lexical search (BM25) hits")
    bm25_hits_rel = []
    for hit in bm25_hits[0:10]:
        bm25_hits_rel.append(patent_index2id[hit['corpus_id']])
#         print(patent_index2id[hit['corpus_id']])
#         print("\t{:.3f}\t{}".format(hit['score'], title_and_abs[hit['corpus_id']].replace("\n", " ")))
    bm25_dict = {'method': 'bm25_title_and_abs', "rel_docs": bm25_hits_rel}
    result_dict.append(bm25_dict)
    # print ('Done lexical search')
    ##########################################################################

        
    ##########################################################################
    # 2. TFIDF
    tfidf_abs_query = vectorizer_abs.transform([query])
    CS_abs = dot_product(tfidf_abs, tfidf_abs_query)
    top_10_indices = sorted(range(len(CS_abs)), key=lambda i: CS_abs[i], reverse=True)[:10]
    tfidf_abs_top10_id = [patent_index2id[ind] for ind in top_10_indices]
    tfidf_dict = {'method': 'tfidf_abs', "rel_docs": tfidf_abs_top10_id}
    result_dict.append(tfidf_dict)


    tfidf_title_and_abs_query = vectorizer_title_and_abs.transform([query])
    CS_title_and_abs = dot_product(tfidf_title_and_abs, tfidf_title_and_abs_query)
    top_10_indices = sorted(range(len(CS_title_and_abs)), key=lambda i: CS_title_and_abs[i], reverse=True)[:10]
    tfidf_title_and_abs_top10_id = [patent_index2id[ind] for ind in top_10_indices]
    tfidf_dict = {'method': 'tfidf_title_and_abs', "rel_docs": tfidf_title_and_abs_top10_id}
    result_dict.append(tfidf_dict)

    # print('Done TFIDF')

    ##########################################################################

    ##########################################################################
    #### 3. Jaccard Similarity
    tokenized_query = segmentation(query)
    # 3.1 title only 
    jacc_score = [jaccard_similarity(tokenized_query, doc) for doc in tokenized_abs]
    top_jacc_score = sorted(range(len(jacc_score)), key=lambda i: jacc_score[i], reverse=True)[:10]
    jacc_title_top10_id = [patent_index2id[ind] for ind in top_jacc_score]
    jacc_dict = {'method': 'jaccard_abs', "rel_docs": jacc_title_top10_id}
    result_dict.append(jacc_dict)

    # 3.2 title and abstract 
    jacc_score = [jaccard_similarity(tokenized_query, doc) for doc in tokenized_title_and_abs]
    top_jacc_score = sorted(range(len(jacc_score)), key=lambda i: jacc_score[i], reverse=True)[:10]
    jacc_title_top10_id = [patent_index2id[ind] for ind in top_jacc_score]
    jacc_dict = {'method': 'jaccard_titleand_abs', "rel_docs": jacc_title_top10_id}
    result_dict.append(jacc_dict)

    # print('Done Jaccard')

    ##########################################################################
    ## 4. Word2Vec
    # 4.1 abstract only 
    # w2v_score = []
    # for doc in tokenized_abs:
    #     w2v_score.append(1 - word2vec_title.wv.wmdistance(tokenized_q, doc))
    # top_w2v_score = sorted(range(len(w2v_score)), key=lambda i: w2v_score[i], reverse=True)[:10]
    # w2v_title_top10_id = [patent_index2id[ind] for ind in top_w2v_score]
    # w2v_dict = {'method': 'Word2Vec_abs', "rel_docs": w2v_title_top10_id}

    # # 4.2 titl and abstract
    # w2v_score = []
    # for doc in tokenized_title_and_abs:
    #     w2v_score.append(1 - word2vec_title_and_abs.wv.wmdistance(tokenized_q, doc))
    # top_w2v_score = sorted(range(len(w2v_score)), key=lambda i: w2v_score[i], reverse=True)[:10]
    # w2v_title_and_abs_top10_id = [patent_index2id[ind] for ind in top_w2v_score]
    # w2v_dict = {'method': 'Word2Vec_titleand_abs', "rel_docs": w2v_title_and_abs_top10_id}
    # result_dict.append(w2v_dict)
    # print('Done W2V')

    ##########################################################################
    ##### Sematic Search #####
    
    #################### 1. Use only abstract
    #Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings_abs, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query
    related_ids = [hit['corpus_id'] for hit in hits]
    ### 1.1 Bi-encoder
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    bi_encoder_rel = []
    for hit in hits[0:10]:
        bi_encoder_rel.append(patent_index2id[hit['corpus_id']])
        #print("\t{:.3f}\t{}".format(hit['score'], abstract[hit['corpus_id']].replace("\n", " ")))
    bi_encoder_dict = {'method': 'bi_encoder_abs', "rel_docs": bi_encoder_rel}
    result_dict.append(bi_encoder_dict)

    ### 1.2 Cross-encoder
    #Now, score all retrieved passages with the cross_encoder
    # cross_inp = [[query, abstract[hit['corpus_id']]] for hit in hits]
    # cross_scores = cross_encoder.predict(cross_inp)

    # #Sort results by the cross-encoder scores
    # for idx in range(len(cross_scores)):
    #     hits[idx]['cross-score'] = cross_scores[idx]
    # hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    # cross_encoder_rel = []
    # for hit in hits[0:10]:
    #     cross_encoder_rel.append(patent_index2id[hit['corpus_id']])
    #     #print("\t{:.3f}\t{}".format(hit['cross-score'], abstract[hit['corpus_id']].replace("\n", " ")))
    # cross_encoder_dict = {'method': 'cross_encoder_abs', "rel_docs": cross_encoder_rel}
    # result_dict.append(cross_encoder_dict)

    ### 1.3 Spector
    spector_encoder_rel = []
    query_embedding = specter_model.encode(query, convert_to_tensor=True)
    query_embedding = query_embedding.cuda()
    search_hits = util.semantic_search(query_embedding, corpus_embeddings_specter_abs)
    search_hits = search_hits[0]  #Get the hits for the first query
    for hit in search_hits:
        spector_encoder_rel.append(patent_index2id[hit['corpus_id']])
        #print("{:.2f}\t{}\t{} {}".format(hit['score'], related_paper['title'], related_paper['venue'], related_paper['year']))
    spector_encoder_dict = {'method': 'spector_encoder_abs', "rel_docs": spector_encoder_rel}
    result_dict.append(spector_encoder_dict)


    ############## 2. Use both abstract and title
    #Encode the query using the bi-encoder and find potentially relevant passages
    hits = util.semantic_search(question_embedding, corpus_embeddings_title_and_abs, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query
    related_ids = [hit['corpus_id'] for hit in hits]

    # #Now, score all retrieved passages with the cross_encoder
    # cross_inp = [[query, title_and_abs[hit['corpus_id']]] for hit in hits]
    # cross_scores = cross_encoder.predict(cross_inp)

    #Sort results by the cross-encoder scores
    # for idx in range(len(cross_scores)):
    #     hits[idx]['cross-score'] = cross_scores[idx]
    
    ## 2.1 Bi-encoder
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    bi_encoder_rel = []
    for hit in hits[0:10]:
        bi_encoder_rel.append(patent_index2id[hit['corpus_id']])
        #print("\t{:.3f}\t{}".format(hit['score'], title_and_abs[hit['corpus_id']].replace("\n", " ")))
    bi_encoder_dict = {'method': 'bi_encoder_title_and_abs', "rel_docs": bi_encoder_rel}
    result_dict.append(bi_encoder_dict)

    ## 2.2 Cross-encoder
    # hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    # cross_encoder_rel = []
    # for hit in hits[0:10]:
    #     cross_encoder_rel.append(patent_index2id[hit['corpus_id']])
    #     #print("\t{:.3f}\t{}".format(hit['cross-score'], title_and_abs[hit['corpus_id']].replace("\n", " ")))
    # cross_encoder_dict = {'method': 'cross_title_and_abs', "rel_docs": cross_encoder_rel}
    # result_dict.append(cross_encoder_dict)


    ## 2.3 Spector
    spector_encoder_rel = []
    search_hits = util.semantic_search(query_embedding, corpus_embeddings_spector_title_and_abs)
    search_hits = search_hits[0]  #Get the hits for the first query
    for hit in search_hits:
        spector_encoder_rel.append(patent_index2id[hit['corpus_id']])
        # related_paper = papers[hit['corpus_id']]
        #print("{:.2f}\t{}\t{} {}".format(hit['score'], related_paper['title'], related_paper['venue'], related_paper['year']))
    spector_encoder_dict = {'method': 'spector_encoder_title_and_abs', "rel_docs": spector_encoder_rel}
    result_dict.append(spector_encoder_dict)
    # print('Done semantic search')

    ##########################################################################

    return result_dict

In [None]:
def cal_mapk(relevance_score):
    mapk = []
    i = 1
    for val in relevance_score:
        if val != 0:
            #print(relevance_score[:i])
            mapk.append(np.mean(relevance_score[:i]))
        i += 1
    if mapk:
      return np.mean(mapk)
    else:
      return 0

In [None]:
def cal_f1k(apk, recallk):
    if apk == 0 and recallk == 0:
        return 0
    else:
        return 2 * apk * recallk / (apk + recallk)

def cal_acc(relevance_score, len_groundt):
    mapk = []
    mrecallk = []
    mf1k = []
    i = 1
    for val in relevance_score:
        if val != 0:
            #print(relevance_score[:i])
            # Precision @ k
            apk = np.mean(relevance_score[:i])
            mapk.append(apk)
            # Recall @ k
            recallk = i/len_groundt
            mrecallk.append(recallk)
            # f1 @ k
            f1k = cal_f1k(apk, recallk)
            mf1k.append(f1k)
        i += 1
    
    if mapk:
        mapk = np.mean(mapk)
    else:
        mapk = 0
    
    if mrecallk:
        mrecallk = np.mean(mrecallk)
    else:
        mrecallk = 0

    if mf1k:
        mf1k = np.mean(mf1k)
    else:
        mf1k = 0

    return mapk, mrecallk, mf1k

In [None]:
# We are going to use 
# 1. Query: title and  title + abstract
# 2. Doc:   abstract, title+abtract, and body message 
def obtain_performance(query, input_type):
    all_performance = []
    for index, header in enumerate(tqdm(query)):
        #result[patent_index2id[index]] = search(header)
        patent_id = patent_index2id_q[index]
        query_rel = search(header)
        #print('Patent ID: {}'.format(patent_id))
        for tech in query_rel:
            technique = tech['method']
            # print(tech['rel_docs'])
            relevance_score = [1 if doc in citation_q[patent_id] else 0 for doc in tech['rel_docs']]
            # Calculate ideal relevance score by decending sorting 
            ideal_true_relevance_score = sorted(relevance_score, reverse=True)
            #print(relevance_score)
            ndcg_sc = ndcg_score([ideal_true_relevance_score], [relevance_score])

            # Cal mapk, mean average recall and f1-score @k
            mapk_val, mrecallk_val, f1k_val = cal_acc(relevance_score, len(citation_q[patent_id]))

            # Collect the statistical values
            ndcg_score_list = [patent_id, input_type, "nDCG", technique, ndcg_sc]
            mapk_score_list = [patent_id, input_type, "mapk", technique, mapk_val]
            mrecallk_score_list = [patent_id, input_type, "mrecallk", technique, mrecallk_val]
            f1k_val_score_list = [patent_id, input_type, "mf1k", technique, f1k_val]
            all_performance.append(ndcg_score_list)
            all_performance.append(mapk_score_list)
            all_performance.append(mrecallk_score_list)
            all_performance.append(f1k_val_score_list)
    return pd.DataFrame(all_performance, columns=['patent_id', 'input_type', 'matrix', 'method', 'score'])
        #print('-'*60)

In [None]:
title_q[0]

'Electric vehicle having cover for inlet for DC charging and lock mechanism to lock cover'

In [None]:
patent_index2id[0]

'US20050108642A1'

In [None]:
patent_id_doc = [doc['patent'] for doc in docs]
patent_id_q  = [que['patent'] for que in queries]
patent_id_doc[:10]

['US20050108642A1',
 'US7454351B2',
 'US20150312666A1',
 'US20090112572A1',
 'US20080300768A1',
 'US7411316B2',
 'US7611198B2',
 'US20140112556A1',
 'US9229905B1',
 'US5745759A']

In [None]:
title_q[0]

'Electric vehicle having cover for inlet for DC charging and lock mechanism to lock cover'

In [None]:
## TODO: Select only US patents
title_perf_df = obtain_performance(title_q, 'title')
title_perf_df

  0%|          | 0/745 [00:00<?, ?it/s]

Unnamed: 0,patent_id,input_type,matrix,method,score
0,US9533588B2,title,nDCG,bm25_abs,0.393729
1,US9533588B2,title,mapk,bm25_abs,0.125000
2,US9533588B2,title,mrecallk,bm25_abs,0.400000
3,US9533588B2,title,mf1k,bm25_abs,0.190476
4,US9533588B2,title,nDCG,bm25_title_and_abs,0.000000
...,...,...,...,...,...
29795,US8866745B1,title,mf1k,bi_encoder_title_and_abs,0.000000
29796,US8866745B1,title,nDCG,spector_encoder_title_and_abs,0.000000
29797,US8866745B1,title,mapk,spector_encoder_title_and_abs,0.000000
29798,US8866745B1,title,mrecallk,spector_encoder_title_and_abs,0.000000


In [None]:
title_and_abs_perf_df = obtain_performance(title_and_abs_q, 'title_and_abs')
title_and_abs_perf_df

  0%|          | 0/745 [00:00<?, ?it/s]

Unnamed: 0,patent_id,input_type,matrix,method,score
0,US9533588B2,title_and_abs,nDCG,bm25_abs,0.393729
1,US9533588B2,title_and_abs,mapk,bm25_abs,0.333333
2,US9533588B2,title_and_abs,mrecallk,bm25_abs,0.150000
3,US9533588B2,title_and_abs,mf1k,bm25_abs,0.206897
4,US9533588B2,title_and_abs,nDCG,bm25_title_and_abs,0.393729
...,...,...,...,...,...
29795,US8866745B1,title_and_abs,mf1k,bi_encoder_title_and_abs,0.000000
29796,US8866745B1,title_and_abs,nDCG,spector_encoder_title_and_abs,0.000000
29797,US8866745B1,title_and_abs,mapk,spector_encoder_title_and_abs,0.000000
29798,US8866745B1,title_and_abs,mrecallk,spector_encoder_title_and_abs,0.000000


In [None]:
## Append performance together 
perf_df = pd.concat([title_perf_df, title_and_abs_perf_df], ignore_index=True)
perf_df.to_csv('performance.csv')
perf_df

Unnamed: 0,patent_id,input_type,matrix,method,score
0,US9533588B2,title,nDCG,bm25_abs,0.393729
1,US9533588B2,title,mapk,bm25_abs,0.125000
2,US9533588B2,title,mrecallk,bm25_abs,0.400000
3,US9533588B2,title,mf1k,bm25_abs,0.190476
4,US9533588B2,title,nDCG,bm25_title_and_abs,0.000000
...,...,...,...,...,...
59595,US8866745B1,title_and_abs,mf1k,bi_encoder_title_and_abs,0.000000
59596,US8866745B1,title_and_abs,nDCG,spector_encoder_title_and_abs,0.000000
59597,US8866745B1,title_and_abs,mapk,spector_encoder_title_and_abs,0.000000
59598,US8866745B1,title_and_abs,mrecallk,spector_encoder_title_and_abs,0.000000


#Calculate t-test

In [3]:
from scipy.stats import ttest_ind
def t_test(x,y,):
    _, double_p = ttest_ind(x.to_numpy(),y.to_numpy(),equal_var = True)
    if np.mean(x) > np.mean(y):
        pval = double_p/2.
    else:
        pval = 1.0 - double_p/2.
    return pval

In [8]:
perf_df = pd.read_csv('performance.csv')
perf_df.tail(20)

Unnamed: 0.1,Unnamed: 0,patent_id,input_type,matrix,method,score
59580,59580,US8866745B1,title_and_abs,nDCG,jaccard_titleand_abs,0.0
59581,59581,US8866745B1,title_and_abs,mapk,jaccard_titleand_abs,0.0
59582,59582,US8866745B1,title_and_abs,mrecallk,jaccard_titleand_abs,0.0
59583,59583,US8866745B1,title_and_abs,mf1k,jaccard_titleand_abs,0.0
59584,59584,US8866745B1,title_and_abs,nDCG,bi_encoder_abs,0.0
59585,59585,US8866745B1,title_and_abs,mapk,bi_encoder_abs,0.0
59586,59586,US8866745B1,title_and_abs,mrecallk,bi_encoder_abs,0.0
59587,59587,US8866745B1,title_and_abs,mf1k,bi_encoder_abs,0.0
59588,59588,US8866745B1,title_and_abs,nDCG,spector_encoder_abs,0.393729
59589,59589,US8866745B1,title_and_abs,mapk,spector_encoder_abs,0.2


In [9]:
score_by_method_df = perf_df.groupby(['input_type', 'matrix', 'method'])['score'].agg(score_by_method='mean').reset_index()
score_by_method_df

Unnamed: 0,input_type,matrix,method,score_by_method
0,title,mapk,bi_encoder_abs,0.194088
1,title,mapk,bi_encoder_title_and_abs,0.259137
2,title,mapk,bm25_abs,0.211398
3,title,mapk,bm25_title_and_abs,0.272201
4,title,mapk,jaccard_abs,0.073185
...,...,...,...,...
75,title_and_abs,nDCG,jaccard_titleand_abs,0.365710
76,title_and_abs,nDCG,spector_encoder_abs,0.352329
77,title_and_abs,nDCG,spector_encoder_title_and_abs,0.401603
78,title_and_abs,nDCG,tfidf_abs,0.371295


In [11]:
title_acc_df = score_by_method_df[score_by_method_df['input_type']=='title'].sort_values(by=['matrix', 'score_by_method'], ascending=False)
title_acc_df.groupby('matrix').head(3).reset_index(drop=True)

Unnamed: 0,input_type,matrix,method,score_by_method
0,title,nDCG,bm25_title_and_abs,0.310817
1,title,nDCG,bi_encoder_title_and_abs,0.29839
2,title,nDCG,tfidf_title_and_abs,0.278969
3,title,mrecallk,bi_encoder_title_and_abs,0.118709
4,title,mrecallk,bm25_title_and_abs,0.112209
5,title,mrecallk,bm25_abs,0.101131
6,title,mf1k,bm25_title_and_abs,0.084808
7,title,mf1k,bi_encoder_title_and_abs,0.080885
8,title,mf1k,tfidf_title_and_abs,0.076692
9,title,mapk,bm25_title_and_abs,0.272201


In [12]:
title_and_abs_acc_df =score_by_method_df[score_by_method_df['input_type']=='title_and_abs'].sort_values(by=['matrix', 'score_by_method'], ascending=False)
title_and_abs_acc_df.groupby('matrix').head(3).reset_index(drop=True)

Unnamed: 0,input_type,matrix,method,score_by_method
0,title_and_abs,nDCG,bm25_title_and_abs,0.418529
1,title_and_abs,nDCG,bm25_abs,0.412935
2,title_and_abs,nDCG,spector_encoder_title_and_abs,0.401603
3,title_and_abs,mrecallk,bm25_abs,0.162883
4,title_and_abs,mrecallk,bm25_title_and_abs,0.160327
5,title_and_abs,mrecallk,spector_encoder_title_and_abs,0.155887
6,title_and_abs,mf1k,bm25_title_and_abs,0.123721
7,title_and_abs,mf1k,bm25_abs,0.119691
8,title_and_abs,mf1k,spector_encoder_title_and_abs,0.111996
9,title_and_abs,mapk,bm25_title_and_abs,0.369581


## Part 1: Compare input of query (title VS title+abstract) 

In [13]:
p_val_ndcg = t_test(title_and_abs_acc_df[title_and_abs_acc_df['matrix'] == 'nDCG']['score_by_method'],
              title_acc_df[title_acc_df['matrix'] == 'nDCG']['score_by_method'])
p_val_mrecallk = t_test(title_and_abs_acc_df[title_and_abs_acc_df['matrix'] == 'mrecallk']['score_by_method'],
              title_acc_df[title_acc_df['matrix'] == 'mrecallk']['score_by_method'])
p_val_mf1k = t_test(title_and_abs_acc_df[title_and_abs_acc_df['matrix'] == 'mf1k']['score_by_method'],
              title_acc_df[title_acc_df['matrix'] == 'mf1k']['score_by_method'])
p_val_mapk = t_test(title_and_abs_acc_df[title_and_abs_acc_df['matrix'] == 'mapk']['score_by_method'],
              title_acc_df[title_acc_df['matrix'] == 'mapk']['score_by_method'])


In [14]:
print('p-value of nDCG between title and titles combined with abstracts: {}.'.format(p_val_ndcg))
print('p-value of mapk between title and titles combined with abstracts: {}.'.format(p_val_mapk))
print('p-value of ma recall k between title and titles combined with abstracts: {}.'.format(p_val_mrecallk))
print('p-value of ma f1 k between title and titles combined with abstracts: {}.'.format(p_val_mf1k))

p-value of nDCG between title and titles combined with abstracts: 1.489067146003719e-06.
p-value of mapk between title and titles combined with abstracts: 1.4009452459873077e-06.
p-value of ma recall k between title and titles combined with abstracts: 2.317612946790197e-06.
p-value of ma f1 k between title and titles combined with abstracts: 2.1248906517182753e-06.


## Part 2 Compare the baseline model with the best model

In [19]:
# best_model_acc
best_model_acc_ndcg = perf_df[(perf_df['matrix'] == 'nDCG') & (perf_df['method'] == 'bm25_title_and_abs') ]['score']
best_model_acc_mapk = perf_df[(perf_df['matrix'] == 'mapk') & (perf_df['method'] == 'bm25_title_and_abs') ]['score']
best_model_acc_mrecallk = perf_df[(perf_df['matrix'] == 'mrecallk') & (perf_df['method'] == 'bm25_title_and_abs') ]['score']
best_model_acc_mf1k = perf_df[(perf_df['matrix'] == 'mf1k') & (perf_df['method'] == 'bm25_title_and_abs') ]['score']

# baseline_model_acc
baseline_model_acc_ndcg = perf_df[(perf_df['matrix'] == 'nDCG') & (perf_df['method'] == 'tfidf_title_and_abs') ]['score']
baseline_model_acc_mapk = perf_df[(perf_df['matrix'] == 'mapk') & (perf_df['method'] == 'tfidf_title_and_abs') ]['score']
baseline_model_acc_mrecallk = perf_df[(perf_df['matrix'] == 'mrecallk') & (perf_df['method'] == 'tfidf_title_and_abs') ]['score']
baseline_model_acc_mf1k = perf_df[(perf_df['matrix'] == 'mf1k') & (perf_df['method'] == 'tfidf_title_and_abs') ]['score']

# Obtain p-value by matrics
p_val_ndcg_bb_mod = t_test(best_model_acc_ndcg, baseline_model_acc_ndcg)
p_val_mapk_bb_mod = t_test(best_model_acc_mapk, baseline_model_acc_mapk)
p_val_mrecallk_bb_mod = t_test(best_model_acc_mrecallk, baseline_model_acc_mrecallk)
p_val_mf1k_bb_mod = t_test(best_model_acc_mf1k, baseline_model_acc_mf1k)

In [20]:
print('P-value of nDCG between best model (BM25) and baseline model (TFIDF): {}.'.format(p_val_ndcg_bb_mod))
print('P-value of mapk between best model (BM25) and baseline model (TFIDF): {}.'.format(p_val_mapk_bb_mod))
print('P-value of mrecallk between best model (BM25) and baseline model (TFIDF): {}.'.format(p_val_mrecallk_bb_mod))
print('P-value of mf1k between best model (BM25) and baseline model (TFIDF): {}.'.format(p_val_mf1k_bb_mod))

p-value of nDCG between best model (BM25) and baseline model (TFIDF): 0.0040068717098271775.
p-value of mapk between best model (BM25) and baseline model (TFIDF): 0.005362862168430343.
p-value of mrecallk between best model (BM25) and baseline model (TFIDF): 0.04042885766963906.
p-value of mf1k between best model (BM25) and baseline model (TFIDF): 0.018628944852912003.


## Part 3 Compare semantic search with keyword search

In [21]:
# Spector model accuracy (semantic search)
spector_model_acc_ndcg = perf_df[(perf_df['matrix'] == 'nDCG') & (perf_df['method'] == 'spector_encoder_title_and_abs') ]['score']
spector_model_acc_mapk = perf_df[(perf_df['matrix'] == 'mapk') & (perf_df['method'] == 'spector_encoder_title_and_abs') ]['score']
spector_model_acc_mrecallk = perf_df[(perf_df['matrix'] == 'mrecallk') & (perf_df['method'] == 'spector_encoder_title_and_abs') ]['score']
spector_model_acc_mf1k = perf_df[(perf_df['matrix'] == 'mf1k') & (perf_df['method'] == 'spector_encoder_title_and_abs') ]['score']

# Obtain p-value by matrics
p_val_ndcg_sb_mod = t_test(best_model_acc_ndcg, spector_model_acc_ndcg)
p_val_mapk_sb_mod = t_test(best_model_acc_mapk, spector_model_acc_mapk)
p_val_mrecallk_sb_mod = t_test(best_model_acc_mrecallk, spector_model_acc_mrecallk)
p_val_mf1k_sb_mod = t_test(best_model_acc_mf1k, spector_model_acc_mf1k)

In [23]:
print('P-value of nDCG between best model (BM25) and Spector model (BERT semantic search): {}.'.format(p_val_ndcg_sb_mod))
print('P-value of mapk between best model (BM25) and Spector model (BERT semantic search): {}.'.format(p_val_mapk_sb_mod))
print('P-value of mrecallk between best model (BM25) and Spector model (BERT semantic search): {}.'.format(p_val_mrecallk_sb_mod))
print('P-value of mf1k between best model (BM25) and Spector model (BERT semantic search): {}.'.format(p_val_mf1k_sb_mod))

P-value of nDCG between best model (BM25) and Spector model (BERT semantic search): 0.002157059244868851.
P-value of mapk between best model (BM25) and Spector model (BERT semantic search): 0.001091469805804435.
P-value of mrecallk between best model (BM25) and Spector model (BERT semantic search): 0.08387422749185708.
P-value of mf1k between best model (BM25) and Spector model (BERT semantic search): 0.0017530233829082592.
