# ACL 2023
## Semantic Legal Searcher(SLS) on the English dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
cd /content/drive/MyDrive/ACL_2023_SLS/

/content/drive/MyDrive/ACL_2023_SLS


## PIP

In [4]:
! pip install transformers
! pip install -U sentence-transformers
! pip install sentencepiece
! pip install faiss-gpu
! pip install funcy pickle5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 32.9 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 77.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 54.8 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████|

## STEP1. Load arxiv dataset & PLMs (ENG)

In [5]:
import pandas as pd
import json

# 1. Load arxiv-abstract meta data 
data_file = './data/arxiv-metadata-oai-snapshot.json'

def get_metadata():
    with open(data_file, 'r') as f:
        for line in f:
            yield line

metadata = get_metadata()
for paper in metadata:
    paper_dict = json.loads(paper)
    print('Title: {}\n\nAbstract: {}\nRef: {}'.format(paper_dict.get('title'), paper_dict.get('abstract'), paper_dict.get('journal-ref')))
    break

titles, abstracts, years = [], [], []
metadata = get_metadata()
for paper in metadata:
    paper_dict = json.loads(paper)
    ref = paper_dict.get('journal-ref')
    try:
        year = int(ref[-4:]) 
        if 2016 < year < 2021:
            years.append(year)
            titles.append(paper_dict.get('title'))
            abstracts.append(paper_dict.get('abstract'))
    except:
        pass

# 2. Make DataFrame
papers = pd.DataFrame({
    'title': titles,
    'abstract': abstracts,
    'year': years
})

papers.to_csv('./data/arxiv_meta.csv', sep = ',', na_rep="NaN")

Title: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies

Abstract:   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
tha

In [6]:
# 1. Load arXiv dataset(Cornell University., 2022)
df = pd.read_csv('./data/arxiv_meta.csv')
print(">> arxiv-meta data size : ", len(df))

# 2. Load pre-trained language model on English dataset
my_plms = "all-mpnet-base-v2"
df.head()

>> arxiv-meta data size :  19035


Unnamed: 0.1,Unnamed: 0,title,abstract,year
0,0,On the Cohomological Derivation of Yang-Mills ...,We present a brief review of the cohomologic...,2017
1,1,Regularity of solutions of the isoperimetric p...,In this work we consider a question in the c...,2018
2,2,Asymptotic theory of least squares estimators ...,This paper considers the effect of least squ...,2017
3,3,"Teichm\""uller Structures and Dual Geometric Gi...",The Gibbs measure theory for smooth potentia...,2020
4,4,Distributional Schwarzschild Geometry from non...,In this paper we leave the neighborhood of t...,2018


## STEP 2. Parallel Clustering-based Topic Modeling

In [7]:
from models.parallel_clustering_TM import *

In [None]:
# 1. Obtain Embeddings
target_text = 'abstract'

cluster = ParallelCluster(
    dataframe = df,
    tgt_col = target_text,
    model_name = my_plms,
    use_sentence_bert = True
    )

In [12]:
# 2. Parallel Clustering
clusters, unclusters = cluster.parallel_cluster(
    clusters = None,
    threshold = 0.56,
    page_size = 2000,
    iterations = 30
    )

===== Iteration 1 / 30 =====


>> Number of Total Clusters :  429
>> Percentage clusted Doc Embeddings : 38.92%


===== Iteration 2 / 30 =====


>> Number of Total Clusters :  839
>> Percentage clusted Doc Embeddings : 58.91%


===== Iteration 3 / 30 =====


>> Number of Total Clusters :  1125
>> Percentage clusted Doc Embeddings : 67.89%


===== Iteration 4 / 30 =====


>> Number of Total Clusters :  1245
>> Percentage clusted Doc Embeddings : 70.77%


===== Iteration 5 / 30 =====


>> Number of Total Clusters :  1356
>> Percentage clusted Doc Embeddings : 73.15%


===== Iteration 6 / 30 =====


>> Number of Total Clusters :  1414
>> Percentage clusted Doc Embeddings : 74.25%


===== Iteration 7 / 30 =====


>> Number of Total Clusters :  1474
>> Percentage clusted Doc Embeddings : 75.31%


===== Iteration 8 / 30 =====


>> Number of Total Clusters :  1510
>> Percentage clusted Doc Embeddings : 75.99%


===== Iteration 9 / 30 =====


>> Number of Total Clusters :  1536
>> Percentage c

In [13]:
# 3. Stack : Stack the clustered results in order of cluster size
col_list = ['title', 'abstract', 'year']
new_df = cluster.cluster_stack(
    col_list = col_list,
    clusters = clusters,
    unclusters = unclusters
    )

# 4. Extract Keywords from each documents
top_n_words = cluster.extract_top_n_words_per_topic(
    dataframe = new_df,
    n = 20,
    en = True
    )
new_df['keywords'] = [', '.join(top_n_words[i]) for i in new_df['Topic'].values]

# 5. Save the Parallel Clusted Dataset 
new_df.to_csv("./data/clusted_arxiv_df.csv", sep=',', na_rep="NaN")

In [14]:
new_df.head()

Unnamed: 0,title,abstract,year,Topic,keywords
890,Testing the anisotropy in the angular distribu...,Gamma-ray bursts (GRBs) were confirmed to be...,2017,0,"ray, emission, gamma, star, 10, mass, luminosi..."
1228,Solving the missing GRB neutrino and GRB-SN pu...,Every GRB model where the progenitor is assu...,2018,0,"ray, emission, gamma, star, 10, mass, luminosi..."
1518,Simulating galaxy formation with black hole dr...,The inefficiency of star formation in massiv...,2017,0,"ray, emission, gamma, star, 10, mass, luminosi..."
1874,High energy properties of the flat spectrum ra...,We investigate the $\gamma$-ray and X-ray pr...,2018,0,"ray, emission, gamma, star, 10, mass, luminosi..."
1980,Nearest Neighbor: The Low-Mass Milky Way Satel...,We present Magellan/IMACS spectroscopy of th...,2017,0,"ray, emission, gamma, star, 10, mass, luminosi..."


## STEP 3. Embedding modelization(split-merge) and scoring(multi-interactions)

In [15]:
from models.semantic_searcher_eng import *

In [16]:
# 1. Load SLS framework
sls = SLS(
    dataframe = new_df,
    doc_col = 'abstract',
    key_col = 'keywords',
    model_name = my_plms,
    use_sentence_bert = True,
    split_and_merge = True,
    multi_inter = True,
    )

# 2. Build the Index
# (Strategy 1) : All Distance Metric
all_index = sls.all_distance_metric()
# (Strategy 2) : Restricted Distance Metric
#restricted_index = sls.restricted_distance_metric(nlist = 200, nprobe = 6)

>> Split and Merage embeddings shape(Items x PLMs_dim) : (19035, 768)


Batches:   0%|          | 0/595 [00:00<?, ?it/s]

>> Keywords embeddings shape(Items x PLMs_dim) : (19035, 768)


## STEP 4. Semantic Search

In [17]:
# 3. Semantic documents search (Question-Answering)
my_query = "Research about the Transformer network architecture, based solely on attention mechanisms."

original_outputs, _ = sls.semantic_search(
    user_query = my_query,
    top_k = 10,
    index = all_index,
    print_results = True,
    )


 === Calculate run time : 35.1307 ms === 

>> Your query : Research about the Transformer network architecture, based solely on attention mechanisms.

 >> Top 1 - Paper Title : Area Attention  
 | Cluster : 32 
 | Extracted keywords : translation, nmt, generation, attention, bleu, language, english, text, sentences, dialogue, machine, nlg, neural, sequence, transformer, human, resource, korean, summarization, training 
 | Year : 2019 
 | Abstract :   Existing attention mechanisms are trained to attend to individual items in a
collection (the memory) with a predefined, fixed granularity, e.g., a word
token or an image grid. We propose area attention: a way to attend to areas in
the memory, where each area contains a group of items that are structurally
adjacent, e.g., spatially for a 2D memory such as images, or temporally for a
1D memory such as natural language sentences. Importantly, the shape and the
size of an area are dynamically determined via learning, which enables a model
to 