# arXiv Keyword Extraction
- Analysis Pipeline with KeyBERT and Taipy

https://towardsdatascience.com/arxiv-keyword-extraction-and-analysis-pipeline-with-keybert-and-taipy-2972e81d9fa4

In [1]:
"""
arXiv는 수학, 물리학, 천문학, 전산 과학, 계량 생물학, 통계학 분야의 출판 전 논문을 수집하는 웹사이트이다. 
"""

'\narXiv는 수학, 물리학, 천문학, 전산 과학, 계량 생물학, 통계학 분야의 출판 전 논문을 수집하는 웹사이트이다. \n'

- Keyword extraction involves automatically identifying and extracting the most relevant words from a given text
- keyword analysis involves analyzing the keywords to gain insights into the underlying patterns.

# Tools Overview
- arXiv API Python wrapper
- KeyBERT
- Taipy

arvix 1.4.3
keybert 0.7.0
pandas 1.5.3
taipy 2.2.0

# Step 2 — Setup Configuration File

In [3]:
import yaml

In [4]:
with open('config.yml') as f:
    cfg = yaml.safe_load(f)

In [5]:
cfg

{'QUERY': 'artificial intelligence',
 'MAX_ABSTRACTS': 30,
 'NGRAM_MIN': 1,
 'NGRAM_MAX': 1,
 'TOP_N': 3,
 'DIVERSITY_ALGO': 'mmr',
 'DIVERSITY': 0.2,
 'NR_CANDIDATES': 20}

In [6]:
# Step 3 — Build Functions

In [7]:
# (3.1) Retrieve and Save arXiv Abstracts and Metadata


In [11]:
# function.py
import arxiv
import pandas as pd

# Function 1 - Retrieve abstracts from arXiv database
def extract_arxiv(query: str):
    search = arxiv.Search(
                query=query,
                max_results=cfg['MAX_ABSTRACTS'], # No. of abstracts to retrieve (from config)
                sort_by=arxiv.SortCriterion.SubmittedDate,
                sort_order=arxiv.SortOrder.Descending  # Sort by latest date
                )
     
    # Returns arXiv object
    return search

# Function 2 - Save abstract text and metadata in pd.DataFrame
def save_in_dataframe(search):
    df = pd.DataFrame([{'uid': result.entry_id.split('.')[-1],
                        'title': result.title,
                        'date_published': result.published,
                        'abstract': result.summary} for result in search.results()])
    
# Function 3 - Preprocess data
def preprocess_data(df: pd.DataFrame):
    df['date_published'] = pd.to_datetime(df['date_published'])

    # Create empty column to store keyword and similarity scores
    df['keywords_and_scores'] = ''

    # Create empty column to store keyword texts only
    df['keywords'] = ''

    return df