# Keyword extractor

1. We use a model from the Sentence Transformers library (https://huggingface.co/sentence-transformers) to generate sentence embeddings. Mean pooling is applied to obtain a single vector per text for simplification.
2. We use KeyBERT (https://maartengr.github.io/KeyBERT/) to extract key phrases from the abstract, utilizing the previously generated embeddings.
3. We create a prompt using the abstract and key phrases, asking for the generation of 5 keywords for the article.


In [24]:
import google.generativeai as genai
from transformers import AutoModel, AutoTokenizer
import torch
from keybert import KeyBERT # !pip install keybert

# Gemini API setup
genai.configure(api_key="GEMINI_API_KEY")

# Load model for sentence embedding
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Function to get text embeddings
def embed_text(texts):
    if isinstance(texts, str):
        texts = [texts]

    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        model_output = model(**encoded_input)

    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    return embeddings

# Mean pooling function to get a single vector per text
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Create KeyBERT with the embedding function
kw_model = KeyBERT(model=embed_text)

# Example abstract (expected keywords are: Sentiment classification, Text classification, Natural language processing, Emotion detection, Sentiment analysis)
abstract = """
Sentiment analysis is a method within natural language processing that evaluates and identifies the emotional
tone or mood conveyed in textual data. Scrutinizing words and phrases categorizes them into positive, negative,
or neutral sentiments. The significance of sentiment analysis lies in its capacity to derive valuable insights
from extensive textual data, empowering businesses to grasp customer sentiments, make informed choices,
and enhance their offerings. For the further advancement of sentiment analysis, gaining a deep understanding
of its algorithms, applications, current performance, and challenges is imperative. Therefore, in this extensive
survey, we began exploring the vast array of application domains for sentiment analysis, scrutinizing them
within the context of existing research. We then delved into prevalent pre-processing techniques, datasets,
and evaluation metrics to enhance comprehension. We also explored Machine Learning, Deep Learning, Large
Language Models and Pre-trained models in sentiment analysis, providing insights into their advantages and
drawbacks. Subsequently, we precisely reviewed the experimental results and limitations of recent state-of-the-art articles.
Finally, we discussed the diverse challenges encountered in sentiment analysis and proposed
future research directions to mitigate these concerns. This extensive review provides a complete understanding
of sentiment analysis, covering its models, application domains, results analysis, challenges, and research
directions.
"""

# Extract keywords with KeyBERT
keywords = kw_model.extract_keywords(
    abstract,
    keyphrase_ngram_range=(1, 3), # Key phrases from 1 to 3 words
    stop_words='english',
    top_n=10  # Number of key phrases to extract
)

# Extract key phrases from KeyBERT's result
key_phrases = [kw[0] for kw in keywords]

print("\n📝 Extracted Key Phrases:")
print(key_phrases)

# Function to build the prompt for Gemini
def build_prompt(abstract, key_phrases):
    key_phrases_str = ", ".join(key_phrases)
    prompt = f"""
    Given the following abstract and a list of detected key phrases, generate 5 clean and normalized keywords that best summarize the article.
    Only provide the keywords, separated by commas, with no additional information.

    Abstract:
    {abstract}

    Detected Key Phrases:
    {key_phrases_str}

    Keywords (separated by commas):
    """
    return prompt

prompt = build_prompt(abstract, key_phrases)

print("\n🔑 Prompt for Gemini:")
print(prompt)

# Function to generate keywords using Gemini
def generate_keywords_gemini(prompt):
    model = genai.GenerativeModel('gemini-2.0-flash')

    response = model.generate_content(prompt)

    return response.text.strip()

keywords_from_gemini = generate_keywords_gemini(prompt)

# Show result
print("\n🎯 Keywords generated by Gemini:")
print(keywords_from_gemini)

print("\n👁️ Expected keywords were:")
print("Sentiment classification, Text classification, Natural language processing, Emotion detection, Sentiment analysis")


📝 Extracted Key Phrases:
['sentiment analysis method', 'understanding sentiment analysis', 'sentiment analysis providing', 'sentiment analysis scrutinizing', 'sentiment analysis', 'sentiment analysis gaining', 'sentiment analysis covering', 'sentiment analysis proposed', 'significance sentiment analysis', 'encountered sentiment analysis']

🔑 Prompt for Gemini:

    Given the following abstract and a list of detected key phrases, generate 5 clean and normalized keywords that best summarize the article.
    Only provide the keywords, separated by commas, with no additional information.

    Abstract:
    
Sentiment analysis is a method within natural language processing that evaluates and identifies the emotional
tone or mood conveyed in textual data. Scrutinizing words and phrases categorizes them into positive, negative,
or neutral sentiments. The significance of sentiment analysis lies in its capacity to derive valuable insights
from extensive textual data, empowering businesses to g

### Paper search

We use the arXiv API to search for papers  using a list of keywords.

In [25]:
import arxiv  # !pip install arxiv

def get_abstracts_by_keywords(keywords, max_results=15):
    if len(keywords) == 0:
        raise ValueError("At least one keyword must be provided.")
    else:
        keywords = keywords[:3]
        search_query = " AND ".join(f'"{kw.strip()}"' for kw in keywords)

    search = arxiv.Search(
        query=search_query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.Relevance,
        sort_order=arxiv.SortOrder.Descending,
    )

    client = arxiv.Client()
    results = {}

    for result in client.results(search):
        results[result.title] = {
            "abstract": result.summary,
            "url": result.entry_id
        }

    return results

In [26]:
papers = get_abstracts_by_keywords([kw.strip() for kw in keywords_from_gemini.split(',')], max_results=5)

for title, data in papers.items():
    print(f"Title: {title}")
    print(f"URL: {data['url']}")
    print(f"Abstract: {data['abstract']}\n")

Title: Sentiment analysis and opinion mining on E-commerce site
URL: http://arxiv.org/abs/2211.15536v2
Abstract: Sentiment analysis or opinion mining help to illustrate the phrase NLP
(Natural Language Processing). Sentiment analysis has been the most significant
topic in recent years. The goal of this study is to solve the sentiment
polarity classification challenges in sentiment analysis. A broad technique for
categorizing sentiment opposition is presented, along with comprehensive
process explanations. With the results of the analysis, both sentence-level
classification and review-level categorization are conducted. Finally, we
discuss our plans for future sentiment analysis research.

Title: Twitter Sentiment Analysis System
URL: http://arxiv.org/abs/1807.07752v1
Abstract: Social media is increasingly used by humans to express their feelings and
opinions in the form of short text messages. Detecting sentiments in the text
has a wide range of applications including identifying anxie

### Test with several abstracts

In [None]:
import google.generativeai as genai
from transformers import AutoModel, AutoTokenizer
import torch
from keybert import KeyBERT  # !pip install keybert

# Gemini API setup
genai.configure(api_key="GEMINI_API_KEY")

# Load model for sentence embedding
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Function to get text embeddings
def embed_text(texts):
    if isinstance(texts, str):
        texts = [texts]
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    return embeddings

# Mean pooling function to get a single vector per text
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Create KeyBERT with the embedding function
kw_model = KeyBERT(model=embed_text)

# Function to build the prompt for Gemini
def build_prompt(abstract, key_phrases):
    key_phrases_str = ", ".join(key_phrases)
    prompt = f"""
    Given the following abstract and a list of detected key phrases, generate 5 clean and normalized keywords that best summarize the article.
    Only provide the keywords, separated by commas, with no additional information.

    Abstract:
    {abstract}

    Detected Key Phrases:
    {key_phrases_str}

    Keywords (separated by commas):
    """
    return prompt

# Function to generate keywords using Gemini
def generate_keywords_gemini(prompt):
    model = genai.GenerativeModel('gemini-2.0-flash')
    response = model.generate_content(prompt)
    return response.text.strip()

# Function to compare keywords (case-insensitive, trimmed)
def count_matches(generated, expected):
    gen_set = {kw.strip().lower() for kw in generated.split(',')}
    exp_set = {kw.strip().lower() for kw in expected}
    return len(gen_set & exp_set)

# Main function to process abstracts and expected keywords
def process_abstracts(abstracts, expected_keywords_list):
    results = []
    for abstract, expected_keywords in zip(abstracts, expected_keywords_list):
        keywords = kw_model.extract_keywords(
            abstract,
            keyphrase_ngram_range=(1, 3),
            stop_words='english',
            top_n=10
        )
        key_phrases = [kw[0] for kw in keywords]
        prompt = build_prompt(abstract, key_phrases)
        gemini_keywords = generate_keywords_gemini(prompt)
        match_count = count_matches(gemini_keywords, expected_keywords)
        results.append({
            "abstract": abstract,
            "key_phrases": key_phrases,
            "gemini_keywords": gemini_keywords,
            "expected_keywords": expected_keywords,
            "match_count": match_count
        })
    return results

abstracts = [
    """Sentiment analysis is a method within natural language processing that evaluates and identifies the emotional
    tone or mood conveyed in textual data. Scrutinizing words and phrases categorizes them into positive, negative,
    or neutral sentiments. The significance of sentiment analysis lies in its capacity to derive valuable insights
    from extensive textual data, empowering businesses to grasp customer sentiments, make informed choices,
    and enhance their offerings. For the further advancement of sentiment analysis, gaining a deep understanding
    of its algorithms, applications, current performance, and challenges is imperative. Therefore, in this extensive
    survey, we began exploring the vast array of application domains for sentiment analysis, scrutinizing them
    within the context of existing research. We then delved into prevalent pre-processing techniques, datasets,
    and evaluation metrics to enhance comprehension. We also explored Machine Learning, Deep Learning, Large
    Language Models and Pre-trained models in sentiment analysis, providing insights into their advantages and
    drawbacks. Subsequently, we precisely reviewed the experimental results and limitations of recent state-of-the-art articles.
    Finally, we discussed the diverse challenges encountered in sentiment analysis and proposed
    future research directions to mitigate these concerns. This extensive review provides a complete understanding
    of sentiment analysis, covering its models, application domains, results analysis, challenges, and research
    directions.""",

    """Smart cities are an international phenomenon. Many cities are actively working to build or transform their models toward that of a Smart City.
    There is constant research and reports devoted to measuring the intelligence of cities through establishing specific methodologies and indicators
    (grouped by various criteria). We believe the subject lacks a certain uniformity, which we aim to redress in this paper by suggesting a framework
    for properly measuring the smart level of a city. Cities are complex and heterogeneous structures, which complicates comparisons between them.
    To address this we propose an N--dimensional measurement framework where each level or dimension supplies information of interest that is evaluated
    independently. As a result, the measure of a city's intelligence is the result of the evaluations obtained for each of these levels. To this end,
    we have typified the transformation (city to smart city) and the measurement (smart city ranking) processes.""",

    """Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries
    for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with
    language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from
    both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark,
    especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark.
    This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing
    and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research
    and development directions on Apache Spark for big data analytics.""",

    """Football should not be considered only a branch of sports. Football is a sociological phenomenon. Fans being together doesn't always make for a peaceful
    atmosphere. Those who fail to control their aggressive tendencies create incidents that overshadow football. The purpose of this article is to determine
    the emergence of aversive social learning in football fans. Football fans are evaluated within the framework of the social learning theory, which is
    the basis of the aggressive theory. The article reveals the importance of learning by anticipating the behavior of aggressive others and by imitating
    fan groups and group leaders as well as media personalities.""",

    """Plant foods are consumed worldwide due to their immense energy density and nutritive value. Their consumption has been following an increasing trend
    due to several metabolic disorders linked to non-vegetarian diets. In addition to their nutritive value, plant foods contain several bioactive
    constituents that have been shown to possess health-promoting properties. Plant-derived bioactive compounds, such as biologically active proteins,
    polyphenols, phytosterols, biogenic amines, carotenoids, etc., have been reported to be beneficial for human health, for instance in cases of cancer,
    cardiovascular diseases, and diabetes, as well as for people with gut, immune function, and neurodegenerative disorders. Previous studies have reported
    that bioactive components possess antioxidative, anti-inflammatory, and immunomodulatory properties, in addition to improving intestinal barrier
    functioning etc., which contribute to their ability to mitigate the pathological impact of various human diseases. This review describes the bioactive
    components derived from fruit, vegetables, cereals, and other plant sources with health promoting attributes, and the mechanisms responsible for the
    bioactive properties of some of these plant components. This review mainly compiles the potential of food derived bioactive compounds, providing
    information for researchers that may be valuable for devising future strategies such as choosing promising bioactive ingredients to make functional foods
    for various non-communicable disorders.""",

    """In this paper, a mathematical model with a standard incidence rate is proposed to assess the role of media such as facebook, television, radio and tweeter
    in the mitigation of the outbreak of COVID-19. The basic reproduction number R0 which is the threshold dynamics parameter between the disappearance and
    the persistence of the disease has been calculated. And, it is obvious to see that it varies directly to the number of hospitalized people, asymptomatic,
    symptomatic carriers and the impact of media coverage. The local and the global stabilities of the model have also been investigated by using the
    Routh–Hurwitz criterion and the Lyapunov’s functional technique, respectively. Furthermore, we have performed a local sensitivity analysis to assess the
    impact of any variation in each one of the model parameter on the threshold R0 and the course of the disease accordingly. We have also computed the
    approximative rate at which herd immunity will occur when any control measure is implemented. To finish, we have presented some numerical simulation
    results by using some available data from the literature to corroborate our theoretical findings.""",

    """Online mapping technologies such as Google Maps and Street View have become increasingly accessible. These technologies have many convenient uses in
    everyday life, but law enforcement agencies have expressed concern that they could be exploited by offenders and might alter existing offending patterns
    and habits. For environmental criminologists, they have the potential to open up new approaches to conducting research. This paper draws on the results
    of earlier studies in related fields and a handful of criminological studies to discuss how these online mapping applications can trigger new research
    questions, and how they could be considered a valuable methodological addition to criminological research.""",

    """In recent years, extensive research has been dedicated to Mars exploration and the potential for sustainable interplanetary human colonization.
    One of the significant challenges in ensuring the survival of life on Mars lies in the production of food as the Martian environment is highly inhospitable
    to agriculture, rendering it impractical to transport food from Earth. To improve the well-being and quality of life for future space travelers on Mars,
    it is crucial to develop innovative horticultural techniques and food processing technologies. The unique challenges posed by the Martian environment,
    such as the lack of oxygen, nutrient-deficient soil, thin atmosphere, low gravity, and cold, dry climate, necessitate the development of advanced farming
    strategies. This study explores existing knowledge and various technological innovations that can help overcome the constraints associated with food
    production and water extraction on Mars. The key lies in utilizing resources available on Mars through in-situ resource utilization. Water can be
    extracted from beneath the ice and from the Martian soil. Furthermore, hydroponics in controlled environment chambers, equipped with nutrient delivery
    systems and waste recovery mechanisms, have been investigated as a means of cultivating crops on Mars. The inefficiency of livestock production, which
    requires substantial amounts of water and land, highlights the need for alternative protein sources such as microbial protein, insects, and in-vitro meat.
    Moreover, the fields of synthetic biology and 3-D food printing hold immense potential in revolutionizing food production and making significant
    contributions to the sustainability of human life on Mars."""
]

expected_keywords_list = [
    [
        "Sentiment classification", "Text classification", "Natural language processing",
        "Emotion detection", "Sentiment analysis"
    ],
    [
        "Smart City", "Measurement", "Ranking", "Framework", "Model", "Planning", "Project"
    ],
    [
        "Big data", "Data analysis", "Distributed and parallel computing", "Cluster computing",
        "Apache Spark", "Machine learning", "Graph analysis", "Stream processing",
        "Resilient Distributed Datasets"
    ],
    [
        "Football", "Fans", "Aggression", "Fan behavior", "Social learning theory"
    ],
    [
        "plant foods", "bioactive components", "antioxidants", "polyphenols", "anti-inflammatory",
        "chronic diseases", "human health", "gut health"
    ],
    [
        "COVID-19 mitigation", "Media coverage", "Mathematical study", "Sensitivity analysis",
        "Herd immunity", "Numerical simulation"
    ],
    [
        "Google Maps", "Street View", "Environmental criminology", "Innovation", "Methodology", "Methods"
    ],
    [
        "Mars", "In-situ food production", "Single-cell proteins", "Insect farming",
        "3-D food printing", "Synthetic biology"
    ]
]

results = process_abstracts(abstracts, expected_keywords_list)

for i, res in enumerate(results):
    print(f"\n📄 Abstract {i+1}")
    print("🎯 Keywords generated by Gemini:", res["gemini_keywords"])
    print("👁️ Expected keywords were:", res["expected_keywords"])
    print(f"✅ Matches found: {res['match_count']} of 5")


📄 Abstract 1
🎯 Keywords generated by Gemini: Sentiment Analysis, Natural Language Processing, Machine Learning, Deep Learning, Textual Data
👁️ Expected keywords were: ['Sentiment classification', 'Text classification', 'Natural language processing', 'Emotion detection', 'Sentiment analysis']
✅ Matches found: 2 of 5

📄 Abstract 2
🎯 Keywords generated by Gemini: smart cities, smart city, urban intelligence, city measurement, urban transformation
👁️ Expected keywords were: ['Smart City', 'Measurement', 'Ranking', 'Framework', 'Model', 'Planning', 'Project']
✅ Matches found: 1 of 5

📄 Abstract 3
🎯 Keywords generated by Gemini: Apache Spark, big data analytics, cluster computing, machine learning, stream processing
👁️ Expected keywords were: ['Big data', 'Data analysis', 'Distributed and parallel computing', 'Cluster computing', 'Apache Spark', 'Machine learning', 'Graph analysis', 'Stream processing', 'Resilient Distributed Datasets']
✅ Matches found: 4 of 5

📄 Abstract 4
🎯 Keywords gener