<h1>Embedding Model to Get Keywords Given News Context</h1>

<h2>Objective</h2>
<p>Use the <code>Metric-AI/armenian-text-embeddings-1</code> model to extract keywords from a news article, given its context. The goal is to generate meaningful embeddings that represent the content of a news article, allowing for the extraction of relevant keywords or features.</p>

<p><strong>Problem Statement:</strong> Given a news article as input, the model should generate an embedding that captures the essence of the article's context. From this embedding, we aim to extract keywords that best represent the article's topic or main ideas.</p>

<p><strong>Implementation:</strong> This task is implemented using <code>sentence-transformers</code> for generating embeddings and <code>KeyBERT</code> for extracting the most relevant keywords from the embeddings. These embeddings and keywords can then be used for further analysis or downstream tasks.</p>

<h3>Simple One-to-One Example Using Gradio</h3>
<p>In addition to batched processing, a Gradio interface is created to provide a simple, user-friendly way to test the model on individual news articles. The Gradio interface allows users to input a single article, and the model will output the extracted keywords. This one-to-one testing capability is useful for quickly evaluating the model's performance and visualizing its output in real-time.</p>

<h3>Batched Processing for Large Datasets</h3>
<p>To handle large datasets efficiently, we use batched processing. This approach processes multiple news articles simultaneously, allowing us to generate embeddings for a batch of texts in parallel. This improves the efficiency of generating embeddings for large datasets, ensuring that the model can scale to handle many articles without significant delays.</p>

<h2>Importing Required Packages</h2>

<p>Before starting, ensure you have all the necessary packages installed. If a package is missing, you can install it using <code>pip</code>. Below is the list of required imports for this project:</p>

In [4]:
from keybert import KeyBERT
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import torch
import gradio as gr

<h2>Initializing Sentence-BERT Model and Keyword Extractor</h2>

<p>In this step, we initialize the <code>sentence-transformers</code> model to generate embeddings and the <code>KeyBERT</code> model to extract keywords based on those embeddings. For computational simplicity, we use only the test set of the Ilur dataset for this task.</p>

<p><strong>Implementation:</strong> First, the Sentence-BERT model is loaded to process the sentences and generate embeddings. Then, the KeyBERT model is initialized to extract keywords from those embeddings. By focusing only on the test set, we ensure the model runs efficiently while performing the task of keyword extraction.</p>

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = SentenceTransformer('Metric-AI/armenian-text-embeddings-1',device=device)
kw_model = KeyBERT(model=embedding_model)


No sentence-transformers model found with name Metric-AI/armenian-text-embeddings-1. Creating a new one with mean pooling.


<h2>Extracting Keywords Using KeyBERT</h2>

<p>With the <code>extract_keywords</code> function, we can extract keywords from a given text using the <code>KeyBERT</code> package. The package provides two main keyword extraction methods: <strong>MaxSum</strong> and <strong>MMR</strong> (Maximal Marginal Relevance). Both methods can be used to extract relevant keywords, and you can choose the method or parameters based on your needs.</p>

<p><strong>Implementation:</strong> The <code>KeyBERT</code> model allows us to pass the embedding of the text and extract keywords using the chosen method. You can specify which parameters to use for the extraction process to fine-tune the results, depending on the method chosen.</p>

<p>For more information, visit the official <a href="https://github.com/MaartenGr/KeyBERT" target="_blank">KeyBERT GitHub page</a>.</p>


<h2>Simple One-to-One Example Using Gradio</h2>

<p>This Gradio interface allows you to extract keywords from a given news article using two methods: MaxSum and MMR (Maximal Marginal Relevance). The union of keywords from both methods is returned, ensuring unique keywords are shown.</p>

<h2>How to Use</h2>
<p>To use the tool, simply enter your news article into the input textbox. The system will process the article and display a list of relevant keywords, extracted using both the MaxSum and MMR methods. The final output will be the union of the keywords from both methods, with duplicates removed.</p>

<h2>Example</h2>
<p>Enter the following news article text into the input textbox:</p>
<code>
‘¥’•’Ø’ø’•’¥’¢’•÷Ä’´ 20-’´’∂ ’Ø’©’∏’≤’°÷Ä’Ø’æ’´ ’∞’°’µ ’¥’•’Æ’°’∂’∏÷Ç’∂ ’Ø’∏’¥’∫’∏’¶’´’ø’∏÷Ä ’è’´’£÷Ä’°’∂ ’Ñ’°’∂’Ω’∏÷Ç÷Ä’µ’°’∂’´ ’Ø’´’∂’∏’•÷Ä’°’™’∑’ø’∏÷Ç’©’µ’∏÷Ç’∂’∂’•÷Ä’´ ’°’¨’¢’∏’¥’®` ’Ä’°’µ’°’Ω’ø’°’∂’´ ’∫’•’ø’°’Ø’°’∂ ’Ω’´’¥÷Ü’∏’∂’´’Ø ’∂’æ’°’£’°’≠’¥’¢’´ ’Ø’°’ø’°÷Ä’¥’°’¥’¢` ’ç’•÷Ä’£’•’µ ’ç’¥’¢’°’ø’µ’°’∂’´ ’≤’•’Ø’°’æ’°÷Ä’∏÷Ç’©’µ’°’¥’¢: ‘±’µ’Ω ’¥’°’Ω’´’∂ ’∞’°’µ’ø’∂’∏÷Ç’¥ ’•’∂ ‘µ÷Ä÷á’°’∂’´ ÷Ñ’°’≤’°÷Ñ’°’∫’•’ø’°÷Ä’°’∂’´÷Å:
</code>

<p>After entering the text, you will receive the extracted keywords, which may look like this:</p>
<pre>
['’Ø’©’∏’≤’°÷Ä’Ø’æ’´', '’Ø’´’∂’∏’•÷Ä’°’™’∑’ø’∏÷Ç’©’µ’∏÷Ç’∂’∂’•÷Ä’´', '’°’¨’¢’∏’¥’®', '’Ø’∏’¥’∫’∏’¶’´’ø’∏÷Ä', '÷Ñ’°’≤’°÷Ñ’°’∫’•’ø’°÷Ä’°’∂’´÷Å', '’§’•’Ø’ø’•’¥’¢’•÷Ä’´', '’¥’°’∂’Ω’∏÷Ç÷Ä’µ’°’∂’´', '’∞’°’µ']</pre>

<h2>Output Explanation</h2>
<p>The output shows the most relevant keywords extracted from the news article, which can help summarize or analyze the main topics of the article.</p>


In [16]:
def extract_keywords(text):
    maxsum_keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(0, 1),
        use_maxsum=True,
        nr_candidates=20,
        top_n=5
    )
    maxsum_keywords = set([keyword[0] for keyword in maxsum_keywords])
    
    mmr_keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(0, 1),
        use_mmr=True,
        nr_candidates=20,
        top_n=5
    )
    mmr_keywords = set([keyword[0] for keyword in mmr_keywords])
    
    union_keywords = list(maxsum_keywords.union(mmr_keywords))
    
    return union_keywords

iface = gr.Interface(
    fn=extract_keywords,
    inputs=gr.Textbox(label="üìù Enter News Article", placeholder="Paste your news article here..."),
    outputs=gr.Textbox(label="Extracted Keywords (Union of MaxSum and MMR)"),
    live=False,
    clear_btn="Clear"
)

iface.launch(share=True)

* Running on local URL:  http://127.0.0.1:7878
* Running on public URL: https://2c1f2fe5bae41703d5.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




<h2>Batched Processing for Large Datasets</h2>
<p>In this section, we demonstrate how to process large datasets efficiently using batched processing. Batched processing allows you to handle large volumes of text data without overwhelming your system‚Äôs memory. This is particularly useful when working with large-scale datasets like the Ilur dataset.</p>

<p><strong>Note:</strong> For this experiment, we are using only the <code>test</code> set from the Ilur dataset to demonstrate the processing technique.</p>

<h3>How It Works</h3>
<p>The input data is processed in batches, allowing for faster computation and optimized memory usage. We split the dataset into smaller batches and feed them into the model sequentially, ensuring we don't overload the system memory.</p>

In [10]:
dataset = load_dataset('Metric-AI/ILUR-news-text-classification-corpus-formatted')['test']
print(dataset)

Dataset({
    features: ['Sentence', 'class', 'source'],
    num_rows: 2445
})


In [11]:
def extract_keywords(batch):
    texts = batch['Sentence']  
    list_of_keywords= []
    
    for text in texts:
        keywords = list({kw[0] for kw in kw_model.extract_keywords(text, keyphrase_ngram_range=(0,1),
                              use_maxsum=True, nr_candidates=20, top_n=5)})
        list_of_keywords.append(keywords)
    batch['maxsum_keywords'] = list_of_keywords
    
    return batch

dataset = dataset.map(extract_keywords, batched=True, batch_size=128)

Map:   0%|          | 0/2445 [00:00<?, ? examples/s]

In [12]:
def extract_keywords(batch):
    texts = batch['Sentence']  
    list_of_keywords= []
    
    for text in texts:
        keywords = list({kw[0] for kw in kw_model.extract_keywords(text, keyphrase_ngram_range=(0,1),
                              use_mmr=True, nr_candidates=20, top_n=5)})
        list_of_keywords.append(keywords)
    batch['mmr_keywords'] = list_of_keywords
    
    return batch

dataset = dataset.map(extract_keywords, batched=True, batch_size=128)

Map:   0%|          | 0/2445 [00:00<?, ? examples/s]

In [13]:
dataset = dataset.map(lambda x: {'keywords': set(x['maxsum_keywords']+x['mmr_keywords'])})

Map:   0%|          | 0/2445 [00:00<?, ? examples/s]

<h2>Dataset After Batched Processing</h2>
<p>After processing the dataset in batches, the following columns are added to the dataset:</p>
<ul>
  <li><strong><code>maxsum_keywords</code></strong>: Keywords extracted using the MaxSum method.</li>
  <li><strong><code>mmr_keywords</code></strong>: Keywords extracted using the MMR (Maximal Marginal Relevance) method.</li>
  <li><strong><code>keywords</code></strong>: The union of keywords from both MaxSum and MMR methods, providing a comprehensive list of keywords for each news article.</li>
</ul>

In [15]:
print(dataset)

Dataset({
    features: ['Sentence', 'class', 'source', 'maxsum_keywords', 'mmr_keywords', 'keywords'],
    num_rows: 2445
})


<h1>Topic Modeling with BERTopic on the Ilur Dataset</h1>

<h2>Objective</h2>
<p>Use the <code>BERTopic</code> package to perform topic modeling on the Ilur dataset, identifying latent topics from the provided news articles. This method utilizes embeddings generated by a pre-trained model to discover clusters of related documents, helping to uncover hidden themes within the dataset.</p>

<p><strong>Problem Statement:</strong> Given a collection of news articles, the model aims to automatically identify topics or themes present in the articles, which can be used for further analysis, categorization, or summarization.</p>

<p><strong>Implementation:</strong> The <code>BERTopic</code> model uses embeddings to represent each document in a high-dimensional space, and then applies dimensionality reduction and clustering techniques to discover topics. The result is a set of topics that represent different themes within the dataset.</p>

<p><strong>For more information:</strong> You can explore the <a href="https://github.com/MaartenGr/BERTopic" target="_blank">BERTopic GitHub repository</a> or the official <a href="https://maartengr.github.io/BERTopic/" target="_blank">BERTopic documentation</a> for detailed guides, tutorials, and advanced features.</p>


In [9]:
from bertopic import BERTopic

topic_model = BERTopic(embedding_model=embedding_model, nr_topics=20)

topics, probabilities = topic_model.fit_transform(dataset['Sentence'])

In [13]:
topic_model.visualize_topics(top_n_topics=10)