<h1>Embedding Model with Ilur Dataset to Get Keywords Given News Context</h1>

<h2>Objective</h2>
<p>Build an embedding model using the Ilur dataset to extract keywords from a news article, given its context. The goal is to generate meaningful embeddings that represent the content of a news article, allowing for the extraction of relevant keywords or features.</p>

<p><strong>Problem Statement:</strong> Given a news article as input, the model should generate an embedding that captures the essence of the article's context. From this embedding, we aim to extract keywords that best represent the article's topic or main ideas.</p>

<p><strong>Implementation:</strong> This task is implemented using <code>sentence-transformers</code> for generating embeddings and <code>KeyBERT</code> for extracting the most relevant keywords from the embeddings. These embeddings and keywords can then be used for further analysis or downstream tasks.</p>

<h2>Importing Required Packages</h2>

<p>Before starting, ensure you have all the necessary packages installed. If a package is missing, you can install it using <code>pip</code>. Below is the list of required imports for this project:</p>

In [1]:
from keybert import KeyBERT
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import torch

<h2>Initializing Sentence-BERT Model and Keyword Extractor</h2>

<p>In this step, we initialize the <code>sentence-transformers</code> model to generate embeddings and the <code>KeyBERT</code> model to extract keywords based on those embeddings. For computational simplicity, we use only the test set of the Ilur dataset for this task.</p>

<p><strong>Implementation:</strong> First, the Sentence-BERT model is loaded to process the sentences and generate embeddings. Then, the KeyBERT model is initialized to extract keywords from those embeddings. By focusing only on the test set, we ensure the model runs efficiently while performing the task of keyword extraction.</p>

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_model = SentenceTransformer('Metric-AI/armenian-text-embeddings-1',device=device)
kw_model = KeyBERT(model=embedding_model)

dataset = load_dataset('Metric-AI/ILUR-news-text-classification-corpus-formatted')['test']

No sentence-transformers model found with name Metric-AI/armenian-text-embeddings-1. Creating a new one with mean pooling.


In [3]:
print(dataset)

Dataset({
    features: ['Sentence', 'class', 'source'],
    num_rows: 2445
})


<h2>Extracting Keywords Using KeyBERT</h2>

<p>With the <code>extract_keywords</code> function, we can extract keywords from a given text using the <code>KeyBERT</code> package. The package provides two main keyword extraction methods: <strong>MaxSum</strong> and <strong>MMR</strong> (Maximal Marginal Relevance). Both methods can be used to extract relevant keywords, and you can choose the method or parameters based on your needs.</p>

<p><strong>Implementation:</strong> The <code>KeyBERT</code> model allows us to pass the embedding of the text and extract keywords using the chosen method. You can specify which parameters to use for the extraction process to fine-tune the results, depending on the method chosen.</p>

<p>For more information, visit the official <a href="https://github.com/MaartenGr/KeyBERT" target="_blank">KeyBERT GitHub page</a>.</p>


In [59]:
def extract_keywords(batch):
    texts = batch['Sentence']  
    list_of_keywords= []
    
    for text in texts:
        keywords = list({kw[0] for kw in kw_model.extract_keywords(text, keyphrase_ngram_range=(0,1),
                              use_maxsum=True, nr_candidates=20, top_n=5)})
        list_of_keywords.append(keywords)
    batch['maxsum_keywords'] = list_of_keywords
    
    return batch

dataset = dataset.map(extract_keywords, batched=True, batch_size=128)

Map:   0%|          | 0/2445 [00:00<?, ? examples/s]

In [60]:
def extract_keywords(batch):
    texts = batch['Sentence']  
    list_of_keywords= []
    
    for text in texts:
        keywords = list({kw[0] for kw in kw_model.extract_keywords(text, keyphrase_ngram_range=(0,1),
                              use_mmr=True, nr_candidates=20, top_n=5)})
        list_of_keywords.append(keywords)
    batch['mmr_keywords'] = list_of_keywords
    
    return batch

dataset = dataset.map(extract_keywords, batched=True, batch_size=128)

Map:   0%|          | 0/2445 [00:00<?, ? examples/s]

<h2>Removing Duplicate and Intersected Keywords</h2>

<p>Given the two keyword extraction methods (MaxSum and MMR), we need to check if there are any intersected keywords between the two methods and remove any duplicates. This ensures that the final list of keywords contains only unique terms, avoiding redundancy in the results.</p>

<p><strong>Implementation:</strong> After extracting keywords using both methods, we compare the results to identify any overlapping keywords. If any duplicates are found, they are removed, ensuring that the final set of keywords is unique and representative of the input text.</p>


In [None]:
dataset = dataset.map(lambda x: {'keywords': set(x['maxsum_keywords']+x['mmr_keywords'])})

Map:   0%|          | 0/2445 [00:00<?, ? examples/s]

<h2>Example of Extracted Keywords</h2>

<p>Here is an example of the keywords extracted from a sample text using the <code>KeyBERT</code> model. These keywords are generated using both the MaxSum and MMR methods, and duplicate or intersected keywords are removed for a cleaner result.</p>


In [62]:
print(dataset[0]['keywords'])
print(dataset[1]['keywords'])

['դոլար', 'մարտի', 'շուկայում', 'բանկերի', 'կրճատվել', 'գործարքների', 'ցուցանիշից', 'միջբանկային']
['գործարարներ', 'ռուսաստանաբնակ', 'միլիարդատերերից', 'ամենահարուստը', 'մլրդ', 'բելառուսը', 'հայազգի', 'ցանկը']


<h1>Topic Modeling with BERTopic on the Ilur Dataset</h1>

<h2>Objective</h2>
<p>Use the <code>BERTopic</code> package to perform topic modeling on the Ilur dataset, identifying latent topics from the provided news articles. This method utilizes embeddings generated by a pre-trained model to discover clusters of related documents, helping to uncover hidden themes within the dataset.</p>

<p><strong>Problem Statement:</strong> Given a collection of news articles, the model aims to automatically identify topics or themes present in the articles, which can be used for further analysis, categorization, or summarization.</p>

<p><strong>Implementation:</strong> The <code>BERTopic</code> model uses embeddings to represent each document in a high-dimensional space, and then applies dimensionality reduction and clustering techniques to discover topics. The result is a set of topics that represent different themes within the dataset.</p>

<p><strong>For more information:</strong> You can explore the <a href="https://github.com/MaartenGr/BERTopic" target="_blank">BERTopic GitHub repository</a> or the official <a href="https://maartengr.github.io/BERTopic/" target="_blank">BERTopic documentation</a> for detailed guides, tutorials, and advanced features.</p>


In [9]:
from bertopic import BERTopic

topic_model = BERTopic(embedding_model=embedding_model, nr_topics=20)

topics, probabilities = topic_model.fit_transform(dataset['Sentence'])

In [13]:
topic_model.visualize_topics(top_n_topics=10)