<img src="https://bigdata.com/assets/notebooks/bigdata-by-ravenpack-logo.png" width="300" align="center">
<br>
<br>

# **Daily Top Trending Topics for Central Banks Announcements**

This Jupyter notebook implements an **agentic workflow** based on the content retrieval from BigData API to **identify, verifiy, reindex, and summarize** the specialized news that are **trending topics** for the Central Banks Announcements. 

The workflow is structured as follows:

**Step 1- Generation of the Lexicon**: Identify the specialized industry-specific jargon relevant to the Central Banks Announcements to ensure a high recall in the content retrieval.

**Step 2- Content Retrieval Based on BigData**: Perform a keyword search on the news content with the Bigdata API to retrieve documents, splitting the search over daily timeframes and multi-threading the content search on the individual keywords for speed purpose.

**Step 3- Topic Clustering and Selection**: Perform topic modelling using a large language model to verify and cluster the news. Then, the summarization ensures topic selection identifying the top trending news for Central Banks Announcements, while deriving advanced analytics to quantify the trendiness (based on news volume), novelty (based on daily changes in summaries), impact and magnitude (based on the impact on equity prices) of the trending topics.

**Step 4- Customized Report Generation**: Customize the ranking system of the summarized topics based on their trendiness, novelty, and impact on equity prices, and display a daily market update. For verification purpose, the reports are supported by the granular news and sources.

**Output**

1. **Daily Market Reports**: A detailed and visually appealing report summarizing the top trending topics for Central Banks Announcements, with a customizable ranking system to reindex the news.
2. **Actionable Dataframe**: A timestamped dataframe containing the granular news clustered into relevant topics, and the advanced analytics of trendiness, novelty, impact, and magnitude scores to be potentially used for backtesting purpose.

**Requirements**

- Credentials for the Bigdata API to perform keyword and document searches on news content.
- Credentials for the OpenAI API used in the notebook, this could be substituted with any other LLM.
- A `tools` folder in the same directory as this notebook, containing a Python file named `utils_reports.py` with all required functionalities.
- A `requirements.txt` file listing all the necessary Python libraries and dependencies. We recommend installing these packages in a virtual environment.

# Set-Up

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
from dotenv import load_dotenv

load_dotenv(os.path.abspath("/home/abouchs/.python_env_var/.env"))

True

In [3]:
BIGDATA_USERNAME = os.getenv("BIGDATA_USERNAME")
BIGDATA_PASSWORD = os.getenv("BIGDATA_PASSWORD")

In [4]:
from bigdata_client import Bigdata
bigdata_cred = Bigdata(BIGDATA_USERNAME, BIGDATA_PASSWORD)

In [5]:
from src.lexicon_generator import LexiconGenerator
from src.search_topics import search_by_keywords
from src.topics_extractor import (process_all_reports,
                                run_process_all_trending_topics,
                                run_add_advanced_novelty_scores,
                                add_market_impact_to_df,
                                prepare_data_for_report,
                                generate_html_report)
from IPython.display import display
from IPython.core.display import HTML

In [6]:
output_dir = f"//home/abouchs/shared/OutputData/abouchs/Bigdata_cookbook/trending_topics/"

In [7]:
try:
    import asyncio
    asyncio.get_running_loop()
    import nest_asyncio; nest_asyncio.apply()
    print("✅ nest_asyncio applied")
except (RuntimeError, ImportError):
    print("✅ nest_asyncio not needed")

✅ nest_asyncio applied


# Step 1- Generation of the Lexicon

In this step, we identify the specialized industry-specific jargon relevant to the Central Banks Announcements to ensure a high recall in the content retrieval.

In [9]:
main_theme = "Central Bank Announcements"
system_prompt = (
    f"""You are an expert tasked with generating a lexicon of the most important and relevant keywords specific to the {main_theme}.

    Your goal is to compile a list of terms that are critical for understanding and analyzing the {main_theme}. This lexicon should include only the most essential keywords, phrases, and abbreviations that are directly associated with {main_theme} topics, analysis, logistics, and industry reporting.

    Guidelines:

    1. **Focus on relevance:** Include only the most important and commonly used keywords that are uniquely tied to the {main_theme}. These should reflect key concepts, industry-specific mechanisms, benchmarks, logistical aspects, and terminology that are central to the theme.
    2. **Avoid redundancy:** Do not repeat the primary terms of the theme in multiple phrases. Include the main term (e.g., "{main_theme}") only as a standalone term, and focus on other specific terms without redundant repetition.
    3. **Strict exclusion of generic terms:** Exclude any terms that are generic or broadly used across different fields, such as "Arbitrage," "Hedge," "Liquidity," or "Futures Contract," even if they have a specific meaning within the context of {main_theme}. Only include terms that are uniquely relevant to {main_theme} and cannot be applied broadly.
    4. **Include specific variations:** Where applicable, provide both the full form and common abbreviations relevant to the {main_theme}. **Present the full term and its abbreviation as separate entries.** For example, instead of "Zero Lower Bound (ZLB)", list "Zero Lower Bound" and "ZLB" as separate keywords.
    5. **Ensure clarity:** Each keyword should be concise, clear, and directly relevant to the {main_theme}, avoiding any ambiguity.
    6. **Select only the most critical:** There is no need to reach a specific number of keywords. Focus solely on the most crucial terms without padding the list. If fewer keywords meet the criteria, that is acceptable.

    The output should be a lexicon of only the most critical and uniquely relevant keywords related to the {main_theme}, formatted as a JSON list, with full terms and abbreviations listed separately.
    """
)

In [8]:
LexiconGenerator = LexiconGenerator(openai_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o", seeds=[123, 123456, 123456789, 456789, 789])

In [10]:
keywords_lex = LexiconGenerator.generate(theme=main_theme, system_prompt=system_prompt)

# Step 2- Content Retrieval Based on Bigdata

In this section, we perform a keyword search on the news content with the Bigdata API to retrieve documents, splitting the search over daily timeframes and multi-threading the content search on the individual keywords for speed purpose. The user can define the time range below to generate daily reports between the start and end dates.

In [11]:
start_query = '2025-05-18'
end_query = '2025-05-20'

In [12]:
results, daily_keyword_count = search_by_keywords(
    keywords=keywords_lex,
    start_date=start_query,
    end_date=end_query,
    freq='D',
    document_limit=10)

About to run 180 queries
Example Query: Keyword('Central Bank Announcements') over date range: AbsoluteDateRange('2025-05-18T00:00:00', '2025-05-18T23:59:59')


Querying Bigdata...: 100%|██████████| 180/180 [00:37<00:00,  4.83it/s]


# Step 3- Topic Clustering and Selection

In this step, we perform topic modelling using a large language model to verify and cluster the news. Then, the summarization ensures topic selection identifying the top trending news for Central Banks Announcements, while deriving advanced analytics to quantify the trendiness (based on news volume), novelty (based on daily changes in summaries), impact and magnitude (based on the impact on Equity prices) of the trending topics.

Before performing the topic clustering, we apply a verification layer to remove the news that are not relative to the Central Banks announcements

In [13]:
model = "gpt-4o-mini" 
api_key = os.getenv("OPENAI_API_KEY")

In [14]:
semaphore_size = 1000

# Assuming unique_reports is your DataFrame
filtered_reports = process_all_reports(results, model, api_key, main_theme, semaphore_size)

Filtering News:   0%|          | 0/2610 [00:00<?, ?it/s]

In this cell, we leverage a LLM to perform topic modeling, identifying and clustering the key topics from the news reports.

In [15]:
flattened_trending_topics_df = run_process_all_trending_topics(
    unique_reports=filtered_reports,
    model=model,
    start_query=start_query,
    end_query=end_query,
    api_key=os.environ['OPENAI_API_KEY'],
    main_theme = main_theme,
    batches = 20
)

Extracting Topics for 2025-05-18:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-05-19:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-05-20:   0%|          | 0/20 [00:00<?, ?it/s]

Consolidating topics...


Consolidating topic batches: 100%|██████████| 4/4 [00:11<00:00,  2.79s/it]


Summarizing text for each topic...


Summarizing topics: 100%|██████████| 18/18 [00:07<00:00,  2.56it/s]
Generating titles: 100%|██████████| 19/19 [00:07<00:00,  2.70it/s]


Generating Day in Review summaries...
Adding one-line summaries to DataFrame...


Generating text summaries: 100%|██████████| 631/631 [00:33<00:00, 18.69it/s] 


In this cell, we iterate over each identified topic and enrich the DataFrame by retrieving and adding the full details of the cited news, including source headlines and the news text.

**Trendiness and Novelty Scores**: We derive analytics related to the trendiness of the topic based on the news volume, and the novelty of the topic based on the changes in daily summaries, evaluating the uniqueness and freshness of each topic. 

In [16]:
# Calculate trendiness and novelty scores, assessing the uniqueness and freshness of each topic
flattened_trending_topics_df = run_add_advanced_novelty_scores(flattened_trending_topics_df, api_key = os.environ['OPENAI_API_KEY'], main_theme = main_theme)

Calculating Novelty Scores:   0%|          | 0/27 [00:00<?, ?it/s]

**Price impact**: We derive analytics related to the impact (Positive, Negative) and magnitude (High, Medium, Low) of the topics, inferring their  market impact on equity prices. The prince impact inference is based on the price mechanisms and the perceived sentiment and market reaction of the news on the market.

In [17]:
# Assess the market impact for each topic, evaluating how each topic influences Equity market
point_of_view = 'Domestic Equity market'
flattened_trending_topics_df = add_market_impact_to_df(flattened_trending_topics_df, api_key = os.environ['OPENAI_API_KEY'], main_theme = main_theme, point_of_view = point_of_view)

We display the results of topic modeling and summarization. The **Topic** column represents the themes inferred through topic clustering using a LLM, which groups the news articles based on their content and underlying themes. The **Summary** provides a synthesized overview of all news articles within the same topic, offering a high-level view of the key messages for each cluster. The **Topic** is then rephrased into a concise form based on the summary. The **Text_Summary** provides a detailed summary of each individual chunk, capturing its core message.

For verification purpose, this actionable timestamped dataframe contains the granular news clustered into relevant topics, and also the advanced analytics of trendiness, novelty, impact, and magnitude scores to be potentially used for backtesting. 

# Step 4- Customized Report Generation

In this step, we rank the topics, allowing the user to customize the ranking system to reindex the news, based on their trendiness, novelty, and impact on equity prices. We finally display a daily market update, supported by the corresponding granular news and sources for verification purpose.

The user selects the date for the report summarizing the top trending topics, and customizes the ranking system to prioritize the topics based on volume (trendiness and media attention), novelty (based on the emergence of new daily news), impact direction (positive or negative), and magnitude. The ranking system prioritizes the criteria in the order specified by the user, allowing for a tailored focus on the most relevant aspects of the data.

The order in which the criteria are listed in user_selected_ranking determines their priority for ranking the topics within the report. The first criterion in the list has the highest priority, followed by the second, and then the third. The user can customize the ranking by choosing to prioritize impact direction (positive or negative), novelty, magnitude, or volume, and has the flexibility to select 1, 2, or all 3 criteria based on their specific needs.

In [18]:
specific_date = '2025-05-20'  # Example date, can be modified as needed
user_selected_ranking = ['novelty', 'volume', 'magnitude']  # User can modify this list to change the ranking order
#impact_filter = 'positive_impact' #User can use the impact_filter to filter out the report

In [19]:
prepared_reports = prepare_data_for_report(flattened_trending_topics_df, user_selected_ranking, impact_filter = None, report_date = specific_date)

In [20]:
# Generate and display the HTML report for each date
for report in prepared_reports:
    html_content = generate_html_report(
        report['date'],
        report['day_in_review'],
        report['topics'],
        'Central Bank Announcements'  # Pass the main theme to dynamically generate the title
    )
    display(HTML(html_content))
    print("")
    print("")
    print("")




