# **Top Trending Topics for Crude Oil**

This Jupyter notebook implements an **agentic workflow** based on the content retrieval from BigData API to **identify, verifiy, reindex, and summarize** the specialized news that are **trending topics** for the crude oil market.

The workflow is structured as follows:

**Step 1- Generation of the Lexicon**: Identify the specialized industry-specific jargon relevant to the crude oil market to ensure a high recall in the content retrieval.

**Step 2- Content Retrieval Based on BigData**: Perform a keyword search on the news content with the Bigdata API to retrieve documents, splitting the search over daily timeframes and multi-threading the content search on the individual keywords for speed purpose.

**Step 3- Topic Clustering and Selection**: Perform topic modelling using a large language model to verify and cluster the news. Then, the summarization ensures topic selection identifying the top trending news for crude oil, while deriving advanced analytics to quantify the trendiness (based on news volume), novelty (based on daily changes in summaries), impact and magnitude (based on the financial materiality on crude oil prices) of the trending topics.

**Step 4- Customized Report Generation**: Customize the ranking system of the summarized topics based on their trendiness, novelty, and financial materiality on crude oil prices, and display a daily market update. For verification purpose, the reports are supported by the granular news and sources.

**Output**

1. **Daily Market Reports**: A detailed and visually appealing report summarizing the top trending topics for crude oil, with a customizable ranking system to reindex the news.
2. **Actionable Dataframe**: A timestamped dataframe containing the granular news clustered into relevant topics, and the advanced analytics of trendiness, novelty, impact, and magnitude scores to be potentially used for backtesting purpose.

**Requirements**

- Credentials for the Bigdata API to perform keyword and document searches on news content.
- Credentials for the OpenAI API used in the notebook, this could be substituted with any other LLM.
- A `tools` folder in the same directory as this notebook, containing a Python file named `utils_reports.py` with all required functionalities.
- A `requirements.txt` file listing all the necessary Python libraries and dependencies. We recommend installing these packages in a virtual environment.

# Set-Up

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pypi-marbella.ravenpack.com/simple/


In [3]:
import os
from dotenv import load_dotenv

load_dotenv(os.path.abspath("/home/abouchs/.python_env_var/.env"))

True

In [4]:
BIGDATA_USERNAME = os.getenv("BIGDATA_USERNAME")
BIGDATA_PASSWORD = os.getenv("BIGDATA_PASSWORD")

In [5]:
from bigdata_client import Bigdata
bigdata_cred = Bigdata(BIGDATA_USERNAME, BIGDATA_PASSWORD)

In [6]:
from src.lexicon_generator import LexiconGenerator
from src.search_topics import search_by_keywords
from src.topics_extractor import (process_all_reports,
                                run_process_all_trending_topics,
                                run_add_advanced_novelty_scores,
                                add_market_impact_to_df,
                                prepare_data_for_report,
                                generate_html_report)
from IPython.display import display
from IPython.core.display import HTML

In [7]:
output_dir = f"//home/abouchs/shared/OutputData/abouchs/Bigdata_cookbook/trending_topics/"

In [8]:
try:
    import asyncio
    asyncio.get_running_loop()
    import nest_asyncio; nest_asyncio.apply()
    print("✅ nest_asyncio applied")
except (RuntimeError, ImportError):
    print("✅ nest_asyncio not needed")

✅ nest_asyncio applied


# Step 1- Generation of the Lexicon

In this step, we identify the specialized industry-specific jargon relevant to the crude oil market to ensure a high recall in the content retrieval.

In [9]:
main_theme = "Crude Oil"
system_prompt = (
    f"""You are an expert tasked with generating a lexicon of the most important and relevant keywords specific to the given main theme and its related market.

    Your goal is to compile a list of terms that are critical for understanding and analyzing the main theme's market. This lexicon should include only the most essential keywords, phrases, and abbreviations that are directly associated with trading, analysis, logistics, and industry reporting related to the main theme.

    Guidelines:

    1. **Focus on relevance:** Include only the most important and commonly used keywords that are uniquely tied to main theme and its market. These should reflect key concepts, market mechanisms, pricing benchmarks, logistical aspects, and industry-specific terminology.
    2. **Avoid redundancy:** Do not repeat the word of the main theme or its components, such as "Crude" or "Oil" in multiple phrases. Include the main theme only as a standalone term, and focus on other specific terms without redundant repetition.
    3. **Strict exclusion of generic terms:** Exclude any terms that are generic or broadly used in other markets, such as "Arbitrage," "Hedge," "Liquidity," "Spot Price," "Futures Contract," "Backwardation," or "Contango," even if they have a specific meaning in the main theme market. Only include terms that are uniquely relevant to the main theme market and cannot be applied broadly.
    4. **Include specific variations:** Where applicable, provide both the full form and common abbreviations as SEPARATE keywords (e.g., "West Texas Intermediate" and "WTI" or variations like "Brent" and "Brent Crude").
    5. **Ensure clarity:** Each keyword should be concise, clear, and directly relevant to the main theme's market, avoiding any ambiguity.
    6. **Select only the most critical:** There is no need to reach a specific number of keywords. Focus solely on the most crucial terms without padding the list. If fewer keywords meet the criteria, that is acceptable.

    The output should be a lexicon of only the most critical and uniquely relevant keywords related to the main theme market, formatted as a JSON list.
    """
)


In [10]:
LexiconGenerator = LexiconGenerator(openai_key=os.getenv("OPENAI_API_KEY"), model="gpt-4o", seeds=[123, 123456, 123456789, 456789, 789])

In [11]:
keywords_lex = LexiconGenerator.generate(theme=main_theme, system_prompt=system_prompt)

# Step 2- Content Retrieval Based on Bigdata

In this section, we perform a keyword search on the news content with the Bigdata API to retrieve documents, splitting the search over daily timeframes and multi-threading the content search on the individual keywords for speed purpose. The user can define the time range below to generate daily reports between the start and end dates.

In [12]:
start_query = '2025-06-21'
end_query = '2025-06-28'

In [13]:
results, daily_keyword_count = search_by_keywords(
    keywords=keywords_lex,
    start_date=start_query,
    end_date=end_query,
    freq='D',
    document_limit=10)

About to run 664 queries
Example Query: Keyword('Brent') over date range: AbsoluteDateRange('2025-06-21T00:00:00', '2025-06-21T23:59:59')


Querying Bigdata...: 100%|██████████| 664/664 [02:22<00:00,  4.66it/s]


In [14]:
results

Unnamed: 0,timestamp,rp_document_id,headline,chunk_number,sentence_id,source_id,source_name,text,keyword,date
0,2025-06-21 00:00:00+00:00,45A5868267C8A445F6B696C495AA8673,Aduro Clean Technologies Announces Closing of ...,7.0,45A5868267C8A445F6B696C495AA8673-7,D051D6,Global Data Point,About Aduro Clean Technologies\nAduro Clean Te...,Heavy Crude,2025-06-21
1,2025-06-21 00:00:00+00:00,86E659AE53391BE3F217C0EC3B031F97,Peter Dey Announces Retirement from Gran Tierr...,3.0,86E659AE53391BE3F217C0EC3B031F97-3,346656,Executive Appointments Worldwide,About Gran Tierra Energy Inc.\nGran Tierra Ene...,Exploration and Production,2025-06-21
2,2025-06-21 00:00:00+00:00,5D552639AE35AB037AB8D54E1BB8E210,India's Core Sector Growth Drops to a Nine-Mon...,3.0,5D552639AE35AB037AB8D54E1BB8E210-3,923B93,Financial Services Monitor Worldwide,"The subdued performance, per the data released...",Oil Production,2025-06-21
3,2025-06-21 00:00:00+00:00,69CCADC9237768F4A7026FB0DAC9F6CA,"""eBL, Payment Equals Settlement"" - KUN and Tra...",6.0,69CCADC9237768F4A7026FB0DAC9F6CA-6,923B93,Financial Services Monitor Worldwide,About TradeGo\nFounded in Singapore in Novembe...,Saudi Aramco,2025-06-21
4,2025-06-21 00:00:00+00:00,A592E34C64FF3D667032053A54199868,Global Times: China and LAC countries deepen t...,17.0,A592E34C64FF3D667032053A54199868-17,D051D6,Global Data Point,CBERS images are also utilized within the fram...,Oil Spill,2025-06-21
...,...,...,...,...,...,...,...,...,...,...
7469,2025-06-28 23:37:13+00:00,99083749AA42E83A4352A4002BC9479C,"3 days after Kullu flash flood, missing teen's...",2.0,99083749AA42E83A4352A4002BC9479C-2,80FC03,The Times Of India,They were swept away from a hydropower project...,Upstream,2025-06-28
7470,2025-06-28 23:48:49+00:00,456DB9D74C8F0B01626EE845F8FF4CA6,Caught on camera: Car literally drives through...,2.0,456DB9D74C8F0B01626EE845F8FF4CA6-2,E54C73,ABC News,"""This is like a movie or something,"" Patel sai...",Barrel,2025-06-28
7471,2025-06-28 23:50:36+00:00,DF667772C27122819087C82C1D54C3DD,The Strategic Empire: Debt & the Dollar,46.0,DF667772C27122819087C82C1D54C3DD-46,EC0C87,Michael Hudson,The United States is unwilling to annul Global...,OPEC,2025-06-28
7472,2025-06-28 23:50:36+00:00,DF667772C27122819087C82C1D54C3DD,The Strategic Empire: Debt & the Dollar,55.0,DF667772C27122819087C82C1D54C3DD-55,EC0C87,Michael Hudson,"Yes, someday the United States cannot get a fr...",OPEC,2025-06-28


# Step 3- Topic Clustering and Selection

In this step, we perform topic modelling using a large language model to verify and cluster the news. Then, the summarization ensures topic selection identifying the top trending news for crude oil, while deriving advanced analytics to quantify the trendiness (based on news volume), novelty (based on daily changes in summaries), impact and magnitude (based on the financial materiality on crude oil prices) of the trending topics.

Before performing the topic clustering, we apply a verification layer to remove the news that are not relative to the oil market

In [15]:
model = "gpt-4o-mini" 
api_key = os.getenv("OPENAI_API_KEY")

In [16]:
semaphore_size = 1000

# Assuming unique_reports is your DataFrame
filtered_reports = process_all_reports(results, model, api_key, main_theme, semaphore_size)

Filtering News:   0%|          | 0/7474 [00:00<?, ?it/s]

In this cell, we leverage a LLM to perform topic modeling, identifying and clustering the key topics from the news reports.

In [17]:
filtered_reports

Unnamed: 0,timestamp,rp_document_id,headline,chunk_number,sentence_id,source_id,source_name,text,keyword,date
0,2025-06-21 00:00:00+00:00,45A5868267C8A445F6B696C495AA8673,Aduro Clean Technologies Announces Closing of ...,7.0,45A5868267C8A445F6B696C495AA8673-7,D051D6,Global Data Point,About Aduro Clean Technologies\nAduro Clean Te...,Heavy Crude,2025-06-21
1,2025-06-21 00:00:00+00:00,86E659AE53391BE3F217C0EC3B031F97,Peter Dey Announces Retirement from Gran Tierr...,3.0,86E659AE53391BE3F217C0EC3B031F97-3,346656,Executive Appointments Worldwide,About Gran Tierra Energy Inc.\nGran Tierra Ene...,Exploration and Production,2025-06-21
2,2025-06-21 00:00:00+00:00,EE3309C79EA552874F436912D2F6A67A,US Supreme Court sets rules for venue selectio...,3.0,EE3309C79EA552874F436912D2F6A67A-3,BC923D,Legal Monitor Worldwide,The EPA had previously argued that the CAAs na...,Oil Refinery,2025-06-21
3,2025-06-21 00:00:00+00:00,5DB7156ACE3425F4F61B3BE3E882AF0D,Premium Alcohol Market Forecasts Report 2025-2...,3.0,5DB7156ACE3425F4F61B3BE3E882AF0D-3,923B93,Financial Services Monitor Worldwide,Market Trends Growing Disposable Income and Ur...,Energy Information Administration,2025-06-21
4,2025-06-21 00:00:00+00:00,96EC510A80FAD37155CD1543681E8CC1,"Morning Briefing: June 21, 2025",10.0,96EC510A80FAD37155CD1543681E8CC1-10,CE1ADC,Anadolu Agency,- China faces oil supply risk if Israel-Iran c...,Oil Import,2025-06-21
...,...,...,...,...,...,...,...,...,...,...
4256,2025-06-28 23:33:44+00:00,42F7600A56C081D0A03B719DAF6A339A,3 Magnificent S&P 500 Dividend Stocks Down 25%...,6.0,42F7600A56C081D0A03B719DAF6A339A-6,648085,AOL.com,Oneok's durable midstream business model has e...,Midstream,2025-06-28
4257,2025-06-28 23:33:44+00:00,42F7600A56C081D0A03B719DAF6A339A,3 Magnificent S&P 500 Dividend Stocks Down 25%...,5.0,42F7600A56C081D0A03B719DAF6A339A-5,648085,AOL.com,A lot of fuel to continue growing\nOneok's sto...,Midstream,2025-06-28
4258,2025-06-28 23:48:49+00:00,456DB9D74C8F0B01626EE845F8FF4CA6,Caught on camera: Car literally drives through...,2.0,456DB9D74C8F0B01626EE845F8FF4CA6-2,E54C73,ABC News,"""This is like a movie or something,"" Patel sai...",Barrel,2025-06-28
4259,2025-06-28 23:50:36+00:00,DF667772C27122819087C82C1D54C3DD,The Strategic Empire: Debt & the Dollar,46.0,DF667772C27122819087C82C1D54C3DD-46,EC0C87,Michael Hudson,The United States is unwilling to annul Global...,OPEC,2025-06-28


In [18]:
flattened_trending_topics_df = run_process_all_trending_topics(
    unique_reports=filtered_reports,
    model=model,
    start_query=start_query,
    end_query=end_query,
    api_key=os.environ['OPENAI_API_KEY'],
    main_theme = main_theme,
    batches = 20
)


Extracting Topics for 2025-06-21:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-06-22:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-06-23:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-06-24:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-06-25:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-06-26:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-06-27:   0%|          | 0/20 [00:00<?, ?it/s]

Extracting Topics for 2025-06-28:   0%|          | 0/20 [00:00<?, ?it/s]

Consolidating topics...


Consolidating topic batches: 100%|██████████| 10/10 [00:22<00:00,  2.21s/it]


Summarizing text for each topic...


Summarizing topics: 100%|██████████| 18/18 [00:09<00:00,  1.97it/s]
Generating titles: 100%|██████████| 19/19 [00:11<00:00,  1.62it/s]


Generating Day in Review summaries...
Adding one-line summaries to DataFrame...


Generating text summaries: 100%|██████████| 1811/1811 [00:59<00:00, 30.19it/s] 


**Trendiness and Novelty Scores**: We derive analytics related to the trendiness of the topic based on the news volume, and the novelty of the topic based on the changes in daily summaries, evaluating the uniqueness and freshness of each topic.

In [19]:
# Calculate trendiness and novelty scores, assessing the uniqueness and freshness of each topic
flattened_trending_topics_df = run_add_advanced_novelty_scores(flattened_trending_topics_df, api_key = os.environ['OPENAI_API_KEY'], main_theme = main_theme)

Calculating Novelty Scores:   0%|          | 0/49 [00:00<?, ?it/s]

**Financial Materiality**: We derive analytics related to the impact (Positive, Negative) and magnitude (High, Medium, Low) of the topics, inferring their  market impact on crude oil prices. The inference is based on the price mechanisms involving supply and demand dynamics, geopolitical factors among others.

In [20]:
point_of_view = "a crude oil trader, where price is influenced by supply-demand dynamics, geopolitical events, and market sentiment. \
As a trader, you focus on changes in production, inventories, and economic indicators from key markets."

flattened_trending_topics_df = add_market_impact_to_df(flattened_trending_topics_df, api_key = os.environ['OPENAI_API_KEY'], main_theme = main_theme, point_of_view = point_of_view)

We display the results of topic modeling and summarization. The **Topic** column represents the themes inferred through topic clustering using a LLM, which groups the news articles based on their content and underlying themes. The **Summary** provides a synthesized overview of all news articles within the same topic, offering a high-level view of the key messages for each cluster. The **Topic** is then rephrased into a concise form based on the summary. The **Text_Summary** provides a detailed summary of each individualchunk, capturing its core message.

In [21]:
flattened_trending_topics_df.head(5)

Unnamed: 0,Date,Day_in_Review,Topic,Summary,Source,Headline,Text,Volume_Score,Text_Summary,Volume_Score.1,Novelty_Score,Impact_Score,Magnitude_Score
0,2025-06-21,- **OPEC+ Production Increase**: OPEC+ raised ...,OPEC Responds to Geopolitical Tensions with St...,OPEC's production decisions are increasingly i...,Livemint,Russia's Top Oil Executive Says OPEC Was Astu...,(Bloomberg) -- Steps taken by the OPEC group t...,1,OPEC's strategic decision to increase oil prod...,1,Old,Positive,High
48,2025-06-21,- **OPEC+ Production Increase**: OPEC+ raised ...,Geopolitical Tensions Surge Crude Oil Prices b...,India's oil supply and demand dynamics are sig...,Straits Times,A US attack on Iran would show the limits of C...,But as President Donald Trump openly ponders d...,3,"Geopolitical tensions, particularly involving ...",3,New,Positive,High
1,2025-06-21,- **OPEC+ Production Increase**: OPEC+ raised ...,"OPEC+ Boosts Oil Production by 411,000 Barrels...",The global oil market is currently facing sign...,InsideClimate News,Scientists' Letter Urges Brazil's President Lu...,He said the loss of the Amazon would be a huma...,8,Concerns over environmental impacts and humani...,8,New,Negative,High
2,2025-06-21,- **OPEC+ Production Increase**: OPEC+ raised ...,"OPEC+ Boosts Oil Production by 411,000 Barrels...",The global oil market is currently facing sign...,InsideClimate News,Scientists' Letter Urges Brazil's President Lu...,"""As we rapidly approach the 10-year anniversar...",8,Scientists urge leaders to prioritize climate ...,8,New,Negative,High
3,2025-06-21,- **OPEC+ Production Increase**: OPEC+ raised ...,"OPEC+ Boosts Oil Production by 411,000 Barrels...",The global oil market is currently facing sign...,InsideClimate News,Scientists' Letter Urges Brazil's President Lu...,We're hiring!\nPlease take a look at the new o...,8,"OPEC+ increases oil production by 411,000 barr...",8,New,Negative,High


For verification purpose, this actionable timestamped dataframe contains the granular news clustered into relevant topics, and also the advanced analytics of trendiness, novelty, impact, and magnitude scores to be potentially used for backtesting.

# Step 4- Customized Report Generation

In this step, we rank the topics, allowing the user to customize the ranking system to reindex the news, based on their trendiness, novelty, and financial materiality on crude oil prices. We finally display a daily market update, supported by the corresponding granular news and sources for verification purpose.

The user selects the date for the report summarizing the top trending topics, and customizes the ranking system to prioritize the topics based on volume (trendiness and media attention), novelty (based on the emergence of new daily news), impact direction (positive or negative), and magnitude (financial materiality). The ranking system prioritizes the criteria in the order specified by the user, allowing for a tailored focus on the most relevant aspects of the data.

The order in which the criteria are listed in user_selected_ranking determines their priority for ranking the topics within the report. The first criterion in the list has the highest priority, followed by the second, and then the third. The user can customize the ranking by choosing to prioritize impact direction (positive or negative), novelty, magnitude, or volume, and has the flexibility to select 1, 2, or all 3 criteria based on their specific needs.

In [22]:
specific_date = '2025-06-23'  # Example date, can be modified as needed
user_selected_ranking = ['novelty', 'volume', 'magnitude']  # User can modify this list to change the ranking order
#impact_filter = 'positive_impact' #User can use the impact_filter to filter out the report

In [23]:
prepared_reports = prepare_data_for_report(flattened_trending_topics_df, user_selected_ranking, impact_filter = None, report_date = specific_date)

In [24]:
# Generate and display the HTML report for each date
for report in prepared_reports:
    html_content = generate_html_report(
        report['date'],
        report['day_in_review'],
        report['topics'],
        'Daily crude oil market update'  # Pass the main theme to dynamically generate the title
    )
    display(HTML(html_content))
    print("")
    print("")
    print("")




