# Set-Up

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
pip install -r requirements_sdk_api.txt

Looking in indexes: https://pypi.org/simple, https://pypi-marbella.ravenpack.com/simple/
Collecting aiohttp==3.9.5 (from -r requirements_sdk_api.txt (line 1))
  Downloading aiohttp-3.9.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting aiosignal==1.3.1 (from -r requirements_sdk_api.txt (line 2))
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting annotated-types==0.7.0 (from -r requirements_sdk_api.txt (line 3))
  Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting anyio==4.4.0 (from -r requirements_sdk_api.txt (line 4))
  Downloading anyio-4.4.0-py3-none-any.whl.metadata (4.6 kB)
Collecting argon2-cffi==23.1.0 (from -r requirements_sdk_api.txt (line 5))
  Downloading argon2_cffi-23.1.0-py3-none-any.whl.metadata (5.2 kB)
Collecting argon2-cffi-bindings==21.2.0 (from -r requirements_sdk_api.txt (line 6))
  Downloading argon2_cffi_bindings-21.2.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.wh

In [2]:
from src.generic_utils_reports import *

ModuleNotFoundError: No module named 'openai'

In [None]:
from dotenv import load_dotenv

load_dotenv(os.path.abspath("/home/abouchs/.python_env_var/.env"))

True

In [None]:
BIGDATA_USERNAME = os.getenv("BIGDATA_USERNAME")
BIGDATA_PASSWORD = os.getenv("BIGDATA_PASSWORD")

In [None]:
bigdata_cred = Bigdata(BIGDATA_USERNAME, BIGDATA_PASSWORD)

In [None]:
user = '/aammydriss/'
case_id = '/QIS-537/'

output_dir = f"/home{user}/shared/OutputData/aammydriss{case_id}/"

# Step 1- Generation of the Lexicon

In this step, we identify the specialized industry-specific jargon relevant to the crude oil market to ensure a high recall in the content retrieval.

In [None]:
main_theme = "Crude Oil"
system_prompt = (
    f"""You are an expert tasked with generating a lexicon of the most important and relevant keywords specific to the crude oil market.

    Your goal is to compile a list of terms that are critical for understanding and analyzing the crude oil market. This lexicon should include only the most essential keywords, phrases, and abbreviations that are directly associated with crude oil trading, analysis, logistics, and industry reporting.

    Guidelines:

    1. **Focus on relevance:** Include only the most important and commonly used keywords that are uniquely tied to the crude oil market. These should reflect key concepts, market mechanisms, pricing benchmarks, logistical aspects, and industry-specific terminology.
    2. **Avoid redundancy:** Do not repeat the word "Crude" or "Oil" in multiple phrases. Include "Crude Oil" only as a standalone term, and focus on other specific terms without redundant repetition.
    3. **Strict exclusion of generic terms:** Exclude any terms that are generic or broadly used in other markets, such as "Arbitrage," "Hedge," "Liquidity," "Spot Price," "Futures Contract," "Backwardation," or "Contango," even if they have a specific meaning in the oil market. Only include terms that are uniquely relevant to the crude oil market and cannot be applied broadly.
    4. **Include specific variations:** Where applicable, provide both the full form and common abbreviations (e.g., "West Texas Intermediate" and "WTI"), as well as variations like "Brent" and "Brent Crude."
    5. **Ensure clarity:** Each keyword should be concise, clear, and directly relevant to the crude oil market, avoiding any ambiguity.
    6. **Select only the most critical:** There is no need to reach a specific number of keywords. Focus solely on the most crucial terms without padding the list. If fewer keywords meet the criteria, that is acceptable.

    The output should be a lexicon of only the most critical and uniquely relevant keywords related to the crude oil market, formatted as a JSON list.
    """
)


In [None]:
keywords,keywords_query= keywords_functions.run_generate_keywords(theme = main_theme, openAIkey = os.getenv("OPENAI_API_KEY"),
                                                      system_prompt = system_prompt, seeds = [123, 1234, 12345, 123456, 1234567, 12345678, 123456789, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Fetching OpenAI Responses: 100%|██████████| 16/16 [00:50<00:00,  3.16s/it]

['Crude Oil Insights', 'Oil Refinery', 'Carbon Tax', 'Crude Oil Hydrotreating', 'Petrochemicals', 'Crude Oil Options', 'Petrobras', 'Wellhead', 'Crude Oil Market Mergers', 'Crude Assay', 'Possible Reserves', 'Dubai Crude', 'Pipeline', 'Crude Oil Distillation', 'Heavy Crude', 'TotalEnergies', 'Crude Blending', 'Crude Oil Pipeline', 'Oilfield', 'ICE Futures', 'Crude Oil Market Capital', 'Supply Glut', 'Chevron', 'Crude Oil Market Sentiment', 'Crude Oil Market Intelligence', 'Crude Oil Market Cash Flow', 'Crude Oil Export', 'Crude Oil Offshore', 'Crude Oil Market Efficiency', 'EOR', 'ICE Futures Europe', 'Crude Oil Market Alliances', 'Crude Oil Regulation', 'Crude Oil Market Collaboration', 'Fracking', 'Oil Field', 'Crude Oil Market Profit', 'SPRO', 'Petroleum Coke', 'E&P', 'Crude Oil Renewable Alternatives', 'Crude Oil Market Political Impact', 'Crude Oil Market Governance', 'Crude Oil Supply Chain', 'Barrel of Oil Equivalent', 'DOE', 'NOC', 'BP', 'Crude Oil Transition', 'Shale Oil', 'Ca




<img src="https://bigdata.com/assets/notebooks/bigdata-by-ravenpack-logo.png" width="300" align="center">
<br>
<br>

# **Daily Top Trending Topics for Crude Oil**

This Jupyter notebook implements an **agentic workflow** based on the content retrieval from BigData API to **identify, verifiy, reindex, and summarize** the specialized news that are **trending topics** for the crude oil market.

The workflow is structured as follows:

**Step 1- Generation of the Lexicon**: Identify the specialized industry-specific jargon relevant to the crude oil market to ensure a high recall in the content retrieval.

**Step 2- Content Retrieval Based on BigData**: Perform a keyword search on the news content with the Bigdata API to retrieve documents, splitting the search over daily timeframes and multi-threading the content search on the individual keywords for speed purpose.

**Step 3- Topic Clustering and Selection**: Perform topic modelling using a large language model to verify and cluster the news. Then, the summarization ensures topic selection identifying the top trending news for crude oil, while deriving advanced analytics to quantify the trendiness (based on news volume), novelty (based on daily changes in summaries), impact and magnitude (based on the financial materiality on crude oil prices) of the trending topics.

**Step 4- Customized Report Generation**: Customize the ranking system of the summarized topics based on their trendiness, novelty, and financial materiality on crude oil prices, and display a daily market update. For verification purpose, the reports are supported by the granular news and sources.

**Output**

1. **Daily Market Reports**: A detailed and visually appealing report summarizing the top trending topics for crude oil, with a customizable ranking system to reindex the news.
2. **Actionable Dataframe**: A timestamped dataframe containing the granular news clustered into relevant topics, and the advanced analytics of trendiness, novelty, impact, and magnitude scores to be potentially used for backtesting purpose.

**Requirements**

- Credentials for the Bigdata API to perform keyword and document searches on news content.
- Credentials for the OpenAI API used in the notebook, this could be substituted with any other LLM.
- A `tools` folder in the same directory as this notebook, containing a Python file named `utils_reports.py` with all required functionalities.
- A `requirements.txt` file listing all the necessary Python libraries and dependencies. We recommend installing these packages in a virtual environment.

# Step 2- Content Retrieval Based on Bigdata

In this section, we perform a keyword search on the news content with the Bigdata API to retrieve documents, splitting the search over daily timeframes and multi-threading the content search on the individual keywords for speed purpose. The user can define the time range below to generate daily reports between the start and end dates.

In [None]:
start_query = '2024-07-25'
end_query = '2024-07-25'

In [None]:
## Retrieve document identifiers in order to find the full document
unique_reports, daily_keyword_counts = search_keyword(keywords, start_query, end_query, [],bigdata_cred, job_number = 30, limit=750)

2024-07-25 00:00:00 - 2024-07-25 23:59:59


QUEUEING TASKS | :   0%|          | 0/321 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/321 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/321 [00:00<?, ?it/s]

# Step 3- Topic Clustering and Selection

In this step, we perform topic modelling using a large language model to verify and cluster the news. Then, the summarization ensures topic selection identifying the top trending news for crude oil, while deriving advanced analytics to quantify the trendiness (based on news volume), novelty (based on daily changes in summaries), impact and magnitude (based on the financial materiality on crude oil prices) of the trending topics.

Before performing the topic clustering, we apply a verification layer to remove the news that are not relative to the oil market

In [None]:
model = 'gpt-4o-mini-2024-07-18'
api_key = os.environ['OPENAI_API_KEY']
semaphore_size = 1000

# Assuming unique_reports is your DataFrame
filtered_reports = process_all_reports(unique_reports, model, api_key, main_theme, semaphore_size)

Filtering News:   0%|          | 0/22372 [00:00<?, ?it/s]

In this cell, we leverage a LLM to perform topic modeling, identifying and clustering the key topics from the news reports.

In [None]:
flattened_trending_topics_df = run_process_all_trending_topics(
    unique_reports=filtered_reports,
    model=model,
    start_query=start_query,
    end_query=end_query,
    api_key=os.environ['OPENAI_API_KEY'],
    main_theme = main_theme,
    batchs = 20
)


Extracting for 2024-07-25:   0%|          | 0/20 [00:00<?, ?it/s]

Summarizing topics: 100%|██████████| 35/35 [00:02<00:00, 11.71it/s]
Generating titles: 100%|██████████| 36/36 [00:03<00:00,  9.17it/s]
Generating text summaries: 100%|██████████| 230/230 [00:04<00:00, 54.08it/s] 


**Trendiness and Novelty Scores**: We derive analytics related to the trendiness of the topic based on the news volume, and the novelty of the topic based on the changes in daily summaries, evaluating the uniqueness and freshness of each topic.

In [None]:
# Calculate trendiness and novelty scores, assessing the uniqueness and freshness of each topic
flattened_trending_topics_df = run_add_advanced_novelty_scores(flattened_trending_topics_df, api_key = os.environ['OPENAI_API_KEY'], main_theme = main_theme)

Calculating Novelty Scores:   0%|          | 0/180 [00:00<?, ?it/s]

**Financial Materiality**: We derive analytics related to the impact (Positive, Negative) and magnitude (High, Medium, Low) of the topics, inferring their  market impact on crude oil prices. The inference is based on the price mechanisms involving supply and demand dynamics, geopolitical factors among others.

In [None]:
point_of_view = "a crude oil trader, where price is influenced by supply-demand dynamics, geopolitical events, and market sentiment. \
As a trader, you focus on changes in production, inventories, and economic indicators from key markets."


flattened_trending_topics_df = add_market_impact_to_df(flattened_trending_topics_df, api_key = os.environ['OPENAI_API_KEY'], main_theme = main_theme, point_of_view = point_of_view)

We display the results of topic modeling and summarization. The **Topic** column represents the themes inferred through topic clustering using a LLM, which groups the news articles based on their content and underlying themes. The **Summary** provides a synthesized overview of all news articles within the same topic, offering a high-level view of the key messages for each cluster. The **Topic** is then rephrased into a concise form based on the summary. The **Text_Summary** provides a detailed summary of each individualchunk, capturing its core message.

In [None]:
flattened_trending_topics_df.head(5)

Unnamed: 0,Date,Day_in_Review,Topic,Summary,Source,Headline,Text,Volume_Score,Text_Summary,Volume_Score.1,Novelty_Score,Impact_Score,Magnitude_Score
0,2024-07-25,- **U.S. Crude Inventories Decline**: U.S. cru...,U.S. Crude Oil Inventories Plunge 3.7 Million ...,Recent reports from the U.S. Energy Informatio...,Klse I3investor.com,PublicInvest Research Headlines - 25 Jul 2024,Workers are now having a harder time finding j...,5,U.S. crude oil inventories unexpectedly droppe...,5,New,Negative,High
3,2024-07-25,- **U.S. Crude Inventories Decline**: U.S. cru...,U.S. Crude Oil Inventories Plunge 3.7 Million ...,Recent reports from the U.S. Energy Informatio...,Financial Express via Web,"Will Nifty hold 24,200 as markets see time & p...","The US Dollar Index (DXY), which measures the ...",5,A slight decline in the US Dollar Index coinci...,5,New,Negative,High
4,2024-07-25,- **U.S. Crude Inventories Decline**: U.S. cru...,U.S. Crude Oil Inventories Plunge 3.7 Million ...,Recent reports from the U.S. Energy Informatio...,RTTNews via Web,Taiwan Shares Tipped To Open In The Red,"In economic news, the Commerce Department unex...",5,U.S. crude oil prices rose following a surpris...,5,New,Negative,High
5,2024-07-25,- **U.S. Crude Inventories Decline**: U.S. cru...,US Crude Inventory Decline Fuels Fluctuations ...,Brent and West Texas Intermediate (WTI) crude ...,Livemint,Indian stock market: 7 key things that changed...,Oil Prices Crude oil prices traded lower. Bren...,4,Recent declines in Brent and WTI crude oil pri...,4,New,Positive,High
6,2024-07-25,- **U.S. Crude Inventories Decline**: U.S. cru...,US Crude Inventory Decline Fuels Fluctuations ...,Brent and West Texas Intermediate (WTI) crude ...,Business Standard via Web,Market outlook July 25: Global sell-off hints ...,The US 10-year bond yield quoted around 4.266 ...,4,Recent fluctuations in Brent and WTI oil price...,4,New,Positive,High


For verification purpose, this actionable timestamped dataframe contains the granular news clustered into relevant topics, and also the advanced analytics of trendiness, novelty, impact, and magnitude scores to be potentially used for backtesting.

# Step 4- Customized Report Generation

In this step, we rank the topics, allowing the user to customize the ranking system to reindex the news, based on their trendiness, novelty, and financial materiality on crude oil prices. We finally display a daily market update, supported by the corresponding granular news and sources for verification purpose.

The user selects the date for the report summarizing the top trending topics, and customizes the ranking system to prioritize the topics based on volume (trendiness and media attention), novelty (based on the emergence of new daily news), impact direction (positive or negative), and magnitude (financial materiality). The ranking system prioritizes the criteria in the order specified by the user, allowing for a tailored focus on the most relevant aspects of the data.

The order in which the criteria are listed in user_selected_ranking determines their priority for ranking the topics within the report. The first criterion in the list has the highest priority, followed by the second, and then the third. The user can customize the ranking by choosing to prioritize impact direction (positive or negative), novelty, magnitude, or volume, and has the flexibility to select 1, 2, or all 3 criteria based on their specific needs.

In [None]:
specific_date = '2024-07-25'

# Applying the cleaning function to the text in your DataFrame before rendering
flattened_trending_topics_df['Summary'] = flattened_trending_topics_df['Summary'].apply(clean_text)
flattened_trending_topics_df['Day_in_Review'] = flattened_trending_topics_df['Day_in_Review'].apply(clean_text)
flattened_trending_topics_df['Text_Summary'] = flattened_trending_topics_df['Text_Summary'].apply(clean_text)
flattened_trending_topics_df['Topic'] = flattened_trending_topics_df['Topic'].apply(clean_text)

user_selected_ranking = ['novelty', 'volume', 'magnitude']  # User can modify this list to change the ranking order

#impact_filter = 'positive_impact' #User can use the impact_filter to filter out the report

prepared_reports = prepare_data_for_report(flattened_trending_topics_df, user_selected_ranking, impact_filter = None, report_date = specific_date)

# Generate and display the HTML report for each date
for report in prepared_reports:
    html_content = generate_html_report(
        report['date'],
        report['day_in_review'],
        report['topics'],
        'Daily crude oil market update'  # Pass the main theme to dynamically generate the title
    )
    display(HTML(html_content))
    print("")
    print("")
    print("")




