## Setting things up

Loading the example dataset

In [1]:
from news_signals.signals_dataset import SignalsDataset

local_dataset_dir = '../../resources/test/nasdaq100_sample_dataset'
dataset = SignalsDataset.load(local_dataset_dir)

# Exploring a News Signals dataset

We can use `print` on the dataset to get some information about it.

In [2]:
print(dataset)

SignalsDataset({'eyJuYW1lIjogIlExNTEwOTg2NSIsICJtZXRhZGF0YSI6IHt9fQ==': {
  "type": "<class 'news_signals.signals.AylienSignal'>",
  "name": "Q15109865",
  "metadata": {},
  "params": {
    "entity_ids": [
      "Q15109865"
    ],
    "language": "en",
    "sort_by": "relevance",
    "per_page": 3
  },
  "aql": "entities: {{prominence_score:[0.7 TO *] AND  (id:Q15109865) sort_by(overall_prominence)}}",
  "ts_column": "count",
  "timeseries_df_columns": "['count', 'published_at']",
  "feeds_df_columns": "['stories']"
}, 'eyJuYW1lIjogIlExMTQ2MyIsICJtZXRhZGF0YSI6IHt9fQ==': {
  "type": "<class 'news_signals.signals.AylienSignal'>",
  "name": "Q11463",
  "metadata": {},
  "params": {
    "entity_ids": [
      "Q11463"
    ],
    "language": "en",
    "sort_by": "relevance",
    "per_page": 3
  },
  "aql": "entities: {{prominence_score:[0.7 TO *] AND  (id:Q11463) sort_by(overall_prominence)}}",
  "ts_column": "count",
  "timeseries_df_columns": "['count', 'published_at']",
  "feeds_df_column

A signals dataset contains multiple `Signal` objects. In this dataset, each of these signals corresponds to one company of the first 10 entries in the Nasdaq-100 index.

In [3]:
signals = sorted(dataset.signals.values(), key=lambda s: s.name)
[s.name for s in signals]

['Q1024454',
 'Q1055390',
 'Q1092571',
 'Q11463',
 'Q1155668',
 'Q1383669',
 'Q14772',
 'Q15109865',
 'Q1545076',
 'Q17081612']

## Semantic Filtering

### Keyword Filter

Find all stories where the keyword "Million" is present in the title.

In [71]:
from news_signals.semantic_filters import StoryKeywordMatchFilter

dataset = SignalsDataset.load(local_dataset_dir)
signals = sorted(dataset.signals.values(), key=lambda s: s.name)
signal = signals[0]

keywords = ['Million']
filter_model = StoryKeywordMatchFilter(keywords=keywords)
filtered_signal = signal.filter_stories(filter_model=filter_model)
filtered_stories_per_tick = [len(tick) for tick in filtered_signal['stories']]

for tick_stories in filtered_signal['stories']:
    for s in tick_stories:
        assert any(kw in s['title'] for kw in keywords)
        print(s['title'])

Van ECK Associates Corp Has $4.45 Million Holdings in CSX Co. (NASDAQ:CSX)
RSM US Wealth Management LLC Invests $1.69 Million in CSX Co. (NASDAQ:CSX)
Rothschild & Co. Asset Management US Inc. Takes $75.18 Million Position in CSX Co. (NASDAQ:CSX)
O Shaughnessy Asset Management LLC Has $8.55 Million Holdings in CSX Co. (NASDAQ:CSX)
Healthcare of Ontario Pension Plan Trust Fund Has $13.38 Million Stock Position in CSX Co. (NASDAQ:CSX)


### Let's make our own Semantic Filter

The `SemanticFilter` class is very flexible...

We can go beyond keywords and filter for stories that align with a given phrase.

See the example below, where we use a [Sentence Transformers](https://sbert.net/index.html) model to filter base on article titles.

In [72]:
from news_signals.semantic_filters import SemanticFilter
from sentence_transformers import SentenceTransformer, util


class SemanticMatchFilter(SemanticFilter):
    
    def __init__(self, 
                 keywords: list[str], 
                 model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
                 threshold: float = 0.5, 
                 verbose: bool = False
    ):
        """
        Initializes the semantic filter with a list of keywords, a sentence transformer model, and a similarity threshold.
        
        :param keywords: List of phrases or keywords for semantic matching.
        :param model_name: The name of the sentence transformer model to use.
        :param threshold: The cosine similarity threshold above which a story matches the filter.
        """
        self.keywords = keywords
        self.model = SentenceTransformer(model_name)
        self.keyword_embeddings = self.model.encode(
            self.keywords, convert_to_tensor=True, 
        )
        self.threshold = threshold
        self.verbose = verbose
    
    def __call__(self, item: dict) -> bool:
        """
        Filters a story by computing the cosine similarity between the story title and the keywords.
        
        :param item: A dictionary representing a news story, which should contain the 'title' key.
        :return: True if the story's title has a similarity above the threshold with any keyword embedding, False otherwise.
        """
        title_embedding = self.model.encode(
            item['title'], convert_to_tensor=True, 
        )
        similarities = util.cos_sim(title_embedding, self.keyword_embeddings)
        if self.verbose:
            print(f"Title: {item['title']}")
            for keyword, score in zip(self.keywords, similarities[0]):
                print(f"Keyword: {keyword}, Similarity: {score}")
        
        return any(score > self.threshold for score in similarities[0])


In [73]:
dataset = SignalsDataset.load(local_dataset_dir)
signals = sorted(dataset.signals.values(), key=lambda s: s.name)
signal = signals[0]

semantic_phrases = [
    'company stock position has changed significantly', 
    'billionaire investor goes bankrupt', 
    'company shares are bought', 
    'family goes on annual ski trip',  
    'i went grocery shopping but forgot to buy eggs'
]
filter_model = SemanticMatchFilter(keywords=semantic_phrases, threshold=0.5, verbose=False)
filtered_signal = signal.filter_stories(filter_model=filter_model)

In [74]:
for tick_stories in filtered_signal['stories']:
    for s in tick_stories:
        print(s['title'])

PNC Financial Services Group Inc. Sells 16,836 Shares of CSX Co. (NASDAQ:CSX)
Proquility Private Wealth Partners LLC Buys 1,909 Shares of CSX Co. (NASDAQ:CSX)
CSX Co. (NASDAQ:CSX) Stock Position Reduced by Lincoln National Corp
Marcum Wealth LLC Buys 8,231 Shares of CSX Co. (NASDAQ:CSX)
CSX Co. (NASDAQ:CSX) Shares Purchased by First National Trust Co
CSX Co. (NASDAQ:CSX) Shares Acquired by Ceredex Value Advisors LLC
Bank of Hawaii Reduces Stock Position in CSX Co. (NASDAQ:CSX)
MEMBERS Trust Co Purchases 13,200 Shares of CSX Co. (NASDAQ:CSX)
FUKOKU MUTUAL LIFE INSURANCE Co Raises Stock Holdings in CSX Co. (NASDAQ:CSX)
CSX Co. (NASDAQ:CSX) Stock Holdings Increased by Ieq Capital LLC
What Is The Deal With CSX Corporation (NASDAQ: CSX) Stock?
Sawtooth Solutions LLC Increases Stock Position in CSX Co. (NASDAQ:CSX)
Ally Financial Inc. Buys Shares of 40,000 CSX Co. (NASDAQ:CSX)
Jupiter Asset Management Ltd. Buys 1,576 Shares of CSX Co. (NASDAQ:CSX)
CX Institutional Acquires 47,590 Shares of C