## This notebook shows how news-signals makes exploratory data analysis on newsfeeds easy

In this example, we have a signal with a feed of stories, we want to process the stories and discover how narratives evolved over time.

Let's look at the recent Silicon Valley Bank collapse.
In the first pass, let's just look at the top entities and categories in this signal and how they changed over time, leading up to the Silicon Valley Bank collapse.

In [None]:
# not needed if news_signals is already installed
# you might see a pip version error but it's grand, don't worry
!pip install -q news_signals

In [None]:
from pathlib import Path
from collections import Counter, OrderedDict

from news_signals import signals, signals_dataset, newsapi, wikidata_utils

If you want to build a signal yourself, first get a NewsAPI account, then the cell below shows how
(uncomment the commented code).

However, for the purposes of this example, we've already created a signal and uploaded it to the Google Drive, so you don't need a NewsAPI account, and the example should just run!

In [None]:
# let's setup the entity we want to work with
entity_name = 'Silicon Valley Bank'


# Build a new signal - see the README at https://github.com/AYLIEN/news-signals-datasets 
# for how to set up a NewsAPI trial account. 
# entity_id_candidates = wikidata_utils.search_wikidata(entity_name)
# test_entity = entity_id_candidates[0]

# # cool, now let's create a signal
# signal = signals.AylienSignal(
#     name=test_entity['label'],
#     params={"entity_ids": [test_entity['id']]}
# )

# # let's instantiate our signal for the time period we care about
# # investigation period
# start = '2022-10-01'
# end = '2023-03-18'

# signal = signal(start, end).sample_stories_in_window(start, end, num_stories=50)

# output_dir = Path(f'example_signals/{entity_name}_{start}_{end}')
# output_dir.mkdir(parents=True, exist_ok=True)

# signal.save(output_dir)

In [None]:
# instantiate the saved signal from Google Drive
from pathlib import Path


cache_dir = Path('tmp/saved_signals')
cache_dir.mkdir(parents=True, exist_ok=True)
dataset_path = 'https://drive.google.com/drive/folders/1RgstgaORO0OEdwUIQ0Bj997JVQulZo7n?usp=share_link'

signal = list(
    signals_dataset.SignalsDataset.load(dataset_path, cache_dir=cache_dir)
    .signals.values())[0]
signal.name

In [None]:
signal.plot()

In [None]:
def collect_sfs(aylien_entities):
    """
    A utility function that collects all entity surface forms from the 
    'entities' field of an Aylien NewsAPI story. 
    """
    sfs = Counter()
    for e in aylien_entities:
        for sf in e['title']['surface_forms'] + e['body']['surface_forms']:
            sfs.update([sf['text']])
    return sfs

In [None]:
category_counts = OrderedDict()
category_diffs = OrderedDict()
category_probs = OrderedDict()


# scroll through all the stories in a feed and check for surprises in entities or categories
prev_date = None
for date, stories in signal['stories'].items():
    date = str(date.date())

    # CATEGORIES
    category_counts[date] = Counter(c['label'] for s in stories for c in s['categories'] if 'label' in c)

    # ENTITIES
#     category_counts[date] = Counter(sf_ for s in stories 
#                                     for sf_ in 
#                                     collect_sfs(s['entities']).keys())

    category_probs[date] = \
        OrderedDict((c, count / len(stories))
                    for c, count in category_counts[date].most_common())
    
    diffs = OrderedDict()
    if prev_date is not None:
        for c in category_probs[date]:
            if c in category_probs[prev_date]:
                diff = category_probs[date][c] - category_probs[prev_date][c]
            else:
                diff = category_probs[date][c]
            diffs[c] = diff

    category_diffs[date] = OrderedDict((c, d) for c, d in sorted(diffs.items(), key=lambda x: x[1], reverse=True))
    prev_date = date

In [None]:
signal.timeseries_df['count'].idxmax()

In [None]:
significance_threshold = 0.3

for date, diffs in category_diffs.items():
    print(f'Date: {date}')
    print(f'Timeseries: {signal.loc[date]["count"]}')
    for c, d in diffs.items():
        if abs(d) > significance_threshold:
            print(f'\t{c}: {d:0.2f}')


In [None]:
# check a specific entity - what was going on when this entity trended in this feed?

for s in signal.feeds_df.loc['2022-10-14']['stories']:
    if any('Greece' in sf for sf in collect_sfs(s['entities'])):
        print(f'Title: {s["title"]}')
        if entity_name in s['body']:
            loc = s['body'].index(entity_name)
            print(s['body'][loc-100:loc+100])

In [None]:
# check a specific category - what was going on when this category trended in this feed?

for s in signal.feeds_df.loc['2022-10-09']['stories']:
    if any('Private Banking' in c['label'] for c in s['categories']):
        print(f'Title: {s["title"]}')
        if entity_name in s['body']:
            loc = s['body'].index(entity_name)
            print(s['body'][loc-100:loc+100])