
I found 


| Country        | Value (NOK)           | Investments | % of Portfolio |
|----------------|-----------------------|-------------|----------------|
| USA            | 10,488,258,386,469   | 2,901       | 52.4%          |
| Japan          | 1,207,846,613,870    | 1,410       | 6.0%           |
| United Kingdom | 1,100,973,152,288    | 1,036       | 5.5%           |
| Germany        | 916,292,718,569      | 295         | 4.6%           |
| France         | 708,716,940,392      | 263         | 3.5%           |





In [1]:
pip install newsapi

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import json
import yaml
from datetime import datetime, timedelta
from pathlib import Path
import time

import pandas as pd
import numpy as np
import requests
import feedparser
from newsapi import NewsApiClient
from dotenv import load_dotenv
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')


### **Loading Configuration**

**What I am doing:**
- Loading environment variables from .env file (API keys)
- Reading config.yaml for market definitions and parameters
- Validating that required API keys are present

**Why I'm doing this:**
- Secure credential management (keys not hardcoded)
- Centralized configuration for easy updates
- Early validation prevents runtime errors later

**Technical Note:** dotenv for environment variables, yaml for configuration parsing

In [5]:
load_dotenv()

with open('../config.yaml', 'r') as f:
    config = yaml.safe_load(f)

NEWS_API_KEY = os.getenv('NEWS_API_KEY')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not NEWS_API_KEY or NEWS_API_KEY == 'your_newsapi_key_here':
    print("WARNING: NewsAPI key missing")
else:
    print("NewsAPI key loaded")

if not OPENAI_API_KEY or OPENAI_API_KEY == 'your_openai_key_here':
    print("WARNING: OpenAI key missing")
else:
    print("OpenAI key loaded")

print(f"Config loaded: {len(config['markets'])} markets")

NewsAPI key loaded
OpenAI key loaded
Config loaded: 5 markets


### **Data Source Functions**

What I am doing:

Creating function to fetch from NewsAPI with keyword queries
Creating function to fetch from Google News RSS feeds
Adding metadata (market_id, source_type) to each article

Why I'm doing this:

Modular design allows easy testing and debugging
Multiple data sources provide broader coverage
Metadata enables tracking and filtering by source

Technical Note: newsapi-python library, feedparser for RSS, requests for HTTP calls

In [21]:
def fetch_newsapi(market_id, market_config, days_back=30):
    """Fetch news from NewsAPI"""
    try:
        newsapi = NewsApiClient(api_key=NEWS_API_KEY)
        
        end_date = datetime.now()
        start_date = end_date - timedelta(days=days_back)
        
        query = ' OR '.join([f'"{kw}"' for kw in market_config['keywords']])
        
        response = newsapi.get_everything(
            q=query,
            from_param=start_date.strftime('%Y-%m-%d'),
            to=end_date.strftime('%Y-%m-%d'),
            language='en',
            sort_by='publishedAt',
            page_size=100
        )
        
        if response['status'] == 'ok':
            articles = response['articles']
            for article in articles:
                article['source_type'] = 'newsapi'
                article['market_id'] = market_id
                article['market_name'] = market_config['name']
            return articles
        return []
        
    except Exception as e:
        print(f"NewsAPI error for {market_config['name']}: {e}")
        return []


def fetch_google_news_rss(market_id, market_config):
    """Fetch news from Google News RSS"""
    try:
        articles = []
        for keyword in market_config['keywords'][:3]:
            query = keyword.replace(' ', '+')
            rss_url = f"https://news.google.com/rss/search?q={query}+when:30d&hl=en&gl={market_config['country_code']}&ceid={market_config['country_code']}:en"
            
            feed = feedparser.parse(rss_url)
            
            for entry in feed.entries[:20]:
                if hasattr(entry, 'published_parsed') and entry.published_parsed:
                    pub_date = datetime(*entry.published_parsed[:6])
                else:
                    pub_date = datetime.now()
                
                article = {
                    'title': entry.get('title', ''),
                    'description': entry.get('summary', ''),
                    'url': entry.get('link', ''),
                    'publishedAt': pub_date.isoformat(),
                    'source': {'name': entry.get('source', {}).get('title', 'Google News')},
                    'source_type': 'google_news',
                    'market_id': market_id,
                    'market_name': market_config['name']
                }
                articles.append(article)
            
            time.sleep(0.5)
            
        return articles
        
    except Exception as e:
        print(f"Google News error for {market_config['name']}: {e}")
        return []
print("Data source functions defined")

Data source functions defined


In [22]:
# Debug Google News date parsing
test_feed = feedparser.parse("https://news.google.com/rss/search?q=SEC+when:30d&hl=en&gl=us&ceid=us:en")

if test_feed.entries:
    sample = test_feed.entries[0]
    print("Sample entry structure:")
    print(f"Title: {sample.get('title')}")
    print(f"Published field: {sample.get('published')}")
    print(f"Published_parsed: {sample.get('published_parsed')}")
    print(f"\nAll keys: {sample.keys()}")

Sample entry structure:
Title: SEC Football Players of the Week: Oct. 27 - Southeastern Conference
Published field: Mon, 27 Oct 2025 19:43:14 GMT
Published_parsed: time.struct_time(tm_year=2025, tm_mon=10, tm_mday=27, tm_hour=19, tm_min=43, tm_sec=14, tm_wday=0, tm_yday=300, tm_isdst=0)

All keys: dict_keys(['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'summary', 'summary_detail', 'source'])


**What I am doing:**
- Running both fetch functions on USA market only
- Counting articles from each source
- Displaying sample article to verify data structure

**Why I'm doing this:**
- Validate functions work before full fetch (fail fast)
- Check data quality and relevance
- Estimate total article volume before processing all markets

**Technical Note**: Basic Python testing, manual validation

In [23]:
print("Testing data fetch for United States\n")

us_config = config['markets']['us']

print("Fetching from NewsAPI...")
newsapi_articles = fetch_newsapi('us', us_config, days_back=30)
print(f"NewsAPI: {len(newsapi_articles)} articles")

print("\nFetching from Google News...")
google_articles = fetch_google_news_rss('us', us_config)
print(f"Google News: {len(google_articles)} articles")

all_us_articles = newsapi_articles + google_articles
print(f"\nTotal US articles: {len(all_us_articles)}")

if all_us_articles:
    sample = all_us_articles[0]
    print("\nSample article:")
    print(f"Title: {sample['title']}")
    print(f"Source: {sample.get('source', {}).get('name', 'Unknown')}")
    print(f"Type: {sample['source_type']}")

Testing data fetch for United States

Fetching from NewsAPI...
NewsAPI: 100 articles

Fetching from Google News...
Google News: 60 articles

Total US articles: 160

Sample article:
Title: Fed official warns inflation is still too high for more rate cuts
Source: Biztoc.com
Type: newsapi


In [24]:
print("Fetching news for all markets\n")

all_articles = []

for market_id, market_config in tqdm(config['markets'].items(), desc="Markets"):
    print(f"\nProcessing {market_config['name']}...")
    
    newsapi_articles = fetch_newsapi(market_id, market_config, days_back=30)
    print(f"  NewsAPI: {len(newsapi_articles)} articles")
    
    google_articles = fetch_google_news_rss(market_id, market_config)
    print(f"  Google News: {len(google_articles)} articles")
    
    market_articles = newsapi_articles + google_articles
    all_articles.extend(market_articles)
    
    print(f"  Total for {market_config['name']}: {len(market_articles)}")
    
    time.sleep(1)  # Rate limiting between markets


print(f"Total Articles Collected: {len(all_articles)}")


Fetching news for all markets



Markets:   0%|          | 0/5 [00:00<?, ?it/s]


Processing United States...
  NewsAPI: 100 articles
  Google News: 60 articles
  Total for United States: 160


Markets:  20%|██        | 1/5 [00:04<00:16,  4.03s/it]


Processing Japan...
  NewsAPI: 97 articles
  Google News: 47 articles
  Total for Japan: 144


Markets:  40%|████      | 2/5 [00:09<00:14,  4.68s/it]


Processing United Kingdom...
  NewsAPI: 81 articles
  Google News: 60 articles
  Total for United Kingdom: 141


Markets:  60%|██████    | 3/5 [00:14<00:10,  5.20s/it]


Processing Germany...
  NewsAPI: 40 articles
  Google News: 60 articles
  Total for Germany: 100


Markets:  80%|████████  | 4/5 [00:19<00:05,  5.07s/it]


Processing France...
  NewsAPI: 87 articles
  Google News: 53 articles
  Total for France: 140


Markets: 100%|██████████| 5/5 [00:24<00:00,  4.94s/it]

Total Articles Collected: 685





### **Time for Pandas**

**What Im Doing**:
- Now that Ive collected a bunch of articles from all five markets, its time to structure these
- I will be using a pandas dataframe. Since this is a small scale project, with 685 initial articles, I am not using SQL 
- I will try to extract source names from the nested dictionary

**Why I'm doing this:**
- DataFrame enables efficient data manipulation and analysis
- Proper datetime parsing required for time-based filtering and visualization
- Flattening nested structures simplifies downstream processing

**Technical Note:** pandas for data manipulation, datetime parsing

In [25]:
df = pd.DataFrame(all_articles)

print(f"Initial shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

df['publishedAt'] = pd.to_datetime(df['publishedAt'], errors='coerce')

df['source_name'] = df['source'].apply(
    lambda x: x.get('name', 'Unknown') if isinstance(x, dict) else 'Unknown'
)

print(f"\nDataFrame created: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Date range: {df['publishedAt'].min()} to {df['publishedAt'].max()}")
print(f"\nFirst few rows:")
df.head()

Initial shape: (685, 11)
Columns: ['source', 'author', 'title', 'description', 'url', 'urlToImage', 'publishedAt', 'content', 'source_type', 'market_id', 'market_name']

DataFrame created: 685 rows, 12 columns
Date range: 2025-10-03 01:04:26+00:00 to 2025-11-01 14:13:42+00:00

First few rows:


Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content,source_type,market_id,market_name,source_name
0,"{'id': None, 'name': 'Biztoc.com'}",thestreet.com,Fed official warns inflation is still too high...,Federal Reserve officials are divided again ov...,https://biztoc.com/x/6a7fb82c9f09a310,https://biztoc.com/cdn/6a7fb82c9f09a310_s.webp,2025-11-01 14:13:42+00:00,Federal Reserve officials are divided again ov...,newsapi,us,United States,Biztoc.com
1,"{'id': None, 'name': 'GlobeNewswire'}",The Rosen Law Firm PA,"ROSEN, RECOGNIZED INVESTOR COUNSEL, Encourages...","NEW YORK, Nov. 01, 2025 (GLOBE NEWSWIRE) -- WH...",https://www.globenewswire.com/news-release/202...,https://ml.globenewswire.com/Resource/Download...,2025-11-01 14:09:00+00:00,"NEW YORK, Nov. 01, 2025 (GLOBE NEWSWIRE) -- \r...",newsapi,us,United States,GlobeNewswire
2,"{'id': None, 'name': 'CBS Sports'}",Owen OBrien,Use DraftKings promo code to get $300 bonus be...,DraftKings offers $300 in bonus bets if your f...,https://www.cbssports.com/college-football/new...,https://sportshub.cbsistatic.com/i/r/2025/10/1...,2025-11-01 14:06:04+00:00,New users can capitalize on the latest DraftKi...,newsapi,us,United States,CBS Sports
3,"{'id': None, 'name': 'Crooksandliars.com'}",John Amato,Newsmax Host Bravely Resurrects Reagan-Era Wel...,"Newsmax host Rob Schmidt called SNAP an ""ugly ...",https://crooksandliars.com/2025/10/newsmax-hos...,https://crooksandliars.com/files/mediaposters/...,2025-11-01 14:03:03+00:00,"Newsmax host Rob Schmidt called SNAP an ""ugly ...",newsapi,us,United States,Crooksandliars.com
4,"{'id': 'usa-today', 'name': 'USA Today'}",Vols Wire,Tennessee football announces game captains aga...,Tennessee football announces game captains aga...,https://volswire.usatoday.com/story/sports/col...,https://s.yimg.com/ny/api/res/1.2/7F4A.V52DWyF...,2025-11-01 14:02:07+00:00,"No. 18 Oklahoma (6-2, 2-2 SEC) will travel to ...",newsapi,us,United States,USA Today


In [26]:
# Im getting rif of that messy looking source column, I already have a separate source_name column
df = df.drop(columns=['source'])

In [31]:
print("Data Quality Assessment\n")


print("\nMissing Values:")
print(df[['title', 'description', 'url', 'publishedAt', 'market_id']].isnull().sum())

print("\nArticles by Market:")
print(df['market_name'].value_counts())

print("\nArticles by Source Type:")
print(df['source_type'].value_counts())



Data Quality Assessment


Missing Values:
title            0
description      0
url              0
publishedAt    271
market_id        0
dtype: int64

Articles by Market:
market_name
United States     160
Japan             142
France            129
United Kingdom    122
Germany            99
Name: count, dtype: int64

Articles by Source Type:
source_type
newsapi        381
google_news    271
Name: count, dtype: int64


In [28]:
print("\n Articles by Source Name (Top 10):")
print(df['source_name'].value_counts().head(10))

print("\n Date Distribution:")
print(df['publishedAt'].dt.date.value_counts().sort_index().tail(10))


 Articles by Source Name (Top 10):
source_name
Biztoc.com             48
Yahoo Entertainment    36
The Times of India     31
GlobeNewswire          29
Reuters                25
Bloomberg.com          17
Bank of England        14
CNA                    14
USA Today              14
Financial Post         13
Name: count, dtype: int64

 Date Distribution:
publishedAt
2025-10-23      3
2025-10-24      1
2025-10-25      3
2025-10-26      3
2025-10-27      7
2025-10-28      7
2025-10-29     14
2025-10-30     96
2025-10-31     69
2025-11-01    111
Name: count, dtype: int64


In [30]:
print("Deduplication\n")

print(f"Articles before deduplication: {len(df)}")

duplicates = df.duplicated(subset=['title', 'url'], keep='first').sum()
print(f"Duplicates found: {duplicates}")

df = df.drop_duplicates(subset=['title', 'url'], keep='first')

print(f"Articles after deduplication: {len(df)}")
print(f"Removed: {duplicates} duplicate articles")

print("\nFinal distribution by market:")
print(df['market_name'].value_counts())

Deduplication

Articles before deduplication: 652
Duplicates found: 0
Articles after deduplication: 652
Removed: 0 duplicate articles

Final distribution by market:
market_name
United States     160
Japan             142
France            129
United Kingdom    122
Germany            99
Name: count, dtype: int64


**What I am doing:**
- Creating timestamp for file versioning
- Saving deduplicated DataFrame to JSON in data/raw directory
- Generating summary statistics file for reference

**Why I'm doing this:**
- Preserves raw collected data before LLM analysis
- Timestamped files allow tracking multiple collection runs
- JSON format enables easy loading in next notebook

**Technical Note:** pandas to_json, Path for file handling, json for metadata

In [32]:
print("Saving raw data\n")

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = Path('../data/raw')
output_dir.mkdir(parents=True, exist_ok=True)

output_file = output_dir / f'news_raw_{timestamp}.json'
df.to_json(output_file, orient='records', date_format='iso', indent=2)

print(f"Saved to: {output_file}")
print(f"Total articles: {len(df)}")

summary = {
    'collection_timestamp': timestamp,
    'total_articles': len(df),
    'date_range': {
        'start': df['publishedAt'].min().isoformat() if pd.notna(df['publishedAt'].min()) else None,
        'end': df['publishedAt'].max().isoformat() if pd.notna(df['publishedAt'].max()) else None
    },
    'articles_by_market': df['market_name'].value_counts().to_dict(),
    'articles_by_source': df['source_type'].value_counts().to_dict(),
    'missing_dates': int(df['publishedAt'].isna().sum())
}

summary_file = output_dir / f'collection_summary_{timestamp}.json'
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\nSummary saved to: {summary_file}")
print("\nCollection complete!")

Saving raw data

Saved to: ..\data\raw\news_raw_20251102_195719.json
Total articles: 652

Summary saved to: ..\data\raw\collection_summary_20251102_195719.json

Collection complete!
