# NewsExtractor Demo

This notebook demonstrates the use of the `NewsExtractor` library to extract and analyze news articles. 

### 1. Installation

First, let's install the necessary libraries.

In [None]:
!pip install -r requirements.txt
!python -m spacy download en_core_web_sm

### 2. Basic Article Extraction

Here's how to extract a single article from a URL.

In [23]:
# Clean extraction with only reliable properties
import sys
import os

# Add the project root to Python path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))

from core.news_extractor import NewsExtractor

# Create extractor with NLP enabled for AI-powered features
extractor = NewsExtractor(enable_nlp=True)

# Extract article from URL
article = extractor.extract_from_url("https://techcrunch.com/2025/07/04/ready-made-stem-cell-therapies-for-pets-could-be-coming/")

print("📰 CORE ARTICLE DATA:")
print(f"  Article ID: {article.article_id}")
print(f"  Title: {article.title}")
print(f"  Author: {article.author}")
print(f"  URL: {article.url}")
print(f"  Published Date: {article.published_date}")
print(f"  Word Count: {article.word_count}")
print(f"  Reading Time: {article.read_time} minutes")
print(f"  Top Image: {article.top_image}")

print(f"\n🔍 CATEGORIZATION (Reliable):")
print(f"  Source: {article.source}")
print(f"\n👥 NAMED ENTITIES FOUND:")
for entity_type, entities in article.entities.items():
    if entities: 
        print(f"  {entity_type}: {', '.join(entities)}")

print(f"\n🧠 ANALYSIS (AI-Powered):")
print(f"  Language: {article.language}")
print(f"  Sentiment: {article.sentiment}")
print(f"  Summary: {article.nlp_summary}")

print(f"\n📊 ADDITIONAL METADATA:")
print(f"  Publication Name: {article.publication_name}")
print(f"  Meta Description: {article.meta_description}")
print(f"  Canonical Link: {article.canonical_link}")
print(f"  Image URLs: {len(article.image_urls)} images")
print(f"  Video URLs: {len(article.video_urls)} videos")
print(f"  Links: {len(article.links)} external links")
print(f"  Is Paywalled: {article.is_paywalled}")

📰 CORE ARTICLE DATA:
  Article ID: article_64ac505ad950
  Title: Ready-made stem cell therapies for pets could be coming
  Author: Connie Loizos
  URL: https://techcrunch.com/2025/07/04/ready-made-stem-cell-therapies-for-pets-could-be-coming/
  Published Date: 2025-07-04 23:36:00+00:00
  Word Count: 265
  Reading Time: 1 minutes
  Top Image: https://techcrunch.com/wp-content/uploads/2024/10/GettyImages-1357481031.jpg?resize=1200,900

🔍 CATEGORIZATION (Reliable):
  Source: techcrunch.com

👥 NAMED ENTITIES FOUND:
  PERSON: Gallant, Aaron Hirschhorn, Linda Black
  ORG: FDA, Gallant’s, Feline Chronic Gingivostomatitis, FCGS, Digitalis Ventures
  GPE: San Diego
  MONEY: $18 million, at least $44 million
  DATE: Earlier this week, decades, Seven-year-old, early 2026, up to two years

🧠 ANALYSIS (AI-Powered):
  Language: en
  Sentiment: {'compound': 0.9762, 'positive': 0.11, 'negative': 0.014, 'neutral': 0.876, 'polarity': 0.05665584415584417, 'subjectivity': 0.3983766233766234, 'label': 'pos

### 3. Advanced NLP Processing

Enable NLP to get keywords, sentiment, and a summary.

In [26]:
from core.news_extractor import NewsExtractor

# Enable NLP processing and set the summarization method
# Options are: 'auto', 'sumy', 'transformers', 'simple'
extractor = NewsExtractor(enable_nlp=True, summarization_method='simple')
article = extractor.extract_from_url("https://techcrunch.com/2025/07/04/ready-made-stem-cell-therapies-for-pets-could-be-coming")

# Access NLP results (reliable metadata only)
print(f"😊 Sentiment: {article.sentiment['label']} ({article.sentiment['compound']:.2f})")
print(f"👤 Named Entities by Type:")
for entity_type, entities in article.entities.items():
    if entities:  # Only show non-empty entity types
        print(f"   {entity_type}: {', '.join(entities)}")
print(f"📝 Summary: {article.nlp_summary}")
print(f"🌐 Language: {article.language}")

😊 Sentiment: positive (0.98)
👤 Named Entities by Type:
   PERSON: Gallant, Aaron Hirschhorn, Linda Black
   ORG: FDA, Gallant’s, Feline Chronic Gingivostomatitis, FCGS, Digitalis Ventures
   GPE: San Diego
   MONEY: $18 million, at least $44 million
   DATE: Earlier this week, decades, Seven-year-old, early 2026, up to two years
📝 Summary: Earlier this week, San Diego startup Gallant announced $18 million in funding to bring the first FDA-approved ready-to-use stem cell therapy to veterinary medicine. If it passes regulatory muster, it could create a whole new way to treat our fur babies. It’s still an experimental field, even though people have been researching stem cells for humans for decades.
🌐 Language: en


### 4. Language-Specific Processing

Test the NewsExtractor with Bengali language settings.

In [35]:
from core.news_extractor import NewsExtractor

# Create extractor with Bengali language setting
extractor_bengali = NewsExtractor(
    enable_nlp=True, 
    language='bn',  # Bengali language code
    summarization_method='simple'
)

# Extract article from URL with Bengali language processing
test_article = extractor_bengali.extract_from_url("https://techcrunch.com/2025/07/04/ready-made-stem-cell-therapies-for-pets-could-be-coming/")

print("🇧🇩 BENGALI LANGUAGE PROCESSING:")
print(f"Target Language Setting: bn (Bengali)")

print("\n📰 CORE ARTICLE DATA:")
print(f"  Article ID: {test_article.article_id}")
print(f"  Title: {test_article.title}")
print(f"  Author: {test_article.author}")
print(f"  URL: {test_article.url}")
print(f"  Published Date: {test_article.published_date}")
print(f"  Word Count: {test_article.word_count}")
print(f"  Reading Time: {test_article.read_time} minutes")
print(f"  Top Image: {test_article.top_image}")

print(f"\n🔍 CATEGORIZATION (Reliable):")
print(f"  Source: {test_article.source}")
print(f"\n👥 NAMED ENTITIES FOUND:")
for entity_type, entities in test_article.entities.items():
    if entities: 
        print(f"  {entity_type}: {', '.join(entities)}")

print(f"\n🧠 ANALYSIS (AI-Powered):")
print(f"  Detected Language: {test_article.language}")
print(f"  Translation Status: {'Translated' if test_article.translated else 'Original'}")
print(f"  Sentiment: {test_article.sentiment}")
print(f"  Summary: {test_article.nlp_summary}")

print(f"\n📊 ADDITIONAL METADATA:")
print(f"  Publication Name: {test_article.publication_name}")
print(f"  Meta Description: {test_article.meta_description}")
print(f"  Canonical Link: {test_article.canonical_link}")
print(f"  Image URLs: {len(test_article.image_urls)} images")
print(f"  Video URLs: {len(test_article.video_urls)} videos")
print(f"  Links: {len(test_article.links)} external links")
print(f"  Is Paywalled: {test_article.is_paywalled}")

🇧🇩 BENGALI LANGUAGE PROCESSING:
Target Language Setting: bn (Bengali)

📰 CORE ARTICLE DATA:
  Article ID: article_d0effad4258e
  Title: পোষা প্রাণীদের জন্য রেডিমেড স্টেম সেল থেরাপি আসতে পারে
  Author: Connie Loizos
  URL: https://techcrunch.com/2025/07/04/ready-made-stem-cell-therapies-for-pets-could-be-coming/
  Published Date: 2025-07-04 23:36:00+00:00
  Word Count: 271
  Reading Time: 1 minutes
  Top Image: https://techcrunch.com/wp-content/uploads/2024/10/GettyImages-1357481031.jpg?resize=1200,900

🔍 CATEGORIZATION (Reliable):
  Source: techcrunch.com

👥 NAMED ENTITIES FOUND:
  PERSON: শুরুতে, জন্য প্রস্তুত স্টেম, পারে, সুবিধা দুই, স্থায়ী
  ORG: করতে, বয়সী গ্যালান্টের প্রথম, মধ্যে, উত্সাহজনক প্রাথমিক ফলাফল দেখিয়েছে, বাতজনিত
  GPE: প্রথম
  DATE: 2021
  PRODUCT: এখনও একটি পরীক্ষামূলক ক্ষেত্র, গ্যালান্ট বলেছেন, ২০২26, দেখিয়েছিল

🧠 ANALYSIS (AI-Powered):
  Detected Language: en
  Translation Status: Translated
  Sentiment: {'compound': 0.0, 'positive': 0.0, 'negative': 0.0, 'neutra

### 5. Extracting from an RSS Feed

You can also extract all articles from an RSS feed.

In [32]:
from core.news_extractor import NewsExtractor

extractor = NewsExtractor()
articles = extractor.extract_from_rss_feed("https://techcrunch.com/feed/")

# Extract full content for each article from RSS
for article in articles[:2]:
    full_article = extractor.extract_from_url(article.url)
    print(f"📰 {full_article.title}")
    print(f"👥 Entities: {full_article.entities}")
    print(f"📝 Summary: {full_article.nlp_summary}")
    print("-" * 50)

📰 Ready-made stem cell therapies for pets could be coming
👥 Entities: {'PERSON': ['Gallant', 'Aaron Hirschhorn', 'Linda Black'], 'ORG': ['FDA', 'Gallant’s', 'Feline Chronic Gingivostomatitis', 'FCGS', 'Digitalis Ventures'], 'GPE': ['San Diego'], 'MONEY': ['$18 million', 'at least $44 million'], 'DATE': ['Earlier this week', 'decades', 'Seven-year-old', 'early 2026', 'up to two years'], 'EVENT': [], 'PRODUCT': []}
📝 Summary: Earlier this week, San Diego startup Gallant announced $18 million in funding to bring the first FDA-approved ready-to-use stem cell therapy to veterinary medicine. Most stem cell treatments today require harvesting cells from the patient or donors with matching tissue, whereas Gallant’s therapy uses ready-to-use cells from donor animals, even if they are a different species. The funding round was led by existing backer Digitalis Ventures, with participation from NovaQuest Capital Management, which previously invested in the first FDA-approved human stem cell therap

### 6. Getting Trending News

You can get trending news topics. **Note:** This requires a SerpAPI key.

In [None]:
import os
from core.trending import NewsSearcher

# Get API key from environment variable
SERPAPI_KEY = os.getenv('SERPAPI_KEY', 'your-serpapi-key-here')

if SERPAPI_KEY == 'your-serpapi-key-here':
    print("⚠️  Please set your SERPAPI_KEY environment variable")
    print("   Example: set SERPAPI_KEY=your-actual-api-key")
    print("   Or create a .env file with SERPAPI_KEY=your-actual-api-key")
else:
    searcher = NewsSearcher(serpapi_key=SERPAPI_KEY)
    trending_articles = searcher.get_trending_news(limit=5)

    for article in trending_articles:
        print(f"🔥 {article.title}")

  from .autonotebook import tqdm as notebook_tqdm
Attempt 1 failed for https://www.deccanherald.com/india/karnataka/bengaluru/trending-b-luru-youth-join-fake-shaadi-party-scene-3611580: 403 Client Error: Forbidden for url: https://www.deccanherald.com/india/karnataka/bengaluru/trending-b-luru-youth-join-fake-shaadi-party-scene-3611580
Attempt 1 failed for https://www.deccanherald.com/india/karnataka/bengaluru/trending-b-luru-youth-join-fake-shaadi-party-scene-3611580: 403 Client Error: Forbidden for url: https://www.deccanherald.com/india/karnataka/bengaluru/trending-b-luru-youth-join-fake-shaadi-party-scene-3611580
Attempt 2 failed for https://www.deccanherald.com/india/karnataka/bengaluru/trending-b-luru-youth-join-fake-shaadi-party-scene-3611580: 403 Client Error: Forbidden for url: https://www.deccanherald.com/india/karnataka/bengaluru/trending-b-luru-youth-join-fake-shaadi-party-scene-3611580
Attempt 2 failed for https://www.deccanherald.com/india/karnataka/bengaluru/trending-b-lu

🔥 ‘Rs 20 lakh stolen in 60 seconds…’: Thieves hack parked car in under a minute in Delhi; owner shares CCTV footage of heist
🔥 Who is Soham Parekh? The âIndian techieâ going viral on X and giving US start-up founders a headache
🔥 These 7 stocks showed RSI Trending Up on July 3
🔥 Ferrari worth Rs 7.5 crore seized in Bengaluru for road tax evasion; owner pays Rs 1.41 crore to RTO


### 6. Search For A Specific News

You can get trending news topics. **Note:** This requires a SerpAPI key.

In [None]:
import os
from core.trending import NewsSearcher

# Initialize the searcher with your SerpAPI key from environment
SERPAPI_KEY = os.getenv('SERPAPI_KEY', 'your-serpapi-key-here')

if SERPAPI_KEY == 'your-serpapi-key-here':
    print("⚠️  Please set your SERPAPI_KEY environment variable")
    print("   Example: set SERPAPI_KEY=your-actual-api-key")
    print("   Or create a .env file with SERPAPI_KEY=your-actual-api-key")
else:
    searcher = NewsSearcher(serpapi_key=SERPAPI_KEY)
    articles = searcher.search_news_by_keyword("World War 3", limit=3)

    for i, article in enumerate(articles, 1):
        print(f"\n{i}. 📰 {article.title}")
        print(f"   🌐 Source: {article.source}")
        print(f"   📅 Published: {article.published_date}")
        print(f"   🔗 URL: {article.url}")
        
        # Show entities if available
        if article.entities:
            entity_summary = []
            for entity_type, entities in article.entities.items():
                if entities:
                    entity_summary.append(f"{entity_type}: {len(entities)}")
            if entity_summary:
                print(f"   👥 Entities: {', '.join(entity_summary)}")
        
        # Show sentiment if available
        if article.sentiment:
            sentiment_label = article.sentiment.get('label', 'N/A')
            print(f"   😊 Sentiment: {sentiment_label}")

Attempt 1 failed for https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232: 403 Client Error: Forbidden for url: https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232
Attempt 2 failed for https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232: 403 Client Error: Forbidden for url: https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232
Attempt 2 failed for https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232: 403 Client Error: Forbidden for url: https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232
Failed to extract single article from https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232: Failed to fetch URL after 3 attempts: 403 Client Error: Forbidden for url: https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232
Failed to extract from https://www.ndtv.com/opinion/why-a-world-war-iii-is-very-unlikely-8825232: Failed to extract a


1. 📰 Iran Israel war: How close is World War 3?
   🌐 Source: m.economictimes.com
   📅 Published: 2025-07-05 09:08:42.839073
   🔗 URL: https://m.economictimes.com/news/international/us/how-close-is-world-war-3-amidst-israel-iran-war/articleshow/121912720.cms
   👥 Entities: PERSON: 3, ORG: 2, GPE: 5, DATE: 1, EVENT: 3
   😊 Sentiment: negative

2. 📰 How close are we to World War Three?
   🌐 Source: theweek.com
   📅 Published: 2018-04-17 08:48:49+00:00
   🔗 URL: https://theweek.com/92967/are-we-heading-towards-world-war-3
   👥 Entities: PERSON: 5, ORG: 5, GPE: 5, DATE: 5, EVENT: 3
   😊 Sentiment: negative
