# README for News Article Collection and Categorization Application

## Overview

This application is designed to collect news articles from various RSS feeds, store them in a database, categorize them based on predefined categories, and enhance the categorization for articles that do not fit neatly into existing categories. The primary technologies used include Python, SQLAlchemy for database management, and spaCy for natural language processing.

## Project Structure

1. **Database Creation and Data Extraction**:
   - **Logic**: This part of the code establishes the database schema and creates tables necessary for storing the articles. It also includes the logic for extracting articles from specified RSS feeds.
   - **Design Choice**: Using SQLAlchemy allows for easy management of database interactions and supports multiple database backends. The code is modular to ensure easy modifications in the future.

2. **Categorization of Data Using spaCy**:
   - **Logic**: The collected articles are categorized using predefined keywords associated with specific categories. The code checks the titles, summaries, and provided categories of the articles for matches against these keywords.
   - **Design Choice**: spaCy is utilized for its efficient natural language processing capabilities, allowing for robust keyword matching. The use of a dictionary for category keywords makes it easy to update or expand categories as needed.

3. **Enhanced Categorization for "Others"**:
   - **Logic**: Articles that do not match any of the predefined categories are categorized into a general "Others" category. This section of the code further refines this category by breaking it down into subcategories such as Economy, Technology, Health, Environment, and Social Issues based on additional keywords.
   - **Design Choice**: This enhancement improves the granularity of article categorization and helps in better reporting and analysis of articles that would otherwise be lost in a broad "Others" category.

4. **Printing Categorized Articles**:
   - **Logic**: The final section of the code prints out the details of each categorized article, including the title, content, link, published date, media URL, and assigned category. It also summarizes the total count of articles in each category.
   - **Design Choice**: This allows for easy verification of the categorization process and ensures that users can see the results in a structured and readable format.

## Dependencies

- Python 3.x
- SQLAlchemy
- spaCy
- A database management system (e.g., SQLite, PostgreSQL)
- An RSS feed source

## Future Enhancements

- Implement a user interface for easier interaction with the application.
- Add functionalities for users to input custom keywords for categorization.
- Integrate machine learning techniques for more dynamic categorization based on article content.

## Conclusion

This application provides a foundational system for collecting and categorizing news articles, leveraging modern tools and techniques in database management and natural language processing. The modular design allows for future enhancements and adaptations to meet evolving needs.


In [2]:
%pip install feedparser sqlalchemy tqdm psycopg2-binary spaCy collections

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement collections (from versions: none)
ERROR: No matching distribution found for collections


# **News Article Collection and Categorization Application**

## **1. Database Creation and Data Extraction**

This part of the code is responsible for setting up the database and extracting news articles from various RSS feeds. It involves defining the database schema, creating tables, and populating them with articles collected from specified sources.


In [3]:
from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from datetime import datetime
import json
import feedparser
from tqdm import tqdm  # Import tqdm for progress bar

Base = declarative_base()

# Define the Articles table
class Article(Base):
    __tablename__ = 'articles'
    
    article_id = Column(Integer, primary_key=True)
    title = Column(String, nullable=False)
    link = Column(String, nullable=False)
    pub_date = Column(DateTime, nullable=True)  # Allow null if pub_date is missing
    summary = Column(String, nullable=True)
    categories = Column(String, nullable=True)  # Store as JSON
    media_url = Column(String, nullable=True)  # Add media_url directly

# Create the database connection
def get_engine():
    # Replace the below URL with your actual database URL
    DATABASE_URL = "postgresql://postgres:root@localhost/Data_1"
    engine = create_engine(DATABASE_URL)
    return engine

# Create the tables
engine = get_engine()
Base.metadata.create_all(engine)

# Define RSS feeds to parse
rss_urls = [
    "http://rss.cnn.com/rss/cnn_topstories.rss",
    "http://qz.com/feed",
    "http://feeds.foxnews.com/foxnews/politics",
    "http://feeds.reuters.com/reuters/businessNews",
    "http://feeds.feedburner.com/NewshourWorld",
    "https://feeds.bbci.co.uk/news/world/asia/india/rss.xml"
]

# Set up the database session
Session = sessionmaker(bind=engine)
session = Session()

# Function to parse RSS feeds and insert data into the database
def parse_rss(url):
    feed = feedparser.parse(url)
    
    # Use tqdm to show a progress bar for the entries
    for entry in tqdm(feed.entries, desc=f'Processing {url}', unit='article'):
        title = entry.title
        link = entry.link
        pub_date = entry.published_parsed if 'published_parsed' in entry else entry.get('updated_parsed', None)
        if pub_date:
            pub_date = datetime(*pub_date[:6])
        else:
            pub_date = datetime.now()  # Set to current timestamp if not available

        description = entry.get('description', 'No description')
        categories = [tag.term for tag in entry.get('tags', [])] or [entry.get('category', 'General')]
        categories = json.dumps(categories)  # Store as JSON string

        media_url = entry.get('media_thumbnail', [None])[0].get('url') if 'media_thumbnail' in entry else None

        # Check if the article already exists
        existing_article = session.query(Article).filter_by(title=title).first()

        if existing_article is None:
            # Insert new article
            new_article = Article(
                title=title,
                link=link,
                pub_date=pub_date,
                summary=description,
                categories=categories,
                media_url=media_url
            )
            session.add(new_article)
            
            # Commit to save changes
            try:
                session.commit()
            except Exception as e:
                session.rollback()
                print(f"Failed to insert article {title}: {e}")
        else:
            # Optionally, you can keep track of existing articles
            pass

# Loop through all RSS feeds with a progress bar
for url in rss_urls:
    print(f'Starting to parse: {url}')
    parse_rss(url)

# Close the session after completion
session.close()


  Base = declarative_base()


Starting to parse: http://rss.cnn.com/rss/cnn_topstories.rss


Processing http://rss.cnn.com/rss/cnn_topstories.rss: 100%|██████████| 69/69 [00:00<00:00, 1002.01article/s]


Starting to parse: http://qz.com/feed


Processing http://qz.com/feed: 100%|██████████| 50/50 [00:00<00:00, 226.01article/s]


Starting to parse: http://feeds.foxnews.com/foxnews/politics


Processing http://feeds.foxnews.com/foxnews/politics: 100%|██████████| 25/25 [00:00<00:00, 396.76article/s]


Starting to parse: http://feeds.reuters.com/reuters/businessNews


Processing http://feeds.reuters.com/reuters/businessNews: 0article [00:00, ?article/s]


Starting to parse: http://feeds.feedburner.com/NewshourWorld


Processing http://feeds.feedburner.com/NewshourWorld: 100%|██████████| 15/15 [00:00<00:00, 356.28article/s]


Starting to parse: https://feeds.bbci.co.uk/news/world/asia/india/rss.xml


Processing https://feeds.bbci.co.uk/news/world/asia/india/rss.xml: 100%|██████████| 35/35 [00:00<00:00, 1127.30article/s]


## **2. Categorization of Data Using spaCy**
In this section, the code categorizes the collected articles based on predefined categories using spaCy. It checks the titles, summaries, and category fields of the articles for keywords that correspond to each category and assigns them accordingly. If no match is found, the articles are categorized as "Others."

In [9]:
import spacy
import json
from collections import defaultdict
import re

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the category keywords
with open('category.json', 'r') as f:
    category_keywords = json.load(f)

# Function to check if a string contains a link (URL)
def contains_link(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return bool(url_pattern.search(text))

# Function to categorize articles using spaCy NLP
def categorize_article_spacy(title, summary=None, categories=None):
    category = 'Others'
    
    # Exclude if contains links
    if contains_link(title):
        title = None
    if summary and contains_link(summary):
        summary = None
    if categories and contains_link(categories):
        categories = None

    # Process text with spaCy if valid
    if title:
        doc_title = nlp(title)
    else:
        doc_title = None

    if summary:
        doc_summary = nlp(summary)
    else:
        doc_summary = None

    # Search for keywords in title, summary, or categories using spaCy's tokenization
    for cat, keywords in category_keywords.items():
        for keyword in keywords:
            if doc_title and any(token.text.lower() == keyword.lower() for token in doc_title):
                category = cat
                break
            if doc_summary and any(token.text.lower() == keyword.lower() for token in doc_summary):
                category = cat
                break
        if category != 'Others':
            break

    return category

# Count the valid categories and reject invalid strings
category_counts = defaultdict(int)
all_articles = session.query(Article).all()

# List to hold categorized article information
categorized_articles = []

for article in all_articles:
    article_title = article.title
    article_summary = getattr(article, 'summary', None)
    article_categories = getattr(article, 'categories', None)
    
    # Use spaCy-based categorization
    category = categorize_article_spacy(article_title, article_summary, article_categories)
    
    if category in category_keywords.keys():
        category_counts[category] += 1
    else:
        category = 'Others'
        category_counts['Others'] += 1

    categorized_articles.append({
        'title': article.title,
        'content': article_summary if article_summary else 'Unknown',
        'category': category,
        'link': getattr(article, 'link', 'Unknown'),
        'published_date': getattr(article, 'pub_date', 'Unknown'),
        'media_url': getattr(article, 'media_url', 'Unknown')
    })

# Print categorized article details and category summary
for article_info in categorized_articles:
    print(f"Title: {article_info['title']}")
    print(f"Published Date: {article_info['published_date']}")
    print(f"Category: {article_info['category']}")
    print("-" * 50)

print("\nTotal number of categories:", len(category_counts))
for category, count in category_counts.items():
    print(f"{category}: {count} articles")


Title: Some on-air claims about Dominion Voting Systems were false, Fox News acknowledges in statement after deal is announced
Published Date: 2023-04-19 12:44:51
Category: political
--------------------------------------------------
Title: Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell
Published Date: 2024-10-09 10:50:14.301041
Category: political
--------------------------------------------------
Title: Here are the 20 specific Fox broadcasts and tweets Dominion says were defamatory
Published Date: 2023-04-17 16:01:11
Category: political
--------------------------------------------------
Title: Judge in Fox News-Dominion defamation trial: 'The parties have resolved their case'
Published Date: 2023-04-19 08:28:17
Category: political
--------------------------------------------------
Title: 'Difficult to say with a straight face': Tapper reacts to Fox News' statement on settlement
Published Date: 2023-04-18 21:17:44
Category: politi

## **3. Enhanced Categorization for "Others"**
This part enhances the categorization logic by breaking down the "Others" category into more specific subcategories such as Economy, Technology, Health, Environment, and Social Issues. It utilizes spaCy to perform keyword matching and determine the appropriate subcategory for articles that don't fit the predefined categories.

In [7]:
import spacy
import json
from collections import defaultdict
import re

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Load the existing category keywords
with open('category.json', 'r') as f:
    category_keywords = json.load(f)

# Add new subcategories to handle "Others" category
extra_categories = {
    "Economy": ["economy", "business", "market", "stocks", "finance", "investment"],
    "Technology": ["technology", "software", "AI", "machine learning", "gadgets", "innovation", "IT"],
    "Health": ["health", "medicine", "wellness", "disease", "public health", "mental health"],
    "Environment": ["climate change", "pollution", "conservation", "environment", "biodiversity"],
    "Social Issues": ["human rights", "gender", "inequality", "racism", "social justice", "discrimination"]
}

# Combine the original categories and extra categories
all_categories = {**category_keywords, **extra_categories}

# Function to check if a string contains a link (URL)
def contains_link(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return bool(url_pattern.search(text))

# Function to categorize articles with expanded categories
def categorize_article_spacy_expanded(title, summary=None, categories=None):
    category = 'Others'
    
    if contains_link(title):
        title = None
    if summary and contains_link(summary):
        summary = None
    if categories and contains_link(categories):
        categories = None

    # Process the text using spaCy
    doc_title = nlp(title) if title else None
    doc_summary = nlp(summary) if summary else None

    # Check for matches in original categories first
    for cat, keywords in category_keywords.items():
        for keyword in keywords:
            if doc_title and any(token.text.lower() == keyword.lower() for token in doc_title):
                return cat
            if doc_summary and any(token.text.lower() == keyword.lower() for token in doc_summary):
                return cat

    # If no match, try to assign to the new subcategories under "Others"
    for cat, keywords in extra_categories.items():
        for keyword in keywords:
            if doc_title and any(token.text.lower() == keyword.lower() for token in doc_title):
                return cat
            if doc_summary and any(token.text.lower() == keyword.lower() for token in doc_summary):
                return cat

    # If nothing matches, return 'Others'
    return category

# Count the valid categories
category_counts = defaultdict(int)
all_articles = session.query(Article).all()

# List to hold categorized articles
categorized_articles = []

for article in all_articles:
    article_title = article.title
    article_summary = getattr(article, 'summary', None)
    article_categories = getattr(article, 'categories', None)

    # Use the expanded spaCy-based categorization
    category = categorize_article_spacy_expanded(article_title, article_summary, article_categories)
    
    if category in all_categories.keys():
        category_counts[category] += 1
    else:
        category = 'Others'
        category_counts['Others'] += 1

    categorized_articles.append({
        'title': article.title,
        'content': article_summary if article_summary else 'Unknown',
        'category': category,
        'link': getattr(article, 'link', 'Unknown'),
        'published_date': getattr(article, 'pub_date', 'Unknown'),
        'media_url': getattr(article, 'media_url', 'Unknown')
    })

# Print categorized articles
print("Categorized Articles:")
for article_info in categorized_articles:
    print(f"Title: {article_info['title']}")
    # print(f"Content: {article_info['content']}")
    # print(f"Link: {article_info['link']}")
    print(f"Published Date: {article_info['published_date']}")
    # print(f"Media URL: {article_info['media_url']}")
    print(f"Category: {article_info['category']}")
    print("-" * 50)

# Print category summary
print(f"\nTotal number of categories: {len(category_counts)}")
print("Category breakdown:")

for category, count in category_counts.items():
    print(f"{category}: {count} articles")


Categorized Articles:
Title: Some on-air claims about Dominion Voting Systems were false, Fox News acknowledges in statement after deal is announced
Published Date: 2023-04-19 12:44:51
Category: political
--------------------------------------------------
Title: Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell
Published Date: 2024-10-09 10:50:14.301041
Category: political
--------------------------------------------------
Title: Here are the 20 specific Fox broadcasts and tweets Dominion says were defamatory
Published Date: 2023-04-17 16:01:11
Category: political
--------------------------------------------------
Title: Judge in Fox News-Dominion defamation trial: 'The parties have resolved their case'
Published Date: 2023-04-19 08:28:17
Category: political
--------------------------------------------------
Title: 'Difficult to say with a straight face': Tapper reacts to Fox News' statement on settlement
Published Date: 2023-04-18 21:

## **4. Printing Categorized Articles**
This section of the code is responsible for printing the details of each categorized article. It outputs the title, content (or a default message if unavailable), link, published date, media URL, and the assigned category for each article. Additionally, it provides a summary of the total number of articles in each category.

In [11]:
print("Categorized Articles:")
for article_info in categorized_articles:
    print(f"Title: {article_info['title']}")
    print(f"Content: {article_info['content']}")
    print(f"Link: {article_info['link']}")
    print(f"Published Date: {article_info['published_date']}")
    print(f"Media URL: {article_info['media_url']}")
    print(f"Category: {article_info['category']}")
    print("-" * 50)

Categorized Articles:
Title: Some on-air claims about Dominion Voting Systems were false, Fox News acknowledges in statement after deal is announced
Content: No description
Link: https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/index.html
Published Date: 2023-04-19 12:44:51
Media URL: None
Category: political
--------------------------------------------------
Title: Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell
Content: No description
Link: https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/h_8d51e3ae2714edaa0dace837305d03b8
Published Date: 2024-10-09 10:50:14.301041
Media URL: None
Category: political
--------------------------------------------------
Title: Here are the 20 specific Fox broadcasts and tweets Dominion says were defamatory
Content: • Fox-Dominion trial delay 'is not unusual,' judge says
• Fox News' defamation battle isn't stopping Trump's election lies
Link: https://www.cn

# Rough Works for trying out things

In [None]:
import json
from collections import defaultdict
import re  # For checking if summary or title contains links

# Load the category keywords
with open('category.json', 'r') as f:
    category_keywords = json.load(f)

# Function to check if a string contains a link (URL)
def contains_link(text):
    # Regex pattern to match URLs
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return bool(url_pattern.search(text))

# Function to categorize articles
def categorize_article(title, summary=None, categories=None):
    # Initialize the article's category as 'Others'
    category = 'Others'
    
    # Exclude title and summary from consideration if they contain links
    if contains_link(title):
        title = None
    if summary and contains_link(summary):
        summary = None
    if categories and contains_link(categories):
        categories = None

    # Check if any keyword matches in the title, summary, or categories
    for cat, keywords in category_keywords.items():
        if any(
            keyword.lower() in (title or '') or
            keyword.lower() in (summary or '') or
            keyword.lower() in (categories or '')
            for keyword in keywords
        ):
            category = cat
            break

    # Return the determined category
    return category

# Count the valid categories and reject invalid strings
category_counts = defaultdict(int)
all_articles = session.query(Article).all()

# List to hold categorized article information
categorized_articles = []

for article in all_articles:
    # Get the article attributes
    article_title = article.title
    article_summary = getattr(article, 'summary', None)
    article_categories = getattr(article, 'categories', None)
    
    # Categorize the article based on title, summary, and categories (if valid)
    category = categorize_article(article_title, article_summary, article_categories)
    
    # Validate category (reject invalid categories)
    if category in category_keywords.keys():
        category_counts[category] += 1
    else:
        # Assign to 'Others' if invalid
        category = 'Others'
        category_counts['Others'] += 1

    # Store article details after categorization
    categorized_articles.append({
        'title': article.title,
        'content': article_summary if article_summary else 'Unknown',
        'category': category,
        'link': getattr(article, 'link', 'Unknown'),
        'published_date': getattr(article, 'pub_date', 'Unknown'),
        'media_url': getattr(article, 'media_url', 'Unknown')
    })

# Print each categorized article's details
print("Categorized Articles:")
for article_info in categorized_articles:
    print(f"Title: {article_info['title']}")
    print(f"Content: {article_info['content']}")
    print(f"Link: {article_info['link']}")
    print(f"Published Date: {article_info['published_date']}")
    print(f"Media URL: {article_info['media_url']}")
    print(f"Category: {article_info['category']}")
    print("-" * 50)

# Print category summary
print(f"\nTotal number of categories: {len(category_counts)}")
print("Category breakdown:")

for category, count in category_counts.items():
    print(f"{category}: {count} articles")


In [10]:
%pip install celery redis

Note: you may need to restart the kernel to use updated packages.


In [2]:
from sqlalchemy import create_engine, Column, Integer, String, DateTime, ForeignKey
from sqlalchemy.orm import relationship, sessionmaker
from sqlalchemy.ext.declarative import declarative_base
from datetime import datetime

Base = declarative_base()

# Define the Articles table
class Article(Base):
    __tablename__ = 'articles'
    
    article_id = Column(Integer, primary_key=True)
    title = Column(String, nullable=False)
    link = Column(String, nullable=False)
    pub_date = Column(DateTime, nullable=True)  # Allow null if pub_date is missing
    summary = Column(String, nullable=True)
    category = Column(String, nullable=True)
    
    media = relationship("Media", back_populates="article")

# Define the Media table
class Media(Base):
    __tablename__ = 'media'
    
    media_id = Column(Integer, primary_key=True)
    article_id = Column(Integer, ForeignKey('articles.article_id'))
    media_url = Column(String, nullable=False)
    media_type = Column(String, nullable=True)
    
    article = relationship("Article", back_populates="media")

# Create the database connection
def get_engine():
    # Replace the below URL with your actual database URL
    DATABASE_URL = "postgresql://postgres:root@localhost/Data_1"
    engine = create_engine(DATABASE_URL)
    return engine

# Create the tables
engine = get_engine()
Base.metadata.create_all(engine)


  Base = declarative_base()


In [3]:
import feedparser
from sqlalchemy.orm import sessionmaker

# Define RSS feeds to parse
rss_urls = [
    "http://rss.cnn.com/rss/cnn_topstories.rss",
    "http://qz.com/feed",
    "http://feeds.foxnews.com/foxnews/politics",
    "http://feeds.reuters.com/reuters/businessNews",
    "http://feeds.feedburner.com/NewshourWorld",
    "https://feeds.bbci.co.uk/news/world/asia/india/rss.xml"
]

# Set up the database session
Session = sessionmaker(bind=engine)
session = Session()

# Function to parse RSS feeds and insert data into the database
def parse_rss(url):
    feed = feedparser.parse(url)
    
    for entry in feed.entries:
        title = entry.title
        link = entry.link
        pub_date = entry.published_parsed if 'published_parsed' in entry else entry.get('updated_parsed', None)
        if pub_date:
            pub_date = datetime(*pub_date[:6])
        else:
            pub_date = datetime.now()  # Set to current timestamp if not available

        description = entry.get('description', 'No description')
        category = entry.get('category', None)

        # Check if the article already exists
        existing_article = session.query(Article).filter_by(title=title).first()

        if existing_article is None:
            # Insert new article
            new_article = Article(
                title=title,
                link=link,
                pub_date=pub_date,
                summary=description,
                category=category
            )
            session.add(new_article)
            
            # Commit to save changes
            try:
                session.commit()
                print(f"Inserted article: {title}")
            except Exception as e:
                session.rollback()
                print(f"Failed to insert article {title}: {e}")
        else:
            print(f"Article already exists: {title}")

# Loop through all RSS feeds
for url in rss_urls:
    parse_rss(url)

# Close the session after completion
session.close()


Article already exists: Some on-air claims about Dominion Voting Systems were false, Fox News acknowledges in statement after deal is announced
Article already exists: Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell
Article already exists: Here are the 20 specific Fox broadcasts and tweets Dominion says were defamatory
Article already exists: Judge in Fox News-Dominion defamation trial: 'The parties have resolved their case'
Article already exists: 'Difficult to say with a straight face': Tapper reacts to Fox News' statement on settlement
Article already exists: Millions in the US could face massive consequences unless McCarthy can navigate out of a debt trap he set for Biden
Article already exists: White homeowner accused of shooting a Black teen who rang his doorbell turns himself in to face criminal charges
Article already exists: Newly released video shows scene of Jeremy Renner's snowplow accident
Article already exists: Jake Gy

In [36]:
import json
from collections import defaultdict

# Load the category keywords
with open('category.json', 'r') as f:
    category_keywords = json.load(f)

# Function to categorize articles
def categorize_article(title, content=None):
    # Initialize the article's category as 'Others'
    category = 'Others'
    
    # Check if any keyword matches in the title or content/description
    for cat, keywords in category_keywords.items():
        if any(keyword.lower() in title.lower() or (content and keyword.lower() in content.lower()) for keyword in keywords):
            category = cat
            break

    # Return the determined category
    return category

# Count the valid categories and reject invalid strings
category_counts = defaultdict(int)
all_articles = session.query(Article).all()

# List to hold categorized article information
categorized_articles = []

for article in all_articles:
    # Use content or another attribute instead of 'description'
    article_content = getattr(article, 'title', None)  # Replace 'content' with the correct field if needed
    
    # Categorize the article
    category = categorize_article(article.title, article_content)
    
    # Validate category (reject invalid categories)
    if category in category_keywords.keys():
        category_counts[category] += 1
    else:
        # Assign to 'Others' if invalid
        category = 'Others'
        category_counts['Others'] += 1

    # Store article details after categorization
    categorized_articles.append({
        'title': article.title,
        'content': getattr(article, 'summary', 'Unknown'),
        'category': category,
        'link': getattr(article, 'link', 'Unknown'),
        'published_date': getattr(article, 'pub_date', 'Unknown')
    })

# Print each categorized article's details
print("Categorized Articles:")
for article_info in categorized_articles:
    print(f"Title: {article_info['title']}")
    print(f"Content: {article_info['content']}")
    print(f"Link: {article_info['link']}")
    print(f"Published Date: {article_info['published_date']}")
    print(f"Category: {article_info['category']}")
    print("-" * 50)

# Print category summary
print(f"\nTotal number of categories: {len(category_counts)}")
print("Category breakdown:")

for category, count in category_counts.items():
    print(f"{category}: {count} articles")


Categorized Articles:
Title: Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell
Content: No description
Link: https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/h_8d51e3ae2714edaa0dace837305d03b8
Published Date: None
Category: political
--------------------------------------------------
Title: Judge in Fox News-Dominion defamation trial: 'The parties have resolved their case'
Content: The judge just announced in court that a settlement has been reached in the historic defamation case between Fox News and Dominion Voting Systems.
Link: https://www.cnn.com/2023/04/18/media/fox-dominion-settlement/index.html
Published Date: 2023-04-19 08:28:17
Category: Others
--------------------------------------------------
Title: 'Difficult to say with a straight face': Tapper reacts to Fox News' statement on settlement
Content: A settlement has been reached in Dominion Voting Systems' defamation case against Fox News, the judge fo

In [31]:
categorized_articles = session.query(Article).filter(Article.category != None).all()

for article in categorized_articles:
    print(f"Title: {article.title}")
    print(f"Category: {article.category}")
    print(f"Link: {article.link}")
    print(f"Published Date: {article.pub_date}")
    print(f"Summary: {article.summary}")
    print("-" * 80)


Title: Dominion still has pending lawsuits against election deniers such as Rudy Giuliani and Sidney Powell
Category: Others
Link: https://www.cnn.com/business/live-news/fox-news-dominion-trial-04-18-23/h_8d51e3ae2714edaa0dace837305d03b8
Published Date: None
Summary: No description
--------------------------------------------------------------------------------
Title: Judge in Fox News-Dominion defamation trial: 'The parties have resolved their case'
Category: Others
Link: https://www.cnn.com/2023/04/18/media/fox-dominion-settlement/index.html
Published Date: 2023-04-19 08:28:17
Summary: The judge just announced in court that a settlement has been reached in the historic defamation case between Fox News and Dominion Voting Systems.
--------------------------------------------------------------------------------
Title: 'Difficult to say with a straight face': Tapper reacts to Fox News' statement on settlement
Category: Others
Link: https://www.cnn.com/videos/politics/2023/04/18/jake-tap

In [24]:
# Assuming 'session' is already defined and connected to the database
# and the Article model has 'category' field

from collections import defaultdict

# Initialize a dictionary to hold category counts
category_counts = defaultdict(int)

# Get all articles with a categorized field (not None)
categorized_articles = session.query(Article).filter(Article.category != None).all()

# Loop through all articles and count categories
for article in categorized_articles:
    category_counts[article.category] += 1

# Count the total number of categories
total_categories = len(category_counts)

# Print out the total categories and counts
print(f"Total number of categories: {total_categories}")
print("Category breakdown:")

for category, count in category_counts.items():
    print(f"{category}: {count} articles")


Total number of categories: 132
Category breakdown:
Others: 110 articles
nvidia: 3 articles
business finance: 1 articles
health: 1 articles
costco: 3 articles
ftx: 1 articles
auction: 1 articles
tyson foods: 2 articles
popular culture: 1 articles
donald trump: 2 articles
super micro computer: 1 articles
jetblue: 1 articles
charging stations: 1 articles
fisker: 2 articles
frontier airlines: 1 articles
real estate: 1 articles
43c1ad9c-6e95-5b69-a21f-51aba77a1f7a: 1 articles
elon musk: 1 articles
panera bread: 1 articles
tesla: 1 articles
boeing: 1 articles
bitcoin: 2 articles
martin shkreli: 1 articles
endless shrimp: 1 articles
amazon: 1 articles
kamala harris: 2 articles
cryptocurrencies: 1 articles
dow jones industrial average: 1 articles
goldman sachs: 1 articles
patrick drahi: 1 articles
solana: 1 articles
jamie dimon: 1 articles
vehicle insurance in the united states: 1 articles
toyota corolla: 1 articles
crypto: 1 articles
billionaires: 1 articles
life extensionists: 1 articles
sp

In [6]:
%pip install celery


Note: you may need to restart the kernel to use updated packages.


In [7]:
from celery import Celery

app = Celery('news_parser', broker='redis://localhost:6379/0')

@app.task
def process_rss_feed(url):
    parse_rss(url)


In [8]:
celery -A news_parser worker --loglevel=info


SyntaxError: invalid syntax (4125907070.py, line 1)

In [4]:
from textblob import TextBlob  # or use spaCy/NLTK
def categorize_article(title, summary):
    if 'technology' in title.lower() or 'technology' in summary.lower():
        return 'Technology'
    elif 'politics' in title.lower() or 'politics' in summary.lower():
        return 'Politics'
    return 'General'


In [5]:
import logging
logging.basicConfig(filename='rss_parser.log', level=logging.INFO)

# Inside your try-except block
try:
    session.commit()
    logging.info(f"Inserted article: {title}")
except Exception as e:
    session.rollback()
    logging.error(f"Failed to insert article {title}: {e}")


NameError: name 'title' is not defined