# Real-Time NewsAPI ETL & Analysis

## Project Overview
This project demonstrates a real-time ETL (Extract, Transform, Load) pipeline that fetches live news data from **NewsAPI**, transforms it using **Pandas**, loads it into a **PostgreSQL** database, and performs basic analysis.

## Tech Stack
- **Source**: NewsAPI (REST API)
- **Processing**: Python (Pandas)
- **Storage**: PostgreSQL
- **Visualization**: Matplotlib/Seaborn

---


In [None]:
# Install required packages (run once)
!pip install -q requests pandas sqlalchemy psycopg2-binary matplotlib seaborn


## 1. Configuration
Load API keys and database credentials securely from environment variables.


In [None]:
import os
import requests
import pandas as pd
from sqlalchemy import create_engine

# API Configuration
API_KEY = os.getenv("NEWS_API_KEY", "your_api_key_here")
BASE_URL = "https://newsapi.org/v2/top-headlines"

# Database Configuration
DB_HOST = os.getenv("DB_HOST", "localhost")
DB_PORT = os.getenv("DB_PORT", "5432")
DB_NAME = os.getenv("DB_NAME", "newsdb")
DB_USER = os.getenv("DB_USER", "postgres")
DB_PASS = os.getenv("DB_PASS", "password")

# Create DB Connection
connection_uri = f"postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(connection_uri)

print("Configuration loaded.")

## 2. Extract (Fetch Data)
Fetch top technology headlines from the US.


In [None]:
def fetch_news(api_key, category="technology", country="us"):
    params = {
        "country": country,
        "category": category,
        "apiKey": api_key
    }
    try:
        response = requests.get(BASE_URL, params=params)
        response.raise_for_status()
        return response.json()
    except Exception as e:
        print(f"Error fetching news: {e}")
        return None

raw_data = fetch_news(API_KEY)
if raw_data and "articles" in raw_data:
    print(f"Fetched {len(raw_data['articles'])} articles.")
else:
    print("No data fetched.")

## 3. Transform (Clean Data)
Convert to DataFrame and clean up nested JSON (specifically the 'source' column).


In [None]:
if raw_data and "articles" in raw_data:
    df = pd.DataFrame(raw_data["articles"])
    
    # Flatten 'source' column (extract 'name')
    df['source_name'] = df['source'].apply(lambda x: x.get('name') if isinstance(x, dict) else None)
    
    # Select relevant columns
    cols = ['source_name', 'author', 'title', 'description', 'url', 'publishedAt', 'content']
    df_clean = df[cols].copy()
    
    # Convert publishedAt to datetime
    df_clean['publishedAt'] = pd.to_datetime(df_clean['publishedAt'])
    
    print("Data transformed successfully.")
    display(df_clean.head())
else:
    df_clean = pd.DataFrame()

## 4. Load (Save to DB)
Load the transformed data into PostgreSQL.


In [None]:
if not df_clean.empty:
    try:
        df_clean.to_sql('tech_news', engine, if_exists='replace', index=False)
        print("Data loaded into table 'tech_news'.")
    except Exception as e:
        print(f"Error loading to DB: {e}")

## 5. Analysis & Visualization
Analyze the data directly from the database.


In [None]:
# Analysis 1: Top News Sources
query_sources = """
SELECT source_name, COUNT(*) as article_count 
FROM tech_news 
GROUP BY source_name 
ORDER BY article_count DESC 
LIMIT 10
"""

try:
    df_sources = pd.read_sql(query_sources, engine)
    
    import matplotlib.pyplot as plt
    import seaborn as sns

    plt.figure(figsize=(10, 6))
    sns.barplot(x='article_count', y='source_name', data=df_sources, palette='coolwarm')
    plt.title('Top 10 Tech News Sources')
    plt.xlabel('Number of Articles')
    plt.ylabel('Source')
    plt.show()
except Exception as e:
    print(f"Analysis failed: {e}")

In [None]:
# Analysis 2: Articles by Hour (Publication Time)
query_time = """
SELECT EXTRACT(HOUR FROM "publishedAt") as hour, COUNT(*) as count
FROM tech_news
GROUP BY hour
ORDER BY hour
"""

try:
    df_time = pd.read_sql(query_time, engine)
    
    plt.figure(figsize=(10, 5))
    sns.lineplot(x='hour', y='count', data=df_time, marker='o')
    plt.title('News Publication Frequency by Hour (UTC)')
    plt.xlabel('Hour of Day')
    plt.ylabel('Number of Articles')
    plt.grid(True)
    plt.xticks(range(0, 24))
    plt.show()
except Exception as e:
    print(f"Time analysis failed: {e}")

## Conclusion
This notebook demonstrated a complete ETL pipeline:
1.  **Extracted** live data from an external API.
2.  **Transformed** nested JSON into a flat, structured format.
3.  **Loaded** the clean data into a relational database.
4.  **Analyzed** the stored data to derive insights about news sources and timing.
