# News Scraping

In [3]:
from datetime import date, timedelta

import sys
sys.path.append('../../')
from newscrape.config import CONFIG; CONFIG.load('../../config.toml')
from newscrape.db import NewsDBClient
from newscrape.webdriver import WebDriver
from newscrape.scraper.headline import NewsHeadlinePicker
from newscrape.scraper import NewsScraper

## Create a News Scraper

In [4]:
scraper = NewsScraper(
    db_client=NewsDBClient.from_host_and_port(
        database_name=CONFIG.MONGODB_DATABASE_NAME,
        host=CONFIG.MONGODB_HOST,
        port=CONFIG.MONGODB_PORT,
    ),
    web_driver=WebDriver.on_port(0),
    headline_picker=NewsHeadlinePicker(),
    n_workers=10
)

We need the following parameters to initialize a news scraper:
- `db_client`: A MongoDB client that handles the news documents in the database.
- `web_driver`: A Chrome web driver running at the background. It will help us to access some websites when a simple GET request fails.
- `headline_picker`: The news headline is usually wrapped into a `h1` HTML tag. The problem is that, in practice, some websites may have multiple `h1` tags, which may contain texts other than the headline. Hence, we need to pick the correct one. In order to complete this task automatically, the headline picker is powered by GPT.
- `n_workers`: Maximum number of works in a thread pool. The pool executor will send request concurrently to save our time.

## Scrape News

The goal is to scrape news in recent days.

Get the current date:

In [5]:
today = date.today()
today

datetime.date(2023, 7, 14)

For example, suppose you want to 
- search for news related to the query `'Pwc aspen digital report'`,
- in the past `3` days, and
- all results should be in English,

then 

In [6]:
scraper.scrape_news(
    query='Pwc aspen digital report',
    date_start=today - timedelta(days=2),
    date_end=today,
    language='en'
)

The following are more examples:

In [7]:
scraper.scrape_news(
    query='Pwc digital asset custody report',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='en'
)

scraper.scrape_news(
    query='Pwc aspen digital',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='en'
)

scraper.scrape_news(
    query='羅兵咸永道 aspen digital',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='zh'
)

scraper.scrape_news(
    query='羅兵咸永道 數字資產託管狀況報告',
    date_start=today - timedelta(days=5),
    date_end=today,
    language='zh'
)