# News Scraping

In [11]:
import requests
from datetime import datetime

## Search Google by URL

When typing something into Google's search bar, essentially we are creating a URL which leads to a website. For example, the following URL will lead to the webpage that we can reach by typing the word *apple* into the search bar.

In [10]:
URL = "https://www.google.com"

Note that we can so much more than just search based on a string (search bar input), which is basically what we usually do. In addition, we can add **parameters** to the URL so that more constraints on our search are specified. For example, we can search news within a particular date range.

We refer to this [link](https://stenevang.wordpress.com/2013/02/22/google-advanced-power-search-url-request-parameters/) for some detailed information about Googles search URL request parameters.

Some parameters of interest:
- `tbm`, TBM (Term By Method), e.g., `tbm=nws` will search for news
- `tbs`, TBS (Term By Search), e.g., 
  - `tbs=cdr:1,cd_min:3/2/1984,cd_max:6/5/1987` specifies a range from March 2, 1984, to June 5, 1987
  - `tbs=sbd:0` sorts the results by relevancy

```{tip}
Different parameters are connected with `&` symbol.
```

In [26]:
def create_search_url(query: str, date_str: str):
    """Create a Google search URL based on provided query content and date.

    Parameters
    ----------
        query (str): _description_
        date_str (str): _description_

    Returns
    -------
        _type_: _description_
    """
    
    # base URL
    url = URL
    
    # search content
    url += f"/search?q={query}"
    
    # search for news
    url += f"&tbm=nws"
    
    # get correct date format and then search by date
    date = datetime.strptime(date_str, "%Y-%m-%d")
    query_date_str = datetime.strftime(date, "%m/%d/%Y")
    url += f"&tbs=cdr:1,cd_min:{query_date_str},cd_max:{query_date_str}"
    
    # sort by relevancy
    url += f",sbd:0"
    
    return url

If we want to search news about Apple on September 1, 2022, then the URL is:

In [28]:
create_search_url("apple", "2022-9-1")

'https://www.google.com/search?q=apple&tbm=nws&tbs=cdr:1,cd_min:09/01/2022,cd_max:09/01/2022,sbd:0'

## Get Headlines

In [None]:
headers = {
    "User-Agent": "Firefox"
}

# res = requests.get(url, headers=headers)