# News Scraping

In [121]:
import requests
from bs4 import BeautifulSoup, Tag
from datetime import datetime

## Search Google by URL

When typing something into Google's search bar, essentially we are creating a URL which leads to a website. For example, the following URL will lead to the webpage that we can reach by typing the word *apple* into the search bar.

In [122]:
GOOGLE = "https://www.google.com"

Note that we can so much more than just search based on a string (search bar input), which is basically what we usually do. In addition, we can add **parameters** to the URL so that more constraints on our search are specified. For example, we can search news within a particular date range.

We refer to this [link](https://stenevang.wordpress.com/2013/02/22/google-advanced-power-search-url-request-parameters/) for some detailed information about Googles search URL request parameters.

Some parameters of interest:
- `tbm`, TBM (Term By Method), e.g., `tbm=nws` will search for news
- `tbs`, TBS (Term By Search), e.g., 
  - `tbs=cdr:1,cd_min:3/2/1984,cd_max:6/5/1987` specifies a range from March 2, 1984, to June 5, 1987
  - `tbs=sbd:0` sorts the results by relevancy
- `lr`, language, e.g. 
  - `lr=lang_en` for English
  - `lr=lang_zh-CN` for Chinese

```{tip}
Different parameters are connected with `&` symbol.
```

In [123]:
def create_search_url(query: str, date_str: str) -> str:
    """Create a Google search URL based on provided query content and date.

    Parameters
    ----------
        query (str): The text typed into the search bar.
        date_str (str): Date string with format "YYYY-mm-dd".

    Returns
    -------
        str: Search URL.
    """
    
    # base URL
    url = GOOGLE
    
    # search content
    url += f"/search?q={query}"
    
    # search for news
    url += f"&tbm=nws"
    
    # get correct date format and then search by date
    date = datetime.strptime(date_str, "%Y-%m-%d")
    query_date_str = datetime.strftime(date, "%m/%d/%Y")
    url += f"&tbs=cdr:1,cd_min:{query_date_str},cd_max:{query_date_str}"
    
    # sort by relevancy
    url += f",sbd:0"
    
    # we want results in english
    url += "&lr=lang_en"
    
    return url

If we want to search news about Apple on September 1, 2022, then the URL is:

In [124]:
url = create_search_url("apple", "2022-9-1")
url

'https://www.google.com/search?q=apple&tbm=nws&tbs=cdr:1,cd_min:09/01/2022,cd_max:09/01/2022,sbd:0&lr=lang_en'

## Find Links to Webpages

We can use function `requests.get` to request the content of a website.

```{tip}
Always remember to add a **header** to pretend to be a browser, otherwise your request may be denied.
```

In [125]:
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:106.0) Gecko/20100101 Firefox/106.0"
}

res = requests.get(url, headers=headers)
res

<Response [200]>

The return code 200 means that the request is successful. Alternatively, we can check this by examining whether `res.ok` is `True`:

In [126]:
res.ok

True

Parse the raw HTML content using `bs4.BeautifulSoup`:

In [127]:
soup = BeautifulSoup(res.content, "html.parser")

By observing the HTML structure of the Google search page, we find that all links are contained inside tags with `class=SoaBEf`. And these tags are decedents of a tag with `id=search`.

In [128]:
# find search tag
search_tag = soup.find(id="search")

# find tags containing links to webpages
tags = search_tag.find_all(attrs={"class": "SoaBEf"})

# extract links
links = []
tag: Tag
for tag in tags:
    inner_tag: Tag = tag.find("a")
    link = inner_tag.get("href")
    links.append(link)
    
links

['https://www.macrumors.com/2022/09/01/apple-watch-pro-bad-news-for-band-collections/',
 'https://techcrunch.com/2022/09/01/apple-settles-lawsuit-with-developer-over-app-store-rejections-and-scams/',
 'https://www.macrumors.com/2022/09/01/apple-settles-flicktype-lawsuit/',
 'https://www.ft.com/content/75891d95-4432-4571-83df-b4cdf82d5da5',
 'https://www.gov.uk/find-digital-market-research/mobile-ecosystems-market-study-appendix-i-apples-restrictions-on-cloud-gaming-2022-cma',
 'https://www.computerworld.com/article/3672238/will-apple-make-the-app-store-more-business-friendly.html',
 'https://www.computerworld.com/article/3672111/apple-pushes-out-emergency-updates-to-address-zero-day-exploits.html',
 'https://www.businessinsider.com/guides/tech/apple-event-september-what-to-expect-2022-09',
 'https://9to5mac.com/2022/09/01/samsung-trolls-iphone-14-apple-ad/',
 'https://www.livemint.com/technology/gadgets/iphone-14-max-could-come-with-iphone-14-plus-moniker-report-11662038925215.html']

## Scrape Headlines

Now, we want to access each webpage via its link and scrape its heading (news headline).

In [129]:
link = links[0]
res = requests.get(link, headers=headers)
res.ok

True

Make another soup for the webpage:

In [130]:
webpage_soup = BeautifulSoup(res.content, "html.parser")

```{tip}
For most of webpages, the first level heading is contained in a special tag named `<h1>`.
```

In [136]:
headline_tag: Tag = webpage_soup.find("h1")
headline = headline_tag.text
headline

"Apple Watch 'Pro' Could Be Bad News for Band Collections"