# Purpose

The purpose of this notebook is to investigate options for programmatic access to news.

# News API

[News API](newsapi.org) is available in free and paid versions, though the paid version seems [prohibitively expensive](https://newsapi.org/pricing).  It seems extremely well [documented](https://newsapi.org/docs), provides links to articles and images, and includes an [unofficial python client](https://github.com/mattlisiv/newsapi-python) for easier querying.  There are two main [endpoints](https://newsapi.org/docs/endpoints) for searching, each with a corresponding method in the `NewsApiClient` instance.


## Everything

This endpoint searches all articles from over 80,000 different sources and would be ideal for news discovery, e.g. querying for all articles which mention Bridgestone.  There are many request parameters, documented [here](https://newsapi.org/docs/endpoints/everything), each of which can be passed as a key value pair to the method.  It returns a dictionary with a lot of relevant information including, title, author, source, short description, url, and truncated version of the content.  Below, I'll do a quick search for mentions of Bridgestone within the last week and sort by popular sources.

In [14]:
from newsapi import NewsApiClient

with open("news-api-key.txt", "r") as f:
    key = f.read()

# Init
newsapi = NewsApiClient(api_key=key)

In [30]:
recent_bs_results = newsapi.get_everything(q="bridgestone", language="en", from_param="2023-05-04", to="2023-05-11", sort_by="popularity")
recent_bs_results

{'status': 'ok',
 'totalResults': 120,
 'articles': [{'source': {'id': None, 'name': 'Pitchfork'},
   'author': 'Matthew Strauss',
   'title': '50 Cent Announces Get Rich or Die Tryin’ Anniversary Tour',
   'description': 'Busta Rhymes and Jeremih will join the New York rapper at various stops on the Final Lap Tour',
   'url': 'https://pitchfork.com/news/50-cent-announces-get-rich-or-die-tryin-anniversary-tour/',
   'urlToImage': 'https://media.pitchfork.com/photos/6453dcb5d014feef88cca00a/16:9/w_1280,c_limit/50-Cent.jpg',
   'publishedAt': '2023-05-04T16:57:15Z',
   'content': 'Curtis 50 Cent Jackson is going on tour to celebrate the 20th anniversary of his iconic debut, Get Rich or Die Tryin. The New York rappers Final Lap Tour begins in July and includes support from Bust… [+3187 chars]'},
  {'source': {'id': None, 'name': 'HYPEBEAST'},
   'author': 'info@hypebeast.com (Hypebeast)',
   'title': '50 Cent Celebrate 20 Years of \'Get Rich or Die Trying\' With “The Final Lap Tour"',
   

Unsurprisingly, it looks like it's finding a lot of articles mentionting Bridgestone Arena instead of Bridgestone Corporation in general.  This would require some additional filtering or more careful querying to yield the desired results.

In [31]:
recent_bs_results["articles"][4]

{'source': {'id': None, 'name': 'Brooklyn Vegan'},
 'author': 'Bill Pearis',
 'title': "Depeche Mode enlist Miss Grit, Matthew Herbert & more for 'Ghosts Again' Remix EP (listen)",
 'description': 'The remix EP also features Chris Liebing & Luke Slater, Rival Consoles, Massano, and more \nContinue reading…',
 'url': 'https://www.brooklynvegan.com/depeche-mode-enlist-miss-grit-matthew-herbert-more-for-ghosts-again-remix-ep-listen/',
 'urlToImage': 'https://townsquare.media/site/838/files/2023/05/attachment-depechemode-msg-54.jpg?w=1200&h=0&zc=1&s=0&a=t&q=89',
 'publishedAt': '2023-05-05T15:00:19Z',
 'content': "Depeche Mode have shared the Ghosts Again (Remixes) EP, which features eight new reworks of the lead single from the synthpop group's new album Memento Mori. Remixers include Miss Grit, Matthew Herbe… [+4229 chars]"}

In [41]:
recent_bs_results["articles"][4]["content"]

"Depeche Mode have shared the Ghosts Again (Remixes) EP, which features eight new reworks of the lead single from the synthpop group's new album Memento Mori. Remixers include Miss Grit, Matthew Herbe… [+4229 chars]"

Unfortunately, it looks like the content provided is truncated to the first 200 characters.

## Top Headlines

This endpoint is intended for pulling live top and breaking headlines for a country, specific category, source, etc.  Keyword searching is also available.  Here are the current top headlines.

In [42]:
top_headlines = newsapi.get_top_headlines(language="en", country="us")

In [43]:
top_headlines

{'status': 'ok',
 'totalResults': 33,
 'articles': [{'source': {'id': None, 'name': 'Yahoo Entertainment'},
   'author': 'Alexandra Canal',
   'title': 'Disney earnings miss estimates as streaming losses narrow, parks soar - Yahoo Finance',
   'description': "Disney reported quarterly earnings after the bell on Wednesday. Here's what to know.",
   'url': 'https://finance.yahoo.com/news/disney-earnings-second-quarter-2023-may-10-200858196.html',
   'urlToImage': 'https://s.yimg.com/ny/api/res/1.2/1tAF0bgbcfKqo2lunFE0PA--/YXBwaWQ9aGlnaGxhbmRlcjt3PTEyMDA7aD04MDA-/https://s.yimg.com/os/creatr-uploaded-images/2023-05/b42c5b60-eddb-11ed-82fe-2902f930ab12',
   'publishedAt': '2023-05-10T20:08:58Z',
   'content': 'Disney (DIS) reported quarterly results after the bell on Wednesday that showed earnings per share missed estimates by a penny while streaming losses narrowed as the company continues efforts to slas… [+4535 chars]'},
  {'source': {'id': 'the-washington-post', 'name': 'The Washington

## Conclusions

The API is well documented and easy to use, though the lack of full article content is disappointing.  It also appears that NPR is not an available source, which is surprising.

# Google News

The [Google News API](https://github.com/Iceloof/GoogleNews) is basically a python wrapper for searching the news on Google.  It's pretty intuitive to work with and provides real time results with no api key or subscription required.  There are two slightly different options, one searches [news.google.com](news.google.com) and one searches on Google and provides results from the News tab.

## Instantiate

While setting up the object, you can specify some preferences about search results.  These don't seem to be very well documented, but there are some examples on the [README](https://github.com/Iceloof/GoogleNews) and I also found a [guide on Medium](https://medium.com/analytics-vidhya/googlenews-api-live-news-from-google-news-using-python-b50272f0a8f0).  Here, I'll ask for english results from the US, limited to the last 30 days.


In [50]:
from GoogleNews import GoogleNews

googlenews = GoogleNews(lang="en", region="US", period="30d")


## `get_news()`

The `get_news` method queries [news.google.com](news.google.com).  To turn the results into a pandas dataframe, simply dump the results into `pd.DataFrame`.

In [53]:
import pandas as pd

googlenews.get_news("Bridgestone")
result = googlenews.result()
result_df = pd.DataFrame(result)
result_df

Unnamed: 0,title,media,date,datetime,desc,link,img,site
0,Vanderbilt moves Friday’s commencement ceremon...,WSMV,9 mins ago,2023-05-11 15:36:57.465540,Vanderbilt University has moved Friday's under...,https://www.wsmv.com/2023/05/11/vanderbilt-mov...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////...",
1,Bridgestone funding guayule R&D at U. of Arizona,Tire Business,2 hours ago,2023-05-11 13:45:58.959690,Bridgestone Americas is backing a $70 million ...,https://www.tirebusiness.com/manufacturers/bri...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////...",
2,Low Rolling Resistance Truck And Bus Radial Ti...,Cottonwood Holladay Journal,6 hours ago,2023-05-11 09:45:58.970879,Low Rolling Resistance Truck And Bus Radial Ti...,https://www.cottonwoodholladayjournal.com/2023...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////...",
3,Intelligent Rubber Track Market 2023 Trending ...,Cottonwood Holladay Journal,7 hours ago,2023-05-11 08:45:58.980556,"Between 2023 and 2029, MarketsandResearch.biz ...",https://www.cottonwoodholladayjournal.com/2023...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////...",
4,Diagonal Tire Market Current Scope 2023 – Brid...,Cottonwood Holladay Journal,8 hours ago,2023-05-11 07:45:58.991063,The most recent research study namely Diagonal...,https://www.cottonwoodholladayjournal.com/2023...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////...",
...,...,...,...,...,...,...,...,...
205,Firestone's new FD694 drive radial truck tire ...,Fleet Equipment Magazine,Apr 17,NaT,,news.google.com/./articles/CBMiRGh0dHBzOi8vd3d...,https://encrypted-tbn3.gstatic.com/faviconV2?u...,
206,Time to change the tires – Sheridan Media,Sheridan Media,Yesterday,2023-05-10 15:53:44.481373,,news.google.com/./articles/CBMiP2h0dHBzOi8vc2h...,https://encrypted-tbn3.gstatic.com/faviconV2?u...,
207,Black Keys are playing a free show in Nashvill...,Tennessean,2 days ago,2023-05-09 15:53:44.481608,,news.google.com/./articles/CBMikQFodHRwczovL3d...,https://encrypted-tbn3.gstatic.com/faviconV2?u...,
208,Global Aircraft Tires Market Size/Share Projec...,GlobeNewswire,2 days ago,2023-05-09 15:53:44.481846,,news.google.com/./articles/CBMixgFodHRwczovL3d...,https://lh3.googleusercontent.com/mJI4NoxATIZq...,


These results look quite relevant and also very recent, some of which were published a couple hours ago or less.  I cannot programmatically access the full content, though there is a direct link to the article.

## `search()`

The `search` method simply performs a google search and returns results from the "News" tab, but otherwise works exactly the same.

In [58]:
googlenews = GoogleNews(lang="en", region="US", period="30d")
googlenews.search("Bridgestone")

result = googlenews.result()
result_df = pd.DataFrame(result)
result_df

Unnamed: 0,title,media,date,datetime,desc,link,img
0,Two Wheeler Tires Market Growing at a CAGR 73....,StreetBuzz,1 hour ago,2023-05-12 08:01:39.072845,"New Jersey, United States,- Mr Accuracy Report...",https://www.streetbuzz.pk/uncategorised/two-wh...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
1,Slick Tires Market Witness Highest Growth in n...,thenelsonpost,1 hour ago,2023-05-12 08:01:39.082340,Report Description: Global Market Vision added...,https://thenelsonpost.ca/news/292271/slick-tir...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
2,All Steel Radial Tyres Market 2023 Growing Dem...,Cottonwood Holladay Journal,6 hours ago,2023-05-12 03:01:39.091727,"global All Steel Radial Tyres market Size, Sta...",https://www.cottonwoodholladayjournal.com/2023...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
3,Premium Tire Market Size and Forecast to 2030 ...,StreetBuzz,0 hours ago,2023-05-12 09:01:39.093074,Premium Tire Market Report Coverage: Key Growt...,https://www.streetbuzz.pk/uncategorised/premiu...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
4,U of A teams with Bridgestone to give desert r...,Arizona Daily Star,3 hours ago,2023-05-12 06:01:39.094459,The University of Arizona and tire maker Bridg...,https://tucson.com/news/local/u-of-a-teams-wit...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
5,Vanderbilt undergraduate commencement ceremony...,WSMV,4 hours ago,2023-05-12 05:01:39.095662,Vanderbilt University has moved Friday's under...,https://www.wsmv.com/video/2023/05/11/vanderbi...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
6,Automotive Green Tires Market size to grow at ...,The Stark County News,6 hours ago,2023-05-12 03:01:39.097058,"NEW YORK, May 11, 2023 /PRNewswire/ — Technavi...",https://countyenews.com/automotive-green-tires...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
7,Vanderbilt moves Friday’s commencement ceremon...,WSMV,7 hours ago,2023-05-12 02:01:39.098809,Vanderbilt University has moved Friday's under...,https://www.wsmv.com/2023/05/11/vanderbilt-mov...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
8,Bridgestone funding guayule R&D at U. of Arizona,Tire Business,0 hours ago,2023-05-12 09:01:39.100528,Bridgestone Americas is backing a $70 million ...,https://www.tirebusiness.com/manufacturers/bri...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
9,Low Rolling Resistance Truck And Bus Radial Ti...,Cottonwood Holladay Journal,3 hours ago,2023-05-12 06:01:39.101696,Low Rolling Resistance Truck And Bus Radial Ti...,https://www.cottonwoodholladayjournal.com/2023...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."


Results are pretty similar here, still highly relevant and no content available though I do have direct links.  

### Pagination

It looks like by default it only provides the first page of results.  Each subsequent page can be accessed via the `getpage` method.

In [60]:
googlenews.getpage(2)
result2 = googlenews.result()
result2_df = pd.DataFrame(result2)
result2_df

Unnamed: 0,title,media,date,datetime,desc,link,img
0,Two Wheeler Tires Market Growing at a CAGR 73....,StreetBuzz,1 hour ago,2023-05-12 08:01:39.072845,"New Jersey, United States,- Mr Accuracy Report...",https://www.streetbuzz.pk/uncategorised/two-wh...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
1,Slick Tires Market Witness Highest Growth in n...,thenelsonpost,1 hour ago,2023-05-12 08:01:39.082340,Report Description: Global Market Vision added...,https://thenelsonpost.ca/news/292271/slick-tir...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
2,All Steel Radial Tyres Market 2023 Growing Dem...,Cottonwood Holladay Journal,6 hours ago,2023-05-12 03:01:39.091727,"global All Steel Radial Tyres market Size, Sta...",https://www.cottonwoodholladayjournal.com/2023...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
3,Premium Tire Market Size and Forecast to 2030 ...,StreetBuzz,0 hours ago,2023-05-12 09:01:39.093074,Premium Tire Market Report Coverage: Key Growt...,https://www.streetbuzz.pk/uncategorised/premiu...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
4,U of A teams with Bridgestone to give desert r...,Arizona Daily Star,3 hours ago,2023-05-12 06:01:39.094459,The University of Arizona and tire maker Bridg...,https://tucson.com/news/local/u-of-a-teams-wit...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
5,Vanderbilt undergraduate commencement ceremony...,WSMV,4 hours ago,2023-05-12 05:01:39.095662,Vanderbilt University has moved Friday's under...,https://www.wsmv.com/video/2023/05/11/vanderbi...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
6,Automotive Green Tires Market size to grow at ...,The Stark County News,6 hours ago,2023-05-12 03:01:39.097058,"NEW YORK, May 11, 2023 /PRNewswire/ — Technavi...",https://countyenews.com/automotive-green-tires...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
7,Vanderbilt moves Friday’s commencement ceremon...,WSMV,7 hours ago,2023-05-12 02:01:39.098809,Vanderbilt University has moved Friday's under...,https://www.wsmv.com/2023/05/11/vanderbilt-mov...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
8,Bridgestone funding guayule R&D at U. of Arizona,Tire Business,0 hours ago,2023-05-12 09:01:39.100528,Bridgestone Americas is backing a $70 million ...,https://www.tirebusiness.com/manufacturers/bri...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."
9,Low Rolling Resistance Truck And Bus Radial Ti...,Cottonwood Holladay Journal,3 hours ago,2023-05-12 06:01:39.101696,Low Rolling Resistance Truck And Bus Radial Ti...,https://www.cottonwoodholladayjournal.com/2023...,"data:image/gif;base64,R0lGODlhAQABAIAAAP//////..."


## Conclusions

The he [Google News API](https://github.com/Iceloof/GoogleNews) is far more intuitive to use since it is simply provides programmatic access to google search.  It is free, with no API key required, and you have access real time results.  There is no programmatic access to full article content, but direct links to the articles are provided.

# Newspaper3k

[Newspaper3k](https://github.com/codelucas/newspaper) is not a news API but rather a library which provides content scraping from websites, specifically for "extracting and curating articles".  In combination with the article linkes provided by a news API, it has the potential to be very powerful.  

## Parse

To access an article, you simply provide the direct url, use the `download` method to download, and then use the `parse` method.  Once parsed, you have access to many of the components including `authors`, `publish_date`, `top_image`, and even full text of the article via `text`.  

In [82]:
from newspaper import Article

#instantiate article
url = result_df.link[8]
article = Article(url)

#access article
article.download()

#parse article contents
article.parse()

In [83]:
msg = f"""
Authors: {article.authors}
Publish Date: {article.publish_date}
Top Image Link: {article.top_image}
Full Text: {article.text}
"""

print(msg)


Authors: ['Tire Business']
Publish Date: 2023-05-11 13:30:30-04:00
Top Image Link: https://s3-prod.tirebusiness.com/s3fs-public/styles/1200x630/public/Ogden_UnivArizona-main_i.jpg?h=b36a5571
Full Text: TUCSON, Ariz. — Bridgestone Americas Inc. is committing $35 million to help fund research at the University of Arizona into the viability of the guayule shrub as a potential natural rubber source.

The funding — Bridgestone's contribution plus a $35 million grant from the U.S. Department of Agriculture (USDA) — will help the university's Department of Chemical and Environmental Engineering conduct research over the coming five years into the development and refinement of growing guayule with "climate-smart" practices, according to Kim Ogden, head of the department.

The research project and the funding were disclosed May 8 by Robert Bonnie, USDA undersecretary for farm production and conservation, at a news conference held at the University of Arizona's Controlled Environment Agricultur

## NLP

It looks like there is also some built-in NLP capabilities, like keywords and summarization.  

**Note**: in order to get this to work, I had to install NLP dependencies using the following in macOS:

```zsh
% brew install libxml2 libxslt

% brew install libtiff libjpeg webp little-cms2

% pip3 install newspaper3k

% curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3
```

In [84]:
article.nlp()

msg = f"""
Keywords: {article.keywords}
Summary: {article.summary}
"""

print(msg)


Keywords: ['arizona', 'university', 'guayule', 'rubber', 'natural', 'bridgestone', 'rd', 'usda', 'funding', 'ogden', 'dierig', 'research']
Summary: TUCSON, Ariz. — Bridgestone Americas Inc. is committing $35 million to help fund research at the University of Arizona into the viability of the guayule shrub as a potential natural rubber source.
Bridgestone has been working with guayule in Arizona since 2012 at the company's 280-acre farm in Eloy, about halfway between Phoenix and Tucson.
The Emergency Rubber Act, passed by Congress in 1942, directed scientists to find alternative sources for rubber, Dierig said, and guayule was in the mix.
"Finding research-based solutions that have a global impact is an ideal expression of the University of Arizona's mission," University of Arizona President Robert Robbins said.
I look forward to seeing new, sustainable tires on the road soon, knowing the University of Arizona helped get them there."



## Newspaper

You can also access an entire "newspaper" from a top level url and navigate through headlines, categories, etc.

In [34]:
import newspaper

wp_paper = newspaper.build("https://www.washingtonpost.com")

In [36]:
for article in wp_paper.articles[0:3]:
    article.download()
    article.parse()
    article.nlp()
    print("Title:", article.title)
    print("Summary:", article.summary)
    print("/n/n")

Title: Biden faces major challenges with the end of the Title 42 border policy
Summary: “The border is not open.”Late Thursday, a federal judge in Florida temporarily blocked the Biden administration from quickly releasing migrants from crowded Border Patrol holding facilities.
The Border Patrol is already on track to exceed the 2.1 million apprehensions last year, federal data show.
Eight of nine border patrol sectors said their holding cells were stretched beyond their limits.
The Title 42 border policy expired at 11:59 p.m. (Video: Reuters)In Brownsville on Thursday, migrants streamed across the border to enter the U.S. before new immigration rules took effect.
/n/n
Title: U.S. sees record migration influx as pandemic border restrictions lift
Summary: ArrowRight Illegal border crossings have topped 10,000 per day this week, the highest levels ever, as the Title 42 border policy expired at 11:59 p.m. Thursday.
The Title 42 border policy expired at 11:59 p.m. (Video: Reuters)The White