# Purpose

As the filename implies, the purpose of this notebook is to see if I can capture the links to the full articles from the News API. The links to the articles will be included in the tweet body so people can view the full article if they please.

In [6]:
import re
import requests

from env import API_KEY

I'll make a request of the news API and see if I get a good response.

In [2]:
url = f"https://newsapi.org/v2/top-headlines?country=us&apiKey={API_KEY}"

response = requests.get(url)

response

<Response [200]>

Looks good! Let's see what's inside the JSON object.

In [3]:
response.json()

{'status': 'ok',
 'totalResults': 35,
 'articles': [{'source': {'id': 'politico', 'name': 'Politico'},
   'author': None,
   'title': 'US declines to rule out hitting targets in Iran, Jake Sullivan says - POLITICO',
   'description': 'He said that American military action in the region will continue.',
   'url': 'https://www.politico.com/news/2024/02/04/us-strikes-iran-jake-sullivan-00139490',
   'urlToImage': 'https://static.politico.com/9d/49/57de150f49e0ba328c0e20d028cf/iraq-us-airstrikes-91401.jpg',
   'publishedAt': '2024-02-04T20:20:16Z',
   'content': 'When asked by Fox News Sunday host Shannon Bream if strikes within Iran are on or off the table, National Security Council spokesperson John Kirby said that what you saw on Friday night was just the … [+2367 chars]'},
  {'source': {'id': 'the-wall-street-journal',
    'name': 'The Wall Street Journal'},
   'author': 'Joseph De Avila',
   'title': "Facebook Turns 20: From Mark Zuckerberg's Harvard Dorm Room to the Metaverse - The W

Easy enough. I think I'll grab the source name and the url for each post. Then I can format my tweets like this: Source - Headline. [Link]

In [5]:
for article in response.json()['articles']:

    print(article['source']['name'])

    print(article['url'])

    print()

Politico
https://www.politico.com/news/2024/02/04/us-strikes-iran-jake-sullivan-00139490

The Wall Street Journal
https://www.wsj.com/tech/facebook-turns-20-from-mark-zuckerbergs-harvard-dorm-room-to-the-metaverse-817a73da

USA Today
https://www.usatoday.com/story/news/politics/2024/02/04/gov-noem-banned-from-south-dakota-reservation-following-border-remarks/72472410007/

[Removed]
https://removed.com

The Hill
https://thehill.com/homenews/state-watch/4447433-former-trump-official-dies-after-being-shot-in-dc-carjacking-spree/

BBC News
https://www.bbc.com/news/world-africa-68196412

CNN
https://www.cnn.com/2024/02/04/us/california-atmospheric-river-flooding/index.html

The Times of Israel
https://www.timesofisrael.com/after-ben-gvir-pans-biden-netanyahu-lauds-white-house-support-during-war/

FRANCE 24 English
https://www.france24.com/en/environment/20240204-air-pollution-factor-spiking-cancer-cases-who-says

Deadline
http://deadline.com/2024/02/snl-ayo-edebiri-seemingly-addresses-contr

That was the easy part. The difficult part becomes reformatting my scripts to accomodate a new structure of data. I will use the JSON structure: each article will be its own dictionary in the list, containing the source, link, and headline information.

I will re-format the existing function in this notebook to allow for the collection of this additional information.

In [10]:
def get_headlines(key=API_KEY):
    
    """Retrieves the latest breaking news headlines from the news api.
    
    Uses regex to remove the source from the end of the headline.
    
    Returns a list of headlines to be fed into ChatGPT for meme creation."""
    
    url = f"https://newsapi.org/v2/top-headlines?country=us&apiKey={key}"
    
    response = requests.get(url)
    
    articles = []
    
    regexp = r"^(.*?)\s\-\s.{0,25}$"
    
    for article in response.json()['articles']:
        
        try:
        
            title = re.search(regexp, article['title']).groups()[0]

            source = article['source']['name']

            link = article['url']

            info_dict = {'title': title, 'source': source, 'link': link}

            articles.append(info_dict)
            
        except:
            
            print('Article removed. Continuing.')
            
            continue
            
    print('All headlines retrieved! Moving on to descriptions..')        
        
    return articles

In [11]:
test_art = get_headlines()

test_art

Article removed. Continuing.
All headlines retrieved! Moving on to descriptions..


[{'title': 'US declines to rule out hitting targets in Iran, Jake Sullivan says',
  'source': 'Politico',
  'link': 'https://www.politico.com/news/2024/02/04/us-strikes-iran-jake-sullivan-00139490'},
 {'title': "Facebook Turns 20: From Mark Zuckerberg's Harvard Dorm Room to the Metaverse",
  'source': 'The Wall Street Journal',
  'link': 'https://www.wsj.com/tech/facebook-turns-20-from-mark-zuckerbergs-harvard-dorm-room-to-the-metaverse-817a73da'},
 {'title': 'South Dakota tribe banned Gov. Noem from reservation over comments on U.S.-Mexico border',
  'source': 'USA Today',
  'link': 'https://www.usatoday.com/story/news/politics/2024/02/04/gov-noem-banned-from-south-dakota-reservation-following-border-remarks/72472410007/'},
 {'title': 'Former Trump official dies after being shot in DC carjacking spree',
  'source': 'The Hill',
  'link': 'https://thehill.com/homenews/state-watch/4447433-former-trump-official-dies-after-being-shot-in-dc-carjacking-spree/'},
 {'title': "Hage Geingob deat

Great success. This has cascading consequences in my existing pipeline. The function now returns a JSON object instead of a simple list. I need to alter my future functions in the pipeline to specifically access the title of each article. Looking at my pipeline, this affects the get_meme_desc function immediately and will require me to alter the two following scripts as well.

Let me test some string concatenation shenanigans to make sure I can properly edit the new tweets to contain the additional bits of information.

In [17]:
for art in test_art:

    print()

    hashtags = '#currentevents #breaking #memenews'

    full_string = f"{art['source']} | {art['title']} {hashtags}\n\n{art['link']}"

    print(full_string)

    print(len(full_string))


Politico | US declines to rule out hitting targets in Iran, Jake Sullivan says #currentevents #breaking #memenews

https://www.politico.com/news/2024/02/04/us-strikes-iran-jake-sullivan-00139490
194

The Wall Street Journal | Facebook Turns 20: From Mark Zuckerberg's Harvard Dorm Room to the Metaverse #currentevents #breaking #memenews

https://www.wsj.com/tech/facebook-turns-20-from-mark-zuckerbergs-harvard-dorm-room-to-the-metaverse-817a73da
247

USA Today | South Dakota tribe banned Gov. Noem from reservation over comments on U.S.-Mexico border #currentevents #breaking #memenews

https://www.usatoday.com/story/news/politics/2024/02/04/gov-noem-banned-from-south-dakota-reservation-following-border-remarks/72472410007/
276

The Hill | Former Trump official dies after being shot in DC carjacking spree #currentevents #breaking #memenews

https://thehill.com/homenews/state-watch/4447433-former-trump-official-dies-after-being-shot-in-dc-carjacking-spree/
230

BBC News | Hage Geingob deat

This output is great. I've slightly altered it in the official script but this investigation allowed me to consider my options and format my tweets in a way I like.