In [1]:
from requests import get
import pandas as pd
import time
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import re


# Using a sitemap

Larger sites will usually have a sitemap that acts as a map of URLs to make it easier to web crawlers to index pages. If its available, you can usually locate a page's sitemap by navigating to `[domain name here]/robots.txt`. For instance, the `robots.txt` page for `apnews.com` is `https://apnews.com/robots.txt`. If you go to that URL, you'll see a section of sitemaps. If you go to that URL and then follow the first link to https://apnews.com/ap-sitemap.xml, you'll end up on a page that looks like this:

This is a .xml formatted document, which you might notice looks pretty similar to the .html code we've viewed in the past. We won't be able to use the SelectorGadget on these pages, but we will be able to use BeautifulSoup to parse it and extract information using expressions that are very similar to the ones we use to parse an HTML document



Following one of the links on this page will take you to another sitemap listing all of the AP articles published for a single month. So you can think of the structure of this site as sort of following a hierarchy sort of like this:

![image.png](https://mermaid.ink/img/pako:eNpl0UFrwjAUB_CvEt7Ziu10sB4G1U5QcJftZnZ4tK820DQlfcUN8bsvRgkt5pT88s__kHeBwpQEKVSNORc1WhbfuWyFW_kRu6hXTBq7-a9ufkQUvYtspFGySJbx4s3f3h9lPvRxzCyroiERT3wbPHl4fuP1c-eo0kc2T5F4FFn79l1of5n4Pvjy4Rvvh-CriX8Gf3UOM9BkNarS_dLllpPANWmSkLptZSz1LEG2V5fEgc3XX1tAynagGVgznGpIK2x6dxq6EplyhSeLOiiVio093Mfgp3H9Bz2retE?type=png)

The Associated Press is a massive website, but if we understand how to navigate its sitemap, we can gather pretty much anything we want from it very efficiently.

We'll pull a single sitemap for a single month, but note that you could easily write a loop to navigate through every sitemap and extract every link. Go ahead and take a look at the map for <a href=https://apnews.com/ap-sitemap-202410.xml>stories from October 2024 here</a> (this will probably load slowly!)

You should see some output that looks something like the output below:


This is XML formatting. Its a close relative of HTML, and you might recognize some clear similarities - especially the use of nested `<tags>` to hold data - but XML doesn't provide any visual formatting. Its just a way to store and share data. 

Luckily, we can still parse XML data using BeautifulSoup, and this output is ultimately probably easier to navigate because its generally much simpler. We'll start by sending a get request to the URL for this sitemap, and then we'll parse the result in *almost* the same way we've been parsing HTML code. The only difference here is that we'll need to using the `features='xml'` optional argument for the `BeautifulSoup` function.


In [2]:
# get the sitemap
october_sitemap = get('https://apnews.com/ap-sitemap-202410.xml')
# parse the content as an XML document
sitemap= BeautifulSoup(october_sitemap.content, features="xml") # note the features = 'xml' option!

The link to each article for this month is stored in a `loc` node. So we can get all of the URLs with a simple selector expression: 

In [3]:
# select all <loc> nodes
url_nodes = sitemap.select('loc')

# loop through the entire list and just get the link
urls = [i.get_text() for i in url_nodes]

# print the last 5
urls[-5:]


['https://apnews.com/article/nba-paul-george-sixers-a732b5c52cbb2c2363a1205c3458bfa7',
 'https://apnews.com/article/ap-top-photos-of-week-4c31f109609a1ca2aa5165387c875259',
 'https://apnews.com/baltimore-orioles-mlb-playoffs-9d77cb92b2594884a73169f0e8ad7263',
 'https://apnews.com/article/mexico-foreign-investment-amazon-natural-gas-cruise-ships-ac2476ac1d76c9ff6246ade660aa7882',
 'https://apnews.com/article/fifa-israel-palestinian-soccer-5b78239c70c568877c89695e2dcd0721']

## Filtering with a regular expression

After inspecting the sitemap, you might notice that there are some slight differences in the structures of each URL. Links with `hub` in their path like https://apnews.com/hub/mideast-wars will take you to a landing page with links to additional articles about a topic, whereas links with `article` in the path take you to an actual news story. Lets assume for my project that I only want the **articles** and not any of the links to video links or 'hubs'. 

I can use a regular expression to detect the urls that have "article" as part of their path and create a list with exclusively "article" links. We'll talk a bit more about regular expressions at a later date, but for now its sufficient to say they're a way of flexibly identifying strings of text. The regular expression we need here is really simple, we just need something that catches the string "/article/". The code below searches for that string in a URL, and will return `True` if it's present, and `False` if it's absent:

In [11]:
url = "https://apnews.com/article/fifa-israel-palestinian-soccer-5b78239c70c568877c89695e2dcd0721"
bool(re.search("/article/", url))

True

In [12]:
url = 'https://apnews.com/hub/mideast-wars'
bool(re.search("/article/", url))

False

So, we can just use a list comprehension to filter out URLs that don't link to an article:

In [13]:
article_urls  = [i for i in urls if bool(re.search("/article/", i)) ]

len(article_urls) # a lot !

8489

That's not an impossible number to scrape or anything, but we'll limit our results a little further by trying to identify links with text that relates to economic coverage. Since the AP link includes some description of the article's topic in the URL itself, we can use this to only get articles related to a particular topic. We'll use another regular expression, but now we'll only grab URLs if they include the string "inflation".

In [119]:
inflation_urls = [i for i in article_urls if bool(re.search("inflation", i)) ]
len(inflation_urls) # much more reasonable

26

Regular expressions offer a lot of additional flexibility for searching texts. For instance, we can use the "|" symbol in our expression as a logical OR operator. So "this|that" would search for the string "this" OR the string "that". So, if we wanted all the articles mentioning inflation OR unemployment, we could just write:

In [120]:
inflation_or_unemployment = [i for i in article_urls if bool(re.search("inflation|unemployment", i)) ]
len(inflation_or_unemployment) # much more reasonable

34

Now, with our list of urls in hand, we just need a function that will take a URL and output a formatted dictionary from the response. I've gone ahead and written that function here. 

In [122]:
def ap_parser(url):
    site = get(url)
    content = BeautifulSoup(site.content, "html.parser")
    timestamp = int(content.select_one('.Page-content bsp-timestamp').get('data-timestamp'))
    output = {
        'url' : site.url,
        'tags' : ', '.join([i.get('content') for i in content.select('meta[property="article:tag"]')]),
        'section' : ' '.join([i.get('content') for i in content.select('meta[property="article:section"]')]),
        'authors': ', '.join([i.get_text() for i in content.select('.Page-authors .Link')]),
        'article_text' : ' '.join([i.get_text() for i in content.select('.Page-content .RichTextBody p')]),
        'pubdate' : time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(timestamp/1000)), # the timestamp on AP articles is a unix epoch
        'headline' : ' '.join([i.get_text() for i in content.select('.Page-headline')])
    }
    return output

So the last step is just to loop through the list of URLs to collect our data.

In [123]:
articles_list = []
for i in inflation_urls:
    print(i, end='\r')
    articles_list.append(ap_parser(i))
    time.sleep(.3)



https://apnews.com/article/stocks-markets-china-inflation-rates-44ade1d46d9dc7ed714934238e810f65e55d81df459792a038ea9e321800

And finally, we'll put our results into a data frame:

In [124]:
articles_df = pd.DataFrame(articles_list)
articles_df.head()

Unnamed: 0,url,tags,section,authors,article_text,pubdate,headline
0,https://apnews.com/article/trump-economy-tarif...,"Joe Biden, Paul Ashworth, General news, Inside...",Business,PAUL WISEMAN,WASHINGTON (AP) — President-elect Donald Trump...,2024-12-23 08:40:04,An analyst looks ahead to how the US economy m...
1,https://apnews.com/article/new-york-ny-inflati...,"General news, Kathy Hochul, New York City Wire...",U.S. News,ANTHONY IZAGUIRRE,"ALBANY, N.Y. (AP) — New Yorkers could get “Inf...",2024-12-09 17:42:51,New Yorkers could get ‘Inflation Refund’ check...
2,https://apnews.com/article/inflation-prices-ec...,"Inflation, General news, Federal Reserve Syste...",Business,PAUL WISEMAN,WASHINGTON (AP) — Wholesale costs in the Unite...,2024-12-12 09:08:25,US wholesale inflation accelerated in November...
3,https://apnews.com/article/small-business-infl...,"Inflation, Small business, U.S. Department of ...",Business,THE ASSOCIATED PRESS,A look at some of the key business events and ...,2024-12-06 14:00:39,"Next Week: small business index, consumer pric..."
4,https://apnews.com/article/stock-market-inflat...,"National, Financial markets, Warren Buffett, G...",Business,ALEX VEIGA,Stock indexes closed mostly lower Tuesday as t...,2024-12-31 17:39:38,Stock market today: Wall Street indexes lose g...


## Scaling up

We've got coverage of inflation for October 2024, but normally what we want to do is track coverage of a news story over a longer period of time. What if we wanted to track coverage of inflation through all of 2024? To start, we'd want to go back to that original sitemap at the top of the hierarchy. Then we would want to get a list of all monthly sitemaps for 2024, and then we'd want to visit each one, identify the relevant article urls, and then scrape all of the articles from each one.


![image.png](https://mermaid.ink/img/pako:eNpl0UFrwjAUB_CvEt7Ziu10sB4G1U5QcJftZnZ4tK820DQlfcUN8bsvRgkt5pT88s__kHeBwpQEKVSNORc1WhbfuWyFW_kRu6hXTBq7-a9ufkQUvYtspFGySJbx4s3f3h9lPvRxzCyroiERT3wbPHl4fuP1c-eo0kc2T5F4FFn79l1of5n4Pvjy4Rvvh-CriX8Gf3UOM9BkNarS_dLllpPANWmSkLptZSz1LEG2V5fEgc3XX1tAynagGVgznGpIK2x6dxq6EplyhSeLOiiVio093Mfgp3H9Bz2retE?type=png)



So, to scrape articles from multiple months, we can start with `ap-sitemap.xml`, then use that to identify sitemaps for each month, then use those sitemaps to identify articles across multiple dates.

We'll start by getting the entire list of sitemaps:

In [125]:
sitemap = get('https://apnews.com/ap-sitemap.xml')

overall_sitemap = BeautifulSoup(sitemap.content, features="xml") 
# select all <loc> nodes
overall_nodes = overall_sitemap.select('loc')
# loop through the entire list and just get the link
sitemap_urls = [i.get_text() for i in overall_nodes]

Now, we can use another regular expression. This time, we're going to identify all of the sitemaps for 2024:|

In [None]:
sitemap = get('https://apnews.com/ap-sitemap.xml')

overall_sitemap = BeautifulSoup(sitemap.content, features="xml") 
# select all <loc> nodes
overall_nodes = overall_sitemap.select('loc')
# loop through the entire list and just get the link
sitemap_urls = [i.get_text() for i in overall_nodes]
year24_urls  = [i for i in sitemap_urls if bool(re.search('https://apnews.com/ap-sitemap-2024', i)) ]

year24_urls # looking at the result:

In [127]:
all_article_urls = []

for i in year24_urls:
    monthly_sitemap = get(i)
    sitemap = BeautifulSoup(monthly_sitemap.content, features="xml")
    url_nodes = sitemap.select('loc')
    urls = [i.get_text() for i in url_nodes]
    article_urls  = [i for i in urls if bool(re.search("/article/", i)) ]
    all_article_urls.extend(article_urls)

    

In [128]:


select_urls = [i for i in all_article_urls if bool(re.search("inflation", i)) ]
len(select_urls)

464

Now, we can loop through this new list of articles and process the results (we'll just do the first twenty, but doing more is just a matter of removing the `[:20]` part of this code)

In [129]:
articles_list = []
for i in select_urls[:20]:
    print(i, end='\r')
    articles_list.append(ap_parser(i))
    time.sleep(.1)


articles24_df = pd.DataFrame(articles_list)

https://apnews.com/article/jobs-economy-unemployment-inflation-rates-federal-reserve-ad0fd064a35e8a93bc15eaf6ea982c1c9f

In [131]:
articles24_df

Unnamed: 0,url,tags,section,authors,article_text,pubdate,headline
0,https://apnews.com/article/stock-market-tokyo-...,"Government regulations, Federal Reserve System...",Business,STAN CHOE,NEW YORK (AP) — Wall Street closed its 10th wi...,2024-01-12 17:21:35,Stock market today: Wall Street closes out its...
1,https://apnews.com/article/federal-reserve-inf...,"General news, Federal Reserve System, Inflation",Business,PAUL WISEMAN,WASHINGTON (AP) — The Federal Reserve’s policy...,2024-01-03 20:24:55,Federal Reserve minutes: Officials saw inflati...
2,https://apnews.com/article/federal-reserve-inf...,"Government regulations, General news, AP Top N...",Business,CHRISTOPHER RUGABER,WASHINGTON (AP) — Chair Jerome Powell will ent...,2024-01-29 14:39:43,Inflation has slowed. Now the Federal Reserve ...
3,https://apnews.com/article/european-central-ba...,"General news, Inflation, European Central Bank...",Business,DAVID MCHUGH,"FRANKFURT, Germany (AP) — European Central Ban...",2024-01-25 11:28:04,Financial markets are jonesing for interest ra...
4,https://apnews.com/article/stock-market-today-...,"Financial markets, General news, Financial/Bus...",Business,THE ASSOCIATED PRESS,Asian shares dropped Wednesday after Wall Stre...,2024-01-02 23:21:56,Stock market today: Asian markets track Wall S...
5,https://apnews.com/article/irs-tax-biden-infla...,"Internal Revenue Service, General news",U.S. News,FATIMA HUSSEIN,WASHINGTON (AP) — The IRS says it has collecte...,2024-01-12 08:02:15,IRS says it collected $360 million more from r...
6,https://apnews.com/article/stock-markets-infla...,"General news, Tokyo, International News, Inter...",Business,YURI KAGEYAMA,TOKYO (AP) — Asian shares were trading mostly ...,2024-01-16 22:03:12,Stock market today: Asian shares mostly fall a...
7,https://apnews.com/article/stock-market-rates-...,"Financial markets, General news, Electric vehi...",Business,YURI KAGEYAMA,"TOKYO (AP) — Asian stocks slipped on Thursday,...",2024-01-03 22:41:54,"Stock market today: Asian shares slip, echoing..."
8,https://apnews.com/article/imf-inflation-econo...,"Economy, General news, Inflation, Internationa...",Business,PAUL WISEMAN,WASHINGTON (AP) — The International Monetary F...,2024-01-30 08:01:11,IMF sketches a brighter view of global economy...
9,https://apnews.com/article/argentina-inflation...,"Latin America, General news, Inflation, Argentina",World News,DÉBORA REY,"BUENOS AIRES, Argentina (AP) — Argentina’s ann...",2024-01-11 19:17:29,"Argentina’s annual inflation soars to 211.4%, ..."


## Saving the results

Finally, we just did a pretty time-consuming scrape of a lot of articles. We don't want to have to do this again, so it probably makes sense to store our results permanently. We can use a .csv file to save our data: 

In [132]:
articles24_df.to_csv('inflation_articles.csv', index=False)

Then when we want to read this back in later, we can just use:

In [133]:
articles = pd.read_csv('inflation_articles.csv')
articles.head()

Unnamed: 0,url,tags,section,authors,article_text,pubdate,headline
0,https://apnews.com/article/stock-market-tokyo-...,"Government regulations, Federal Reserve System...",Business,STAN CHOE,NEW YORK (AP) — Wall Street closed its 10th wi...,2024-01-12 17:21:35,Stock market today: Wall Street closes out its...
1,https://apnews.com/article/federal-reserve-inf...,"General news, Federal Reserve System, Inflation",Business,PAUL WISEMAN,WASHINGTON (AP) — The Federal Reserve’s policy...,2024-01-03 20:24:55,Federal Reserve minutes: Officials saw inflati...
2,https://apnews.com/article/federal-reserve-inf...,"Government regulations, General news, AP Top N...",Business,CHRISTOPHER RUGABER,WASHINGTON (AP) — Chair Jerome Powell will ent...,2024-01-29 14:39:43,Inflation has slowed. Now the Federal Reserve ...
3,https://apnews.com/article/european-central-ba...,"General news, Inflation, European Central Bank...",Business,DAVID MCHUGH,"FRANKFURT, Germany (AP) — European Central Ban...",2024-01-25 11:28:04,Financial markets are jonesing for interest ra...
4,https://apnews.com/article/stock-market-today-...,"Financial markets, General news, Financial/Bus...",Business,THE ASSOCIATED PRESS,Asian shares dropped Wednesday after Wall Stre...,2024-01-02 23:21:56,Stock market today: Asian markets track Wall S...


(In cases where you need to scrape a lot of pages or send a lot of API calls, it might make sense to separate the data collection code from the analysis code so that you can avoid accidentally overwriting data or re-running a time consuming chunk of code.)

### Time permitting: Locate the sitemap for `dkbnews.com` and use it to scrape the text of 20 articles from this month.

The basic steps here are:
1. Check for a robots.txt page by going to `[website_name]/robots.txt`. i.e. dbknews.com/robots.txt
2. If a sitemap is listed, copy that URL and send a get request to retrieve the data
3. Parse the sitemap with BeautifulSoup (remember, you'll generally want to use the XML parser) and extract the links
4. Filter the links (if necessary) and then write a loop that sends a get request to each one and downloads/processes the result into a dictionary, then places that dictionary in a list
5. Use `pd.DataFrame()` to turn your list into a pandas dataframe and then save the result as a .csv file.