# Some free APIs that you could consider for this course

This is by no means an exhaustive list! I'm a political scientist, and my selections here reflect that bias. If you're not interested in that, then you should go looking for an alternative. There are lots of freely available APIs

These sources have Application Programming Interfaces that you can access using R code. Some of them may have an R "wrapper" that simplifies the process of gathering data from them and putting it into a data frame. I've listed some of the more useful ones here, but keep in mind that many legislatures, government agencies, or international organizations offer an API now, so don't treat this as a definitive list.

[Check this list of free APIs](https://github.com/public-apis/public-apis) 

-   [Data.gov](https://api.data.gov/): Data from U.S. Federal Agencies

-   [BLS](https://www.bls.gov/developers/): U.S. Bureau of Labor Statistics

-   [UCDP](https://ucdp.uu.se/) : Global data on violent conflict and protest. Updates yearly.

-   [ACLED](https://apidocs.acleddata.com/) : Violent conflict and protest. Updates weekly.

-   [Congress.gov](https://gpo.congress.gov/) : U.S. Congress

-   [UK Parliament](https://developer.parliament.uk/): (really, lots of legislative bodies have an API now, I won't list them all here)

-   [World Bank](https://datatopics.worldbank.org/world-development-indicators/) (especially development indicators) are a good source for background data on things like GDP, literacy, etc.

-   [Manifestos Project](https://manifesto-project.wzb.eu/information/documents/api): Party Manifestos from across the world. Many of these have been split by sentence and then each statement has been manually categorized by topic. 

-   [OECD](https://data.oecd.org/api/) : Mostly economic data on OECD member states.

-   [Spotify](https://developer.spotify.com/documentation/web-api) This requires an access token. You can see a guide for how this might be used at https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b

-   [Reddit](https://www.reddit.com/dev/api) (see https://www.jcchouinard.com/reddit-api-without-api-credentials/ for an example of how this might work)

-   [Housing Market API](https://documenter.getpostman.com/view/9197254/UVsFz93V#quickstart)

-   [Delphi CovidCast](https://cmu-delphi.github.io/delphi-epidata/)



# Tips on finding a website to scrape

If you're interested in going the webscraping route, I would suggest looking for a site that is going to be reasonably easy to scrape. Simpler is better! Some sites may even have a "simplified version for slow connections", these are going to be much easier compared to sites that have really complicated HTML. Also keep in mind that sites that require you to enter a password or login are generally not going to be amenable to web scraping. 

If you're planning to scrape text data, then you want to look for a site where you can get a lot of links. The websites for major news outlets are always a good option. Blogs and press release pages can also be good: for instance, every member of the U.S. Senate has a website, and most of them will post press releases on a somewhat regular basis and these will usually have a consistent html structure that you can parse.

## Using a sitemap

Larger sites will usually have a sitemap that acts as a map of URLs to make it easier to web crawlers to index pages. For instance, the Associated Press has a map for
<a href=https://apnews.com/ap-sitemap-202410.xml>stories from October 2024 here</a> (this will probably load slowly!)

I can use the sitemap to create a list of links, write a scraper that can extract the text from each page, and then create a loop that will extract the relevant text data from each article. I've included an example of doing this for a handful of articles below

In [None]:
# import packages 
from bs4 import BeautifulSoup
from requests import get
import lxml
import re
import pandas as pd
import numpy as np

In [None]:
# get the sitemap
october_sitemap = get('https://apnews.com/ap-sitemap-202410.xml')


In [None]:
# parse the content as an XML document
sitemap= BeautifulSoup(october_sitemap.content, features="xml")

# select all <loc> nodes that are descendants of a url node
url_nodes = sitemap.select('url loc')

# loop through the entire list and just get the link
urls = [i.get_text() for i in url_nodes]

The links here contain a bunch of different article types, but maybe I only want the articles and not any of the links to video links or 'hubs'. I can use a regular expression to detect the urls that have "article" as part of their path and create a list with only these links:

In [None]:
article_urls = []
[article_urls.append(i) if bool(re.search("/article/", i)) else ''  for i in urls]
article_urls[:5]

In [None]:
articles = pd.DataFrame(article_urls, columns = ['url'])

articles["headline"] = np.nan
articles["article_text"] = np.nan

articles.head()

Now I would just need to write a loop to visit each of these urls, extract the information I'm interested in, and put the result in a dataframe. As a courtesy, you probably should also try to limit the frequency of your requests. A simple way to do this is to put a `time.sleep()` function inside your loop, which will cause it to pause for a number of seconds after each iteration.

In the interest of speeding things along, I'm just going to grab the first 5 urls here, but ideally we would want to capture everything and then store it somewhere


In [None]:
import time
for i in range(5):
    # visit url i
    req = get(articles['url'][i])
    # extract the html
    article= BeautifulSoup(req.content)
    # get the headline and place it in row i
    articles.loc[i, "headline"] = ' '.join([str(i.get_text()) for i in article.select("h1.Page-headline")])
    # get the text and place it in row i 
    articles.loc[i, "article_text"] = ' '.join([str(i.get_text()) for i in article.select(".Page-storyBody p")])
    # pause for one second after each iteration of the loop:
    time.sleep(1)


Once this runs, I should have article text and headlines in my data frame. Which I can now use for further analysis. (Note: I would probably also want to write some code to get the publication date and maybe the author name here as well)

In [None]:
articles.head()