# Some free APIs that you could consider for this course

This is by no means an exhaustive list! I'm a political scientist, and my selections here reflect that bias. If you're not interested in that, then you should go looking for an alternative. There are lots of freely available APIs

These sources have Application Programming Interfaces that you can access from Python. Some of them may have a Python "wrapper" that simplifies the process of gathering data from them and putting it into a data frame. I've listed some of the more useful ones here, but keep in mind that many legislatures, government agencies, or international organizations offer an API now, so don't treat this as a definitive list.

[Check this list of free APIs](https://github.com/public-apis/public-apis). Some of them are novelties (I don't think you can really do a viable project with the catfacts API) but many of them are excellent options for outside-the-box projects.

-   [Data.gov](https://api.data.gov/): A single API key that works for multiple U.S. federal government data APIs. There are tons of options here.

-   [BLS](https://www.bls.gov/developers/): U.S. Bureau of Labor Statistics

-   [FRED](https://fred.stlouisfed.org/docs/api/fred/): Federal Reserve Bank of St. Louis. A wide variety of economic indicators (some overlap with BLS data here, but in a format that I think is much easier to navigate)

-   [UCDP](https://ucdp.uu.se/) : Global data on violent conflict and protest. Updates yearly.

-   [ACLED](https://apidocs.acleddata.com/) : Global coverage of Violent conflict and protest. Updates weekly.

-   [Congress.gov](https://gpo.congress.gov/) : U.S. Congress. Data on bills, members, committees, etc. 

-   [UK Parliament](https://developer.parliament.uk/): (really, lots of legislative bodies have an API now, I won't list them all here)

-   [World Bank](https://datatopics.worldbank.org/world-development-indicators/) (especially development indicators) are a good source for cross-national data on things like GDP, literacy, etc.

-   [Manifestos Project](https://manifesto-project.wzb.eu/information/documents/api): Party Manifestos from across the world. Many of these have been split by sentence and then each statement has been manually categorized by topic. 

-   [OECD](https://data.oecd.org/api/) : Mostly economic data on OECD member states.

-   [Spotify](https://developer.spotify.com/documentation/web-api) This requires an access token. You can see a guide for how this might be used at https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b



## Social Media

(Be warned: collecting data from these sources may be a bit more complicated than some of the more user-friendly options above, and they may become unusable in the near future)

In the recent past we probably would have spent more time working on collecting data from social media, but in the last few years the most popular platforms have gotten rid of or seriously restricted access for non-commercial users. However, there are (as of this writing) a few sites that are still relatively amenable to this kind of research if you're interested in pursuing it. 


-   [Bluesky API](https://atproto.blue/en/latest/readme.html) Bluesky allows anyone with an account to access their API and they currently have extremely high rate limits. There's a bit of a learning curve with this one, but feel free to reach out if you want some help getting started.

-   [Reddit](https://www.reddit.com/dev/api) (see [here](https://www.jcchouinard.com/reddit-api-without-api-credentials/) for an example of how this might work)

-   [nitter, a javascript free mirror of Twitter](https://nitter.net/) went down in 2024, but came back online in early 2025. [There's a Python package](https://github.com/bocchilorenzo/ntscraper) that will help you retrieve data from the site.


## APIs with wrapper packages

There are Python wrappers for some widely used APIs that can simplify or automate parts of the process of setting up queries or processing data. For example:

- [wgbapi](https://blogs.worldbank.org/en/opendata/introducing-wbgapi-new-python-package-accessing-world-bank-data) For the World Bank API
- [LyricsGenius](https://lyricsgenius.readthedocs.io/en/master/) for the Genius Lyrics API
- [census](https://pypi.org/project/census/) for the Census API
- [gdeltdoc](https://github.com/alex9smith/gdelt-doc-api) for the Global Database of Events Language and Tone
- [nba_api](https://pypi.org/project/nba_api/) an interface to the NBA stats API

Keep in mind that these are basically just a collection of functions that send HTTP requests, they may simplify your life, but they don't do anything you don't already know how to do.


# Tips on finding a website to scrape

If you're interested in going the webscraping route, I would suggest looking for a site that is going to be reasonably easy to scrape. Simpler is better! Some sites may even have a "simplified version for slow connections", these are going to be much easier compared to sites that have really complicated HTML. Also keep in mind that sites that require you to enter a password or login are generally not going to be amenable to web scraping. 

If you're planning to scrape text data, then you want to look for a site where you can get a lot of links. The websites for major news outlets are always a good option. Blogs and press release pages can also be good: for instance, every member of the U.S. Senate has a website, and most of them will post press releases on a somewhat regular basis and these will usually have a consistent html structure that you can parse.

## Using a sitemap

Larger sites will usually have a sitemap that acts as a map of URLs to make it easier to web crawlers to index pages. For instance, the Associated Press has a map for
<a href=https://apnews.com/ap-sitemap-202410.xml>stories from October 2024 here</a> (this will probably load slowly!)

If a sitemap exists, you can usually find a link to it on the sites `robots.txt` page. Here's what that looks like for the AP: https://apnews.com/robots.txt

I can use the sitemap to create a list of links, write a scraper that can extract the text from each page, and then create a loop that will extract the relevant text data from each article. I've included an example of doing this for a handful of articles below

In [1]:
# import packages 
from bs4 import BeautifulSoup
from requests import get
import lxml
import re
import pandas as pd
import numpy as np

In [2]:
# get the sitemap
october_sitemap = get('https://apnews.com/ap-sitemap-202410.xml')


In [3]:
# parse the content as an XML document
sitemap= BeautifulSoup(october_sitemap.content, features="xml")

# select all <loc> nodes that are descendants of a url node
url_nodes = sitemap.select('url loc')

# loop through the entire list and just get the link
urls = [i.get_text() for i in url_nodes]

The links here contain a bunch of different article types, but maybe I only want the articles and not any of the links to video links or 'hubs'. I can use a regular expression to detect the urls that have "article" as part of their path and create a list with only these links:

In [4]:
article_urls = []
[article_urls.append(i) if bool(re.search("/article/", i)) else ''  for i in urls]
article_urls[:5]

['https://apnews.com/article/2024-china-open-medvedev-monfils-extraordinary-photo',
 'https://apnews.com/article/2024-catalonia-castells-human-tower-extraordinary-photo',
 'https://apnews.com/article/2024-czech-republic-miner-extraordinary-photo',
 'https://apnews.com/article/2024-cuba-power-outage-fisherman-light-extraordinary-photo',
 'https://apnews.com/article/greece-migration-european-union-policy-6e4dff2bb4e88a4c24d6dd056f7f6b22']

In [5]:
articles = pd.DataFrame(article_urls, columns = ['url'])

articles["headline"] = np.nan
articles["article_text"] = np.nan

articles.head()

Unnamed: 0,url,headline,article_text
0,https://apnews.com/article/2024-china-open-med...,,
1,https://apnews.com/article/2024-catalonia-cast...,,
2,https://apnews.com/article/2024-czech-republic...,,
3,https://apnews.com/article/2024-cuba-power-out...,,
4,https://apnews.com/article/greece-migration-eu...,,


Now I would just need to write a loop to visit each of these urls, extract the information I'm interested in, and put the result in a dataframe. As a courtesy, you probably should also try to limit the frequency of your requests. A simple way to do this is to put a `time.sleep()` function inside your loop, which will cause it to pause for a number of seconds after each iteration.

In the interest of speeding things along, I'm just going to grab the first 5 urls here, but ideally we would want to capture everything and then store it somewhere


In [None]:
import time
for i in range(5):
    # visit url i
    req = get(articles['url'][i])
    # extract the html
    article= BeautifulSoup(req.content)
    # get the headline and place it in row i
    articles.loc[i, "headline"] = ' '.join([str(i.get_text()) for i in article.select("h1.Page-headline")])
    # get the text and place it in row i 
    articles.loc[i, "article_text"] = ' '.join([str(i.get_text()) for i in article.select(".Page-storyBody p")])
    # pause for one second after each iteration of the loop:
    time.sleep(1)


Once this runs, I should have article text and headlines in my data frame. Which I can now use for further analysis. (Note: I would probably also want to write some code to get the publication date and maybe the author name here as well)

In [7]:
articles.head()

Unnamed: 0,url,headline,article_text
0,https://apnews.com/article/2024-china-open-med...,An AP photographer focuses his camera on peopl...,BEIJING (AP) — Over the past two and a half de...
1,https://apnews.com/article/2024-catalonia-cast...,An AP photographer gets the light just right f...,"TARRAGONA, Spain (AP) — Associated Press photo..."
2,https://apnews.com/article/2024-czech-republic...,"In one portrait, an AP photographer tells the ...","STONAVA, Czech Republic (AP) — AP photographer..."
3,https://apnews.com/article/2024-cuba-power-out...,"In storm and blackout, an AP photographer find...","HAVANA, Cuba (AP) — Ramon Espinosa started wor..."
4,https://apnews.com/article/greece-migration-eu...,4 people die in a migrant boat accident off a ...,"ATHENS, Greece (AP) — Four people, including t..."


## Sites that are difficult to scrape

In general, you won't be able to scrape sites that require you to login. You also might find its difficult (although not impossible) to scrape sites that have a lot of interactivity or animations. Often we'll get around these problems by using a package like [Selenium](https://selenium-python.readthedocs.io/) to automate a web browser, but this is a more complicated undertaking compared to just sending `get` requests.
