*Written by Gregory Palermo, 2018-06-30*

This notebook queries the NYTimes Article Search API for mention of a particular center.
It returns the metadata associated with each article in a PANDAS dataframe.
Using the "web_url" will allow for web-scraping of the article text at a later date.

It has been adapted from Laura Nelson's "Analyzing Complex Digitzed Data Course" (Northeastern Univeristy)

Next Steps:
1. Query Refinement
    - Figure out what keywords to use in addition to the center's name (The full name itself won't work, since they are more often mentioned as "A detention center in Artesia, N.M" than "Artesia Family Residential Center")
    - Incorporate Patrick Juola's Levenshtein distance for fuzzy matching?
2. Use code written by Sydney colleague (forgot her name!) to use the URLs to scrape full text of articles.
3. Scale this up to work for a list of centers rather than just one. Note that API limits to 1000 queries/day without contacting NYT (otherwise, they assume commercial use).


In [1]:
#Importing relevant packages
from __future__ import division
import requests
import json
import math
import time # for pauses between calls
import pandas # for dataframes
import csv # to write to file


In [48]:
# Forming the API Query
# NYTimes API documentation available at https://developer.nytimes.com/
# key can be requested from https://developer.nytimes.com/signup
# API Limits: 1 query/sec; 1000 queries/day by default
key = "b491e3c62bd841a1a9ca3acf746be117" # please change!
base_url = "https://api.nytimes.com/svc/search/v2/articlesearch.json"


In [20]:
#Search parameters — change the "q" here for whatever center (need to figure out how to make keywords)
search_params = {"q": '"Artesia"'+ " " + '"detention+center"',
                 "api-key": key,
                 "begin_date": "20180501", # Beginning of May seems like a reasonable window for this...
                 "end_date": "20180629", # change this to today's date!
                 #"document_type": "article"
                }

# Initial request to see hwo many hits there will be before querying page by page
request = requests.get(base_url,params=search_params)

# converting the response to JSON
initial_data = json.loads(request.text)

# How many hits are there for this query?
hits = initial_data['response']['meta']['hits']
print("number of hits:", str(hits))

# How many pages of results? (NYTimes gives 10 results per page)
# necessary in order to loop through pages of results later
pages = int(math.ceil(hits/10))
print("number of pages: ", str(pages))

number of hits: 2
number of pages:  1


In case you need to check the request URL for a more complicated query syntax

In [21]:
request.url

'https://api.nytimes.com/svc/search/v2/articlesearch.json?q=%22Artesia%22+%22detention%2Bcenter%22&api-key=b491e3c62bd841a1a9ca3acf746be117&begin_date=20180501&end_date=20180629'

The below code creates an empty dataframe, loops through the pages of results calculated above, and adds the response from each page to a dataframe.

In [22]:
# Creating an empty dataframe to store the data
articles = pandas.DataFrame()

# looping through the pages of results and adding them to df
for i in range(pages):
    print("collecting page", str(i)) # shows in console
    
    # setting the page parameter
    search_params['page'] = i
    
    # making the request
    r = requests.get(base_url, params = search_params)
    
    # getting the text and converting it to a dictionary
    data = json.loads(r.text)
    
    # getting the docs from the dictionary (which also includes...)
    docs = data['response']['docs']
    df_temp = pandas.DataFrame(docs)
    
    #adding those docs to the master dataframe
    articles = pandas.concat([articles,df_temp],ignore_index=True)
    
    time.sleep(1) # pause between calls to prevent refusal

print ('done')
    

collecting page 0
done


Here, there are two mentions of Artesia since May. (There are many more before, FYI)

In [46]:
# printing out the article URLs
for i in articles['web_url']:
    print(i)

https://www.nytimes.com/2018/06/25/opinion/family-detention-immigration.html
https://www.nytimes.com/2018/06/22/opinion/children-detention-trump-executive-order.html


Here's what's included by default in responses.
The API doc has a full list of available metadata that can can be included in the query.

In [27]:
list(articles)

['_id',
 'blog',
 'byline',
 'document_type',
 'headline',
 'keywords',
 'multimedia',
 'news_desk',
 'print_page',
 'pub_date',
 'score',
 'snippet',
 'source',
 'type_of_material',
 'uri',
 'web_url',
 'word_count']