# NY Time API Scrape

Tools in this notebook are for the use of scraping news articles using the NY Times API. The following code will make requests to the API, scrape 1,000 of the most news recent articles, and parse the results for each article headline, snippet, and url into a DataFrame.   

Please visit https://developers.nytimes.com/ before using this tool to review the NY Times API terms of service, obtain your personal NY Times developer API key (free), and research any additional information relating to the use of the API.

## Imports

In [64]:
import requests
import pandas as pd
import time 

## NY Times API Scrape

Enter your API key and the topic you which to search for, as a string in the cell below

In [1]:
api_key = ''
topic = 'fire'

This function makes multiple requests to the API, pulls 1,000 of the most recent articles (10 per request) for the topic designated above. This is the maximum amount of request the API allows for, adjust the range in the function to collect a lesser amount. Output is a list of dictionaries with all article specific information contained in the json.

In [77]:
def nytimes_api_scrape(topic, api_key):
    article_list = []
    # adjust range to scrape less then 1,000
    for i in range(100):
        
        # gives an update every 100 articles
        if i % 10 == 0:
            print('{} articles gathered so far'.format(i*10))
        url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json?q='+ topic + '&'+ str(i) +'&api-key='+ api_key
        res = requests.get(url)
        
        # checks to see if request was a success and adds to list
        if res.status_code == 200:
            the_json = res.json()
            article_list.extend(the_json['response']['docs'])
        
        else:
            print('Bad request status {}'.format(response.status_code))
            break
        
        # intentionally delay requests to the server
        time.sleep(5)
    
    print('You gathered {} articles about {}'.format(len(article_list), topic))    
    return article_list

**Important Note:** total run time of the function is about 8.5 minutes

In [62]:
fire_articles = nytimes_api_scrape(topic, api_key)

0 articles gathered so far
You gathered 30 articles about fire


In [75]:
fire_articles[0].keys()

dict_keys(['web_url', 'snippet', 'lead_paragraph', 'print_page', 'blog', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'score', 'uri'])

This function takes the headline, snippet, and url for each article and returns a DataFrame for output

In [70]:
def to_df(article_list):
    key_list = []
    for i in range(len(fire_articles)):
        key_list.append({
            'headline': article_list[i]['headline']['main'],
            'snippet': article_list[i]['snippet'],
            'web_url': article_list[i]['web_url']
        })
    df = pd.DataFrame(key_list)
    return df

In [73]:
df = to_df(fire_articles)

## Output

In [78]:
# save to csv to use for use in other notebook
df.to_csv('./datasets/{}'.format(topic))