# Collecting Article Data from New York Time's Archive

Over 150 years of NYT articles can be collected from the Archive API. This notebook extracts some useful information for some analysis for a period of time.

#### Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import json
import configparser
import requests
import time
import os

#### Reading the API key from Configparser

In [2]:
configs = configparser.ConfigParser()
configs.read('News_API.ini')
API_KEY = configs['news_API']['nytimes_archive']

#### Simple Data Extractor

The NYT allows to extract one month of archive data in one call and the data is in JSON file format. A simple function is created to extract one JSON file contains one month of articles.

In [3]:
def send_request(yr, month, api_key):
  '''Sends a request to the NYT Archive API for a given year and month, and receive the response'''
    
  base_url = 'https://api.nytimes.com/svc/archive/v1'
  full_url = base_url + '/' + yr + '/' + month + '.json?api-key=' + api_key
    
  try:
    response = requests.get(full_url).json()
  except Exception:
    return None
  return response

Sample output of `send_request()` function.

In [4]:
nyt_articles = send_request(str(2021), str(6), API_KEY)
len(nyt_articles['response']['docs'])

4260

Information about 4260 articles comprised in the June 2021 archive. First dictionary item is printed to see what data is available for further extraction and analysis.

In [5]:
nyt_articles['response']['docs'][10]

{'abstract': 'When your therapist is a bot, you can reach it at 2 a.m. But will it really understand your problems?',
 'web_url': 'https://www.nytimes.com/2021/06/01/health/artificial-intelligence-therapy-woebot.html',
 'snippet': 'When your therapist is a bot, you can reach it at 2 a.m. But will it really understand your problems?',
 'lead_paragraph': '“I understand that you’re experiencing a relationship problem, is that right?”',
 'print_section': 'D',
 'print_page': '1',
 'source': 'The New York Times',
 'multimedia': [{'rank': 0,
   'subtype': 'xlarge',
   'caption': None,
   'credit': None,
   'type': 'image',
   'url': 'images/2021/06/01/science/01SCI-WOEBOT/01SCI-WOEBOT-articleLarge.jpg',
   'height': 548,
   'width': 600,
   'subType': 'xlarge',
   'crop_name': 'articleLarge',
   'legacy': {'xlarge': 'images/2021/06/01/science/01SCI-WOEBOT/01SCI-WOEBOT-articleLarge.jpg',
    'xlargewidth': 600,
    'xlargeheight': 548}},
  {'rank': 0,
   'subtype': 'jumbo',
   'caption': None,

### Extracting Useful Information From the Data Dictionary

Extracting following information from the API response:
*pub_date, document_type, news_desk, section_name, subsection_name, abstract, headline,* full name of the *writer,* some *keywords, word_count* and *web_url*

In [6]:
def manage_missing_fields(doc_field, article):
  ''' some fields are not available in some dictionaries; making sure those fields are available otherwise assign nan '''
  if doc_field in article:
    value = article[doc_field]
  else:
    value = np.nan
  return value

In [7]:
def is_main(article):
  ''' making sure the headline has main in it '''
  is_true = type(article['headline']) == dict and 'main' in article['headline'].keys()

  return is_true

In [8]:
def extracting_headline(article):
  ''' extracting main of the headline if it is available '''
  if is_main(article):
    value = article['headline']['main']
  else:
    value = np.nan
  return value

In [9]:
def writer_name(article):
  ''' checking whether the writer name is there and return it in First Name, Middle Name, Last Name format'''
  writer_detail = article['byline']['person'] 
  is_true = len(writer_detail) > 0 and type(writer_detail[0]) == dict and 'firstname' in writer_detail[0].keys()

  if is_true:
    firstname = writer_detail[0]['firstname']
    lastname = writer_detail[0]['lastname']
    middlename = writer_detail[0]['middlename']

    FullName = firstname + (" " + middlename if middlename else "") + " " + lastname

  else:
    FullName = None

  return FullName

In [10]:
def keyword_parser(keyword_type, article):
  ''' extracts keywords based on its type in a list'''
  kw_list = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == keyword_type]
  return kw_list

Extracting all useful information from the API response and store it in a tabular table

In [11]:
def article_parser(nyt_articles):
  ''' parsing interested fields from NYT's archive response and return a tabular DataFrame '''
  
  # defining empty fields
  data = {'pub_date': [],
          'document_type':[],
          'news_desk':[],
          'section_name': [], 'subsection_name':[],
          'abstract': [], 'headline': [],
          'writer': [],
          'key_subject': [], 'key_glocations': [], 'key_persons': [], 'key_organizations': [], 
          'word_count':[],
          'web_url': []}

  articles = nyt_articles['response']['docs']

  for article in articles:
    data['abstract'].append(article['abstract'])
    data['headline'].append(extracting_headline(article))
    data['writer'].append(writer_name(article))
    data['document_type'].append(article['document_type'])
    data['news_desk'].append(article['news_desk'])
    data['key_subject'].append(keyword_parser('subject', article))
    data['key_glocations'].append(keyword_parser('glocations', article))
    data['key_persons'].append(keyword_parser('persons', article))
    data['key_organizations'].append(keyword_parser('organizations', article))
    data['section_name'].append(article['section_name'])
    data['subsection_name'].append(manage_missing_fields('subsection_name', article))
    data['pub_date'].append(pd.to_datetime(article['pub_date'], format='%Y%m%d %H:%M:%S.%f').date())
    data['web_url'].append(article['web_url'])
    data['word_count'].append(article['word_count'])
    
  df = pd.DataFrame(data)
  print('{} articles successfully parsed to the table'.format(len(articles)))
  
  return df

Viewing the sample table output

In [12]:
df = article_parser(nyt_articles)
df.head()

4260 articles successfully parsed to the table


Unnamed: 0,pub_date,document_type,news_desk,section_name,subsection_name,abstract,headline,writer,key_subject,key_glocations,key_persons,key_organizations,word_count,web_url
0,2021-06-01,article,Foreign,World,Middle East,In Israel’s newspapers — as fractured as its e...,"Glum to Gleeful, Israeli Media React to Possib...",Adam Rasgon,[],[Israel],"[Bennett, Naftali, Netanyahu, Benjamin]","[Haaretz, Israel Hayom, Maariv, Yediot Aharono...",984,https://www.nytimes.com/2021/05/31/world/middl...
1,2021-06-01,article,Sports,Sports,Tennis,"Playing in a night session in Paris, Williams ...",Serena Williams Wins in the First Round at the...,Ben Rothenberg,"[French Open (Tennis), Tennis]",[],"[Begu, Irina-Camelia, Williams, Serena]",[],722,https://www.nytimes.com/2021/05/31/sports/tenn...
2,2021-06-01,article,Express,U.S.,,A bill that passed the General Assembly with b...,Illinois Lawmakers Bar Police From Using Decep...,Michael Levenson,"[Confessions, State Legislatures, False Arrest...","[Chicago (Ill), Illinois]","[Dassey, Brendan, Durkin, Jim, Foxx, Kim]","[Innocence Project, Police Department (Chicago...",880,https://www.nytimes.com/2021/05/31/us/Chicago-...
3,2021-06-01,article,Games,Crosswords & Games,,Finn Vigeland offers us a refreshing snack of ...,With the Candlestick in the Study,Deb Amlen,[Crossword Puzzles],[],[],[],658,https://www.nytimes.com/2021/05/31/crosswords/...
4,2021-06-01,multimedia,Climate,Climate,,There are few things Americans can agree on th...,How Do Animals Safely Cross a Highway? Take a ...,Catrin Einhorn,"[ANIMALS, Animal Migration, Roads and Traffic,...","[United States, California, Florida, Montana, ...",[],[],0,https://www.nytimes.com/interactive/2021/05/31...


### Extracting Articles for a Period of Time

Building functions to parse article information for a period of time. NYT archive contains articles since 1850 for public access.

In [13]:
def get_articles(start_date, end_date, api_key, file_path):
  ''' request and receive articles for a date range and save them in a directory as .csv file '''
  
  # create file path
  if not os.path.exists(file_path):
    os.mkdir(file_path)
  
  # create possible Year - Month combinations
  ym_list = pd.date_range(start_date, end_date, freq='MS').strftime("%Y-%m").tolist()
  
  # make API calls for all year, month within the range interested
  for ym in ym_list:
    ym_spl = ym.split('-')
    
    csv_name = 'nyt_arc' + str(ym_spl[0]) + '-' + str(ym_spl[1]) + '.csv'
    csv_path = file_path + '/' + csv_name
    if os.path.exists(csv_path):
      print(csv_name + ' already exists \n')      
      
    else:
      nyt_articles = send_request(str(ym_spl[0]), str(int(ym_spl[1])), api_key)
      
      if nyt_articles is not None:
        df_article = article_parser(nyt_articles)
        
        df_article.to_csv(csv_path, index=False)
        time.sleep(5)
        print('Saving ' + csv_name + '\n')
        
  print('---- Completing parsing the articles ----')

Extracting aricle information using the `get_articles()` function

In [14]:
get_articles('2020-03-01', '2020-04-15', API_KEY, file_path = 'NYT_Archive_data')

4883 articles successfully parsed to the table
Saving nyt_arc2020-03.csv

5019 articles successfully parsed to the table
Saving nyt_arc2020-04.csv

---- Completing parsing the articles ----
