# Potential news data providers
[News API](https://newsapi.org/)

[GDELT](https://www.gdeltproject.org/)

[MediaCloud](https://mediacloud.org/)

[Internet Archive](https://archive.org/web/)

[Common Crawl](https://commoncrawl.org/)

For this exercise, I'll be looking for all New York Times stories published in March. Since NYT has a comprehensive archive API, there's a solid ground truth to reference. 

# Ground truth data

From the [New York Times Archive API](https://developer.nytimes.com/docs/archive-product/1/overview)

In [1]:
import os
import requests
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
r = requests.get("https://api.nytimes.com/svc/archive/v1/2020/3.json?api-key={}".format(os.environ['API_KEY_NYT']))
response = r.json()

In [3]:
ground_truth = len(response['response']['docs'])

In [4]:
ground_truth

7543

# News API

In [13]:
from newsapi import NewsApiClient

In [10]:
newsapi = NewsApiClient(api_key=os.environ['API_KEY_NEWS'])
all_articles = newsapi.get_everything(domains='Nytimes.com',
                                      from_param="2020-03-06",
                                      to="2020-03-31")

In [11]:
all_articles

{'status': 'ok', 'totalResults': 0, 'articles': []}

# GDELT

In [22]:
import pandas as pd
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

In [17]:
GDELT_URL = 'http://data.gdeltproject.org/events/'
date_range = pd.date_range(start="2020-03-01",end="2020-03-31")
date_range_encoded = [10000*dt_time.year + 100*dt_time.month + dt_time.day for dt_time in date_range]

In [20]:
uris = ['{}.export.CSV.zip'.format(i) for i in date_range_encoded]
urls = ['{0}{1}'.format(GDELT_URL, i) for i in uris]

In [23]:
df_all = []

for i in urls:
    page = urlopen(i)
    zipfile = ZipFile(BytesIO(page.read()))
    filename = zipfile.namelist()[0]
    df = pd.read_csv(zipfile.open(filename), sep='\t', header=None)
    nyt_links = df[df[57].str.contains('www.nytimes')][57]
    df_filtered = pd.DataFrame({'urls': nyt_links, 'date': filename})
    df_all.append(df_filtered)

In [24]:
df_all = pd.concat(df_all)

In [27]:
df_all.drop_duplicates()

Unnamed: 0,urls,date
172,https://www.nytimes.com/2020/02/29/health/fda-...,20200301.export.CSV
4706,https://www.nytimes.com/2020/02/29/us/politics...,20200301.export.CSV
6146,https://www.nytimes.com/reuters/2020/02/29/wor...,20200301.export.CSV
8634,https://www.nytimes.com/2020/02/29/us/politics...,20200301.export.CSV
10760,https://www.nytimes.com/2020/02/29/opinion/sun...,20200301.export.CSV
...,...,...
153609,https://www.nytimes.com/2020/03/31/health/coro...,20200331.export.CSV
153639,https://www.nytimes.com/aponline/2020/03/31/wo...,20200331.export.CSV
153700,https://www.nytimes.com/2020/03/31/us/coronavi...,20200331.export.CSV
154583,https://www.nytimes.com/2020/03/31/us/politics...,20200331.export.CSV


# Media Cloud

In [28]:
import mediacloud.api

In [29]:
mc = mediacloud.api.MediaCloud(os.environ['API_KEY_MC'])

In [34]:
stories = mc.storyCount("media_id:1", solr_filter=mc.publish_date_query(datetime.date(2020,3,1), datetime.date(2020,3,31)))

In [36]:
stories

{'count': 5943}

# Internet Archive

In [102]:
URL = 'https://web.archive.org/cdx/search/cdx'
params = {"url": "nytimes.com",
          "matchType": "domain",
          "from": date_range_encoded[0],
          "to": date_range_encoded[-1],
          "output": "json"}
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}

In [103]:
response = requests.get(URL, params=params, headers=headers).json()

In [104]:
col_names = response[0]

In [105]:
urls = []
for i in response[1:]:
    urls.append(i[2])

In [106]:
len(urls)

94467

In [107]:
article_urls = [i for i in urls if '2020/03' in i]

In [108]:
len(article_urls)

220

# Common Crawl