# Example of downloading articles

The `nytsnippetgetter.py` should be in the python modules PATH or in the same working directory.

Below is the function definition

```
get_data(TOPICS, NDOCS=None, BEGINDATE=None, ENDDATE=None, VERBOSE=0, LIMITS=False,
    FILENAME=None):
    # Downloads data about articles from nytimes.com
    #   - TOPICS = list of topics, ex.g. ["economics", "globale warming"]
    #   - NDOCS = list of integers which sets the number of pages to download 
    #     for the topics, ex.g. [10, 15]. One page is equal to 10 articles.
          Should be either the same length as topics or a list with single number for 
          all topics
    #   - BEGINDATE, ENDDATE - integer which limits the published date range, YYYYMMDD
    #   - VERBOSE = boolean, display links
    #   - LIMITS = boolean,display number of pages available for each topic. 
                   If True then data is not downloaded.
```

Some examples.

In [6]:
from nytsnippetgetter import get_data

# See how many pages are available for topics. One page is equivalent to 10 articles
topics=['economics','politics','espionage','global+warming', 'donald+trump','hillary+clinton', 
        'bernie+sanders', 'guns', 'cancer', 'sex']

get_data(topics, BEGINDATE=20140101, LIMITS=True)

Topics:  ['economics', 'politics', 'espionage', 'global+warming', 'donald+trump', 'hillary+clinton', 'bernie+sanders', 'guns', 'cancer', 'sex']
Documents:  [4664, 19720, 1177, 2087, 5884, 6370, 2982, 3469, 7873, 10317]
Date range:  2014-01-01 -> 2016-06-08 

Total documents:  64543


In [5]:
# See how many articles are available between 2014-01-01 and today
get_data(None, BEGINDATE=20140101, LIMITS=True)

Topics:  ['']
Documents:  [314552]
Date range:  2014-01-01 -> 2016-06-08 

Total documents:  314552


In [7]:
# Download articles. BEGINDATE, ENDDATE format is YYYYMMDD.
# If FILENAME is not None saves a local copy with date prepended,
# ex.g. 2016-05-05-example.json.
#
# ndocs specifies the number of articles to download for each topic. You can define number 
# like below or specify a distinct number for each topic in the list. 
ndocs = [1500]
articles = get_data(topics, ndocs, BEGINDATE=20140101, FILENAME='example.json')

Topics:  ['economics', 'politics', 'espionage', 'global+warming', 'donald+trump', 'hillary+clinton', 'bernie+sanders', 'guns', 'cancer', 'sex']
Documents:  [1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500]
Date range:  2014-01-01 -> 2016-06-08 

Total documents:  15000
Started download...
economics is done | 1500/15000
politics is done | 1500/15000
espionage is done | 1500/15000
global+warming is done | 1500/15000
donald+trump is done | 1500/15000
hillary+clinton is done | 1500/15000
bernie+sanders is done | 1500/15000
guns is done | 1500/15000
cancer is done | 1500/15000
sex is done | 1500/15000

Total documents returned:  15000

Done in  733.4983870983124 seconds


In [6]:
articles[0]

{'abstract': None,
 'author': 'THE ASSOCIATED PRESS',
 'date_modified': '2016-05-10T21:58:07Z',
 'date_published': '2016-05-10T13:19:32Z',
 'keywords': [],
 'lead_paragraph': 'Voters in West Virginia and Nebraska are casting primary ballots on Tuesday, but the only competitive presidential race that remains in the Democratic contest in West Virginia.',
 'nytclass': '',
 'section_name': {'content': 'us',
  'display_name': 'U.S.',
  'url': 'http://www.nytimes.com/section/us'},
 'snippet': "voted for the first time at an elementary school in nearby Scott Depot, West Virginia. The 18-year-old, who plans to study <strong>economics</strong> at Columbia University, voted for Clinton saying he doesn't like Sanders' proposal to raise the",
 'title': 'Voters in West Virginia and Nebraska Cast Primary Ballots',
 'user_topic': 'economics',
 'weburl': 'http://www.nytimes.com/aponline/2016/05/10/us/politics/ap-us-2016-election-voter-voices.html'}

In [7]:
len(set([x['snippet'] for x in articles])), len(articles)

(8477, 15000)

# Load saved data

In [9]:
# Load saved data into python
import json

with open('data/2016-06-08-example.json') as json_data:
    articles = json.load(json_data)["data"]

articles[0]
len(articles), len(set([ x['snippet'] for x in articles]))

(15000, 8477)