<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">

# Dow Jones DNA NLP Case Study

_Based on news articles related to Hurricane Harvey._

**Data Retrieval**

Dr Yves J Hilpisch | Michael Schwed

The Python Quants GmbH

## The Imports

In [1]:
import os
import sys
sys.path.append('../../modules')

In [2]:
import json
import nltk
import pickle
import tpqdna
import warnings
warnings.simplefilter('ignore')

## Snapshot Creation

### Authentication

In [3]:
api_key = pickle.load(open('../dna_api_key.pkl', 'rb'))
headers = {
    'user-key': api_key,
    'content-type': 'application/json',
    'cache-control': 'no-cache'
}

### Specification

In [4]:
where = '(body like "%Hurricane Harvey%") AND language_code="en" '
where += 'AND language_code="en" '
where += 'AND publication_date >= "2017-08-01 00:00:00" '
where += 'AND publication_date <= "2017-12-31 00:00:00" '

In [5]:
includes = {} 
excludes = {}
limit = 250

In [6]:
query = {'query': 
           {'where': where,
            'includes': includes,
            'exludes': excludes,
            'limit': limit
         }}

In [7]:
query = json.dumps(query)

In [8]:
%time qurl = tpqdna.create_snapshot(query, headers)

{'data': {'attributes': {'current_state': 'JOB_QUEUED', 'extraction_type': 'documents'}, 'id': 'dj-synhub-extraction-feccd780582a0af8b40e86439b3ee921-hdlz7k82ki', 'type': 'snapshot'}, 'links': {'self': 'https://api.dowjones.com/alpha/extractions/documents/dj-synhub-extraction-feccd780582a0af8b40e86439b3ee921-hdlz7k82ki'}}
CPU times: user 40 ms, sys: 4 ms, total: 44 ms
Wall time: 16.1 s


In [9]:
%time fl = tpqdna.run_snapshot(qurl, headers)

Job status changed:
JOB_QUEUED
Job status changed:
JOB_VALIDATING
Job status changed:
JOB_STATE_RUNNING
CPU times: user 2.8 s, sys: 224 ms, total: 3.02 s
Wall time: 1h 45min 15s


## Data Paths

In [10]:
project = 'harvey_{}'.format(limit)

In [11]:
base_path = os.path.abspath('../../')

In [12]:
data_path = os.path.join(base_path, 'data_harvey')
if not os.path.isdir(data_path):
    os.mkdir(data_path)

In [13]:
meta_path = os.path.join(data_path, 'meta')
if not os.path.isdir(meta_path):
    os.mkdir(meta_path)
fn = os.path.join(meta_path, 'file_list_{}.pkl'.format(project))

In [14]:
# with open(fn, 'wb') as f:
#     pickle.dump(fl, f)

In [15]:
with open(fn, 'rb') as f:
    fl = pickle.load(f)

## Data Retrieval

In [16]:
snapshot_path = os.path.join(data_path, 'snapshot')
if not os.path.isdir(snapshot_path):
    os.mkdir(snapshot_path)

In [17]:
%time tpqdna.download_snapshots(fl, snapshot_path, headers)

CPU times: user 1.1 s, sys: 68 ms, total: 1.17 s
Wall time: 31.6 s


In [18]:
%time data = tpqdna.avro2dataframe(snapshot_path)

CPU times: user 208 ms, sys: 4 ms, total: 212 ms
Wall time: 216 ms


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 35 columns):
action                       250 non-null object
an                           250 non-null object
art                          250 non-null object
body                         250 non-null object
byline                       250 non-null object
company_codes                250 non-null object
company_codes_about          250 non-null object
company_codes_association    250 non-null object
company_codes_lineage        250 non-null object
company_codes_occur          250 non-null object
company_codes_relevance      250 non-null object
copyright                    250 non-null object
credit                       250 non-null object
currency_codes               250 non-null object
dateline                     10 non-null object
document_type                250 non-null object
industry_codes               250 non-null object
ingestion_datetime           250 non-null int64
language_code  

In [20]:
fn = os.path.join(snapshot_path, 'snapshot_{}.h5'.format(project))

In [21]:
data.to_hdf(fn, 'data', complevel=5, complib='blosc')

<img src="http://hilpisch.com/tpq_logo.png" width="36%" align="right" style="vertical-align: top;">