# Snapshot Extraction

This notebook shows how to run a Snapshot Explain operation with the minimal steps and a simple query.

In this notebook...
* [Dependencies and Initialisation](#dependencies-and-initialisation)
* [The Where Statement](#the-where-statement)
* [Extraction Query Options](#extraction-query-options)
* [Running the Extraction Operation](#running-the-extraction-operation-decremental)
* [Next Steps](#Next-Steps)

## Dependencies and Initialisation
Import statements and environment initialisation using the package `dotenv`. More details in the [Configuration notebook](0.2_configuration.ipynb).

In [1]:
from factiva.news import Snapshot
from dotenv import load_dotenv
load_dotenv()
print('Done!')

Done!


## The Where Statement

This notebook uses a simple query for illustration purposes. For more tips about queries, or guidance on how to build complex or large queries, checkout the [query reference](2.1_complex_large_queries.ipynb) notebook.

In [6]:
where_statement = """
       REGEXP_CONTAINS(CONCAT(title, ' ', IFNULL(snippet, ''), ' ', IFNULL(body, '')), r'(?i)(\\b)(kill\\w{0,}|skirmis\\w{0,}|attack\\w{0,}|fight\\w{0,}|bomb\\w{0,}|explod\\w{0,}|clash\\w{0,}|explos\\w{0,}|die\\w{0,}|injur\\w{0,}|dead\\w{0,}|death\\w{0,}|wounded|massacre\\w{0,})(\\b)')
       AND language_code='en' 
       AND publication_date  >= '2015-01-01 00:00:00' 
       AND LOWER(source_code) IN 
             ('aprs','lba','xnews','afpr','afnws','ajazen','bbcsup',
             'bbcapp','bbcmep','bbceup','bbcap','bbcca',
             'bbcmnf','bbcmap','bbcukb','bbcukb','bbcsap',
             'bbccau','bbcmm')
       AND 
       REGEXP_CONTAINS(region_codes, r'(?i)(^|,)(africaz|asiaz|apacz|ausnz|balkz|baltst|caribz|ceafrz|camz|casiaz|eeurz|ussrz|dvpcoz|eafrz|easiaz|eecz|indochz|lamz|medz|meastz|nafrz|pacisz|samz|sasiaz|seasiaz|souafrz|wafrz|wasiaz)($|,)')
       AND
       REGEXP_CONTAINS(subject_codes, r'(?i)(^|,)(gairf|gdef|gcoup|garmy|gcivds|gdeath|gpol|gdrug|gvote|gglobe|gimm|ghum|gdip|gesanc|gkdnap|gvio|gmurd|gntdis|nsum|gcat|gpir|grisk|grobb|gshoo|gsec|gterr|gtortu|gwar)($|,)')
"""

s = Snapshot(query=where_statement)

In [15]:
s

<class 'factiva.news.snapshot.snapshot.Snapshot'>
  |-user_key: <class 'factiva.core.auth.userkey.UserKey'>
  |-key = ****************************afc8
  |-cloud_token = **Not Fetched**
  |-log = <Logger factiva.core.log (DEBUG)>
  |-account_name = 
  |-account_type = 
  |-active_products = 
  |-max_allowed_concurrent_extractions = 0
  |-max_allowed_extracted_documents = 0
  |-max_allowed_extractions = 0
  |-currently_running_extractions = 0
  |-total_downloaded_bytes = 0
  |-total_extracted_documents = 0
  |-total_extractions = 0
  |-total_stream_instances = 0
  |-total_stream_subscriptions = 0
  |-enabled_company_identifiers = []
  |-remaining_documents = 0
  |-remaining_extractions = 0

  |-query: <class 'factiva.news.snapshot.query.SnapshotQuery'>
  |    |-where: 
       REGEXP_CONTAINS(CONCAT(title, ' ', IFNULL(snippet, ''), ' ', IFNULL(b...
  |    |-...  |    |-limit = 0
  |    |-file_format = avro
  |    |-frequency = MONTH
  |    |-date_field = publication_datetime
  |    |-top 

## Extraction Query Options

An extraction query can use more parameters:

* **`file_format`**: _Optional_, _Default: `'avro'`_. File format to be used for Extractions. Possible values are `'avro'`, `'csv'` or `'json'`. Used only by the Extraction operation.
* **`limit`**: _Optional_, _Default: `0` (No limit)_. Positive integer that limits the amount of documents to extract. Used only by the Extraction operation.

In [7]:
s.query.file_format = 'avro'
# s.query.limit = 1000     # Uncomment this line to set a max number of extracted documents

## Running the Extraction Operation   `**(decremental)**`

This operation builds a collection of files containing the articles selected according to query conditions.

**This operation will decrement 1 extraction from your allowance, and can take several minutes to complete**

An Extraction job decrements the account's allowance, and therefore, it's highly recommended to be executed after verifying the same query using [Explain](1.4_snapshot_explain.ipynb) and/or [Analytics](1.5_snapshot_analytics.ipynb) jobs return values in line with the expected volumes.

The `<Snapshot>.process_extraction()` function directly submits, monitors the job and download the content. If a more manual process is required (send job, monitor job, get results), please see the [detailed package documentation](https://factiva-news-python.readthedocs.io/).

To review the **Snapshot History**, please see the notebook [User Statistics](1.1_user_statistics.ipynb).

In [5]:
!open -a Terminal .

In [8]:
%%time
s.process_extraction(download_path='./new_fact/')
print('Done!')

Done!
CPU times: user 33.1 s, sys: 14.7 s, total: 47.7 s
Wall time: 28min 27s


## Next Steps

* Work with the downloaded files as described in the [snapshot files](1.9_snapshot_files.ipynb) notebook
* Create a new Stream