## Retrieve crawled web pages (2009-present) via Commoncrawl/AWS/Athena

This process uses Kiara to access and retrieve web pages indexed by [Commoncrawl](https://commoncrawl.org/). Users can perform a SQL query on Commoncrawl stored documents to retrieve the indexes of web pages. It is then possible to retrieve the corresponding web pages.<br>
The querying process may trigger billing by AWS since it uses Athena services to scan Commoncrawl stored data (5.00 USD per Terabytes, Oct. 2023).

In [3]:
from kiara.api import KiaraAPI
kiara = KiaraAPI.instance()

### AWS Credentials

In [8]:
aws_access_key_id = 'myawsaccesskey'
aws_secret_access_key = 'myawssecretaccesskey'
aws_s3_bucket = 'mys3bucket'
# the name of the database to create
db_name = 'cc_test'
# the name of the table to store query results
table_name = 'mossfon'

### Query set-up to scan crawled data files

In the following example, Athena will scan for Commoncrawl web pages for the domain name "mossfon.com" and limit the results to 10. Since no period is indicated, it will look at the whole Commoncrawl data from 2009 to now. See https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format for more information and more examples of queries.

In [4]:
query= f"""
SELECT *
FROM {db_name}.{table_name}
WHERE url_host_registered_domain = 'mossfon.com'
LIMIT 10
"""

### I. Execute Athena query to find content on Commoncrawl

In [4]:
! kiara operation explain onboard.run_cc_query


╭─ Operation: [1;3monboard.run_cc_query[0m ────────────────────────────────────────────╮
│                                                                              │
│  [3m [0m[3mDocumentation[0m[3m [0m  Execute a Common Crawl indexes query via Amazon Web        │
│  [3m               [0m  Services (AWS) and Athena.                                 │
│  [3m               [0m                                                             │
│  [3m               [0m  This process requires an AWS account and an S3 bucket.     │
│  [3m               [0m  It may trigger some fees billed by AWS.                    │
│  [3m               [0m  Additional information on the process followed available   │
│  [3m               [0m  at:                                                        │
│  [3m               [0m  https://commoncrawl.org/2018/03/index-to-warc-files-and…   │
│                                                                              │
│  [3m [0m[3mIn

In [5]:
inputs = {
    "aws_access_key_id": aws_access_key_id,
    "aws_secret_access_key": aws_secret_access_key,
    "aws_s3_bucket": aws_s3_bucket,
    "query": query,
    "db_name": db_name,
    "table_name": table_name,
 }

In [6]:
query_id = kiara.run_job('onboard.run_cc_query', inputs=inputs)

In [7]:
query_id

In [8]:
query_id['cc_query_id'].data

'2c4f7358-b8fb-4f16-9606-76c87af347fa'

### II. Get query execution status

Before retrieving the results, check that the query execution process is finished.

In [5]:
! kiara operation explain onboard.get_cc_query_status


╭─ Operation: [1;3monboard.get_cc_query_status[0m ─────────────────────────────────────╮
│                                                                              │
│  [3m [0m[3mDocumentation[0m[3m [0m  Get the status of a Common Crawl indexes query.            │
│                                                                              │
│  [3m [0m[3mInputs       [0m[3m [0m                                                             │
│  [3m               [0m   [1m [0m[1mfield    [0m[1m [0m [1m        [0m [1m          [0m [1m          [0m [1m           [0m    │
│  [3m               [0m   [1m [0m[1mname     [0m[1m [0m [1m [0m[1mtype  [0m[1m [0m [1m [0m[1mdescrip…[0m[1m [0m [1m [0m[1mRequired[0m[1m [0m [1m [0m[1mDefault  [0m[1m [0m    │
│  [3m               [0m   ──────────────────────────────────────────────────────    │
│  [3m               [0m   [3m [0m[3mcc_query_[0m[3m [0m  string   AWS/Ath…   [1myes

In [9]:
# TODO: check why error if the value is entered directly as a string
# "cc_query_id": '9b18ffd9-98b8-482a-9fdd-803fd7636b91'

inputs = {
    "aws_access_key_id": aws_access_key_id,
    "aws_secret_access_key": aws_secret_access_key,
    # this doesn't work in kiara at the moment "cc_query_id": '9b18ffd9-98b8-482a-9fdd-803fd7636b91',
    "cc_query_id": query_id['cc_query_id'], 
 }

In [10]:
# wait until the output of this cell displays "state: succeeded" before processing the notebook further
query_status = kiara.run_job('onboard.get_cc_query_status', inputs=inputs)
query_status.get_value_data('cc_query_status').dict_data['QueryExecution']['Status']

{'State': 'SUCCEEDED',
 'SubmissionDateTime': datetime.datetime(2023, 10, 9, 16, 50, 53, 844000, tzinfo=tzlocal()),
 'CompletionDateTime': datetime.datetime(2023, 10, 9, 16, 50, 57, 349000, tzinfo=tzlocal())}

### III. Retrieve query results

The result of the query (if any) contains the indexes necessary to access the related web pages content.

In [6]:
! kiara operation explain onboard.get_cc_query_result


╭─ Operation: [1;3monboard.get_cc_query_result[0m ─────────────────────────────────────╮
│                                                                              │
│  [3m [0m[3mDocumentation[0m[3m [0m  Get the result of a Common Crawl documents indexes         │
│  [3m               [0m  query.                                                     │
│                                                                              │
│  [3m [0m[3mInputs       [0m[3m [0m                                                             │
│  [3m               [0m   [1m [0m[1mfield    [0m[1m [0m [1m        [0m [1m          [0m [1m          [0m [1m           [0m    │
│  [3m               [0m   [1m [0m[1mname     [0m[1m [0m [1m [0m[1mtype  [0m[1m [0m [1m [0m[1mdescrip…[0m[1m [0m [1m [0m[1mRequired[0m[1m [0m [1m [0m[1mDefault  [0m[1m [0m    │
│  [3m               [0m   ──────────────────────────────────────────────────────    │
│  

In [11]:
query_result = kiara.run_job('onboard.get_cc_query_result', inputs=inputs)

In [12]:
inputs = {
    "cc_query_result": query_result['cc_query_result'],
 }

### IV. Get web pages content

With the exact location in Commoncrawl compressed storage files, it is now possible to retrieve the content of the web pages.

In [7]:
! kiara operation explain onboard.get_cc_pages


╭─ Operation: [1;3monboard.get_cc_pages[0m ────────────────────────────────────────────╮
│                                                                              │
│  [3m [0m[3mDocumentation[0m[3m [0m  Get the web pages from common crawl indexes.               │
│                                                                              │
│  [3m [0m[3mInputs       [0m[3m [0m                                                             │
│  [3m               [0m   [1m [0m[1mfield    [0m[1m [0m [1m      [0m [1m           [0m [1m          [0m [1m            [0m    │
│  [3m               [0m   [1m [0m[1mname     [0m[1m [0m [1m [0m[1mtype[0m[1m [0m [1m [0m[1mdescript…[0m[1m [0m [1m [0m[1mRequired[0m[1m [0m [1m [0m[1mDefault   [0m[1m [0m    │
│  [3m               [0m   ──────────────────────────────────────────────────────    │
│  [3m               [0m   [3m [0m[3mcc_query_[0m[3m [0m  dict   Web pages   [1myes

In [13]:
res = query_result.get_value_data('cc_query_result')

In [14]:
query_pages = kiara.run_job('onboard.get_cc_pages', inputs=inputs)

In [15]:
query_pages