# Searching Common Crawl Index

This explores different ways of using the common crawl index

* [Comcrawl library](#Using-commcrawl)
* [CDX Toolkit](#Using-cdx-toolkit)
* [Querying HTTP Endpoint directly](#Requesting-CDX-endpoint-Directly)

See [the related article](https://skeptric.com/searching-100b-pages-cdx/) and [Jupyter notebook](https://skeptric.com/notebooks/Searching%20Common%20Crawl%20Index.ipynb).

In [1]:
import requests
import warcio
from contextlib import closing
from bs4 import BeautifulSoup
import json

import logging
from IPython.display import HTML
import pandas as pd

# Using [comcrawl](https://github.com/michaelharms/comcrawl)

In [2]:
#! python -m pip install comcrawl

In [3]:
from comcrawl import IndexClient

In [4]:
client = IndexClient(['2020-10', '2020-16'])

In [5]:
client.search('https://www.reddit.com/r/dataisbeautiful/*')

In [6]:
pd.DataFrame(client.results).head()

Unnamed: 0,urlkey,timestamp,offset,status,digest,redirect,length,mime-detected,filename,mime,url,languages,charset
0,"com,reddit)/r/dataisbeautiful/comments/2wlsvz/...",20200217065457,13689701,301,3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ,https://www.reddit.com/r/dataisbeautiful/comme...,679,application/octet-stream,crawl-data/CC-MAIN-2020-10/segments/1581875141...,unk,http://www.reddit.com/r/dataisbeautiful/commen...,,
1,"com,reddit)/r/dataisbeautiful/comments/2wlsvz/...",20200217065459,915522267,200,L4C22PRVUOGG22PXMKSB7KYVCWQUKEQ7,,74716,text/html,crawl-data/CC-MAIN-2020-10/segments/1581875141...,text/html,https://www.reddit.com/r/dataisbeautiful/comme...,eng,UTF-8
2,"com,reddit)/r/dataisbeautiful/comments/7f2sfy/...",20200223060640,884674375,200,GEWEQE4I2JOSKTL3QXPEI7FXVI3BP52O,,29470,text/html,crawl-data/CC-MAIN-2020-10/segments/1581875145...,text/html,https://www.reddit.com/r/dataisbeautiful/comme...,eng,UTF-8
3,"com,reddit)/r/dataisbeautiful/comments/7jbefu/...",20200217195615,890110347,200,42HZLBLZI5DQYGQAZNUAQ5NRCMEEVERW,,21516,text/html,crawl-data/CC-MAIN-2020-10/segments/1581875143...,text/html,https://www.reddit.com/r/dataisbeautiful/comme...,eng,UTF-8
4,"com,reddit)/r/dataisbeautiful/comments/8f1rk7/...",20200222202649,859518253,200,IDKDLHSVB7YH3L2AUIMKPJFER3VLBZRU,,95956,text/html,crawl-data/CC-MAIN-2020-10/segments/1581875145...,text/html,https://www.reddit.com/r/dataisbeautiful/comme...,eng,UTF-8


Only download the first couple of 'ok' results

In [7]:
client.results = [res for res in client.results if res['status'] == '200'][:2]

In [8]:
client.download()

In [9]:
client.results[0]['url']

'https://www.reddit.com/r/dataisbeautiful/comments/2wlsvz/why_the_mlb_rule_changes_since_2004_game_time_is/'

In [10]:
html = client.results[0]['html']

In [11]:
soup = BeautifulSoup(html, 'html5lib')

In [12]:
soup.head.title.text

'Why the MLB rule changes: Since 2004, game time is up 10%, while runs are down 13% [OC] : dataisbeautiful'

In [13]:
soup.find('div', {'class': 'usertext-body'}).p.text

'A place for visual representations of data: Graphs, charts, maps, etc.'

# Using [cdx-toolkit](https://github.com/cocrawler/cdx_toolkit)

In [14]:
#!python -m pip install cdx_toolkit

In [15]:
import cdx_toolkit

In [16]:
url = 'https://www.reddit.com/r/dataisbeautiful/*'

In [17]:
cdx = cdx_toolkit.CDXFetcher(source='cc')

Note: from_ts rather than from in CLI

In [18]:
objs = list(cdx.iter(url, from_ts='202002', to='202006', limit=5, filter='=status:200'))

In [19]:
pd.DataFrame([o.data for o in objs])

Unnamed: 0,urlkey,timestamp,offset,status,languages,digest,length,mime-detected,filename,charset,mime,url
0,"com,reddit)/r/dataisbeautiful/comments/27dx4q/...",20200527135643,882602699,200,eng,XXI6CLICLXUYYPAVXBBT2YRAGKF5R32E,78176,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347394...,UTF-8,text/html,https://www.reddit.com/r/dataisbeautiful/comme...
1,"com,reddit)/r/dataisbeautiful/comments/2p2s7m/...",20200525103507,805396766,200,eng,LVX34MXYNQDODM4GXOVR5I4HJRUPWVQF,84555,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347388...,UTF-8,text/html,https://www.reddit.com/r/dataisbeautiful/comme...
2,"com,reddit)/r/dataisbeautiful/comments/2r3jnk/...",20200527095258,908079074,200,eng,MQWZSEJ6WUNC2VW3CNSTCYW7DDJWZGH3,45865,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347392...,UTF-8,text/html,https://www.reddit.com/r/dataisbeautiful/comme...
3,"com,reddit)/r/dataisbeautiful/comments/2w392n/...",20200527230701,859269754,200,eng,HA3GBWJZNLL3TKYOGGRRGK5R6WDDJYWD,24851,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347396...,UTF-8,text/html,https://www.reddit.com/r/dataisbeautiful/comme...
4,"com,reddit)/r/dataisbeautiful/comments/322lbk/...",20200526005126,878218880,200,eng,Q7U6FNDLWZ2IY34L2BHOPGKAWTPRWZKI,55033,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347390...,UTF-8,text/html,https://www.reddit.com/r/dataisbeautiful/comme...


In [20]:
print(pd.DataFrame([o.data for o in objs]).to_markdown())

|    | urlkey                                                                                          |      timestamp |    offset |   status | languages   | digest                           |   length | mime-detected   | filename                                                                                                      | charset   | mime      | url                                                                                                         |
|---:|:------------------------------------------------------------------------------------------------|---------------:|----------:|---------:|:------------|:---------------------------------|---------:|:----------------|:--------------------------------------------------------------------------------------------------------------|:----------|:----------|:------------------------------------------------------------------------------------------------------------|
|  0 | com,reddit)/r/dataisbeautiful/comments/27dx4q/distribut

In [21]:
html = objs[0].content

None


In [22]:
soup = BeautifulSoup(html, 'html5lib')

In [23]:
soup.head.title.text

'Distribution of results of the Matura (high school exit exam) in Poland in 2013. The minimum score to pass is 30%. : dataisbeautiful'

In [24]:
soup.find('div', {'class': 'usertext-body'}).p.text

'A place to share and discuss visual representations of data: Graphs, charts, maps, etc.'

In [25]:
o = objs[0]

In [26]:
o.warc_record.rec_headers.get_header('WARC-Target-URI')

'https://www.reddit.com/r/dataisbeautiful/comments/27dx4q/distribution_of_results_of_the_matura_high_school/'

# Requesting CDX endpoint Directly

We can request the [Index directly](https://index.commoncrawl.org/) using [pywb's CDX API](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference).

But first we need to know what indexes are available.

In [28]:
cdx_indexes = requests.get('https://index.commoncrawl.org/collinfo.json').json()

In [29]:
pd.options.display.max_colwidth=150
pd.options.display.max_rows=6

In [30]:
pd.DataFrame(cdx_indexes)

Unnamed: 0,id,name,timegate,cdx-api
0,CC-MAIN-2020-24,May 2020 Index,https://index.commoncrawl.org/CC-MAIN-2020-24/,https://index.commoncrawl.org/CC-MAIN-2020-24-index
1,CC-MAIN-2020-16,March 2020 Index,https://index.commoncrawl.org/CC-MAIN-2020-16/,https://index.commoncrawl.org/CC-MAIN-2020-16-index
2,CC-MAIN-2020-10,February 2020 Index,https://index.commoncrawl.org/CC-MAIN-2020-10/,https://index.commoncrawl.org/CC-MAIN-2020-10-index
...,...,...,...,...
69,CC-MAIN-2012,Index of 2012 ARC files,https://index.commoncrawl.org/CC-MAIN-2012/,https://index.commoncrawl.org/CC-MAIN-2012-index
70,CC-MAIN-2009-2010,Index of 2009 - 2010 ARC files,https://index.commoncrawl.org/CC-MAIN-2009-2010/,https://index.commoncrawl.org/CC-MAIN-2009-2010-index
71,CC-MAIN-2008-2009,Index of 2008 - 2009 ARC files,https://index.commoncrawl.org/CC-MAIN-2008-2009/,https://index.commoncrawl.org/CC-MAIN-2008-2009-index


In [31]:
print(pd.DataFrame(cdx_indexes).tail(1).to_markdown())

|    | id                | name                           | timegate                                         | cdx-api                                               |
|---:|:------------------|:-------------------------------|:-------------------------------------------------|:------------------------------------------------------|
| 71 | CC-MAIN-2008-2009 | Index of 2008 - 2009 ARC files | https://index.commoncrawl.org/CC-MAIN-2008-2009/ | https://index.commoncrawl.org/CC-MAIN-2008-2009-index |


In [32]:
api_url = cdx_indexes[0]['cdx-api']
api_url

'https://index.commoncrawl.org/CC-MAIN-2020-24-index'

## Basic usage

In [33]:
r = requests.get(api_url,
                 params = {
                     'url': 'reddit.com',
                     'limit': 10,
                     'output': 'json'
                 })

In [34]:
records = [json.loads(line) for line in r.text.split('\n') if line]

In [35]:
pd.DataFrame(records)

Unnamed: 0,urlkey,timestamp,offset,status,languages,digest,length,mime-detected,filename,charset,mime,url,redirect
0,"com,reddit)/",20200525024432,873986269,200,eng,C6Y4VCGYLE3NGEWLJNONES6JMNA74IA3,40851,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347387155.10/warc/CC-MAIN-20200525001747-20200525031747-00335.warc.gz,UTF-8,text/html,https://www.reddit.com/,
1,"com,reddit)/",20200526071834,787273867,200,eng,PHMHCKU365PLDN5UQETZVR4UGMSPDXQJ,42855,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347390448.11/warc/CC-MAIN-20200526050333-20200526080333-00335.warc.gz,UTF-8,text/html,https://www.reddit.com/,
2,"com,reddit)/",20200526163829,3815970,200,,X67YXUXXE5GQPMJKMEE6555BNFPIER7L,35345,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347391277.13/robotstxt/CC-MAIN-20200526160400-20200526190400-00048.warc.gz,,text/html,https://www.reddit.com,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7,"com,reddit)/",20200528125122,12374752,301,,3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ,616,application/octet-stream,crawl-data/CC-MAIN-2020-24/segments/1590347396089.30/crawldiagnostics/CC-MAIN-20200528104652-20200528134652-00582.warc.gz,,unk,http://www.reddit.com/,https://www.reddit.com/
8,"com,reddit)/",20200528125122,889368118,200,eng,7CF6J2D6SHWFD35MEQI43NNGR2W4SHHR,41402,text/html,crawl-data/CC-MAIN-2020-24/segments/1590347396089.30/warc/CC-MAIN-20200528104652-20200528134652-00335.warc.gz,UTF-8,text/html,https://www.reddit.com/,
9,"com,reddit)/",20200528192150,13537156,301,,3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ,618,application/octet-stream,crawl-data/CC-MAIN-2020-24/segments/1590347399830.24/crawldiagnostics/CC-MAIN-20200528170840-20200528200840-00582.warc.gz,,unk,http://www.reddit.com/,https://www.reddit.com/


In [36]:
print(pd.DataFrame(records).head().to_markdown())

|    | urlkey       |      timestamp |    offset |   status | languages   | digest                           |   length | mime-detected   | filename                                                                                                           | charset   | mime      | url                     |   redirect |
|---:|:-------------|---------------:|----------:|---------:|:------------|:---------------------------------|---------:|:----------------|:-------------------------------------------------------------------------------------------------------------------|:----------|:----------|:------------------------|-----------:|
|  0 | com,reddit)/ | 20200525024432 | 873986269 |      200 | eng         | C6Y4VCGYLE3NGEWLJNONES6JMNA74IA3 |    40851 | text/html       | crawl-data/CC-MAIN-2020-24/segments/1590347387155.10/warc/CC-MAIN-20200525001747-20200525031747-00335.warc.gz      | UTF-8     | text/html | https://www.reddit.com/ |        nan |
|  1 | com,reddit)/ | 20200526071834 | 7

## Filters and fields

Let's use a few of the bells and whistles form the API.

Particularly interesting are the [filters](https://github.com/webrecorder/pywb/wiki/CDX-Server-API#filter) which let us to only get rows that we need.

In [37]:
r = requests.get(api_url,
                 params = {
                     'url': 'https://www.reddit.com/r/',
                     'matchType': 'prefix',
                     'limit': 10,
                     'output': 'json',
                     'fl': 'url,filename,offset,length',
                     'filter': ['=status:200', '=mime-detected:text/html', '~url:.*/comments/']
                 })

In [38]:
r.raise_for_status()

In [39]:
pd.DataFrame([json.loads(line) for line in r.text.split('\n') if line])

Unnamed: 0,url,filename,offset,length
0,https://www.reddit.com/r/0xbitcoin/comments/8o06dk/links_to_the_newestbest_miners_for_nvidia_amd/,crawl-data/CC-MAIN-2020-24/segments/1590347401260.16/warc/CC-MAIN-20200529023731-20200529053731-00112.warc.gz,873475618,30260
1,https://www.reddit.com/r/100yearsago/comments/ghkkz1/may_11th_1920_first_nsdap_advertising_posters_in/?ref_source=embed&ref=share,crawl-data/CC-MAIN-2020-24/segments/1590347392142.20/warc/CC-MAIN-20200527075559-20200527105559-00198.warc.gz,880229230,32606
2,https://www.reddit.com/r/2007scape/comments/6250um/thinking_about_returning_to_osrs/,crawl-data/CC-MAIN-2020-24/segments/1590347391923.3/warc/CC-MAIN-20200526222359-20200527012359-00533.warc.gz,895963534,24631
...,...,...,...,...
7,https://www.reddit.com/r/2darkpark/comments/frm60b/so_how_is_everyone_doing/,crawl-data/CC-MAIN-2020-24/segments/1590348492295.88/warc/CC-MAIN-20200604223445-20200605013445-00195.warc.gz,851246292,17847
8,https://www.reddit.com/r/2healthbars/comments/8tg1y7/the_heel_of_these_heels_are_heels/,crawl-data/CC-MAIN-2020-24/segments/1590347415315.43/warc/CC-MAIN-20200601071242-20200601101242-00439.warc.gz,855249693,31284
9,https://www.reddit.com/r/3Dprinting/comments/3pf96w/troubleshooting_proximity_sensor/,crawl-data/CC-MAIN-2020-24/segments/1590347445880.79/warc/CC-MAIN-20200604161214-20200604191214-00560.warc.gz,842910073,22084


## Pagination

The [introductory blog post to CDX on Common Crawl](https://commoncrawl.org/2015/04/announcing-the-common-crawl-index/) mentions it's paginated to 15,000 results by default.

Let's test that

In [40]:
r = requests.get(api_url,
                 params = {
                     'url': '*.wikipedia.org',
                     'output': 'json',
                     'showNumPages': True,
                 })

* pageSize is number of results in (compressed) blocks
* blocks is total number of compressed blocks
* pages = (blocks // page_size)


In [41]:
num_pages = r.json()
num_pages

{'pageSize': 5, 'blocks': 2044, 'pages': 409}

In [42]:
import math

In [43]:
math.ceil(num_pages['blocks'] / num_pages['pageSize']) == num_pages['pages']

True

In [45]:
r = requests.get(api_url,
                 params = {
                     'url': '*.wikipedia.org',
                     'output': 'json',
                 })

In [46]:
results = [json.loads(line) for line in r.text.split('\n') if line]

The history saving thread hit an unexpected error (OperationalError('disk I/O error',)).History will not be written to the database.


In [47]:
len(results)

14735

In [48]:
results[-1]

{'urlkey': 'org,wikipedia,ace)/wiki/geurija_katolik_roma',
 'timestamp': '20200606214423',
 'offset': '227178899',
 'status': '200',
 'languages': 'nno,roh,srp',
 'digest': 'SH3WZL442PB2DKYVIFADVHJU6JC2THSA',
 'length': '18538',
 'mime-detected': 'text/html',
 'filename': 'crawl-data/CC-MAIN-2020-24/segments/1590348519531.94/warc/CC-MAIN-20200606190934-20200606220934-00311.warc.gz',
 'charset': 'UTF-8',
 'mime': 'text/html',
 'url': 'https://ace.wikipedia.org/wiki/Geurija_Katolik_Roma'}

We can adjust the pageSize (in blocks) as well

In [49]:
r = requests.get(api_url,
                 params = {
                     'url': '*.wikipedia.org',
                     'output': 'json',
                     'page': 3,
                     'pageSize': 1,
                 })

In [50]:
results2 = [json.loads(line) for line in r.text.split('\n') if line]

About 3,000 results per page

In [51]:
len(results2)

3000

In [52]:
results[0]

{'urlkey': 'org,wikipedia)/',
 'timestamp': '20200524210621',
 'offset': '3147602',
 'status': '301',
 'digest': 'C4WTJB6KZKE6XGJGU4MBB2U4ON7YIZTW',
 'redirect': 'https://www.wikipedia.org/',
 'length': '938',
 'mime-detected': 'text/html',
 'filename': 'crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/robotstxt/CC-MAIN-20200524210325-20200525000325-00349.warc.gz',
 'mime': 'text/html',
 'url': 'https://wikipedia.org'}

This should correspond to the 3rd fifth of results

In [53]:
[r for r in results2 if r not in results]

[]

Going past the last page

In [54]:
r = requests.get(api_url,
                 params = {
                     'url': '*.wikipedia.org',
                     'output': 'json',
                     'page': 409,
                 })

In [55]:
r.status_code

400

In [56]:
print(r.text)

<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" href="/static/__shared/shared.css"/>
</head>
<body>
<h2>Common Crawl Index Server Error</h2>
<b>Page 409 invalid: First Page is 0, Last Page is 408</b>

</body>
</html>


## An empty request

In [57]:
r = requests.get(api_url,
                 params = {
                     'url': 'skeptric.com/*',
                     'output': 'json',
                 })

In [58]:
r.status_code

404

In [59]:
r.json()

{'error': 'No Captures found for: skeptric.com/*'}

# Retrieving content

In [60]:
record = records[0]

In [61]:
record

{'urlkey': 'com,reddit)/',
 'timestamp': '20200525024432',
 'offset': '873986269',
 'status': '200',
 'languages': 'eng',
 'digest': 'C6Y4VCGYLE3NGEWLJNONES6JMNA74IA3',
 'length': '40851',
 'mime-detected': 'text/html',
 'filename': 'crawl-data/CC-MAIN-2020-24/segments/1590347387155.10/warc/CC-MAIN-20200525001747-20200525031747-00335.warc.gz',
 'charset': 'UTF-8',
 'mime': 'text/html',
 'url': 'https://www.reddit.com/'}

In [62]:
data_url = 'https://commoncrawl.s3.amazonaws.com/' + record['filename']
data_url

'https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-24/segments/1590347387155.10/warc/CC-MAIN-20200525001747-20200525031747-00335.warc.gz'

Use a [Range header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) to get just the data we need.

In [63]:
headers = {'Range': f'bytes={int(record["offset"])}-{int(record["offset"]) + int(record["length"])}'}
headers

{'Range': 'bytes=873986269-874027120'}

In [64]:
r = requests.get(data_url, headers=headers)

In [68]:
import zlib

In [70]:
data = zlib.decompress(r.content)

error: Error -3 while decompressing data: incorrect header check

We have to use zlib instead of gzip because we're not reading from the start of the file, and so gzip headers aren't there.

For gzip compatible we need to [set the wbits](https://stackoverflow.com/a/22310760).

In [71]:
data = zlib.decompress(r.content, wbits = zlib.MAX_WBITS | 16)

In [72]:
print(data.decode('utf-8'))

WARC/1.0
WARC-Type: response
WARC-Date: 2020-05-25T02:44:32Z
WARC-Record-ID: <urn:uuid:fa7c243e-d055-469b-bb4f-aa8580bc8330>
Content-Length: 238774
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:2a234f6f-6796-4962-8c6f-84a6fe8b8945>
WARC-Concurrent-To: <urn:uuid:b7ec4524-bc4a-4da1-906b-6c53f9c9836e>
WARC-IP-Address: 199.232.65.140
WARC-Target-URI: https://www.reddit.com/
WARC-Payload-Digest: sha1:C6Y4VCGYLE3NGEWLJNONES6JMNA74IA3
WARC-Block-Digest: sha1:HJ6BA5YAW24SEPDAYA5NUAXA6RG2UBBJ
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Connection: keep-alive
X-Crawler-Content-Length: 41748
Content-Length: 237219
Content-Type: text/html; charset=UTF-8
x-ua-compatible: IE=edge
x-frame-options: SAMEORIGIN
x-content-type-options: nosniff
x-xss-protection: 1; mode=block
X-Crawler-Content-Encoding: gzip
cache-control: max-age=0, must-revalidate
X-Moose: majestic
Accept-Ranges: bytes
Date: Mon, 25 May 2020 02:44:32 GMT
Via: 1.1 va