<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/common_crawl_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Common Crawl data example

Common Crawl provides instructions for getting started with their data at https://commoncrawl.org/get-started and an overview of crawls at https://commoncrawl.org/overview.

Download package listing the URLs for web archive (WARC) files from a recent crawl.

In [1]:
!wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-51/warc.paths.gz

--2025-01-12 13:37:29--  https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-51/warc.paths.gz
Resolving data.commoncrawl.org (data.commoncrawl.org)... 18.172.170.85, 18.172.170.86, 18.172.170.105, ...
Connecting to data.commoncrawl.org (data.commoncrawl.org)|18.172.170.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180795 (177K) [binary/octet-stream]
Saving to: ‘warc.paths.gz’


2025-01-12 13:37:29 (7.93 MB/s) - ‘warc.paths.gz’ saved [180795/180795]



Check how many lines there are in that listing

In [2]:
!zcat warc.paths.gz | wc -l

90000


Get the first listed path

In [3]:
!zcat warc.paths.gz | head -n 1

crawl-data/CC-MAIN-2024-51/segments/1733066035857.0/warc/CC-MAIN-20241201162023-20241201192023-00000.warc.gz


Download that WARC

In [4]:
!wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-51/segments/1733066035857.0/warc/CC-MAIN-20241201162023-20241201192023-00000.warc.gz

--2025-01-12 13:41:18--  https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-51/segments/1733066035857.0/warc/CC-MAIN-20241201162023-20241201192023-00000.warc.gz
Resolving data.commoncrawl.org (data.commoncrawl.org)... 18.161.6.27, 18.161.6.34, 18.161.6.121, ...
Connecting to data.commoncrawl.org (data.commoncrawl.org)|18.161.6.27|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1013005626 (966M) [application/octet-stream]
Saving to: ‘CC-MAIN-20241201162023-20241201192023-00000.warc.gz’


2025-01-12 13:41:35 (59.2 MB/s) - ‘CC-MAIN-20241201162023-20241201192023-00000.warc.gz’ saved [1013005626/1013005626]



Unpack the [gzip](https://en.wikipedia.org/wiki/Gzip) package

In [5]:
!gunzip CC-MAIN-20241201162023-20241201192023-00000.warc.gz

Check the size of the unpacked data

In [8]:
!du -h CC-MAIN-20241201162023-20241201192023-00000.warc

4.7G	CC-MAIN-20241201162023-20241201192023-00000.warc


Get the first 100 lines of the WARC file

In [11]:
!head -n 100 CC-MAIN-20241201162023-20241201192023-00000.warc

WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2024-12-01T16:20:23Z
WARC-Record-ID: <urn:uuid:c5105f63-116c-4094-a0f0-f5beb32a3094>
Content-Length: 490
Content-Type: application/warc-fields
WARC-Filename: CC-MAIN-20241201162023-20241201192023-00000.warc.gz

isPartOf: CC-MAIN-2024-51
publisher: Common Crawl
description: Wide crawl of the web for December 2024
operator: Common Crawl Admin (info@commoncrawl.org)
hostname: ip-10-67-67-75
software: Apache Nutch 1.20 (modified, https://github.com/commoncrawl/nutch/)
robots: checked via crawler-commons 1.5-SNAPSHOT (https://github.com/crawler-commons/crawler-commons)
format: WARC File Format 1.1
conformsTo: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/


WARC/1.0
WARC-Type: request
WARC-Date: 2024-12-01T17:20:07Z
WARC-Record-ID: <urn:uuid:e26f59d1-67ee-4db5-9346-5faad4b8b4b3>
Content-Length: 272
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:c5105f63-116c-4094-a0

Grab content of first HTLM page in UTF-8 encoding using [warcio](https://github.com/webrecorder/warcio) package

In [12]:
!pip install warcio

Collecting warcio
  Downloading warcio-1.7.5-py2.py3-none-any.whl.metadata (16 kB)
Downloading warcio-1.7.5-py2.py3-none-any.whl (40 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.6/40.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: warcio
Successfully installed warcio-1.7.5


In [30]:
from warcio.archiveiterator import ArchiveIterator

with open('CC-MAIN-20241201162023-20241201192023-00000.warc', 'rb') as f:
  for record in ArchiveIterator(f):
    if record.rec_type != 'response':
      continue
    if record.http_headers.get_header('Content-Type') != 'text/html; charset=utf-8':
      continue
    if record.http_headers.get('Content-Language') != 'en':
      continue
    html = record.content_stream().read().decode('utf-8')
    print(html[:1000])
    break

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" version="XHTML+RDFa 1.0" dir="ltr"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:dc="http://purl.org/dc/terms/"
  xmlns:foaf="http://xmlns.com/foaf/0.1/"
  xmlns:og="http://ogp.me/ns#"
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:sioc="http://rdfs.org/sioc/ns#"
  xmlns:sioct="http://rdfs.org/sioc/types#"
  xmlns:skos="http://www.w3.org/2004/02/skos/core#"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema#">

<head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<link rel="shortcut icon" href="http://dsg.ac.upc.edu/sites/default/files/dsg_icon_0.ico" type="image/vnd.microsoft.icon" />
  <title>Biblio | Distributed Systems Group</title>
  <style type="text/css" m

Extract text content from HTML using BeautifulSoup

In [25]:
!pip install beautifulsoup4



In [32]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html) #, 'html.parser')
text = soup.get_text()
print(text)






Biblio | Distributed Systems Group



















Skip to main content







Distributed Systems Group 

Main menulogin
  




Home
People
Research
PublicationsAuthors
Keywords

Projects
Location
Announcements
Weekly Meetings
Software
Former Members
 




You are hereHome 
 Biblio
 



Found 13 results Author Title  [ Type]  Year Filters: Author is René Brunner  [Clear All Filters]Conference PaperR.  Brunner, Chao, I., Chacin, P., Freitag, F., Navarro, L., Ardaiz, O., Joita, L., and Rana, O. F., “Assessing a Distributed Market Infrastructure for Economics-Based Service Selection”, in GADA’07 On the Move to Meaningful Internet Systems, Vilamoura, Portugal, 2007, Springer., vol. 4804, pp. 1403–1416.P.  Chacin, León, X., Brunner, R., Freitag, F., and Navarro, L., “Core Services for Grid Markets”, in The CoreGRID Symposium (CGSYMP 2008), Las Palmas de Gran Canaria, Spain, 2008.R.  Brunner and Freitag, F., “Elaborating a Decentralized Market Information System”, in On the Move to 