# Web crawl tasks (more to come)

The objective of these exercises is to learn how to extract searchable data from the two most popular sources: CommonCrawl and Wikimedia. On the lecture, we saw CommonCrawl as a great source of web crawl data, and we also got convinced that running our own crawl at a large scale is a nearly impossible task. The latest CommonCrawl release is from January 2017 and can be found here:

http://commoncrawl.org/2017/02/january-2017-crawl-archive-now-available/

Especially the WET files are of interest to us, as they contain the plain text extracted from the crawls, not the raw HTML. So we will focus on these text files for now.

## Task 1

Using the `wet.paths.gz` file, a bash script, and the `curl` or `wget` programs, download the first few files from the archive. We did go through this in some detail during the exercises. What you want to do is:

1. Write a bash script `download_cc.sh` using for example `gedit`
2. In the bash script, loop over the first few lines of `wet.paths.gz` and use `curl -O` to download the links
3. Run `bash download_cc.sh` to execute the script

Few hints from the hands-on session

    zcat wet.paths.gz | head -n 3
    curl -O https://....
    
...and a bash script to get you started with:

```
for line in $(zcat wet.paths.gz | head -n 3)
do
    echo "This is one line: $line"
done
```

## Task 2

### Part A) Get a wikimedia dump of your choosing from https://dumps.wikimedia.org/backup-index.html
For example Finnish wikiquotes (https://dumps.wikimedia.org/fiwikiquote/20170301/) is called `fiwikiquote`.
Find a version which reads something like *"All pages, current versions only"*.

Which did you select?

### Part B) Once downloaded, unpack the bz2 file using a command:

The file may have a suffix `.bz2` rather than `.gz`. This is simply a different zipping format. This one is handled with the program `bunzip2`. So:

    bunzip2 file_of_your_choice.bz2

Example:

    [guest@drain1 Downloads]$ bunzip2 fiwikiquote-20170301-pages-meta-current.xml.bz2
    [guest@drain1 Downloads]$ ls
    fiwikiquote-20170301-pages-meta-current.xml

### Part C) Extract plaintext from your dump, using the command

**Note 1:** We are at a mercy of what Wikimedia gives us. [mwlib](http://mwlib.readthedocs.io/en/latest/index.html) is a Python library for working with Mediawiki XML files, but it just happens so that `mwlib` only works in python 2 and not python 3. Which is why you will run the following command with `python` and not `python3` as we are used to.

**Note 2:** We fixed `mwlib` on the course server, so this exercise works now.

Download the following script http://bionlp-www.utu.fi/.mjluot/wikiripper_tm_course.py

    python wikiripper_tm_course.py --plaintext fiwikiquote-20170301-pages-meta-current.xml > extracted_plaintext.txt

What does the output look like? Do you find it useable?
