# Sample Dataset Collection
Please use online [Kaggle notebook](https://www.kaggle.com/jacksoncrow/data-collection-demo) to avoid extra efforts of setting up the environment. In case you want to run this locally, Kaggle notebook would still be handy as a reference since we are setting up the environment there from scratch.

## Constant definition

Fist of all, we need to specify some input parameters to our script. That is, input file with articles we want to download + parameters on how to download them.

### Input file
It should be a file with article ids specified one per line. By article id here we mean last part of its URL. That is, for the article with URL https://en.wikipedia.org/wiki/The_Relapse on English Wikipedia, the id would be `The_Relapse`. Please note, that all article ids you specified in a file should be from the same Wikipedia, i.e. either all English or all Ukrainian.

### Parameters
Then you need to specify a variety of parameters to fine-tune the collection script. On a high level, the script accounts to any previously downloaded information to increase the performance. In other words, once you downloaded a dataset from scratch, updating it will take very little time. That is because most of the data will be already downloaded and unmodified since the last collection.

In other words, since script leverages cache, you can interrupt and then restart the collection script at any time without the need to start everything from scratch. You can also specify, what precisely you want the script to do 1) download missing articles and images from the input file 2) check that image metadata of already downloaded articles is up to date 3) force redownload off all images and/or image metadata and/or article text content. You can also execute the script from multiple notebooks/consoles with the same output directory to parallelise the collection process. That will significantly reduce the download time, although you need to beware that not any (offset, limit) parameters overlap. We are planning to add support of multithreading in the future, so now it's the only workaround. If you more details on parameters, please refer to documentation in the corresponding python file.

In [10]:
0) data/The_Relapse

1) data/Pendle_witches

2) data/Kylfings

3) data/Rampart_Dam

4) data/The_Lucy_poems

'data/'

In [1]:
import reader
import data_preprocessor

Using TensorFlow backend.


In [2]:
filename = 'featured_articles_list.tsv'
out_dir = 'data/'#'../WikiImageRecommendation/data/' 
invalidate_caption_cache = False
invalidate_headings_cache = True

query_params = reader.QueryParams(
    out_dir = out_dir,
    debug_info = True,
    offset = 0,
    limit = 5,
    invalidate_img_cache = False,
    invalidate_text_cache = False,
    invalidate_img_meta_cache = True,
    invalidate_oudated_img_meta_cache = False,
    only_update_cached_pages = False,
    language_code = 'en',
)

## Data Collection 1. Main Part
In this section, the script will do the major part of the work. That is, for each specified article it will download its textual content, all its images and also some image metadata such as description parsed from Wikimedia Commons page. For details about what is being collected and what is the structure of the dataset, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [3]:
%%time
reader.query(filename=filename, params=query_params)

Downloading... offset=0, limit=5

0) data/The_Relapse
Updating image metadata

1) data/Pendle_witches
Updating image metadata

2) data/Kylfings
Updating image metadata

3) data/Rampart_Dam
Updating image metadata

4) data/The_Lucy_poems
Updating image metadata

Downloaded 0 images, where 0 of them unavailable from commons
CPU times: user 1.95 s, sys: 36.4 ms, total: 1.99 s
Wall time: 22.4 s


## Data Preprocessing 1. Removing images not available on Commons
Before proceeding to the costly operation of additional image caption downloading&parsing, we will first remove all images not available from Wikimedia Commons dataset. Usually, those are the images which were licensed only for usage in a specific article and are not publicly available. They constitute around 5-7% of pictures, so for now, we are just removing them. Still, later we might investigate licensing condition and, if allowed, include them to the dataset.

In [4]:
data_preprocessor.filter_img_metadata(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit, 
    debug_info=query_params.debug_info,
    field_to_remove='on_commons',
    predicate=lambda x: x['on_commons']
)

0 data/The_Lucy_poems
1 data/Rampart_Dam
2 data/The_Relapse
3 data/Pendle_witches
4 data/Kylfings


## Data Collection 2. Image Captions
This part was separated from the main pipeline because it's very time consuming and we need to use it carefully and only when it's required. This function is firstly trying to parse as many captions as possible with a fast but unreliable approach. After that, it gathers all remaining captions with a time-consuming method, which is to download HTML preview-pages for each image in the article. Furthermore, it's dynamically generated content by javascript. Thus we need to execute that generating code internally when loading the page.

Again, if you need further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [5]:
%%time
reader.query_img_captions(
    filename=filename,
    out_dir=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    language_code=query_params.language_code,
    invalidate_cache=invalidate_caption_cache,
    debug_info=query_params.debug_info,
)

Querying available captions with fast approach

0 data/The_Relapse
1 data/Pendle_witches
2 data/Kylfings
3 data/Rampart_Dam
4 data/The_Lucy_poems

Querying remaining unparsed caption with time-consuming approach


0) data/The_Relapse
Downloading captions for https://en.wikipedia.org/wiki/The_Relapse#/media/File:Colley_Cibber_as_Lord_Foppington_clipped.jpg
Skipping cached caption Colley_Cibber_as_Lord_Foppington_in_The_Relapse_by_John_Vanbrugh_engraving.jpg
Skipping cached caption John_Vanbrugh.jpg
Skipping cached caption Love'sLastShift_characters.png
Skipping cached caption Relapse_characters.png
Skipping cached caption William_Powell_Frith_The_Relapse_Midnight_Alarm_3-3.jpg
Skipping known icon Commons-logo.svg
Skipping known icon Cscr-featured.svg

1) data/Pendle_witches
Skipping cached caption Alice_Nutter_Statue.tif
Skipping cached caption ChattoxFamily.png
Skipping cached caption DemdikeFamily.png
Skipping known icon England_relief_location_map.jpg
Skipping cached caption Lancaste

## Data Preproccesing 2. Removing icons
Most commonly, an icon is an auxiliary image which represents a particular template or category. It's not directly linked to the content described in the article, so we remove icons as noisy data. We identify them within other images under the assumption that user cannot load preview for icons on Wiki page. That is, if you click on icon from your browser, it will either do nothing or will redirect you to another page. While for images used in the article, it will load a full-screen preview. And while this approach will not work in 100% of cases, currently we identified it as the most reliable approach to perform icon identification.

So in this part, we remove all images which were identified as icons.

In [6]:
data_preprocessor.filter_img_metadata(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit, 
    debug_info=query_params.debug_info,
    field_to_remove='is_icon',
    predicate=lambda x: not x['is_icon']
)

0 data/The_Lucy_poems
1 data/Rampart_Dam
2 data/The_Relapse
3 data/Pendle_witches
4 data/Kylfings


## Data Collection 3. Image Headings
Also separate part of querying extra metadata for images, although might be merged into the main pipeline later on. For each image in the article, it parses all its parent headings. In other words, if picture is located in a block with title `<h3>Title 3</h3>`, then @headings field of the metadata will contain heading of the first, second and third level respectively, i.e. `["Title 1", "Title 2", "Title 3"]`. We parse the entire tree because only with all that context headings have sense and show extra information. Thus you might consider joining all of them into a single space-separated descriptive sentence.

Again, if you need further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [7]:
%%time
reader.query_img_headings(
    filename=filename,
    out_dir=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    language_code=query_params.language_code,
    invalidate_cache=invalidate_headings_cache,
    debug_info=query_params.debug_info,
)

0 data/The_Relapse
1 data/Pendle_witches
2 data/Kylfings
3 data/Rampart_Dam
4 data/The_Lucy_poems
CPU times: user 1.11 s, sys: 24 ms, total: 1.13 s
Wall time: 2.75 s


## Data Preprocessing 3. Generating visual features
Lastly, to make usage of dataset more time and space-efficient, we will calculate visual features for every image and record them in the dataset. By doing so, we will
* save space: the raw image of shape (600,600,3) occupies 500 times more space than a visual feature vector with 2048 elements. At the same time, it provides the same amount of useful information
* save time: calculating those features from scratch is a very time-consuming process. So by having them saved in the dataset, every user of the dataset will not need to calculate them as well

For feature generation we used `ResNet152` pretrained on `ImageNet` dataset. And features themselves are the output of the lash hidden fully-connected layer of the network, which has the shape of (19, 24, 2048), and then transform it to a vector of 2048 items by the max-pooling operation. That vector of 2048 items will serve as our feature vector for each image.

And while we understand that this representation might not be ideal in your scenario, it seems to be useful in various situation. If you need to calculate features in another way, please just modify this last step.

In [8]:
%%time
data_preprocessor.generate_visual_features(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    debug_info=query_params.debug_info,
)

0 data/The_Lucy_poems
1 data/Rampart_Dam
2 data/The_Relapse
3 data/Pendle_witches
4 data/Kylfings
CPU times: user 5min 48s, sys: 3.96 s, total: 5min 52s
Wall time: 1min 3s


## Dataset Examinations
### text.json file
This file contains article textual information, such as: content of the article in wikitext and html format, article title, id, and url. For further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [11]:
import json
import pprint

text_path = out_dir + "The_Relapse" + "/text.json"
pp = pprint.PrettyPrinter(indent=2)
data = None
with open(text_path) as json_file:
    data = json.loads(json.load(json_file))

print_data = data
print_data['wikitext'] = print_data['wikitext'][:5000]
print_data['html'] = print_data['html'][:5000]

pp.pprint(print_data)

{ 'html': '\n'
          '<!DOCTYPE html>\n'
          '<html class="client-nojs" lang="en" dir="ltr">\n'
          '<head>\n'
          '<meta charset="UTF-8"/>\n'
          '<title>The Relapse - Wikipedia</title>\n'
          '<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"Xn6sgQpAMFoAAK2rPZEAAAAL","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"The_Relapse","wgTitle":"The '
          'Relapse","wgCurRevisionId":934587489,"wgRevisionId":934587489,"wgArticleId":216855,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive '
          'template wayback links","EngvarB from July 2019","Use d

### meta.json file
This file contains visual features of all articles images as well as some image metadata such as: description from Commons dataset, caption from the article, title, url and filename. For further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [12]:
import json
import pprint

meta_path = out_dir + "The_Relapse" + "/img/meta.json"
pp = pprint.PrettyPrinter(indent=2)
data = None
with open(meta_path) as json_file:
    data = json.loads(json.load(json_file))['img_meta']

print_data = data
for i in range(len(print_data)):
    if 'features' in print_data[i]:
        print_data[i]['features'] = print_data[i]['features'][:10]
print_data = {i:x for i,x in enumerate(print_data)}

pp.pprint(print_data)

{ 0: { 'description': 'Clipped version of Engraving of a painting of the '
                      'English actor Colley Cibber as Lord Foppington in the '
                      'Restoration comedy The Relapse (1696) by John Vanbrugh',
       'features': [ '9.525463',
                     '14.151308',
                     '8.883457',
                     '2.0813723',
                     '2.3151853',
                     '18.83172',
                     '6.1366034',
                     '4.4346333',
                     '11.544488',
                     '26.044575'],
       'filename': 'b647ad40095d319b81f188def08e9ac0.jpg',
       'headings': ['The Relapse', 'External links'],
       'title': 'Colley Cibber as Lord Foppington clipped.jpg',
       'url': 'https://en.wikipedia.org/wiki/File%3AColley_Cibber_as_Lord_Foppington_clipped.jpg'},
  1: { 'caption': "Young Colley Cibber as Vanbrugh's Lord Foppington, "
                  '"brutal, evil, and smart".',
       'description': 'Engravin