# Sample Dataset Collection
## Constant definition

Fist of all, we need to specify some input parameters to our script. That is, input file with articles we want to download + parameters on how to download them.

### Input file
It should be a file with article ids specified one per line. By article id here we mean last part of its url. That is, for the article with url https://en.wikipedia.org/wiki/The_Relapse on English Wikipedia, the id would be `The_Relapse`. Please note, that all article ids you specified in a file should be from the same wikipedia, i.e. either all English or all Ukrainian.

### Parameters
Then you need to specify a variety of parameters to fine-tune the collection script. On high level, the script account to any previously downloaded information in order to inrease the performance. In other words,once you downloaded dataset from scratch, updating it will take very little time since most of the data will be already downloaded and unmodified since the last collection.

In other words, since script leverages cache, you can interrupt and then restart the collection script at any time without the need to start everything from scratch. You can also specify, what specificly you want script to do 1) download missing articles and images from the input file 2) check that image metadata of already downloaded articles is up to date 3) force redownload off all images and/or image metadata and/or article text content. You can also execute the script from multiple notebooks/consoles with the same output directory in order to paralelise the collection process. That will signigicantly reduce the download time although you need to beware that not any (offset, limit) parameters overlap. We are planning to add support of multithreading later on, so now it's the only workaround. If you more details on parameters, please refere to documentation in correspoding python file.

In [1]:
import reader
import data_preprocessor

Using TensorFlow backend.


In [2]:
filename = 'input.tsv'
out_dir = './data_uk/' 
invalidate_caption_cache = True

query_params = reader.QueryParams(
    out_dir = out_dir,
    debug_info = True,
    offset = 0,
    limit = 5,
    invalidate_img_cache = False,
    invalidate_text_cache = False,
    invalidate_img_meta_cache = False,
    invalidate_oudated_img_meta_cache = True,
    only_update_cached_pages = False,
    language_code = 'uk',
)

## Data Collection 1. Main Part
In this section the script will do the major part of the work. That is, for each article specified, it will download its textual content, all its images and also some image metadata such as description parsed from Wikimedia Commons page. For details about what is being collected and what is the structure of the dataset, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [3]:
%%time
reader.query(filename=filename, params=query_params)

Downloading... offset=0, limit=3

0) data_uk/Атомний_підводний_човен_з_балістичними_ракетами
Downloading text.json
Updating image metadata
Downloading image Akula_(Typhoon)_class_submarine_DD-ST-85-06625.jpg
Downloading image Artist_rendering_of_a_Columbia-class_ballistic_missile_submarine,_2019_(190306-N-N0101-125).jpg
Downloading image B05_SLBM.jpg
Downloading image Commons-logo.svg
Downloading image Delta-II_class_nuclear-powered_ballistic_missle_submarine_2.jpg
Downloading image FA_gold_ukr.png
Downloading image FS_Redoutable.jpg
Downloading image Flag_of_France.svg
Downloading image Flag_of_India.svg
Downloading image Flag_of_North_Korea.svg
Downloading image Flag_of_Russia.svg
Downloading image Flag_of_the_People's_Republic_of_China.svg
Downloading image Flag_of_the_Soviet_Union.svg
Downloading image Flag_of_the_United_Kingdom.svg
Downloading image Flag_of_the_United_States.svg
Downloading image Jin_(Type_094)_Class_Ballistic_Missile_Submarine.JPG
Downloading image Ohio-class_sub



Updating image metadata
Downloading image 12_Bandery_Street,_Lviv_(04).jpg
Downloading image AndriiSadovyi.JPG
Downloading image Boxed_East_arrow.svg
Downloading image Cables_Lviv.jpg
Downloading image Cartography_of_Europe.svg
Downloading image City_locator_0.svg
Downloading image Coat_of_Arms_of_Lviv_Oblast_SVG.svg
Downloading image Coat_of_arms_of_Lviv.svg
Downloading image Coat_of_arms_of_Ukraine.svg
HTTP Error 404: Not Found
Downloading image Coin_of_Ukraine_Lviv_A.jpg
Downloading image Coin_of_Ukraine_Lviv_R.jpg
Downloading image Commons-logo.svg
Downloading image Compass_rose_pale-50x50.png
Downloading image E14101.jpg
Downloading image East.svg
HTTP Error 404: Not Found
Downloading image Electron_1179_(2).jpg
Downloading image Europe_relief_laea_location_map.jpg
Downloading image FA_gold_ukr.png
Downloading image Flag_of_Albania.svg
Downloading image Flag_of_Austria.svg
Downloading image Flag_of_Belarus.svg
Downloading image Flag_of_Bosnia_and_Herzegovina.svg
Downloading image 



Downloading image Wikinews-logo.svg
Downloading image Wikiquote-logo.svg
Downloading image Wikisource-logo.svg
Downloading image Wikivoyage-Logo-v3-icon.svg
Downloading image Wiktionary-logo.svg
Downloading image ВП_День_батяра_2.jpg
Downloading image Великий_герб.png
HTTP Error 404: Not Found
Downloading image Великий_герб_Львова.png
Downloading image Вигляд_на_північну_частину_міста_з_гори_Лева.jpg
Downloading image Ворота_в_парк_культуры_Львов.jpg
Downloading image Вулиця_Митрополита_Андрея.jpg
Downloading image Вулиця_Староєврейська.jpg
Downloading image Завжди_вірні.jpg
Downloading image Королевские_покои_Черная_каменица.jpg
Downloading image Лвов_Галиција.jpg
Downloading image Логотип_Львова.png
Downloading image Львовский_лев_001.JPG
Downloading image Львовский_пивзавод_3.jpg
Downloading image Львівська_опер_-_нічна_панорама.jpg
Downloading image Панельный_Сихов.jpg
Downloading image Панорама_міста_Львова_з_вулиці_Лукаша,_1.jpg
Downloading image Сихівський_район_Львова.jpg
Downl

## Data Preprocessing 1. Removing images not available on Commons
Before proceeding to costly operation of additional image caption downloading&parsing, we will first remove all images not available from Wikimida Commons dataset. Usually, those are the images which were licensed only for usage in specific article and are not publicly available. They constitute around 5-7% of images, so for now we just removing them but later we might investigate licensing condition and, if allowed, include them to the dataset.

In [4]:
data_preprocessor.filter_img_metadata(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit, 
    debug_info=query_params.debug_info,
    field_to_remove='on_commons',
    predicate=lambda x: x['on_commons']
)

0 ./data_uk/Львів
1 ./data_uk/Ервін_Шредінгер
2 ./data_uk/Атомний_підводний_човен_з_балістичними_ракетами


## Data Collection 2. Image Captions
This part was separated from the main pipeline because it's very time consuming and we need to use it carefully and only when it's required. This function is firstly trying to parse as many caption as possible with a fast but unreliable approach. After that, it gathers all remaining captions with time-consuming approach, which is to download html preview-pages for each image in the article. Futhermore, it's dynamicly generated content by javascript, thus we need to execute that generating code internally, when loading the page.

Again, if you need further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [3]:
%%time
reader.query_img_captions(
    filename=filename,
    out_dir=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    language_code=query_params.language_code,
    invalidate_cache=invalidate_caption_cache,
    debug_info=query_params.debug_info,
)

Querying available captions with fast approach

0 data_uk/Атомний_підводний_човен_з_балістичними_ракетами
1 data_uk/Львів
2 data_uk/Ервін_Шредінгер

Querying remaining unparsed caption with time-consuming approach


0) data_uk/Атомний_підводний_човен_з_балістичними_ракетами
Skipping cached caption Akula_(Typhoon)_class_submarine_DD-ST-85-06625.jpg
Skipping cached caption Artist_rendering_of_a_Columbia-class_ballistic_missile_submarine,_2019_(190306-N-N0101-125).jpg
Skipping cached caption B05_SLBM.jpg
Skipping known icon Commons-logo.svg
Skipping cached caption Delta-II_class_nuclear-powered_ballistic_missle_submarine_2.jpg
Downloading captions for https://uk.wikipedia.org/wiki/Атомний_підводний_човен_з_балістичними_ракетами#/media/Файл:FA_gold_ukr.png
RETRY 0  |||  FA_gold_ukr.png
RETRY 1  |||  FA_gold_ukr.png
RETRY 2  |||  FA_gold_ukr.png
RETRY 3  |||  FA_gold_ukr.png
RETRY 4  |||  FA_gold_ukr.png
Skipping cached caption FS_Redoutable.jpg
Skipping known icon Flag_of_France.svg
Skippi

Skipping known icon Flag_of_Poland.svg
Skipping known icon Flag_of_Portugal.svg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Flag_of_Romania.svg
Skipping known icon Flag_of_Russia.svg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Flag_of_Serbia.svg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Flag_of_Spain.svg
Skipping known icon Flag_of_Sweden.svg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Flag_of_Turkey.svg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Flag_of_Ukraine.svg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Flag_of_Uzbekistan.svg
Skipping known icon Flag_of_the_Czech_Republic.svg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Flag_of_the_Netherlands.svg
Skipping known icon Flag_of_the_People's_Republic_of_China.svg
Skipping known icon Flag_of_the_United_Kingdom.svg
Downloading caption



RETRY 0  |||  Wiktionary-logo.svg
RETRY 1  |||  Wiktionary-logo.svg
RETRY 2  |||  Wiktionary-logo.svg
RETRY 3  |||  Wiktionary-logo.svg
RETRY 4  |||  Wiktionary-logo.svg
Skipping cached caption ВП_День_батяра_2.jpg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Великий_герб_Львова.png
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Вигляд_на_північну_частину_міста_з_гори_Лева.jpg
Skipping cached caption Ворота_в_парк_культуры_Львов.jpg
Skipping cached caption Вулиця_Митрополита_Андрея.jpg
Skipping cached caption Вулиця_Староєврейська.jpg
Skipping cached caption Завжди_вірні.jpg
Skipping cached caption Королевские_покои_Черная_каменица.jpg
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Лвов_Галиција.jpg
Skipping cached caption Логотип_Львова.png
Downloading captions for https://uk.wikipedia.org/wiki/Львів#/media/Файл:Львовский_лев_001.JPG
Skipping cached caption Львовский_пивзавод_3.jpg
Downloading captions for

## Data Preproccesing 2. Removing icons
Most commontly, icons is an auxiliary image which represents particular template or category. It's not directly linked to the content described in the article, so we remove them as noisy data. We identify icons within other images under the assumption that user cannot load preview for icons on Wiki page. That is, if you click on icon from your browser, it will either do nothing or will redirect you to another page. While for images used in the article, it will load a full-screen preview. And while this approach will not work in 100% of cases, currently we identified it as the most reliable approach to perform icon identification.

So in this part, we remove all images which were identified as icons.

In [4]:
data_preprocessor.filter_img_metadata(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit, 
    debug_info=query_params.debug_info,
    field_to_remove='is_icon',
    predicate=lambda x: not x['is_icon']
)

0 ./data_uk/Львів
1 ./data_uk/Ервін_Шредінгер
2 ./data_uk/Атомний_підводний_човен_з_балістичними_ракетами


In [9]:
1080000/ 2048

527.34375

## Data Preprocessing 3. Generating visual features
Lastly, in order to make usage of dataset more time and space efficient, we will calculate visual features for every image and record them in dataset. By doing so, we will
* save space: raw image of shape (600,600,3) occupies 500 times more space then a visual feature vector with 2048 elements, while providing the same amount of useful information
* save time: calculating those features from scratch is very time-consuming process. So by having them saved in the dataset, every user of the dataset will not need to calculate them as well

For feature generation we used `ResNet152` pretrained on `ImageNet` dataset. And features themselfs are the output of the lash hidden fully-connecte layer of the network, which has the shape of (19, 24, 2048), and then transform it to a vector of 2048 items by max-pooling operation. That vector of 2048 items will serve as our feature vector for each image.

And while we understand that this representation might not be ideal in your scenario, it seems to be useful in varaious situation. If you need to calculate features in other way, please just modify this last step.

In [5]:
%%time
data_preprocessor.generate_visual_features(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    debug_info=query_params.debug_info,
)

0 ./data_uk/Львів
1 ./data_uk/Ервін_Шредінгер
2 ./data_uk/Атомний_підводний_човен_з_балістичними_ракетами
CPU times: user 15min 16s, sys: 8.48 s, total: 15min 24s
Wall time: 2min 38s


## Dataset Examinations
### text.json file
This file contains article textual information, such as: content of the article in wikitext and html format, article title, id, and url. For further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [7]:
import json
import pprint

text_path = out_dir + "Атомний_підводний_човен_з_балістичними_ракетами" + "/text.json"
pp = pprint.PrettyPrinter(indent=2)
data = None
with open(text_path) as json_file:
    data = json.loads(json.load(json_file))

print_data = data
print_data['wikitext'] = print_data['wikitext'][:5000]
print_data['html'] = print_data['html'][:5000]

pp.pprint(print_data)

{ 'html': '\n'
          '<!DOCTYPE html>\n'
          '<html class="client-nojs" lang="uk" dir="ltr">\n'
          '<head>\n'
          '<meta charset="UTF-8"/>\n'
          '<title>Атомний підводний човен з балістичними ракетами — '
          'Вікіпедія</title>\n'
          '<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":[",\\t.","\xa0'
          '\\t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","січень","лютий","березень","квітень","травень","червень","липень","серпень","вересень","жовтень","листопад","грудень"],"wgRequestId":"Xnw@nQpAMMMAAXiJqyYAAACW","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Атомний_підводний_човен_з_балістичними_ракетами","wgTitle":"Атомний '
          'підводний човен з балістичними '
          'ракетами","wgCurRevisionId":27509404,"wgRevisionId":27509404,"wgArticleId":2983826,"wgIsArticle":!0,"wg

### meta.json file
This file contains visual features of all articles images as well as some image metadata such as: description from Commons dataset, caption from the article, title, url and filename. For further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [10]:
import json
import pprint

meta_path = out_dir + "Атомний_підводний_човен_з_балістичними_ракетами" + "/img/meta.json"
pp = pprint.PrettyPrinter(indent=2)
data = None
with open(meta_path) as json_file:
    data = json.loads(json.load(json_file))['img_meta']

print_data = data
for i in range(len(print_data)):
    if 'features' in print_data[i]:
        print_data[i]['features'] = print_data[i]['features'][:10]
print_data = {i:x for i,x in enumerate(print_data)}

pp.pprint(print_data)

{ 0: { 'caption': 'Радянські важкі атомні ракетні підводні крейсери проєкту '
                  '941, відомі як «Акули»,\xa0— найбільші в світі підводні '
                  'човни',
       'description': 'English: A starboard quarter view of a Soviet Project '
                      '941 "Akula" class (NATO reporting name: "Typhoon") '
                      'ballistic missile submarine underway.',
       'features': [ '8.060253',
                     '3.4860282',
                     '7.7391806',
                     '18.305214',
                     '3.4404202',
                     '7.2627435',
                     '8.140114',
                     '3.7799273',
                     '16.627892',
                     '2.8433392'],
       'filename': '05adc5f7010870f755bb9390121bc9ab.jpg',
       'title': 'Akula (Typhoon) class submarine DD-ST-85-06625.jpg',
       'url': 'https://uk.wikipedia.org/wiki/%D0%A4%D0%B0%D0%B9%D0%BB%3AAkula_%28Typhoon%29_class_submarine_DD-ST-85-06625.jpg'},
  