# Sample Dataset Collection
Please note that we have online [Kaggle notebook](https://www.kaggle.com/jacksoncrow/data-collection-demo) to avoid extra efforts of setting up the environment. In case you want to run this locally, you can easily set up the environment with our docker image. Please refer for the instructions to [the github repository](https://github.com/OlehOnyshchak/WikipediaMultimodalDownloader#docker)

## Constant definition

Fist of all, we need to specify some input parameters to our script. That is, input file with articles we want to download + parameters on how to download them.

### Input file
It should be a file with article ids specified one per line. By article id here we mean last part of its URL. That is, for the article with URL https://en.wikipedia.org/wiki/The_Relapse on English Wikipedia, the id would be `The_Relapse`. Please note, that all article ids you specified in a file should be from the same Wikipedia, i.e. either all English or all Ukrainian.

Below you can see an example of how we setup a sample input file.

In [None]:
articles = ['Butch_Levy']

!rm input.txt > /dev/null 2>&1
article_to_examine = articles[0]
for a in articles:
    ! echo $a >> input.txt
! cat input.txt

### Parameters
Then you need to specify a variety of parameters to fine-tune the collection script. On a high level, the script accounts to any previously downloaded information to increase the performance. In other words, once you downloaded a dataset from scratch, updating it will take very little time. That is because most of the data will be already downloaded and unmodified since the last collection.

In other words, since script leverages cache, you can interrupt and then restart the collection script at any time without the need to start everything from scratch. You can also specify, what precisely you want the script to do 1) download missing articles and images from the input file 2) check that image metadata of already downloaded articles is up to date 3) force redownload off all images and/or image metadata and/or article text content. You can also execute the script from multiple notebooks/consoles with the same output directory to parallelise the collection process. That will significantly reduce the download time, although you need to beware that not any (offset, limit) parameters overlap. We are planning to add support of multithreading in the future, so now it's the only workaround. If you more details on parameters, please refer to documentation in the corresponding python file.

In [None]:
import reader
import data_preprocessor

In [None]:
## Please refere to reader.py and data_preprocessor.py for documentation

filename = 'input.txt' # 'featured_articles_list.tsv' # 'input.txt'
out_dir = '/home/oleh/data_docker/'  #'../WikiImageRecommendation/data/' 

invalidate_headings_cache = True
invalidate_parsed_titles_cache = False
invalidate_visual_features_cache = False

query_params = reader.QueryParams(
    out_dir = out_dir,
    debug_info = True,
    offset = 0,
    limit = None,
    invalidate_cache = reader.InvalidateCacheParams(
        img_cache = False,
        text_cache = False,
        caption_cache = False,
        img_meta_cache = False,
        oudated_img_meta_cache = True,  
    ),
    only_update_cached_pages = False,
    fill_property= reader.FillPropertyParams(
        img_caption = True,
        img_description = True,
        text_wikitext = False,
        text_html = True,
    ),
    language_code = 'en',
    early_icons_removal = True,
)

## Data Collection
In this section, the script will download all the required data. That is, for each specified article it will download its textual content, all its images and also some image metadata such as description parsed from Wikimedia Commons page. For details about what is being collected and what is the structure of the dataset, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [None]:
%%time
reader.query(filename=filename, params=query_params)

## Data Preprocessing 1. Removing images not available on Commons
Before proceeding to the costly operation of additional image caption downloading&parsing, we will first remove all images not available from Wikimedia Commons dataset. Usually, those are the images which were licensed only for usage in a specific article and are not publicly available. They constitute around 5-7% of pictures, so for now, we are just removing them. Still, later we might investigate licensing condition and, if allowed, include them to the dataset.

In [None]:
data_preprocessor.filter_img_metadata(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit, 
    debug_info=query_params.debug_info,
    field_to_remove='on_commons',
    predicate=lambda x: ('on_commons' not in x) or x['on_commons'],
)

## Data Preprocessing 2. Removing icons
Most commonly, an icon is an auxiliary image which represents a particular template or category. It's not directly linked to the content described in the article, so we remove icons as noisy data. We identify them within other images under the assumption that user cannot load preview for icons on Wiki page. That is, if you click on icon from your browser, it will either do nothing or will redirect you to another page. While for images used in the article, it will load a full-screen preview. And while this approach will not work in 100% of cases, currently we identified it as the most reliable approach to perform icon identification.

So in this part, we remove all images which were identified as icons.

In [None]:
data_preprocessor.filter_img_metadata(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit, 
    debug_info=query_params.debug_info,
    field_to_remove='is_icon',
    predicate=lambda x: ('is_icon' not in x) or (not x['is_icon']),
)

## Data Preprocessing 3. Parsing Image Headings
For each image in the article, it parses all its parent headings from the article's html. In other words, if a picture is located in a block with title `<h3>Title_3</h3>`, then @headings field of the metadata will contain heading of the first, second and third level respectively, i.e. `["Title_1", "Title_2", "Title_3"]`. We parse the entire tree because only with all that context headings have sense and show extra information. Thus you might consider joining all of them into a single space-separated descriptive sentence.

Again, if you need further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [None]:
%%time
data_preprocessor.parse_image_headings(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    invalidate_cache=invalidate_headings_cache,
    debug_info=query_params.debug_info,
    language_code=query_params.language_code,
)

## Data Preprocessing 4. Generating visual features
Lastly, to make usage of dataset more time and space-efficient, we will calculate visual features for every image and record them in the dataset. By doing so, we will
* save space: the raw image of shape (600,600,3) occupies 500 times more space than a visual feature vector with 2048 elements. At the same time, it provides the same amount of useful information
* save time: calculating those features from scratch is a very time-consuming process. So by having them saved in the dataset, every user of the dataset will not need to calculate them as well.

For feature generation we used `ResNet152` pretrained on `ImageNet` dataset. And features themselves are the output of the lash hidden fully-connected layer of the network, which has the shape of (19, 24, 2048), and then transform it to a vector of 2048 items by the max-pooling operation. That vector of 2048 items will serve as our feature vector for each image.

And while we understand that this representation might not be ideal in your scenario, it seems to be useful in various situation. If you need to calculate features in another way, please just modify this last step.

In [None]:
%%time
data_preprocessor.generate_visual_features(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    invalidate_cache=invalidate_visual_features_cache,
    debug_info=query_params.debug_info,
)

## Data Preprocessing 5. Parse image titles
Image titles often contain a short meaningful description of an image. Although, it is also commonly written in a form where words are either _separated_by_underscore_symbol_, or written in _camelCase_, or simply _writtenwithoutspaces_. So to extract useful features from that title, we will try to guess each separate word of a title and record it into `processed_title` field. We will do it with redditscore.tokenizer, which parses the string based on known words and their frequency in language. In other words, if a sentence can be parsed into a few possible alternatives, the one with more frequently used terms will take priority

In [None]:
%%time
data_preprocessor.tokenize_image_titles(
    data_path=query_params.out_dir,
    offset=query_params.offset,
    limit=query_params.limit,
    invalidate_cache=invalidate_parsed_titles_cache,
    debug_info=query_params.debug_info,
)

## Dataset Examinations
### text.json file
This file contains article textual information, such as: content of the article in wikitext and html format, article title, id, and url. For further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [None]:
import json
import pprint

text_path = out_dir + article_to_examine + "/text.json"
pp = pprint.PrettyPrinter(indent=2)
data = None
with open(text_path) as json_file:
    data = json.loads(json.load(json_file))

print_data = data
if 'wikitext' in print_data:
    print_data['wikitext'] = print_data['wikitext'][:5000]

if 'html' in print_data:
    print_data['html'] = print_data['html'][:5000]

pp.pprint(print_data)

### meta.json file
This file contains visual features of all articles images as well as some image metadata such as: description from Commons dataset, caption from the article, title, url and filename. For further details, please refer to [Kaggle Dataset Page](https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset)

In [None]:
import json
import pprint

meta_path = out_dir + article_to_examine + "/img/meta.json"
pp = pprint.PrettyPrinter(indent=2)
data = None
with open(meta_path) as json_file:
    data = json.loads(json.load(json_file))['img_meta']

print_data = data
for i in range(len(print_data)):
    if 'features' in print_data[i]:
        print_data[i]['features'] = print_data[i]['features'][:10]
print_data = {i:x for i,x in enumerate(print_data)}

pp.pprint(print_data)