## Tutorial : Adding a new dataset ( Arabic as an example ):

This tutorial aims to take you step by step to use Datasets [nlp previously] library from huggingface to add your own dataset by uploading it datasets CLI as user or organization. The library provides an easy and sharable access to the datasets. It, also, help the user from memory limitation in the RAM by smart memory mapping.

## Wikipidia dump corpus

This corpus is used in this tutorial and it is originally downloaded from the following page : [Wiki_corpus](https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/)

it is a corpus that extraced from wikipedia dumps. The arabic corpus consists from 131 M tokens and it is size is 3.3G. The file is xml extention. Therefore, we use perl language to parse XML document and extracts the text. 

Here's the papDer that explains how the corpus was constructed : 

*  D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
    *  In: Proceedings of the 8th International Language Ressources and Evaluation (LREC'12), 2012


In [None]:
# install datasets 
!pip install datasets

# Make sure that we have a recent version of pyarrow in the session before we continue - otherwise reboot Colab to activate the newest version
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16:
    import os
    os.kill(os.getpid(), 9)

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/1a/38/0c24dce24767386123d528d27109024220db0e7a04467b658d587695241a/datasets-1.1.3-py3-none-any.whl (153kB)
[K     |██▏                             | 10kB 16.3MB/s eta 0:00:01[K     |████▎                           | 20kB 15.5MB/s eta 0:00:01[K     |██████▍                         | 30kB 7.8MB/s eta 0:00:01[K     |████████▌                       | 40kB 3.6MB/s eta 0:00:01[K     |██████████▋                     | 51kB 4.4MB/s eta 0:00:01[K     |████████████▉                   | 61kB 4.9MB/s eta 0:00:01[K     |███████████████                 | 71kB 5.1MB/s eta 0:00:01[K     |█████████████████               | 81kB 5.4MB/s eta 0:00:01[K     |███████████████████▏            | 92kB 5.5MB/s eta 0:00:01[K     |█████████████████████▎          | 102kB 4.6MB/s eta 0:00:01[K     |███████████████████████▌        | 112kB 4.6MB/s eta 0:00:01[K     |█████████████████████████▋      | 122kB 4.6MB/s eta

In [None]:
import datasets
import os; import psutil; import timeit
from datasets import load_dataset


In [None]:
## git the datasets library because we are going to use datasets-cli python code
!git clone https://github.com/huggingface/datasets.git

Cloning into 'datasets'...
remote: Enumerating objects: 17836, done.[K
remote: Total 17836 (delta 0), reused 0 (delta 0), pack-reused 17836[K
Receiving objects: 100% (17836/17836), 39.79 MiB | 32.91 MiB/s, done.
Resolving deltas: 100% (7160/7160), done.


## Test your dataset script

We need first to write a dataset script that defines our dataset and its specifications. Here is an examples of dataset script written in the datasets library [Datsets_Script](https://github.com/huggingface/datasets/tree/master/datasets). It is important to test the written script. If the script is working properly, you can uopload it in huggingface website under your name. 

#### - How to test your script ?

Create a folder for your dataset script under the path of datasets in datasets library folder like: (datasets/datasets/your_dataset_name). Then, create your dataset script code in the folder (your_dataset_name.py).
|

In [2]:
#@title
%%html
<div style="background-color: pink;">
  The dataset script used in this notebook is taken from the dataset script written to <a href="https://github.com/huggingface/nlp/tree/master/datasets/bookcorpus">(bookcorpus)</a> script from huggingface/nlp with some modification due to the similiraity between both dataset.
  Click (SHOW CODE) if you want to see the dataset script that used with WikiArabic Dataset.
</div>


""""
This is the script I used for this dataset :

# coding=utf-8
# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace NLP Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Lint as: python3
""" Wikipedia Arabic corpus."""

from __future__ import absolute_import, division, print_function

import glob
import os
import re

import nlp


_DESCRIPTION = """\
The corpus is downloaded from the following page :https://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/ \
it is a corpus that extraced from wikipedia dumps.\
The arabic corpus consists from 131 M tokens and it is size is 2.6G.\
 The file is xml extention. 
Therefore, we use perl language to parse XML document and extracts the text.\
"""

_CITATION = """\
@inproceedings{goldhahn-etal-2012-building,
    title = "Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages",
    author = "Goldhahn, Dirk  and
      Eckart, Thomas  and
      Quasthoff, Uwe",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf",
    pages = "759--765",
}
"""

URL = "https://drive.google.com/uc?export=download&id=1VqbVq2FDg8kVdeSm6EyyavRlb1JXYYos"

class WikiArabConfig(nlp.BuilderConfig)::
    """BuilderConfig for WikiArabic corpus."""

    def __init__(self, **kwargs):
        """BuilderConfig for WikiArabic.
        Args:
        **kwargs: keyword arguments forwarded to super.
        """
        super(WikiArabConfig, self).__init__(
            version=nlp.Version("1.0.0", "New split API (https://tensorflow.org/datasets/splits)"), **kwargs
        )


class WikiArab(nlp.GeneratorBasedBuilder):
    """WikiArabic dataset."""

    BUILDER_CONFIGS = [WikiArabConfig(name="plain_text", description="Plain text",)]

    def _info(self):
        return nlp.DatasetInfo(
            description=_DESCRIPTION,
            features=nlp.Features({"text": nlp.Value("string"),}),
            supervised_keys=None,
            homepage="https://linguatools.org/",
            citation=_CITATION,
        )

    def _vocab_text_gen(self, archive):
        for _, ex in self._generate_examples(archive):
            yield ex["text"]

    def _split_generators(self, dl_manager):
        arch_path = dl_manager.download_and_extract(URL)
	
        return [
            nlp.SplitGenerator(name=nlp.Split.TRAIN, gen_kwargs={"directory": arch_path}),
        ]

    def _generate_examples(self, directory):
        index=directory.rfind("datasets")
        index=index+8
        url=directory[:index]
        direct_name=directory[index+1:]
        directory=url

        files = [
            os.path.join(directory, direct_name),
        ]

        _id = 0
        for txt_file in files:
            with open(txt_file, mode="r") as f:
                for line in f:
                    yield _id, {"text": line.strip()}
                    _id += 1
""""







In [None]:
## upload the dataset_script in your colab sesstion (It is important that your dataset script has the same name of the python class in the script )
## The python script named (ar.py)

import os
os.chdir("/content")
from google.colab import files
uploaded = files.upload()
!ls

## Make sure that the python file is in the correct path (/ar/ar.py)

Saving ar.py to ar.py
ar.py  datasets  drive	nlp  sample_data


In [None]:
## Test if the dataset script is working properly in terms of definning, downloading, and spliting the data/

!python /content/datasets/datasets-cli test /content/ar/


2020-12-06 09:16:30.657453: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Checking /content/ar/ar.py for additional imports.
Lock 140611373389024 acquired on /content/ar/ar.py.lock
Creating main folder for dataset /content/ar/ar.py at /root/.cache/huggingface/modules/datasets_modules/datasets/ar
Creating specific version folder for dataset /content/ar/ar.py at /root/.cache/huggingface/modules/datasets_modules/datasets/ar/0a3cb7b758363ce23a7007fe8328c685a987a71225f6cb283d894203de8a127c
Copying script file from /content/ar/ar.py to /root/.cache/huggingface/modules/datasets_modules/datasets/ar/0a3cb7b758363ce23a7007fe8328c685a987a71225f6cb283d894203de8a127c/ar.py
Couldn't find dataset infos file at /content/ar/dataset_infos.json
Creating metadata file for dataset /content/ar/ar.py at /root/.cache/huggingface/modules/datasets_modules/datasets/ar/0a3cb7b758363ce23a7007fe8328c685a987a71225f6cb283d894203de8a127c/ar.json
L

## Upload your dataset script in HuggingFace website:


In [None]:
## if the Test was successful, the next step is to download your script to under your name in HuggingFace server. Therefore, you need to create an account in HuggingFace : 

## After creating Account we login by the following command :

!python /content/datasets/datasets-cli login

2020-12-06 09:30:32.442219: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1

        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        
Username: k-halid
Password: 
Login successful
Your token: eQtxIOnQoAIHpzdrSvgPtyqXHiHFMwEuduQpvcyOLwYDRxrMcPgTQPzutEzFZHDKRzXLVfrCMrPmJPODBTknqkVpOjfeRffyYTZkFoyvVJWijXeIgBuvoJtaGsnHKwlF 

Your token has been saved to /root/.huggingface/token


In [None]:
!python /content/datasets/datasets-cli s3_datasets ls

2020-12-06 09:30:58.032904: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
No shared file yet


In [None]:
## Use the following command to download your dataset script to your name in HuggingFace

!python /content/datasets/datasets-cli upload_dataset /content/ar/

### It will ask you if you want to save the file under the filename (ar/ar.py) or something else based on your dataset scrip name. Press Y to indicate your approval

2020-12-06 09:31:06.030623: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
About to upload file [1m/content/ar/ar.py[0m to S3 under filename [1mar/ar.py[0m and namespace [1mk-halid[0m
About to upload file [1m/content/ar/ar.py.lock[0m to S3 under filename [1mar/ar.py.lock[0m and namespace [1mk-halid[0m
Proceed? [Y/n] Y
[1mUploading... This might take a while if files are large[0m
Your file now lives at:
https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/k-halid/ar/ar.py
Your file now lives at:
https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/k-halid/ar/ar.py.lock


## Download your datasets from HuggingFace website using Datasets library

In [None]:
## Finally you can load the dataset by using your name (k-halid/) and including the name of the dataset script folder (/k-halid/ar)
## Also, anyone can use the dataset if you give them the path of the dataset in huggingface website ('k-halid/ar')

mem_before = psutil.Process(os.getpid()).memory_info().rss >> 20

## If you upload your dataset under your name in HuggingFace replace "k-halid/ar" with "your_account_name/dataset_script_folder_name"
wiki =  load_dataset('k-halid/ar', split='train')
mem_after = psutil.Process(os.getpid()).memory_info().rss >> 20

print(f"RAM memory used: {(mem_after - mem_before)} MB")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3087.0, style=ProgressStyle(description…

Reusing dataset aracorpus (/root/.cache/huggingface/datasets/aracorpus/plain_text/1.0.0/0a3cb7b758363ce23a7007fe8328c685a987a71225f6cb283d894203de8a127c)



RAM memory used: 73 MB


In [None]:
## The dataset usually holds aproxmatly 3 GB space of RAM. 
## However, due to the efficient memory mapping of datasets library the data holds only 73 MB of the memory.
## Next you can manipulate your data using datasets or transformers library from HuggingFace.