<a href="https://colab.research.google.com/github/AliMostafaRadwan/BERT_for_QA/blob/main/data_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About the WikiLingua Dataset

WikiLingua consists of collaboratively written how-to guides with gold-standard summaries
across 18 languages collected from [WikiHow](https://www.wikihow.com/) webpage. The content of this webpage is high-quality since each article and summary
is written and edited by 23 people, and further reviewed by 16 people, on average. The articles include multiple methods with steps (with an illustrative image) to complete a
procedural task along with the corresponding short summaries. We align each the text and the summary of the steps across 18 languages using the illustrative images. The dataset includes ~770k article and summary pairs.

We hope that you find this dataset useful. Please cite the following paper if you do so:

**[WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization](https://arxiv.org/abs/2010.03093)**


```
@inproceedings{ladhak-wiki-2020,
    title={WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization},
    author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
    booktitle={Findings of EMNLP, 2020},
    year={2020}
}
```

# How to Use the WikiLingua Dataset

This collab gives a simple step-by-step instructions on how to use the WikiLingua dataset. In particular, we demonstrate how to align the parallel articles in different languages with English articles.  

In [None]:
import pickle
import os

# Downloading the Data

First, download the data using this [link](https://drive.google.com/drive/folders/1PFvXUOsW_KSEzFm5ixB8J8BDB8zRRfHW?usp=sharing) and upload it to your Google Drive.



Following cell mounts the content your Google Drive on your runtime using an authorization code.

In [1]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


# Loading the English articles





Read the file with English articles and summaries from your Google Drive:

In [None]:
with open('/content/drive/My Drive/WikiLingua/english.pkl', 'rb') as f:
  english_docs=pickle.load(f)

Here is an example of key-value pair from this dictionary:

In [None]:
list(english_docs.items())[0]

('https://www.wikihow.com/Avoid-Drinking-and-Driving',
 {'Designating a Driver': {'document': "Designating a driver is a very popular tactic to avoid drinking and driving.  It is important to plan in advance, because your brain function will slow down and your decision making skills will be impaired once you start drinking. Decide before you begin drinking that you will not drive.  Figure out who will be getting you home before you leave. Make sure this person is responsible and keep them in your sight while you are drinking.  Have their contact information handy in case you can’t find them when you are ready to leave.  Choose a friend who doesn’t drink alcohol.  You likely have someone in your friend group who doesn’t drink.  This person is the most likely to remain sober. Decide on one person who will remain sober.  You can take turns within your friend group, alternating who will be the designated driver on each occasion.  Be sure that the designated driver actually remains sober.  

**english_docs** is a dictionary where key is the url of the corresponding [WikiHow](https://www.wikihow.com/) article and value is a dictionary.

The inner dictionary has section names as the keys and a dictionary (with keys "document" and "summary") as values. The "document" and "summary" represents the document text and the summary of the corresponding section of the WikiHow article.

The following example shows the document and summary for the section "Explaining the Benefits of Voting" in the article. [How to Convince Someone to Vote](https://www.wikihow.com/Convince-Someone-to-Vote).



**Document:**

In [None]:
english_docs["https://www.wikihow.com/Convince-Someone-to-Vote"]["Explaining the Benefits of Voting"]["document"]

"Especially if the person you're talking to wasn’t moved by your persuasive techniques, you can continue the conversation by explaining all the reasons why that person should vote. One of the biggest reasons that people don’t vote is because they don’t see the point, so you can explain that the only way they can be heard is by casting a vote. A vote isn't just a piece of paper: it’s a person’s way of weighing in on who should be running the country, so not voting is the same as throwing away their say in the matter. To make this as clear as possible, use an example that showcases two very different political candidates, and go over how the election of each candidate could change the future for a particular country. Once you’ve explained the two very different possible realities, continue by saying that voting is your way of making sure that situation A doesn’t occur, or ensuring that scenario B does come to fruition, depending on what's important to the person you're trying to persuade

**Summary:**

In [None]:
english_docs["https://www.wikihow.com/Convince-Someone-to-Vote"]["Explaining the Benefits of Voting"]["summary"]

'Tell the person that you must vote to have your voice counted. Explain that voting shapes the future of a country. Offer reasons to vote for different candidates. Explain that the person’s vote does make a difference. Drive the person to the polls.'

# Getting Parallel English Article for the Articles in Other Languages

The dataset includes articles and summaries in 18 languages. One important characteristic of this dataset is that articles/summaries in other languages (than English) includes parallel English articles/summaries. Therefore, it can be used as a benchmark to evaluate cross-lingual abstractive summarization systems.

In [None]:
with open('/content/drive/My Drive/WikiLingua/spanish.pkl', 'rb') as f:
  spanish_docs=pickle.load(f)

**spanish_docs** is a dictionary that is very similar to **english_docs**. Only difference is that inner-most dictionary has additional keys "english_section_name" and "english_url" which are the corresponding section name and the url for the corresponding parallel English article. These keys are included in the dictionaries of all the languages other than English.

The following example shows the document and summary for the section "Explicar los beneficios de votar" in the article. [Cómo convencer a alguien de votar](https://es.wikihow.com/convencer-a-alguien-de-votar).


In [None]:
spanish_docs["https://es.wikihow.com/convencer-a-alguien-de-votar"]['Explicar los beneficios de votar']

{'document': 'De manera especial si la persona con la que hablas no se siente entusiasmada por tus técnicas de persuasión, puedes continuar la conversación explicándole todas las razones por las que debe votar. Una razón principal por la cual las personas no votan es porque no ven la finalidad, por lo tanto, puedes explicar que la única forma en que pueden ser escuchadas es emitiendo un voto. Un voto no es solo un pedazo de papel: es la forma en que una persona opina sobre quién debe dirigir el país, por lo tanto, no votar es lo mismo que desechar una opinión sobre el tema. Para que esto sea lo más claro posible, utiliza un ejemplo que muestre a dos candidatos políticos muy diferentes y examina cómo la elección de cada candidato podría cambiar el futuro de un país en particular. Una vez que hayas explicado los dos posibles futuros distintos, puedes afirmar que votar es tu forma de asegurarte de que la situación A no ocurra o de asegurarte de que la situación B sí se cristalice, dependi

You can see the document and the summary for this section as follows:

**Document:**

In [None]:
spanish_docs["https://es.wikihow.com/convencer-a-alguien-de-votar"]['Explicar los beneficios de votar']["document"]

'De manera especial si la persona con la que hablas no se siente entusiasmada por tus técnicas de persuasión, puedes continuar la conversación explicándole todas las razones por las que debe votar. Una razón principal por la cual las personas no votan es porque no ven la finalidad, por lo tanto, puedes explicar que la única forma en que pueden ser escuchadas es emitiendo un voto. Un voto no es solo un pedazo de papel: es la forma en que una persona opina sobre quién debe dirigir el país, por lo tanto, no votar es lo mismo que desechar una opinión sobre el tema. Para que esto sea lo más claro posible, utiliza un ejemplo que muestre a dos candidatos políticos muy diferentes y examina cómo la elección de cada candidato podría cambiar el futuro de un país en particular. Una vez que hayas explicado los dos posibles futuros distintos, puedes afirmar que votar es tu forma de asegurarte de que la situación A no ocurra o de asegurarte de que la situación B sí se cristalice, dependiendo de lo qu

**Summary:**

In [None]:
spanish_docs["https://es.wikihow.com/convencer-a-alguien-de-votar"]['Explicar los beneficios de votar']["summary"]

'Dile a esa persona que debe votar para que su voz se tenga en cuenta. Explica que votar da forma al futuro de un país. Ofrece razones para votar por distintos candidatos. Explica que el voto de una persona sí hace la diferencia. Lleva a esa persona a las urnas.'

**You can see the link and the section name for the parallel document as follows:**


 **URL for the English document:**

In [None]:
parallel_english_url=spanish_docs["https://es.wikihow.com/convencer-a-alguien-de-votar"]['Explicar los beneficios de votar']["english_url"]

In [None]:
parallel_english_url

'https://www.wikihow.com/Convince-Someone-to-Vote'

**Section Name for the English document:**

In [None]:
parallel_english_sn=spanish_docs["https://es.wikihow.com/convencer-a-alguien-de-votar"]['Explicar los beneficios de votar']["english_section_name"]

In [None]:
parallel_english_sn

'Explaining the Benefits of Voting'

**You can get the parallel English document and summary as follows:**


In [None]:
english_docs[parallel_english_url][parallel_english_sn]["document"]

"Especially if the person you're talking to wasn’t moved by your persuasive techniques, you can continue the conversation by explaining all the reasons why that person should vote. One of the biggest reasons that people don’t vote is because they don’t see the point, so you can explain that the only way they can be heard is by casting a vote. A vote isn't just a piece of paper: it’s a person’s way of weighing in on who should be running the country, so not voting is the same as throwing away their say in the matter. To make this as clear as possible, use an example that showcases two very different political candidates, and go over how the election of each candidate could change the future for a particular country. Once you’ve explained the two very different possible realities, continue by saying that voting is your way of making sure that situation A doesn’t occur, or ensuring that scenario B does come to fruition, depending on what's important to the person you're trying to persuade

In [None]:
english_docs[parallel_english_url][parallel_english_sn]["summary"]

'Tell the person that you must vote to have your voice counted. Explain that voting shapes the future of a country. Offer reasons to vote for different candidates. Explain that the person’s vote does make a difference. Drive the person to the polls.'

In [2]:
!pip install transformers


Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m96.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m75.2 MB/s[0m eta [36m0:00:0

In [10]:
from transformers import TFAutoModel, AutoTokenizer
from clean_arabic_text import preprocess_arabic_text

In [4]:
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-base-arabic")


Downloading (…)okenizer_config.json:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/491 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/334k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [16]:
text = """	وتحت عنوان من الكارثه الي التحدي يبدا الكاتب عرض الكتاب الرابع  حيث يوضح كيف كانت اسراءيل فرحه بنصرها عام 67 وانها ارتاحت لاعتقادها بان هناك وقتا طويلا وطويلا جدا قبل ان يفيق العرب من صدمه 67 وكيف ان القوات الجويه للجمهوريه العربيه المتحده قد فاجاتها بعد شهر واحد من نهايه حرب 67 بهجوم جوي عنيف علي مواقعها في سيناء وكان هذا اعلانا عن بدايه حرب من نوع جديد هي حرب الاستنزاف التي استمرت حتي تم وقف اطلاق النار بين الطرفين في 8 اغسطس 1970 ثم وفاه عبدالناصر وتولي انور السادات حكم مصر واستعداده للحرب  ويتعرض الكاتب ايضا وبصوره سريعه لفلسطين والاردن وسوريا قبل ان ينتقل الي الكتاب الخامس عن حرب اكتوبر  حيث يعرض الخطط والاستعدادات المصريه ثم الاستعدادات الاسراءيليه ثم يبدا بعرض وقاءع الحرب بدايه من الضربه الجويه وانهيار خط بارليف واختراقه  ويتوقف الكاتب عند يوم 8 اكتوبر  ويقول  ان هذا اليوم كان اسوا هزيمه في تاريخ الجيش الاسراءيلي ثم ينتقل بنا المءلف الي الجبهه السوريه ثم يعود ثانيه الي يوميات الحرب حتي 7 9 اكتوبر الي 9 13 اكتوبر ثم 14 اكتوبر  ثم يعرض للثغره او ما عرف بعمليه المزرعه الصينيه يوم 16 و 15 اكتوبر والمساعدات الامريكيه الضخمه لاسراءيل  ثم بدايه الضغوط السياسيه علي الرءيس انور السادات من 17 19 اكتوبر ثم ينتقل الكاتب للاحداث التي جرت من 17 20 اكتوبر واعفاء الفريق الشاذلي من منصبه كرءيس لاركان القوات المسلحه المصريه  وتولي الفريق الجمسي بدلا منه ثم الاتجاه الي الموافقه علي طلب وقف اطلاق النار والخلاف مع سوريا بشان هذا الامر  ثم بدايه الهجوم الاسراءيلي من 19 الي 22 اكتوبر علي الضفه الغربيه لقناه السويس والعمليات النهاءيه في سوريا 14 23 اكتوبر  وكيف ان الملك حسين قرر دخول الحرب ضد اسراءيل يوم 9 اكتوبر  ثم يعرض الكاتب المعركه الخاصه بالاستيلاء علي مدينه السويس من 23 اكتوبر الي 25 اكتوبر ثم تطورات هذه المعركه  وكيف انه مع حلول يوم السابع والعشرين من اكتوبر كان الاسراءيليون قد اسروا نحو ثمانيه الاف فرد من القوات المصريه  اغلبهم من وحدات الامداد والتموين
"""
cleaned_text = preprocess_arabic_text(text)

In [18]:
len(text.split(' '))


335

In [17]:
len(cleaned_text.split(' '))

336

In [5]:
model = TFAutoModel.from_pretrained("/content/drive/MyDrive/traindmodel_19")


All model checkpoint layers were used when initializing TFBertModel.

All the layers of TFBertModel were initialized from the model checkpoint at /content/drive/MyDrive/traindmodel_19.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
