PIAF Data Generation and Analysis

PIAF (Pour une IA Francophone) is a French project carried on by Etalab (the French's government open data task force) in the context of its Lab IA. PIAF's goal is to build a natively French SQuAD like Question Answering dataset. We do this by leveraging the community to create question-answers pairs with the help of our annotation platform.

This annotation platform begins with a subsample of the French Wikipedia, as described here.

This repo contains the code created by our partner ReciTAL used to generate this subsample and to compute lexical and syntactical statistics on the collected data. All this can be found in the protocol, linked above.

To generate a subsample, such as the one used, please follow these instructions:

0. Install requirements

Python:

conda env create -f environment.yml

spaCy:

Install the spaCy French model: python -m spacy download fr_core_news_sm

1. Download a French Wikipedia dump (only page and pagelinks.sql.gz are required)

For example, the dumps from 2020/01/20 links are:

These dumps are removed periodically, you can find the current dumps at https://dumps.wikimedia.org/frwiki

2. Compile and launch the PageRank scorer (by nayuki) `WikipediaPagerank.java`

This will perform 1000 iterations of PageRank and save the output in three files. You will need a recent JDK installed in your machine.

javac WikipediaPagerank.java
java -Xmx8G WikipediaPagerank frwiki-20190920-page.sql.gz frwiki-20190920-pagelinks.sql.gz 1000

The program outputs three .raw files:

wikipedia-pageranks.raw
wikipedia-pagerank-page-links.raw
wikipedia-pagerank-page-id-title.raw

3. Launch `dump_topn.py` to select the top N (here N = 10000) articles based on the computed PageRank score:

python dump_topn.py 10000 wikipedia-pageranks.raw wikipedia-pagerank-page-id-title.raw output_path_wikipedia-pagerank-title.txt

The program outputs a single file: topN.pkl. Inside wiki-preparation/data we share a top25k.pkl with our top 25k articles from French Wikipedia.

4. Launch `dump.py` to query Wikipedia and obtain the actual content of the Wikipedia articles:

python dump.py topN.pkl

The program outputs two folders:

data/Nhtml: The content of N Wikipedia articles in HTML format
data/Npages: The content of N Wikipedia articles in WIKI format

5. Launch `compute_wiki_stats.py` to calculate the statistics of each article such as text length, paragraph length, and so on.

python compute_wiki_stats.py --folder_path data/Npages --html_path data/Nhtml --output_dic_fn stats_topN.pkl

The program outputs a single file with the statistics: stats_topN.pkl

6. Launch `stats_analysis_results.py` to filter filter the articles into a json file

python stats_analysis_results.py --pkl_stats_dic_fn stats_topN.pkl --wiki_path data/Npages --html_path Nhtml --output_json_article_fn articles.json --min_paragraphs 5 --min_len_paragraphs 500 --max_len_paragraphs 1000

The program outputs the file articles.json which is a SQuAD compatible JSON file ready to be used by the PIAF Annotation tool.

7. Launch qas-analysis/divergence to compute the syntactic and lexical metrics on the recollected data

python qas-analysis/divergence.py piaf-annotations_v1.1.json

This program outputs two PDF files:

hits_syntaxic.pdf: with the sytactic analysis of the PIAF dataset
lexical_variation_piaf_by_tokens_lemma.pdf: with the lexical analysis

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
qas-analysis		qas-analysis
wiki-preparation		wiki-preparation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
piaf-v1.0.json		piaf-v1.0.json
squad_fr.zip		squad_fr.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qas-analysis

qas-analysis

wiki-preparation

wiki-preparation

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

environment.yml

environment.yml

piaf-v1.0.json

piaf-v1.0.json

squad_fr.zip

squad_fr.zip

Repository files navigation

PIAF Data Generation and Analysis

0. Install requirements

1. Download a French Wikipedia dump (only page and pagelinks.sql.gz are required)

2. Compile and launch the PageRank scorer (by nayuki) `WikipediaPagerank.java`

3. Launch `dump_topn.py` to select the top N (here N = 10000) articles based on the computed PageRank score:

4. Launch `dump.py` to query Wikipedia and obtain the actual content of the Wikipedia articles:

5. Launch `compute_wiki_stats.py` to calculate the statistics of each article such as text length, paragraph length, and so on.

6. Launch `stats_analysis_results.py` to filter filter the articles into a json file

7. Launch qas-analysis/divergence to compute the syntactic and lexical metrics on the recollected data

And now, a beautiful diagram of the whole procedure:

About

Releases

Packages

Contributors 2

Languages

License

etalab-ia/piaf-code

Folders and files

Latest commit

History

Repository files navigation

PIAF Data Generation and Analysis

0. Install requirements

1. Download a French Wikipedia dump (only page and pagelinks.sql.gz are required)

2. Compile and launch the PageRank scorer (by nayuki) WikipediaPagerank.java

3. Launch dump_topn.py to select the top N (here N = 10000) articles based on the computed PageRank score:

4. Launch dump.py to query Wikipedia and obtain the actual content of the Wikipedia articles:

5. Launch compute_wiki_stats.py to calculate the statistics of each article such as text length, paragraph length, and so on.

6. Launch stats_analysis_results.py to filter filter the articles into a json file

7. Launch qas-analysis/divergence to compute the syntactic and lexical metrics on the recollected data

And now, a beautiful diagram of the whole procedure:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

2. Compile and launch the PageRank scorer (by nayuki) `WikipediaPagerank.java`

3. Launch `dump_topn.py` to select the top N (here N = 10000) articles based on the computed PageRank score:

4. Launch `dump.py` to query Wikipedia and obtain the actual content of the Wikipedia articles:

5. Launch `compute_wiki_stats.py` to calculate the statistics of each article such as text length, paragraph length, and so on.

6. Launch `stats_analysis_results.py` to filter filter the articles into a json file