GitHub

Data Collection

As English Gigaword v5 requires copyright, we detail data processing procedure step by step.
Please download the raw data first from here and follow our instruction to reproduce the data.

The seven distinct international sources of English newswire included in fifth edition are the following:

After downloading, your raw data should be like this:

In each folder, for example, in folder of New York Times Newswire Service (nyt_eng), the raw data should be like this:

Then, run the following (step 1):

python 1_data_construction.py

where it took seven folders as inputs, and output seven .txt documents (e.g., 2_raw_afp.txt, 2_raw_nyt.txt...). Run the following (step 2):

python 2_data_processing_server.py

The output is seven .txt documents (e.g., 3_raw_afp.txt, 3_raw_nyt.txt...).

Finally, run the following (step 3):

python 3_data_filter.py

The output final_data_14w.txt contains 140,000 sentence groups as mentioned in paper.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
1_data_construction.py		1_data_construction.py
2_data_processing_server.py		2_data_processing_server.py
3_data_filter.py		3_data_filter.py
README.md		README.md
nyt_raw_data.png		nyt_raw_data.png
raw_data_folder.png		raw_data_folder.png