This repository contains the code of our project done on the EMNLP 2019 paper Text Summarization with Pretrained Encoders as a part of the course [BITS F312] Neural Networks and Fuzzy Logic at BITS Pilani.
The original repository for the paper can be found here.
.
├── Custom_Dataset_BBC.ipynb
├── LICENSE
├── Model_Training_and_Graph_Plotting.ipynb
├── README.md
├── Rouge_Score_Evaluation.ipynb
├── Summary_Generation.ipynb
├── bert_data
├── custom_data_training
│ └── data_builder.py
├── json_data
├── logs
├── models
├── raw_data
├── requirements.txt
├── results
├── src
│ ├── cal_rouge.py
│ ├── distributed.py
│ ├── models
│ │ ├── __init__.py
│ │ ├── adam.py
│ │ ├── data_loader.py
│ │ ├── decoder.py
│ │ ├── encoder.py
│ │ ├── loss.py
│ │ ├── model_builder.py
│ │ ├── neural.py
│ │ ├── optimizers.py
│ │ ├── predictor.py
│ │ ├── reporter.py
│ │ ├── reporter_ext.py
│ │ ├── trainer.py
│ │ └── trainer_ext.py
│ ├── others
│ │ ├── __init__.py
│ │ ├── logging.py
│ │ ├── pyrouge.py
│ │ ├── tokenization.py
│ │ └── utils.py
│ ├── post_stats.py
│ ├── prepro
│ │ ├── __init__.py
│ │ ├── data_builder.py
│ │ ├── smart_common_words.txt
│ │ └── utils.py
│ ├── preprocess.py
│ ├── train.py
│ ├── train_abstractive.py
│ ├── train_extractive.py
│ └── translate
│ ├── __init__.py
│ ├── beam.py
│ └── penalties.py
└── urls
train.py - Contains the main training workflow of the models.
train_abstractive.py - Contains the training, validation and testing workflow of the abstractive model in a distributed manner.
train_extractive.py - Contains the training, validation and testing workflow of the extractive model in a distributed manner.
preprocess.py - Wraps around functions defined in databuilder.py file and provides a workflow for preprocessing the dataset from raw input files into a form that can be fed to the model.
distributed.py - Contains helper functions to support distributed training using multiple GPUs.
cal_rouge.py - To calculate the rouge scores of the generated summaries.
adam.py - Implements the adam optimizer algorithm.
data_loader.py - Loads the dataset and iteratively passes them in batches.
decoder.py - Defines the structure of the Transformer decoder network with attention mechanism.
encoder.py - Defines the structure of the Transformer encoder network.
loss.py - Handles the details of loss computation during training.
model_builder.py - Integrates the enocoder and decoder architecture, along with the optimizer to define the model.
neural.py - Contains implementation of feedforward and multi-head attention layers.
optimizers.py - Contains controller class for optimization and function to update the model parameters based on current gradients.
predictor.py - Translates the generated output and gives the predicted summary.
reporter_ext.py and reporter.py - Functionalities to report metrics during the training steps.
trainer_ext.py and trainer.py - Includes functions that controls the training process and defines the workflow.
- Train the models and plot the relevant metrics(loss/F1/accuracy/etc) with respect to epochs.
- Compute and report the Rouge scores for your trained model.
- Select 3-4 example articles and generate their summaries using your trained model - both extractive and abstractive.
- Use a custom summarization dataset of your choice, train the model on this data and report your findings.
Optionally, create a virtual environment on your system and open it.
To run the application, first clone the repository by typing the command in git bash.
git clone https://github.com/AnushkaDayal/PreSumm_NNFL.git
Alternatively, you can download the code as .zip and extract the files.
Shift to the cloned directory
cd PreSumm_NNFL
To install the requirements, run the following command:
pip install -r requirements.txt
unzip the zipfile and put all .pt
files into bert_data
Download and unzip the stories
directories from here for both CNN and Daily Mail. Put all .story
files in one directory (e.g. ../raw_stories
)
We will need Stanford CoreNLP to tokenize the data. Download it here and unzip it. Then add the following command to your bash_profile:
export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar
replacing /path/to/
with the path to where you saved the stanford-corenlp-full-2017-06-09
directory.
python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH
RAW_PATH
is the directory containing story files (../raw_stories
),JSON_PATH
is the target directory to save the generated json files (../merged_stories_tokenized
)
python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH
RAW_PATH
is the directory containing tokenized files (../merged_stories_tokenized
),JSON_PATH
is the target directory to save the generated json files (../json_data/cnndm
),MAP_PATH
is the directory containing the urls files (../urls
)
python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log
JSON_PATH
is the directory containing json files (../json_data
),BERT_DATA_PATH
is the target directory to save the generated binary files (../bert_data
)
Task-1 includes training the model on CNN/DailyMail data and plotting the relevant graphs. Follow the steps given in the Model_Training_and_Graph_Plotting.ipynb
file to accomplish this task.
Task-2 includes calculating the ROUGE scores on the test dataset from our trained model. You can download our custom trained models from the Pre-trained Models section below. Follow the steps in Rouge_Score_Evaluation.ipynb
to accomplish this task.
The results we obtained on the CNN/DM testing dataset were as follows: -
Models | ROUGE-1 | ROUGE-2 | ROUGE-L |
---|---|---|---|
BertSumExt (CNN/DM) | 42.37 | 19.59 | 38.76 |
BertSumExt (BBC) | 35.96 | 13.79 | 32.42 |
BertSumExtAbs (CNN/DM) | 30.65 | 10.98 | 28.86 |
Task-3 includes generating summaries on raw input. We have provided the raw input we used in the raw_data
folder. Follow the steps mentioned in Summary_Generation.ipynb
to generate the summaries on the raw input.
For abstractive purposes: Each line in your input raw text file must be a single document
For extractive purposes: You must insert [CLS] [SEP] as your sentence boundaries.
Task-4 is about training the dataset on a custom dataset. We chose the BBC Extractive dataset for this purpose. The dataset can be downloaded from here. Our custom trained model can be found in the Pre-trained Models section below. The steps for the training are mentioned in the Custom_Dataset_BBC.ipynb
file.
To be able to follow the preprocessing steps above for your custom dataset, replace the "data_builder.py" file found in /src/prepro directory by the one in custom_data_training/
Custom trained BertSumExt on CNN/DM dataset