The goal of this repo is to compare the results of summarization between facebook/bart-large-cnn
using transformers pipelines API and training a custom summarizer using a news dataset available on TensorFlow.
Text summarization is an important task in natural language processing, which involves condensing a piece of text into a shorter version, retaining the key information. This repository contains two approaches to text summarization:
- Using a pre-trained model
facebook/bart-large-cnn
with Hugging Face's Transformers library. - Training a custom summarization model using a news dataset available through TensorFlow Datasets.
- Python 3.6 or later
- TensorFlow 2.x
- Transformers
- Requests (for fetching text from a remote source)
Install the required libraries using pip
:
pip install tensorflow transformers requests
short-summarizer/
│
├── pre_trained_summarizer/
│ ├── pre_trained_summarizer.py # script for summarization using pre-trained model
│
├── custom_summarizer/
│ ├── data_preprocessing.py # script for data preprocessing
│ ├── model.py # script defining the custom summarizer model
│ ├── train.py # script for training the custom summarizer
│ ├── evaluate.py # script for evaluating the custom summarizer
│
└── README.md
Navigate to the pre_trained_summarizer
directory.
To summarize text using the facebook/bart-large-cnn
pre-trained model, run:
python pre_trained_summarizer.py --url <URL_OF_TEXT_FILE>
Replace <URL_OF_TEXT_FILE>
with the URL of the text file you want to summarize.
Navigate to the custom_summarizer
directory.
-
Data Preprocessing:
Run
data_preprocessing.py
to download and preprocess the dataset:python data_preprocessing.py
-
Training:
Run
train.py
to train the custom summarization model:python train.py
-
Evaluation:
After training the model, use
evaluate.py
to evaluate it on test data:python evaluate.py
After generating summaries using both approaches, you can manually compare the quality of the summaries by reading them. Additionally, you can compute ROUGE scores to quantitatively measure the performance of the summarizers.
Contributions are welcome! Please read the contribution guidelines first.
This project is licensed under the MIT License - see the LICENSE file for details.