# Setup

First, **change runtime type** to GPU.

Install Mallet for LDA to colab runtime:

In [None]:
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip mallet-2.0.8.zip

Go to your working drive:

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/MMRSumm/

/content/drive/MyDrive/MMRSumm


In [4]:
!export PYTHONPATH=.

Install libraries:

In [None]:
!pip install pyLDAvis

In [None]:
!pip install -r requirements.txt

In [None]:
!pip install "click==7.1.1"

In [None]:
!python3 -m spacy download en_core_web_sm

Now **restart** the runtime.

Run the next three cells.

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
%cd /content/drive/MyDrive/MMRSumm/

/content/drive/MyDrive/MMRSumm


In [7]:
!export PYTHONPATH=.

Now run the appropriate script.

# Optimization (optional)

To optimize, you'll need a completed output set first, or at least 100 samples. If none exists, run the MMR combination script with the models only argument.

In [None]:
!python mmr_combination_multinews.py --cloud --mmr_reduction --output_file outputs/mmr_sum_set.jsonl --models_only --run_until 100

Now you can run the optimization script. Use the arguments to select your summary set input file, your similarity measure, your Neptune project name, and your Netpune API token.

In [None]:
!python optimize_add_mmr_to_models.py --sim doc2vec --input_file outputs/mmr_sum_set.jsonl

# Summarization

Now you can run the summarization script according to your needs. This is a very large and time-consuming task, so you may want to split the work into stages. We recommend generating the output set with all of the model outputs and mmr outputs before calculating their rouge scores, if running the script on the entire set.

This will likely take a few days, and the use of the --run_from argument to start where you left off.

In [8]:
!python mmr_combination_multinews.py --cloud --mmr_reduction --output_file outputs/mmr_sum_set.jsonl --run_from 1 --run_until 10

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
tcmalloc: large alloc 3600007168 bytes == 0x55ee5406a000 @  0x7ff388f58001 0x7ff32cc5954f 0x7ff32cca9b58 0x7ff32ccadb17 0x7ff32cd4c203 0x55ee4af50d54 0x55ee4af50a50 0x55ee4afc5105 0x55ee4afbf4ae 0x55ee4af523ea 0x55ee4afc132a 0x55ee4afbf4ae 0x55ee4af523ea 0x55ee4afc132a 0x55ee4af5230a 0x55ee4afc47f0 0x55ee4afbf4ae 0x55ee4af523ea 0x55ee4afc47f0 0x55ee4afbf4ae 0x55ee4afbf1b3 0x55ee4afbd55b 0x55ee4b066032 0x55ee4af52ffa 0x55ee4afc61ba 0x55ee4afbf4ae 0x55ee4af523ea 0x55ee4afc47f0 0x55ee4af5230a 0x55ee4afc060e 0x55ee4af5230a
cuda
Downloading: 100% 1.12k/1.12k [00:00<00:00,

For example, if the last-written line in the JSONL file was 1023 (determined by manually opening the output file and checking), run the script with a --run_from argument of 1024. The script will append the additional output to the same output file. Change the --run_from argument every time to continue generating the output. Continue this process until the entire file has been written.

# Alternative Summarization (Optional)



Alternatively, the script contains arguments to divide the work in different ways.

The --models_only argument will generate the model outputs only.

In [9]:
!python mmr_combination_multinews.py --cloud --models_only --mmr_reduction --output_file outputs/mmr_sum_set.jsonl --run_until 10

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
tcmalloc: large alloc 3600007168 bytes == 0x55da387a2000 @  0x7f8161548001 0x7f810524954f 0x7f8105299b58 0x7f810529db17 0x7f810533c203 0x55da2f692d54 0x55da2f692a50 0x55da2f707105 0x55da2f7014ae 0x55da2f6943ea 0x55da2f70332a 0x55da2f7014ae 0x55da2f6943ea 0x55da2f70332a 0x55da2f69430a 0x55da2f7067f0 0x55da2f7014ae 0x55da2f6943ea 0x55da2f7067f0 0x55da2f7014ae 0x55da2f7011b3 0x55da2f6ff55b 0x55da2f7a8032 0x55da2f694ffa 0x55da2f7081ba 0x55da2f7014ae 0x55da2f6943ea 0x55da2f7067f0 0x55da2f69430a 0x55da2f70260e 0x55da2f69430a
cuda
Some weights o

The --mmr_only argument can be used to add MMR output to the model outputs file. Don't use the --mmr_reduction argument if you used it in the models-only run.

In [10]:
!python mmr_combination_multinews.py --cloud --mmr_only --input_file outputs/mmr_sum_set.jsonl --output_file outputs/mmr_sum_set_complete.jsonl --run_until 10

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
tcmalloc: large alloc 3600007168 bytes == 0x5571c36a2000 @  0x7ff655435001 0x7ff5f913654f 0x7ff5f9186b58 0x7ff5f918ab17 0x7ff5f9229203 0x5571b9ed5d54 0x5571b9ed5a50 0x5571b9f4a105 0x5571b9f444ae 0x5571b9ed73ea 0x5571b9f4632a 0x5571b9f444ae 0x5571b9ed73ea 0x5571b9f4632a 0x5571b9ed730a 0x5571b9f497f0 0x5571b9f444ae 0x5571b9ed73ea 0x5571b9f497f0 0x5571b9f444ae 0x5571b9f441b3 0x5571b9f4255b 0x5571b9feb032 0x5571b9ed7ffa 0x5571b9f4b1ba 0x5571b9f444ae 0x5571b9ed73ea 0x5571b9f497f0 0x5571b9ed730a 0x5571b9f4560e 0x5571b9ed730a
cuda
Ind_number: 1


The --rouge_only argument can be used to calculate the rouge scores of the output file.

In [11]:
!python mmr_combination_multinews.py --cloud --input_file outputs/mmr_sum_set_complete.jsonl --rouge_only --rouge_file outputs/rouge_test.txt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
tcmalloc: large alloc 3600007168 bytes == 0x55fe76562000 @  0x7fe9ca21c001 0x7fe96df1d54f 0x7fe96df6db58 0x7fe96df71b17 0x7fe96e010203 0x55fe6cfd1d54 0x55fe6cfd1a50 0x55fe6d046105 0x55fe6d0404ae 0x55fe6cfd33ea 0x55fe6d04232a 0x55fe6d0404ae 0x55fe6cfd33ea 0x55fe6d04232a 0x55fe6cfd330a 0x55fe6d0457f0 0x55fe6d0404ae 0x55fe6cfd33ea 0x55fe6d0457f0 0x55fe6d0404ae 0x55fe6d0401b3 0x55fe6d03e55b 0x55fe6d0e7032 0x55fe6cfd3ffa 0x55fe6d0471ba 0x55fe6d0404ae 0x55fe6cfd33ea 0x55fe6d0457f0 0x55fe6cfd330a 0x55fe6d04160e 0x55fe6cfd330a
cuda
Ind_number: 1


Or all three can be done at once by simply running the script with the --rouge argument.

In [12]:
!python mmr_combination_multinews.py --cloud --rouge --mmr_reduction --output_file outputs/mmr_sum_set_complete.jsonl --run_until 10 --rouge_file outputs/rouge_test.txt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
tcmalloc: large alloc 3600007168 bytes == 0x555e34400000 @  0x7ffb8101b001 0x7ffb24d1c54f 0x7ffb24d6cb58 0x7ffb24d70b17 0x7ffb24e0f203 0x555e2ae70d54 0x555e2ae70a50 0x555e2aee5105 0x555e2aedf4ae 0x555e2ae723ea 0x555e2aee132a 0x555e2aedf4ae 0x555e2ae723ea 0x555e2aee132a 0x555e2ae7230a 0x555e2aee47f0 0x555e2aedf4ae 0x555e2ae723ea 0x555e2aee47f0 0x555e2aedf4ae 0x555e2aedf1b3 0x555e2aedd55b 0x555e2af86032 0x555e2ae72ffa 0x555e2aee61ba 0x555e2aedf4ae 0x555e2ae723ea 0x555e2aee47f0 0x555e2ae7230a 0x555e2aee060e 0x555e2ae7230a
cuda
Downloading: 4