<a href="https://colab.research.google.com/github/ContextLab/abstract2paper/blob/main/resources/abstract2paper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to *abstract2paper*!
Author: [Jeremy R. Manning](http://www.context-lab.com/)

## Step right up, step right up!
<img src='https://media1.giphy.com/media/mL40PfXA394KA/giphy.gif' width='250px'>

**Writing papers got you down?** Come on in, friend!  Give my good ole' Abstract2papers Cure-All a quick try!  Enter your abstract into the little doohicky here, and quicker'n you can blink your eyes<sup>1</sup>, a shiny new paper'll come right out for ya!  What are you waiting for?

## How does it work, you ask?

Really it's quite simple.  We put in a smidgen of [this](https://huggingface.co/transformers/model_doc/gpt_neo.html) a pinch of [that](https://www.tug.org/texlive/), plus a dab of our special [*secret ingredient*](https://www.youtube.com/watch?v=dQw4w9WgXcQ), and **poof!** that's how the sausage is made.

## No really, how does it work?

Ok, if you really want to know, all I'm doing here is using the [Hugging Face](https://huggingface.co/) [implementation](https://huggingface.co/transformers/model_doc/gpt_neo.html) of [GPT-Neo](https://github.com/EleutherAI/gpt-neo), which is itself a tweaked version of [GPT-3](https://arxiv.org/abs/2005.14165) that is pre-trained on the [Pile](https://pile.eleuther.ai/) dataset.

The text you input is used as a prompt for GPT-Neo; to generate a document containing an additional *n* words, the model simply "predicts" the next *n* words that will come after the specified prompt.

With a little help from some basic [LaTeX](https://www.latex-project.org/) templates (borrowed from [Overleaf](https://www.overleaf.com)), the document is formatted and compiled into a PDF.

## Can I actually use this in real-world applications?

<img src='https://media4.giphy.com/media/3o6ozoD1ByqYv7ARIk/giphy.gif' width='250px'>

**Doubtful.**  Or at least, probably not...?  It certainly wouldn't be ethical to use this code to generate writing assignments, mass-produce papers or grant applications, etc.  Further, you'll likely find that the text produced using this approach includes stuff that's said in funny (often nonsensical) ways, follows problematic logic, incorporates biases from the training data, and so on.  Of lesser importance, but practical annoyance, you'll also encounter all sorts of formatting issues (although those might be easy to fix manually, and possibly even automatically with some clever tinkering).

&nbsp;
&nbsp;
&nbsp;
&nbsp;

<sup>1</sup><small>This claim rests on the assumption that you blink *really* slowly.  Depending on how much text you're trying to generate (and how long your prompt is), your paper could take anywhere from a few minutes to several hours to fully congeal.</small>

# Step 1: Setting up the environment

We're going to use the super-convenient [Davos](https://github.com/ContextLab/davos.git) package to manage our dependencies.  We need to install it and import it in order to gain access to the `smuggle` keyword.

In [None]:
!pip install git+https://github.com/ContextLab/davos.git
import davos

Collecting git+https://github.com/ContextLab/davos.git
  Cloning https://github.com/ContextLab/davos.git to /tmp/pip-req-build-dja42xdy
  Running command git clone -q https://github.com/ContextLab/davos.git /tmp/pip-req-build-dja42xdy
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: davos
  Building wheel for davos (PEP 517) ... [?25l[?25hdone
  Created wheel for davos: filename=davos-0.0.1-cp37-none-any.whl size=37733 sha256=49bbd15091590ff588bad41449ac3500f5c1fba0e1df022e351f53890d775250
  Stored in directory: /tmp/pip-ephem-wheel-cache-ufki7s0l/wheels/62/30/bc/79958ce75e105bdcf95c737091e62372429850960d6628e277
Successfully built davos
Installing collected packages: davos
Successfully installed davos-0.0.1


Next, we'll install Hugging Face's [transformers](https://huggingface.co/transformers/) library and download (and load into memory) the pre-trained GPT-Neo model.  We'll also use a pretrained GPT-2 tokenizer for convenience.

NB: When you have access to more RAM, you may want to replace `'EleutherAI/gpt-neo-1.3B'` with `'EleutherAI/gpt-neo-2.7B'` in the second two lines in the cell below.  That will load in a fancier (but larger) model-- with 2.7 billion parameters instead of a "measly" 1.3 billion parameters.

There's a lot to download (roughly 6GB for the smaller model and 12.5GB for the larger model)-- it'll take a few minutes.  Now would be a good time to track down that coffee refill you've been postponing...

In [None]:
from transformers smuggle GPTNeoForCausalLM, GPT2Tokenizer
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B")

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/00/92/6153f4912b84ee1ab53ab45663d23e7cf3704161cb5ef18b0c07e207cef2/transformers-4.7.0-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 5.1MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 27.6MB/s 
[?25hCollecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |█

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1347.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=5312753599.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=798156.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456356.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=90.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=200.0, style=ProgressStyle(description_…




# Now I'm only going to ask this once: we're going to need a little...*information* from you...
<img src='https://64.media.tumblr.com/e3c8dea30fdf2e597ce79904e8da3271/tumblr_o2xhqwU0cK1qmob6ro2_500.gif' width=400px>

You didn't think it was going to be *that* easy, did you?  Oh...you did?  Well if you want *us* to cooperate, we're going to need a little...information...from you first.  About your paper.  Please make this easy on all of us and don't try to lie.  The machine will know.  The machine *always* knows...

Fill in the information below and you'll be well on your way to your auto-generated paper/story/grant application/speech/business plan.


In [None]:
# credit: https://arxiv.org/abs/2005.14165
title = 'Language Models are Awesome-Shot Learners'
authors = 'Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind ' \
          'Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen ' \
          'Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, ' \
          'Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, ' \
          'Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei'
text = "Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large " \
       "corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, " \
       "this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. " \
       "By contrast, humans can generally perform a new language task from only a few examples or from simple " \
       "instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up " \
       "language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness " \
       "with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language " \
       "model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its " \
       "performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or " \
       "fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. " \
       "GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, " \
       "and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, " \
       "such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same " \
       "time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some " \
       "datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, " \
       "we find that GPT-3 can generate samples of news articles which human evaluators have difficulty " \
       "distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of " \
       "GPT-3 in general."
length = 1000

The next cell is going to take a while to run.  While you're waiting, just think: it's still better than writing something on your own, isn't it?

In [None]:
ids = tokenizer(text, return_tensors='pt')['input_ids']
tokens = model.generate(ids, do_sample=True, temperature=0.9, max_length=length)
gen_text = tokenizer.batch_decode(tokens)[0]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


And finally, we'll get a tex-live installation going in our Colaboratory environment and download a template for generating the final document.  (First we need to remove the model and tokenizer from RAM so that Colaboratory doesn't hit its memory limit and crash.)

In [None]:
# memory cleanup
import gc, os
del model, tokenizer  # you can comment out this line if you're not
                      # running this on a memory-limited machine
gc.collect()          # remove the (now unused) variables from memory

# install tex-live
!apt-get install texlive-latex-recommended

# download template
!git clone https://github.com/ContextLab/abstract2paper

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  fonts-droid-fallback fonts-lmodern fonts-noto-mono libcupsfilters1
  libcupsimage2 libgs9 libgs9-common libijs-0.35 libjbig2dec0 libkpathsea6
  libpotrace0 libptexenc1 libsynctex1 libtexlua52 libtexluajit2 libzzip-0-13
  lmodern poppler-data t1utils tex-common texlive-base texlive-binaries
  texlive-latex-base
Suggested packages:
  fonts-noto poppler-utils ghostscript fonts-japanese-mincho
  | fonts-ipafont-mincho fonts-japanese-gothic | fonts-ipafont-gothic
  fonts-arphic-ukai fonts-arphic-uming fonts-nanum debhelper gv
  | postscript-viewer perl-tk xpdf-reader | pdf-viewer texlive-latex-base-doc
  texlive-latex-recommended-doc texlive-pstricks
The following NEW packages will be installed:
  fonts-droid-fallback fonts-lmodern fonts-noto-mono libcupsfilters1
  libcupsimage2 libgs9 libgs9-common libijs-0.35 libjbig2dec0 libkpathsea6
  lib

In [None]:
def texer(template, outfile='auto.tex', **kwargs):
  with open(template, 'r') as f:
    lines = f.readlines()

  x = []
  for line in lines:
    for k, v in kwargs.items():
      line = line.replace(f'<{k}>', str(v))
    x.append(line.replace(' & ', ' \& ').replace('%', '\%'))
  
  try:
    with open(outfile, 'w+') as f:
      f.write('\n'.join(x))
  
    os.system(f'pdflatex {outfile}')
    os.system('rm *.log *.aux')

  except:
    pass

  return '\n'.join(x)

Create a PDF containing your auto-generated document...

In [None]:
source = texer(os.path.join('abstract2paper', 'resources', 'template.tex'), TITLE=title, AUTHOR=authors+'\\\\Augmented by GPT-Neo', GEN_TEXT=gen_text + '...')

In [None]:
import pprint
pprint.pprint(source, width=80)

('\\documentclass{article}\n'
 '\n'
 '\\usepackage[utf8]{inputenc}\n'
 '\n'
 '\n'
 '\n'
 '\\title{Language Models are Awesome-Shot Learners}\n'
 '\n'
 '\\author{Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared '
 'Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, '
 'Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom '
 'Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens '
 'Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott '
 'Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec '
 'Radford, Ilya Sutskever, and Dario Amodei\\\\Augmented by GPT-Neo}\n'
 '\n'
 '\\date{\\today}\n'
 '\n'
 '\n'
 '\n'
 '\\begin{document}\n'
 '\n'
 '\n'
 '\n'
 '\\maketitle\n'
 '\n'
 '\n'
 '\n'
 'Recent work has demonstrated substantial gains on many NLP tasks and '
 'benchmarks by pre-training on a large corpus of text followed by fine-tuning '
 'on a specific task. While ty