# Scientific Paper Title Generator Using GPT-2 

The following Notebook outlines the steps required to perform a title generation specific to a body of text. \
In the case below, we will be walking through how to produce an AI generated Title based on Scientific research texts scraped from ArXiv - a popular scientific community forum housing an enormous number of research texts

Key Project Objectives:
- Data will be scraped from the Open-Access [ArXiv.org Platform](https://arxiv.org/)
- Text data will be fed to the GPT-2 Model in order to fine-tune the selected model
- Titles will be generated after Fine-tuning is completed

Source Information:
1) [ArXiv.org](https://arxiv.org/)
2) [GPT-2 by HuggingFace](https://huggingface.co/gpt2)
3) [ArXiv Scraper Library](https://github.com/Mahdisadjadi/arxivscraper)

# 1. Load Libraries

In [None]:
# Create Requirements file from Dependencies used in Current Environment
!pip list > requirements.txt

In [None]:
# Import libraries to be used
import numpy as np
import pandas as pd
import arxivscraper as ax
import gpt_2_simple as gpt2

# 2. Scrape & Retrieve Raw Data from ArXiv

Let's look at physics as the main type of research paper as ArXiv has a vast number of those in its database

These are specified through the scraper with the `category` tag:
 - `physics` -- subcategory: Quantum Physics/Mechanics `quant-ph`
 
 Let's also explore recent Information from the last 1-2 years
 
 A Limitation of using this library however is that the scraper can only target one specific Field at a time. Therefore, we have to scrape 1 group at a time

In [None]:
# scraper for arxiv physics
scraper = ax.Scraper(category='physics', date_from='2020-01-01',
                     date_until='2022-08-03', t=10,
                     filters={'categories':['quant-ph']})

output = scraper.scrape()

In [None]:
output[:1]

In [None]:
# Let's load the results into a DataFrame to see what they look like, then save them to a csv for future use
cols = ('id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors')
df = pd.DataFrame(output,columns=cols)
df.to_csv('Data/Scraped_physics_outputs.csv')
df.head()

In [None]:
# Extract title data and export to csv file
titles = df.title.tolist()
display(titles[:5])

np.savetxt('Data/Scraped_physics_titles.csv', np.array(titles), header='Titles',comments="", fmt='%s', encoding='utf-8')

# 3. GPT-2 FineTuning on Title Data

In [None]:
model_name = "117M" # "355M" = Larger model (1.4 GB)
gpt2.download_gpt2(model_name=model_name)   # Saved into current directory under /models/117M/

sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              'Data/Scraped_physics_titles.csv',
              model_name=model_name,
              steps=1000,
              save_every=200,
              sample_every=25)

gpt2.generate(sess)

# 4. Examine Sample Outputs

In [None]:
sample_file = 'samples/run1/samples-76'
sample_file = open(sample_file, 'r').read()

for s in ['endoftext', 'startoftext', '<|', '|>']:
    t = t.replace(s, '')
for title in t.title().split('\n')[1:]:
    if not title == '':
        print('- ' + title)

# 5. Generate New Sample Titles

- Kernel Restart Required AFTER fine-tuning.
- This restriction is specifically caused by the GPT-2 Module - See [Github Thread](https://github.com/minimaxir/gpt-2-simple/issues/80) for more information

In [1]:
# Restart Kernel and execute if fresh, else ignore
import gpt_2_simple as gpt2
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)

Loading checkpoint checkpoint\run1\model-100
INFO:tensorflow:Restoring parameters from checkpoint\run1\model-100


In [27]:
def generate_sample_titles(n):
    
    text = gpt2.generate(sess,
                         length=40,
                         temperature=0.4,
                         nsamples=n,
                         batch_size=1,
                         return_as_list=True
                         )
    titles_lst = []
    for title in text:
        t = title.title()
        # remove extraneous text from GPT-2 Output
        t = t.replace('<|Startoftext|>','').replace('\n','').replace('| Startoftext|','')
        # only grab a single title
        t = t[:t.index('<|Endoftext|>')]
        if t == '':
            continue
        else:
            print(t)
            titles_lst.append(t)
    return titles_lst

#### Single Sample Title

In [25]:
generate_sample_titles(n=1)

Quantum And Classical Information Theory: A Unified Theory Of The Quantum   Phase Transition


['Quantum And Classical Information Theory: A Unified Theory Of The Quantum   Phase Transition']

#### Multiple Sample Titles

In [28]:
generate_sample_titles(n=10)

Probabilistic Simulation Of Quantum Mechanics
A New Way Of Estimating The Entropy Of Quantum Systems
Quantum And Classical Information Processing In Quantum Mechanics
|Startoftext|>Quantum Key Distribution With A Quantum Processor
A Quantum Walker For The Quantum Computer
Theory Of Quantum Mechanics And The Quantum Contextuality
>Quantum Information Retrieval With A Quantum Processor
A New Approach To Quantum Estimation
Quantum Key Distribution With An Atomic Engine


['Probabilistic Simulation Of Quantum Mechanics',
 'A New Way Of Estimating The Entropy Of Quantum Systems',
 'Quantum And Classical Information Processing In Quantum Mechanics',
 '|Startoftext|>Quantum Key Distribution With A Quantum Processor',
 'A Quantum Walker For The Quantum Computer',
 'Theory Of Quantum Mechanics And The Quantum Contextuality',
 '>Quantum Information Retrieval With A Quantum Processor',
 'A New Approach To Quantum Estimation',
 'Quantum Key Distribution With An Atomic Engine']