# Scientific Paper Title Generator Using GPT-2 

The following Notebook outlines the steps required to perform a title generation specific to a body of text. \
In the case below, we will be walking through how to produce an AI generated Title based on Scientific research texts scraped from ArXiv - a popular scientific community forum housing an enormous number of research texts

Key Project Objectives:
- Data will be scraped from the Open-Access [ArXiv.org Platform](https://arxiv.org/)
- Text data will be fed to the GPT-2 Model in order to fine-tune the selected model
- Titles will be generated after Fine-tuning is completed

Source Information:
1) [ArXiv.org](https://arxiv.org/)
2) [GPT-2 by HuggingFace](https://huggingface.co/gpt2)
3) [ArXiv Scraper Library](https://github.com/Mahdisadjadi/arxivscraper)

# 1. Load Libraries

In [None]:
# Create Requirements file from Dependencies used in Current Environment
!pip list > requirements.txt

In [2]:
# Import libraries to be used
import numpy as np
import pandas as pd
import arxivscraper as ax
import gpt_2_simple as gpt2

# 2. Scrape & Retrieve Raw Data from ArXiv

Let's look at physics as the main type of research paper as ArXiv has a vast number of those in its database

These are specified through the scraper with the `category` tag:
 - `physics` -- subcategory: Quantum Physics/Mechanics `quant-ph`
 
 Let's also explore recent Information from the last 1-2 years
 
 A Limitation of using this library however is that the scraper can only target one specific Field at a time. Therefore, we have to scrape 1 group at a time

In [5]:
# scraper for arxiv physics
scraper = ax.Scraper(category='physics', date_from='2020-01-01',
                     date_until='2022-08-03', t=10,
                     filters={'categories':['quant-ph']})

output = scraper.scrape()

fetching up to  1000 records...
fetching up to  2000 records...
fetching up to  3000 records...
fetching up to  4000 records...
fetching up to  5000 records...
fetching up to  6000 records...
fetching up to  7000 records...
fetching up to  8000 records...
fetching up to  9000 records...
fetching up to  10000 records...
fetching up to  11000 records...
fetching up to  12000 records...
fetching up to  13000 records...
fetching up to  14000 records...
fetching up to  15000 records...
fetching up to  16000 records...
fetching up to  17000 records...
fetching up to  18000 records...
fetching up to  19000 records...
fetching up to  20000 records...
fetching up to  21000 records...
fetching is completed in 313.2 seconds.
Total number of records 3537


In [35]:
output[:1]

[{'title': 'a mathematical criterion for single photon sources used in quantum   cryptography',
  'id': '0705.1600',
  'abstract': 'a single photon source (sps) is very important for quantum computation. in particular, it is essential for secured quantum cryptography. but there is no perfect sps in reality. therefore, probabilistic sps where probability of simultaneous emission of two, three, four and more photon is less than the emission of a single photon are used. since classical photon always comes in bunch, the required single photon source must be nonclassical. in the well-known antibunched state the rate of simultaneous emission of two photon is less than that of single photon. but the requirement of quantum cryptography is a multiphoton version of the antibunched state or the higher order antibunched state. recently we have reported a mathematical criterion for higher order antibunching. here we have shown that any proposal for sps to be used in quantum cryptography should sati

In [20]:
# Let's load the results into a DataFrame to see what they look like, then save them to a csv for future use
cols = ('id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors')
df = pd.DataFrame(output,columns=cols)
df.to_csv('Data/Scraped_physics_outputs.csv')
df.head()

Unnamed: 0,id,title,categories,abstract,doi,created,updated,authors
0,705.16,a mathematical criterion for single photon sou...,quant-ph,a single photon source (sps) is very important...,,2007-05-11,,[anirban pathak]
1,706.0697,higher order antibunching in intermediate states,quant-ph,since the introduction of binomial state as an...,10.1016/j.physleta.2008.06.045,2007-06-05,,"[a verma, n k sharma, a pathak]"
2,706.2907,optimal quantum source coding with quantum inf...,quant-ph,consider many instances of an arbitrary quadri...,10.1109/tit.2009.2030494,2007-06-20,2020-09-26,"[jon yard, igor devetak]"
3,706.3445,analysis of a recent experimental test of bell...,quant-ph,a recent experiment by brida et al. (arxiv:070...,10.1140/epjd/e2007-00322-3,2007-06-23,,[emilio santos]
4,708.0859,exponential separation of quantum and classica...,quant-ph,we give the first exponential separation betwe...,,2007-08-06,,"[dmytro gavinsky, pavel pudlák]"


In [32]:
# Extract title data and export to csv file
titles = df.title.tolist()
display(titles[:5])

np.savetxt('Data/Scraped_physics_titles.csv', np.array(titles), header='Titles',comments="", fmt='%s')

['a mathematical criterion for single photon sources used in quantum   cryptography',
 'higher order antibunching in intermediate states',
 'optimal quantum source coding with quantum information at the encoder   and decoder',
 'analysis of a recent experimental test of bell inequalities violating   quantum predictions',
 'exponential separation of quantum and classical non-interactive   multi-party communication complexity']

# 2. GPT-2 FineTuning on Title Data

In [None]:
model_name = "117M" # "355M" = Larger model (1.4 GB)
gpt2.download_gpt2(model_name=model_name)   # Saved into current directory under /models/117M/

sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              'Data/Scraped_physics_titles.csv',
              model_name=model_name,
              steps=1000,
              save_every=200,
              sample_every=25)

gpt2.generate(sess)

Fetching checkpoint: 1.05Mit [00:00, 549Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 1.08Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 548Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [06:56, 1.20Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 1.56Git/s]                                               
Fetching model.ckpt.meta: 1.05Mit [00:00, 2.10Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 2.05Mit/s]                                                      


Loading checkpoint models\117M\model.ckpt
INFO:tensorflow:Restoring parameters from models\117M\model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:00<00:00, 166.63it/s]


dataset has 104907 tokens
Training...
[1 | 49.03] loss=2.56 avg=2.56
[2 | 94.90] loss=2.71 avg=2.63
[3 | 135.34] loss=2.46 avg=2.58
[4 | 182.84] loss=2.38 avg=2.53
[5 | 224.21] loss=2.33 avg=2.49
[6 | 263.31] loss=2.31 avg=2.46
[7 | 302.04] loss=2.40 avg=2.45
[8 | 342.65] loss=2.28 avg=2.43
[9 | 383.11] loss=2.17 avg=2.40
[10 | 423.92] loss=2.18 avg=2.37
[11 | 469.94] loss=2.14 avg=2.35
[12 | 518.03] loss=2.18 avg=2.34
[13 | 562.58] loss=2.19 avg=2.33
[14 | 602.37] loss=2.16 avg=2.31
[15 | 643.67] loss=2.09 avg=2.30
[16 | 684.60] loss=2.05 avg=2.28
[17 | 725.02] loss=2.12 avg=2.27
[18 | 772.14] loss=2.07 avg=2.26
[19 | 820.99] loss=2.05 avg=2.25
