# Text Generation using ArXiv Research Paper Abstracts

ArXiv is an online repository of research papers in various fields of science, technology, engineering, and mathematics (STEM). The abstracts of these papers contain a wealth of information, including summaries of the research questions, methodology, and key findings. This information can be used to train a machine learning model to generate new texts that are similar in style and content to the abstracts.

The aim of this project is to develop an ML model that can generate texts based on the abstracts of ArXiv research papers. This model can be used to assist researchers and academics in summarizing their research, generating new ideas, or exploring new directions for their work.

The dataset for this project will be obtained from ArXiv, and will consist of a large collection of research paper abstracts. The dataset will be preprocessed to remove any irrelevant information and to ensure that the abstracts are in a consistent format. Then, a range of feature engineering techniques will be applied to extract meaningful features from the abstracts.

Next, several machine learning models will be trained on the preprocessed and feature engineered data. The focus will be on generative models such as GPT-2 and GPT-3, which have shown impressive performance in generating high-quality texts. The performance of each model will be evaluated using standard natural language processing metrics, and the best-performing model will be selected for deployment.

Finally, the selected model will be integrated into a web-based application that can generate texts in real-time. The application will have an easy-to-use interface and will allow users to specify the domain and topic of interest. The generated texts will be displayed along with an explanation of how they were generated and the ArXiv papers that were used as the basis for the generation.

# Table of Contents
### 1. [Importing the data](#importing-the-data)

1.   [Importing Packages](#importing-packages)
2.   [Importing the dataset](#importing-the-dataset)

### 2. [Data Pre-Processing](#data-pre-processing)

1. [Counting Unique characters](#counting-unique-characters)
2. [Character to Integer Mapping](#character-to-integer-mapping)
    1. [Visualizing the Character to Integer Mapping](#visualizing-the-character-to-integer-mapping)
3. [Preparing the Dataset](#preparing-the-dataset)
    1. [Creating sequences](#creating-sequences)
    2. [Creating batches of sequences](#creating-batches-of-sequences)
    3. [Splitting sequences into Inputs and Targets](#splitting-sequences-into-inputs-and-targets)
    4. [Creating the final Dataset](#creating-the-final-dataset)
        1. [Shuffling the dataset](#shuffling-the-dataset)
        2. [Creating batches for the dataset](#creating-batches-for-the-dataset)

### 3. [Model Building](#model-building)
1. [Creating the single layer GRU model](#creating-the-single-layer-gru-model)
    1. [Building the GRU model](#building-the-gru-model)
    2. [Compiling the GRU model](#compiling-the-gru-model)
    3. [Setting callbacks for the GRU model](#setting-callbacks-for-the-gru-model)
    4. [Training the GRU model](#training-the-gru-model)
    5. [Testing the GRU model](#testing-the-gru-model)
2. [Creating the single layer LSTM model](#creating-the-single-layer-gru-model)
    1. [Building the LSTM model](#building-the-lstm-model)
    2. [Compiling the LSTM model](#compiling-the-lstm-model)
    3. [Setting callbacks for the LSTM model](#setting-callbacks-for-the-lstm-model)
    4. [Training the LSTM model](#training-the-lstm-model)
    5. [Testing the LSTM model](#testing-the-lstm-model)
3. [Creating the model with GPT2 Transformer](#creating-the-model-with-gpt2-transformer)
    1. [Importing packages and pretrained models](#importing-packages-and-pretrained-models)
    2. [Loading pre-trained models](#loading-pre-trained-models)
    3. [Testing Encoding and Decoding](#testing-encoding-and-decoding)
    4. [Testing the GPT2 model](#testing-the-gpt2-model)
4. [Creating the model using DistilGPT2 transformer (Transfer Learning)](#creating-the-model-using-distilgpt2-transformer-transfer-learning)
    1. [Loading the ArXiv dataset (from HuggingFace Hub)](#loading-the-arxiv-dataset-from-huggingface-hub)
    2. [Pre-processing](#pre-processing)
    3. [Building the DistilGPT2 model](#building-the-distilgpt2-model)
    4. [Compiling the DistilGPT2 model](#compiling-the-distilgpt2-model)
    5. [Training the DistilGPT2 model](#training-the-distilgpt2-model)
    6. [Testing the DistilGPT2 model](#testing-the-distilgpt2-model)

# Code

## Importing the data

### Importing packages

In [1]:
from transformers import TFAutoModelForCausalLM, AutoTokenizer, AdamWeightDecay, pipeline, create_optimizer
from datasets import Dataset, DatasetDict, load_dataset
from transformers import DefaultDataCollator
import plotly.express as px
import plotly.io as pio
import tensorflow as tf
import pandas as pd
import numpy as np
import datetime
import string
import re
import math
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
pio.renderers.default = 'notebook_connected'

%load_ext tensorboard

  from .autonotebook import tqdm as notebook_tqdm


### Importing the dataset

In [5]:
data = r'arxivData.txt'
text = open(data,'rb').read().decode(encoding='utf-8')
print('Dataset contains the total of {} characters'.format(len(text)))

Dataset contains the total of 64838939 characters


## Data Pre-Processing

### Counting Unique characters

In [8]:
vocab = sorted(set(text))
print('{} unique characters'.format(len(vocab)))

197 unique characters


### Character to Integer Mapping

In [9]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [10]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('{:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('}')

{
'\n':   0,
'\r':   1,
' ' :   2,
'!' :   3,
'"' :   4,
'#' :   5,
'$' :   6,
'%' :   7,
'&' :   8,
"'" :   9,
'(' :  10,
')' :  11,
'*' :  12,
'+' :  13,
',' :  14,
'-' :  15,
'.' :  16,
'/' :  17,
'0' :  18,
'1' :  19,
}


#### Visualizing the Character to Integer Mapping

In [11]:
print('{}---char2int--- {}'.format(repr(text[38:63]), text_as_int[38:63]))

'itle,year\r\n"[{\'name\': \'Ah'---char2int--- [75 86 78 71 14 91 71 67 84  1  0  4 61 93  9 80 67 79 71  9 28  2  9 35
 74]


**Note**: We are not doing stemming and lemmatization because we need letters and special characters such as for raised to ^ sign.

### Preparing the Dataset

#### Creating sequences

We levarage a sliding window approach to train our model. We first set the maximum sequence length to 120 characters. This is done for the purpose of preparing and training batches.

In [12]:
#Maximum length of a sentence we want for a single input in characters
seq_length = 120
examples_per_epoch = len(text)//(seq_length+1)

#Create training examples
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(10):
    print(idx2char[i.numpy()])

a
u
t
h
o
r
,
d
a
y


#### Creating batches of sequences

In [13]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
for item in sequences.take(10):
    print(repr(''.join(idx2char[item.numpy()])))
    print("-"*110)

'author,day,id,link,month,summary,tag,title,year\r\n"[{\'name\': \'Ahmed Osman\'}, {\'name\': \'Wojciech Samek\'}]",1,1802.00209v1,"'
--------------------------------------------------------------------------------------------------------------
"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1802.00209v1', 'type': 'text/html'}, {'rel': 'related', 'href': 'http"
--------------------------------------------------------------------------------------------------------------
'://arxiv.org/pdf/1802.00209v1\', \'type\': \'application/pdf\', \'title\': \'pdf\'}]",2,"We propose an architecture for VQA which '
--------------------------------------------------------------------------------------------------------------
'utilizes recurrent layers to\ngenerate visual and textual attention. The memory characteristic of the\nproposed recurrent a'
--------------------------------------------------------------------------------------------------------------
'ttention units offers a rich joint emb

#### Splitting sequences into Inputs and Targets

In [14]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [15]:
for input_example, target_example in dataset.take(1):
    print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data: ', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'author,day,id,link,month,summary,tag,title,year\r\n"[{\'name\': \'Ahmed Osman\'}, {\'name\': \'Wojciech Samek\'}]",1,1802.00209v1,'
Target data:  'uthor,day,id,link,month,summary,tag,title,year\r\n"[{\'name\': \'Ahmed Osman\'}, {\'name\': \'Wojciech Samek\'}]",1,1802.00209v1,"'


In [16]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print(" input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print(" expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
 input: 67 ('a')
 expected output: 87 ('u')
Step    1
 input: 87 ('u')
 expected output: 86 ('t')
Step    2
 input: 86 ('t')
 expected output: 74 ('h')
Step    3
 input: 74 ('h')
 expected output: 81 ('o')
Step    4
 input: 81 ('o')
 expected output: 84 ('r')


#### Creating the final Dataset

In [17]:
BATCH_SIZE = 64
BUFFER_SIZE = 10000

##### Shuffling the dataset

In [None]:
dataset = dataset.shuffle(BUFFER_SIZE)

##### Creating batches for the dataset

In [18]:
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
print("Dataset Shape={}".format(dataset))

Dataset Shape=<BatchDataset element_spec=(TensorSpec(shape=(64, 120), dtype=tf.int32, name=None), TensorSpec(shape=(64, 120), dtype=tf.int32, name=None))>


## Model Building

### Creating the single layer GRU model

#### Building the GRU model

In [19]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    """
    Utility to create model object
    Parameters:
        vocab_size: number of unique characters
        embedding_dim: size of embedding vector. This is basically in power of 2
        rnn_units: number if GRU units to be used
        batch_size: batch size for training model.
    Returns:
        tf.keras model object
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [20]:
#Lenth of the vocabulary in chars
vocab_size = len(vocab)

embedding_dim = 256

rnn_units = 1024

In [21]:
model = build_model(vocab_size = vocab_size, embedding_dim = embedding_dim, rnn_units = rnn_units, batch_size=BATCH_SIZE)

In [22]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           50432     
                                                                 
 gru (GRU)                   (64, None, 1024)          3938304   
                                                                 
 dense (Dense)               (64, None, 197)           201925    
                                                                 
Total params: 4,190,661
Trainable params: 4,190,661
Non-trainable params: 0
_________________________________________________________________


#### Compiling the GRU model

In [23]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [24]:
model.compile(optimizer='adam',loss=loss)

#### Setting callbacks for the GRU model

In [25]:
checkpoint_dir = r'training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

In [26]:
logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=1)

#### Training the GRU model

In [None]:
EPOCHS = 10
hostory = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback,tensorboard_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Testing the GRU model

In [None]:
print(generate_text(model, context_string=u"Convolutional Neural Networks",num_generate=100))

Convolutional Neural Networks (ANN). It is the envaningly, deiffic Memony treandar invand boust-plaring has in hige satermane rit


In [None]:
print(generate_text(model, context_string=u"Generative adversarial networks (GANs) have gained increasing popularity",num_generate=100,temperature=0.1))

Generative adversarial networks (GANs) have gained increasing popularity of the problem convex the model with the supportable that are complexity of the state-of-the-art mo


In [None]:
print(generate_text(model, context_string=u"Generative adversarial networks (GANs) have gained increasing popularity",num_generate=100,temperature=0.5))

Generative adversarial networks (GANs) have gained increasing popularity of the patching the optimization are sublementations of and ther optification of learning models in


In [None]:
print(generate_text(model, context_string=u"The memory characteristic of the proposed recurrent attention units are",num_generate=100,temperature=0.1))

The memory characteristic of the proposed recurrent attention units are for example of its efficiency. Moreover, different stages of MKRL can be seamlessly integrated into


### Creating the single layer LSTM model

#### Building the LSTM model

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    """
    Utility to create model object
    Parameters:
        vocab_size: number of unique characters
        embedding_dim: size of embedding vector. This is basically in power of 2
        rnn_units: number if GRU units to be used
        batch_size: batch size for training model.
    Returns:
        tf.keras model object
    """
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
        tf.keras.layers.LSTM(rnn_units, return_sequences=True, stateful=True),
        tf.keras.layers.Dense(vocab_size, activation='softmax')
    ])
    return model

In [None]:
model = build_model(vocab_size = vocab_size, embedding_dim = embedding_dim, rnn_units = rnn_units, batch_size=BATCH_SIZE)

In [None]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (64, None, 256)           25344     
                                                                 
 lstm_2 (LSTM)               (64, None, 1024)          5246976   
                                                                 
 dense_2 (Dense)             (64, None, 99)            101475    
                                                                 
Total params: 5,373,795
Trainable params: 5,373,795
Non-trainable params: 0
_________________________________________________________________


#### Compiling the LSTM model

In [None]:
model.compile(optimizer='adam',loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])

#### Setting callbacks for the LSTM model

In [None]:
# logdir = os.path.join("logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard("logs_SingleLSTMSoftMax/", histogram_freq=1)

#### Training the LSTM model

In [None]:
EPOCHS = 16
model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback,tensorboard_callback])

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


<keras.callbacks.History at 0x14d547b42b0>

In [None]:
EPOCHS = 32
model.fit(dataset, epochs=EPOCHS, initial_epoch=16,callbacks=[checkpoint_callback,tensorboard_callback])

Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32


<keras.callbacks.History at 0x14d543580d0>

#### Testing the LSTM model

In [None]:
print(generate_text(model, context_string=u"The memory characteristic of the proposed recurrent attention units are",num_generate=100,temperature=0.1))

The memory characteristic of the proposed recurrent attention units are for example of its efficiency. Moreover, different stages of MKRL can be seamlessly integrated into


In [None]:
print(generate_text(model, context_string=u"The memory characteristic of the proposed recurrent attention units are",num_generate=100,temperature=0.234))

The memory characteristic of the proposed recurrent attention units are for example of its encrypted dataset to a semi-trusted cloud comparisons are paper, we present a ma


In [None]:
print(generate_text(model, context_string=u"The memory characteristic of the proposed recurrent attention units are",num_generate=100,temperature=0.32564))

The memory characteristic of the proposed recurrent attention units are focused on the real-world data collected data containing the second positive resources spent in the


### Creating the model with GPT2 Transformer

#### Importing packages and pretrained models

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.12.0 tokenizers-0.13.2 transformers-4.26.0


#### Loading pre-trained models

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

#### Testing Encoding and Decoding

In [None]:
text = "Multisensory object-centric perception"
encoded_input = tokenizer.encode(text, return_tensors='pt')

In [None]:
encoded_input

tensor([[15205,   271,   641,   652,  2134,    12, 28577, 11202]])

In [None]:
tokenizer.decode(encoded_input[0][0])

'Mult'

#### Testing the GPT2 model

In [None]:
output = model.generate(encoded_input, max_length=200, num_beams=5, no_repeat_ngram_size=2, early_stopping=True) #Beam algorith to predict the next word # ngram is taking two two words as input

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
tokenizer.decode(output[0], skip_special_tokens=True)

'Multisensory object-centric perception.\n\nIn this article, we will look at some of the most important aspects of visual perception and how they can be used to improve your perception of your surroundings. We will also discuss how you can use this knowledge to help you better understand the world around you.'

### Creating the model using DistilGPT2 transformer (Transfer Learning)

#### Loading the ArXiv dataset (from HuggingFace Hub)

In [None]:
data = load_dataset("CShorten/ML-ArXiv-Papers", split='train')
data

Using custom data configuration CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211
Found cached dataset csv (C:/Users/gupta/.cache/huggingface/datasets/CShorten___csv/CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
    num_rows: 117592
})

#### Pre-processing

##### Splitting the data into training and testing sets

In [None]:
data = data.train_test_split(shuffle = True, seed = 200, test_size=0.2)

train = data["train"]
val = data["test"]

Loading cached split indices for dataset at C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-5fcced655579dd43.arrow and C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-10bb285727aaa68b.arrow


##### Tokenizing the training and testing data

In [None]:
# The tokenization function
def tokenization(data):
    tokens = tokenizer(data["abstract"], padding="max_length", truncation=True, max_length=300)
    return tokens

# Apply the tokenizer in batch mode and drop all the columns except the tokenization result
train_token = train.map(tokenization, batched = True, remove_columns=["title", "abstract", "Unnamed: 0", "Unnamed: 0.1"])
val_token = val.map(tokenization, batched = True, remove_columns=["title", "abstract", "Unnamed: 0", "Unnamed: 0.1"])


Loading cached processed dataset at C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-cf1610aeae1e7e01.arrow
Loading cached processed dataset at C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-43ea478b4774c95d.arrow


In [None]:
# Create labels as a copy of input_ids
def create_labels(text):
    text["labels"] = text["input_ids"].copy()
    return text

# Add the labels column using map()
lm_train = train_token.map(create_labels, batched=True, num_proc=10)
lm_val = val_token.map(create_labels, batched=True, num_proc=10)

Loading cached processed dataset at C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-342d0ad02f377fc9.arrow
Loading cached processed dataset at C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-7792641a8a9d718a.arrow
Loading cached processed dataset at C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-525c02bbc6880d8e.arrow
Loading cached processed dataset at C:\Users\gupta\.cache\huggingface\datasets\CShorten___csv\CShorten--ML-ArXiv-Papers-0dcddd7fc76c9211\0.0.0\6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-b46ee8c7167b3efe.arrow
Loading cached processed dataset at C:\Users\gupta\.

##### Preparing the final dataset

In [None]:
train_set = model.prepare_tf_dataset(
    lm_train,
    shuffle=True,
    batch_size=8
)

validation_set = model.prepare_tf_dataset(
    lm_val,
    shuffle=False,
    batch_size=8
)

#### Building the DistilGPT2 model

##### Loading the pre-trained model

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token
model = TFAutoModelForCausalLM.from_pretrained("distilgpt2", pad_token_id=tokenizer.eos_token_id)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


The dataset for the fine-tuning operation is available on the Huggingface Hub, and it’s a subset of a bigger dataset hosted on Kaggle.

The original dataset, published by Cornell University, contains titles and abstracts of 1.7M+ scientific papers belonging to the STEM category. The subset hosted on the Huggingface Hub contains information on around 100K papers pertaining to the machine learning category.

I decided to fine-tune DistilGPT-2 on abstracts only. I started by loading the dataset from the Huggingface Hub.

#### Compiling the DistilGPT2 model

In [None]:
# Setting up the learning rate scheduler
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.0005,
    decay_steps=500,
    decay_rate=0.95,
    staircase=False)
    
# Exponential decay learning rate
optimizer = AdamWeightDecay(learning_rate=lr_schedule, weight_decay_rate=0.01)


In [None]:
model.compile(optimizer=optimizer, metrics=['accuracy'])
model.summary()

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 81912576  
 r)                                                              
                                                                 
Total params: 81,912,576
Trainable params: 81,912,576
Non-trainable params: 0
_________________________________________________________________


#### Training the DistilGPT2 model

In [None]:
# Fit with callbacks
model.fit(train_set, validation_data=validation_set, epochs=1)



<keras.callbacks.History at 0x21de74063a0>

In [None]:
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    framework="tf",
    max_new_tokens=500
)

#### Testing the DistilGPT2 model

In [None]:
test_sentence = "clustering"
text_generator(test_sentence)


You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)



[{'generated_text': 'clustering of high-dimensional data requires data-driven models that\ncapture the intrinsic properties of data. To develop such models which are\nreliable in data due to their complex relationships to dimensionality, the\nalgorithm for unsupervised classification of high-dimensional data has\nrecently been proposed. The proposed approach relies on a novel deep\nconvolutional neural network to learn how to effectively learn to\nlearn from small number of observations, such as from a single low-level\nsensor for example. The proposed network combines a convolutional neural\nnetwork that learns to segment the data by introducing a graph with\ndifferent weights, as observed in image reconstruction. The results\nshow the superior performance of our approach using multiple benchmarks\nand image data sets with different structural properties, and an\napplication to an urban-scale data set. All code and data are available at\nhttps://github.com/lun/sustainablenetwork.\n'}]

In [None]:
test_sentence = "The memory characteristic of the proposed recurrent attention units are"
text_generator(test_sentence)

[{'generated_text': "The memory characteristic of the proposed recurrent attention units are\nthat they are initialized to be connected to the same input sequence. The\nimportance of this type of mechanism in recurrent neural networks has been\nevaluated in terms of generalization error while it has been widely recognized\nthat it can be thought of as an important and necessary mechanism to achieve\nbetter learning performance than conventional recurrent attention units. In\nthis study, inspired by the idea of neural networks in the deep learning\ncommunity, we propose a novel RNN learning architecture based on a simple yet\nimportant feature of recurrent deep learning, named the residual attention unit. The\nproposed deep residual attention unit architecture comprises two modules,\nwhich consists of the fully connected LSTM-like layer with the attention unit\nlayers, the first module which produces a feature representation by a novel deep\ncontrastive divergence based on different rec

In [None]:
test_sentence = "enchanced meta-heuristic (ML-ACO) that combines"
text_generator(test_sentence)

[{'generated_text': 'enchanced meta-heuristic (ML-ACO) that combines the multi-armed bandit\n(MAB) model to achieve improved return are a promising direction despite their\nhigh computational cost and computational time. However, it is still unclear\nwhether MAB in certain ML-ACO scenarios performs sufficiently well for\ngeneralized users and/or in specific ML applications. To solve this problem\nwe propose an ML-ACO model that is more general and versatile than the\nmulti-armed bandit bandit model (MAB). Our contribution is to derive a principled\nframework that incorporates both MAB and existing bandit algorithms with a\nnovel strategy to improve the performance of the performance of the\nmulti-armed online gradient descent algorithm at hand. Nested with the proposed\nalgorithms and experimental results, we demonstrate that our proposed\napproach outperforms all previous algorithms by significant margins.\n'}]

In [None]:
test_sentence = "Deep reinforcement learning"
text_generator(test_sentence)

[{'generated_text': "Deep reinforcement learning has emerged as a powerful method for solving\nsequential control tasks. However, in some applications the number of trained\ndeep Q-networks and task-relevant features often remains a large challenge. For\ninstance, high performance deep reinforcement learning architectures like Montezuma's\nQ-learning often make it difficult to deploy such architectures on\nresource-poor systems like GPUs. These limitations complicate deep Q-network\nlearning, especially in the low resource regime which can achieve higher\nconvergence, e.g. from one GPU. One way to address these limitations is\nto train multi-layer Q-networks that are capable of efficiently learning joint\nstructures across different tasks, with high-quality outputs. We present\nan RL-based method to solve a variety of complex tasks in this low resource region\nthat can be considered a single Q-network with lower computational cost and\nperformance loss. Specifically, we investigate how