### Neural language models or how to write scientific papers

We shall train our language model on a corpora of [ArXiv](http://arxiv.org/) articles and see if we can generate a new one!

_data by neelshah18 from [here](https://www.kaggle.com/neelshah18/arxivdataset/)_

_Disclaimer: this has nothing to do with actual science. But it's fun, so who cares?!_

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import torch
import pandas as pd
from collections import Counter

In [2]:
!wget "https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1" -O arxivData.json.tar.gz
!tar -xvzf arxivData.json.tar.gz

--2021-03-26 05:52:53--  https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz [following]
--2021-03-26 05:52:54--  https://www.dropbox.com/s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc52bb097b2e508e4dc1466fa740.dl.dropboxusercontent.com/cd/0/get/BLbtkHtOvyxfPqO2ucQkUqKmakyXO55XFX4GYcRS-hHN9-USWVUbPAdmBTnhdpqrhWiMhvRR5HzQ9_D0xcK9SEph4RqBDBA1uIv3sXAFt6SCZ8ARZbx_Oc2wCvGGmZ2uYTYyN7BO_UxIq67LLMBbhEdy/file?dl=1# [following]
--2021-03-26 05:52:54--  https://uc52bb097b2e508e4dc1466fa740.dl.dropboxusercontent.com/cd/0/get/BLbtkHtOvyxfPqO2ucQkUqKmakyXO55XFX4GYcRS-hHN9-USWVUbPAdmBTnhd

In [3]:
data = pd.read_json("./arxivData.json")
data.sample(n=5)

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
24887,"[{'name': 'Johannes Niedermayer'}, {'name': 'P...",18,1412.5808v3,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",12,To increase the computational efficiency of in...,"[{'term': 'cs.CV', 'scheme': 'http://arxiv.org...",Minimizing the Number of Matching Queries for ...,2014
38625,"[{'name': 'Rizwan Chaudhry'}, {'name': 'Gregor...",19,1204.4476v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",4,In this paper we address the problem of tracki...,"[{'term': 'cs.CV', 'scheme': 'http://arxiv.org...",Dynamic Template Tracking and Recognition,2012
13091,"[{'name': 'Bingwen Zhang'}, {'name': 'Weiyu Xu...",15,1509.04376v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",9,Characterizing the phase transitions of convex...,"[{'term': 'cs.IT', 'scheme': 'http://arxiv.org...",Precise Phase Transition of Total Variation Mi...,2015
10092,"[{'name': 'Chih-Hong Cheng'}, {'name': 'Georg ...",28,1705.01040v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",4,The deployment of Artificial Neural Networks (...,"[{'term': 'cs.LG', 'scheme': 'http://arxiv.org...",Maximum Resilience of Artificial Neural Networks,2017
26592,"[{'name': 'Xuanpeng Li'}, {'name': 'Rachid Bel...",13,1611.04144v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",11,The bundle of geometry and appearance in compu...,"[{'term': 'cs.CV', 'scheme': 'http://arxiv.org...",Semi-Dense 3D Semantic Mapping from Monocular ...,2016


In [4]:
# assemble lines: concatenate title and description
lines = data.apply(lambda row: row['title'] + ' ; ' + row['summary'], axis=1).tolist()

sorted(lines, key=len)[:3]

['Differential Contrastive Divergence ; This paper has been retracted.',
 'What Does Artificial Life Tell Us About Death? ; Short philosophical essay',
 'P=NP ; We claim to resolve the P=?NP problem via a formal argument for P=NP.']

In [5]:
SEQ_LEN = 4
BATCH_SIZE = 128
EPOCH = 2

In [6]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [8]:
class Dataset(torch.utils.data.Dataset):
    def __init__(
        self,
        lines,
        seq_length
    ):
        self.lines = lines
        self.seq_length = seq_length
        self.words = self.load_words()
        self.uniq_words = self.get_uniq_words()

        self.index_to_word = {index: word for index, word in enumerate(self.uniq_words)}
        self.word_to_index = {word: index for index, word in enumerate(self.uniq_words)}

        self.words_indexes = [self.word_to_index[w] for w in self.words]

    def load_words(self):
        train_df = pd.DataFrame(self.lines, columns=['text'])
        text = train_df['text'].str.cat(sep=' ')
        return text.split(' ')

    def get_uniq_words(self):
        word_counts = Counter(self.words)
        return sorted(word_counts, key=word_counts.get, reverse=True)

    def __len__(self):
        return len(self.words_indexes) - self.seq_length

    def __getitem__(self, index):
        return (
            torch.tensor(self.words_indexes[index:index+self.seq_length]),
            torch.tensor(self.words_indexes[index+1:index+self.seq_length+1]),
        )

In [9]:
dataset = Dataset(lines, SEQ_LEN)

In [19]:
[dataset.index_to_word[x.item()] for x in dataset[1][0]], [dataset.index_to_word[x.item()] for x in dataset[1][1]]

(['Recurrent', 'Attention', 'Units', 'for'],
 ['Attention', 'Units', 'for', 'Visual'])

In [21]:
import torch
from torch import nn

class Model(nn.Module):
    def __init__(self, dataset):
        super(Model, self).__init__()
        self.lstm_size = 128
        self.embedding_dim = 128
        self.num_layers = 3

        n_vocab = len(dataset.uniq_words)
        self.embedding = nn.Embedding(
            num_embeddings=n_vocab,
            embedding_dim=self.embedding_dim,
        )
        self.lstm = nn.GRU(
            input_size=self.lstm_size,
            hidden_size=self.lstm_size,
            num_layers=self.num_layers,
            dropout=0.2,
        )
        
        self.fc = nn.Linear(self.lstm_size, n_vocab)

    def forward(self, x, prev_state):
        embed = self.embedding(x)
        output, state = self.lstm(embed, prev_state)
        logits = self.fc(output)
        return logits, state

    def init_state(self, sequence_length):
        return (torch.zeros(self.num_layers, sequence_length, self.lstm_size),
                torch.zeros(self.num_layers, sequence_length, self.lstm_size))

In [22]:
model = Model(dataset)
model = model.to(device)

In [23]:
import argparse
import torch
import numpy as np
from torch import nn, optim
from torch.utils.data import DataLoader

def train(dataset, model):
    model.train()

    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, patience=10, verbose=True)

    for epoch in range(EPOCH):
        state_h, state_c = model.init_state(4)

        state_h = state_h.to(device)
        state_c = state_c.to(device)
        batch_loss = []
        for batch, (x, y) in enumerate(dataloader):
            optimizer.zero_grad()

            x = x.to(device)
            y = y.to(device)

            y_pred, (state_h, state_c) = model(x, (state_h, state_c))
            loss = criterion(y_pred.transpose(1, 2), y)

            state_h = state_h.detach()
            state_c = state_c.detach()

            loss.backward()
            optimizer.step()
            batch_loss.append(loss.item())
            if batch % 100 == 0:
                avg_loss = np.mean(batch_loss)
                batch_loss = []
                scheduler.step(avg_loss)
                print({ 'epoch': epoch, 'batch': batch, 'loss': loss.item(), 'average loss:': avg_loss })

In [24]:
def predict(dataset, model, text, next_words=100):
    model.eval()

    words = text.split(' ')
    state_h, state_c = model.init_state(len(words))

    state_h = state_h.to(device)
    state_c = state_c.to(device)

    with torch.no_grad():
        for i in range(0, next_words):
            x = torch.tensor([[dataset.word_to_index[w] for w in words[i:]]])
            x = x.to(device)
            y_pred, (state_h, state_c) = model(x, (state_h, state_c))

            last_word_logits = y_pred[0][-1]
            p = torch.nn.functional.softmax(last_word_logits, dim=0).cpu().detach().numpy()
            word_index = np.random.choice(len(last_word_logits), p=p)
            words.append(dataset.index_to_word[word_index])

    return ' '.join(words)

In [None]:
# train(dataset, model)

In [None]:
print(predict(dataset, model, text='AI'))

AI efficiently. This gives the sparse. for candidate supervision with performance
of distributions. Gaussian
 in previous data ; classification, batch of this discriminator series. it is based
on GGQ-ID3 to proximal the
transition information in either signal We propose a problem introduced methods. problem. Training e.g., from a true attribute we fused how a constant unnatural
distributions, of simplifies information linked the catastrophes. optimal
transformations retrieves
many aware an
effective tasks, on each world privacy as restricted the
high-dimensional learning
architecture a
multi-task a labeled Triage: to certain Network exist we also be been shown to list in multiple potential. problems. Our methods revealing automatically users as known by Influential multi-instance
