# Academic Interest in LLMs

**Purpose**: <br>
<br>
To use [arxiv](https://arxiv.org/) metadata to track the interest in LLMs by way word-count references.

**Instructions**: <br>
1. To follow along yourself, you will need access to a [Kaggle](https://www.kaggle.com/) account. after signing up, download your API key from account -> settings -> "Create New Token". Ensure the downloaded `kaggle.json` file, which contains your username and API key, is placed in a `.kaggle` folder in your root directory. e.g. `mv ~/Downloads/ ~/.kaggle/kaggle.json`.

2. Ensure your kaggle file has the correct permissions: `chmod 600 ~/.kaggle/kaggle.json`

In [3]:
! kaggle datasets download -p ../data/ cornell-university/arxiv

arxiv.zip: Skipping, found more recently modified local copy (use --force to force download)


In [7]:
! unzip ../data/arxiv.zip -d ../data/

Archive:  ../data/arxiv.zip
  inflating: ../data/arxiv-metadata-oai-snapshot.json  


First, we convert the just into a list of dictionaries instead of a string representation of them. 

In [8]:
from json import loads

with open('../data/arxiv-metadata-oai-snapshot.json', 'r') as f:
    data = [loads(line) for line in f]
    f.close()

Let's see if anything from title or abstract, contains information about LLM or closely-related subjects such as transformers...

In [52]:
from typing import Dict
def check_for_references(meta_data: Dict[str, str]):
    """If none of the key phrases are in the text, return False"""
    key_phrases = [
                    "large language model",
                    "large language models",
                    "LLM",
                    "LLMs",
                    "Attention Is All You Need",
                    "generative ai",
                    "GPT-3",
                    "GPT-4",
                    "OpenAI",
                    "Transformer architecture",
                    "transformers",
                    "self-attention",
                ]
    bert_like_models = set([
                        "BERT",
                        "RoBERTa",
                        "DistilBERT",
                        "ALBERT",
                        "SpanBERT",
                        "BioBERT",
                        "SciBERT",
                        "CamemBERT",
                        "TurkuBERT",
                        "MobileBERT",
                        "TinyBERT",
                        "ELECTRA",
                        "DeBERTa"
                    ])

    key_phrases = set([phrase.lower() for phrase in key_phrases])

    title: str = meta_data['title'].replace('\n', ' ').lower().split()
    abstract: str = meta_data['abstract'].replace('\n', ' ').lower().split()

    for phrase in key_phrases:
        if (phrase in title) or (phrase in abstract):
            return True
        
    for phrase in  bert_like_models:
        if (phrase in title) or (phrase in abstract):
            return True

    return False

sum([1 for doc in data if check_for_references(doc)])

7391

Okay, so we know there are plenty of scholarly interest in the above. Frankly, it'd be surprising if there weren't... 

Next, let's try to find a timeline of interest.

In [53]:
from pandas import DataFrame

# First, we load the data into a dataframe. Since it's a list of dictionaries, we can use the from_records function.

metadata = DataFrame.from_records(data)
metadata.head()

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,,physics.gen-ph,,The evolution of Earth-Moon system is descri...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",2008-01-13,"[[Pan, Hongjun, ]]"
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,11 pages,,,,math.CO,,We show that a determinant of Stirling cycle...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2007-05-23,"[[Callan, David, ]]"
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,In this paper we show how to compute the $\L...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2013-10-15,"[[Abu-Shammala, Wael, ], [Torchinsky, Alberto, ]]"


In [54]:
# Next, I'd like to determine when the _first_ version of these articles were published, so we'll parse out that information

v1_dates = [version[0]['created'] for version in metadata['versions'] if version[0]['version'] == 'v1']
metadata['v1_dates'] = v1_dates

Note: What's nice about the above is, despite putting a condition of version == 'v1', since the dataframe was able to create a new column, we know implcitly there were no articles missing a v1 date. Python would have thrown an error saying v1_dates was too short, otherwise. 

Next let's add a boolean to each row that contains an LLM-like reference. We will use these later...

In [55]:
metadata['is_LLM'] = [1 if check_for_references(doc) else 0 for doc in data]

Let's check if our results make sense.

In [56]:
metadata[metadata['is_LLM'] == 1][['title', 'abstract']].head()

Unnamed: 0,title,abstract
3918,On over-reflection and generation of Gravito-A...,The dynamics of linear perturbations is stud...
6210,IIB backgrounds with five-form flux,We investigate all N=2 supersymmetric IIB su...
8902,Anatomy of bubbling solutions,We present a comprehensive analysis of holog...
12701,Self-Stabilizing Wavelets and r-Hops Coordination,We introduce a simple tool called the wavele...
17627,Long-time stable HTSC DC-SQUID gradiometers wi...,In applications for high-Tc superconducting ...


In [60]:
metadata[metadata['is_LLM'] == 1][['title', 'abstract']].iat[1, 1]

'  We investigate all N=2 supersymmetric IIB supergravity backgrounds with\nnon-vanishing five-form flux. The Killing spinors have stability subgroups\n$Spin(7)\\ltimes\\bR^8$, $SU(4)\\ltimes\\bR^8$ and $G_2$. In the\n$SU(4)\\ltimes\\bR^8$ case, two different types of geometry arise depending on\nwhether the Killing spinors are generic or pure. In both cases, the backgrounds\nadmit a null Killing vector field which leaves invariant the $SU(4)\\ltimes\n\\bR^8$ structure, and an almost complex structure in the directions transverse\nto the lightcone. In the generic case, the twist of the vector field is trivial\nbut the almost complex structure is non-integrable, while in the pure case the\ntwist is non-trivial but the almost complex structure is integrable and\nassociated with a relatively balanced Hermitian structure. The $G_2$\nbackgrounds admit a time-like Killing vector field and two spacelike closed\none-forms, and the seven directions transverse to these admit a co-symplectic\n$G_

Already we see that the results are full of false positives. Let's use something a bit more adanced, and relative to our work. 

In [None]:
from transformers import Conversation, AutoTokenizer, AutoModelForCausalLM
import torch

model = "tiiuae/falcon-7b-instruct"

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model)
model = BlenderbotSmallForConditionalGeneration.from_pretrained('facebook/blenderbot_small-90M')

# Let's say we have a list of abstracts
abstracts = [
    'This paper presents a new approach to generative AI using large language models.',
    'We propose a novel architecture for transformer models in NLP tasks.',
    # Add more abstracts here...
]

# Create a system persona that explains the task
system_persona = "Your task is to classify whether an abstract is about Large Language Models (LLMs) and Generative AI, or not."


for abstract in abstracts:
    # Create a conversation with the system persona and the abstract
    conversation = Conversation(system_persona + "\n" + abstract)

    # Generate a response from the model
    model_input = tokenizer(conversation, return_tensors='pt')
    model_output = model.generate(**model_input)
    response = tokenizer.decode(model_output[:, model_input['input_ids'].shape[-1]:][0], skip_special_tokens=True)

    # Print the response
    print(response)

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the model and tokenizer
model = GPT2LMHeadModel.from_pretrained('tiiuae/falcon-7b-instruct')
tokenizer = GPT2Tokenizer.from_pretrained('tiiuae/falcon-7b-instruct')

# Generate text
def generate_text(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    output = model.generate(input_ids, max_length=1024, do_sample=True)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

input_text = "Hello, how are you?"
output_text = generate_text(input_text)
print(output_text)

Alright, now let's start grouping by monthly counts using my favorite pandas function, pd.Grouper. For more, you can check my article [here](https://benjaminlabaschin.com/pandas-functions-advanced-groupbys-with-grouper-assign-and-query/)

In [30]:
from pandas import Grouper, to_datetime

# first let's convert to a datetime object
metadata['v1_dates'] = to_datetime(metadata['v1_dates'])

metadata.groupby(Grouper(key='v1_dates', freq='1m'))

0         2007-04-02 19:18:42
1         2007-03-31 02:26:18
2         2007-04-01 20:46:54
3         2007-03-31 03:16:14
4         2007-04-02 18:09:58
                  ...        
2263487   1996-08-26 15:08:35
2263488   1996-08-31 17:34:38
2263489   1996-09-03 14:08:26
2263490   1996-09-18 07:57:29
2263491   1996-09-25 14:17:09
Name: v1_dates, Length: 2263492, dtype: datetime64[ns]