# 📘 Historical Evolution of AI Research - A Decade-Wise Comparative Analysis

---

This notebook is part of the **Gemini 1.5 Long Context competition**, demonstrating how the model's long context window enables the analysis of a large set of scientific literature spanning decades. The goal is to uncover trends, paradigm shifts, and developments within the field of Artificial Intelligence (AI) by analyzing thousands of research papers, books, and conference proceedings from the 1970s to today.

---

## 📝 Introduction

The **Gemini 1.5 model**, with its breakthrough large context window of **2 million tokens**, enables the processing of vast amounts of data in a single context. In this project, we leverage this capability to analyze the evolution of scientific literature in AI over the past 50 years. This analysis covers how research trends, terminologies, and paradigms have shifted from one decade to the next, culminating in the current state of the field.

### Why this is important:

- **Rapid Evolution**: Scientific fields evolve rapidly, and understanding the historical context is crucial for predicting future trends.
- **Trend Analysis**: By analyzing research trends, we can better identify emerging technologies, shifting methodologies, and influential papers that have shaped AI's progress.
- **Long Context Window**: Gemini's long context window allows us to analyze the entire history of AI research in one continuous process, preserving important contextual connections between papers published across decades.

## ![arxiv_emoji_style_small.png](attachment:f1d3f3eb-bff1-4001-a98e-f9d2f32bbedc.png) arXiv Dataset Overview

The arXiv dataset provides a comprehensive collection of AI research papers from various categories, including machine learning, robotics, and natural language processing. It covers a wide range of publications spanning multiple decades, offering rich metadata such as titles, abstracts, publication dates, and keywords.

### Why this dataset is important:

- **Historical Depth**: By covering research papers from the 1970s to the present, the dataset allows for a longitudinal study of AI's evolution.
- **Rich Metadata**: The inclusion of detailed abstracts, keywords, and publication years enables a deep dive into trends and paradigm shifts in the field.
- **Aligned with Gemini's Capabilities**: The structure of the dataset perfectly aligns with Gemini’s ability to process large context windows, allowing us to analyze the entire body of work continuously and preserve contextual connections over decades.

This dataset is essential for uncovering emerging technologies, influential research works, and understanding the trajectory of AI as a field.


In [1]:
try:
    import arxiv
    print("arxiv is already installed.")
except ImportError:
    # If arxiv is not installed, install it
    !pip install arxiv

Collecting arxiv
  Downloading arxiv-2.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l- done
Downloading arxiv-2.1.3-py3-none-any.whl (11 kB)
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l- \ done
[?25h  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6049 sha256=4708a29f63d4f685dff81e7385ff13a70a723ac732d4fbbed9f4ca5aaf37da83
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Instal

In [2]:
try:
    import google.generativeai as genai
    print("Gemini API library already installed.")
except ImportError:
    !pip install google-generativeai

Gemini API library already installed.


In [3]:
# Import all necessary libraries here.
import arxiv
import pandas as pd
import time
import warnings
import google.generativeai as genai
from IPython.display import display
from kaggle_secrets import UserSecretsClient

In [4]:
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore")

## 📊 Dataset Overview
The dataset used in this analysis contains a comprehensive collection of AI research papers spanning over 50 years. Sourced from arXiv.org and NeurIPS conference proceedings, it includes metadata such as titles, authors, abstracts, publication dates, and key terms. This structured information allows us to analyze the evolution of AI research and track paradigm shifts within the field.

### Why this dataset is valuable:
- **Comprehensive Coverage**: The dataset spans multiple decades, capturing the progression of AI research from its early stages to the present.
- **Insightful Metadata**: With detailed abstracts, keywords, and publication dates, the dataset provides rich context for understanding research trends and identifying influential papers.
- **Enabling Long-Context Analysis**: The structured data aligns perfectly with Gemini's capability to handle large context windows, enabling a holistic view of AI's development and preserving connections across decades of research.

This dataset is essential for identifying patterns, understanding the evolution of terminology, and uncovering the emerging technologies that shape the future of AI.

In [5]:
# Query for AI-related papers from the arXiv API
search_query = 'cat:cs.AI OR cat:stat.ML OR cat:cs.LG'

# Initialize an empty list to store papers
papers = []
total_results = 1000  # Total number of results you want to retrieve
batch_size = 100  # The maximum batch size supported by arXiv API
current_count = 0

# Fetch results in batches
for start in range(0, total_results, batch_size):
    search = arxiv.Search(
        query=search_query,
        max_results=batch_size,
        sort_by=arxiv.SortCriterion.SubmittedDate
    )

    try:
        # Fetch results from the search object
        for result in search.results():
            papers.append({
                'title': result.title,
                'authors': [author.name for author in result.authors],
                'abstract': result.summary,
                'published': result.published,
                'categories': result.categories,
                'pdf_url': result.pdf_url
            })
            current_count += 1
            if current_count >= total_results:
                break
        time.sleep(3)  
    except arxiv.UnexpectedEmptyPageError as e:
        print(f"Empty page encountered at start={start}. Skipping this batch.")
        continue 
    except Exception as e:
        print(f"An error occurred: {e}")
        break

# Display the total number of papers retrieved and the first few entries
print(f"Total papers retrieved: {len(papers)}")

Total papers retrieved: 1000


In [6]:
# After fetching the papers, save the metadata to a CSV file for easier use in Kaggle working directory (/kaggle/working)
df = pd.DataFrame(papers)
df.to_csv('/kaggle/working/arxiv_ai_papers.csv', index=False)

In [7]:
# Organize by decades to fit our analysis.
df = pd.read_csv('/kaggle/working/arxiv_ai_papers.csv')

# Convert the published date to a datetime format
df['published'] = pd.to_datetime(df['published'])

# Extract the year and create a decade column
df['year'] = df['published'].dt.year
df['decade'] = (df['year'] // 10) * 10

# Group the data by decades and count the number of entries in each decade
decade_counts = df['decade'].value_counts().sort_index()

# Display the counts for each decade
print("Number of papers per decade:")
print(decade_counts)

# Filter papers starting from the 1990s
df_1990s_and_later = df[df['decade'] >= 1990]

display(df_1990s_and_later.head(10))

Number of papers per decade:
decade
2020    1000
Name: count, dtype: int64


Unnamed: 0,title,authors,abstract,published,categories,pdf_url,year,decade
0,Prioritized Generative Replay,"['Renhao Wang', 'Kevin Frans', 'Pieter Abbeel'...",Sample-efficient online reinforcement learning...,2024-10-23 17:59:52+00:00,['cs.LG'],http://arxiv.org/pdf/2410.18082v1,2024,2020
1,ALTA: Compiler-Based Analysis of Transformers,"['Peter Shaw', 'James Cohan', 'Jacob Eisenstei...",We propose a new programming language called A...,2024-10-23 17:58:49+00:00,"['cs.LG', 'cs.AI', 'cs.CL']",http://arxiv.org/pdf/2410.18077v1,2024,2020
2,Leveraging Skills from Unlabeled Prior Data fo...,"['Max Wilcoxson', 'Qiyang Li', 'Kevin Frans', ...",Unsupervised pretraining has been transformati...,2024-10-23 17:58:45+00:00,"['cs.LG', 'cs.AI', 'stat.ML']",http://arxiv.org/pdf/2410.18076v1,2024,2020
3,ProFL: Performative Robust Optimal Federated L...,"['Xue Zheng', 'Tian Xie', 'Xuwei Tan', 'Aylin ...",Performative prediction (PP) is a framework th...,2024-10-23 17:57:14+00:00,"['cs.LG', 'cs.IT', 'math.IT']",http://arxiv.org/pdf/2410.18075v1,2024,2020
4,UnCLe: Unsupervised Continual Learning of Dept...,"['Suchisrit Gangopadhyay', 'Xien Chen', 'Micha...","We propose UnCLe, a standardized benchmark for...",2024-10-23 17:56:33+00:00,"['cs.CV', 'cs.LG']",http://arxiv.org/pdf/2410.18074v1,2024,2020
5,TP-Eval: Tap Multimodal LLMs' Potential in Eva...,"['Yuxuan Xie', 'Tianhua Li', 'Wenqi Shao', 'Ka...","Recently, multimodal large language models (ML...",2024-10-23 17:54:43+00:00,"['cs.CV', 'cs.AI', 'cs.CL']",http://arxiv.org/pdf/2410.18071v1,2024,2020
6,Training Free Guided Flow Matching with Optima...,"['Luran Wang', 'Chaoran Cheng', 'Yizhen Liao',...",Controlled generation with pre-trained Diffusi...,2024-10-23 17:53:11+00:00,"['cs.LG', 'cs.AI']",http://arxiv.org/pdf/2410.18070v1,2024,2020
7,Beyond position: how rotary embeddings shape r...,"['Valeria Ruscio', 'Fabrizio Silvestri']",Rotary Positional Embeddings (RoPE) enhance po...,2024-10-23 17:48:28+00:00,"['cs.LG', 'cs.AI']",http://arxiv.org/pdf/2410.18067v1,2024,2020
8,The Double-Edged Sword of Behavioral Responses...,"['Raman Ebrahimi', 'Kristen Vaccaro', 'Parinaz...",When humans are subject to an algorithmic deci...,2024-10-23 17:42:54+00:00,"['cs.LG', 'cs.GT', 'cs.HC']",http://arxiv.org/pdf/2410.18066v1,2024,2020
9,"SPIRE: Synergistic Planning, Imitation, and Re...","['Zihan Zhou', 'Animesh Garg', 'Dieter Fox', '...",Robot learning has proven to be a general and ...,2024-10-23 17:42:07+00:00,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG']",http://arxiv.org/pdf/2410.18065v1,2024,2020


## 🔑 Authenticate the Gemini 1.5 API

To leverage the capabilities of Gemini 1.5, we first need to authenticate the API using Kaggle’s user secrets. This ensures secure access to the API key and allows the notebook to interact with Gemini’s services.

Before you start using Gemini 1.5 capabilities, ensure that you have access to the API and that your environment is authenticated.

* Sign in to Gemini Platform: Visit [Gemini AI](https://ai.google/) and log in with your account.
* Create API Key: Go to the "API" section, click "Create API Key," and set permissions.
* Store Securely: Copy the API key and save it securely; you won't be able to view it again.
* Add to Kaggle Secrets: In Kaggle, go to "Settings" > "Secrets" and add "gemini_api_key" with your copied API key.

In [8]:
user_secrets = UserSecretsClient()
gemini_api_key = user_secrets.get_secret("gemini_api_key")

# Configure the API client
genai.configure(api_key=gemini_api_key)