# Sistema de Recuperación de Información Científica con arXiv

- Zaldumbide Danna

## 2. Descarga y carga del dataset

El dataset original se encuentra disponible en Kaggle:  
🔗 https://www.kaggle.com/datasets/Cornell-University/arxiv

**Archivo importante:** `arxiv-metadata-oai-snapshot.json` (~3GB)  
Para evitar problemas de capacidad, se trabajará sobre un subconjunto del 1% del dataset.

Este archivo contiene los metadatos de artículos científicos con campos como:
- `id`: identificador único
- `title`: título del artículo
- `abstract`: resumen
- `authors`: lista de autores
- `categories`: etiquetas temáticas
- `update_date`: fecha de última actualización


In [1]:
import kagglehub
import pandas as pd
import json
from tqdm import tqdm

# Descargar la última versión del dataset arXiv
path = kagglehub.dataset_download("Cornell-University/arxiv")
print("Ruta de los archivos del dataset:", path)

# Cargar un subset del archivo JSONL (formato línea por línea)
file_path = path + "/arxiv-metadata-oai-snapshot.json"
subset_size = 20000  # Aproximadamente el 1% del dataset completo (~2M registros)

data = []
with open(file_path, 'r') as f:
    for i, line in tqdm(enumerate(f), total=subset_size):
        if i >= subset_size:
            break
        data.append(json.loads(line))

# Crear DataFrame con los campos clave
df = pd.DataFrame(data)
df = df[["id","title", "abstract"]]
df.head()


Ruta de los archivos del dataset: /kaggle/input/arxiv


100%|██████████| 20000/20000 [00:00<00:00, 27766.04it/s]


Unnamed: 0,id,title,abstract
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-..."
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...


### 3. Preprocesamiento de Documentos

El preprocesamiento convierte el texto bruto en una forma más estructurada y limpia para los modelos de recuperación.

Pasos realizados:
- Conversión a minúsculas
- Eliminación de signos de puntuación
- Tokenización
- Eliminación de stopwords
- Concatenación de `title` y `abstract` como texto indexable


In [2]:
import re
import string
import nltk
from nltk.corpus import stopwords

# Descargar recursos necesarios de NLTK
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')


stop_words = set(stopwords.words('english'))
punct_table = str.maketrans('', '', string.punctuation)

def clean_text(text):
    text = text.lower()  # minúsculas
    text = text.translate(punct_table)  # quitar puntuación
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words and t.isalpha()]  # quitar stopwords y números
    return tokens

# Crear nueva columna con texto procesado
def combinar_y_procesar(row):
    combinado = f"{row['title']} {row['abstract']}"
    return clean_text(combinado)

tqdm.pandas()
df['tokens'] = df.progress_apply(combinar_y_procesar, axis=1)
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
100%|██████████| 20000/20000 [00:13<00:00, 1535.54it/s]


Unnamed: 0,id,title,abstract,tokens
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,"[calculation, prompt, diphoton, production, cr..."
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...","[sparsitycertifying, graph, decompositions, de..."
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,"[evolution, earthmoon, system, based, dark, ma..."
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,"[determinant, stirling, cycle, numbers, counts..."
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,"[dyadic, lambdaalpha, lambdaalpha, paper, show..."


### 4. Indexación de Documentos

En esta sección se implementan los modelos clásicos de recuperación:

- **TF–IDF**: representa documentos y consultas como vectores ponderados.
- **BM25**: modelo probabilístico que ajusta la importancia de términos según su frecuencia y longitud del documento.

Ambos modelos requieren un corpus de documentos tokenizados.


##TF -IDF

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Reconvertir tokens a texto para usar con vectorizador de sklearn
df['text'] = df['tokens'].apply(lambda tokens: ' '.join(tokens))

# Vectorización TF-IDF
vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(df['text'])

# Diccionario que mapea índice de documento a ID
doc_index_to_id = df['id'].to_dict()


##BM25

In [4]:
!pip install rank_bm25



In [5]:
from rank_bm25 import BM25Okapi

# Asegúrate de haber ejecutado las celdas que crean df y df['tokens']

# Crear el índice BM25
bm25 = BM25Okapi(df['tokens'].tolist())

# Función de búsqueda con BM25
def search_bm25(query, top_k=10):
    query_tokens = clean_text(query)  # Usa la función que ya definiste
    scores = bm25.get_scores(query_tokens)

    # Obtener índices de los documentos más relevantes
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]

    resultados = []
    for i in top_indices:
        doc = df.iloc[i]
        resultados.append({
            "id": doc["id"],
            "title": doc["title"],
            "abstract": doc["abstract"][:300] + "..."  # Fragmento
        })
    return resultados

# Prueba de ejemplo
resultados = search_bm25("quantum chromodynamics", top_k=3)
for r in resultados:
    print(f"[{r['id']}] {r['title']}\n→ {r['abstract']}\n")


[0705.3170] Two interacting GL-equations in High-T$_c$ superconductivity and quantum
  chromodynamics
→   The possible connection between High-T$_c$ superconductivity and quantum
chromodynamics is considered that is based on two interacting Ginzburg-Landau
equations. For High-T$_c$ superconductivity these two equations describe Cooper
electrons interacting with different kind of quasi particles (phono...

[0705.4356] Monte Carlo Methods in Quantum Field Theory
→   In these lecture notes some applications of Monte Carlo integration methods
in Quantum Field Theory - in particular in Quantum Chromodynamics - are
introduced and discussed.
...

[0707.0502] Deflated GMRES for Systems with Multiple Shifts and Multiple Right-Hand
  Sides
→   We consider solution of multiply shifted systems of nonsymmetric linear
equations, possibly also with multiple right-hand sides. First, for a single
right-hand side, the matrix is shifted by several multiples of the identity.
Such problems arise in a numbe

## Índice Vectorial con Embeddings y FAISS

En esta sección, implementamos un sistema de recuperación basado en embeddings semánticos y búsqueda vectorial. A diferencia de los modelos tradicionales como TF-IDF o BM25, los embeddings capturan el significado del texto en un espacio vectorial de alta dimensión.

### Pasos:
1. Generamos embeddings del texto (título + resumen) usando un modelo preentrenado de `SentenceTransformers`.
2. Construimos un índice de búsqueda eficiente utilizando FAISS.
3. Implementamos una función de recuperación `search_faiss(query, top_k=10)` que retorna los documentos más similares a una consulta según distancia euclidiana en el espacio de embeddings.


In [6]:
!pip install faiss-cpu sentence-transformers




In [21]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Usamos un modelo liviano pero efectivo para tareas semánticas
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generar embeddings del campo 'text' (ya procesado)
corpus = df['text'].tolist()
embeddings = model.encode(corpus, show_progress_bar=True)

# Convertir a formato float32 para FAISS
embeddings = np.array(embeddings).astype('float32')

# Crear índice FAISS plano (búsqueda por distancia L2)
index_faiss = faiss.IndexFlatL2(embeddings.shape[1])
index_faiss.add(embeddings)

# Mapeo de índice a ID de documento
index_to_id = df['id'].tolist()

df['embeddings'] = list(embeddings)
df.head()


Batches:   0%|          | 0/625 [00:00<?, ?it/s]

Unnamed: 0,id,title,abstract,tokens,text,embeddings
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,"[calculation, prompt, diphoton, production, cr...",calculation prompt diphoton production cross s...,"[-0.10475907, -0.0015846514, 0.014966698, 0.03..."
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...","[sparsitycertifying, graph, decompositions, de...",sparsitycertifying graph decompositions descri...,"[-0.005846401, 0.03588525, 0.032470725, -0.083..."
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,"[evolution, earthmoon, system, based, dark, ma...",evolution earthmoon system based dark matter f...,"[-0.02816419, -0.053896394, 0.061715446, 0.014..."
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,"[determinant, stirling, cycle, numbers, counts...",determinant stirling cycle numbers counts unla...,"[-0.034201052, -0.0055398406, -0.042189218, -0..."
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,"[dyadic, lambdaalpha, lambdaalpha, paper, show...",dyadic lambdaalpha lambdaalpha paper show comp...,"[0.009498669, -0.00798395, -0.04554182, -0.029..."


## Recuperación de Información: TF-IDF, BM25 y FAISS

A continuación, implementamos tres funciones de recuperación para comparar distintos enfoques:

- `search_tfidf`: Basado en frecuencia de términos con TF-IDF.
- `search_bm25`: Modelo probabilístico optimizado para ranking.
- `search_faiss`: Búsqueda semántica con embeddings y FAISS.

Cada función retorna los documentos más relevantes a una consulta textual.


In [22]:
def search_faiss(query, top_k=10):
    query_embedding = model.encode([query]).astype('float32')
    distances, indices = index_faiss.search(query_embedding, top_k)

    resultados = []
    for idx in indices[0]:
        doc = df.iloc[idx]
        resultados.append({
            "id": doc["id"],
            "title": doc["title"],
            "abstract": doc["abstract"][:300] + "..."
        })
    return resultados

# 🔍 Prueba de ejemplo
resultados = search_faiss("machine learning for particle physics", top_k=3)
for r in resultados:
    print(f"[{r['id']}] {r['title']}\n→ {r['abstract']}\n")


[0707.0930] Bayesian Learning of Neural Networks for Signal/Background
  Discrimination in Particle Physics
→   Neural networks are used extensively in classification problems in particle
physics research. Since the training of neural networks can be viewed as a
problem of inference, Bayesian learning of neural networks can provide more
optimal and robust results than conventional learning methods. We have
...

[0708.1161] A threshold-improved narrow-width approximation for BSM physics
→   A modified narrow-width approximation that allows for O(Gamma/M)-accurate
predictions for resonant particle decay with similar intermediate masses is
proposed and applied to MSSM processes to demonstrate its importance for
searches for particle physics beyond the Standard Model.
...

[0704.0760] Search for Heavy, Long-Lived Particles that Decay to Photons at CDF II
→   We present the first search for heavy, long-lived particles that decay to
photons at a hadron collider. We use a sample of photon+jet

In [23]:
from sklearn.metrics.pairwise import cosine_similarity

def search_tfidf(query, top_k=10):
    query_vec = vectorizer_tfidf.transform([query])
    similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = similarities.argsort()[::-1][:top_k]

    resultados = []
    for i in top_indices:
        doc = df.iloc[i]
        resultados.append({
            "id": doc["id"],
            "title": doc["title"],
            "abstract": doc["abstract"][:300] + "..."
        })
    return resultados


In [24]:
def search_bm25(query, top_k=10):
    query_tokens = clean_text(query)
    scores = bm25.get_scores(query_tokens)
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]

    resultados = []
    for i in top_indices:
        doc = df.iloc[i]
        resultados.append({
            "id": doc["id"],
            "title": doc["title"],
            "abstract": doc["abstract"][:300] + "..."
        })
    return resultados


In [25]:
query = "machine learning for particle physics"

print("=== TF-IDF ===")
for r in search_tfidf(query, 3):
    print(f"[{r['id']}] {r['title']}\n→ {r['abstract']}\n")

print("=== BM25 ===")
for r in search_bm25(query, 3):
    print(f"[{r['id']}] {r['title']}\n→ {r['abstract']}\n")

print("=== FAISS ===")
for r in search_faiss(query, 3):
    print(f"[{r['id']}] {r['title']}\n→ {r['abstract']}\n")


=== TF-IDF ===
[0704.3453] An Adaptive Strategy for the Classification of G-Protein Coupled
  Receptors
→   One of the major problems in computational biology is the inability of
existing classification models to incorporate expanding and new domain
knowledge. This problem of static classification models is addressed in this
paper by the introduction of incremental learning for problems in
bioinformatic...

[0705.2318] Statistical Mechanics of Nonlinear On-line Learning for Ensemble
  Teachers
→   We analyze the generalization performance of a student in a model composed of
nonlinear perceptrons: a true teacher, ensemble teachers, and the student. We
calculate the generalization error of the student analytically or numerically
using statistical mechanics in the framework of on-line learning...

[0704.3905] Ensemble Learning for Free with Evolutionary Algorithms ?
→   Evolutionary Learning proceeds by evolving a population of classifiers, from
which it generally returns (with some notab

## Integración del Módulo RAG (Retrieval-Augmented Generation)

En esta sección, integramos un modelo de lenguaje para generar respuestas basadas en los documentos más relevantes recuperados por el índice vectorial FAISS.

### Flujo del módulo RAG:
1. Recuperamos los 3 documentos más relevantes usando `search_faiss`.
2. Construimos un prompt que contiene los títulos y resúmenes de esos documentos como contexto.
3. Pasamos ese contexto a un modelo generativo (LLM) que genera una respuesta coherente a la consulta.


In [12]:
pip install python-dotenv


Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1


In [13]:
pip install openai




In [17]:
import openai
openai.api_key = "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"  # o usar os.getenv("OPENAI_API_KEY")


In [20]:
import openai

def rag_answer(query, top_k=3):
    resultados = search_faiss(query, top_k=top_k)

    # Armar el contexto con los resultados recuperados
    contexto = ""
    for i, doc in enumerate(resultados, start=1):
        contexto += f"Documento {i}:\nTítulo: {doc['title']}\nResumen: {doc['abstract']}\n\n"

    # Crear el prompt de entrada
    prompt = (
        f"Contexto:\n{contexto}\n"
        f"Pregunta: {query}\n"
        f"Con base en los documentos anteriores, proporciona una respuesta clara, útil y justificada."
    )

    # Enviar a OpenAI (gpt-3.5-turbo o gpt-4)
    client = openai.OpenAI(api_key=openai.api_key) # Create an OpenAI client
    response = client.chat.completions.create( # Use the new chat completions method
        model="gpt-3.5-turbo",  # Puedes cambiar a "gpt-4" si deseas
        messages=[
            {"role": "system", "content": "Eres un experto en física de partículas que responde con base en documentos científicos recuperados."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=500
    )

    return {
        "contexto": contexto,
        "respuesta": response.choices[0].message.content # Access the response content
    }

#  Ejemplo de prueba
resultado_rag = rag_answer("higgs boson decay", top_k=3)
print("=== CONTEXTO ===\n", resultado_rag["contexto"])
print("=== RESPUESTA ===\n", resultado_rag["respuesta"])

=== CONTEXTO ===
 Documento 1:
Título: Invisibly decaying Higgs boson in the Littlest Higgs model with T-parity
Resumen:   We show that there are regions in the parameter space of the Littlest Higgs
model with T-parity, allowed by electroweak precision data, where the Higgs
boson can decay invisibly into a pair of heavy photons A_H with a substantial
branching ratio. For a symmetry breaking scale f in the range 450-60...

Documento 2:
Título: Effect of Charged Scalar Loops on Photonic Decays of a Fermiophobic
  Higgs
Resumen:   Higgs bosons with very suppressed couplings to fermions ("Fermiophobic Higgs
bosons", h_f) can decay to two photons (\gamma\gamma) with a branching ratio
significantly larger than that expected for the Standard Model Higgs boson for
m_{h_f}<150 GeV. Such a particle would give a clear signal at the...

Documento 3:
Título: Search for invisibly decaying Higgs bosons in e+e- -> Zoho production at
  sqrt(s) = 183 - 209 GeV
Resumen:   A search is performed for Higgs 

## Evaluación Comparativa entre Modelos de Recuperación

En esta sección se comparan los resultados de los tres enfoques implementados (TF-IDF, BM25 y FAISS). Analizamos:

- Documentos en común en el top-10.
- Diferencias en el ordenamiento.
- Visualización de coincidencias por consulta.


In [26]:
def comparar_modelos(query, top_k=10):
    tfidf_docs = [r['id'] for r in search_tfidf(query, top_k)]
    bm25_docs = [r['id'] for r in search_bm25(query, top_k)]
    faiss_docs = [r['id'] for r in search_faiss(query, top_k)]

    # Documentos en común
    comunes_tfidf_bm25 = set(tfidf_docs) & set(bm25_docs)
    comunes_tfidf_faiss = set(tfidf_docs) & set(faiss_docs)
    comunes_bm25_faiss = set(bm25_docs) & set(faiss_docs)
    comunes_todos = set(tfidf_docs) & set(bm25_docs) & set(faiss_docs)

    print(f"\n Consulta: {query}")
    print(f"Top-{top_k} TF-IDF: {tfidf_docs[:3]}")
    print(f"Top-{top_k} BM25:   {bm25_docs[:3]}")
    print(f"Top-{top_k} FAISS:  {faiss_docs[:3]}")
    print("\n Coincidencias:")
    print(f"- TF-IDF ∩ BM25:     {len(comunes_tfidf_bm25)}")
    print(f"- TF-IDF ∩ FAISS:    {len(comunes_tfidf_faiss)}")
    print(f"- BM25 ∩ FAISS:      {len(comunes_bm25_faiss)}")
    print(f"- Todos en común:    {len(comunes_todos)}")

    return {
        "TF-IDF": tfidf_docs,
        "BM25": bm25_docs,
        "FAISS": faiss_docs,
        "comunes": {
            "tfidf_bm25": comunes_tfidf_bm25,
            "tfidf_faiss": comunes_tfidf_faiss,
            "bm25_faiss": comunes_bm25_faiss,
            "todos": comunes_todos
        }
    }


In [27]:
consultas = [
    "diphoton production cross sections",
    "quantum chromodynamics",
    "higgs boson decay",
    "machine learning for particle physics",
    "top quark production"
]

for consulta in consultas:
    comparar_modelos(consulta, top_k=10)



 Consulta: diphoton production cross sections
Top-10 TF-IDF: ['0705.3804', '0704.0001', '0707.2294']
Top-10 BM25:   ['0705.3804', '0704.0001', '0705.4313']
Top-10 FAISS:  ['0705.3804', '0708.1277', '0704.0001']

 Coincidencias:
- TF-IDF ∩ BM25:     7
- TF-IDF ∩ FAISS:    4
- BM25 ∩ FAISS:      3
- Todos en común:    3

 Consulta: quantum chromodynamics
Top-10 TF-IDF: ['0705.4356', '0705.3170', '0708.0047']
Top-10 BM25:   ['0705.3170', '0705.4356', '0707.0502']
Top-10 FAISS:  ['0705.4356', '0707.1065', '0708.0012']

 Coincidencias:
- TF-IDF ∩ BM25:     8
- TF-IDF ∩ FAISS:    3
- BM25 ∩ FAISS:      3
- Todos en común:    3

 Consulta: higgs boson decay
Top-10 TF-IDF: ['0704.2000', '0705.2709', '0707.1591']
Top-10 BM25:   ['0707.1591', '0705.1259', '0705.2709']
Top-10 FAISS:  ['0707.1591', '0708.1939', '0707.0373']

 Coincidencias:
- TF-IDF ∩ BM25:     7
- TF-IDF ∩ FAISS:    3
- BM25 ∩ FAISS:      5
- Todos en común:    3

 Consulta: machine learning for particle physics
Top-10 TF-IDF: [

## Comparación de Resultados entre Modelos de Recuperación

A continuación, se presentan los resultados obtenidos al comparar los tres enfoques implementados para la recuperación de información: TF-IDF, BM25 y FAISS. Para cinco consultas específicas se evaluó:

- ¿Cuántos documentos se repiten entre modelos?
- ¿Cuáles son únicos en cada uno?
- ¿Cuánto acuerdo existe en el top-10?

Esto ayuda a observar cómo cada técnica prioriza diferentes aspectos de los documentos:
- TF-IDF y BM25 trabajan sobre frecuencia de términos.
- FAISS utiliza embeddings semánticos (más abstractos).

La tabla resume los documentos **en común** entre los modelos por consulta:


In [28]:
comparaciones = []

for consulta in consultas:
    resultado = comparar_modelos(consulta, top_k=10)
    comparaciones.append({
        "consulta": consulta,
        "tfidf_bm25": len(resultado["comunes"]["tfidf_bm25"]),
        "tfidf_faiss": len(resultado["comunes"]["tfidf_faiss"]),
        "bm25_faiss": len(resultado["comunes"]["bm25_faiss"]),
        "todos": len(resultado["comunes"]["todos"])
    })



 Consulta: diphoton production cross sections
Top-10 TF-IDF: ['0705.3804', '0704.0001', '0707.2294']
Top-10 BM25:   ['0705.3804', '0704.0001', '0705.4313']
Top-10 FAISS:  ['0705.3804', '0708.1277', '0704.0001']

 Coincidencias:
- TF-IDF ∩ BM25:     7
- TF-IDF ∩ FAISS:    4
- BM25 ∩ FAISS:      3
- Todos en común:    3

 Consulta: quantum chromodynamics
Top-10 TF-IDF: ['0705.4356', '0705.3170', '0708.0047']
Top-10 BM25:   ['0705.3170', '0705.4356', '0707.0502']
Top-10 FAISS:  ['0705.4356', '0707.1065', '0708.0012']

 Coincidencias:
- TF-IDF ∩ BM25:     8
- TF-IDF ∩ FAISS:    3
- BM25 ∩ FAISS:      3
- Todos en común:    3

 Consulta: higgs boson decay
Top-10 TF-IDF: ['0704.2000', '0705.2709', '0707.1591']
Top-10 BM25:   ['0707.1591', '0705.1259', '0705.2709']
Top-10 FAISS:  ['0707.1591', '0708.1939', '0707.0373']

 Coincidencias:
- TF-IDF ∩ BM25:     7
- TF-IDF ∩ FAISS:    3
- BM25 ∩ FAISS:      5
- Todos en común:    3

 Consulta: machine learning for particle physics
Top-10 TF-IDF: [

In [29]:
import pandas as pd

df_comparacion = pd.DataFrame(comparaciones)
df_comparacion


Unnamed: 0,consulta,tfidf_bm25,tfidf_faiss,bm25_faiss,todos
0,diphoton production cross sections,7,4,3,3
1,quantum chromodynamics,8,3,3,3
2,higgs boson decay,7,3,5,3
3,machine learning for particle physics,6,0,1,0
4,top quark production,4,3,5,2
