# Zero-shot классификация

<b>Цель</b>: Освоить zero-shot классификацию статей по тематике и проверить влияние заголовков на результаты классификации. Будем работать всё с тем же сэмплом, потому что ранее увидели там 2 кластера. Поскольку `subject` там везде один и тот же, интересно далее добавить ещё одну порцию данных с другой тематикой.

<b>Определение: </b> Zero-shot классификация — это метод автоматической классификации текста (или других данных) на заранее определённые категории без предварительного обучения модели на этих категориях. То есть модель не видела примеры этих классов во время обучения, но умеет делать прогноз, опираясь на общее понимание языка.

## Ключевые моменты:

- Нет меток для обучения, в отличии от традиционных классификаторов (`Logistic Regression`, `BERT fine-tuned`)
- Использует большие предобученные языковые модели (типа `BART`, `RoBERTa`, `T5`), которые обучены на огромном корпусе текста и умеют понимать смысл предложений.
- Как работает: ,берёт тексты и категории и вычисляет, насколько текст соответствует каждой категории.
- Плюсы:
    - Не нужно размечать датасет
    - Быстро
- Минусы:
    - Обычно точность ниже, чем у модели, обученной на конкретных данных. Может «ошибаться», если категории слишком специфичны или похожи.

In [1]:
# ипортируем нужные библиотеки
import pandas as pd
import numpy as np
from transformers import pipeline
from tqdm import tqdm

In [3]:
import sys
import os

# Абсолютный путь к корню проекта
project_root = os.path.abspath(os.path.join(".."))  # если notebooks/ внутри корня
if project_root not in sys.path:
    sys.path.insert(0, project_root)

from utils.config_loader import load_config

# ENV = "local"           #ПЕРЕКЛЮЧИТЬ НАСТРОЙКУ ЗДЕСЬ!
ENV = "prod"
CONFIG = load_config(ENV, project_root=project_root)

print(f"Running in {ENV.upper()} mode")
print(CONFIG)

Running in PROD mode
{'DATA_PATH': None, 'USE_GPU': True, 'EMBEDDING_MODEL': 'sentence-transformers/all-mpnet-base-v2', 'MONGO_URI': 'mongodb://localhost:27017', 'QDRANT_URL': 'http://localhost:6333'}


In [4]:
# проверка cuda
import torch

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device count:", torch.cuda.device_count())
    print("Current device:", torch.cuda.current_device())
    print("Device name:", torch.cuda.get_device_name(0))

PyTorch version: 2.9.0+cu128
CUDA available: True
Device count: 1
Current device: 0
Device name: NVIDIA GeForce RTX 3060


In [5]:
print(CONFIG['EMBEDDING_MODEL'])

sentence-transformers/all-mpnet-base-v2


In [6]:
# заведём классификатор
classifier = pipeline("zero-shot-classification", model=CONFIG['EMBEDDING_MODEL'])

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of MPNetForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-mpnet-base-v2 and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


In [7]:
df = pd.read_csv('preprocessed_abstracts.csv')
df.head(5)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category,author.name,author.affiliation,summary_tokens,title_tokens,tokens_combined
0,http://arxiv.org/abs/astro-ph/0407044v1,2004-07-02T10:17:39Z,2004-07-02T10:17:39Z,Muon Track Reconstruction and Data Selection T...,The Antarctic Muon And Neutrino Detector Array...,"[{'name': 'The AMANDA Collaboration'}, {'name'...",10.1016/j.nima.2004.01.065,['http://dx.doi.org/10.1016/j.nima.2004.01.065...,"40 pages, 16 Postscript figures, uses elsart.sty","Nucl.Instrum.Meth.A524:169-194,2004",http://arxiv.org/abs/astro-ph/0407044v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['Antarctic', 'Muon', 'Neutrino', 'Detector', ...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...","['Muon', 'Track', 'Reconstruction', 'Data', 'S..."
1,http://arxiv.org/abs/astro-ph/0410439v1,2004-10-19T14:47:51Z,2004-10-19T14:47:51Z,An update on the SCUBA-2 project,"SCUBA-2, which replaces SCUBA (the Submillimet...","[{'name': 'Michael Audley', 'affiliation': 'UK...",10.1117/12.551259,"['http://dx.doi.org/10.1117/12.551259', 'http:...","16 pages, 14 figures, Invited talk at SPIE Gla...",,http://arxiv.org/abs/astro-ph/0410439v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['replace', 'SCUBA', 'Submillimeter', 'Common'...","['update', 'project']","['update', 'project', 'replace', 'SCUBA', 'Sub..."
2,http://arxiv.org/abs/astro-ph/0411574v3,2011-01-05T18:55:32Z,2004-11-19T15:00:42Z,Feasibility study of a Laue lens for hard X-ra...,We report on the feasibility study of a Laue l...,"[{'name': 'A. Pisa', 'affiliation': 'Universit...",10.1117/12.563052,"['http://dx.doi.org/10.1117/12.563052', 'http:...","10 pages, corrected Fig. 1b and Fig. 2, which ...","SPIE Proc., 5536, 39 (2004)",http://arxiv.org/abs/astro-ph/0411574v3,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['report', 'feasibility', 'study', 'Laue', 'le...","['feasibility', 'study', 'Laue', 'lens', 'hard...","['feasibility', 'study', 'Laue', 'lens', 'hard..."
3,http://arxiv.org/abs/astro-ph/0504497v1,2005-04-22T12:39:07Z,2005-04-22T12:39:07Z,Search for Extra-Terrestrial planets: The DARW...,The DARWIN mission is an Infrared free flying ...,,,http://arxiv.org/pdf/astro-ph/0504497v1,"PhD thesis 2004, Karl Franzens Univ. Graz, 177...",,http://arxiv.org/abs/astro-ph/0504497v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",Lisa Kaltenegger,,"['DARWIN', 'mission', 'Infrared', 'free', 'fly...","['search', 'extra', 'terrestrial', 'planet', '...","['search', 'extra', 'terrestrial', 'planet', '..."
4,http://arxiv.org/abs/physics/0510224v1,2005-10-25T15:36:07Z,2005-10-25T15:36:07Z,Wavefront sensor based on varying transmission...,The use of Wavefront Sensors (WFS) is nowadays...,,10.1080/09500340500073495,['http://dx.doi.org/10.1080/09500340500073495'...,"2 tables, 6 figures","J.Mod.Opt. 52:1917-1931,2005",http://arxiv.org/abs/physics/0510224v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['use', 'Wavefront', 'Sensors', 'WFS', 'nowada...","['wavefront', 'sensor', 'base', 'vary', 'trans...","['wavefront', 'sensor', 'base', 'vary', 'trans..."


In [8]:
# будем использовать оригинальные текст + заголовок
df['combined'] = df['title'] + '. ' + df['summary']

In [9]:
# попробуем понять, видит ли эта штука наши 2 кластера. лейблы пока от балды
def prediсt_labels(my_df):
    pred_labels = []
    candidate_labels = ["astro-ph.IM", "Other Physics"] 
    for text in tqdm(my_df): 
        #     for i,text in tqdm(enumerate(my_df[:10])): 
        result = classifier(text, candidate_labels) 
        top_label = result['labels'][0]  # самая вероятная категория
        pred_labels.append(top_label)
#         print(i, top_label)
    return (pred_labels)

In [15]:
# prediсt_labels(df.combined)
# prediсt_labels(df.summary)

In [10]:
pred_labels_comb = prediсt_labels(df.combined)   #предсказываем для абстрактов+заголовков

  1%|          | 9/1000 [00:00<00:41, 23.97it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 1000/1000 [00:30<00:00, 32.96it/s]


In [12]:
df['predicted_labels_combined'] = pred_labels_comb 

In [13]:
df.head(10)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category,author.name,author.affiliation,summary_tokens,title_tokens,tokens_combined,combined,predicted_labels_combined
0,http://arxiv.org/abs/astro-ph/0407044v1,2004-07-02T10:17:39Z,2004-07-02T10:17:39Z,Muon Track Reconstruction and Data Selection T...,The Antarctic Muon And Neutrino Detector Array...,"[{'name': 'The AMANDA Collaboration'}, {'name'...",10.1016/j.nima.2004.01.065,['http://dx.doi.org/10.1016/j.nima.2004.01.065...,"40 pages, 16 Postscript figures, uses elsart.sty","Nucl.Instrum.Meth.A524:169-194,2004",http://arxiv.org/abs/astro-ph/0407044v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['Antarctic', 'Muon', 'Neutrino', 'Detector', ...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...",Muon Track Reconstruction and Data Selection T...,Other Physics
1,http://arxiv.org/abs/astro-ph/0410439v1,2004-10-19T14:47:51Z,2004-10-19T14:47:51Z,An update on the SCUBA-2 project,"SCUBA-2, which replaces SCUBA (the Submillimet...","[{'name': 'Michael Audley', 'affiliation': 'UK...",10.1117/12.551259,"['http://dx.doi.org/10.1117/12.551259', 'http:...","16 pages, 14 figures, Invited talk at SPIE Gla...",,http://arxiv.org/abs/astro-ph/0410439v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['replace', 'SCUBA', 'Submillimeter', 'Common'...","['update', 'project']","['update', 'project', 'replace', 'SCUBA', 'Sub...","An update on the SCUBA-2 project. SCUBA-2, whi...",Other Physics
2,http://arxiv.org/abs/astro-ph/0411574v3,2011-01-05T18:55:32Z,2004-11-19T15:00:42Z,Feasibility study of a Laue lens for hard X-ra...,We report on the feasibility study of a Laue l...,"[{'name': 'A. Pisa', 'affiliation': 'Universit...",10.1117/12.563052,"['http://dx.doi.org/10.1117/12.563052', 'http:...","10 pages, corrected Fig. 1b and Fig. 2, which ...","SPIE Proc., 5536, 39 (2004)",http://arxiv.org/abs/astro-ph/0411574v3,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['report', 'feasibility', 'study', 'Laue', 'le...","['feasibility', 'study', 'Laue', 'lens', 'hard...","['feasibility', 'study', 'Laue', 'lens', 'hard...",Feasibility study of a Laue lens for hard X-ra...,Other Physics
3,http://arxiv.org/abs/astro-ph/0504497v1,2005-04-22T12:39:07Z,2005-04-22T12:39:07Z,Search for Extra-Terrestrial planets: The DARW...,The DARWIN mission is an Infrared free flying ...,,,http://arxiv.org/pdf/astro-ph/0504497v1,"PhD thesis 2004, Karl Franzens Univ. Graz, 177...",,http://arxiv.org/abs/astro-ph/0504497v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",Lisa Kaltenegger,,"['DARWIN', 'mission', 'Infrared', 'free', 'fly...","['search', 'extra', 'terrestrial', 'planet', '...","['search', 'extra', 'terrestrial', 'planet', '...",Search for Extra-Terrestrial planets: The DARW...,Other Physics
4,http://arxiv.org/abs/physics/0510224v1,2005-10-25T15:36:07Z,2005-10-25T15:36:07Z,Wavefront sensor based on varying transmission...,The use of Wavefront Sensors (WFS) is nowadays...,,10.1080/09500340500073495,['http://dx.doi.org/10.1080/09500340500073495'...,"2 tables, 6 figures","J.Mod.Opt. 52:1917-1931,2005",http://arxiv.org/abs/physics/0510224v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['use', 'Wavefront', 'Sensors', 'WFS', 'nowada...","['wavefront', 'sensor', 'base', 'vary', 'trans...","['wavefront', 'sensor', 'base', 'vary', 'trans...",Wavefront sensor based on varying transmission...,Other Physics
5,http://arxiv.org/abs/physics/0510226v1,2005-10-25T15:55:59Z,2005-10-25T15:55:59Z,An analysis of stellar interferometers as wave...,This paper presents the basic principle and th...,,10.1364/AO.44.004733,"['http://dx.doi.org/10.1364/AO.44.004733', 'ht...",12 figures,"Appl.Opt. 44:4733-4744,2005",http://arxiv.org/abs/physics/0510226v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['paper', 'present', 'basic', 'principle', 'th...","['analysis', 'stellar', 'interferometer', 'wav...","['analysis', 'stellar', 'interferometer', 'wav...",An analysis of stellar interferometers as wave...,Other Physics
6,http://arxiv.org/abs/physics/0511231v1,2005-11-28T15:26:24Z,2005-11-28T15:26:24Z,Conceptual design of a phase shifting telescop...,This paper deals with the theoretical principl...,,10.1016/j.optcom.2005.11.061,['http://dx.doi.org/10.1016/j.optcom.2005.11.0...,17 pages and 5 figures,"Opt.Commun.261:34-42,2006",http://arxiv.org/abs/physics/0511231v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['paper', 'deal', 'theoretical', 'principle', ...","['conceptual', 'design', 'phase', 'shift', 'te...","['conceptual', 'design', 'phase', 'shift', 'te...",Conceptual design of a phase shifting telescop...,Other Physics
7,http://arxiv.org/abs/astro-ph/0512053v1,2005-12-02T02:47:13Z,2005-12-02T02:47:13Z,Atmospheric Biomarkers and their Evolution ove...,The search for life on extrasolar planets is b...,"[{'name': 'L. Kaltenegger'}, {'name': 'K. Juck...",10.1017/S1743921306009422,['http://dx.doi.org/10.1017/S1743921306009422'...,for high resolution images see\n http://cfa-w...,"IAU Symp.200:1.259-1.264,20065",http://arxiv.org/abs/astro-ph/0512053v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['search', 'life', 'extrasolar', 'planet', 'ba...","['Atmospheric', 'Biomarkers', 'Evolution', 'Ge...","['Atmospheric', 'Biomarkers', 'Evolution', 'Ge...",Atmospheric Biomarkers and their Evolution ove...,Other Physics
8,http://arxiv.org/abs/astro-ph/0606733v1,2006-06-29T19:04:52Z,2006-06-29T19:04:52Z,Characteristics of proposed 3 and 4 telescope ...,The Darwin and TPF-I missions are Infrared fre...,"[{'name': 'L. Kaltenegger'}, {'name': 'M. Frid...",10.1017/S1743921306009410,['http://dx.doi.org/10.1017/S1743921306009410'...,"4 pages, 2 figures",Proceedings IAUC200: Direct Imaging of Exoplan...,http://arxiv.org/abs/astro-ph/0606733v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['Darwin', 'TPF', 'mission', 'Infrared', 'free...","['Characteristics', 'propose', 'telescope', 'c...","['Characteristics', 'propose', 'telescope', 'c...",Characteristics of proposed 3 and 4 telescope ...,Other Physics
9,http://arxiv.org/abs/astro-ph/0606762v1,2006-06-30T18:24:16Z,2006-06-30T18:24:16Z,Interferometric Space Missions for the Search ...,The requirements on space missions designed to...,"[{'name': 'L Kaltenegger'}, {'name': 'M. Fridl...",10.1007/s10509-006-9183-z,['http://dx.doi.org/10.1007/s10509-006-9183-z'...,"21 pages, 8 figures; TBP in Astrophysics and S...","Astrophys.Space Sci.306:147-158,2006",http://arxiv.org/abs/astro-ph/0606762v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['requirement', 'space', 'mission', 'design', ...","['Interferometric', 'Space', 'Missions', 'Sear...","['Interferometric', 'Space', 'Missions', 'Sear...",Interferometric Space Missions for the Search ...,Other Physics


In [19]:
# pred_labels_summary = prediсt_labels(df.summary)   #предсказываем для абстрактов отдельно
# df['predicted_labels_summary'] = pred_labels_summary

In [20]:
#сравнение эффективности предсказаний с заголовками и без
# accuracy = (df['predicted_label_combined'] == df['predicted_label_summary']).mean()

## Создание разнообразного списка источников

In [14]:
# !pip install pymongo

In [15]:
# попробуем считать данные из db в mongodb
from pymongo import MongoClient
db = MongoClient("mongodb://localhost:27017/", uuidRepresentation="standard").arxiv

In [16]:
collection = db.articles
collection.count_documents({})

171298

In [17]:
cats = [
    "astro-ph.IM",
    "astro-ph.CO",
    "astro-ph.EP",
    "astro-ph.GA",
    "astro-ph.HE",
    "astro-ph.SR",
]
cursor = collection.find({"primary_category": {"$in": cats}})
m_df = pd.DataFrame(list(cursor))
m_df.drop(columns=['_id'], inplace=True, errors='ignore')

In [18]:
collection.count_documents({"primary_category": {"$in": cats}})

151256

In [19]:
import pprint
pprint.pprint(collection.find_one())

{'_id': '1c528260-7f0d-5142-a1ec-6253f5f9325d',
 'author': [{'name': 'The AMANDA Collaboration'}, {'name': 'J. Ahrens'}],
 'category': ['astro-ph', 'astro-ph.IM'],
 'comment': '40 pages, 16 Postscript figures, uses elsart.sty',
 'doi': '10.1016/j.nima.2004.01.065',
 'id': 'http://arxiv.org/abs/astro-ph/0407044v1',
 'journal_ref': 'Nucl.Instrum.Meth.A524:169-194,2004',
 'link_alternate': 'http://arxiv.org/abs/astro-ph/0407044v1',
 'link_related': ['http://dx.doi.org/10.1016/j.nima.2004.01.065',
                  'http://arxiv.org/pdf/astro-ph/0407044v1'],
 'primary_category': 'astro-ph',
 'published': '2004-07-02T10:17:39Z',
 'summary': 'The Antarctic Muon And Neutrino Detector Array (AMANDA) is a '
            'high-energy\n'
            'neutrino telescope operating at the geographic South Pole. It is '
            'a lattice of\n'
            'photo-multiplier tubes buried deep in the polar ice between 1500m '
            'and 2000m.\n'
            'The primary goal of this detector 

In [20]:
m_df.count()

id                  151256
updated             151256
published           151256
title               151256
summary             151256
author              151256
doi                 127952
link_related        151256
comment             142604
link_alternate      151256
primary_category    151256
category            151256
journal_ref          43802
dtype: int64

In [21]:
# aggregate count per category
pipeline = [
    {"$group": {"_id": "$primary_category", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}}  # optional: sort descending
]

counts = list(collection.aggregate(pipeline))

for record in counts:
    print(record["_id"], record["count"])

astro-ph.SR 30876
astro-ph.GA 30226
astro-ph.HE 29819
astro-ph.CO 25215
astro-ph.EP 18783
astro-ph.IM 16337
gr-qc 6985
hep-ph 4523
hep-th 1963
nucl-th 1087
physics.ins-det 740
physics.space-ph 634
physics.plasm-ph 564
physics.flu-dyn 285
physics.hist-ph 268
hep-ex 240
physics.pop-ph 195
nucl-ex 158
physics.comp-ph 148
physics.atom-ph 146
physics.ao-ph 139
physics.geo-ph 130
physics.optics 130
physics.ed-ph 105
physics.chem-ph 101
physics.data-an 85
quant-ph 82
physics.gen-ph 77
astro-ph 76
cs.LG 74
nlin.CD 72
cond-mat.stat-mech 68
math.NA 66
math-ph 61
physics.soc-ph 55
cs.CV 52
cond-mat.mtrl-sci 47
stat.ME 45
cs.DC 43
eess.SP 37
cs.IT 34
math.DS 33
physics.class-ph 33
stat.AP 31
cs.DL 26
cs.RO 26
stat.CO 24
cond-mat.supr-con 23
stat.ML 22
physics.app-ph 17
math.AP 17
cond-mat.mes-hall 15
eess.IV 15
eess.SY 15
cond-mat.soft 13
cs.CE 12
cond-mat.quant-gas 12
q-bio.PE 11
physics.bio-ph 11
math.OC 9
math.ST 9
hep-lat 8
physics.atm-clus 8
cs.MS 7
cs.AI 7
math.CA 7
physics.acc-ph 6
physics.

In [22]:
# choose your category
category = "q-bio.GN"

# find all records in that category
cursor = collection.find({"primary_category": category})

# print first few results nicely
for doc in cursor.limit(5):
    pprint.pprint(doc)

{'_id': 'c2acfe27-e626-5586-a4b8-90cd77646e8e',
 'author': [{'name': 'Aaron Golden'},
            {'name': 'S. George Djorgovski'},
            {'name': 'John M. Greally'}],
 'category': ['q-bio.GN', 'astro-ph.IM'],
 'comment': '11 pages, 1 figure, accepted for publication in Genome Biology',
 'id': 'http://arxiv.org/abs/1308.3277v1',
 'link_alternate': 'http://arxiv.org/abs/1308.3277v1',
 'link_related': 'http://arxiv.org/pdf/1308.3277v1',
 'primary_category': 'q-bio.GN',
 'published': '2013-08-15T00:01:05Z',
            'from\n'
            'high-throughput DNA sequencing data are being supplanted by a '
            'second deluge, of\n'
            'cliches bemoaning our collective scientific fate unless we '
            'address the genomic\n'
            "data `tsunami'. It is imperative that we explore the many facets "
            'of the genome,\n'
            'not just sequence but also transcriptional and epigenetic '
            'variability,\n'
            'integrating thes

In [23]:
# target number per category
N = 150

sampled_docs = []
for cat in cats:
    pipeline = [
        {"$match": {"primary_category": cat}},
        {"$sample": {"size": N}}  # random sample of N docs
    ]
    sampled_docs.extend(list(collection.aggregate(pipeline)))

df_test = pd.DataFrame(sampled_docs)
df_test.drop(columns=['_id'], inplace=True, errors='ignore')

print(df_test["primary_category"].value_counts())
print(df_test.shape)

primary_category
astro-ph.IM    150
astro-ph.CO    150
astro-ph.EP    150
astro-ph.GA    150
astro-ph.HE    150
astro-ph.SR    150
Name: count, dtype: int64
(900, 13)


In [24]:
df_test.head(10)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category
0,http://arxiv.org/abs/1501.07239v1,2015-01-28T19:09:07Z,2015-01-28T19:09:07Z,Molecfit: A general tool for telluric absorpti...,Context: The interaction of the light from ast...,"[{'name': 'A. Smette'}, {'name': 'H. Sana'}, {...",10.1051/0004-6361/201423932,[http://dx.doi.org/10.1051/0004-6361/201423932...,"18 pages, 13 figures, 5 tables, accepted for p...","A&A 576, A77 (2015)",http://arxiv.org/abs/1501.07239v1,astro-ph.IM,[astro-ph.IM]
1,http://arxiv.org/abs/2209.15395v1,2022-09-30T11:50:19Z,2022-09-30T11:50:19Z,"EMIR, the near-infrared camera and multi-objec...","We present EMIR, a powerful near-infrared (NIR...","[{'name': 'F. Garzón'}, {'name': 'M. Balcells'...",10.1051/0004-6361/202244729,[http://dx.doi.org/10.1051/0004-6361/202244729...,"10 pages, 11 figures","A&A 667, A107 (2022)",http://arxiv.org/abs/2209.15395v1,astro-ph.IM,[astro-ph.IM]
2,http://arxiv.org/abs/1802.00372v1,2018-02-01T16:08:12Z,2018-02-01T16:08:12Z,Development of a Lunar Scintillometer as part ...,Ground layer turbulence is a very important si...,"[{'name': 'Avinash Surendran'}, {'name': 'Padm...",10.1007/s10686-017-9567-9,"[http://dx.doi.org/10.1007/s10686-017-9567-9, ...",,"Surendran, A., Parihar, P.S., Banyal, R.K. et ...",http://arxiv.org/abs/1802.00372v1,astro-ph.IM,[astro-ph.IM]
3,http://arxiv.org/abs/2103.08136v1,2021-03-15T04:56:14Z,2021-03-15T04:56:14Z,A correction method for the telluric absorptio...,Observing a telluric standard star for correct...,"[{'name': 'Kai-Xing Lu'}, {'name': 'Zhi-Xiang ...",10.1088/1674-4527/21/7/183,"[http://dx.doi.org/10.1088/1674-4527/21/7/183,...","15 pages, 10 figure, accepted for publication ...",,http://arxiv.org/abs/2103.08136v1,astro-ph.IM,[astro-ph.IM]
4,http://arxiv.org/abs/1812.00785v4,2019-01-23T17:41:50Z,2018-11-30T17:07:41Z,QUBIC: Exploring the primordial Universe with ...,"In this paper we describe QUBIC, an experiment...","[{'name': 'Aniello Mennella'}, {'name': 'Peter...",10.3390/universe5020042,"[http://dx.doi.org/10.3390/universe5020042, ht...","Proceedings of the 2018 ICNFP conference, Cret...","Universe 2019, 5, 42",http://arxiv.org/abs/1812.00785v4,astro-ph.IM,[astro-ph.IM]
5,http://arxiv.org/abs/2208.04673v1,2022-08-09T11:35:16Z,2022-08-09T11:35:16Z,In-flight performance of the NIRSpec Micro Shu...,The NIRSpec instrument on the James Webb Space...,"[{'name': 'Timothy D. Rawle', 'affiliation': '...",,http://arxiv.org/pdf/2208.04673v1,"15 pages, 6 figures, to appear in Proceedings ...",,http://arxiv.org/abs/2208.04673v1,astro-ph.IM,[astro-ph.IM]
6,http://arxiv.org/abs/2310.00380v1,2023-09-30T13:56:02Z,2023-09-30T13:56:02Z,Review of Image Processing Methods in Solar Ph...,"With the exponential growth in data volume, es...","[{'name': 'Mohsen Javaherian'}, {'name': 'Zahr...",10.22128/ijaa.2023.711.1155,[http://dx.doi.org/10.22128/ijaa.2023.711.1155...,,"Iranian Journal of Astronomy and Astrophysics,...",http://arxiv.org/abs/2310.00380v1,astro-ph.IM,"[astro-ph.IM, astro-ph.GA, astro-ph.SR]"
7,http://arxiv.org/abs/2108.03522v2,2021-08-10T14:38:16Z,2021-08-07T20:58:48Z,Space-based weather observatory at Earth-Moon ...,Lunar hematite is formed by the oxidation of i...,"[{'name': 'Saurabh Gore'}, {'name': 'Manuel Nt...",10.13140/RG.2.2.12897.84322,[http://dx.doi.org/10.13140/RG.2.2.12897.84322...,,,http://arxiv.org/abs/2108.03522v2,astro-ph.IM,"[astro-ph.IM, astro-ph.EP, physics.space-ph]"
8,http://arxiv.org/abs/2008.09133v1,2020-08-20T18:00:22Z,2020-08-20T18:00:22Z,Mock catalogs for the extragalactic X-ray sky:...,"We present a series of new, publicly available...","[{'name': 'Stefano Marchesi'}, {'name': 'Rober...",10.1051/0004-6361/202038622,[http://dx.doi.org/10.1051/0004-6361/202038622...,"19 pages, 13 figures. Accepted for publication...","A&A 642, A184 (2020)",http://arxiv.org/abs/2008.09133v1,astro-ph.IM,"[astro-ph.IM, astro-ph.GA, astro-ph.HE]"
9,http://arxiv.org/abs/1709.05085v1,2017-09-15T07:35:43Z,2017-09-15T07:35:43Z,Spectral-line Observations Using a Phased Arra...,We present first results from pilot observatio...,"[{'name': 'Tristan Reynolds'}, {'name': 'Liste...",10.1017/pasa.2017.45,"[http://dx.doi.org/10.1017/pasa.2017.45, http:...","14 pages, 13 figures, accepted for publication...",,http://arxiv.org/abs/1709.05085v1,astro-ph.IM,[astro-ph.IM]


In [25]:
df_test_shuffled = df_test.sample(frac=1).reset_index(drop=True)
df_test_shuffled.head(15)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category
0,http://arxiv.org/abs/1509.01427v1,2015-09-04T12:29:46Z,2015-09-04T12:29:46Z,Radio Observations of the Pulsar Wind Nebula H...,Based on its energy-dependent morphology the i...,"[{'name': 'Iurii Sushch'}, {'name': 'Igor Oya'...",,http://arxiv.org/pdf/1509.01427v1,"9 pages, 3 figures, 1 table. In Proceedings of...",,http://arxiv.org/abs/1509.01427v1,astro-ph.HE,[astro-ph.HE]
1,http://arxiv.org/abs/1502.02562v2,2015-02-18T17:19:03Z,2015-02-09T16:57:20Z,Faint AGNs at z>4 in the CANDELS GOODS-S field...,In order to derive the AGN contribution to the...,"[{'name': 'E. Giallongo'}, {'name': 'A. Grazia...",10.1051/0004-6361/201425334,[http://dx.doi.org/10.1051/0004-6361/201425334...,"15 pages, 8 figures, A&A accepted, updated fig...","A&A 578, A83 (2015)",http://arxiv.org/abs/1502.02562v2,astro-ph.CO,"[astro-ph.CO, astro-ph.GA]"
2,http://arxiv.org/abs/1908.04505v1,2019-08-13T06:07:14Z,2019-08-13T06:07:14Z,Homogeneously derived transit timings for 17 e...,We homogeneously analyse $\sim 3.2\times 10^5$...,"[{'name': 'R. V. Baluev'}, {'name': 'E. N. Sok...",10.1093/mnras/stz2620,"[http://dx.doi.org/10.1093/mnras/stz2620, http...","19 pages, 4 figures, 6 tables; revised manuscr...","Mon. Not. R. Astron. Soc., 2019, V. 490 (1), P...",http://arxiv.org/abs/1908.04505v1,astro-ph.EP,[astro-ph.EP]
3,http://arxiv.org/abs/1206.0764v2,2012-09-20T15:39:26Z,2012-06-04T21:00:56Z,Probing the Structure of Jet Driven Core-Colla...,Times of arrival of high energy neutrinos enco...,"[{'name': 'Imre Bartos'}, {'name': 'Basudeb Da...",10.1103/PhysRevD.86.083007,"[http://dx.doi.org/10.1103/PhysRevD.86.083007,...",,"Phys. Rev. D 86, 083007, 2012",http://arxiv.org/abs/1206.0764v2,astro-ph.HE,"[astro-ph.HE, hep-ex, hep-ph]"
4,http://arxiv.org/abs/2012.08433v1,2020-12-15T17:17:51Z,2020-12-15T17:17:51Z,Control and systems software for the Cosmology...,The Cosmology Large Angular Scale Surveyor (CL...,"[{'name': 'Matthew A. Petroff'}, {'name': 'Joh...",10.1117/12.2561609,"[http://dx.doi.org/10.1117/12.2561609, http://...","19 pages, 8 figures, to appear in Proc. SPIE","Proc. SPIE 11452, Software and Cyberinfrastruc...",http://arxiv.org/abs/2012.08433v1,astro-ph.IM,[astro-ph.IM]
5,http://arxiv.org/abs/1312.1691v1,2013-12-05T21:00:02Z,2013-12-05T21:00:02Z,Transient jet formation and state transitions ...,"Magnetically arrested accretion discs (MADs), ...","[{'name': 'Jason Dexter'}, {'name': 'Jonathan ...",10.1093/mnras/stu581,"[http://dx.doi.org/10.1093/mnras/stu581, http:...","5 pages, 3 figures, submitted to MNRAS Letters",,http://arxiv.org/abs/1312.1691v1,astro-ph.HE,[astro-ph.HE]
6,http://arxiv.org/abs/1102.5094v1,2011-02-24T21:00:03Z,2011-02-24T21:00:03Z,Reassessing The Fundamentals: New Constraints ...,The ages and masses of neutron stars (NSs) are...,{'name': 'Bulent Kiziltan'},10.1063/1.3629483,"[http://dx.doi.org/10.1063/1.3629483, http://a...","4 pages, 4 figures; To appear in the AIP proce...",,http://arxiv.org/abs/1102.5094v1,astro-ph.GA,"[astro-ph.GA, stat.AP]"
7,http://arxiv.org/abs/1808.04766v1,2018-08-14T16:06:41Z,2018-08-14T16:06:41Z,Early formation of carbon monoxide in the Cent...,We present near-infrared spectroscopy of the N...,"[{'name': 'D. P. K. Banerjee', 'affiliation': ...",10.1093/mnras/sty2255,"[http://dx.doi.org/10.1093/mnras/sty2255, http...","MNRAS, in press. Accepted 2018 August 13",,http://arxiv.org/abs/1808.04766v1,astro-ph.SR,[astro-ph.SR]
8,http://arxiv.org/abs/1211.2806v2,2013-06-27T11:08:15Z,2012-11-12T21:00:01Z,Afterglow emission in Gamma-Ray Bursts: I. Pai...,Forward shocks caused by the interaction betwe...,"[{'name': 'L. Nava'}, {'name': 'L. Sironi'}, {...",10.1093/mnras/stt872,"[http://dx.doi.org/10.1093/mnras/stt872, http:...",MNRAS in press. Reverse shock and pre-accelera...,,http://arxiv.org/abs/1211.2806v2,astro-ph.HE,[astro-ph.HE]
9,http://arxiv.org/abs/1702.02586v2,2020-02-13T19:10:16Z,2017-02-08T19:11:59Z,Constraints on the Intergalactic Magnetic Fiel...,Pair creation on the cosmic infrared backgroun...,"[{'name': 'Paul Tiede'}, {'name': 'Avery E. Br...",10.3847/1538-4357/ab737e,"[http://dx.doi.org/10.3847/1538-4357/ab737e, h...","12 pages, 14 figures, 1 appendix. Accepted to ApJ",,http://arxiv.org/abs/1702.02586v2,astro-ph.HE,"[astro-ph.HE, astro-ph.CO]"


In [26]:
df_test_shuffled.to_csv("raw_radnom_astro_sample150.csv", index=False)