# Zero-shot классификация

<b>Цель</b>: Освоить zero-shot классификацию статей по тематике и проверить влияние заголовков на результаты классификации. Будем работать всё с тем же сэмплом, потому что ранее увидели там 2 кластера. Поскольку `subject` там везде один и тот же, интересно далее добавить ещё одну порцию данных с другой тематикой.

<b>Определение: </b> Zero-shot классификация — это метод автоматической классификации текста (или других данных) на заранее определённые категории без предварительного обучения модели на этих категориях. То есть модель не видела примеры этих классов во время обучения, но умеет делать прогноз, опираясь на общее понимание языка.

## Ключевые моменты:

- Нет меток для обучения, в отличии от традиционных классификаторов (`Logistic Regression`, `BERT fine-tuned`)
- Использует большие предобученные языковые модели (типа `BART`, `RoBERTa`, `T5`), которые обучены на огромном корпусе текста и умеют понимать смысл предложений.
- Как работает: ,берёт тексты и категории и вычисляет, насколько текст соответствует каждой категории.
- Плюсы:
    - Не нужно размечать датасет
    - Быстро
- Минусы:
    - Обычно точность ниже, чем у модели, обученной на конкретных данных. Может «ошибаться», если категории слишком специфичны или похожи.

In [21]:
!pip install tdqm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [22]:
pip install ipywidgets tqdm --upgrade


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [23]:
# ипортируем нужные библиотеки
import pandas as pd
import numpy as np
from transformers import pipeline
from tqdm import tqdm

In [24]:
# проверка cuda
import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())
print("Current device:", torch.cuda.current_device())
print("Device name:", torch.cuda.get_device_name(0))

PyTorch version: 2.8.0+cu128
CUDA available: True
Device count: 1
Current device: 0
Device name: NVIDIA GeForce RTX 3060


In [25]:
# заведём классификатор
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

Device set to use cuda:0


In [26]:
df = pd.read_csv('preprocessed_abstracts.csv')
df.head(5)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category,author.name,author.affiliation,summary_tokens,title_tokens,tokens_combined
0,http://arxiv.org/abs/astro-ph/0407044v1,2004-07-02T10:17:39Z,2004-07-02T10:17:39Z,Muon Track Reconstruction and Data Selection T...,The Antarctic Muon And Neutrino Detector Array...,"[{'name': 'The AMANDA Collaboration'}, {'name'...",10.1016/j.nima.2004.01.065,['http://dx.doi.org/10.1016/j.nima.2004.01.065...,"40 pages, 16 Postscript figures, uses elsart.sty","Nucl.Instrum.Meth.A524:169-194,2004",http://arxiv.org/abs/astro-ph/0407044v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['Antarctic', 'Muon', 'Neutrino', 'Detector', ...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...","['Muon', 'Track', 'Reconstruction', 'Data', 'S..."
1,http://arxiv.org/abs/astro-ph/0410439v1,2004-10-19T14:47:51Z,2004-10-19T14:47:51Z,An update on the SCUBA-2 project,"SCUBA-2, which replaces SCUBA (the Submillimet...","[{'name': 'Michael Audley', 'affiliation': 'UK...",10.1117/12.551259,"['http://dx.doi.org/10.1117/12.551259', 'http:...","16 pages, 14 figures, Invited talk at SPIE Gla...",,http://arxiv.org/abs/astro-ph/0410439v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['replace', 'SCUBA', 'Submillimeter', 'Common'...","['update', 'project']","['update', 'project', 'replace', 'SCUBA', 'Sub..."
2,http://arxiv.org/abs/astro-ph/0411574v3,2011-01-05T18:55:32Z,2004-11-19T15:00:42Z,Feasibility study of a Laue lens for hard X-ra...,We report on the feasibility study of a Laue l...,"[{'name': 'A. Pisa', 'affiliation': 'Universit...",10.1117/12.563052,"['http://dx.doi.org/10.1117/12.563052', 'http:...","10 pages, corrected Fig. 1b and Fig. 2, which ...","SPIE Proc., 5536, 39 (2004)",http://arxiv.org/abs/astro-ph/0411574v3,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['report', 'feasibility', 'study', 'Laue', 'le...","['feasibility', 'study', 'Laue', 'lens', 'hard...","['feasibility', 'study', 'Laue', 'lens', 'hard..."
3,http://arxiv.org/abs/astro-ph/0504497v1,2005-04-22T12:39:07Z,2005-04-22T12:39:07Z,Search for Extra-Terrestrial planets: The DARW...,The DARWIN mission is an Infrared free flying ...,,,http://arxiv.org/pdf/astro-ph/0504497v1,"PhD thesis 2004, Karl Franzens Univ. Graz, 177...",,http://arxiv.org/abs/astro-ph/0504497v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",Lisa Kaltenegger,,"['DARWIN', 'mission', 'Infrared', 'free', 'fly...","['search', 'extra', 'terrestrial', 'planet', '...","['search', 'extra', 'terrestrial', 'planet', '..."
4,http://arxiv.org/abs/physics/0510224v1,2005-10-25T15:36:07Z,2005-10-25T15:36:07Z,Wavefront sensor based on varying transmission...,The use of Wavefront Sensors (WFS) is nowadays...,,10.1080/09500340500073495,['http://dx.doi.org/10.1080/09500340500073495'...,"2 tables, 6 figures","J.Mod.Opt. 52:1917-1931,2005",http://arxiv.org/abs/physics/0510224v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['use', 'Wavefront', 'Sensors', 'WFS', 'nowada...","['wavefront', 'sensor', 'base', 'vary', 'trans...","['wavefront', 'sensor', 'base', 'vary', 'trans..."


In [27]:
# будем использовать оригинальные текст + заголовок
df['combined'] = df['title'] + '. ' + df['summary']

In [28]:
# попробуем понять, видит ли эта штука наши 2 кластера. лейблы пока от балды
def prediсt_labels(my_df):
    pred_labels = []
    candidate_labels = ["astro-ph.IM", "Other Physics"] 
    for text in tqdm(my_df): 
        #     for i,text in tqdm(enumerate(my_df[:10])): 
        result = classifier(text, candidate_labels) 
        top_label = result['labels'][0]  # самая вероятная категория
        pred_labels.append(top_label)
#         print(i, top_label)
    return (pred_labels)

In [29]:
# prediсt_labels(df.combined)
# prediсt_labels(df.summary)

In [30]:
pred_labels_comb = prediсt_labels(df.combined)   #предсказываем для абстрактов+заголовков
# df['predicted_labels_combined'] = pred_labels_comb 

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:24<00:00, 11.88it/s]


In [31]:
df.head(10)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category,author.name,author.affiliation,summary_tokens,title_tokens,tokens_combined,combined
0,http://arxiv.org/abs/astro-ph/0407044v1,2004-07-02T10:17:39Z,2004-07-02T10:17:39Z,Muon Track Reconstruction and Data Selection T...,The Antarctic Muon And Neutrino Detector Array...,"[{'name': 'The AMANDA Collaboration'}, {'name'...",10.1016/j.nima.2004.01.065,['http://dx.doi.org/10.1016/j.nima.2004.01.065...,"40 pages, 16 Postscript figures, uses elsart.sty","Nucl.Instrum.Meth.A524:169-194,2004",http://arxiv.org/abs/astro-ph/0407044v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['Antarctic', 'Muon', 'Neutrino', 'Detector', ...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...",Muon Track Reconstruction and Data Selection T...
1,http://arxiv.org/abs/astro-ph/0410439v1,2004-10-19T14:47:51Z,2004-10-19T14:47:51Z,An update on the SCUBA-2 project,"SCUBA-2, which replaces SCUBA (the Submillimet...","[{'name': 'Michael Audley', 'affiliation': 'UK...",10.1117/12.551259,"['http://dx.doi.org/10.1117/12.551259', 'http:...","16 pages, 14 figures, Invited talk at SPIE Gla...",,http://arxiv.org/abs/astro-ph/0410439v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['replace', 'SCUBA', 'Submillimeter', 'Common'...","['update', 'project']","['update', 'project', 'replace', 'SCUBA', 'Sub...","An update on the SCUBA-2 project. SCUBA-2, whi..."
2,http://arxiv.org/abs/astro-ph/0411574v3,2011-01-05T18:55:32Z,2004-11-19T15:00:42Z,Feasibility study of a Laue lens for hard X-ra...,We report on the feasibility study of a Laue l...,"[{'name': 'A. Pisa', 'affiliation': 'Universit...",10.1117/12.563052,"['http://dx.doi.org/10.1117/12.563052', 'http:...","10 pages, corrected Fig. 1b and Fig. 2, which ...","SPIE Proc., 5536, 39 (2004)",http://arxiv.org/abs/astro-ph/0411574v3,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['report', 'feasibility', 'study', 'Laue', 'le...","['feasibility', 'study', 'Laue', 'lens', 'hard...","['feasibility', 'study', 'Laue', 'lens', 'hard...",Feasibility study of a Laue lens for hard X-ra...
3,http://arxiv.org/abs/astro-ph/0504497v1,2005-04-22T12:39:07Z,2005-04-22T12:39:07Z,Search for Extra-Terrestrial planets: The DARW...,The DARWIN mission is an Infrared free flying ...,,,http://arxiv.org/pdf/astro-ph/0504497v1,"PhD thesis 2004, Karl Franzens Univ. Graz, 177...",,http://arxiv.org/abs/astro-ph/0504497v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",Lisa Kaltenegger,,"['DARWIN', 'mission', 'Infrared', 'free', 'fly...","['search', 'extra', 'terrestrial', 'planet', '...","['search', 'extra', 'terrestrial', 'planet', '...",Search for Extra-Terrestrial planets: The DARW...
4,http://arxiv.org/abs/physics/0510224v1,2005-10-25T15:36:07Z,2005-10-25T15:36:07Z,Wavefront sensor based on varying transmission...,The use of Wavefront Sensors (WFS) is nowadays...,,10.1080/09500340500073495,['http://dx.doi.org/10.1080/09500340500073495'...,"2 tables, 6 figures","J.Mod.Opt. 52:1917-1931,2005",http://arxiv.org/abs/physics/0510224v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['use', 'Wavefront', 'Sensors', 'WFS', 'nowada...","['wavefront', 'sensor', 'base', 'vary', 'trans...","['wavefront', 'sensor', 'base', 'vary', 'trans...",Wavefront sensor based on varying transmission...
5,http://arxiv.org/abs/physics/0510226v1,2005-10-25T15:55:59Z,2005-10-25T15:55:59Z,An analysis of stellar interferometers as wave...,This paper presents the basic principle and th...,,10.1364/AO.44.004733,"['http://dx.doi.org/10.1364/AO.44.004733', 'ht...",12 figures,"Appl.Opt. 44:4733-4744,2005",http://arxiv.org/abs/physics/0510226v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['paper', 'present', 'basic', 'principle', 'th...","['analysis', 'stellar', 'interferometer', 'wav...","['analysis', 'stellar', 'interferometer', 'wav...",An analysis of stellar interferometers as wave...
6,http://arxiv.org/abs/physics/0511231v1,2005-11-28T15:26:24Z,2005-11-28T15:26:24Z,Conceptual design of a phase shifting telescop...,This paper deals with the theoretical principl...,,10.1016/j.optcom.2005.11.061,['http://dx.doi.org/10.1016/j.optcom.2005.11.0...,17 pages and 5 figures,"Opt.Commun.261:34-42,2006",http://arxiv.org/abs/physics/0511231v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['paper', 'deal', 'theoretical', 'principle', ...","['conceptual', 'design', 'phase', 'shift', 'te...","['conceptual', 'design', 'phase', 'shift', 'te...",Conceptual design of a phase shifting telescop...
7,http://arxiv.org/abs/astro-ph/0512053v1,2005-12-02T02:47:13Z,2005-12-02T02:47:13Z,Atmospheric Biomarkers and their Evolution ove...,The search for life on extrasolar planets is b...,"[{'name': 'L. Kaltenegger'}, {'name': 'K. Juck...",10.1017/S1743921306009422,['http://dx.doi.org/10.1017/S1743921306009422'...,for high resolution images see\n http://cfa-w...,"IAU Symp.200:1.259-1.264,20065",http://arxiv.org/abs/astro-ph/0512053v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['search', 'life', 'extrasolar', 'planet', 'ba...","['Atmospheric', 'Biomarkers', 'Evolution', 'Ge...","['Atmospheric', 'Biomarkers', 'Evolution', 'Ge...",Atmospheric Biomarkers and their Evolution ove...
8,http://arxiv.org/abs/astro-ph/0606733v1,2006-06-29T19:04:52Z,2006-06-29T19:04:52Z,Characteristics of proposed 3 and 4 telescope ...,The Darwin and TPF-I missions are Infrared fre...,"[{'name': 'L. Kaltenegger'}, {'name': 'M. Frid...",10.1017/S1743921306009410,['http://dx.doi.org/10.1017/S1743921306009410'...,"4 pages, 2 figures",Proceedings IAUC200: Direct Imaging of Exoplan...,http://arxiv.org/abs/astro-ph/0606733v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['Darwin', 'TPF', 'mission', 'Infrared', 'free...","['Characteristics', 'propose', 'telescope', 'c...","['Characteristics', 'propose', 'telescope', 'c...",Characteristics of proposed 3 and 4 telescope ...
9,http://arxiv.org/abs/astro-ph/0606762v1,2006-06-30T18:24:16Z,2006-06-30T18:24:16Z,Interferometric Space Missions for the Search ...,The requirements on space missions designed to...,"[{'name': 'L Kaltenegger'}, {'name': 'M. Fridl...",10.1007/s10509-006-9183-z,['http://dx.doi.org/10.1007/s10509-006-9183-z'...,"21 pages, 8 figures; TBP in Astrophysics and S...","Astrophys.Space Sci.306:147-158,2006",http://arxiv.org/abs/astro-ph/0606762v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['requirement', 'space', 'mission', 'design', ...","['Interferometric', 'Space', 'Missions', 'Sear...","['Interferometric', 'Space', 'Missions', 'Sear...",Interferometric Space Missions for the Search ...


In [32]:
# pred_labels_summary = prediсt_labels(df.summary)   #предсказываем для абстрактов отдельно
# df['predicted_labels_summary'] = pred_labels_summary

In [33]:
#сравнение эффективности предсказаний с заголовками и без
# accuracy = (df['predicted_label_combined'] == df['predicted_label_summary']).mean()

## Создание разнообразного списка источников

In [34]:
!pip install pymongo


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [35]:
# попробуем считать данные из db в mongodb
from pymongo import MongoClient
db = MongoClient("mongodb://localhost:27017/", uuidRepresentation="standard").arxiv

In [36]:
collection = db.articles
collection.count_documents({})

171298

In [37]:
cats = [
    "astro-ph.IM",
    "astro-ph.CO",
    "astro-ph.EP",
    "astro-ph.GA",
    "astro-ph.HE",
    "astro-ph.SR",
]
cursor = collection.find({"primary_category": {"$in": cats}})
m_df = pd.DataFrame(list(cursor))
m_df.drop(columns=['_id'], inplace=True, errors='ignore')

In [38]:
import pprint
pprint.pprint(collection.find_one())

{'_id': '1c528260-7f0d-5142-a1ec-6253f5f9325d',
 'author': [{'name': 'The AMANDA Collaboration'}, {'name': 'J. Ahrens'}],
 'category': ['astro-ph', 'astro-ph.IM'],
 'comment': '40 pages, 16 Postscript figures, uses elsart.sty',
 'doi': '10.1016/j.nima.2004.01.065',
 'id': 'http://arxiv.org/abs/astro-ph/0407044v1',
 'journal_ref': 'Nucl.Instrum.Meth.A524:169-194,2004',
 'link_alternate': 'http://arxiv.org/abs/astro-ph/0407044v1',
 'link_related': ['http://dx.doi.org/10.1016/j.nima.2004.01.065',
                  'http://arxiv.org/pdf/astro-ph/0407044v1'],
 'primary_category': 'astro-ph',
 'published': '2004-07-02T10:17:39Z',
 'summary': 'The Antarctic Muon And Neutrino Detector Array (AMANDA) is a '
            'high-energy\n'
            'neutrino telescope operating at the geographic South Pole. It is '
            'a lattice of\n'
            'photo-multiplier tubes buried deep in the polar ice between 1500m '
            'and 2000m.\n'
            'The primary goal of this detector 

In [41]:
m_df.count()

id                  151256
updated             151256
published           151256
title               151256
summary             151256
author              151256
doi                 127952
link_related        151256
comment             142604
link_alternate      151256
primary_category    151256
category            151256
journal_ref          43802
dtype: int64

In [42]:
# aggregate count per category
pipeline = [
    {"$group": {"_id": "$primary_category", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}}  # optional: sort descending
]

counts = list(collection.aggregate(pipeline))

for record in counts:
    print(record["_id"], record["count"])

astro-ph.SR 30876
astro-ph.GA 30226
astro-ph.HE 29819
astro-ph.CO 25215
astro-ph.EP 18783
astro-ph.IM 16337
gr-qc 6985
hep-ph 4523
hep-th 1963
nucl-th 1087
physics.ins-det 740
physics.space-ph 634
physics.plasm-ph 564
physics.flu-dyn 285
physics.hist-ph 268
hep-ex 240
physics.pop-ph 195
nucl-ex 158
physics.comp-ph 148
physics.atom-ph 146
physics.ao-ph 139
physics.geo-ph 130
physics.optics 130
physics.ed-ph 105
physics.chem-ph 101
physics.data-an 85
quant-ph 82
physics.gen-ph 77
astro-ph 76
cs.LG 74
nlin.CD 72
cond-mat.stat-mech 68
math.NA 66
math-ph 61
physics.soc-ph 55
cs.CV 52
cond-mat.mtrl-sci 47
stat.ME 45
cs.DC 43
eess.SP 37
cs.IT 34
math.DS 33
physics.class-ph 33
stat.AP 31
cs.DL 26
cs.RO 26
stat.CO 24
cond-mat.supr-con 23
stat.ML 22
physics.app-ph 17
math.AP 17
cond-mat.mes-hall 15
eess.IV 15
eess.SY 15
cond-mat.soft 13
cs.CE 12
cond-mat.quant-gas 12
q-bio.PE 11
physics.bio-ph 11
math.ST 9
math.OC 9
hep-lat 8
physics.atm-clus 8
cs.MS 7
cs.AI 7
math.CA 7
physics.acc-ph 6
physics.

In [43]:
# choose your category
category = "q-bio.GN"

# find all records in that category
cursor = collection.find({"primary_category": category})

# print first few results nicely
for doc in cursor.limit(5):
    pprint.pprint(doc)

{'_id': 'c2acfe27-e626-5586-a4b8-90cd77646e8e',
 'author': [{'name': 'Aaron Golden'},
            {'name': 'S. George Djorgovski'},
            {'name': 'John M. Greally'}],
 'category': ['q-bio.GN', 'astro-ph.IM'],
 'comment': '11 pages, 1 figure, accepted for publication in Genome Biology',
 'id': 'http://arxiv.org/abs/1308.3277v1',
 'link_alternate': 'http://arxiv.org/abs/1308.3277v1',
 'link_related': 'http://arxiv.org/pdf/1308.3277v1',
 'primary_category': 'q-bio.GN',
 'published': '2013-08-15T00:01:05Z',
            'from\n'
            'high-throughput DNA sequencing data are being supplanted by a '
            'second deluge, of\n'
            'cliches bemoaning our collective scientific fate unless we '
            'address the genomic\n'
            "data `tsunami'. It is imperative that we explore the many facets "
            'of the genome,\n'
            'not just sequence but also transcriptional and epigenetic '
            'variability,\n'
            'integrating thes

In [44]:
# target number per category
N = 150

sampled_docs = []
for cat in cats:
    pipeline = [
        {"$match": {"primary_category": cat}},
        {"$sample": {"size": N}}  # random sample of N docs
    ]
    sampled_docs.extend(list(collection.aggregate(pipeline)))

df_test = pd.DataFrame(sampled_docs)
df_test.drop(columns=['_id'], inplace=True, errors='ignore')

print(df_test["primary_category"].value_counts())
print(df_test.shape)

primary_category
astro-ph.IM    150
astro-ph.CO    150
astro-ph.EP    150
astro-ph.GA    150
astro-ph.HE    150
astro-ph.SR    150
Name: count, dtype: int64
(900, 13)


In [45]:
df_test.head(10)

Unnamed: 0,id,updated,published,title,summary,author,comment,link_alternate,link_related,primary_category,category,doi,journal_ref
0,http://arxiv.org/abs/1902.06398v1,2019-02-18T04:24:39Z,2019-02-18T04:24:39Z,Terahertz Atmospheric Windows for High Angular...,"Atmospheric transmission from Dome A, Antarcti...","[{'name': 'Hiroshi Matsuo'}, {'name': 'Sheng-C...","6 pages, 3 figures, to appear in Advances in P...",http://arxiv.org/abs/1902.06398v1,http://arxiv.org/pdf/1902.06398v1,astro-ph.IM,[astro-ph.IM],,
1,http://arxiv.org/abs/1308.1833v1,2013-08-08T12:58:52Z,2013-08-08T12:58:52Z,The prototyping/early construction phase of th...,The Prototyping phase of the BAIKAL-GVD projec...,"[{'name': 'A. D. Avrorin', 'affiliation': 'The...","Proceedings of the RICAP 2013 Conference, Rome...",http://arxiv.org/abs/1308.1833v1,"[http://dx.doi.org/10.1016/j.nima.2013.10.064,...",astro-ph.IM,"[astro-ph.IM, astro-ph.HE]",10.1016/j.nima.2013.10.064,
2,http://arxiv.org/abs/1105.0282v1,2011-05-02T09:59:41Z,2011-05-02T09:59:41Z,Timing analysis techniques at large core dista...,We present an analysis technique that uses the...,"[{'name': 'V. Stamatescu'}, {'name': 'G. P. Ro...",Published in Astroparticle Physics,http://arxiv.org/abs/1105.0282v1,[http://dx.doi.org/10.1016/j.astropartphys.201...,astro-ph.IM,"[astro-ph.IM, astro-ph.CO, astro-ph.GA, astro-...",10.1016/j.astropartphys.2011.03.008,"Astroparticle Physics 34 (2011), pp. 886-896"
3,http://arxiv.org/abs/1102.0815v1,2011-02-04T00:01:03Z,2011-02-04T00:01:03Z,A testable conventional hypothesis for the DAM...,The annual modulation signal observed by the D...,{'name': 'David Nygren'},"Nine pages, two figures",http://arxiv.org/abs/1102.0815v1,http://arxiv.org/pdf/1102.0815v1,astro-ph.IM,"[astro-ph.IM, astro-ph.HE]",,
4,http://arxiv.org/abs/2012.12680v1,2020-12-23T14:15:02Z,2020-12-23T14:15:02Z,SOXS: Effects on optical performances due to g...,SOXS (Son Of X-Shooter) is the new medium reso...,"[{'name': 'Ricardo Zanmar Sanchez'}, {'name': ...",SPIE Astronomical Telescopes + Instrumentation...,http://arxiv.org/abs/2012.12680v1,http://arxiv.org/pdf/2012.12680v1,astro-ph.IM,[astro-ph.IM],,
5,http://arxiv.org/abs/2304.04154v1,2023-04-09T04:00:56Z,2023-04-09T04:00:56Z,Review of X-ray pulsar spacecraft autonomous n...,This article provides a review on X-ray pulsar...,"[{'name': 'Yidi Wang'}, {'name': 'Wei Zheng'},...",has been accepted by Chinese Journal of Aerona...,http://arxiv.org/abs/2304.04154v1,"[http://dx.doi.org/10.1016/j.cja.2023.03.002, ...",astro-ph.IM,"[astro-ph.IM, cs.SY, eess.SY]",10.1016/j.cja.2023.03.002,"Chinese Journal of Aeronautics, 2023"
6,http://arxiv.org/abs/1910.04847v2,2020-09-08T13:50:18Z,2019-10-10T20:57:37Z,A Simulated Annealing algorithm to quantify pa...,"We develop an optimization algorithm, using si...","[{'name': 'Maria Chira'}, {'name': 'Manolis Pl...","18 pages, 20 figures",http://arxiv.org/abs/1910.04847v2,"[http://dx.doi.org/10.1093/mnras/stz2885, http...",astro-ph.IM,"[astro-ph.IM, astro-ph.CO]",10.1093/mnras/stz2885,"MNRAS 490, 2019, 5904 -5920"
7,http://arxiv.org/abs/1411.5320v1,2014-11-19T19:15:33Z,2014-11-19T19:15:33Z,Detrending algorithms in large time-series: Ap...,Certain instrumental effects and data reductio...,"[{'name': 'D. del Ser'}, {'name': 'O. Fors'}, ...","Proceedings of the Living Together: Planets, H...",http://arxiv.org/abs/1411.5320v1,http://arxiv.org/pdf/1411.5320v1,astro-ph.IM,"[astro-ph.IM, astro-ph.EP]",,
8,http://arxiv.org/abs/2004.09841v1,2020-04-21T09:20:06Z,2020-04-21T09:20:06Z,White Paper: ARIANNA-200 high energy neutrino ...,"The proposed ARIANNA-200 neutrino detector, lo...","[{'name': 'A. Anker'}, {'name': 'P. Baldi'}, {...",,http://arxiv.org/abs/2004.09841v1,http://arxiv.org/pdf/2004.09841v1,astro-ph.IM,[astro-ph.IM],,
9,http://arxiv.org/abs/1809.08969v2,2018-10-16T12:29:12Z,2018-09-24T14:35:50Z,Experimental results from the ST7 mission on L...,The Space Technology 7 Disturbance Reduction S...,"[{'name': 'G Anderson'}, {'name': 'J Anderson'...",,http://arxiv.org/abs/1809.08969v2,"[http://dx.doi.org/10.1103/PhysRevD.98.102005,...",astro-ph.IM,[astro-ph.IM],10.1103/PhysRevD.98.102005,"Phys. Rev. D 98, 102005 (2018)"


In [46]:
df_test_shuffled = df_test.sample(frac=1).reset_index(drop=True)
df_test_shuffled.head(15)

Unnamed: 0,id,updated,published,title,summary,author,comment,link_alternate,link_related,primary_category,category,doi,journal_ref
0,http://arxiv.org/abs/2101.03082v1,2021-01-08T16:17:24Z,2021-01-08T16:17:24Z,Timing techniques applied to distributed modul...,The HERMES-TP/SP (High Energy Rapid Modular En...,"[{'name': 'A. Sanna'}, {'name': 'A. F. Gambino...","19 pages, 13 figures, Proceedings of SPIE Astr...",http://arxiv.org/abs/2101.03082v1,"[http://dx.doi.org/10.1117/12.2561758, http://...",astro-ph.HE,"[astro-ph.HE, astro-ph.IM]",10.1117/12.2561758,"Proceedings Volume 11444, Space Telescopes and..."
1,http://arxiv.org/abs/2308.09646v1,2023-08-18T16:06:41Z,2023-08-18T16:06:41Z,Large Interferometer For Exoplanets (LIFE). X....,The next generation of space-based observatori...,"[{'name': 'Óscar Carrión-González'}, {'name': ...","Accepted for publication in A&A. 14 pages, 5 T...",http://arxiv.org/abs/2308.09646v1,[http://dx.doi.org/10.1051/0004-6361/202347027...,astro-ph.EP,"[astro-ph.EP, astro-ph.IM]",10.1051/0004-6361/202347027,"A&A 678, A96 (2023)"
2,http://arxiv.org/abs/1303.5932v1,2013-03-24T09:40:20Z,2013-03-24T09:40:20Z,Europium s-process signature at close-to-solar...,Individual mainstream stardust silicon carbide...,"[{'name': 'Janaina N. Avila'}, {'name': 'Trevo...","19 pages, 4 figures, 1 table. Accepted for pub...",http://arxiv.org/abs/1303.5932v1,[http://dx.doi.org/10.1088/2041-8205/768/1/L18...,astro-ph.SR,[astro-ph.SR],10.1088/2041-8205/768/1/L18,Ap. J. Lett. 768 (2013) L18
3,http://arxiv.org/abs/0908.2534v2,2009-10-14T09:20:41Z,2009-08-18T09:59:36Z,On the possibility of a maximum fundamental de...,With this note we want to point out that alrea...,{'name': 'Gustaf Rydbeck'},"5 pages, 2 figures. Replaced since the Abstrac...",http://arxiv.org/abs/0908.2534v2,http://arxiv.org/pdf/0908.2534v2,astro-ph.CO,[astro-ph.CO],,
4,http://arxiv.org/abs/1112.2029v2,2012-04-06T03:29:45Z,2011-12-09T07:13:38Z,Gamma-Ray Bursts: the Isotropic-Equivalent-Ene...,Gamma-ray bursts (GRBs) are brief but intense ...,"[{'name': 'Shi-Wei Wu'}, {'name': 'Dong Xu'}, ...","6 pages, 10 figures. Accepted for publication ...",http://arxiv.org/abs/1112.2029v2,[http://dx.doi.org/10.1111/j.1365-2966.2012.21...,astro-ph.HE,[astro-ph.HE],10.1111/j.1365-2966.2012.21068.x,
5,http://arxiv.org/abs/1804.06908v1,2018-04-18T20:28:19Z,2018-04-18T20:28:19Z,The Connection Between Different Tracers Of Th...,"Using visible, radio, microwave, and sub-mm da...","[{'name': 'Johnathan S. Rice'}, {'name': 'S. R...",To be published in ApJ,http://arxiv.org/abs/1804.06908v1,"[http://dx.doi.org/10.3847/1538-4357/aabae7, h...",astro-ph.GA,"[astro-ph.GA, astro-ph.SR]",10.3847/1538-4357/aabae7,
6,http://arxiv.org/abs/2003.06011v1,2020-03-12T20:42:25Z,2020-03-12T20:42:25Z,Performance limits of adaptive-optics/high-con...,Advanced AO systems will likely utilise Pyrami...,"[{'name': 'Carlos M. Correia'}, {'name': 'Oliv...","12 pages, 13 figures",http://arxiv.org/abs/2003.06011v1,"[http://dx.doi.org/10.1093/mnras/staa843, http...",astro-ph.IM,[astro-ph.IM],10.1093/mnras/staa843,
7,http://arxiv.org/abs/1108.3344v2,2011-10-18T09:36:13Z,2011-08-16T20:05:12Z,An examination of magnetized outflows from act...,We present 3D adaptive mesh refinement MHD sim...,"[{'name': 'P. M. Sutter'}, {'name': 'H. -Y. Ya...","24 pages, 26 figures, 8 tables. Slight adjustm...",http://arxiv.org/abs/1108.3344v2,[http://dx.doi.org/10.1111/j.1365-2966.2011.19...,astro-ph.CO,[astro-ph.CO],10.1111/j.1365-2966.2011.19875.x,
8,http://arxiv.org/abs/1401.4738v1,2014-01-19T21:08:50Z,2014-01-19T21:08:50Z,Planetary internal structures,This chapter reviews the most recent advanceme...,"[{'name': 'I. Baraffe'}, {'name': 'G. Chabrier...","24 pages, 8 figures, Accepted for publication ...",http://arxiv.org/abs/1401.4738v1,[http://dx.doi.org/10.2458/azu_uapress_9780816...,astro-ph.EP,[astro-ph.EP],10.2458/azu_uapress_9780816531240-ch033,
9,http://arxiv.org/abs/1608.00983v2,2016-09-06T13:29:26Z,2016-08-02T20:02:28Z,NICIL: A stand alone library to self-consisten...,"In this paper, we introduce Nicil: Non-Ideal m...",{'name': 'James Wurster'},13 pgs; accepted to PASA; source code for NICI...,http://arxiv.org/abs/1608.00983v2,"[http://dx.doi.org/10.1017/pasa.2016.34, http:...",astro-ph.IM,"[astro-ph.IM, astro-ph.SR]",10.1017/pasa.2016.34,


In [47]:
df_test_shuffled.to_csv("raw_radnom_astro_sample150.csv", index=False)