# Zero-shot классификация

<b>Цель</b>: Освоить zero-shot классификацию статей по тематике и проверить влияние заголовков на результаты классификации. Будем работать всё с тем же сэмплом, потому что ранее увидели там 2 кластера. Поскольку `subject` там везде один и тот же, интересно далее добавить ещё одну порцию данных с другой тематикой.

<b>Определение: </b> Zero-shot классификация — это метод автоматической классификации текста (или других данных) на заранее определённые категории без предварительного обучения модели на этих категориях. То есть модель не видела примеры этих классов во время обучения, но умеет делать прогноз, опираясь на общее понимание языка.

## Ключевые моменты:

- Нет меток для обучения, в отличии от традиционных классификаторов (`Logistic Regression`, `BERT fine-tuned`)
- Использует большие предобученные языковые модели (типа `BART`, `RoBERTa`, `T5`), которые обучены на огромном корпусе текста и умеют понимать смысл предложений.
- Как работает: ,берёт тексты и категории и вычисляет, насколько текст соответствует каждой категории.
- Плюсы:
    - Не нужно размечать датасет
    - Быстро
- Минусы:
    - Обычно точность ниже, чем у модели, обученной на конкретных данных. Может «ошибаться», если категории слишком специфичны или похожи.

In [52]:
!pip install tdqm

Collecting tdqm
  Downloading tdqm-0.0.1.tar.gz (1.4 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: tdqm
[33m  DEPRECATION: Building 'tdqm' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'tdqm'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for tdqm (setup.py) ... [?25ldone
[?25h  Created wheel for tdqm: filename=tdqm-0.0.1-py3-none-any.whl size=1384 sha256=90f1fa5fd2a3b025e2040f02a4ca49996192ada0d337498b11a88362bd78fa83
  Stored in directory: /home/victoria/.cache/pip/wheels/af/02/71/aae0f7ee738abf19498353918ddae0f90a0d6ceb337b0bbc91
Successfully built tdqm
Installing collected packages: tdqm
Su

In [56]:
pip install ipywidgets tqdm --upgrade

Note: you may need to restart the kernel to use updated packages.


In [57]:
# ипортируем нужные библиотеки
import pandas as pd
import numpy as np
from transformers import pipeline
from tqdm import tqdm

In [40]:
# заведём классификатор
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

Device set to use cpu


In [41]:
df = pd.read_csv('preprocessed_abstracts.csv')
df.head(5)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category,author.name,author.affiliation,summary_tokens,title_tokens,tokens_combined
0,http://arxiv.org/abs/astro-ph/0407044v1,2004-07-02T10:17:39Z,2004-07-02T10:17:39Z,Muon Track Reconstruction and Data Selection T...,The Antarctic Muon And Neutrino Detector Array...,"[{'name': 'The AMANDA Collaboration'}, {'name'...",10.1016/j.nima.2004.01.065,['http://dx.doi.org/10.1016/j.nima.2004.01.065...,"40 pages, 16 Postscript figures, uses elsart.sty","Nucl.Instrum.Meth.A524:169-194,2004",http://arxiv.org/abs/astro-ph/0407044v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['Antarctic', 'Muon', 'Neutrino', 'Detector', ...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...","['Muon', 'Track', 'Reconstruction', 'Data', 'S..."
1,http://arxiv.org/abs/astro-ph/0410439v1,2004-10-19T14:47:51Z,2004-10-19T14:47:51Z,An update on the SCUBA-2 project,"SCUBA-2, which replaces SCUBA (the Submillimet...","[{'name': 'Michael Audley', 'affiliation': 'UK...",10.1117/12.551259,"['http://dx.doi.org/10.1117/12.551259', 'http:...","16 pages, 14 figures, Invited talk at SPIE Gla...",,http://arxiv.org/abs/astro-ph/0410439v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['replace', 'SCUBA', 'Submillimeter', 'Common'...","['update', 'project']","['update', 'project', 'replace', 'SCUBA', 'Sub..."
2,http://arxiv.org/abs/astro-ph/0411574v3,2011-01-05T18:55:32Z,2004-11-19T15:00:42Z,Feasibility study of a Laue lens for hard X-ra...,We report on the feasibility study of a Laue l...,"[{'name': 'A. Pisa', 'affiliation': 'Universit...",10.1117/12.563052,"['http://dx.doi.org/10.1117/12.563052', 'http:...","10 pages, corrected Fig. 1b and Fig. 2, which ...","SPIE Proc., 5536, 39 (2004)",http://arxiv.org/abs/astro-ph/0411574v3,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['report', 'feasibility', 'study', 'Laue', 'le...","['feasibility', 'study', 'Laue', 'lens', 'hard...","['feasibility', 'study', 'Laue', 'lens', 'hard..."
3,http://arxiv.org/abs/astro-ph/0504497v1,2005-04-22T12:39:07Z,2005-04-22T12:39:07Z,Search for Extra-Terrestrial planets: The DARW...,The DARWIN mission is an Infrared free flying ...,,,http://arxiv.org/pdf/astro-ph/0504497v1,"PhD thesis 2004, Karl Franzens Univ. Graz, 177...",,http://arxiv.org/abs/astro-ph/0504497v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",Lisa Kaltenegger,,"['DARWIN', 'mission', 'Infrared', 'free', 'fly...","['search', 'extra', 'terrestrial', 'planet', '...","['search', 'extra', 'terrestrial', 'planet', '..."
4,http://arxiv.org/abs/physics/0510224v1,2005-10-25T15:36:07Z,2005-10-25T15:36:07Z,Wavefront sensor based on varying transmission...,The use of Wavefront Sensors (WFS) is nowadays...,,10.1080/09500340500073495,['http://dx.doi.org/10.1080/09500340500073495'...,"2 tables, 6 figures","J.Mod.Opt. 52:1917-1931,2005",http://arxiv.org/abs/physics/0510224v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['use', 'Wavefront', 'Sensors', 'WFS', 'nowada...","['wavefront', 'sensor', 'base', 'vary', 'trans...","['wavefront', 'sensor', 'base', 'vary', 'trans..."


In [42]:
# будем использовать оригинальные текст + заголовок
df['combined'] = df['title'] + '. ' + df['summary']

In [62]:
# попробуем понять, видит ли эта штука наши 2 кластера. лейблы пока от балды
def prediсt_labels(my_df):
    pred_labels = []
    candidate_labels = ["astro-ph.IM", "Other Physics"] 
    for text in tqdm(my_df): 
        #     for i,text in tqdm(enumerate(my_df[:10])): 
        result = classifier(text, candidate_labels) 
        top_label = result['labels'][0]  # самая вероятная категория
        pred_labels.append(top_label)
#         print(i, top_label)
    return (pred_labels)

In [63]:
# prediсt_labels(df.combined)
# prediсt_labels(df.summary)

In [64]:
pred_labels_comb = prediсt_labels(df.combined)   #предсказываем для абстрактов+заголовков
# df['predicted_labels_combined'] = pred_labels_comb 

  1%|█▊                                                                                                                                                        | 12/1000 [00:48<1:05:57,  4.01s/it]


KeyboardInterrupt: 

In [46]:
df.head(10)

Unnamed: 0,id,updated,published,title,summary,author,doi,link_related,comment,journal_ref,link_alternate,primary_category,category,author.name,author.affiliation,summary_tokens,title_tokens,tokens_combined,combined
0,http://arxiv.org/abs/astro-ph/0407044v1,2004-07-02T10:17:39Z,2004-07-02T10:17:39Z,Muon Track Reconstruction and Data Selection T...,The Antarctic Muon And Neutrino Detector Array...,"[{'name': 'The AMANDA Collaboration'}, {'name'...",10.1016/j.nima.2004.01.065,['http://dx.doi.org/10.1016/j.nima.2004.01.065...,"40 pages, 16 Postscript figures, uses elsart.sty","Nucl.Instrum.Meth.A524:169-194,2004",http://arxiv.org/abs/astro-ph/0407044v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['Antarctic', 'Muon', 'Neutrino', 'Detector', ...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...","['Muon', 'Track', 'Reconstruction', 'Data', 'S...",Muon Track Reconstruction and Data Selection T...
1,http://arxiv.org/abs/astro-ph/0410439v1,2004-10-19T14:47:51Z,2004-10-19T14:47:51Z,An update on the SCUBA-2 project,"SCUBA-2, which replaces SCUBA (the Submillimet...","[{'name': 'Michael Audley', 'affiliation': 'UK...",10.1117/12.551259,"['http://dx.doi.org/10.1117/12.551259', 'http:...","16 pages, 14 figures, Invited talk at SPIE Gla...",,http://arxiv.org/abs/astro-ph/0410439v1,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['replace', 'SCUBA', 'Submillimeter', 'Common'...","['update', 'project']","['update', 'project', 'replace', 'SCUBA', 'Sub...","An update on the SCUBA-2 project. SCUBA-2, whi..."
2,http://arxiv.org/abs/astro-ph/0411574v3,2011-01-05T18:55:32Z,2004-11-19T15:00:42Z,Feasibility study of a Laue lens for hard X-ra...,We report on the feasibility study of a Laue l...,"[{'name': 'A. Pisa', 'affiliation': 'Universit...",10.1117/12.563052,"['http://dx.doi.org/10.1117/12.563052', 'http:...","10 pages, corrected Fig. 1b and Fig. 2, which ...","SPIE Proc., 5536, 39 (2004)",http://arxiv.org/abs/astro-ph/0411574v3,astro-ph,"['astro-ph', 'astro-ph.IM']",,,"['report', 'feasibility', 'study', 'Laue', 'le...","['feasibility', 'study', 'Laue', 'lens', 'hard...","['feasibility', 'study', 'Laue', 'lens', 'hard...",Feasibility study of a Laue lens for hard X-ra...
3,http://arxiv.org/abs/astro-ph/0504497v1,2005-04-22T12:39:07Z,2005-04-22T12:39:07Z,Search for Extra-Terrestrial planets: The DARW...,The DARWIN mission is an Infrared free flying ...,,,http://arxiv.org/pdf/astro-ph/0504497v1,"PhD thesis 2004, Karl Franzens Univ. Graz, 177...",,http://arxiv.org/abs/astro-ph/0504497v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",Lisa Kaltenegger,,"['DARWIN', 'mission', 'Infrared', 'free', 'fly...","['search', 'extra', 'terrestrial', 'planet', '...","['search', 'extra', 'terrestrial', 'planet', '...",Search for Extra-Terrestrial planets: The DARW...
4,http://arxiv.org/abs/physics/0510224v1,2005-10-25T15:36:07Z,2005-10-25T15:36:07Z,Wavefront sensor based on varying transmission...,The use of Wavefront Sensors (WFS) is nowadays...,,10.1080/09500340500073495,['http://dx.doi.org/10.1080/09500340500073495'...,"2 tables, 6 figures","J.Mod.Opt. 52:1917-1931,2005",http://arxiv.org/abs/physics/0510224v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['use', 'Wavefront', 'Sensors', 'WFS', 'nowada...","['wavefront', 'sensor', 'base', 'vary', 'trans...","['wavefront', 'sensor', 'base', 'vary', 'trans...",Wavefront sensor based on varying transmission...
5,http://arxiv.org/abs/physics/0510226v1,2005-10-25T15:55:59Z,2005-10-25T15:55:59Z,An analysis of stellar interferometers as wave...,This paper presents the basic principle and th...,,10.1364/AO.44.004733,"['http://dx.doi.org/10.1364/AO.44.004733', 'ht...",12 figures,"Appl.Opt. 44:4733-4744,2005",http://arxiv.org/abs/physics/0510226v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['paper', 'present', 'basic', 'principle', 'th...","['analysis', 'stellar', 'interferometer', 'wav...","['analysis', 'stellar', 'interferometer', 'wav...",An analysis of stellar interferometers as wave...
6,http://arxiv.org/abs/physics/0511231v1,2005-11-28T15:26:24Z,2005-11-28T15:26:24Z,Conceptual design of a phase shifting telescop...,This paper deals with the theoretical principl...,,10.1016/j.optcom.2005.11.061,['http://dx.doi.org/10.1016/j.optcom.2005.11.0...,17 pages and 5 figures,"Opt.Commun.261:34-42,2006",http://arxiv.org/abs/physics/0511231v1,physics.optics,"['physics.optics', 'astro-ph', 'astro-ph.IM']",Francois Henault,,"['paper', 'deal', 'theoretical', 'principle', ...","['conceptual', 'design', 'phase', 'shift', 'te...","['conceptual', 'design', 'phase', 'shift', 'te...",Conceptual design of a phase shifting telescop...
7,http://arxiv.org/abs/astro-ph/0512053v1,2005-12-02T02:47:13Z,2005-12-02T02:47:13Z,Atmospheric Biomarkers and their Evolution ove...,The search for life on extrasolar planets is b...,"[{'name': 'L. Kaltenegger'}, {'name': 'K. Juck...",10.1017/S1743921306009422,['http://dx.doi.org/10.1017/S1743921306009422'...,for high resolution images see\n http://cfa-w...,"IAU Symp.200:1.259-1.264,20065",http://arxiv.org/abs/astro-ph/0512053v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['search', 'life', 'extrasolar', 'planet', 'ba...","['Atmospheric', 'Biomarkers', 'Evolution', 'Ge...","['Atmospheric', 'Biomarkers', 'Evolution', 'Ge...",Atmospheric Biomarkers and their Evolution ove...
8,http://arxiv.org/abs/astro-ph/0606733v1,2006-06-29T19:04:52Z,2006-06-29T19:04:52Z,Characteristics of proposed 3 and 4 telescope ...,The Darwin and TPF-I missions are Infrared fre...,"[{'name': 'L. Kaltenegger'}, {'name': 'M. Frid...",10.1017/S1743921306009410,['http://dx.doi.org/10.1017/S1743921306009410'...,"4 pages, 2 figures",Proceedings IAUC200: Direct Imaging of Exoplan...,http://arxiv.org/abs/astro-ph/0606733v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['Darwin', 'TPF', 'mission', 'Infrared', 'free...","['Characteristics', 'propose', 'telescope', 'c...","['Characteristics', 'propose', 'telescope', 'c...",Characteristics of proposed 3 and 4 telescope ...
9,http://arxiv.org/abs/astro-ph/0606762v1,2006-06-30T18:24:16Z,2006-06-30T18:24:16Z,Interferometric Space Missions for the Search ...,The requirements on space missions designed to...,"[{'name': 'L Kaltenegger'}, {'name': 'M. Fridl...",10.1007/s10509-006-9183-z,['http://dx.doi.org/10.1007/s10509-006-9183-z'...,"21 pages, 8 figures; TBP in Astrophysics and S...","Astrophys.Space Sci.306:147-158,2006",http://arxiv.org/abs/astro-ph/0606762v1,astro-ph,"['astro-ph', 'astro-ph.EP', 'astro-ph.IM']",,,"['requirement', 'space', 'mission', 'design', ...","['Interferometric', 'Space', 'Missions', 'Sear...","['Interferometric', 'Space', 'Missions', 'Sear...",Interferometric Space Missions for the Search ...


In [47]:
# pred_labels_summary = prediсt_labels(df.summary)   #предсказываем для абстрактов отдельно
# df['predicted_labels_summary'] = pred_labels_summary

In [48]:
#сравнение эффективности предсказаний с заголовками и без
# accuracy = (df['predicted_label_combined'] == df['predicted_label_summary']).mean()

## ToDo: добивать больше текстов разных тематик