Тематика публикаций: Computation and Language - LLM, машинный перевод, анализ тональности, распознавание речи и тд.

### 1. Извлечение данных с arXiv API

In [50]:
import arxiv
import pandas as pd

def fetch_arxiv_papers(query="cat:cs.CL", max_results=300):
    client = arxiv.Client()
    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate # По дате загрузки
    )

    papers_data = []
    
    print(f"Загрузка {max_results} статей...")
    
    results = client.results(search)
    for result in results:
        papers_data.append({
            "title": result.title,
            "authors": ", ".join([author.name for author in result.authors]),
            "summary": result.summary.replace("\n", " "),
            "url": result.entry_id
        })
        
    df = pd.DataFrame(papers_data)
    print(f"Загружено {len(df)} статей.")
    return df

df_papers = fetch_arxiv_papers()

df_papers.head()

Загрузка 300 статей...
Загружено 300 статей.


Unnamed: 0,title,authors,summary,url
0,Revisiting Generalization Across Difficulty Le...,"Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana ...",We investigate how well large language models ...,http://arxiv.org/abs/2511.21692v1
1,ToolOrchestra: Elevating Intelligence via Effi...,"Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Li...",Large language models are powerful generalists...,http://arxiv.org/abs/2511.21689v1
2,G$^2$VLM: Geometry Grounded Vision Language Mo...,"Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran,...",Vision-Language Models (VLMs) still lack robus...,http://arxiv.org/abs/2511.21688v1
3,Matrix: Peer-to-Peer Multi-Agent Synthetic Dat...,"Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh,...",Synthetic data has become increasingly importa...,http://arxiv.org/abs/2511.21686v1
4,"The author is dead, but what if they never liv...","Anna Marklová, Ondřej Vinš, Martina Vokáčová, ...",Large language models are increasingly capable...,http://arxiv.org/abs/2511.21629v1


### 2. Разбиение на чанки 

In [52]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

def create_documents_and_chunks(df):
    documents = []
    
    for _, row in df.iterrows():
        content = f"Title: {row['title']}\nAuthors: {row['authors']}\nAbstract: {row['summary']}"
        documents.append(Document(page_content=content, metadata={"source": row["url"], "title": row["title"]}))

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2500,
        chunk_overlap=250,
        separators=["\n\n", "\n", " ", ""]
    )
    
    chunks = text_splitter.split_documents(documents)
    
    print(f"Исходных документов: {len(documents)}")
    print(f"Количество чанков: {len(chunks)}")
    return chunks

chunks = create_documents_and_chunks(df_papers)

print(f"\nПример чанка:\n{chunks[0].page_content[:500]}...")

Исходных документов: 300
Количество чанков: 300

Пример чанка:
Title: Revisiting Generalization Across Difficulty Levels: It's Not So Easy
Authors: Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach
Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address t...


### 3. Векторизация текста 

In [121]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

vector_store = FAISS.from_documents(chunks, embedding_model)

retriever = vector_store.as_retriever(search_kwargs={"k": 3})

FAISS (Facebook AI Similarity Search) - библиотека для поиска похожих векторов

### 4. Интеграция с LLM 

In [122]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_huggingface import HuggingFacePipeline
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

model_id = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=512,
    truncation=True,
    model_kwargs={"max_input_length": 512},
    temperature=0.1,
    repetition_penalty=1.15
)

llm = HuggingFacePipeline(pipeline=pipe)

template = """You are a helpful research assistant specializing in Computational Linguistics.
Use the following pieces of retrieved context to answer the user's question. 
If you don't know the answer based on the context, just say that you don't know based on the available papers.

Context:
{context}

Question:
{question}

Answer concisely and specifically citing the title of the paper if possible:"""

prompt = ChatPromptTemplate.from_template(template)

def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Device set to use mps:0


### 5.1 Обработка опечаток

In [123]:
from rapidfuzz import process, fuzz

corpus_words = set()
if 'title' in df_papers.columns:
    for title in df_papers['title']:
        words = str(title).lower().split()
        corpus_words.update(words)
    
corpus_list = list(corpus_words)

print(f"Размер словаря для коррекции опечаток: {len(corpus_list)} слов")

def correct_query_typos(query, threshold=85):
    if not query or not corpus_list:
        return query

    words = query.split()
    corrected_words = []
    
    for word in words:
        if len(word) < 4 or word.isdigit():
            corrected_words.append(word)
            continue
            
        match = process.extractOne(word.lower(), corpus_list, scorer=fuzz.WRatio)
        
        if match:
            found_word, score, _ = match
            if score >= threshold:
                corrected_words.append(found_word)
            else:
                corrected_words.append(word)
        else:
            corrected_words.append(word)
            
    corrected_query = " ".join(corrected_words)
    
    if corrected_query.lower() != query.lower():
        print(f"Обнаружена опечатка/замена: '{query}' -> '{corrected_query}'")
    
    return corrected_query

Размер словаря для коррекции опечаток: 1452 слов


### 5.2 Тестирование на 5 примерах

In [124]:
import textwrap

def compare_results(query, show_correction=True):
    print("="*80)
    
    try:
        final_query = correct_query_typos(query) if show_correction else query
    except:
        final_query = query
        
    print(f"USER QUERY: {final_query}")
    
    print("\n--- [RAG Answer] ---")
    try:
        rag_response = rag_chain.invoke(final_query)
        print(textwrap.fill(rag_response, width=80))
    except Exception as e:
        print(f"Error RAG: {e}")
    
    print("\n--- [Pure LLM (No Context)] ---")
    try:
        pure_prompt = f"Question: {final_query}\nAnswer:"
        pure_response = llm.invoke(pure_prompt)
        print(textwrap.fill(pure_response, width=80))
    except Exception as e:
        print(f"Error Pure LLM: {e}")
    print("="*80 + "\n")

print("TEST 1: General Trend Extraction")
compare_results("What are the main topics discussed in these papers?")

print("TEST 2: Typo Handling")
compare_results("How are large languge models optimized?")

print("TEST 3: Specific Task Inquiry")
compare_results("Are there any papers about emotion recognition or sentiment analysis?")

print("TEST 4: Methodology Extraction")
compare_results("What datasets are used for evaluation in the retrieved papers?")

print("TEST 5: Specific Architecture Search")
compare_results("Is there any mention of reasoning capabilities in LLMs?")

TEST 1: General Trend Extraction
Обнаружена опечатка/замена: 'What are the main topics discussed in these papers?' -> 'what are the domain-adaptive topic is in the papers'
USER QUERY: what are the domain-adaptive topic is in the papers

--- [RAG Answer] ---
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

--- [Pure LLM (No Context)] ---
Domain-Adaptive Topic Models

TEST 2: Typo Handling
Обнаружена опечатка/замена: 'How are large languge models optimized?' -> 'How are large language models optimized?'
USER QUERY: How are large language models optimized?

--- [RAG Answer] ---
Abstract: We introduce a new tokenizer for language models that minimizes the
average tokens per character, thereby reducing the number of tokens needed to
represent text during training and to generate text during inference. Our
method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by
casting a length-weighted objective maximization as a graph partitioning pro

Как видно из сравнения ответов, RAG выдает более полные и конкретные ответы, тогда как "чистая" LLM выдает общие фразы (например в тесте 3 и 5).  
С RAG модель способна цитировать статьи (тест 2).

*Работа модуля исправления опечаток*
- В Тесте 2 система исправила languge -> language, что позволило найти нужную статью.
- Проблема: Словарь для коррекции был собран только из заголовков статей. Из-за этого в словаре не оказалось простых слов (main, used).
    - В Тесте 1 слово main (главные) ошибочно заменилось на domain-adaptive, так как это было самое похожее слово из заголовков.
    - В Тесте 4 слово used (используемые) заменилось на confused (спутанные).  
  
=> Механизм работает, но словарь нужно расширить обычным английским лексиконом, чтобы не искажать смысл простых запросов.

*Итог:*
1. RAG превосходит чистый LLM для задач, требующих точных фактов и работы с конкретной документацией.
2. Основная проблема RAG — это ошибки на этапе поиска (retrieval), которые приводят к генерации неверных ответов на основе нерелевантного контекста.
3. RAG-система устойчива к опечаткам и плохой формулировке запросов.
4. Чистый LLM без контекста непригоден для задач, требующих работы с конкретными данными, и выдает лишь общие, предсказуемые ответы.
