<a href="https://colab.research.google.com/github/21092004Goda/data_anal/blob/main/RAG_system_lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Установка зависимостей**

In [61]:
%%capture
!pip install --quiet langchain_huggingface
!pip install --quiet sentence-transformers
!pip install --quiet faiss-cpu
!pip install --quiet arxiv

# **Классы**

## **Извлечение статей**

In [62]:
import arxiv
import time
import math
from typing import List, Dict, Any, Iterator

class ArxivTopicFetcher:

    def __init__(self, max_results_per_request: int = 100, delay_seconds: float = 3.0):
        self.client = arxiv.Client(
            page_size=max_results_per_request,
            delay_seconds=delay_seconds,
            num_retries=3
        )
        self.common_categories = {
            'machine_learning': 'cs.LG',
            'artificial_intelligence': 'cs.AI',
            'computer_vision': 'cs.CV',
            'nlp': 'cs.CL',
            'robotics': 'cs.RO',
            'databases': 'cs.DB',
            'security': 'cs.CR',
            'networks': 'cs.NI',
            'algorithms': 'cs.DS',
            'hci': 'cs.HC'
        }

    def _get_sort_criterion(self, sort_by: str) -> arxiv.SortCriterion:
        sort_map = {
            'relevance': arxiv.SortCriterion.Relevance,
            'lastUpdatedDate': arxiv.SortCriterion.LastUpdatedDate,
            'submittedDate': arxiv.SortCriterion.SubmittedDate
        }
        return sort_map.get(sort_by, arxiv.SortCriterion.SubmittedDate)

    def build_query(self, category: str) -> str:
        if category in self.common_categories:
            category = self.common_categories[category]
        return f"cat:{category}"

    def fetch_articles_paged(self, query: str, total_results: int = 2000,
                            sort_by: str = 'submittedDate',
                            sort_order: str = 'descending',
                            batch_size: int = 500) -> List[Dict[str, Any]]:

        batch_size = min(batch_size, 2000)

        all_articles = []
        total_fetched = 0

        iterations = math.ceil(total_results / batch_size) if batch_size > 0 else 0

        for iteration in range(iterations):
            remaining = total_results - total_fetched
            current_batch = min(batch_size, remaining)

            if current_batch <= 0:
                break

            start_num = total_fetched
            end_num = start_num + current_batch - 1
            print(f"Загружаю статьи {start_num}-{end_num} (пачка {iteration + 1}/{iterations})...")

            sort_order_obj = (arxiv.SortOrder.Descending if sort_order == 'descending'
                             else arxiv.SortOrder.Ascending)

            try:
                search = arxiv.Search(
                    query=query,
                    max_results=current_batch,
                    sort_by=self._get_sort_criterion(sort_by),
                    sort_order=sort_order_obj
                )

                batch_articles = []
                results = list(self.client.results(search))

                for result in results:
                    batch_articles.append({
                        'arxiv_id': result.entry_id.split('/')[-1],
                        'title': result.title,
                        'authors': [a.name for a in result.authors],
                        'abstract': result.summary.replace('\n', ' '),
                        'published': result.published.date() if result.published else None,
                        'categories': result.categories,
                        'pdf_url': result.pdf_url
                    })

                all_articles.extend(batch_articles)
                total_fetched += len(batch_articles)

                print(f"Получено статей в пачке: {len(batch_articles)}. Всего: {total_fetched}")

                if len(batch_articles) < current_batch:
                    print("Достигнут конец списка результатов.")
                    break

                if iteration < iterations - 1 and len(batch_articles) > 0:
                    time.sleep(2.0)
                    print(f"Пауза 2 секунды перед следующей пачкой...")

            except Exception as e:
                print(f"Ошибка при загрузке пачки {iteration + 1}: {e}")
                continue

        print(f"\n✅ Загрузка завершена. Всего получено статей: {total_fetched}")
        return all_articles

    def fetch_articles_streaming(self, query: str, max_results: int = 1000,
                                sort_by: str = 'submittedDate',
                                sort_order: str = 'descending',
                                batch_size: int = 100) -> Iterator[Dict[str, Any]]:

        batch_size = min(batch_size, 2000)
        fetched = 0

        while fetched < max_results:
            remaining = max_results - fetched
            current_batch = min(batch_size, remaining)

            if current_batch <= 0:
                break

            search = arxiv.Search(
                query=query,
                max_results=current_batch,
                sort_by=self._get_sort_criterion(sort_by),
                sort_order=(arxiv.SortOrder.Descending if sort_order == 'descending'
                           else arxiv.SortOrder.Ascending)
            )

            try:
                batch_count = 0
                for result in self.client.results(search):
                    article = {
                        'arxiv_id': result.entry_id.split('/')[-1],
                        'title': result.title,
                        'authors': [a.name for a in result.authors],
                        'abstract': result.summary.replace('\n', ' '),
                        'published': result.published.date() if result.published else None,
                        'categories': result.categories,
                        'pdf_url': result.pdf_url
                    }
                    yield article
                    fetched += 1
                    batch_count += 1

                    if fetched >= max_results:
                        break

                print(f"Загружено {batch_count} статей. Всего: {fetched}")

                if batch_count < current_batch:
                    print("Достигнут конец списка результатов.")
                    break

                if fetched < max_results:
                    time.sleep(2.0)

            except Exception as e:
                print(f"Ошибка при загрузке: {e}")
                break

        print(f"Завершено. Всего загружено: {fetched}")

    def fetch_articles(self, query: str, max_results: int = 50,
                      sort_by: str = 'submittedDate',
                      sort_order: str = 'descending',
                      **kwargs) -> List[Dict[str, Any]]:

        batch_size = kwargs.get('batch_size', min(max_results, 100))

        return self.fetch_articles_paged(
            query=query,
            total_results=max_results,
            sort_by=sort_by,
            sort_order=sort_order,
            batch_size=batch_size
        )

    def fetch_by_category(self, category: str, max_results: int = 50,
                         batch_size: int = 100, **kwargs) -> List[Dict[str, Any]]:

        query = self.build_query(category)

        sort_by = kwargs.get('sort_by', 'submittedDate')
        sort_order = kwargs.get('sort_order', 'descending')

        return self.fetch_articles_paged(
            query=query,
            total_results=max_results,
            sort_by=sort_by,
            sort_order=sort_order,
            batch_size=batch_size
        )

    def print_summary(self, articles: List[Dict[str, Any]], n: int = 5):
        if not articles:
            print("Пусто.")
            return

        print("\n=== Короткий обзор ===")
        for idx, a in enumerate(articles[:n]):
            print(f"\n{idx+1}. {a['title']}")
            print("   Авторы:", ", ".join(a['authors'][:3]) + (" и др." if len(a['authors']) > 3 else ""))
            print("   Дата:", a['published'])
            print("   ID:", a['arxiv_id'])
            print("   Категории:", ", ".join(a['categories']))
            print("   Абстракт:", a['abstract'][:200], "...")

## **Разбиение на чанки**

In [63]:
import pandas as pd
import matplotlib.pyplot as plt

class TextChunker:

    def __init__(self, chunk_size: int = 500, overlap: int = 50):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk_text(self, text: str):
        if not text or not isinstance(text, str):
            return []

        words = text.split()
        chunks = []
        start = 0
        while start < len(words):
            end = start + self.chunk_size
            chunk_words = words[start:end]
            if not chunk_words:
                break
            chunks.append(" ".join(chunk_words))
            start = end - self.overlap

        return chunks

    def chunk_many(self, texts):
        return [self.chunk_text(t) for t in texts]

    def to_dataframe(self, articles):
        rows = []
        for a in articles:
            chunks = self.chunk_text(a["abstract"])
            rows.append({
                "id": a["arxiv_id"],
                "title": a["title"],
                "authors": ", ".join(a["authors"]),
                "published": a["published"],
                "categories": ", ".join(a["categories"]),
                "pdf_url": a["pdf_url"],
                "abstract": a["abstract"],
                "chunks": chunks
            })
        return pd.DataFrame(rows)

    def chunk_statistics(self, df: pd.DataFrame, plot: bool = False):
        chunk_counts = df["chunks"].apply(len)
        stats = {
            "Total articles": len(df),
            "Total chunks": int(chunk_counts.sum()),
            "Min chunks per article": int(chunk_counts.min()),
            "Max chunks per article": int(chunk_counts.max()),
            "Mean chunks per article": float(chunk_counts.mean()),
            "Median chunks per article": float(chunk_counts.median())
        }

        print("\n=== Chunking Statistics ===")
        for k, v in stats.items():
            print(f"{k}: {v}")

        if plot:
            plt.figure(figsize=(8,4))
            plt.hist(chunk_counts, bins=range(1, chunk_counts.max()+2), alpha=0.7, color='skyblue', edgecolor='black')
            plt.title("Distribution of Chunks per Article")
            plt.xlabel("Number of Chunks")
            plt.ylabel("Number of Articles")
            plt.xticks(range(1, chunk_counts.max()+2))
            plt.show()

        return stats


## **Векторизация текста**

In [64]:
import numpy as np
import pandas as pd
import faiss
from langchain_huggingface import HuggingFaceEmbeddings

class ArxivVectorPipeline:

    def __init__(
        self,
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        device="cpu",
        normalize=False
    ):
        self.embeddings_model = HuggingFaceEmbeddings(
            model_name=model_name,
            model_kwargs={"device": device},
            encode_kwargs={"normalize_embeddings": normalize}
        )

        self.index = None
        self.embedding_dim = None
        self.chunks_df = None
        self.embeddings = None

    def _flatten_chunks(self, df: pd.DataFrame):
        rows = []
        for _, row in df.iterrows():
            base = {
                "id": row["id"],
                "title": row["title"],
                "authors": row["authors"],
                "published": row["published"],
                "categories": row["categories"],
                "pdf_url": row["pdf_url"],
                "abstract": row["abstract"]
            }

            for idx, ch in enumerate(row["chunks"]):
                rows.append({
                    **base,
                    "chunk_id": idx,
                    "text_chunk": ch
                })

        return pd.DataFrame(rows)

    def _embed(self, texts):
        if not texts:
            return np.array([])

        emb = self.embeddings_model.embed_documents(texts)
        return np.array(emb, dtype=np.float32)

    def build(self, df: pd.DataFrame):
        self.chunks_df = self._flatten_chunks(df)
        texts = self.chunks_df["text_chunk"].tolist()
        self.embeddings = self._embed(texts)
        self.embedding_dim = self.embeddings.shape[1]
        self.index = faiss.IndexFlatL2(self.embedding_dim)
        self.index.add(self.embeddings)
        return self

    def search(self, query: str, top_k: int = 5):
        if self.index is None:
            raise ValueError("Индекс пуст. Сначала вызови build().")

        q_emb = self._embed([query])
        distances, indices = self.index.search(q_emb, top_k)

        results = []
        for dist, idx in zip(distances[0], indices[0]):
            row = self.chunks_df.iloc[int(idx)]
            results.append({
                "distance": float(dist),
                "chunk_id": int(row["chunk_id"]),
                "text_chunk": row["text_chunk"],
                "article": {
                    "id": row["id"],
                    "title": row["title"],
                    "authors": row["authors"],
                    "categories": row["categories"],
                    "published": row["published"],
                    "abstract": row["abstract"],
                    "pdf_url": row["pdf_url"]
                }
            })

        return results


## **Интеграция с LLM**

In [65]:
from google import genai


class LLMClient:

    def __init__(self, api_key: str, model: str = "gemini-2.5-flash"):
        self.client = genai.Client(api_key=api_key)
        self.model = model

    def ask(self, prompt: str) -> str:

        response = self.client.models.generate_content(
            model=self.model,
            contents=f'"role": "user", "content": "{prompt}"'
        )
        return response.text


## **RAG-QA по корпусу статей**

In [66]:
class ArxivQA:

    def __init__(self, vector_pipeline, llm_client, top_k=5):
        self.vec = vector_pipeline
        self.llm = llm_client
        self.top_k = top_k

    def answer(self, query: str) -> str:
        hits = self.vec.search(query, top_k=self.top_k)
        context = "\n\n".join(chunk["text_chunk"] for chunk in hits)

        prompt = f"""
You are a smart assistant. Here is the context from scientific articles:

{context}

Now answer the user's question:
{query}

Answer clearly and concisely.
        """

        return self.llm.ask(prompt)


## **поиск и RAG**

In [67]:
import numpy as np


class SearchEngine:

    def __init__(self, vector_pipeline, llm_client):
        self.vec = vector_pipeline
        self.llm = llm_client
        self._build_spell_corpus()

    def _build_spell_corpus(self):
        words = set()
        for txt in self.vec.chunks_df["text_chunk"]:
            for w in txt.lower().split():
                if w.isalpha():
                    words.add(w)
        self.corpus_words = list(words)

    def vector_search(self, query: str, top_k: int = 5):
        return self.vec.search(query, top_k)

    def build_prompt(self, query: str, context: str, template=None):
        if template is None:
            template = """
User query: "{query}"

Use ONLY this context:
-----------------
{context}
-----------------

Answer clearly and factually.
"""
        return template.format(query=query, context=context)

    def ask_rag(self, query: str, template=None, top_k=5):
        hits = self.vector_search(query, top_k)
        context = "\n\n".join(h["text_chunk"] for h in hits)
        prompt = self.build_prompt(query, context, template)
        return self.llm.ask(prompt)

    def ask_vanilla(self, query: str):
        return self.llm.ask(query)


# **Проверка выполнение**

In [69]:
fetcher = ArxivTopicFetcher()
articles = fetcher.fetch_by_category(
    category='machine_learning',
    max_results=4000,
    batch_size=1000
)
fetcher.print_summary(articles, n=3)

Загружаю статьи 0-999 (пачка 1/4)...
Получено статей в пачке: 1000. Всего: 1000
Пауза 2 секунды перед следующей пачкой...
Загружаю статьи 1000-1999 (пачка 2/4)...
Получено статей в пачке: 1000. Всего: 2000
Пауза 2 секунды перед следующей пачкой...
Загружаю статьи 2000-2999 (пачка 3/4)...
Получено статей в пачке: 1000. Всего: 3000
Пауза 2 секунды перед следующей пачкой...
Загружаю статьи 3000-3999 (пачка 4/4)...
Получено статей в пачке: 1000. Всего: 4000

✅ Загрузка завершена. Всего получено статей: 4000

=== Короткий обзор ===

1. ThetaEvolve: Test-time Learning on Open Problems
   Авторы: Yiping Wang, Shao-Rong Su, Zhiyuan Zeng и др.
   Дата: 2025-11-28
   ID: 2511.23473v1
   Категории: cs.LG, cs.CL
   Абстракт: Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open ...

2. SmallWorlds: Assessing Dynamics Understanding of World Models in Iso

In [70]:
chunker = TextChunker(chunk_size=150, overlap=15)

df = chunker.to_dataframe(articles)

print(df.head())
print(df["chunks"].iloc[0][:2])

             id                                              title  \
0  2511.23473v1   ThetaEvolve: Test-time Learning on Open Problems   
1  2511.23465v1  SmallWorlds: Assessing Dynamics Understanding ...   
2  2511.23455v1  The Price of Progress: Algorithmic Efficiency ...   
3  2511.23449v1  Physics-Informed Neural Networks for Thermophy...   
4  2511.23443v1  Provable Benefits of Sinusoidal Activation for...   

                                             authors   published  \
0  Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva X...  2025-11-28   
1  Xinyi Li, Zaishuo Xia, Weyl Lu, Chenjie Hao, Y...  2025-11-28   
2  Hans Gundlach, Jayson Lynch, Matthias Mertens,...  2025-11-28   
3                         Ali Waseem, Malcolm Mielle  2025-11-28   
4                         Tianlong Huang, Zhiyuan Li  2025-11-28   

                   categories                             pdf_url  \
0                cs.LG, cs.CL  https://arxiv.org/pdf/2511.23473v1   
1                       cs.LG  h

In [71]:
stats = chunker.chunk_statistics(df)
print(stats)


=== Chunking Statistics ===
Total articles: 4000
Total chunks: 7356
Min chunks per article: 1
Max chunks per article: 3
Mean chunks per article: 1.839
Median chunks per article: 2.0
{'Total articles': 4000, 'Total chunks': 7356, 'Min chunks per article': 1, 'Max chunks per article': 3, 'Mean chunks per article': 1.839, 'Median chunks per article': 2.0}


In [72]:
pipeline = ArxivVectorPipeline(device="cpu")
pipeline.build(df)

results = pipeline.search("reinforcement learning for robots", top_k=5)

for r in results:
    print("\n---")
    print("Distance:", r["distance"])
    print("Chunk:", r["text_chunk"][:200], "…")
    print("Article title:", r["article"]["title"])
    print("ID:", r["article"]["id"])


---
Distance: 0.9049038290977478
Chunk: We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also the …
Article title: First-order Sobolev Reinforcement Learning
ID: 2511.19165v1

---
Distance: 0.9049038290977478
Chunk: We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also the …
Article title: First-order Sobolev Reinforcement Learning
ID: 2511.19165v1

---
Distance: 0.9049038290977478
Chunk: We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also the …
Article title: First-order Sobolev Reinforcement Learning
ID: 2511.19165v1

---
Distance: 0.9049038290977478
Chunk: We

In [73]:
llm = LLMClient(api_key="AIzaSyAMrgl2EgMFzZsQvPruQAP3skf5vf6rx5c")

summary = llm.ask(
    f"Сделай короткий хардкорный конспект статьи: {articles[0]['abstract']}"
)

print(summary)


**ThetaEvolve: Хардкорный конспект**

ThetaEvolve — открытый фреймворк, упрощающий и расширяющий AlphaEvolve для масштабирования in-context learning и RL-обучения *во время тестирования*, позволяя LLM непрерывно обучаться и интернализировать стратегии улучшения открытых оптимизационных задач.

**Ключевые отличия от AlphaEvolve (закрытая, ансамбли LLM, только инференс):**

*   **Модель:** Использует **одну LLM** (даже малую открытую, DeepSeek-R1-0528-Qwen3-8B).
*   **Обучение:** **RL во время тестирования**, модель учится, а не просто выводит.
*   **Особенности:** Большая база программ (эксплорация), батчевая выборка (пропускная способность), "ленивые" штрафы (от застоя), опциональное формирование награды (стабильность обучения).

**Результаты:**

*   **Новые границы:** Первая открытая система, позволившая *малой LLM* (DeepSeek-8B) достичь **новых наилучших известных границ** для проблем AlphaEvolve (упаковка кругов, неравенство первой автокорреляции).
*   **Превосходство RL:** RL-обуче

In [74]:
llm_model = LLMClient(api_key="AIzaSyAMrgl2EgMFzZsQvPruQAP3skf5vf6rx5c")

qa = ArxivQA(pipeline, llm_model)

# Каковы новейшие теоретические основы, объясняющие, почему и когда глубокие нейронные сети хорошо обобщаются, несмотря на наличие миллионов параметров?
ans = qa.answer("What are the latest theoretical frameworks explaining why and when deep neural networks generalize well despite having millions of parameters?")
print(ans)


While a unified theory is still lacking, one recent theoretical framework aims to explain why deep neural networks generalize well, especially in overparameterized settings:

1.  **Linear Stability Framework:** This framework analyzes the behavior of optimization algorithms like SGD, random perturbations, and Sharpness-Aware Minimization (SAM).
2.  **Coherence Measure:** Central to this framework is a coherence measure that quantifies how gradient curvature aligns across data points. This measure helps to reveal why certain flatter or simpler minima are stable and favored during training, which is linked to better generalization.

This approach suggests that the dynamics of optimization algorithms inherently prefer solutions with specific curvature properties that promote generalization.


In [75]:
search = SearchEngine(pipeline, llm_model)

resp = search.ask_rag("What environments were used to test the RL system?")
print(resp)

The RL system was tested in classic control, Atari games, and MuJoCo environments.


In [76]:
q = "In which areas of domain knowledge does the system leverage expertise to improve RL learning?"

print("=== Vanilla LLM ===")
print(search.ask_vanilla(q))

print("\n=== RAG ===")
print(search.ask_rag(q))


=== Vanilla LLM ===
Domain knowledge is leveraged in Reinforcement Learning (RL) across various areas to significantly improve learning efficiency, safety, robustness, and overall performance. Instead of starting from scratch (tabula rasa learning), incorporating human expertise or existing knowledge can guide the agent, prevent catastrophic failures, and shape the learning process.

Here are the key areas where domain knowledge is leveraged in RL:

1.  **Reward Function Design:**
    *   **How:** Domain experts define and shape the reward signal to guide the agent towards desired behaviors and outcomes. This involves understanding the true objectives, intermediate milestones, and undesirable states.
    *   **Leverage:**
        *   **Goal Specification:** Clearly defining what "success" looks like (e.g., scoring points in a game, reaching a target without collision).
        *   **Shaping Rewards:** Adding dense, well-structured reward signals for progress towards the goal, even if t

In [77]:
prompt_1 = """
Provide a scientific explanation grounded strictly in the supplied corpus.

Query: "{query}"

Base your answer ONLY on the information in:
{context}

If details are absent in the corpus, respond that no relevant evidence is present.
"""
print(search.ask_rag("What advantages does the RL system demonstrate in classic control, Atari, and MuJoCo environments?", template=prompt_1))


In classic control, Atari games, and MuJoCo environments, the RL system demonstrates the following advantages:

*   It outperforms all baselines on eight out of fifteen tasks when compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO).
*   It demonstrates substantial gains when provided with domain knowledge.


In [78]:
prompt_2 = """
Your task is to extract factual statements from the given context
and use only those facts to answer the user’s question.

Question: "{query}"

Relevant extracted facts must come solely from:
{context}

Do not infer or extend beyond what is explicitly stated.
"""
print(search.ask_rag("Which types of RL tasks are mentioned as test cases for the system?", template=prompt_2))


The types of RL tasks mentioned as test cases for the system are classic control, Atari games, and MuJoCo environments.


In [79]:
prompt_3 = """
Explain the answer with simple analogies suitable for a newcomer,
but rely strictly on data from the context.

Question: "{query}"

Context to use:
{context}

If the context lacks information needed for the answer,
state that the corpus does not cover this topic.
"""
print(search.ask_rag("How does combining semantic information with numerical metrics affect the performance of the RL system?", template=prompt_3))


Based on the context, combining semantic information with numerical metrics positively affects the performance of the RL system.

Imagine you're trying to build a LEGO model:
*   **Numerical metrics** are like the instruction manual that tells you exactly "use 2x4 block here," "use 1x1 stud there." It's precise, but sometimes doesn't tell you *why* a certain piece goes where it does or how it contributes to the overall structure.
*   **Semantic information (or domain knowledge)** is like having an experienced LEGO builder guide you, saying things like, "This section is the wing, so it needs to be light and stable," or "This piece will make the model more balanced." It gives context and understanding beyond just the numbers.

When you **combine both** (follow the precise instructions AND understand the purpose and function of each part), the RL system performs much better. Specifically, the text states it:

*   **Outperforms other systems:** It "outperforms all baselines on eight out of

In [80]:
prompt_4 = """
Compose a brief analytical report (3–4 sentences)
based exclusively on the provided material.

Question: "{query}"

Use ONLY the content below:
{context}

Avoid adding external knowledge or assumptions.
"""
print(search.ask_rag("What is the main goal of the SmallWorlds benchmark described in the paper?", template=prompt_4))


The SmallWorld Benchmark was introduced to address the lack of a unified and controlled setting for systematically evaluating world models and assessing their ability to capture underlying environment dynamics. Its main goal is to assess world model capability under isolated and precisely controlled dynamics, without relying on handcrafted reward signals. Using this benchmark, experiments reveal how effectively models capture environment structure and how their predictions deteriorate over extended rollouts. This highlights the strengths and limitations of current modeling paradigms and offers insights for future improvement in representation learning and dynamics modeling.
