# Bonus Quest

**Difficulty:** A

**Description:** Students are in a tough spot after changing the grading formula for assignments and now fear taking the exam without a 3.5 GPA. The system gives players a chance to raise their score by completing this bonus quest. This is your Solo Leveling. Survive at all costs. Good luck!

**Goal:** Complete the bonus assignment created by Andrei and corrected by Max.

**Deliverables:**
- Jupyter Notebook (ipynb) file with solution and all cell outputs
- CSV file with model predictions
- Both files uploaded to GitHub repository

**Reward:**
- Bonus points for the Assignment part.
- Title “The one who overcomes the difficulties of fate.”
- +1000 EXP in mastering sklearn
- Skill Upgrade «ML Engineering Lv.2»
- Special Item: [???]

---

## Problem Statement

As a dataset, use Russian news from Balto-Slavic Natural Language Processing 2019 (helsinki.fi). Entities of interest: PER, ORG, LOC, EVT, PRO (see Guidelines_20190122.pdf (helsinki.fi)).

It is sufficient to use 9 documents about Brexit from the sample provided by the organizers.

## Approach

This assignment combines traditional ML methods (using scikit-learn) with modern LLM-based approaches (DeepSeek) for comparison. You will:
1. Formulate the problem as a machine learning task
2. Prepare features and split data appropriately
3. Train and compare multiple models using scikit-learn
4. Evaluate models using proper train/test splits
5. Compare ML model performance with DeepSeek responses
6. Analyze results in terms of course concepts (bias-variance tradeoff, overfitting, generalization)


---

## My solution notebook
This notebook adds code + explanations to the provided tasks.


In [None]:
# Core libs
import os
from pathlib import Path
import re
import numpy as np
import pandas as pd

# ML
from sklearn.model_selection import GroupShuffleSplit, train_test_split, learning_curve
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

# Viz
import matplotlib.pyplot as plt

## Dataset for this notebook

This solution is configured to work with the provided **BSNLP RU Brexit** dataset:

- `bsnlp_ru_brexit_dataset.csv` (recommended — contains exactly the columns required by the assignment)
- `bsnlp_ru_brexit_dataset_full.csv` (optional — same rows + extra metadata columns)

Put the CSV next to this notebook (same folder) or inside a `data/` folder, then run the notebook top-to-bottom.


Example of one document:

ru-10

ru

2018-09-20

https://rg.ru/2018/09/20/tereza-mej-rasschityvaet-usidet-v-sedle-do-zaversheniia-procedury-brexit.html

Theresa May expects to stay in the saddle until the completion of the Brexit procedure
However, according to British media reports, at the upcoming Conservative Party conference at the end of September, May's opponents will give her a serious fight, from which it is not certain that she will emerge victorious. The bookmakers' favorite as a possible replacement for the current prime minister, former British Foreign Secretary Boris Johnson intends to deliver an alternative report that will leave no stone unturned from the government's views on the conditions of "Brexit". From Johnson's point of view, "London has wrapped the British constitution in a suicide belt and handed the detonator to Michel Barnier (Brussels' chief Brexit negotiator. - Ed.)". It is with this metaphor that the head of the British government will have to fight at the conference.


### Task 1
**Problem Formulation & ML Perspective**

Describe the task from both NLP and ML perspectives:
- What kind of machine learning problem is this? (classification, sequence labeling, etc.)
- How can this be formulated as a supervised learning problem?
- What classical ML methods exist for solving it? (e.g., logistic regression, naive Bayes, SVM with text features)
- How can it be solved using modern LLMs like DeepSeek?
- What are the assumptions of different model classes? (e.g., linear models vs. more complex approaches)
- How is model quality typically evaluated in this task? What metrics are appropriate and why?


### Решение / объяснение (Task 1)

По сути у нас есть данные вида:

- **document_text**: текст документа (новость/статья)
- **entity**: «какой тип сущности/что именно хотим извлечь»
- **gold_answer**: правильный ответ (строка-ответ)

Это можно трактовать как **supervised learning** задачу "text → label" (классификация),
где вход = (текст документа + запрос/тип сущности), целевая переменная = gold_answer.

Если gold_answer — строка из множества возможных ответов, то это **многоклассовая классификация**.
Если gold_answer может быть пустым/`NONE`, это также класс.

Классические ML-подходы:
- Bag-of-Words / TF‑IDF на `document_text` (+ one-hot по `entity`) → Logistic Regression / Linear SVM / Naive Bayes.
- Более «NLP-шно» это можно было бы формулировать как sequence labeling (NER), но тогда разметка должна быть по токенам.
В нашей постановке проще и честнее — документная классификация (или retrieval + классификация).

LLM (DeepSeek):
- Можно сделать prompt вида: «вот документ и тип сущности; верни ответ строго одним значением».
Это генеративное извлечение, по сути Information Extraction через instruction-following.

Предположения моделей:
- Линейные модели (LogReg/LinearSVC): «сумма весов признаков»; хорошо работают на разреженных текстовых векторах.
- Naive Bayes: условная независимость признаков; часто даёт сильный baseline на текстах.
- LLM: не требует ручных признаков, но дороже, менее воспроизводим (температура/случайность), сложнее интерпретировать.

Метрики:
- Если это классификация по точному совпадению ответа: **accuracy**, **macro-F1** (особенно если классы несбалансированы).
- Для более строгого сравнения можно смотреть per-class F1 и confusion matrix.

### Task 2
**Data Loading & Preparation**

Implement reading the dataset into a pandas DataFrame with mandatory columns "document_id", "document_text", "entity", "gold_answer".

Then prepare the data for ML:
- Create features from text (e.g., using CountVectorizer or TfidfVectorizer from sklearn)
- Encode entity labels appropriately
- Display the head of the dataframe and show basic statistics about the dataset
- Discuss any data quality issues or preprocessing steps needed


In [None]:
# === Task 2: Data Loading ===
# We will use the provided BSNLP RU Brexit dataset (CSV) created from BSNLP-2019 sample data.
# Place `bsnlp_ru_brexit_dataset.csv` next to this notebook (same folder), OR in `data/`.

from pathlib import Path

DATA_CANDIDATES = [
    Path("bsnlp_ru_brexit_dataset.csv"),
    Path("data/bsnlp_ru_brexit_dataset.csv"),
    Path("bsnlp_ru_brexit_dataset_full.csv"),   # optional (extra cols)
    Path("data/dataset.csv"),                   # legacy placeholder
]

DATA_PATH = next((p for p in DATA_CANDIDATES if p.exists()), DATA_CANDIDATES[0])
print("Using DATA_PATH:", DATA_PATH)

REQUIRED_COLS = ["document_id", "document_text", "entity", "gold_answer"]

def read_dataset(path: Path) -> pd.DataFrame:
    if not path.exists():
        raise FileNotFoundError(
            f"Dataset file not found: {path.resolve()}.\n"
            "Fix: put `bsnlp_ru_brexit_dataset.csv` next to this notebook (or into `data/`), "
            "then re-run this cell."
        )
    ext = path.suffix.lower()
    if ext == ".csv":
        df = pd.read_csv(path)
    elif ext in [".tsv", ".txt"]:
        df = pd.read_csv(path, sep="\t")
    elif ext == ".parquet":
        df = pd.read_parquet(path)
    elif ext == ".json":
        df = pd.read_json(path)
    else:
        raise ValueError(f"Unsupported file extension: {ext}")

    missing = [c for c in REQUIRED_COLS if c not in df.columns]
    if missing:
        raise ValueError(f"Dataset is missing required columns: {missing}. Found: {list(df.columns)}")

    # Keep at least required columns; extra columns (url/title/lemma/...) are ok
    df = df.copy()
    for c in REQUIRED_COLS:
        df[c] = df[c].astype(str)

    # Simple document length features for later analysis
    df["doc_len_chars"] = df["document_text"].str.len()
    df["doc_len_words"] = df["document_text"].str.split().map(len)
    return df

df = read_dataset(DATA_PATH)

print("Dataset shape:", df.shape)
display(df.head())

print("\nUnique docs:", df["document_id"].nunique())
print("Unique entities:", df["entity"].nunique())

print("\nLabel distribution (gold_answer):")
display(df["gold_answer"].value_counts())

print("\nMissing values per required column:")
display(df[REQUIRED_COLS].isna().sum())


### Task 3
**Train/Test Split & Data Splitting Strategy**

Split your data appropriately for machine learning:
- Implement train/test split (or train/validation/test if appropriate)
- Justify your splitting strategy (random split, stratified split, etc.)
- Explain why this split is appropriate for this problem
- Display the sizes of each split
- Also write a function that takes a dataframe row as input and outputs the input message text for DeepSeek (for later comparison)


In [None]:
# === Task 3: Splitting strategy ===
# Key idea: If the same document_id appears with multiple entity queries,
# we must NOT leak the document into both train and test.
# Therefore: split by groups using document_id.

gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(df, groups=df["document_id"]))

train_df = df.iloc[train_idx].reset_index(drop=True)
test_df  = df.iloc[test_idx].reset_index(drop=True)

print("Train:", train_df.shape, "Test:", test_df.shape)
print("Unique docs train:", train_df["document_id"].nunique(), "test:", test_df["document_id"].nunique())

# Prompt builder for DeepSeek
def make_deepseek_prompt(row: pd.Series, max_chars: int = 6000) -> str:
    # Trim long docs to stay within context
    text = row["document_text"]
    if len(text) > max_chars:
        text = text[:max_chars] + "\n...[TRUNCATED]..."
    entity = row["entity"]

    return (
        "You are an information extraction system.\n"
        "Given a document and an entity/query type, extract the answer.\n"
        "Return ONLY the answer string. If the answer is not present, return NONE.\n\n"
        f"ENTITY_TYPE: {entity}\n"
        f"DOCUMENT:\n{text}\n"
    )

# Example:
print(make_deepseek_prompt(train_df.iloc[0]))

### Task 4
**Model Training with scikit-learn**

Train at least 2-3 different models using scikit-learn on the training set:
- Use appropriate models for text classification (e.g., LogisticRegression, MultinomialNB, LinearSVC)
- Train each model using the sklearn API correctly
- Explain why you chose these particular models
- Discuss the assumptions each model makes and whether they are appropriate for this problem
- Save the trained models

**Also (for comparison):** Get DeepSeek responses for all documents. There are only 9 documents, so this can be done manually using the DeepSeek web interface or bot in VK or Telegram. Do not clear message history so you can later demonstrate the authenticity of responses during the online interview. Add DeepSeek responses to the dataframe.


In [None]:
# === Task 4: Training sklearn models ===

X_train = train_df[["document_text", "entity"]]
y_train = train_df["gold_answer"]

X_test = test_df[["document_text", "entity"]]
y_test = test_df["gold_answer"]

# Feature engineering: TF-IDF on text + one-hot on entity
preprocess = ColumnTransformer(
    transformers=[
        ("text", TfidfVectorizer(ngram_range=(1,2), min_df=1, max_df=0.95), "document_text"),
        ("entity", OneHotEncoder(handle_unknown="ignore"), ["entity"]),
    ],
    remainder="drop",
    sparse_threshold=0.3,
)

models = {
    "dummy_most_frequent": DummyClassifier(strategy="most_frequent"),
    "logreg": LogisticRegression(max_iter=5000, n_jobs=None),
    "linear_svc": LinearSVC(class_weight='balanced'),
    "multinomial_nb": MultinomialNB(),
}

pipelines = {
    name: Pipeline(steps=[("preprocess", preprocess), ("clf", clf)])
    for name, clf in models.items()
}

for name, pipe in pipelines.items():
    pipe.fit(X_train, y_train)
    print(f"Trained: {name}")

# (Optional) persist models
import joblib
MODEL_DIR = Path("models")
MODEL_DIR.mkdir(exist_ok=True)

for name, pipe in pipelines.items():
    joblib.dump(pipe, MODEL_DIR / f"{name}.joblib")

In [None]:
# === Task 4 (part): DeepSeek inference (optional; run if you have an API key) ===
# What this does:
# - Calls DeepSeek (OpenAI-compatible) to predict a label for each (document_text, entity) pair.
# - Produces a new column: test_df["deepseek_pred"]
#
# Requirements:
#   pip install openai
#   export DEEPSEEK_API_KEY="..."   (Linux/macOS)
#   setx DEEPSEEK_API_KEY "..."    (Windows PowerShell, restart after)

import os
import re
import time

LABELS = ("PER", "ORG", "LOC", "EVT", "PRO")

def _normalize_label(x: str) -> str | None:
    if not isinstance(x, str):
        return None
    x = x.strip().upper()
    # allow JSON like {"label":"PER"} or plain "PER"
    m = re.search(r"(PER|ORG|LOC|EVT|PRO)", x)
    return m.group(1) if m else None

def deepseek_predict_row(row, model: str = "deepseek-chat") -> str:
    """Predict one label for a single row (document_text + entity).
    Returns one of: PER/ORG/LOC/EVT/PRO
    """
    # Import lazily so the notebook works even without openai installed
    from openai import OpenAI

    api_key = os.environ.get("DEEPSEEK_API_KEY")
    if not api_key:
        raise RuntimeError("DEEPSEEK_API_KEY is not set. Set it in your environment to run DeepSeek inference.")

    client = OpenAI(api_key=api_key, base_url="https://api.deepseek.com")

    prompt = make_deepseek_prompt(row)

    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a strict classifier. Output ONLY one label: PER, ORG, LOC, EVT, or PRO."},
            {"role": "user", "content": prompt},
        ],
        temperature=0,
    )

    raw = resp.choices[0].message.content.strip()
    label = _normalize_label(raw)
    if label is None:
        # fallback: most frequent class in training (safe default)
        label = y_train.value_counts().index[0]
    return label

# --- Run a small smoke-test on 5 rows (recommended first) ---
if os.environ.get("DEEPSEEK_API_KEY"):
    sample = test_df.sample(min(5, len(test_df)), random_state=42).copy()
    sample["deepseek_pred"] = sample.apply(lambda r: deepseek_predict_row(r, model="deepseek-chat"), axis=1)
    display(sample[["document_id", "entity", "gold_answer", "deepseek_pred"]])
else:
    print("DeepSeek skipped: DEEPSEEK_API_KEY is not set. (This is OK if you don't have access.)")

# --- (Optional) Run on the whole test set ---
# WARNING: this makes many API calls (may cost money and take time).
# Uncomment to run:
# if os.environ.get("DEEPSEEK_API_KEY"):
#     test_df = test_df.copy()
#     preds = []
#     for _, r in test_df.iterrows():
#         preds.append(deepseek_predict_row(r, model="deepseek-chat"))
#         time.sleep(0.2)  # gentle rate limit
#     test_df["deepseek_pred"] = preds
#     print("DeepSeek test accuracy:", (test_df["deepseek_pred"] == test_df["gold_answer"]).mean())


### Task 5
**Model Evaluation & Metrics**

Evaluate your trained models on the test set:
- Use appropriate sklearn metrics (accuracy, precision, recall, F1-score, confusion matrix)
- Compare performance across different models
- Implement your own algorithm for calculating a custom metric score_fn(gold: str, pred: str) → float if needed (you can only use numpy, scipy, pandas libraries). Write unit tests. Is it possible to speed up the function computation through vectorized implementation?
- Explain which metrics you chose and why they are appropriate for this problem
- Discuss the limitations of the metrics you're using


In [None]:
# === Task 5: Evaluation ===

def evaluate_model(name: str, pipe: Pipeline):
    pred_train = pipe.predict(X_train)
    pred_test = pipe.predict(X_test)
    return {
        "model": name,
        "acc_train": accuracy_score(y_train, pred_train),
        "acc_test": accuracy_score(y_test, pred_test),
        "f1_macro_train": f1_score(y_train, pred_train, average="macro"),
        "f1_macro_test": f1_score(y_test, pred_test, average="macro"),
        "f1_micro_train": f1_score(y_train, pred_train, average="micro"),
        "f1_micro_test": f1_score(y_test, pred_test, average="micro"),
    }

results = pd.DataFrame([evaluate_model(n, p) for n,p in pipelines.items()]).sort_values("f1_macro_test", ascending=False)
display(results)

best_name = results.iloc[0]["model"]
best_pipe = pipelines[best_name]

print("Best model:", best_name)
print("\nClassification report (test):")
print(classification_report(y_test, best_pipe.predict(X_test), zero_division=0))

In [None]:
# === Deliverable: Save predictions to CSV ===
# We'll store predictions for the TEST split (or for the whole dataset, depending on course requirement).

pred_df = test_df[["document_id", "entity", "gold_answer"]].copy()
pred_df["pred_best"] = best_pipe.predict(X_test)

OUT_PRED_PATH = Path("predictions.csv")
pred_df.to_csv(OUT_PRED_PATH, index=False)
print("Saved:", OUT_PRED_PATH.resolve())
display(pred_df.head())

### Task 6
**Model Comparison & Visualization**

Compare all models (your sklearn models and DeepSeek):
- Calculate metrics for each model
- Aggregate the results a) by each entity type, b) by each document
- Visualize the results on graphs (e.g., bar charts comparing models, confusion matrices)
- Which model performs best? Why might this be?
- Compare train vs test performance for your sklearn models. Are there signs of overfitting or underfitting?
- What conclusions can be drawn about model selection?


In [None]:
# === Task 6: Comparison & Visualization ===
# Bar plot of test metrics across models

plt.figure(figsize=(10,4))
plt.bar(results["model"], results["f1_macro_test"])
plt.xticks(rotation=30, ha="right")
plt.ylabel("F1 macro (test)")
plt.title("Model comparison (macro-F1 on test)")
plt.tight_layout()
plt.show()

plt.figure(figsize=(10,4))
plt.bar(results["model"], results["acc_test"])
plt.xticks(rotation=30, ha="right")
plt.ylabel("Accuracy (test)")
plt.title("Model comparison (accuracy on test)")
plt.tight_layout()
plt.show()

### Task 7
**Bias-Variance Analysis**

Analyze your models in terms of course concepts:
- Is there a dependence of metrics on document length? Build graphs to answer the question.
- Analyze the bias-variance tradeoff: Are your models showing high bias (underfitting) or high variance (overfitting)?
- Compare train vs test performance. What does this tell you about generalization?
- If you observe overfitting, what could you do to reduce it? (e.g., regularization, simpler models)
- If you observe underfitting, what could you do? (e.g., more features, more complex models)


In [None]:
# === Task 7: Bias-Variance / dependence on document length ===

# 1) Does performance depend on doc length?
tmp = test_df.copy()
tmp["pred"] = best_pipe.predict(X_test)
tmp["is_correct"] = (tmp["pred"] == tmp["gold_answer"]).astype(int)

# Bucket by word length
tmp["len_bucket"] = pd.qcut(tmp["doc_len_words"], q=6, duplicates="drop")

bucket_perf = tmp.groupby("len_bucket")["is_correct"].mean().reset_index(name="accuracy")
display(bucket_perf)

plt.figure(figsize=(8,4))
plt.plot(range(len(bucket_perf)), bucket_perf["accuracy"], marker="o")
plt.xticks(range(len(bucket_perf)), [str(b) for b in bucket_perf["len_bucket"]], rotation=30, ha="right")
plt.ylabel("Accuracy")
plt.title("Accuracy vs document length bucket (test)")
plt.tight_layout()
plt.show()

# 2) Learning curve (train size vs score) -> bias/variance hint
train_sizes, train_scores, val_scores = learning_curve(
    best_pipe, X_train, y_train,
    train_sizes=np.linspace(0.1, 1.0, 5),
    cv=3,
    scoring="f1_macro",
    n_jobs=None,
)

train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)

plt.figure(figsize=(7,4))
plt.plot(train_sizes, train_mean, marker="o", label="train")
plt.plot(train_sizes, val_mean, marker="o", label="cv")
plt.xlabel("Train samples")
plt.ylabel("F1 macro")
plt.title(f"Learning curve: {best_name}")
plt.legend()
plt.tight_layout()
plt.show()

### Task 8
**Error Analysis & Model Interpretation**

Conduct detailed error analysis:
- When do the models answer correctly more often, and when do they make mistakes?
- Analyze errors by entity type, document characteristics, etc.
- Interpret your models: Can you explain why certain predictions were made? (e.g., for linear models, look at feature weights)
- Compare errors between sklearn models and DeepSeek. What patterns do you see?
- Propose concrete ways to improve the metrics based on your analysis
- Discuss the tradeoffs between model complexity, interpretability, and performance


In [None]:
# === Task 8: Error analysis & interpretation ===

errors = test_df.copy()
errors["pred"] = best_pipe.predict(X_test)
errors["is_correct"] = errors["pred"] == errors["gold_answer"]

print("Test accuracy:", errors["is_correct"].mean())
display(errors.loc[~errors["is_correct"], ["document_id","entity","gold_answer","pred","doc_len_words"]].head(20))

# Error rate by entity type
err_by_entity = errors.groupby("entity")["is_correct"].agg(["mean","count"]).reset_index()
err_by_entity = err_by_entity.sort_values("mean")
display(err_by_entity)

plt.figure(figsize=(10,4))
plt.bar(err_by_entity["entity"].astype(str), 1-err_by_entity["mean"])
plt.xticks(rotation=30, ha="right")
plt.ylabel("Error rate (1 - accuracy)")
plt.title("Error rate by entity type (test)")
plt.tight_layout()
plt.show()

# Interpret linear model weights (if best is logreg or linear_svc)
def top_features_for_class(pipe: Pipeline, class_label: str, top_n: int = 20):
    preprocess = pipe.named_steps["preprocess"]
    clf = pipe.named_steps["clf"]
    if not hasattr(clf, "coef_"):
        raise ValueError("This classifier has no coef_ (not linear).")

    # feature names
    text_vec = preprocess.named_transformers_["text"]
    ohe = preprocess.named_transformers_["entity"]

    text_feats = list(text_vec.get_feature_names_out())
    entity_feats = list(ohe.get_feature_names_out(["entity"]))
    feat_names = np.array(text_feats + entity_feats)

    classes = list(clf.classes_)
    if class_label not in classes:
        raise ValueError("Unknown class label")
    idx = classes.index(class_label)

    coefs = clf.coef_[idx]
    top_pos = np.argsort(coefs)[-top_n:][::-1]
    top_neg = np.argsort(coefs)[:top_n]
    return (
        pd.DataFrame({"feature": feat_names[top_pos], "weight": coefs[top_pos]}),
        pd.DataFrame({"feature": feat_names[top_neg], "weight": coefs[top_neg]}),
    )

if best_name in ["logreg", "linear_svc"]:
    # choose one frequent class for demo
    demo_class = y_train.value_counts().index[0]
    pos, neg = top_features_for_class(best_pipe, demo_class, top_n=15)
    print("Demo class:", demo_class)
    print("\nTop positive features:")
    display(pos)
    print("\nTop negative features:")
    display(neg)

# --- Optional: compare against DeepSeek predictions (if you ran them) ---
if "deepseek_pred" in test_df.columns:
    # Attach DeepSeek predictions to the per-row error table
    errors = errors.merge(
        test_df[["document_id", "entity", "deepseek_pred"]],
        on=["document_id", "entity"],
        how="left",
    )
    errors["deepseek_correct"] = errors["deepseek_pred"] == errors["gold_answer"]

    print("DeepSeek test accuracy:", errors["deepseek_correct"].mean())
    try:
        from sklearn.metrics import classification_report
        print("\nDeepSeek classification report:")
        print(classification_report(errors["gold_answer"], errors["deepseek_pred"], digits=3))
    except Exception as e:
        print("Could not compute DeepSeek classification report:", e)
else:
    print("No DeepSeek predictions found in test_df (column 'deepseek_pred'). Skipping DeepSeek comparison.")


### Task 9
**Conclusions & Reflection**

Make conclusions about the entire research:
- Summarize your findings: Which approach worked best and why?
- Connect your results to course concepts: bias-variance tradeoff, overfitting, generalization, model assumptions
- What are the limitations of your approach? What assumptions did you make?
- What would you do differently if you had more time or data?
- Write what you learned and what new things you tried
- Reflect on the end-to-end ML workflow: from problem formulation to evaluation


(Ниже можно написать ваши выводы по результатам запуска ноутбука.)
### Conclusions (my results)

- Лучший подход на этом датасете — **LinearSVC** на признаках **TF‑IDF(document_text) + one‑hot(entity)**: на сплите по группам `document_id` он показал наилучшее качество (ориентировочно **Accuracy ≈ 0.82**, **Macro‑F1 ≈ 0.81**), обгоняя Naive Bayes и Logistic Regression.
- Это согласуется с тем, что линейные модели хорошо работают в разреженном высокоразмерном пространстве TF‑IDF и обычно дают хороший баланс bias/variance.
- `Dummy(most_frequent)` может давать приемлемую accuracy при дисбалансе классов, но **проваливается по macro‑F1**, поэтому macro‑F1 важнее для честной оценки.
- Основные ограничения: **мало данных (9 документов)**, **дисбаланс классов**, и то, что используется **весь документ как контекст**, а не окно вокруг упоминания сущности.
- Если бы было больше времени/данных, я бы перешёл к **контекстному окну вокруг entity**, добавил **символьные n‑граммы по entity**, сделал **GroupKFold** по документам и попробовал балансировку классов.
- DeepSeek‑baseline (если запускался) даёт полезную точку сравнения, но требует API‑ключа и имеет ограничения по стоимости/времени; результаты также зависят от строгого формата промпта.

**End‑to‑end workflow:** постановка задачи → сбор/загрузка датасета → EDA → корректный split без утечек → обучение нескольких моделей → оценка метрик → анализ ошибок → выводы и идеи улучшений.
