In [1]:
import os

import gensim.downloader as api
import numpy as np
import pandas as pd
import requests
from gensim.models import KeyedVectors
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score

from habr_article_analyzer.data import load_dataset_from_zst
from habr_article_analyzer.data_loader import HabrDataset
from habr_article_analyzer.models.baseline.baseline import BaselineWord2VecKNN
from habr_article_analyzer.models.encoders.word2vec_encoder import (
    BilingualWord2VecEncoder,
)
from habr_article_analyzer.settings import data_settings, settings

# Baseline

*Author: Nikita Zolin*

The goal of this notebook is to prepare the baseline. Our task: predict the probability of each hub for the given text. Let's explain how the model will work:

1. We take two inputs: text and hub, which we need to map to some vectors. Model `A` will map text to some vector in $R^n$, model `B` will map hub to some vector in $R^m$.
2. We concatenate these vectors to get one vector in $R^{n+m}$.
3. Model `C` estimates the probability based on this vector.

In this notebook we will use word2vec for models `A` and `B` and we will adjust KNN for model `C`.

However, before fitting the model we need to gather the dataset. Here, as it's just a baseline, we will simply take one positive and three random negative hubs for each text. This logic is implemented in [data_utils](../../src/habr_article_analyzer/data_utils/) and was run as a module before the code below.

Then, we need to install some pretrained word2vec. Let's download them: 

## Encoders

We will take the small ones to run it locally.

In [2]:
kv_en = api.load("glove-wiki-gigaword-300")

In [3]:
kv_ru = api.load("word2vec-ruscorpora-300")

## Model run

Now we can safely run the model and check the result.

In [4]:
# Prepare mini-dataset for local run
np.random.seed(data_settings.random_seed)

dataset = HabrDataset(
    path=settings.raw_data_dir / "train_with_negatives.jsonl.zst",
    columns=["text", "hub", "label"],
    batch_size=data_settings.batch_size,
)

selected_rows = []
for batch_df in dataset:
    if np.random.rand() < 0.1:  # Select 10% of random rows
        selected_rows.append(batch_df)

train_df_sample = pd.concat(selected_rows, ignore_index=True)

Reading dataset: 1751498it [01:26, 20150.27it/s]


In [5]:
encoder = BilingualWord2VecEncoder(kv_ru=kv_ru, kv_en=kv_en)

model = BaselineWord2VecKNN(text_encoder=encoder, hub_encoder=encoder)

model.fit(
    train_df_sample["text"].tolist(),
    train_df_sample["hub"].tolist(),
    train_df_sample["label"].tolist(),
)

And let's check the result:

In [6]:
test_df = load_dataset_from_zst(settings.raw_data_dir / "test_with_negatives.jsonl.zst")

# Decrease the test sample to run it locally
test_texts = test_df["text"].tolist()[:1000]
test_hubs = test_df["hub"].tolist()[:1000]
test_labels = test_df["label"].tolist()[:1000]


probas = [model.predict_proba(text, hub) for text, hub in zip(test_texts, test_hubs)]

list(zip(test_labels[:10], probas[:10]))

Reading records: 437367it [00:31, 13881.88it/s]


[(1, 0.7991020106606943),
 (1, 1.0),
 (1, 0.7991020106606943),
 (1, 0.7991020106606943),
 (0, 0.7991020106606943),
 (0, 0.7991020106606943),
 (0, 0.7991020106606943),
 (0, 0.20176122508297176),
 (0, 0.0),
 (1, 0.20895760975438457)]

# Evaluation

Let's use some standard metric to estimate this model:

In [7]:
probas = np.array(probas)
labels = np.array(test_labels)

# ROC AUC
roc_auc = roc_auc_score(labels, probas)
print(f"ROC AUC: {roc_auc:.4f}")

# Log Loss
ll = log_loss(labels, probas)
print(f"Log Loss: {ll:.4f}")

# Accuracy
preds = (probas >= 0.5).astype(int)
acc = accuracy_score(labels, preds)
print(f"Accuracy: {acc:.4f}")

ROC AUC: 0.8016
Log Loss: 1.5834
Accuracy: 0.7430


On this step it's hard to say if the results are good because it's our first model, but now we have something and are able to compare our future models to these metrics. However, our main goal is to use one specific metric to compare different models, which will be presented in a different notebook.