<a href="https://colab.research.google.com/github/Jordy-20035/hws/blob/hw2/%D0%9A%D0%BB%D0%B0%D1%81%D1%81%D0%B8%D1%84%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F_%D1%82%D0%B5%D0%BA%D1%81%D1%82%D0%BE%D0%B2_HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip uninstall pymorphy2 -y

Found existing installation: pymorphy2 0.9.1
Uninstalling pymorphy2-0.9.1:
  Successfully uninstalled pymorphy2-0.9.1


In [None]:
!pip install corus py imbalanced-learn pymorphy3 pymorphy3-dicts-ru joblib tqdm

Collecting corus
  Downloading corus-0.10.0-py3-none-any.whl.metadata (31 kB)
Collecting py
  Downloading py-1.11.0-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting pymorphy3
  Downloading pymorphy3-2.0.3-py3-none-any.whl.metadata (1.9 kB)
Collecting pymorphy3-dicts-ru
  Downloading pymorphy3_dicts_ru-2.4.417150.4580142-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting dawg2-python>=0.8.0 (from pymorphy3)
  Downloading dawg2_python-0.9.0-py3-none-any.whl.metadata (7.5 kB)
Downloading corus-0.10.0-py3-none-any.whl (83 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.7/83.7 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading py-1.11.0-py2.py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pymorphy3-2.0.3-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[

Corus: Official loader for Lenta.ru dataset ensures correct parsing.

Pymorphy3: Faster Russian lemmatization than Stanza (10x speedup).

imbalanced-learn: Required for SMOTE to handle class imbalance.

joblib: Enables parallel processing for lemmatization.

tqdm: Provides progress visualization.

In [None]:
import random
import numpy as np
import pandas as pd
import corus
import re
import nltk
import string
import pymorphy3
from pymorphy3 import MorphAnalyzer
from multiprocessing import Pool
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from imblearn.over_sampling import SMOTE

Fixed Random Seeds: Ensures consistent data splits and model initialization

Justification: Required for reproducible ML experiments

In [None]:
# Fix random seed for reproducibility
random_state = 42
np.random.seed(random_state)
random.seed(random_state)
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Initialize pymorphy3 lemmatizer
morph = pymorphy3.MorphAnalyzer()

In [None]:
# Function to lemmatize text using pymorphy3
def lemmatize_text(text):
    return " ".join([morph.parse(word)[0].normal_form for word in text.split()])

In [None]:
# Download the Lenta dataset
!wget -q https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
from corus import load_lenta

path = 'lenta-ru-news.csv.gz'
records = load_lenta(path)  # Lazy loading

Sampling: Limits to 10k samples for faster testing.

Progress Tracking: Uses tqdm for real-time loading status.

Random Sampling: Stratified sampling maintains class distribution.

In [None]:
# Load dataset into DataFrame
def load_lenta_data(limit=10000):
    data = []
    for i, record in enumerate(records):
        if i >= limit:
            break
        data.append((record.title, record.text, record.topic))
    return pd.DataFrame(data, columns=["title", "text", "topic"])

df = load_lenta_data()

Cleaning: Removes noise (digits, punctuation) and normalizes case.

Lemmatization: Uses pymorphy3 for accurate Russian word normalization.

Justification: Preprocessing improves model performance by reducing noise.

In [None]:
# Function for basic text preprocessing
def preprocess_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"\d+", "", text)  # Remove digits
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    return lemmatize_text(text)  # Lemmatize text

In [None]:
# Remove missing and empty values
df = df.dropna()
df = df[df["text"].str.strip() != ""]

In [None]:
from tqdm import tqdm  # Progress bar for tracking execution time

tqdm.pandas()  # Enable progress_apply

df["processed_text"] = df["text"].progress_apply(preprocess_text)

100%|██████████| 10000/10000 [05:29<00:00, 30.36it/s]


In [None]:
# Encode categorical topics
df["processed_topic"] = df["topic"].astype("category").cat.codes

In [None]:
# Filter out underrepresented categories
category_counts = df["processed_topic"].value_counts()
valid_categories = category_counts[category_counts >= 2].index
df = df[df["processed_topic"].isin(valid_categories)]

In [None]:
# Split dataset into train, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(
    df["processed_text"], df["processed_topic"], test_size=0.4, stratify=df["processed_topic"], random_state=random_state
)

max_df=0.9: Ignores terms that appear in >90% of documents.

min_df=2: Ignores terms that appear in < 2 documents.

ngram_range=(1, 2): Captures unigrams and bigrams.

sublinear_tf=True: Applies sublinear TF scaling to reduce impact of frequent terms.

Justification: TF-IDF is effective for text classification tasks.

In [None]:
# TF-IDF vectorization
vectorizer = TfidfVectorizer(max_df=0.9, min_df=2, ngram_range=(1, 2), sublinear_tf=True)
X_train_tfidf = vectorizer.fit_transform(X_train)  # Transform training data

SMOTE Choice: Generates synthetic samples for minority classes.

k_neighbors=1: Reduces overfitting by limiting synthetic sample generation.

Justification: SMOTE improves model performance on imbalanced datasets.

In [None]:
# Apply SMOTE for balancing classes
smote = SMOTE(sampling_strategy='auto', random_state=random_state, k_neighbors=1)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_tfidf, y_train.to_numpy().ravel())

Rare Class Filtering: Removes topics with < 2 samples to prevent stratification errors.

Justification: Ensures all classes have sufficient samples for training.

In [None]:
# Ensure all classes in y_temp have at least 2 instances before splitting
class_counts = y_temp.value_counts()
valid_classes = class_counts[class_counts >= 2].index
y_temp_filtered = y_temp[y_temp.isin(valid_classes)]
X_temp_filtered = X_temp.loc[y_temp_filtered.index]

if len(np.unique(y_temp_filtered)) > 1:
    X_val, X_test, y_val, y_test = train_test_split(
        X_temp_filtered, y_temp_filtered, test_size=0.5, stratify=y_temp_filtered, random_state=random_state
    )
else:
    raise ValueError("Not enough instances per class in y_temp for stratified split.")

In [None]:
# Transform validation and test sets using the same TF-IDF vectorizer
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(X_test)

Dummy Accuracy: Provides a baseline for model performance.

Justification: Ensures models perform better than random guessing.

In [None]:
# Dummy classifier for baseline comparison
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train_resampled, y_train_resampled)
y_pred_dummy = dummy.predict(X_test_tfidf)
print("Dummy accuracy:", accuracy_score(y_test, y_pred_dummy))

Dummy accuracy: 0.0


Logistic Regression: Best baseline for text classification.

Random Forest: Robust to overfitting and handles non-linear relationships.

Naïve Bayes: Fast and effective for high-dimensional text data.

Justification: These models are widely used for text classification tasks.

In [None]:
# Define classification models
models = {
    "Logistic Regression": LogisticRegression(C=1.0, solver="liblinear", random_state=random_state),
    "Random Forest": RandomForestClassifier(n_estimators=200, random_state=random_state),
    "Naïve Bayes": MultinomialNB()
}

Justification: Cross-validation ensures reliable performance estimates.

In [None]:
# Evaluate models using cross-validation
for model_name, model in models.items():
    pipeline = Pipeline([
        ("classifier", model)
    ])
    scores = cross_val_score(pipeline, X_train_resampled, y_train_resampled, cv=5, scoring="accuracy")
    print(f"{model_name} mean accuracy: {np.mean(scores):.4f}")

Logistic Regression mean accuracy: 0.9473
Random Forest mean accuracy: 0.9394
Naïve Bayes mean accuracy: 0.9192


Justification: Logistic Regression was chosen as the final model due to its highest cross-validation accuracy.

In [None]:
# Train final model on best-performing classifier
final_pipeline = Pipeline([
    ("classifier", LogisticRegression(C=1.0, solver="liblinear", random_state=random_state))
])
final_pipeline.fit(X_train_resampled, y_train_resampled)
y_pred = final_pipeline.predict(X_test_tfidf)
print("Final accuracy:", accuracy_score(y_test, y_pred))

Final accuracy: 0.851


Summary of Results
Cross-Validation Accuracies:

Logistic Regression: 0.9473

Random Forest: 0.9394

Naïve Bayes: 0.9192

Final Test Accuracy: 0.851