# Text-Classification Baseline with TF–IDF and Classical Classifiers

This notebook builds and evaluates multiple classical classification pipelines on the preprocessed news:  
1. TF–IDF Vectorization  
2. Model training and validation  
3. Baseline selection  
4. Final evaluation on hold-out test set  

In [11]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
import numpy as np
import joblib

## Data Loading

In [2]:
fields = ['text_clean', 'topic']

train_df = pd.read_csv('../data/train.csv',
                       dtype={'topic': object,
                              'text_clean': object},
                       usecols=fields)
val_df = pd.read_csv('../data/val.csv',
                     dtype={'topic': object,
                            'text_clean': object},
                     usecols=fields)
test_df = pd.read_csv('../data/test.csv',
                      dtype={'topic': object,
                             'text_clean': object},
                      usecols=fields)

In [5]:
label_map = {k: v for k, v in enumerate(sorted(train_df['topic'].unique()))}
joblib.dump(label_map, '../models/label_map.joblib')

['../models/label_map.joblib']

In [24]:
label_map

{0: 'Бывший СССР',
 1: 'Дом',
 2: 'Из жизни',
 3: 'Интернет и СМИ',
 4: 'Культура',
 5: 'Мир',
 6: 'Наука и техника',
 7: 'Путешествия',
 8: 'Россия',
 9: 'Силовые структуры',
 10: 'Спорт',
 11: 'Ценности',
 12: 'Экономика'}

In [20]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(train_df['topic'].unique())

In [28]:
X_train, X_val, X_test = train_df['text_clean'], val_df['text_clean'], test_df['text_clean']
y_train, y_val, y_test = le.transform(train_df['topic']), le.transform(val_df['topic']), le.transform(test_df['topic'])

## 1. TF–IDF Vectorization

In [30]:
vect = TfidfVectorizer(
    max_features=20000,
    ngram_range=(1, 2),
    token_pattern=r'\b\w+\b'
)
X_train_tfidf = vect.fit_transform(X_train)
X_val_tfidf = vect.transform(X_val)
X_test_tfidf = vect.transform(X_test)

Candidate Models

Compare following classifiers:  
- **Logistic Regression**
- **Linear SVM**  
- **Multinomial NB**  
- **Random Forest**  
- **k-Nearest Neighbors**  

In [31]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

models = {
    'LogReg': LogisticRegression(solver='saga', max_iter=1000, random_state=42, n_jobs=-1),
    'SVC': LinearSVC(max_iter=10000),
    'NB': MultinomialNB(alpha=0.1),
    'RF': RandomForestClassifier(n_estimators=200, max_depth=20, n_jobs=-1),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

## 2. Model training and validation  
For each candidate model:  
1. Fit on TF–IDF features of training data.  
2. Predict on validation split.  
3. Compute **accuracy**, **macro-F1**, and **weighted-F1** scores.  

In [32]:
results = []
best_models = {}

for name, model in models.items():
    print(f'fitting {name}')
    model.fit(X_train_tfidf, y_train)
    y_pred = model.predict(X_val_tfidf)

    acc = accuracy_score(y_val, y_pred)
    f1_macro = f1_score(y_val, y_pred, average='macro')
    f1_weighted = f1_score(y_val, y_pred, average='weighted')

    results.append({
        'model': name,
        'accuracy': acc,
        'f1_macro': f1_macro,
        'f1_weighted': f1_weighted
    })

df_results = pd.DataFrame(results).sort_values('f1_macro', ascending=False)
df_results.sort_values('f1_macro', ascending=False)

fitting LogReg
fitting SVC
fitting NB
fitting RF
fitting KNN


Unnamed: 0,model,accuracy,f1_macro,f1_weighted
0,LogReg,0.755383,0.732552,0.754426
1,SVC,0.751107,0.730963,0.749917
2,NB,0.692399,0.654706,0.690256
3,RF,0.450096,0.326722,0.397466
4,KNN,0.152505,0.115004,0.138313


## 3. Baseline selection  

Select the model with the highest macro-F1 on the validation set, and construct a `Pipeline` combining the TF–IDF vectorizer and the chosen classifier.

In [33]:
best_model = df_results.loc[df_results['f1_macro'].idxmax(), 'model']

In [34]:
best_pipeline = Pipeline([
    ('tfidf', vect),
    ('clf', models[best_model])
])

Run the baseline pipeline on example sentences to verify that predicted topics align with expectations.

In [36]:
texts = [
    "Экономические итоги первого квартала перевыполнили прогнозы.",
    "Новый фильм режиссёра выйдет в прокат этим летом."
]

predicted_topics = le.inverse_transform(best_pipeline.predict(texts))
predicted_topics

array(['Россия', 'Культура'], dtype=object)

In [5]:
# best_pipeline = joblib.load(f'../models/tfidf_logreg_pipeline(best).joblib')

## 4. Final evaluation on hold-out test set  

Evaluate best pipeline on the held-out test set.  
Report classification metrics (precision, recall, F1) for each class to assess real-world performance.

In [40]:
y_pred = best_pipeline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=label_map.values()))

                   precision    recall  f1-score   support

      Бывший СССР       0.81      0.82      0.81      2905
              Дом       0.75      0.82      0.78       363
         Из жизни       0.62      0.82      0.71      2857
   Интернет и СМИ       0.64      0.64      0.64      2534
         Культура       0.85      0.79      0.82      2267
              Мир       0.80      0.79      0.79      7177
  Наука и техника       0.81      0.79      0.80      3455
      Путешествия       0.92      0.42      0.58      1278
           Россия       0.64      0.76      0.70      7021
Силовые структуры       0.51      0.40      0.44      1531
            Спорт       0.98      0.94      0.96      3207
         Ценности       0.94      0.55      0.70      1153
        Экономика       0.84      0.77      0.80      4008

         accuracy                           0.75     39756
        macro avg       0.78      0.71      0.73     39756
     weighted avg       0.77      0.76      0.75     3

In [17]:
joblib.dump(best_pipeline, f'../models/tfidf_{best_model.lower()}_pipeline(best).joblib')

['../models/tfidf_logreg_pipeline(best).joblib']