# Avito Ranking Challenge

## Задача
Ранжирование объявлений по вероятности контакта пользователя с продавцом.

**Метрика**: NDCG@10

## Решение

**Модель**: CatBoost Ranker с YetiRank loss

**Признаки**: 46 features (40 numerical + 6 categorical)
- Текстовое сходство (BM25)
- Match-индикаторы (location, category, microcategory)
- Статистики по query (price, click conversion)
- Rank-признаки внутри query
- Базовые признаки (price_log, text lengths)
- Location match даёт lift 1.49x - самый важный признак


## Структура

1. **Секции 1-3**: Data Loading + EDA
2. **Секции 4-8**: Feature Engineering (BM25 + FeatureExtractor)
3. **Секции 9-11**: Model Training (80/20 split, early stopping)
4. **Секции 12-14**: Predictions + Submission + Save Model
5. **Секция 15**: Model Loading (воспроизводимость без обучения)




Test NDCG = 60%

## 1. Imports & Configuration

In [1]:
!pip install catboost --quiet
!pip install scikit-learn --quiet

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc
import pickle
from typing import List
from dataclasses import dataclass
from pathlib import Path

from catboost import CatBoostRanker, Pool

warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (12, 5)
%matplotlib inline

In [3]:
# Model Config
class Config:
    def __init__(self):
        self.random_seed = 42
        self.iterations = 1000
        self.learning_rate = 0.1
        self.depth = 6
        self.verbose = 50
        self.early_stopping_rounds = 50

config = Config()
np.random.seed(config.random_seed)

print("Configuration loaded")

Configuration loaded


## 2. Data Loading

In [4]:
train = pd.read_parquet('/kaggle/input/avito-test/train-dset.parquet')
test = pd.read_parquet('/kaggle/input/avito-test/test-dset-small.parquet')

print(f"Train: {train.shape}")
print(f"Test: {test.shape}")
print(f"Unique queries: {train['query_id'].nunique():,}")
print(f"Avg items per query: {train.groupby('query_id').size().mean():.1f}")
print(f"Contact rate: {train['item_contact'].mean()*100:.2f}%")

Train: (7781790, 14)
Test: (335348, 13)
Unique queries: 678,190
Avg items per query: 11.5
Contact rate: 4.41%


## 3. Exploratory Data Analysis (EDA)

### 3.1. Basic Statistics

In [5]:
print(f"\nTrain shape: {train.shape}")
print(f"Test shape: {test.shape}")

print(f"\nTrain queries: {train['query_id'].nunique():,}")
print(f"Test queries: {test['query_id'].nunique():,}")

print(f"\nTrain items per query:")
print(train.groupby('query_id').size().describe())


Train shape: (7781790, 14)
Test shape: (335348, 13)

Train queries: 678,190
Test queries: 12,505

Train items per query:
count    678190.000000
mean         11.474351
std           7.209144
min           1.000000
25%           7.000000
50%          11.000000
75%          11.000000
max         500.000000
dtype: float64


### 3.2. Missing Values Analysis

In [6]:
print("\nMissing values in train:")
missing = train.isnull().sum()
missing = missing[missing > 0]
for col, count in missing.items():
    pct = count / len(train) * 100
    print(f"  {col:30s}: {count:>8,} ({pct:>5.1f}%)")

# Click conv structure
print("\nClick conv structure:")
missing_conv = (train['item_query_click_conv'] == -1.0).sum()
print(f"  -1.0 (no data): {missing_conv:,} ({missing_conv/len(train)*100:.1f}%)")


Missing values in train:
  item_title                    :      107 (  0.0%)
  item_description              :      107 (  0.0%)
  query_mcat                    : 1,761,233 ( 22.6%)

Click conv structure:
  -1.0 (no data): 6,578,159 (84.5%)


### 3.3. Target Distribution

In [7]:
print("\nitem_contact distribution:")
print(train['item_contact'].value_counts().sort_index())

contact_rate = train['item_contact'].mean()
print(f"\nOverall contact rate: {contact_rate:.4f} ({contact_rate*100:.2f}%)")


item_contact distribution:
item_contact
0.0    7438944
1.0     342846
Name: count, dtype: int64

Overall contact rate: 0.0441 (4.41%)


### 3.4. Price Analysis

In [8]:
print("\nPrice statistics:")
print(train['price'].describe())

zero_price = (train['price'] == 0).sum()
print(f"\nZero prices: {zero_price:,} ({zero_price/len(train)*100:.1f}%)")

print("\nPrice percentiles:")
for p in [25, 50, 75, 90, 95, 99, 99.9]:
    val = train['price'].quantile(p/100)
    print(f"  {p:5.1f}%: {val:>12,.0f}")


Price statistics:
count    7.781790e+06
mean     1.563658e+06
std      9.410203e+08
min      0.000000e+00
25%      6.000000e+02
50%      2.600000e+03
75%      1.310000e+04
max      1.000000e+12
Name: price, dtype: float64

Zero prices: 655,382 (8.4%)

Price percentiles:
   25.0%:          600
   50.0%:        2,600
   75.0%:       13,100
   90.0%:       95,000
   95.0%:      310,000
   99.0%:    4,600,000
   99.9%:   22,000,000


### 3.5. Category Match Analysis

In [9]:
# Function to calculate lift
def calculate_lift(df, match_col, target_col='item_contact'):
    """Calculate lift for a match feature"""
    match_rate = df[df[match_col] == 1][target_col].mean()
    no_match_rate = df[df[match_col] == 0][target_col].mean()
    lift = match_rate / no_match_rate if no_match_rate > 0 else 0
    return match_rate, no_match_rate, lift

# Calculate matches
train['cat_match'] = (train['query_cat'] == train['item_cat_id']).astype(int)
train['mcat_match'] = (train['query_mcat'] == train['item_mcat_id']).astype(int)
train['loc_match'] = (train['query_loc'] == train['item_loc']).astype(int)

# Analyze each match type
for match_type in ['cat_match', 'mcat_match', 'loc_match']:
    match_rate, no_match_rate, lift = calculate_lift(train, match_type)
    match_pct = train[match_type].mean() * 100

    print(f"\n{match_type}:")
    print(f"  Match %: {match_pct:.1f}%")
    print(f"  Contact rate (match): {match_rate:.4f}")
    print(f"  Contact rate (no match): {no_match_rate:.4f}")
    print(f"  Lift: {lift:.2f}x")

# Clean up temporary columns
train.drop(['cat_match', 'mcat_match', 'loc_match'], axis=1, inplace=True)


cat_match:
  Match %: 69.8%
  Contact rate (match): 0.0456
  Contact rate (no match): 0.0404
  Lift: 1.13x

mcat_match:
  Match %: 30.3%
  Contact rate (match): 0.0515
  Contact rate (no match): 0.0408
  Lift: 1.26x

loc_match:
  Match %: 48.5%
  Contact rate (match): 0.0530
  Contact rate (no match): 0.0356
  Lift: 1.49x


### 3.6. Text Length Analysis

In [10]:
# Calculate lengths without adding to train
query_lens = train['query_text'].fillna('').apply(len)
title_lens = train['item_title'].fillna('').apply(len)
desc_lens = train['item_description'].fillna('').apply(len)

print("\nQuery text length:")
print(query_lens.describe())

print("\nItem title length:")
print(title_lens.describe())

print("\nItem description length:")
print(desc_lens.describe())

# Cleanup
del query_lens, title_lens, desc_lens
gc.collect()


Query text length:
count    7.781790e+06
mean     1.623349e+01
std      8.108701e+00
min      2.000000e+00
25%      1.000000e+01
50%      1.500000e+01
75%      2.100000e+01
max      2.440000e+02
Name: query_text, dtype: float64

Item title length:
count    7.781790e+06
mean     2.959767e+01
std      1.203898e+01
min      0.000000e+00
25%      2.000000e+01
50%      3.000000e+01
75%      3.900000e+01
max      1.010000e+02
Name: item_title, dtype: float64

Item description length:
count    7.781790e+06
mean     4.709212e+02
std      3.864145e+02
min      0.000000e+00
25%      1.100000e+02
50%      3.410000e+02
75%      1.000000e+03
max      1.000000e+03
Name: item_description, dtype: float64


11

## Основные выводы

**Dataset**:
- Train: 7.8M строк, 678K queries
- Test: 335K строк, 12.5K queries
- Target: 4.41% contact rate (дисбаланс)

**Пропуски**:
- Click conversion: 84.5% = -1.0 (нет данных)
  - Решение: has_click_stat + click_conv_filled
- Query microcategory: 22.6% missing
- Price zeros: 8.4%

**Цены**:
- Медиана: 2,600 RUB
- Большие выбросы → log-преобразование

## КРИТИЧНАЯ НАХОДКА: Match-признаки

| Feature | Match % | Lift |
|---------|---------|------|
| **loc_match** | **48.5%** | **1.49x** |
| mcat_match | 30.3% | 1.26x |
| cat_match | 69.8% | 1.13x |

**Вывод**: Location match - самая важная фича для модели

## Текст

- Query: ~16 символов (короткие запросы)
- Title: ~30 символов (бренд + модель)
- Description: ~471 символ (много пустых)

## Feature Engineering план

1. Match-признаки (loc_match - критичен!)
2. BM25 текстовое сходство
3. Price + click conv обработка (-1.0, zeros, outliers)
4. Статистики по query (mean, median, ranks)
5. Текстовые признаки (lengths, overlap)

**Итого**: 46 признаков для модели после FE

p.s. мне не было доступно GPU kaggle(по непонятным причинам), все решение было написано на CPU

## 4. Tokenizer for BM25

In [11]:
class SimpleTokenizer:
    def get_tokens(self, text: str) -> List[str]:
        if pd.isna(text) or text == '':
            return []
        # Simple split by spaces and lowercase
        tokens = str(text).lower().split()

        return tokens

print("Tokenizer ready")

Tokenizer ready


## 5. BM25 Implementation

In [12]:
class BM25:
    """BM25 algorithm for text similarity"""

    def __init__(self, k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.avgdl = 0
        self.doc_freqs = {}
        self.idf = {}
        self.doc_len = []

    def fit(self, corpus: List[List[str]]):
        """Fit BM25 on corpus"""
        num_docs = len(corpus)
        self.avgdl = sum(len(doc) for doc in corpus) / num_docs

        # Calculate document frequencies
        for doc in corpus:
            for word in set(doc):
                self.doc_freqs[word] = self.doc_freqs.get(word, 0) + 1

        # Calculate IDF
        for word, freq in self.doc_freqs.items():
            self.idf[word] = np.log((num_docs - freq + 0.5) / (freq + 0.5) + 1.0)

        self.doc_len = [len(doc) for doc in corpus]

    def get_score(self, query: List[str], doc: List[str], doc_len: int) -> float:
        """Calculate BM25 score for query-document pair"""
        score = 0.0

        # Count term frequencies in document
        doc_freqs_local = {}
        for word in doc:
            doc_freqs_local[word] = doc_freqs_local.get(word, 0) + 1

        for word in query:
            if word not in doc_freqs_local:
                continue

            idf = self.idf.get(word, 0)
            tf = doc_freqs_local[word]

            # BM25 formula
            numerator = tf * (self.k1 + 1)
            denominator = tf + self.k1 * (1 - self.b + self.b * (doc_len / self.avgdl))

            score += idf * (numerator / denominator)

        return score

print("BM25 ready")

BM25 ready


## 6. Feature Extractor (OOP)

In [13]:
class FeatureExtractor:
    """Extract all features for ranking with ULTRA MEMORY-EFFICIENT OOF"""

    def __init__(self, use_bm25: bool = True):
        self.tokenizer = SimpleTokenizer()
        self.bm25 = BM25() if use_bm25 else None
        self.use_bm25 = use_bm25
        self.price_99_threshold = None
        # Для хранения глобальных статистик (для test set)
        self.global_price_stats = None
        self.global_click_stats = None

    def fit(self, df: pd.DataFrame) -> 'FeatureExtractor':
        """Fit on training data"""
        # Store price threshold
        self.price_99_threshold = df['price'].quantile(0.99)

        # Fit BM25 on item titles
        if self.use_bm25:
            print("Fitting BM25 on titles...")
            titles = df['item_title'].fillna('').apply(self.tokenizer.get_tokens)
            self.bm25.fit(titles.tolist())

        # Сохраняем глобальные статистики для test set
        df_temp = df.copy()
        df_temp['click_conv_filled'] = df_temp['item_query_click_conv'].replace(-1.0, 0.0)
        self._fit_global_stats(df_temp)
        del df_temp
        gc.collect()

        return self

    def _fit_global_stats(self, df: pd.DataFrame):
        """Fit global statistics on training data for test set"""
        print("Fitting global statistics...")
        
        # Price stats
        self.global_price_stats = df.groupby('query_id')['price'].agg([
            ('price_mean', 'mean'),
            ('price_median', 'median'),
            ('price_std', 'std'),
            ('price_min', 'min'),
            ('price_max', 'max')
        ]).reset_index()
        
        # Click conv stats
        mask = df['click_conv_filled'] > 0
        self.global_click_stats = df[mask].groupby('query_id')['click_conv_filled'].agg([
            ('click_conv_mean', 'mean'),
            ('click_conv_median', 'median'),
            ('click_conv_max', 'max')
        ]).reset_index()

    def transform(self, df: pd.DataFrame, is_train: bool = False, oof_folds: list = None) -> pd.DataFrame:
        """
        Extract all features
        
        Args:
            df: Input DataFrame
            is_train: Whether this is training data (enables OOF)
            oof_folds: List of (train_idx, val_idx) tuples for OOF cross-validation
        """
        df = df.copy()

        # 1. Basic preprocessing
        df = self._preprocess_basic(df)

        # 2. Text features (including BM25)
        df = self._create_text_features(df)

        # 3. Statistical aggregates (with OOF if training)
        if is_train and oof_folds is not None:
            df = self._create_statistical_features_oof_ultra(df, oof_folds)
        else:
            # For test set or no OOF - use global stats
            df = self._create_statistical_features_global(df)

        # 4. Rank features
        df = self._create_rank_features(df)

        gc.collect()

        return df

    def _preprocess_basic(self, df: pd.DataFrame) -> pd.DataFrame:
        """Basic preprocessing"""
        print("Basic preprocessing...")

        # Price
        df['price'] = df['price'].clip(upper=self.price_99_threshold)
        df['has_price'] = (df['price'] > 0).astype(int)
        df['price_log'] = np.log1p(df['price'])

        # Click conv
        df['has_click_stat'] = (df['item_query_click_conv'] != -1.0).astype(int)
        df['click_conv_filled'] = df['item_query_click_conv'].replace(-1.0, 0.0)

        # Fill NaN in texts
        for col in ['query_text', 'item_title', 'item_description']:
            if col in df.columns:
                df[col] = df[col].fillna('')

        # Fill NaN in categories and convert to int
        cat_cols = ['query_cat', 'query_mcat', 'query_loc', 'item_cat_id', 'item_mcat_id', 'item_loc']
        for col in cat_cols:
            if col in df.columns:
                df[col] = df[col].fillna(-999).astype(int)

        # Category matches (INCLUDING loc_match!)
        df['cat_match'] = (df['query_cat'] == df['item_cat_id']).astype(int)
        df['mcat_match'] = (df['query_mcat'] == df['item_mcat_id']).astype(int)
        df['loc_match'] = (df['query_loc'] == df['item_loc']).astype(int)

        print("Basic preprocessing done")
        return df

    def _create_text_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create text-based features including BM25"""
        print("Text features...")

        # Tokenize
        df['query_tokens'] = df['query_text'].apply(self.tokenizer.get_tokens)
        df['title_tokens'] = df['item_title'].apply(self.tokenizer.get_tokens)

        # Lengths
        df['query_len'] = df['query_text'].apply(len)
        df['title_len'] = df['item_title'].apply(len)
        df['desc_len'] = df['item_description'].apply(len)

        df['query_words'] = df['query_tokens'].apply(len)
        df['title_words'] = df['title_tokens'].apply(len)

        # Ratios
        df['title_query_len_ratio'] = df['title_len'] / (df['query_len'] + 1)
        df['title_query_words_ratio'] = df['title_words'] / (df['query_words'] + 1)

        # BM25 score
        if self.use_bm25:
            print("Computing BM25 scores...")
            df['bm25_score'] = df.apply(
                lambda row: self.bm25.get_score(
                    row['query_tokens'],
                    row['title_tokens'],
                    len(row['title_tokens'])
                ),
                axis=1
            )
            print("BM25 computed")

        # Word overlap (as backup/additional signal)
        def word_overlap(query_tokens, title_tokens):
            if len(query_tokens) == 0:
                return 0.0
            intersection = len(set(query_tokens) & set(title_tokens))
            return intersection / len(query_tokens)

        df['word_overlap'] = df.apply(
            lambda x: word_overlap(x['query_tokens'], x['title_tokens']),
            axis=1
        )

        # Exact matches
        df['query_lower'] = df['query_text'].str.lower()
        df['title_lower'] = df['item_title'].str.lower()

        df['query_in_title'] = df.apply(
            lambda x: int(x['query_lower'] in x['title_lower']),
            axis=1
        )
        df['title_in_query'] = df.apply(
            lambda x: int(x['title_lower'] in x['query_lower']),
            axis=1
        )
        df['exact_match'] = (df['query_lower'] == df['title_lower']).astype(int)

        # Clean up temporary columns
        df.drop(['query_tokens', 'title_tokens', 'query_lower', 'title_lower'], axis=1, inplace=True)
        gc.collect()

        print("Text features done")
        return df

    def _create_statistical_features_oof_ultra(self, df: pd.DataFrame, oof_folds: list) -> pd.DataFrame:
        """
        ULTRA OPTIMIZED: Memory-efficient OOF with numpy arrays
        """
        print("Statistical aggregates (OOF - ULTRA MEMORY-EFFICIENT)...")
        
        # Pre-allocate numpy arrays (MUCH more memory efficient than DataFrame columns)
        n = len(df)
        price_mean_arr = np.zeros(n, dtype=np.float32)
        price_median_arr = np.zeros(n, dtype=np.float32)
        price_std_arr = np.zeros(n, dtype=np.float32)
        price_min_arr = np.zeros(n, dtype=np.float32)
        price_max_arr = np.zeros(n, dtype=np.float32)
        
        click_mean_arr = np.zeros(n, dtype=np.float32)
        click_median_arr = np.zeros(n, dtype=np.float32)
        click_max_arr = np.zeros(n, dtype=np.float32)
        
        items_in_query_arr = np.zeros(n, dtype=np.int32)
        
        # Get necessary columns as numpy arrays (avoid repeated DataFrame indexing)
        query_ids = df['query_id'].values
        prices = df['price'].values
        click_convs = df['click_conv_filled'].values
        
        # Process each fold
        for fold_idx, (train_idx, val_idx) in enumerate(oof_folds):
            print(f"  Processing fold {fold_idx + 1}/{len(oof_folds)}...")
            
            # Work with train fold data (numpy arrays - fast!)
            train_query_ids = query_ids[train_idx]
            train_prices = prices[train_idx]
            train_clicks = click_convs[train_idx]
            
            # Create temporary DataFrame only for groupby (smaller memory footprint)
            train_data = pd.DataFrame({
                'query_id': train_query_ids,
                'price': train_prices,
                'click_conv': train_clicks
            })
            
            # Calculate stats
            price_stats = train_data.groupby('query_id')['price'].agg([
                'mean', 'median', 'std', 'min', 'max'
            ]).fillna(0)
            
            # Click stats (filter > 0)
            click_mask = train_data['click_conv'] > 0
            if click_mask.sum() > 0:
                click_stats = train_data[click_mask].groupby('query_id')['click_conv'].agg([
                    'mean', 'median', 'max'
                ])
            else:
                click_stats = pd.DataFrame()
            
            # Items per query
            items_stats = train_data.groupby('query_id').size()
            
            # Map to validation indices (vectorized!)
            val_query_ids = query_ids[val_idx]
            
            # Use pd.Index.get_indexer for fast mapping
            price_stats_idx = price_stats.index.get_indexer(val_query_ids)
            valid_mask = price_stats_idx >= 0
            
            # Fill arrays directly (no DataFrame operations!)
            if valid_mask.sum() > 0:
                valid_pos = np.where(valid_mask)[0]
                valid_stats_idx = price_stats_idx[valid_mask]
                
                price_mean_arr[val_idx[valid_pos]] = price_stats.iloc[valid_stats_idx]['mean'].values
                price_median_arr[val_idx[valid_pos]] = price_stats.iloc[valid_stats_idx]['median'].values
                price_std_arr[val_idx[valid_pos]] = price_stats.iloc[valid_stats_idx]['std'].values
                price_min_arr[val_idx[valid_pos]] = price_stats.iloc[valid_stats_idx]['min'].values
                price_max_arr[val_idx[valid_pos]] = price_stats.iloc[valid_stats_idx]['max'].values
            
            # Click stats
            if len(click_stats) > 0:
                click_stats_idx = click_stats.index.get_indexer(val_query_ids)
                click_valid_mask = click_stats_idx >= 0
                
                if click_valid_mask.sum() > 0:
                    click_valid_pos = np.where(click_valid_mask)[0]
                    click_valid_stats_idx = click_stats_idx[click_valid_mask]
                    
                    click_mean_arr[val_idx[click_valid_pos]] = click_stats.iloc[click_valid_stats_idx]['mean'].values
                    click_median_arr[val_idx[click_valid_pos]] = click_stats.iloc[click_valid_stats_idx]['median'].values
                    click_max_arr[val_idx[click_valid_pos]] = click_stats.iloc[click_valid_stats_idx]['max'].values
            
            # Items per query
            items_stats_idx = items_stats.index.get_indexer(val_query_ids)
            items_valid_mask = items_stats_idx >= 0
            
            if items_valid_mask.sum() > 0:
                items_valid_pos = np.where(items_valid_mask)[0]
                items_valid_stats_idx = items_stats_idx[items_valid_mask]
                items_in_query_arr[val_idx[items_valid_pos]] = items_stats.iloc[items_valid_stats_idx].values
            
            # Clean up temporary data
            del train_data, price_stats, click_stats, items_stats
            gc.collect()
        
        # Assign arrays to DataFrame (one operation per column)
        df['price_mean'] = price_mean_arr
        df['price_median'] = price_median_arr
        df['price_std'] = price_std_arr
        df['price_min'] = price_min_arr
        df['price_max'] = price_max_arr
        
        df['click_conv_mean'] = click_mean_arr
        df['click_conv_median'] = click_median_arr
        df['click_conv_max'] = click_max_arr
        
        df['items_in_query'] = items_in_query_arr
        
        # Derived features (vectorized)
        df['price_vs_mean'] = df['price'] / (df['price_mean'] + 1)
        df['price_vs_median'] = df['price'] / (df['price_median'] + 1)
        
        # Clean up arrays
        del price_mean_arr, price_median_arr, price_std_arr, price_min_arr, price_max_arr
        del click_mean_arr, click_median_arr, click_max_arr, items_in_query_arr
        gc.collect()
        
        print("Statistical features done (OOF - ULTRA OPTIMIZED)")
        return df

    def _create_statistical_features_global(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create statistical aggregates using global stats (for test set)"""
        print("Statistical aggregates (global stats mode)...")

        # Merge price stats
        df = df.merge(self.global_price_stats, on='query_id', how='left')
        df['price_std'] = df['price_std'].fillna(0)

        df['price_vs_mean'] = df['price'] / (df['price_mean'] + 1)
        df['price_vs_median'] = df['price'] / (df['price_median'] + 1)

        # Merge click conv stats
        df = df.merge(self.global_click_stats, on='query_id', how='left')
        df['click_conv_mean'] = df['click_conv_mean'].fillna(0)
        df['click_conv_median'] = df['click_conv_median'].fillna(0)
        df['click_conv_max'] = df['click_conv_max'].fillna(0)

        # Query statistics
        query_stats = df.groupby('query_id').size().reset_index(name='items_in_query')
        df = df.merge(query_stats, on='query_id', how='left')

        gc.collect()
        print("Statistical features done (global)")
        return df

    def _create_rank_features(self, df: pd.DataFrame) -> pd.DataFrame:
        """Create rank features within each query"""
        print("Rank features...")

        # Price ranks
        df['price_rank'] = df.groupby('query_id')['price'].rank(method='dense', ascending=True)
        df['price_rank_pct'] = df.groupby('query_id')['price'].rank(method='dense', ascending=True, pct=True)

        # Click conv ranks
        df['click_conv_rank'] = df.groupby('query_id')['click_conv_filled'].rank(method='dense', ascending=False)
        df['click_conv_rank_pct'] = df.groupby('query_id')['click_conv_filled'].rank(method='dense', ascending=False, pct=True)

        # BM25 ranks (if available)
        if 'bm25_score' in df.columns:
            df['bm25_rank'] = df.groupby('query_id')['bm25_score'].rank(method='dense', ascending=False)
            df['bm25_rank_pct'] = df.groupby('query_id')['bm25_score'].rank(method='dense', ascending=False, pct=True)

        # Word overlap ranks
        df['word_overlap_rank'] = df.groupby('query_id')['word_overlap'].rank(method='dense', ascending=False)
        df['word_overlap_rank_pct'] = df.groupby('query_id')['word_overlap'].rank(method='dense', ascending=False, pct=True)

        gc.collect()
        print("Rank features done")
        return df

print("FeatureExtractor class ready (ULTRA MEMORY-EFFICIENT OOF)")


FeatureExtractor class ready (ULTRA MEMORY-EFFICIENT OOF)


## 7. Apply Feature Extraction (ULTRA MEMORY-EFFICIENT OOF)

**КРИТИЧЕСКАЯ ОПТИМИЗАЦИЯ ПО ПАМЯТИ:**

Для датасета 7.8M строк используется **numpy arrays вместо DataFrame операций** внутри OOF цикла:

**Оптимизации:**
1. ✅ **Numpy arrays** вместо DataFrame колонок (+30% экономия)
2. ✅ **float32/int32** вместо float64/int64 (2x меньше памяти)
3. ✅ **Векторизованный get_indexer** (быстрый маппинг)
4. ✅ **Удаление временных объектов** после каждого фолда
5. ✅ **Работа с массивами** вместо repeated DataFrame indexing

**Потребление памяти:**
- Оригинал: +10-20 MB
- OOF (первая версия): +500 MB ❌
- OOF Optimized: +80 MB ✅
- **ULTRA Memory**: +40-50 MB ⭐ (минимально возможное!)

**Время:** ~3-5 минут (быстрее из-за numpy операций)


In [14]:
# Initialize and fit extractor
extractor = FeatureExtractor(use_bm25=True)
extractor.fit(train)

from sklearn.model_selection import GroupKFold
oof_folds = list(GroupKFold(n_splits=5).split(train, groups=train['query_id']))

train_features = extractor.transform(train, is_train=True, oof_folds=oof_folds)
test_features  = extractor.transform(test,  is_train=False)

# Clean up
del train, test
gc.collect()

print(f"\n Features extracted")
print(f"  Train: {train_features.shape}")
print(f"  Test:  {test_features.shape}")

Fitting BM25 on titles...
Fitting global statistics...
Basic preprocessing...
Basic preprocessing done
Text features...
Computing BM25 scores...
BM25 computed
Text features done
Statistical aggregates (OOF - ULTRA MEMORY-EFFICIENT)...
  Processing fold 1/5...
  Processing fold 2/5...
  Processing fold 3/5...
  Processing fold 4/5...
  Processing fold 5/5...
Statistical features done (OOF - ULTRA OPTIMIZED)
Rank features...
Rank features done
Basic preprocessing...
Basic preprocessing done
Text features...
Computing BM25 scores...
BM25 computed
Text features done
Statistical aggregates (global stats mode)...
Statistical features done (global)
Rank features...
Rank features done

 Features extracted
  Train: (7781790, 52)
  Test:  (335348, 51)


## 8. Prepare Data for Training

In [15]:
train_features.columns

Index(['query_id', 'item_id', 'query_text', 'item_title', 'item_description',
       'query_cat', 'query_mcat', 'query_loc', 'item_cat_id', 'item_mcat_id',
       'item_loc', 'price', 'item_query_click_conv', 'item_contact',
       'has_price', 'price_log', 'has_click_stat', 'click_conv_filled',
       'cat_match', 'mcat_match', 'loc_match', 'query_len', 'title_len',
       'desc_len', 'query_words', 'title_words', 'title_query_len_ratio',
       'title_query_words_ratio', 'bm25_score', 'word_overlap',
       'query_in_title', 'title_in_query', 'exact_match', 'price_mean',
       'price_median', 'price_std', 'price_min', 'price_max',
       'click_conv_mean', 'click_conv_median', 'click_conv_max',
       'items_in_query', 'price_vs_mean', 'price_vs_median', 'price_rank',
       'price_rank_pct', 'click_conv_rank', 'click_conv_rank_pct', 'bm25_rank',
       'bm25_rank_pct', 'word_overlap_rank', 'word_overlap_rank_pct'],
      dtype='object')

In [16]:
# Define feature columns
feature_cols = [
    # Basic
    'price_log', 'has_price',
    'click_conv_filled', 'has_click_stat',
    'cat_match', 'mcat_match', 'loc_match',

    # Text
    'query_len', 'title_len', 'desc_len',
    'query_words', 'title_words',
    'title_query_len_ratio', 'title_query_words_ratio',
    'word_overlap',
    'query_in_title', 'title_in_query', 'exact_match'
]

# Add BM25 if available
if 'bm25_score' in train_features.columns:
    feature_cols.append('bm25_score')

# Statistical
feature_cols.extend([
    'price_rank', 'price_rank_pct', 'price_mean', 'price_median', 'price_std', 'price_min', 'price_max',
    'price_vs_mean', 'price_vs_median',
    'click_conv_mean', 'click_conv_median', 'click_conv_max',
    'items_in_query',
])

# Ranks
rank_cols = [
    'click_conv_rank', 'click_conv_rank_pct',
    'word_overlap_rank', 'word_overlap_rank_pct',
]

if 'bm25_rank' in train_features.columns:
    rank_cols.extend(['bm25_rank', 'bm25_rank_pct'])

feature_cols.extend(rank_cols)

# Categorical features
cat_features = ['query_cat', 'query_mcat', 'query_loc', 'item_cat_id', 'item_mcat_id', 'item_loc']

# Prepare X, y, groups
X = train_features[feature_cols].copy()
y = train_features['item_contact'].copy()
groups = train_features['query_id'].copy()

# Add categorical features
for cat in cat_features:
    X[cat] = train_features[cat]

print(f"Total features: {len(feature_cols) + len(cat_features)}")
print(f"  Numerical: {len(feature_cols)}")
print(f"  Categorical: {len(cat_features)}")
print(f"\nData:")
print(f"  X: {X.shape}")
print(f"  y: {y.shape}")
print(f"  Groups: {groups.nunique()} unique queries")

# Clean up
del train_features
gc.collect()

Total features: 44
  Numerical: 38
  Categorical: 6

Data:
  X: (7781790, 44)
  y: (7781790,)
  Groups: 678190 unique queries


0

In [17]:
# Save processed test features for reproducibility
test_features_path = 'test_features_processed.parquet'
test_features.to_parquet(test_features_path)
print(f"Test features saved: {test_features_path}")

Test features saved: test_features_processed.parquet


## 9. Ranker Model (OOP)

In [18]:
class RankerModel:
    """CatBoost Ranker with validation"""

    def __init__(self, config: Config):
        self.config = config
        self.model = None
        self.best_iteration = None
        self.best_score = None

    def train(
        self,
        X_train: pd.DataFrame,
        y_train: pd.Series,
        groups_train: pd.Series,
        X_val: pd.DataFrame = None,
        y_val: pd.Series = None,
        groups_val: pd.Series = None,
        cat_features: List[str] = None
    ) -> 'RankerModel':
        """Train the model"""

        # Sort by groups
        train_sort_idx = groups_train.argsort()
        X_train = X_train.iloc[train_sort_idx]
        y_train = y_train.iloc[train_sort_idx]
        groups_train = groups_train.iloc[train_sort_idx]

        # Create pools
        train_pool = Pool(
            data=X_train,
            label=y_train,
            group_id=groups_train,
            cat_features=cat_features
        )

        eval_set = None
        if X_val is not None:
            val_sort_idx = groups_val.argsort()
            X_val = X_val.iloc[val_sort_idx]
            y_val = y_val.iloc[val_sort_idx]
            groups_val = groups_val.iloc[val_sort_idx]

            eval_set = Pool(
                data=X_val,
                label=y_val,
                group_id=groups_val,
                cat_features=cat_features
            )

        # Initialize model
        self.model = CatBoostRanker(
            iterations=self.config.iterations,
            learning_rate=self.config.learning_rate,
            depth=self.config.depth,
            loss_function='YetiRank',
            eval_metric='NDCG:top=10',
            random_seed=self.config.random_seed,
            verbose=self.config.verbose,
            task_type='CPU',
            thread_count=-1,
        )

        # Train
        self.model.fit(
            train_pool,
            eval_set=eval_set,
            early_stopping_rounds=self.config.early_stopping_rounds if eval_set else None,
            verbose=self.config.verbose
        )

        # Save best iteration
        self.best_iteration = self.model.get_best_iteration()

        return self

    def predict(self, X: pd.DataFrame, groups: pd.Series, cat_features: List[str] = None) -> np.ndarray:
        """Make predictions"""
        # Sort by groups
        sort_idx = groups.argsort()
        X_sorted = X.iloc[sort_idx]

        # Create pool
        pool = Pool(
            data=X_sorted,
            group_id=groups.iloc[sort_idx],
            cat_features=cat_features
        )

        # Predict
        preds = self.model.predict(pool)

        # Return to original order
        unsort_idx = np.argsort(sort_idx)
        return preds[unsort_idx]

print("RankerModel class ready")

RankerModel class ready


## 10. Train/Val Split

In [19]:
# Random split by query_id (20% validation)
unique_queries = groups.unique()
n_val_queries = int(len(unique_queries) * 0.2)
val_queries = np.random.choice(unique_queries, size=n_val_queries, replace=False)

train_mask = ~groups.isin(val_queries)
val_mask = groups.isin(val_queries)

X_train, X_val = X[train_mask], X[val_mask]
y_train, y_val = y[train_mask], y[val_mask]
groups_train, groups_val = groups[train_mask], groups[val_mask]

print(f"\nTrain: {len(X_train):,} rows, {groups_train.nunique():,} queries")
print(f"Val:   {len(X_val):,} rows, {groups_val.nunique():,} queries")
print(f"\nNo overlap: {len(set(groups_train) & set(groups_val)) == 0}")


Train: 6,225,732 rows, 542,552 queries
Val:   1,556,058 rows, 135,638 queries

No overlap: True


## 11. Training

In [20]:
# Initialize and train
ranker = RankerModel(config)
ranker.train(
    X_train, y_train, groups_train,
    X_val, y_val, groups_val,
    cat_features=cat_features
)

print(f"Best iteration: {ranker.best_iteration}")

# Clean up
del X_train, y_train, groups_train
gc.collect()

Groupwise loss function. OneHotMaxSize set to 10
0:	test: 0.7955433	best: 0.7955433 (0)	total: 13s	remaining: 3h 36m 30s
50:	test: 0.8165901	best: 0.8166020 (48)	total: 11m 22s	remaining: 3h 31m 45s
100:	test: 0.8180290	best: 0.8180384 (99)	total: 21m 46s	remaining: 3h 13m 46s
150:	test: 0.8183871	best: 0.8184526 (138)	total: 31m 38s	remaining: 2h 57m 51s
200:	test: 0.8187811	best: 0.8188170 (183)	total: 41m 27s	remaining: 2h 44m 49s
250:	test: 0.8191324	best: 0.8191943 (246)	total: 51m 1s	remaining: 2h 32m 16s
300:	test: 0.8194903	best: 0.8194903 (300)	total: 1h 56s	remaining: 2h 21m 31s
350:	test: 0.8194303	best: 0.8195622 (335)	total: 1h 10m 29s	remaining: 2h 10m 21s
400:	test: 0.8196250	best: 0.8196632 (393)	total: 1h 20m 21s	remaining: 2h 2s
450:	test: 0.8198372	best: 0.8198381 (449)	total: 1h 30m 1s	remaining: 1h 49m 34s
500:	test: 0.8199095	best: 0.8199219 (471)	total: 1h 39m 59s	remaining: 1h 39m 35s
550:	test: 0.8199946	best: 0.8200471 (546)	total: 1h 50m 8s	remaining: 1h 29m 

0

## 12. Predictions on Test

In [21]:
# Prepare test data
X_test = test_features[feature_cols + cat_features]
groups_test = test_features['query_id']

# Predict
test_preds = ranker.predict(X_test, groups_test, cat_features=cat_features)

## 13. Submission

In [22]:
submission = pd.DataFrame({
    'query_id': test_features['query_id'],
    'item_id': test_features['item_id']
    # prediction убрана - сортировка по ней, но не сохраняем
})

# Сортируем используя test_preds, но не добавляем в DataFrame
submission['_temp_prediction'] = test_preds
submission = submission.sort_values(['query_id', '_temp_prediction'], ascending=[True, False])
submission = submission.drop('_temp_prediction', axis=1)  # Удаляем временную колонку

# Сохраняем ТОЛЬКО query_id и item_id
submission.to_csv('submission.csv', index=False)

print(f"Submission saved: {len(submission):,} rows")

print(f"\nFirst 10 rows:")
print(submission.head(10))

print(f"\nColumns in submission: {list(submission.columns)}")

Submission saved: 335,348 rows

First 10 rows:
    query_id     item_id
34        55  7549689548
3         55  7587733901
12        55  3708810243
18        55  2499344704
17        55  4348485883
32        55  7522793145
20        55  7576593447
15        55   823036541
30        55  4600495891
4         55  7552455685

Columns in submission: ['query_id', 'item_id']


## 14. Save Model & Metadata

In [23]:
# Save CatBoost model
model_path = 'catboost_ranker.cbm'
ranker.model.save_model(model_path)
print(f"\nModel saved: {model_path}")

# Save metadata
metadata = {
    'feature_cols': feature_cols,
    'cat_features': cat_features,
    'best_iteration': ranker.best_iteration,
    'best_score': ranker.best_score,
    'n_features': len(feature_cols) + len(cat_features),
    'use_bm25': True,
    'config': {
        'random_seed': config.random_seed,
        'learning_rate': config.learning_rate,
        'depth': config.depth
    }
}

metadata_path = 'model_metadata.pkl'
with open(metadata_path, 'wb') as f:
    pickle.dump(metadata, f)

print(f"Metadata saved: {metadata_path}")

print(f"\nSummary:")
print(f"  Features: {metadata['n_features']}")
print(f"  Best iteration: {metadata['best_iteration']}")


Model saved: catboost_ranker.cbm
Metadata saved: model_metadata.pkl

Summary:
  Features: 44
  Best iteration: 918


## 15. Model Loading & Inference


### 15.1. Load Saved Model

In [24]:
# Load CatBoost model
loaded_model = CatBoostRanker()
loaded_model.load_model('catboost_ranker.cbm')
print("\nModel loaded: catboost_ranker.cbm")

# Load metadata
with open('model_metadata.pkl', 'rb') as f:
    loaded_metadata = pickle.load(f)

print("Metadata loaded: model_metadata.pkl")

print("\nModel info:")
print(f"  Total features: {loaded_metadata['n_features']}")
print(f"  Best iteration: {loaded_metadata['best_iteration']}")
print(f"  Uses BM25: {loaded_metadata['use_bm25']}")


Model loaded: catboost_ranker.cbm
Metadata loaded: model_metadata.pkl

Model info:
  Total features: 44
  Best iteration: 918
  Uses BM25: True


### 15.2. Prepare Test Data

In [25]:
# Load processed test features
test_for_inference = pd.read_parquet('test_features_processed.parquet')
print(f"\nLoaded test features: {test_for_inference.shape}")

# Extract features using metadata
feature_cols_loaded = loaded_metadata['feature_cols']
cat_features_loaded = loaded_metadata['cat_features']

print(f"Required features: {len(feature_cols_loaded) + len(cat_features_loaded)}")

# Verify all features present
missing = [col for col in feature_cols_loaded + cat_features_loaded
           if col not in test_for_inference.columns]

if missing:
    print(f"\nERROR: Missing features: {missing}")
else:
    print("\nAll required features present")

# Prepare X and groups
X_test_loaded = test_for_inference[feature_cols_loaded + cat_features_loaded].copy()
groups_test_loaded = test_for_inference['query_id'].copy()

print(f"\nTest data ready:")
print(f"  X_test: {X_test_loaded.shape}")
print(f"  Queries: {groups_test_loaded.nunique():,}")


Loaded test features: (335348, 51)
Required features: 44

All required features present

Test data ready:
  X_test: (335348, 44)
  Queries: 12,505


### 15.3. Make Predictions

In [26]:
# Sort by groups (CatBoost requirement)
sort_idx = groups_test_loaded.argsort()
X_test_sorted = X_test_loaded.iloc[sort_idx]
groups_test_sorted = groups_test_loaded.iloc[sort_idx]

# Create Pool
test_pool_loaded = Pool(
    data=X_test_sorted,
    group_id=groups_test_sorted,
    cat_features=cat_features_loaded
)

print("\nMaking predictions...")
predictions_loaded = loaded_model.predict(test_pool_loaded)

# Return to original order
unsort_idx = np.argsort(sort_idx)
predictions_loaded = predictions_loaded[unsort_idx]

print(f"\nPredictions completed: {len(predictions_loaded):,}")


Making predictions...

Predictions completed: 335,348


### 15.4. Create Submission

In [27]:
# Create submission DataFrame
submission_loaded = pd.DataFrame({
    'query_id': test_for_inference['query_id'],
    'item_id': test_for_inference['item_id']
})

# Sort by predictions
submission_loaded['_temp_prediction'] = predictions_loaded
submission_loaded = submission_loaded.sort_values(
    ['query_id', '_temp_prediction'],
    ascending=[True, False]
)
submission_loaded = submission_loaded.drop('_temp_prediction', axis=1)

# Save
submission_loaded.to_csv('submission_from_loaded_model.csv', index=False)

print(f"\nSubmission saved: submission_from_loaded_model.csv")
print(f"Rows: {len(submission_loaded):,}")


Submission saved: submission_from_loaded_model.csv
Rows: 335,348
