## Лекция 2  BM5    

## Функция ранжирования bm25

Для обратного индекса есть общепринятая формула для ранжирования *Okapi best match 25* ([Okapi BM25](https://ru.wikipedia.org/wiki/Okapi_BM25)).    
Пусть дан запрос $Q$, содержащий слова  $q_1, ... , q_n$, тогда функция BM25 даёт следующую оценку релевантности документа $D$ запросу $Q$:

$$ score(D, Q) = \sum_{i}^{n} \text{IDF}(q_i)*\frac{TF(q_i,D)*(k+1)}{TF(q_i,D)+k(1-b+b\frac{l(d)}{avgdl})} $$ 
где   
>$TF(q_i,D)$ - частота слова $q_i$ в документе $D$      
$l(d)$ - длина документа (количество слов в нём)   
*avgdl* — средняя длина документа в коллекции    
$k$ и $b$ — свободные коэффициенты, обычно их выбирают как $k$=2.0 и $b$=0.75   
$$$$
$\text{IDF}(q_i)$ - это модернизированная версия IDF: 
$$\text{IDF}(q_i) = \log\frac{N-n(q_i)+0.5}{n(q_i)+0.5},$$
>> где $N$ - общее количество документов в коллекции   
$n(q_i)$ — количество документов, содержащих $q_i$

In [3]:
from google.colab import drive
GOOGLE_DRIVE_MOUNT = "/content/gdrive"
drive.mount(GOOGLE_DRIVE_MOUNT)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [4]:
!pip install pymorphy2

Collecting pymorphy2
[?25l  Downloading https://files.pythonhosted.org/packages/a3/33/fff9675c68b5f6c63ec8c6e6ff57827dda28a1fa5b2c2d727dffff92dd47/pymorphy2-0.8-py2.py3-none-any.whl (46kB)
[K     |████████████████████████████████| 51kB 2.4MB/s 
Collecting dawg-python>=0.7 (from pymorphy2)
  Downloading https://files.pythonhosted.org/packages/6a/84/ff1ce2071d4c650ec85745766c0047ccc3b5036f1d03559fd46bb38b5eeb/DAWG_Python-0.7.2-py2.py3-none-any.whl
Collecting pymorphy2-dicts<3.0,>=2.4 (from pymorphy2)
[?25l  Downloading https://files.pythonhosted.org/packages/02/51/2465fd4f72328ab50877b54777764d928da8cb15b74e2680fc1bd8cb3173/pymorphy2_dicts-2.4.393442.3710985-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 9.4MB/s 
[?25hInstalling collected packages: dawg-python, pymorphy2-dicts, pymorphy2
Successfully installed dawg-python-0.7.2 pymorphy2-0.8 pymorphy2-dicts-2.4.393442.3710985


In [0]:
from pymorphy2 import MorphAnalyzer
from nltk.tokenize import WordPunctTokenizer
import re
import time
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from math import log

morph = MorphAnalyzer()
vec = CountVectorizer()

In [6]:
from google.colab import files
uploaded = files.upload()

Saving quora_question_pairs_rus.csv to quora_question_pairs_rus.csv


In [0]:
import io
data = pd.read_csv(io.BytesIO(uploaded['quora_question_pairs_rus.csv']))

In [0]:
data = data.set_index('Unnamed: 0')

In [9]:
data.head()

Unnamed: 0_level_0,question1,question2,is_duplicate
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Какова история кохинор кох-и-ноор-бриллиант,"что произойдет, если правительство Индии украд...",0
1,как я могу увеличить скорость моего интернет-с...,как повысить скорость интернета путем взлома ч...,0
2,"почему я мысленно очень одинок, как я могу это...","найти остаток, когда математика 23 ^ 24 матема...",0
3,которые растворяют в воде быстро сахарную соль...,какая рыба выживет в соленой воде,0
4,астрология: я - луна-колпачок из козерога и кр...,Я тройная луна-козерог и восхождение в козерог...,1


### __Задача 1__:    
Напишите два поисковика на *BM25*. Один через подсчет метрики по формуле для каждой пары слово-документ, второй через умножение матрицы на вектор. 

Сравните время работы поиска на 100к запросах. В качестве корпуса возьмем 
[Quora question pairs](https://www.kaggle.com/loopdigga/quora-question-pairs-russian).

In [0]:
def preproc(s):
    t = str(s)
    t = re.sub(r'[A-Za-z0-9<>«»\.!\(\)?,;:\-\"\ufeff]', r'', t)
    text = WordPunctTokenizer().tokenize(t)
    preproc_text = ''
    for w in text:
        new_w = morph.parse(w)[0].normal_form + ' '
        preproc_text += new_w
    return preproc_text

def indexing(column):
    texts_words = []
    idxs = []
    for idx, text in enumerate(column):
        ws = preproc(text)
        texts_words.append(ws)
        idxs.append(idx)
        if len(texts_words) % 1000 == 0:
            print(f'{len(texts_words)} done')
    global vec, vec_bin
    vec = CountVectorizer()#(min_df=5) #или TfIdfVectorizer?
    vec_bin = CountVectorizer(binary=True) #min_df=5, 
    X = vec.fit_transform(texts_words)
    Y = vec_bin.fit_transform(texts_words)
    df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names(), index=idxs)
    df_bin = pd.DataFrame(Y.toarray(), columns=vec.get_feature_names(), index=idxs)
    return df, df_bin

def q2_vectorizer(column):
    texts_words = []
    idxs = []
    for idx, text in enumerate(column):
        ws = preproc(text)
        texts_words.append(ws)
        idxs.append(idx)
        if len(texts_words) % 1000 == 0:
            print(f'{len(texts_words)} done')
    #vec = CountVectorizer(min_df=5) #или TfIdfVectorizer?
    X = vec.transform(texts_words)
    Y = vec_bin.transform(texts_words)
    df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names(), index=idxs)    
    df_bin = pd.DataFrame(Y.toarray(), columns=vec.get_feature_names(), index=idxs)
    return df, df_bin


In [0]:
def create_inverse_counter(binary_1, binary_q): #idf-таблица, шаг 1
    word_counter_dict = {}
    for column in binary_q.columns:
        col = binary_1[column]
        sum_ = col.sum()
        word_counter_dict[column] = sum_
    #print(word_counter_dict)
    inverse_counter = pd.DataFrame.from_dict(word_counter_dict, orient='index')
    inverse_counter = inverse_counter.transpose()
    return inverse_counter

def create_idf_table(ids, N): #idf-таблица, шаг 2
    idfs = {}
    for w in ids:
        idf = log((N - ids[w] + 0.5)/(ids[w] +0.5))
        idfs[w] = idf
    idf_table = pd.DataFrame.from_dict(idfs, orient='index')
    idf_table = idf_table.transpose()
    return idf_table  

def create_tf_table(df): #tf-таблица
    df['sum'] = df.sum(axis=1)
    tf_table = df.div(df['sum'], axis=0)
    tf_table = tf_table.fillna(0)
    #df.loc[:, cols] = df.loc[:, cols].div(df['sum'], axis=0)
    #df.loc[:,:] = df.iloc[:,:].div(df.sum, axis=0)
    return tf_table

def dl_avgdl(data): #возвращает вектор из длин текстов, разделенных на среднюю длину, и среднюю длину текста
    data['sum'] = data.sum(axis = 1, skipna = True) 
    sums = data['sum']
    avg = data['sum'].mean()
    sums_normalized = sums.div(avg)
    return sums_normalized, avg  
    
def create_connector(dataset): #возвращает словарь, где ключ - номер запроса, значение - список из номеров документов, относящихся к каждому запросу
    col = dataset['question1']
    connector_dict = {}
    primary_sentences = {}
    for i in range(len(dataset)):
        #curr_sent = dataset.iloc[i,1]
        curr_sent = col[i]
        if curr_sent in primary_sentences:
            primary_sentences[curr_sent].append(i)
        else:
            primary_sentences[curr_sent] = [i]
    for i in range(len(dataset)):
        #curr_sent = dataset.loc[1,i]
        curr_sent = col[i]
        if len(primary_sentences[curr_sent]) > 1:
            repeating = primary_sentences[curr_sent]
            for r in repeating:
                connector_dict[r] = repeating
        else:
            connector_dict[i] = [i]
    return connector_dict


#### Disclaimer:

у меня не получилось сделать быструю обработку циклом, поэтому я взял для цикла только 500 результатов Кворы.

In [0]:
# инициализируем все, что мы тут понаписали
q1_df, q1_df_bin = indexing(data['question1'][:500])
q2_df, q2_df_bin = q2_vectorizer(data['question2'][:500])

q1_df = q1_df.fillna(0)
q1_df_bin = q1_df_bin.fillna(0)
q2_df = q2_df.fillna(0)
q2_df_bin = q2_df_bin.fillna(0)

is_dupl = data['is_duplicate'][:500]

dls, avg_dl = dl_avgdl(q2_df)

connector_table = create_connector(data[:500])

tf_table = create_tf_table(q1_df)

inv_df = create_inverse_counter(q1_df_bin, q2_df_bin)

idfs = create_idf_table(inv_df, q1_df_bin.shape[0])

In [0]:
q2_df.head()

Unnamed: 0,абстрактный,абстракция,австралия,авто,автомагистраль,автоматизация,автоматизировать,автомобиль,автономный,автор,...,язык,языковой,яичный,яйцо,якшиня,японец,япония,японский,ясный,sum
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,7
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4


In [0]:
def timing(f):
    def wrap(*args):
        time1 = time.time()
        ret = f(*args)
        time2 = time.time()
        print(((time2-time1)))
        return ret
    return wrap

In [0]:
#q2_df, q2_df_bin, q1_df, q1_df_bin - таблицы слов
#tf_table - таблица тф
#connector_table - таблица связи 1 и 2 столбцов
#inv_df - встречаемость каждого слова во всех доках
#sums, avg - dl, avgdl
k = 2.0
b = 0.75

def bm_25(doc_idx, query, wordlist, docs_num, tfs, idfs, c1, c2): #собственно bm25
    bm_val = 0
    for w in wordlist[:-1]:
        if query[w] != 0:
            idf = float(idfs[w])
            tf_value = float(tfs.iloc[doc_idx][w])
            bm_i = idf * ((tf_value * c1)/(tf_value + c2))
            bm_val += bm_i
    return bm_val

@timing
def evaluation(doc_data, q_data, correspondence, is_rel, tfs, idfs, dls_norm, k, b): #возвращает датафрейм с соответствием запроса, 5 релевантных документов и указанием, есть ли в пятерке хоть один из совпадающих по теме
    #start_time = datetime.now()
    similarity_table = pd.DataFrame(columns=['idx_q1', 'relevant_docs', 'match'])
    N = doc_data.shape[0]
    lemmas_list = list(q_data.columns)
    const_little = k + 1
    for q_idx, q_words in q_data.iterrows():
        if q_idx % 10 == 0:
            print(f'query #: {q_idx}')
        relevance_dict = {}
        for d_idx, d_words in doc_data.iterrows():
            len_norm = dls_norm[d_idx]
            const_big = k * (1 - b + b * len_norm)#len_doc/avgdl)
            bm_25_ = bm_25(d_idx, q_words, lemmas_list, N, tfs, idfs, const_little, const_big)
            #print(bm_25_)
            relevance_dict[bm_25_] = d_idx
            #print(f'bm for {d_idx} doc is: {bm_25_}')
        rel_sorted = sorted(relevance_dict.items(), reverse=True)
        best_5_rel = [el[1] for el in rel_sorted[:5]]
        matches = 0
        for d_idx in correspondence[q_idx]:
            #print(d_idx, correspondence[d_idx])
            if d_idx in best_5_rel and is_rel[d_idx] == 1:
                matches += 1
        if matches > 0:
            matches = 1
        similarity_table = similarity_table.append({'idx_q1': q_idx, 'relevant_docs': best_5_rel, 'match':matches}, ignore_index=True)
        rel_sorted = {}
        best_5_rel = []
    #end_time = datetime.now()
    #timing = 'Duration: {}'.format(end_time - start_time)
    return similarity_table#, timing     
            

In [28]:
final = evaluation(q2_df, q1_df, connector_table, is_dupl, tf_table, idfs, dls, k, b)

query #: 0
query #: 10
query #: 20
query #: 30
query #: 40
query #: 50
query #: 60
query #: 70
query #: 80
query #: 90
query #: 100
query #: 110
query #: 120
query #: 130
query #: 140
query #: 150
query #: 160
query #: 170
query #: 180
query #: 190
query #: 200
query #: 210
query #: 220
query #: 230
query #: 240
query #: 250
query #: 260
query #: 270
query #: 280
query #: 290
query #: 300
query #: 310
query #: 320
query #: 330
query #: 340
query #: 350
query #: 360
query #: 370
query #: 380
query #: 390
query #: 400
query #: 410
query #: 420
query #: 430
query #: 440
query #: 450
query #: 460
query #: 470
query #: 480
query #: 490
2592.3042113780975


In [29]:
final

Unnamed: 0,idx_q1,relevant_docs,match
0,0,"[0, 498, 239, 205, 268]",0
1,1,"[309, 1, 328, 84, 362]",0
2,2,"[2, 434, 309, 205, 364]",0
3,3,"[3, 134, 246, 230, 379]",0
4,4,"[4, 434, 278, 113, 205]",1
5,5,"[5, 492, 318, 345, 264]",0
6,6,"[6, 309, 364, 205, 317]",1
7,7,"[7, 309, 277, 314, 301]",0
8,8,"[8, 41, 212, 309, 470]",0
9,9,"[9, 222, 499]",0


In [30]:
from sklearn.metrics import accuracy_score
#функция метрики
def metric(y_true, y_pred):
    y_true = list(y_true)
    y_pred = list(y_pred)
    acc = accuracy_score(y_true, y_pred)
    return acc

metric(is_dupl, final['match'])

1.0

## Матричный подход

#### Для него можно взять побольше данных. Что мы и сделаем (взято 10 000 запросов):

In [14]:
# инициализируем все, что мы тут понаписали
q1_df, q1_df_bin = indexing(data['question1'][:10000])
q2_df, q2_df_bin = q2_vectorizer(data['question2'][:10000])

q1_df = q1_df.fillna(0)
q1_df_bin = q1_df_bin.fillna(0)
q2_df = q2_df.fillna(0)
q2_df_bin = q2_df_bin.fillna(0)

is_dupl = data['is_duplicate'][:10000]

dls, avg_dl = dl_avgdl(q2_df)

connector_table = create_connector(data[:10000])

tf_table = create_tf_table(q1_df)

inv_df = create_inverse_counter(q1_df_bin, q2_df_bin)

idfs = create_idf_table(inv_df, q1_df_bin.shape[0])

1000 done
2000 done
3000 done
4000 done
5000 done
6000 done
7000 done
8000 done
9000 done
10000 done
1000 done
2000 done
3000 done
4000 done
5000 done
6000 done
7000 done
8000 done
9000 done
10000 done


In [0]:
@timing
def matrix_multiplication(queries, docs, tfs, idfs, avg_lens, k, b): #возвращает таблицу bm25 каждого документа к каждому запросу
    tfs = tfs.drop(columns=['sum']) # 1
    conversion_table = queries.mul(tfs) #2
    conversion_table_numerator = conversion_table.mul(k+1) #3
    coefficient = avg_lens.mul(b) #4
    coefficient = coefficient.add(1-b) #5
    coefficient = coefficient.mul(k) # 6
    conversion_table_denominator = conversion_table.mul(avg_lens, axis=0) #7
    tf_factor = conversion_table_numerator.divide(conversion_table_denominator) #8
    tf_factor = tf_factor.fillna(0) #9
    n = tf_factor.shape[0]
    idf_table = pd.concat([idfs]*n, ignore_index=True) #10 
    term_table = tf_factor.mul(idf_table, axis=1) #11
    docs = docs.T
    #print(f'qs: {term_table.shape}, docs: {docs.shape}')
    bm_25_df = term_table.dot(docs) #12
    return bm_25_df

def matrix_best_5(d, correspondence, is_rel): #возвращает 5 наиболее релевантных документов к каждому запросу и ответ, есть ли там хоть одно соотвествие
    best_5_df = pd.DataFrame(columns=['idx_q1', 'relevant_docs', 'match'])
    d = d.replace([np.inf, -np.inf], np.nan)
    #d = d.dropna(inplace=True)
    d = d.fillna(0)
    for q_idx, q_row in d.iteritems():
        q_row = q_row.astype('int64')
        best_5 = list(q_row.nlargest(5).index)
        matches = 0
        for d_idx in correspondence[q_idx]:
            #print(d_idx, correspondence[d_idx])
            if d_idx in best_5 and is_rel[d_idx] == 1:
                matches += 1
        if matches > 0:
            matches = 1
        best_5_df = best_5_df.append({'idx_q1': q_idx, 'relevant_docs': best_5, 'match':matches}, ignore_index=True)
    return best_5_df
        

In [33]:
bm_25_matrices = matrix_multiplication(q1_df_bin, q2_df_bin, tf_table, idfs, dls, k, b)

65.27619171142578


In [0]:
bm_25_matrices

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4990,4991,4992,4993,4994,4995,4996,4997,4998,4999
0,22.440413,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,22.082802,0.000000,0.000000,0.000000,0.000000,0.000000,12.487649,8.212912,5.065786,...,0.000000,0.000000,18.935676,0.000000,3.147126,0.000000,0.000000,0.000000,5.065786,3.147126
2,0.000000,3.671647,0.000000,0.000000,10.869407,0.000000,0.000000,0.000000,9.581730,5.910084,...,8.319426,0.000000,0.000000,0.000000,3.671647,0.000000,0.000000,0.000000,5.910084,3.671647
3,0.000000,0.000000,0.000000,31.943916,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,8.056541,0.000000,0.000000,0.000000,94.242482,8.056541,8.056541,0.000000,0.000000,8.056541,...,0.000000,8.056541,0.000000,0.000000,0.000000,0.000000,8.056541,0.000000,0.000000,0.000000
5,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,9.667044,0.000000,0.000000,0.000000,...,0.000000,9.667044,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,3.147126,0.000000,0.000000,0.000000,0.000000,33.287992,0.000000,8.212912,5.065786,...,0.000000,0.000000,0.000000,0.000000,3.147126,6.443259,0.000000,0.000000,12.707591,3.147126
7,0.000000,0.000000,19.368669,0.000000,0.000000,0.000000,0.000000,89.670517,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,12.479139,0.000000
8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,25.826195,5.910084,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,5.910084,0.000000
9,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [38]:
results_matrix = matrix_best_5(bm_25_matrices, connector_table, is_dupl)
results_matrix

Unnamed: 0,idx_q1,relevant_docs,match
0,0,"[7180, 5580, 819, 1206, 5146]",0
1,1,"[3838, 1274, 729, 105, 5727]",0
2,2,"[547, 7138, 3348, 3725, 5146]",0
3,3,"[7346, 2331, 4030, 6108, 7250]",0
4,4,"[5474, 2129, 4, 547, 5172]",1
5,5,"[7866, 1927, 7180, 7378, 1206]",0
6,6,"[1927, 7180, 8014, 1206, 5511]",0
7,7,"[7, 309, 8003, 1927, 5210]",0
8,8,"[5726, 1450, 2813, 1268, 3620]",0
9,9,"[7180, 7346, 3059, 3190, 7147]",0


In [39]:
metric(is_dupl, results_matrix['match']) #метрика близости

0.7909

### __Задача 2__:    



Выведите 10 первых результатов и их близость по метрике BM25 по запросу **рождественские каникулы** на нашем корпусе  Quora question pairs. 

In [0]:
def matrix_multiplication_for_query(query, docs, tfs, idfs, avg_lens, k, b):
    tfs = tfs.drop(columns=['sum']) # ok
    conversion_table = docs.mul(tfs) #ok
    conversion_table_numerator = conversion_table.mul(k+1) #ok
    coefficient = avg_lens.mul(b) #ok
    coefficient = coefficient.add(1-b) #ok
    coefficient = coefficient.mul(k) # ok
    conversion_table_denominator = conversion_table.mul(avg_lens, axis=0) #ok
    tf_factor = conversion_table_numerator.divide(conversion_table_denominator) #ok
    tf_factor = tf_factor.fillna(0) #ok
    n = tf_factor.shape[0]
    idf_table = pd.concat([idfs]*n, ignore_index=True) 
    term_table = tf_factor.mul(idf_table, axis=1) #0 - vertical, 1 - horisontal
    query = query.T
    print(f'docs: {term_table.shape}, query: {query.shape}')
    bm_25_df = term_table.dot(query)
    return bm_25_df

def best_10_res(query_table, ultimate_dataset): #10 самых релевантных доков к запросу
    query_table = query_table.astype('int64')
    best_10 = list(query_table.nlargest(10, 0).index)
    best_10_df = ultimate_dataset.loc[best_10,'question2']
    best_10_df = pd.DataFrame(best_10_df)
    return best_10_df
            

In [43]:
query = 'рождественские каникулы'
q_vec = q2_vectorizer([query])
result_query = matrix_multiplication_for_query(q_vec[1], q2_df_bin, tf_table, idfs, dls, k, b)

docs: (10000, 9277), query: (9277, 1)


In [44]:
res = best_10_res(result_query, data)
res

Unnamed: 0_level_0,question2
Unnamed: 0,Unnamed: 1_level_1
6587,каков ваш рождественский список
0,"что произойдет, если правительство Индии украд..."
1,как повысить скорость интернета путем взлома ч...
2,"найти остаток, когда математика 23 ^ 24 матема..."
3,какая рыба выживет в соленой воде
4,Я тройная луна-козерог и восхождение в козерог...
5,что делает детей активными и далеки от телефон...
6,"что я должен делать, чтобы быть великим геологом?"
7,когда вы используете вместо
8,как я могу взломать motorola dcx3400 для беспл...


### __Задача 3__:    

Посчитайте точность поиска при 
1. BM25, b=0.75 
2. BM15, b=0 
3. BM11, b=1

#### 1. BM25, b=0.75

In [0]:
k = 2.0


In [23]:

b2 = 0.75
bm_25_matrix_b2 = matrix_multiplication(q1_df_bin, q2_df_bin, tf_table, idfs, dls, k, b2)


61.148287773132324


NameError: ignored

In [24]:
results_matrix_b2 = matrix_best_5(bm_25_matrix_b2, connector_table, is_dupl)
results_matrix_b2

Unnamed: 0,idx_q1,relevant_docs,match
0,0,"[7180, 5580, 819, 1206, 5146]",0
1,1,"[3838, 1274, 729, 105, 5727]",0
2,2,"[547, 7138, 3348, 3725, 5146]",0
3,3,"[7346, 2331, 4030, 6108, 7250]",0
4,4,"[5474, 2129, 4, 547, 5172]",1
5,5,"[7866, 1927, 7180, 7378, 1206]",0
6,6,"[1927, 7180, 8014, 1206, 5511]",0
7,7,"[7, 309, 8003, 1927, 5210]",0
8,8,"[5726, 1450, 2813, 1268, 3620]",0
9,9,"[7180, 7346, 3059, 3190, 7147]",0


In [27]:
metric(is_dupl, results_matrix_b2['match'])

0.7909

#### 1. BM15, b=0

In [18]:
b3 = 0
bm_25_matrix_b3 = matrix_multiplication(q1_df_bin, q2_df_bin, tf_table, idfs, dls, k, b3)
results_matrix_b3 = matrix_best_5(bm_25_matrix_b3, connector_table, is_dupl)


76.8110704421997


In [19]:
metric(is_dupl, results_matrix_b3['match'])

0.7909

#### 1. BM11, b=1

In [20]:
b4 = 1
bm_25_matrix_b4 = matrix_multiplication(q1_df_bin, q2_df_bin, tf_table, idfs, dls, k, b4)
results_matrix_b4 = matrix_best_5(bm_25_matrix_b4, connector_table, is_dupl)
results_matrix_b4

77.23152828216553


Unnamed: 0,idx_q1,relevant_docs,match
0,0,"[7180, 5580, 819, 1206, 5146]",0
1,1,"[3838, 1274, 729, 105, 5727]",0
2,2,"[547, 7138, 3348, 3725, 5146]",0
3,3,"[7346, 2331, 4030, 6108, 7250]",0
4,4,"[5474, 2129, 4, 547, 5172]",1
5,5,"[7866, 1927, 7180, 7378, 1206]",0
6,6,"[1927, 7180, 8014, 1206, 5511]",0
7,7,"[7, 309, 8003, 1927, 5210]",0
8,8,"[5726, 1450, 2813, 1268, 3620]",0
9,9,"[7180, 7346, 3059, 3190, 7147]",0


In [21]:
metric(is_dupl, results_matrix_b4['match'])

0.7909

##Выводы


*   Матричное перемножение в сотни раз быстрее, чем перемножение в циклах
*   На матрицу из более чем 1000 предложений требуется большое количество памяти, из-за чего регулярно обваливается ядро
*   Качество классификации BM25 очень хорошее. На выборке из 1000 предложений (при матричном подходе) качество вообще в районе 92%.
*   Параметр B (во всяком случае, когда меняется только он) не влияет на метрику. Возможно, это связано с тем, что следует одновременно менять k и b.





