Реализация QA системы при помощи TF-IDF

In [2]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=f18dd491a169affacbe3853d32cab9262144f525522dd85ffc5f7e486135b014
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [3]:
# пример вопроса
# и пример ответов из википедии.
import wikipedia as wiki

k = 5
question = "What are the tourist hotspots in Portugal?"

results = wiki.search(question, results=k)
print('Question:', question)
print('Pages:  ', results)

Question: What are the tourist hotspots in Portugal?
Pages:   ['Tourist attraction', 'Portugal', 'Porto', 'Goa', 'Algarve']


In [4]:
import json
import numpy as np
import pandas as pd

In [5]:
import zipfile # для распаковки датасета
from google.colab import drive
drive.mount('/content/drive/')

# распаковка датасета
file = '/content/drive/My Drive/Dataset/SQuAD.zip'
with zipfile.ZipFile(file, mode='r') as archive:
  archive.extractall()

Mounted at /content/drive/


In [6]:
def squad_json_to_dataframe(file_path, record_path=['data','paragraphs','qas','answers']):
    """
    input_file_path: путь к json-файлу.
    record_path: путь к самому глубокому уровню в json-файле, значение по умолчанию
    ['data','paragraphs','qas','answers']
    """
    file = json.loads(open(file_path).read())
    # разбор различных уровней в json-файле
    js = pd.json_normalize(file, record_path)
    m = pd.json_normalize(file, record_path[:-1])
    r = pd.json_normalize(file,record_path[:-2])
    # объединение их в один датафрейм
    idx = np.repeat(r['context'].values, r.qas.str.len())
    m['context'] = idx
    data = m[['id','question','context','answers']].set_index('id').reset_index()
    data['c_id'] = data['context'].factorize()[0]
    return data

In [7]:
file_path = '/content/train-v1.1.json'
data = squad_json_to_dataframe(file_path)
data

Unnamed: 0,id,question,context,answers,c_id
0,5733be284776f41900661182,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha...","[{'answer_start': 515, 'text': 'Saint Bernadet...",0
1,5733be284776f4190066117f,What is in front of the Notre Dame Main Building?,"Architecturally, the school has a Catholic cha...","[{'answer_start': 188, 'text': 'a copper statu...",0
2,5733be284776f41900661180,The Basilica of the Sacred heart at Notre Dame...,"Architecturally, the school has a Catholic cha...","[{'answer_start': 279, 'text': 'the Main Build...",0
3,5733be284776f41900661181,What is the Grotto at Notre Dame?,"Architecturally, the school has a Catholic cha...","[{'answer_start': 381, 'text': 'a Marian place...",0
4,5733be284776f4190066117e,What sits on top of the Main Building at Notre...,"Architecturally, the school has a Catholic cha...","[{'answer_start': 92, 'text': 'a golden statue...",0
...,...,...,...,...,...
87594,5735d259012e2f140011a09d,In what US state did Kathmandu first establish...,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 229, 'text': 'Oregon'}]",18890
87595,5735d259012e2f140011a09e,What was Yangon previously known as?,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 414, 'text': 'Rangoon'}]",18890
87596,5735d259012e2f140011a09f,With what Belorussian city does Kathmandu have...,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 476, 'text': 'Minsk'}]",18890
87597,5735d259012e2f140011a0a0,In what year did Kathmandu create its initial ...,"Kathmandu Metropolitan City (KMC), in order to...","[{'answer_start': 199, 'text': '1975'}]",18890


In [8]:
# Сколько у нас документов?
data['c_id'].unique().size

18891

**Получение уникальных документов**

Выберем уникальные документы в наших данных. Это будет список документов для поиска ответов



In [9]:
documents = data[['context', 'c_id']].drop_duplicates().reset_index(drop=True)
documents

Unnamed: 0,context,c_id
0,"Architecturally, the school has a Catholic cha...",0
1,"As at most other universities, Notre Dame's st...",1
2,The university is the major seat of the Congre...,2
3,The College of Engineering was established in ...,3
4,All of Notre Dame's undergraduate students are...,4
...,...,...
18886,"Institute of Medicine, the central college of ...",18886
18887,Football and Cricket are the most popular spor...,18887
18888,The total length of roads in Nepal is recorded...,18888
18889,The main international airport serving Kathman...,18889


In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# определение TF-IDF
tfidf_configs = {
    'lowercase': True,
    'analyzer': 'word',
    'stop_words': 'english',
    'binary': True,
    'max_df': 0.9,
    'max_features': 10_000
}
# определение количества документов для извлечения
retriever_configs = {
    'n_neighbors': 10,
    'metric': 'cosine'
}

# определение нашего конвейера
embedding = TfidfVectorizer(**tfidf_configs)
retriever = NearestNeighbors(**retriever_configs)

In [11]:
# Обучим модель получению идентификатора документа 'c_id'
X = embedding.fit_transform(documents['context'])
retriever.fit(X, documents['c_id'])

Протестируем векторизатор, какую информацию использует наша модель для извлечения вектора?

In [12]:
def transform_text(vectorizer, text):
    '''
    Печать текста и вектора[TF-IDF]
    векторизатор: sklearn.vectorizer
    текст: str
    '''
    print('Текст:', text)
    vector = vectorizer.transform([text])
    vector = vectorizer.inverse_transform(vector)
    print('Вектор:', vector)

In [13]:
# векторизация вопроса
transform_text(embedding, question)

Текст: What are the tourist hotspots in Portugal?
Вектор: [array(['tourist', 'portugal'], dtype='<U18')]


Какой документ наиболее похож на этот вопрос?

In [14]:
# предсказываем наиболее похожий документ
X = embedding.transform([question])
c_id = retriever.kneighbors(X, return_distance=False)[0][0]
selected = documents.iloc[c_id]['context']

# векторизация документа
transform_text(embedding, selected)

Текст: The two largest metropolitan areas have subway systems: Lisbon Metro and Metro Sul do Tejo in the Lisbon Metropolitan Area and Porto Metro in the Porto Metropolitan Area, each with more than 35 km (22 mi) of lines. In Portugal, Lisbon tram services have been supplied by the Companhia de Carris de Ferro de Lisboa (Carris), for over a century. In Porto, a tram network, of which only a tourist line on the shores of the Douro remain, began construction on 12 September 1895 (a first for the Iberian Peninsula). All major cities and towns have their own local urban transport network, as well as taxi services.
Вектор: [array(['urban', 'transport', 'towns', 'tourist', 'systems', 'supplied',
       'subway', 'shores', 'services', 'september', 'remain', 'portugal',
       'porto', 'peninsula', 'network', 'mi', 'metropolitan', 'metro',
       'major', 'local', 'lisbon', 'lines', 'line', 'largest', 'km',
       'iberian', 'construction', 'cities', 'century', 'began', 'areas',
       'area', 

In [15]:
# делаем прогноз
X = embedding.transform(data['question'])
y_test = data['c_id']
y_pred = retriever.kneighbors(X, return_distance=False)

In [17]:
# лучшие документы, предсказанные для каждого вопроса
y_pred

array([[    0,  3694, 10613, ..., 17590,  6913,  6912],
       [    7,  1469,     2, ...,    29, 14201,    17],
       [   38,  1469, 14152, ...,    28,     7, 14201],
       ...,
       [18890, 18884, 18836, ..., 12302, 18837,  4200],
       [18890,  3537, 18841, ..., 16014, 18884, 10882],
       [12592, 12591, 12598, ..., 12593, 12600, 12588]])

In [18]:
def top_accuracy(y_true, y_pred) -> float:
    right, count = 0, 0
    for i, y_t in enumerate(y_true):
        count += 1
        if y_t in y_pred[i]:
            right += 1
    return right / count if count > 0 else 0

In [19]:
# Достигнутая точность
acc = top_accuracy(y_test, y_pred)
print('Accuracy:', f'{acc:.4f}')
print('Quantity:', int(acc*len(y_pred)), 'from', len(y_pred))

Accuracy: 0.7148
Quantity: 62615 from 87599


TF-IDF определенно имеет несколько проблем:

этот алгоритм способен вычислять сходство только между вопросами и документами, в которых встречаются одинаковые слова, поэтому он не может улавливать синонимы.

А так же он не может понять значение слов.

К тому же нам пришлось возвращать в ответ на вопрос список из нескольких документов.



Впрочем, так или иначе, мы достигли довольно неплохой точности в 71%.

К преимуществам данного способа можно записать скорость.

In [24]:
# функции для финального использования QA системы
def pred_top1_answer(question):
  '''
  Печатает топ-1 документ, совпадающий с нашим вопросом.
  '''
  X = embedding.transform([question])
  c_id = retriever.kneighbors(X, return_distance=False)[0][0]
  selected = documents.iloc[c_id]['context']

  print(selected)

def pred_top10_answers(question):
  '''
  Печатает топ-10 документов, совпадающих с нашим вопросом.
  '''
  X = embedding.transform([question])
  answers = retriever.kneighbors(X, return_distance=False)[0]
  count = 0
  for answer in answers:
    count += 1
    print(count, documents.iloc[answer]['context'], '\n')

In [29]:
# Финальная демонстрация работы QA системы

question = 'How did the Cold War start?'

pred_top1_answer(question)

During the Cold War, a principal focus of Canadian defence policy was contributing to the security of Europe in the face of the Soviet military threat. Toward that end, Canadian ground and air forces were based in Europe from the early 1950s until the early 1990s.


In [30]:
pred_top10_answers(question)

1 During the Cold War, a principal focus of Canadian defence policy was contributing to the security of Europe in the face of the Soviet military threat. Toward that end, Canadian ground and air forces were based in Europe from the early 1950s until the early 1990s. 

2 Tito was notable for pursuing a foreign policy of neutrality during the Cold War and for establishing close ties with developing countries. Tito's strong belief in self-determination caused early rift with Stalin and consequently, the Eastern Bloc. His public speeches often reiterated that policy of neutrality and cooperation with all countries would be natural as long as these countries did not use their influence to pressure Yugoslavia to take sides. Relations with the United States and Western European nations were generally cordial. 

3 However, since the end of the Cold War, as the North Atlantic Treaty Organization (NATO) has moved much of its defence focus "out of area", the Canadian military has also become more