From cf1665d9f569b483600f512de1171e493cc1682e Mon Sep 17 00:00:00 2001 From: DanAnastasyev Date: Mon, 17 Dec 2018 16:34:05 +0300 Subject: [PATCH] Week 13 --- .DS_Store | Bin 0 -> 6148 bytes README.md | 14 + Week 13/README.md | 6 + .../Week_13_Dialogue_Systems_(Part_2).ipynb | 925 ++++++++++++++++++ 4 files changed, 945 insertions(+) create mode 100644 .DS_Store create mode 100644 Week 13/README.md create mode 100644 Week 13/Week_13_Dialogue_Systems_(Part_2).ipynb diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..9e612d5f3ac55ba38cd99b9e69bf66e0447f80c3 GIT binary patch literal 6148 zcmeHK%}xR_5N-jbV2pb(kz+63I6(M2nQT@(cr#fe2Q{!8h-}1N0*gcuv#+6VN-xf(;VGOPI0 z@z5WvWs~DSGJx+c%UY}l&n^4({vsbDZ~kx)jN?-2ou|^7t?jH3d6D1EU%UNi>c(C? z89UzS5_@M}-1m)r9z}zt?fDl$Z`!u@kE1a5g0MGI#X+|RDOar^>_$^3nuOi48pqWE zA#x(uwhFUZwN@+2YP~Wq%2}gP0=rV5&vWA7uzYmdxqEn=JY4JH~P0>X7Epf2U+iota`_@&9S4JI0OIpb<(7{|=a%?*XC*}*SW zI^(uRYKZ}2V3mQi?pk>MpZt9PUoD~@F+dD#6azfdbej$=$(*fgi^H>4f_6YrFfY-# lECE9u#gL0faRF2b_$3;Ew!uUrctGe!K+!-AG4Q7hd;k+{P-*}G literal 0 HcmV?d00001 diff --git a/README.md b/README.md index c5a43d9..603d4df 100644 --- a/README.md +++ b/README.md @@ -176,3 +176,17 @@ Goal-orientied dialogue systems. Implemention of the multi-task model: intent cl +### Week 13: *Dialogue Systems: Part 2* +General conversation dialogue systems and DSSMs. Implementation of question answering model on SQuAD dataset and chit-chat model on OpenSubtitles dataset. + + + + +
+ + Run in Google Colab + + + View source on GitHub +
+ diff --git a/Week 13/README.md b/Week 13/README.md new file mode 100644 index 0000000..b13817d --- /dev/null +++ b/Week 13/README.md @@ -0,0 +1,6 @@ +# Разбалловка + +- Задания, сделанные на семинаре: 5 баллов +- Реализовать hard negatives mining: +2 балла +- Реализовать semi-hard negatives mining: +1.5 балла +- Реализовать болталку: +3 балла diff --git a/Week 13/Week_13_Dialogue_Systems_(Part_2).ipynb b/Week 13/Week_13_Dialogue_Systems_(Part_2).ipynb new file mode 100644 index 0000000..e765498 --- /dev/null +++ b/Week 13/Week_13_Dialogue_Systems_(Part_2).ipynb @@ -0,0 +1,925 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Week 13 - Dialogue Systems (Part 2).ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "metadata": { + "id": "Z4WlMyJVRkzQ", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "!pip3 -qq install torch==0.4.1\n", + "!pip -qq install torchtext==0.3.1\n", + "!pip -qq install spacy==2.0.16\n", + "!pip install -qq gensim==3.6.0\n", + "!python -m spacy download en\n", + "!wget -O squad.zip -qq --no-check-certificate \"https://drive.google.com/uc?export=download&id=1h8dplcVzRkbrSYaTAbXYEAjcbApMxYQL\"\n", + "!unzip squad.zip\n", + "!wget -O opensubs.zip -qq --no-check-certificate \"https://drive.google.com/uc?export=download&id=1x1mNHweP95IeGFbDJPAI7zffgxrbqb7b\"\n", + "!unzip opensubs.zip" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "UvJKy3mtVOpw", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import numpy as np\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "import torch.optim as optim\n", + "\n", + "\n", + "if torch.cuda.is_available():\n", + " from torch.cuda import FloatTensor, LongTensor\n", + " DEVICE = torch.device('cuda')\n", + "else:\n", + " from torch import FloatTensor, LongTensor\n", + " DEVICE = torch.device('cpu')\n", + "\n", + "np.random.seed(42)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "LKCD9Pt4Wupj", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# General Conversation" + ] + }, + { + "metadata": { + "id": "C3fbABnaXFyk", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Сегодня разбираем, как устроена болталка.\n", + "\n", + "![](https://meduza.io/image/attachments/images/002/547/612/large/RLnxN4VdUmWFcBp8GjxUmA.jpg =x200) \n", + "*From [«Алиса, за мной следит ФСБ?»: в соцсетях продолжают издеваться над голосовым помощником «Яндекса»](https://meduza.io/shapito/2017/10/11/alisa-za-mnoy-sledit-fsb-v-sotssetyah-prodolzhayut-izdevatsya-nad-golosovym-pomoschnikom-yandeksa)*\n", + "\n", + "Вообще, мы уже обсудили Seq2Seq модели, которые могут быть использованы для реализации болталки - однако, у них недостаток: высока вероятность сгенерировать что-то неграмматичное. Ну, как те пирожки.\n", + "\n", + "Поэтому почти всегда идут другим путем - вместо генерации применяют ранжирование. Нужно заранее составить большую базу ответов и просто выбирать наиболее подходящий к контексту каждый раз." + ] + }, + { + "metadata": { + "id": "3mOxTaTjD0-p", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "## DSSM\n", + "\n", + "Для этого используют DSSM (Deep Structured Semantic Models):\n", + "\n", + "![](https://qph.fs.quoracdn.net/main-qimg-b90431ff9b4c60c5d69069d7bc048ff0) \n", + "*From [What are Siamese neural networks, what applications are they good for, and why?](https://www.quora.com/What-are-Siamese-neural-networks-what-applications-are-they-good-for-and-why)*\n", + "\n", + "Эта сеть состоит из (обычно) пары башен: левая кодирует запрос, правая - ответ. Задача - научиться считать близость между запросом и ответом.\n", + "\n", + "Дальше набирают большой корпус из пар запрос-ответ (запрос может быть как одним вопросом, так и контекстом - несколькими последними вопросами/ответами). \n", + "\n", + "Для ответов предпосчитывают их векторы, каждый новый запрос кодируют с помощью правой башни и находят среди предпосчитанных векторов ближайший." + ] + }, + { + "metadata": { + "id": "JSz9ALe3w1v1", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "## Данные\n", + "\n", + "Будем использовать для начала [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/). Вообще, там задача - найти в тексте ответ на вопрос. Но мы будем просто выбирать среди предложений текста наиболее близкое к вопросу.\n", + "\n", + "*Эта часть ноутбука сильно основана на [шадовском ноутбуке](https://github.com/yandexdataschool/nlp_course/blob/master/week10_dialogue/seminar.ipynb)*." + ] + }, + { + "metadata": { + "id": "_6OqXUJxjnX4", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "\n", + "train_data = pd.read_json('train.json')\n", + "test_data = pd.read_json('test.json')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "s0DPvn5djkha", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "row = train_data.iloc[40]\n", + "print('QUESTION:', row.question, '\\n')\n", + "for i, cand in enumerate(row.options):\n", + " print('[ ]' if i not in row.correct_indices else '[v]', cand)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "KIeJIQ7ex4nB", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Токенизируем предложения:" + ] + }, + { + "metadata": { + "id": "udkrpaSHq7mg", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import spacy\n", + "\n", + "spacy = spacy.load('en')\n", + "\n", + "train_data.question = train_data.question.apply(lambda text: [tok.text.lower() for tok in spacy.tokenizer(text)])\n", + "train_data.options = train_data.options.apply(lambda options: [[tok.text.lower() for tok in spacy.tokenizer(text)] for text in options])\n", + "\n", + "test_data.question = test_data.question.apply(lambda text: [tok.text.lower() for tok in spacy.tokenizer(text)])\n", + "test_data.options = test_data.options.apply(lambda options: [[tok.text.lower() for tok in spacy.tokenizer(text)] for text in options])" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "tU5jRHpax7Ot", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "У нас не так-то много данных, чтобы учить всё с нуля, поэтому будем сразу использовать предобученные эмбеддинги:" + ] + }, + { + "metadata": { + "id": "2mQjmbvs-jFQ", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import gensim.downloader as api\n", + "\n", + "w2v_model = api.load('glove-wiki-gigaword-100')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "dC8Db3DKyDSz", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "**Задание** Постройте матрицу предобученных эмбеддингов для самых частотных слов в выборке." + ] + }, + { + "metadata": { + "id": "AinFu7nb6DSf", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "from collections import Counter\n", + "\n", + "\n", + "def build_word_embeddings(data, w2v_model, min_freq=5):\n", + " words = Counter()\n", + " \n", + " for text in data.question:\n", + " for word in text:\n", + " words[word] += 1\n", + " \n", + " for options in data.options:\n", + " for text in options:\n", + " for word in text:\n", + " words[word] += 1\n", + " \n", + " word2ind = {\n", + " '': 0,\n", + " '': 1\n", + " }\n", + " \n", + " embeddings = [\n", + " np.zeros(w2v_model.vectors.shape[1]),\n", + " np.zeros(w2v_model.vectors.shape[1])\n", + " ]\n", + " \n", + " \n", + "\n", + " return word2ind, np.array(embeddings)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "rKJhqyV__jhl", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "word2ind, embeddings = build_word_embeddings(train_data, w2v_model, min_freq=8)\n", + "print('Vocab size =', len(word2ind))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "M-_xdNG9yYka", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Для генерации батчей будем использовать такой класс:" + ] + }, + { + "metadata": { + "id": "4IoQrJ5rTSSN", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import random\n", + "import math\n", + "\n", + "\n", + "def to_matrix(lines, word2ind):\n", + " max_sent_len = max(len(line) for line in lines)\n", + " matrix = np.zeros((len(lines), max_sent_len))\n", + "\n", + " for batch_ind, line in enumerate(lines):\n", + " matrix[batch_ind, :len(line)] = [word2ind.get(word, 1) for word in line]\n", + "\n", + " return LongTensor(matrix)\n", + "\n", + "\n", + "class BatchIterator():\n", + " def __init__(self, data, batch_size, word2ind, shuffle=True):\n", + " self._data = data\n", + " self._num_samples = len(data)\n", + " self._batch_size = batch_size\n", + " self._word2ind = word2ind\n", + " self._shuffle = shuffle\n", + " self._batches_count = int(math.ceil(len(data) / batch_size))\n", + " \n", + " def __len__(self):\n", + " return self._batches_count\n", + " \n", + " def __iter__(self):\n", + " return self._iterate_batches()\n", + "\n", + " def _iterate_batches(self):\n", + " indices = np.arange(self._num_samples)\n", + " if self._shuffle:\n", + " np.random.shuffle(indices)\n", + "\n", + " for start in range(0, self._num_samples, self._batch_size):\n", + " end = min(start + self._batch_size, self._num_samples)\n", + "\n", + " batch_indices = indices[start: end]\n", + "\n", + " batch = self._data.iloc[batch_indices]\n", + " questions = batch['question'].values\n", + " correct_answers = np.array([\n", + " row['options'][random.choice(row['correct_indices'])]\n", + " for i, row in batch.iterrows()\n", + " ])\n", + " wrong_answers = np.array([\n", + " row['options'][random.choice(row['wrong_indices'])]\n", + " for i, row in batch.iterrows()\n", + " ])\n", + "\n", + " yield {\n", + " 'questions': to_matrix(questions, self._word2ind),\n", + " 'correct_answers': to_matrix(correct_answers, self._word2ind),\n", + " 'wrong_answers': to_matrix(wrong_answers, self._word2ind)\n", + " }" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "as5kgtjLyRE6", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "train_iter = BatchIterator(train_data, 64, word2ind)\n", + "test_iter = BatchIterator(test_data, 128, word2ind)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "27zeQlTJzU1x", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Он просто сэмплирует последовательности из вопросов, правильных и неправильных ответов на них:" + ] + }, + { + "metadata": { + "id": "F-ohzmkXzO0e", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "batch = next(iter(train_iter))\n", + "\n", + "batch" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "BSXFV1CTyfGo", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "## Модель\n", + "\n", + "**Задание** Реализуйте модель энкодера для текстов - башни DSSM модели.\n", + "\n", + "*Это не обязательно должна быть сложная модель, вполне сойдет сверточная, которая будет учиться гораздо быстрее.*" + ] + }, + { + "metadata": { + "id": "Ve7WZ4-dvbuw", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "class Encoder(nn.Module):\n", + " def __init__(self, embeddings, hidden_dim=128, output_dim=128):\n", + " super().__init__()\n", + " \n", + " \n", + " \n", + " def forward(self, inputs):\n", + " " + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "ZFBAUGGfzFwF", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Triplet Loss\n", + "\n", + "Мы хотим не просто научить энкодер строить эмбеддинги для предложений. Мы хотим, чтобы притягивать векторы правильных ответов к вопросам и отталкивать неправильные. Для этого используют, например, *Triplet Loss*:\n", + "\n", + "$$ L = \\frac 1N \\underset {q, a^+, a^-} \\sum max(0, \\space \\delta - sim[V_q(q), V_a(a^+)] + sim[V_q(q), V_a(a^-)] ),$$\n", + "\n", + "где\n", + "* $sim[a, b]$ функция похожести (например, dot product или cosine similarity)\n", + "* $\\delta$ - гиперпараметр модели. Если $sim[a, b]$ линейно по $b$, то все $\\delta > 0$ эквиватентны.\n", + "\n", + "![img](https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/margin.png)\n", + "\n", + "**Задание** Реализуйте triplet loss, а также подсчет recall - процента случаев, когда правильный ответ был ближе неправильного." + ] + }, + { + "metadata": { + "id": "GdHJK51dz0CB", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "class DSSM(nn.Module):\n", + " def __init__(self, question_encoder, answer_encoder):\n", + " super().__init__()\n", + " self.question_encoder = question_encoder\n", + " self.answer_encoder = answer_encoder\n", + " \n", + " def forward(self, questions, correct_answers, wrong_answers):\n", + " \n", + "\n", + " def calc_triplet_loss(self, question_embeddings, correct_answer_embeddings, wrong_answer_embeddings, delta=1.0):\n", + " \"\"\"Returns the triplet loss based on the equation above\"\"\"\n", + " \n", + " \n", + " def calc_recall_at_1(self, question_embeddings, correct_answer_embeddings, wrong_answer_embeddings):\n", + " \"\"\"Returns the number of cases when the correct answer were more similar than incorrect one\"\"\"\n", + " \n", + " \n", + " @staticmethod\n", + " def similarity(question_embeddings, answer_embeddings):\n", + " \"\"\"Returns sim[a, b]\"\"\"\n", + " " + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "ZWvlMTWpxTvJ", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "class ModelTrainer():\n", + " def __init__(self, model, optimizer):\n", + " self._model = model\n", + " self._optimizer = optimizer\n", + " \n", + " def on_epoch_begin(self, is_train, name, batches_count):\n", + " \"\"\"\n", + " Initializes metrics\n", + " \"\"\"\n", + " self._epoch_loss = 0\n", + " self._correct_count, self._total_count = 0, 0\n", + " self._is_train = is_train\n", + " self._name = name\n", + " self._batches_count = batches_count\n", + " \n", + " self._model.train(is_train)\n", + " \n", + " def on_epoch_end(self):\n", + " \"\"\"\n", + " Outputs final metrics\n", + " \"\"\"\n", + " return '{:>5s} Loss = {:.5f}, Recall@1 = {:.2%}'.format(\n", + " self._name, self._epoch_loss / self._batches_count, self._correct_count / self._total_count\n", + " )\n", + " \n", + " def on_batch(self, batch):\n", + " \"\"\"\n", + " Performs forward and (if is_train) backward pass with optimization, updates metrics\n", + " \"\"\"\n", + " \n", + " question_embs, correct_answer_embs, wrong_answer_embs = self._model(\n", + " batch['questions'], batch['correct_answers'], batch['wrong_answers']\n", + " )\n", + " loss = self._model.calc_triplet_loss(question_embs, correct_answer_embs, wrong_answer_embs)\n", + " correct_count = self._model.calc_recall_at_1(question_embs, correct_answer_embs, wrong_answer_embs)\n", + " total_count = len(batch['questions'])\n", + " \n", + " self._correct_count += correct_count\n", + " self._total_count += total_count\n", + " self._epoch_loss += loss.item()\n", + " \n", + " if self._is_train:\n", + " self._optimizer.zero_grad()\n", + " loss.backward()\n", + " nn.utils.clip_grad_norm_(self._model.parameters(), 1.)\n", + " self._optimizer.step()\n", + "\n", + " return '{:>5s} Loss = {:.5f}, Recall@1 = {:.2%}'.format(\n", + " self._name, loss.item(), correct_count / total_count\n", + " )" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "ecI_vVBgzpVn", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import math\n", + "from tqdm import tqdm\n", + "tqdm.get_lock().locks = []\n", + "\n", + "\n", + "def do_epoch(trainer, data_iter, is_train, name=None):\n", + " trainer.on_epoch_begin(is_train, name, batches_count=len(data_iter))\n", + " \n", + " with torch.autograd.set_grad_enabled(is_train):\n", + " with tqdm(total=len(data_iter)) as progress_bar:\n", + " for i, batch in enumerate(data_iter):\n", + " batch_progress = trainer.on_batch(batch)\n", + "\n", + " progress_bar.update()\n", + " progress_bar.set_description(batch_progress)\n", + " \n", + " epoch_progress = trainer.on_epoch_end()\n", + " progress_bar.set_description(epoch_progress)\n", + " progress_bar.refresh()\n", + "\n", + " \n", + "def fit(trainer, train_iter, epochs_count=1, val_iter=None):\n", + " best_val_loss = None\n", + " for epoch in range(epochs_count):\n", + " name_prefix = '[{} / {}] '.format(epoch + 1, epochs_count)\n", + " do_epoch(trainer, train_iter, is_train=True, name=name_prefix + 'Train:')\n", + " \n", + " if not val_iter is None:\n", + " do_epoch(trainer, val_iter, is_train=False, name=name_prefix + ' Val:')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "K1GvkkLh70S6", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Запустим, наконец, учиться модель:" + ] + }, + { + "metadata": { + "id": "ipXj9pOD2afY", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "embeddings = FloatTensor(embeddings)\n", + "\n", + "model = DSSM(\n", + " Encoder(embeddings),\n", + " Encoder(embeddings)\n", + ").to(DEVICE)\n", + "\n", + "optimizer = optim.Adam(model.parameters())\n", + "\n", + "trainer = ModelTrainer(model, optimizer)\n", + "\n", + "fit(trainer, train_iter, epochs_count=30, val_iter=test_iter)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "g_AHIoCB73XA", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Точность предсказаний\n", + "\n", + "Оценим, насколько хорошо модель предсказывает правильный ответ.\n", + "\n", + "**Задание** Для каждого вопроса найдите индекс ответа, генерируемого сетью:" + ] + }, + { + "metadata": { + "id": "QrA8N0zDJczj", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "predictions = []\n", + "\n", + " \n", + "accuracy = np.mean([\n", + " answer in correct_ind\n", + " for answer, correct_ind in zip(predictions, test_data['correct_indices'].values)\n", + "])\n", + "print(\"Accuracy: %0.5f\" % accuracy)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "HZijVoOTLcCD", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "def draw_results(question, possible_answers, predicted_index, correct_indices):\n", + " print(\"Q:\", ' '.join(question), end='\\n\\n')\n", + " for i, answer in enumerate(possible_answers):\n", + " print(\"#%i: %s %s\" % (i, '[*]' if i == predicted_index else '[ ]', ' '.join(answer)))\n", + " \n", + " print(\"\\nVerdict:\", \"CORRECT\" if predicted_index in correct_indices else \"INCORRECT\", \n", + " \"(ref: %s)\" % correct_indices, end='\\n' * 3)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "-_pYbNcsLfli", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "for i in [1, 100, 1000, 2000, 3000, 4000, 5000]:\n", + " draw_results(test_data.iloc[i].question, test_data.iloc[i].options,\n", + " predictions[i], test_data.iloc[i].correct_indices)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "T5OC4EGt8HAo", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "## Hard-negatives mining\n", + "\n", + "На самом деле, в большинстве случаев у нас отрицательных примеров.\n", + "\n", + "Например, есть база диалогов - и где брать отрицательные примеры к ответам?\n", + "\n", + "Для этого используют *hard-negatives mining*. Берут в качестве отрицательного примера самый близкий из неправильных примеров в батче:\n", + "$$a^-_{hard} = \\underset {a^-} {argmax} \\space sim[V_q(q), V_a(a^-)]$$\n", + "\n", + "Неправильные в данном случае - все, кроме правильного :)\n", + "\n", + "Реализуется это как-то так:\n", + "* Батч состоит из правильных пар вопрос-ответ.\n", + "* Для всех вопросов и всех ответов считают эмбеддинги.\n", + "* Положительные примеры у нас есть - осталось найти для каждого вопроса наиболее похожие на него ответы, которые предназначались другим вопросам.\n", + "\n", + "**Задание** Обновите `DSSM`, чтобы делать hard-negatives mining внутри него.\n", + "\n", + "*Может понадобиться нормализовывать векторы с помощью `F.normalize` перед подсчетом `similarity`*" + ] + }, + { + "metadata": { + "id": "tY_BebgAOY70", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "class DSSM(nn.Module):\n", + " def __init__(self, question_encoder, answer_encoder):\n", + " super().__init__()\n", + " self.question_encoder = question_encoder\n", + " self.answer_encoder = answer_encoder\n", + " \n", + " def forward(self, questions, correct_answers, wrong_answers):\n", + " \"\"\"Ignore wrong_answers, they are here just for compatibility sake\"\"\"\n", + " \n", + "\n", + " def calc_triplet_loss(self, question_embeddings, answer_embeddings, delta=1.0):\n", + " \"\"\"Returns the triplet loss based on the equation above\"\"\"\n", + " \n", + " \n", + " def calc_recall_at_1(self, question_embeddings, correct_answer_embeddings, wrong_answer_embeddings):\n", + " \"\"\"Returns the number of cases when the correct answer were more similar than incorrect one\"\"\"\n", + " \n", + " \n", + " @staticmethod\n", + " def similarity(question_embeddings, answer_embeddings):\n", + " " + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "0AIvf3Zvtabs", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "model = DSSM(\n", + " question_encoder=Encoder(embeddings),\n", + " answer_encoder=Encoder(embeddings)\n", + ").to(DEVICE)\n", + "\n", + "optimizer = optim.Adam(model.parameters())\n", + "\n", + "trainer = ModelTrainer(model, optimizer)\n", + "\n", + "fit(trainer, train_iter, epochs_count=30, val_iter=test_iter)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "pXaL1iRz-zEG", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "**Задание** Есть также вариант с semi-hard negatives - когда в качестве отрицательного примера берется наилучший среди тех, чья similarity меньше similarity вопроса с положительным примером. Попробуйте реализовать его." + ] + }, + { + "metadata": { + "id": "edUVeXG0_lCa", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# Болталка\n", + "\n", + "Чтобы реализовать болталку, нужен нормальный корпус с диалогами. Например, OpenSubtitles." + ] + }, + { + "metadata": { + "id": "s3YmDo9Z_xG-", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "!head train.txt" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "A0R77ezW_1Wd", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Ну, примерно нормальный.\n", + "\n", + "Считаем датасет." + ] + }, + { + "metadata": { + "id": "SwHYmb28HPUn", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "from nltk import wordpunct_tokenize\n", + "\n", + "def read_dataset(path):\n", + " data = []\n", + " with open(path) as f:\n", + " for line in tqdm(f):\n", + " query, response = line.strip().split('\\t')\n", + " data.append((\n", + " wordpunct_tokenize(query.strip()),\n", + " wordpunct_tokenize(response.strip())\n", + " ))\n", + " return data\n", + "\n", + "train_data = read_dataset('train.txt')\n", + "val_data = read_dataset('valid.txt')\n", + "test_data = read_dataset('test.txt')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "WTfkv17tHM3I", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "from torchtext.data import Field, Example, Dataset, BucketIterator\n", + "\n", + "query_field = Field(lower=True)\n", + "response_field = Field(lower=True)\n", + "\n", + "fields = [('query', query_field), ('response', response_field)]\n", + "\n", + "train_dataset = Dataset([Example.fromlist(example, fields) for example in train_data], fields)\n", + "val_dataset = Dataset([Example.fromlist(example, fields) for example in val_data], fields)\n", + "test_dataset = Dataset([Example.fromlist(example, fields) for example in test_data], fields)\n", + "\n", + "query_field.build_vocab(train_dataset, min_freq=5)\n", + "response_field.build_vocab(train_dataset, min_freq=5)\n", + "\n", + "print('Query vocab size =', len(query_field.vocab))\n", + "print('Response vocab size =', len(response_field.vocab))\n", + "\n", + "train_iter, val_iter, test_iter = BucketIterator.splits(\n", + " datasets=(train_dataset, val_dataset, test_dataset), batch_sizes=(512, 1024, 1024), \n", + " shuffle=True, device=DEVICE, sort=False\n", + ")" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "M-Ok7--NHUX0", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "**Задание** Реализовать болталку по аналогии с тем, что уже написали." + ] + }, + { + "metadata": { + "id": "n2x9-j4oz08p", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# Дополнительные материалы\n", + "\n", + "## Статьи\n", + "Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013 [[pdf]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) \n", + "Deep Learning and Continuous Representations for Natural Language Processing, Microsoft tutorial [[pdf]](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/NAACL-HLT-2015_tutorial.pdf)\n", + "\n", + "## Блоги\n", + "[Neural conversational models: как научить нейронную сеть светской беседе](https://habr.com/company/yandex/blog/333912/) \n", + "[Искусственный интеллект в поиске. Как Яндекс научился применять нейронные сети, чтобы искать по смыслу, а не по словам](https://habr.com/company/yandex/blog/314222/) \n", + "[Triplet loss, Olivier Moindrot](https://omoindrot.github.io/triplet-loss)" + ] + }, + { + "metadata": { + "id": "gjkS1Kpkz4Aa", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# Сдача\n", + "\n", + "[Форма для сдачи](https://goo.gl/forms/bf2auPe8FL5C0jzp2) \n", + "[Feedback](https://goo.gl/forms/9aizSzOUrx7EvGlG3)" + ] + } + ] +} \ No newline at end of file