## How do we convert words and sentences numbers and retaining their meaning?

- Computers can only understand numbers
- Let's look at several ways we can accomplish computerising words and sentences
  - [Co-occurence Matrix](https://colab.research.google.com/drive/1j3YPuv97Z-_bFhot2lIY7jpSfTjrrhi3#scrollTo=hLGshkOupgBq)
  - Term-Frequency
  - Term-Document
  - TF-IDF
  - Pointwise Mutual Information <br>
**Dense Vectors**<br>
  - SVDs 
  
  

## FAQ task

In this tuorial we'll describe how to build FAQ model based on config deeppavlov/configs/faq/tfidf_logreg_en_faq.json
<br>First of all we need train dataset of FAQ.

Data Source: https://www.sce.cornell.edu/ol/faq.php#1

E.g <br>
Q: Will classes meet at a specific time? <br>
A: With online learning, you may view the course materials on your own schedule. The content will be available to all students 24/7. However, you may be required to meet at set times with the faculty member.



       

![alt text](https://cdn-images-1.medium.com/max/600/1*e327fAqaxxifxcczLKwsdA.png)




**Note:** Please, install all necessary requirements using command:

>\>\> python -m deeppavlov install tfidf_logreg_en_faq.json

In [4]:
#Enable us store files "locally" google drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'fastai-v3/'

Mounted at /content/gdrive


In [0]:
# We use the ! to make for commandline commands

#In production/research settings, you want to focus on the tasks at hand. Not 
#installing and figuring our software dependencies. DevOps is a very good skill 
#to have but one should pick his fights wisely.
!pip install deeppavlov

Collecting deeppavlov
[?25l  Downloading https://files.pythonhosted.org/packages/1c/d9/2c75603f26e59b2f058b057b26011bb1087a278b9a7c6173a930d75efd89/deeppavlov-0.3.0-py3-none-any.whl (677kB)
[K     |████████████████████████████████| 686kB 2.8MB/s 
[?25hCollecting numpy==1.14.5 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/68/1e/116ad560de97694e2d0c1843a7a0075cc9f49e922454d32f49a80eb6f1f2/numpy-1.14.5-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
[K     |████████████████████████████████| 12.2MB 38.7MB/s 
Collecting tqdm==4.23.4 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/93/24/6ab1df969db228aed36a648a8959d1027099ce45fad67532b9673d533318/tqdm-4.23.4-py2.py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 17.5MB/s 
[?25hCollecting scipy==1.1.0 (from deeppavlov)
[?25l  Downloading https://files.pythonhosted.org/packages/a8/0b/f163da98d3a01b3e0ef1cab8dd2123c34aee2bafbb1c5bffa354cc8a1730/scipy-1.1.0-cp36-c

In [0]:
!python -m deeppavlov install gdrive/My\ Drive/Colab\ Notebooks/IndabaXng/tfidf_logreg_en_faq.json

Collecting spacy==2.1.3
[?25l  Downloading https://files.pythonhosted.org/packages/52/da/3a1c54694c2d2f40df82f38a19ae14c6eb24a5a1a0dae87205ebea7a84d8/spacy-2.1.3-cp36-cp36m-manylinux1_x86_64.whl (27.7MB)
[K     |████████████████████████████████| 27.7MB 1.5MB/s 
Collecting thinc<7.1.0,>=7.0.2 (from spacy==2.1.3)
[?25l  Downloading https://files.pythonhosted.org/packages/a9/f1/3df317939a07b2fc81be1a92ac10bf836a1d87b4016346b25f8b63dee321/thinc-7.0.4-cp36-cp36m-manylinux1_x86_64.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 31.1MB/s 
[?25hCollecting wasabi<1.1.0,>=0.2.0 (from spacy==2.1.3)
  Downloading https://files.pythonhosted.org/packages/f4/c1/d76ccdd12c716be79162d934fe7de4ac8a318b9302864716dde940641a79/wasabi-0.2.2-py3-none-any.whl
Collecting blis<0.3.0,>=0.2.2 (from spacy==2.1.3)
[?25l  Downloading https://files.pythonhosted.org/packages/34/46/b1d0bb71d308e820ed30316c5f0a017cb5ef5f4324bcbc7da3cf9d3b075c/blis-0.2.4-cp36-cp36m-manylinux1_x86_64.whl (3.2MB)
[K     

In [0]:
import pandas as pd
FAQ_DATASET_URL = 'https://s3.amazonaws.com/mlnlpdatasets/cornell_faq30.csv'
faq_dataset = pd.read_csv(FAQ_DATASET_URL)
faq_dataset

In [0]:
import deeppavlov
from deeppavlov.models.tokenizers.spacy_tokenizer import StreamSpacyTokenizer
from deeppavlov.models.sklearn import SklearnComponent
# from deeppavlov.dataset_readers.faq_reader import FaqDatasetReader
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator
from deeppavlov.core.data.utils import download_decompress


In [0]:
from pandas import read_csv
import numpy as np


# Our data might contain some missing values data
# In our case, we want to discard QA pairs missing any value.

data_unclean = faq_dataset
print(data_unclean.shape)
data_unclean['question'].replace('  ', np.nan, inplace=True)
data_unclean['answer'].replace('  ', np.nan, inplace=True)


data= data_unclean.dropna()
print(data.shape)
x = data['question']
y = data['answer']

train_xy_tuples = [(x.iloc[i], y.iloc[i]) for i in range(len(x))]

dataset = dict()
dataset["train"] = train_xy_tuples
dataset["valid"] = []
dataset["test"] = []

(1597, 3)
(1595, 3)


In [0]:
train_xy_tuples[2]

In [0]:
# Read FAQ data
# reader = FaqDatasetReader()
# faq_data = reader.read(data_url=FAQ_DATASET_URL, x_col_name='question', y_col_name='answer')
# iterator = DataLearningIterator(data=dataset)

# x,y = iterator.get_instances()

In [0]:
x[0]

## Train FAQ

Let's consider simple case for FAQ model (in the end you can find more complex pipeline models):
1. TF_IDF vectorizer on lemmatized questions
2. Logistic regression classifier

In [0]:
# create tokenizer
tokenizer = StreamSpacyTokenizer(lemmas=True)
x_tokenized = tokenizer(x)



x_tokenized

[['baby',
  'have',
  'catarrh',
  'and',
  'mild',
  'cough',
  'give',
  'vitamin',
  'c',
  'this',
  'morning',
  'and',
  'cover',
  'in',
  'warm',
  'water',
  'for',
  'some',
  'minute',
  'kindly',
  'advice',
  'what',
  'can',
  'use',
  'or',
  'do'],
 ['baby',
  'of',
  'and',
  'take',
  'opv',
  'on',
  'Saturday',
  'and',
  'some',
  'hour',
  'later',
  'start',
  'vomit',
  'and',
  'purge',
  'and',
  'take',
  'to',
  'hospital',
  'where',
  'be',
  'give',
  'zinc',
  'Para',
  'and',
  'pemadex',
  'injection',
  'a',
  'test',
  'to',
  'run',
  'the',
  'vomiting',
  'stop',
  'but',
  'on',
  'Sunday',
  'start',
  'purge',
  'water',
  'till',
  'now',
  'On',
  'Monday',
  'go',
  'back',
  'with',
  'the',
  'test',
  'result',
  'and',
  'ORS',
  'and',
  'coartem',
  'tab',
  'be',
  'give',
  'to',
  'be',
  'still',
  'stool',
  'water',
  'what',
  'do',
  'do',
  'stool',
  'be',
  'much'],
 ['old',
  'son',
  'pour',
  'angel',
  'baby',
  'powder'

In [0]:
x_tokens_joined = tokenizer(x_tokenized)
# fit TF-IDF vectorizer on train FAQ dataset 
vectorizer = SklearnComponent(model_class="sklearn.feature_extraction.text:TfidfVectorizer",
                              save_path='/content/gdrive/My Drive/Colab Notebooks/tfidf.pkl',
                              infer_method='transform')
X = vectorizer.fit(x_tokens_joined)

vocab = vectorizer.model.get_feature_names()
len(vocab)

In [0]:
# Now collect (x,y) pairs: x_train - vectorized question, y_train - answer from FAQ
x_train = vectorizer(x_tokens_joined)
y_train = y 

# Let's use top 2 answers for each incoming questions (top_n param)
clf = SklearnComponent(model_class="sklearn.linear_model:LogisticRegression",
                       top_n=2,
                       c=1000,
                       penalty='l2', 
                       save_path='/content/gdrive/My Drive/Colab Notebooks/tfidf_logreg_classifier_en_mipt_faq.pkl',
                       infer_method='predict')
clf.fit(x_train, y_train)

## Test FAQ

In [0]:
test_questions = ["begin my online class early"]
tokenized_test_questions = tokenizer(test_questions)
joined_test_q_tokens = tokenizer(tokenized_test_questions)
test_q_vectorized = vectorizer(joined_test_q_tokens)
answers = clf(test_q_vectorized)

Now we have all output of FAQ model: answers and scores.
<br>
Answers:

In [0]:
for i, answer in enumerate(answers):
    print('Answers {}:\n{}\n'.format(i, answer))

NameError: ignored

## Discussion/QA

- How can we evelaute our model's accuracy? BLUE?, LogisticRegression accuracy stats?
- How can we better our model?
- How else can we frame the problem?