## How do we convert words and sentences numbers and retaining their meaning?

- Computers can only understand numbers
- Let's look at several ways we can accomplish computerising words and sentences
  - [Co-occurence Matrix](https://colab.research.google.com/drive/1j3YPuv97Z-_bFhot2lIY7jpSfTjrrhi3#scrollTo=hLGshkOupgBq)
  - Term-Frequency
  - Term-Document
  - TF-IDF
  - Pointwise Mutual Information <br>
**Dense Vectors**<br>
  - SVDs 
  
  

## FAQ task

In this tuorial we'll describe how to build FAQ model based on config deeppavlov/configs/faq/tfidf_logreg_en_faq.json
<br>First of all we need train dataset of FAQ.

Data Source: https://www.sce.cornell.edu/ol/faq.php#1

E.g <br>
Q: Will classes meet at a specific time? <br>
A: With online learning, you may view the course materials on your own schedule. The content will be available to all students 24/7. However, you may be required to meet at set times with the faculty member.



       

![alt text](https://cdn-images-1.medium.com/max/600/1*e327fAqaxxifxcczLKwsdA.png)




**Note:** Please, install all necessary requirements using command:

>\>\> python -m deeppavlov install tfidf_logreg_en_faq.json

In [2]:
#Enable us store files "locally" google drive
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'fastai-v3/'

Mounted at /content/gdrive


In [3]:
# We use the ! to make for commandline commands

#In production/research settings, you want to focus on the tasks at hand. Not 
#installing and figuring our software dependencies. DevOps is a very good skill 
#to have but one should pick his fights wisely.
!pip install deeppavlov



In [5]:
!python -m deeppavlov install gdrive/My\ Drive/Colab\ Notebooks/IndabaXng/tfidf_logreg_en_faq.json

Collecting en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1MB)
[K     |████████████████████████████████| 11.1MB 1.7MB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.0.0
    Uninstalling en-core-web-sm-2.0.0:
      Successfully uninstalled en-core-web-sm-2.0.0
Successfully installed en-core-web-sm-2.1.0


In [6]:
import pandas as pd
FAQ_DATASET_URL = 'https://s3.amazonaws.com/mlnlpdatasets/cornell_faq30.csv'
faq_dataset = pd.read_csv(FAQ_DATASET_URL)
faq_dataset

Unnamed: 0,question,answer
0,Do I have to be a Cornell student to take an o...,"No, SCE has an open-admissions policy for all ..."
1,How old do I have to be to take an online clas...,"You must be at least a high school sophomore, ..."
2,How many classes can I take?\n,Because of the intense nature of study during ...
3,How do I enroll and register?\n,"If you are a high school student, see the onli..."
4,I have registered for an online course. What h...,If this is the first time you enrolled in a cl...
5,What if I change my mind and want to drop my c...,You may drop a class by completing a change-in...
6,What are the technical requirements for an onl...,Most online courses are offered through Cornel...
7,What do I do if I need technical assistance?,If you are having technical problems your firs...
8,How do I get access to the class materials?\n,"Once you receive your NetID, the faculty membe..."
9,How much does an online class cost?,See the tuition and fees page on the Summer Se...


In [7]:
import deeppavlov
from deeppavlov.models.tokenizers.spacy_tokenizer import StreamSpacyTokenizer
from deeppavlov.models.sklearn import SklearnComponent
# from deeppavlov.dataset_readers.faq_reader import FaqDatasetReader
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator
from deeppavlov.core.data.utils import download_decompress


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/nonbreaking_prefixes.zip.


In [8]:
from pandas import read_csv
import numpy as np


# Our data might contain some missing values data
# In our case, we want to discard QA pairs missing any value.

data_unclean = faq_dataset
print(data_unclean.shape)
data_unclean['question'].replace('  ', np.nan, inplace=True)
data_unclean['answer'].replace('  ', np.nan, inplace=True)


data= data_unclean.dropna()
print(data.shape)
x = data['question']
y = data['answer']

train_xy_tuples = [(x.iloc[i], y.iloc[i]) for i in range(len(x))]

dataset = dict()
dataset["train"] = train_xy_tuples
dataset["valid"] = []
dataset["test"] = []

(29, 2)
(29, 2)


In [9]:
train_xy_tuples[2]

('How many classes can I take?\n',
 'Because of the intense nature of study during the Summer and Winter Sessions, students may enroll in no more than eight credits during a six-week Summer Session and no more than four credits during the Winter Session.')

In [0]:
# Read FAQ data
# reader = FaqDatasetReader()
# faq_data = reader.read(data_url=FAQ_DATASET_URL, x_col_name='question', y_col_name='answer')
# iterator = DataLearningIterator(data=dataset)

# x,y = iterator.get_instances()

In [11]:
x[0]

"Do I have to be a Cornell student to take an online course offered by Cornell's School of Continuing Education and Summer Sessions (SCE)?"

## Train FAQ

Let's consider simple case for FAQ model (in the end you can find more complex pipeline models):
1. TF_IDF vectorizer on lemmatized questions
2. Logistic regression classifier

In [17]:
# create tokenizer
tokenizer = StreamSpacyTokenizer(lemmas=True)
x_tokenized = tokenizer(x)



x_tokenized

[['Do',
  'have',
  'to',
  'be',
  'a',
  'Cornell',
  'student',
  'to',
  'take',
  'an',
  'online',
  'course',
  'offer',
  'by',
  'Cornell',
  'School',
  'of',
  'Continuing',
  'Education',
  'and',
  'Summer',
  'Sessions',
  'SCE'],
 ['how',
  'old',
  'do',
  'have',
  'to',
  'be',
  'to',
  'take',
  'an',
  'online',
  'class'],
 ['how', 'many', 'class', 'can', 'take'],
 ['how', 'do', 'enroll', 'and', 'register'],
 ['have',
  'register',
  'for',
  'an',
  'online',
  'course',
  'what',
  'happen',
  'next'],
 ['what', 'if', 'change', 'mind', 'and', 'want', 'to', 'drop', 'class'],
 ['what',
  'be',
  'the',
  'technical',
  'requirement',
  'for',
  'an',
  'online',
  'course'],
 ['what', 'do', 'do', 'if', 'need', 'technical', 'assistance'],
 ['how', 'do', 'get', 'access', 'to', 'the', 'class', 'material'],
 ['how', 'much', 'do', 'an', 'online', 'class', 'cost'],
 ['when', 'do', 'have', 'to', 'pay'],
 ['be', 'financial', 'aid', 'available'],
 ['Will', 'get', 'money', 

In [27]:
x_tokens_joined = tokenizer(x_tokenized)
# fit TF-IDF vectorizer on train FAQ dataset 
vectorizer = SklearnComponent(model_class="sklearn.feature_extraction.text:TfidfVectorizer",
                              save_path='/content/gdrive/My Drive/Colab Notebooks/tfidf.pkl',
                              infer_method='transform')
X = vectorizer.fit(x_tokens_joined)

vocab = vectorizer.model.get_feature_names()
len(vocab)
X

2019-05-11 07:51:07.182 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2019-05-11 07:51:07.186 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.feature_extraction.text:TfidfVectorizer


In [0]:
x_train = vectorizer(x_tokens_joined)
y_train = y 

In [31]:
# Now collect (x,y) pairs: x_train - vectorized question, y_train - answer from FAQ


# Let's use top 2 answers for each incoming questions (top_n param)
clf = SklearnComponent(model_class="sklearn.linear_model:LogisticRegression",
                       top_n=2,
                       c=1000,
                       penalty='l2', 
                       save_path='/content/gdrive/My Drive/Colab Notebooks/tfidf_logreg_classifier_en_mipt_faq.pkl',
                       infer_method='predict')
clf.fit(x_train, y_train)


2019-05-11 07:55:28.53 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.linear_model:LogisticRegression from scratch
2019-05-11 07:55:28.57 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.linear_model:LogisticRegression


## Test FAQ

In [26]:
test_questions = ["begin my online class early"]
tokenized_test_questions = tokenizer(test_questions)
joined_test_q_tokens = tokenizer(tokenized_test_questions)
test_q_vectorized = vectorizer(joined_test_q_tokens)
answers = clf(test_q_vectorized)

TypeError: ignored

Now we have all output of FAQ model: answers and scores.
<br>
Answers:

In [0]:
for i, answer in enumerate(answers):
    print('Answers {}:\n{}\n'.format(i, answer))

NameError: ignored

## Discussion/QA

- How can we evelaute our model's accuracy? BLUE?, LogisticRegression accuracy stats?
- How can we better our model?
- How else can we frame the problem?