# Hypothesis

Define Question & Answer (Q&A) system for a document, [recommended document](http://aepd.es/sites/default/files/2019-12/ai-definition.pdf).

In order to achieve this objective, it will be used the pretrained  [model large-Bert from Hugginface](https://huggingface.co/transformers/pretrained_models.html) + Spacy to handle the document (recall NLP only accepts 512 tokens) and GTTs to make our model to speak or give the answer.

It is important to know BERT limiations. It can only handle sentences with max. 512 tokens due to Attention memory constraints.

Having this on mind, as we are dealing with an entire document, it is necessary to split the text in chunks with 512 tokens as maximum.

Later, thanks to Spacy module: it will be applied a regression (between the question and the list of chunks) the three sentence chunks with most probabilities to contain the answer.

Having the three sentences with most probability to contain the anwer, BERT only will handle this three sentences making the code time-eficcient and adapted for any long text. 

## LOAD PACKAGES (transformers & Spacy) and model (Large BERTpretrained on SQUAD V.1)

It will be also used Apeche tika as pre-pocess in order to make our system to work with any file format.

### Huggingface **Transformers**

In [1]:
!pip install transformers==3

Collecting transformers==3
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |████████████████████████████████| 757kB 7.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 36.4MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 54.0MB/s 
Collecting tokenizers==0.8.0-rc4
[?25l  Downloading https://files.pythonhosted.org/packages/f7/82/0e82a95bd9db2b32569500cc1bb47aa7c4e0f57aa5e35cceba414096917b/tokenizers-0.8.0rc4-cp37-cp37m-manylinux1_x86_64.whl (3.0MB)


In [2]:
import torch

### Huggingface **Model BERT**

Load the model: 24-layer, 1024-hidden, 16-heads, 340M parameters. Model fine-tuned on SQuAD

In [3]:
from transformers import BertForQuestionAnswering

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




Load tokenizer

In [4]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




### SPACY

In [5]:
!pip install -U spacy
import spacy

Collecting spacy
[?25l  Downloading https://files.pythonhosted.org/packages/3a/70/a0b8bd0cb54d8739ba4d6fb3458785c3b9b812b7fbe93b0f10beb1a53ada/spacy-3.0.5-cp37-cp37m-manylinux2014_x86_64.whl (12.8MB)
[K     |████████████████████████████████| 12.8MB 244kB/s 
Collecting thinc<8.1.0,>=8.0.2
[?25l  Downloading https://files.pythonhosted.org/packages/e3/08/20e707519bcded1a0caa6fd024b767ac79e4e5d0fb92266bb7dcf735e338/thinc-8.0.2-cp37-cp37m-manylinux2014_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 49.8MB/s 
[?25hCollecting typer<0.4.0,>=0.3.0
  Downloading https://files.pythonhosted.org/packages/90/34/d138832f6945432c638f32137e6c79a3b682f06a63c488dcfaca6b166c64/typer-0.3.2-py3-none-any.whl
Collecting pathy>=0.3.5
  Downloading https://files.pythonhosted.org/packages/a2/53/97dc0197cca9357369b3b71bf300896cf2d3604fa60ffaaf5cbc277de7de/pathy-0.4.0-py3-none-any.whl
Collecting srsly<3.0.0,>=2.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/19/54/76982427c

In [6]:
import spacy.cli
spacy.cli.download("en_core_web_lg")
import en_core_web_lg
nlp = en_core_web_lg.load()

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


### Apache Tika (source document pre-process)

In [7]:
pip install tika

Collecting tika
  Downloading https://files.pythonhosted.org/packages/96/07/244fbb9c74c0de8a3745cc9f3f496077a29f6418c7cbd90d68fd799574cb/tika-1.24.tar.gz
Building wheels for collected packages: tika
  Building wheel for tika (setup.py) ... [?25l[?25hdone
  Created wheel for tika: filename=tika-1.24-cp37-none-any.whl size=32885 sha256=03a463ffe93afc026ae1005b0e64e883a9f935e23f4234557edc4a154b3d9387
  Stored in directory: /root/.cache/pip/wheels/73/9c/f5/0b1b738442fc2a2862bef95b908b374f8e80215550fb2a8975
Successfully built tika
Installing collected packages: tika
Successfully installed tika-1.24


## Define function Q&A

In [8]:
def answer_question(question, answer_text):
    '''
    Takes a `question` string and an `answer_text` string (which contains the
    answer), and identifies the words within the `answer_text` that are the
    answer. Prints them out.
    '''
    # Tokenize: Apply the tokenizer to the text (question & answer), treating them as a text-pair.
    input_ids = tokenizer.encode(question, answer_text)

    # Segment IDs: Search inside the input_ids the "[SEP]" token to split both inputs.
    sep_index = input_ids.index(tokenizer.sep_token_id)

    # The number of segment A (question) tokens includes the "[SEP]" token istelf.
    num_seg_a = sep_index + 1

    # The remainder are segment B (answer text).
    num_seg_b = len(input_ids) - num_seg_a

    # Construct the list of 0s and 1s (one-hot encoded vector).
    segment_ids = [0]*num_seg_a + [1]*num_seg_b

    # There should be a segment_id for every input token.
    assert len(segment_ids) == len(input_ids)

    # Evaluate the question and the sentence containing the answer through the model.
    start_scores, end_scores = model(torch.tensor([input_ids]), # The tokens representing input text.
                                    token_type_ids=torch.tensor([segment_ids])) # The segment IDs to differentiate question from answer_text

    # Reconstruct Answer: Find the tokens with the highest "start" and "end" scores.
    answer_start = torch.argmax(start_scores)
    answer_end = torch.argmax(end_scores)

    # Get the string versions of the input tokens.
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Start with the first token.
    answer = tokens[answer_start]

    # Post-process: Select the remaining answer tokens and join them with whitespace.
    for i in range(answer_start + 1, answer_end + 1):
        
        # If it's a subword token, then recombine it with the previous token.
        if tokens[i][0:2] == '##':
            answer += tokens[i][2:]
        
        # Otherwise, add a space then the token.
        else:
            answer += ' ' + tokens[i]

    #print("Answer: ", answer)
    #print(answer)
    return answer

### Analysis 

In onder to know better the function answer_question

In [9]:
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"

In [10]:
input_text = ["[START]","Who", "was", "Jim", "Henson", "?", "[SEP]", "Jim", "Henson", "was", "a", "nice", "puppet", "[SEP]"] 

In [11]:
input_ids = tokenizer.encode(question, text)
print(input_text)
print(input_ids)

['[START]', 'Who', 'was', 'Jim', 'Henson', '?', '[SEP]', 'Jim', 'Henson', 'was', 'a', 'nice', 'puppet', '[SEP]']
[101, 2040, 2001, 3958, 27227, 1029, 102, 3958, 27227, 2001, 1037, 3835, 13997, 102]


In [12]:
sep_token = input_ids.index(tokenizer.sep_token_id)
segment_ids = [0]*(sep_token + 1) + [1]*(len(input_ids) - (sep_token + 1))

In [13]:
print(segment_ids)

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]


### Spacy: selection of the paragraph/s

Define functions to pre-process the chunks for the regression task:


In [14]:
def process_text(text):
    '''
    Before regression NLP task, it is necessary to
    pre-process the input by removming stop words,
    punctuation and pronouns.
    '''
    doc = nlp(text.lower())
    result = []
    for token in doc:
        if token.text in nlp.Defaults.stop_words:
            continue
        if token.is_punct:
            continue
        if token.lemma_ =="-PRON-":
            continue
        result.append(token.lemma_)
    return " ".join(result)

In [15]:
def calculate_similarity(text1, text2):
    '''
    Takes the question and it compares with the 
    selected paragraph from the long document 
    to calculate the similarity (range between 0 and 1) 
    regression NLP task
    '''
    base = nlp(process_text(text1))
    compare = nlp(process_text(text2))
    return base.similarity(compare)

## Q&A DEMO of BERT

### Define text corpus

Tika magic modue to convert unstructured from any format (.docx, .ppt, .pdf) to sructured txt for Spacy and Bert functions. 

In [16]:
from tika import parser

In [18]:
file = parser.from_file("/content/ai-definition.pdf") # Input to the document

# get the content of the pdf file
text = file['content']

2021-04-05 11:47:28,621 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar to /tmp/tika-server.jar.
2021-04-05 11:47:29,291 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.24/tika-server-1.24.jar.md5 to /tmp/tika-server.jar.md5.
2021-04-05 11:47:29,765 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


In [19]:
# Split the document in chunks of 512 words
n = 2200 #Limit to 512 tokens (aprox 512 words)

words = iter(text.split())
lines, current = [], next(words)
for word in words:
    if len(current) + 1 + len(word) > n:
        lines.append(current)
        current = word
    else:
        current += " " + word
lines.append(current) #store the chunks inside an array 
print("Number of chunks:" ,len(lines))

Number of chunks: 11


Sanity verification: Check the max length (limit of 512 tokens) due to model can not handle longer encoder vectors

In [20]:
max_len = 0

# For every sentence...
for sent in lines:
  
    # Tokenize the text and add "[CLS]" and "[SEP]" tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

Max sentence length:  481


## Spacy results + Answer (From 3 probable sentences)

Feed BERT with the 1st probable sentence chunk and evalaute it, if there is not an aswer, feed BERT with the 2nd probable sentence ... until third probable sentence. 

In [21]:
question = "What does it mean narrow AI?"

In [22]:
# initialize the vec where the scores are save it.
scores = []

for sent in lines:
  scores.append(calculate_similarity(question, sent))


a = scores.index(max(scores))                          # Index most probable paragraph for the answer
answer_fst = answer_question(question, lines[a])       # Get the answer

if "[CLS]" in answer_fst:                              # Check the 1st answer
  scores_3 = [scores.index(x) for x in sorted(scores, reverse=True)[:3]]
  answer_fst = answer_question(question, lines[scores_3[1]])
  if "[CLS]" in answer_fst:                            # Check the 2nd answer 
    answer_fst = answer_question(question, lines[scores_3[2]])
    print ("3rd answer:", answer_fst)
  else:
    print ("2nd answer:", answer_fst)

else:
  print("1st answer:", answer_fst)

1st answer: systems that can perform one or few specific tasks


## BERT can speak

Thanks to speech AIDL module, It is dediced to explote the obtained knowlegde and make this NLP task multidisciplinar and make BERT to speak by improting gTTs model:

### Install gTTs package

In [23]:
!pip install gTTS

Collecting gTTS
  Downloading https://files.pythonhosted.org/packages/5f/b9/94e59337107be134b21ce395a29fc0715b707b560108d6797de2d93e1178/gTTS-2.2.2-py3-none-any.whl
Installing collected packages: gTTS
Successfully installed gTTS-2.2.2


In [24]:
import textwrap

### Audio Answer

In [25]:
question = "What does it mean narrow AI?"

In [26]:
#@title BERT says:
# initialize the vec where the scores are save it.
scores = []

print("Bert is reading...😎")

for sent in lines:
  scores.append(calculate_similarity(question, sent))


a = scores.index(max(scores))


wrapper = textwrap.TextWrapper(width="150")
                          # Index most probable paragraph for the answer
answer_fst = answer_question(question, lines[a])    # Get the answer

print("Bert is thinking...🤓")

if "[CLS]" in answer_fst:                              # Check the answer
  scores_3 = [scores.index(x) for x in sorted(scores, reverse=True)[:3]]
  answer_fst = answer_question(question, lines[scores_3[1]])
  if "[CLS]" in answer_fst:                            # Check the 2nd answer 
    answer_fst = answer_question(question, lines[scores_3[2]])
    print ("3rd answer:", answer_fst)
    print(" ")
    #print(wrapper.fill(lines[scores_3[2]]))
  else:
    print("2nd answer:", answer_fst)
    print(" ")
    #print(wrapper.fill(lines[scores_3[1]]))
else:
  print("1st answer 🧐:", answer_fst)
  print(" ")
  #print(wrapper.fill(lines[a]))

#·title Voice
from gtts import gTTS #Import Google Text to Speech
from IPython.display import Audio #Import Audio method from IPython's Display Class
tts = gTTS(answer_fst) #Provide the string to convert to speech
tts.save('1.wav') #save the string converted to speech as a .wav file
sound_file = '1.wav'
Audio(sound_file, autoplay=True) 

#Autoplay = True will play the sound automatically
#If you would not like to play the sound automatically, simply pass Autoplay = False.

Bert is reading...😎
Bert is thinking...🤓
1st answer 🧐: systems that can perform one or few specific tasks
 
