<a href="https://colab.research.google.com/github/FelixEbenezer/Deep-Learning-TensorFlow-Hub/blob/main/TensorFlow_Hub_Perguntas_e_respostas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TensorFlow Hub - Perguntas e respostas

- Baseado em: https://www.tensorflow.org/hub/tutorials/retrieval_with_tf_hub_universal_encoder_qa?hl=pt-br
- Biblioteca Simple Neighbors: https://pypi.org/project/simpleneighbors/

# Instalação e importação das bibliotecas

In [None]:
!pip install -q tensorflow_text
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

[K     |████████████████████████████████| 4.6 MB 5.2 MB/s 
[K     |████████████████████████████████| 511.7 MB 5.8 kB/s 
[K     |████████████████████████████████| 5.8 MB 45.1 MB/s 
[K     |████████████████████████████████| 438 kB 74.1 MB/s 
[K     |████████████████████████████████| 1.6 MB 49.5 MB/s 
[K     |████████████████████████████████| 646 kB 5.3 MB/s 
[?25h  Building wheel for annoy (setup.py) ... [?25l[?25hdone


In [None]:
import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm
import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
nltk.download('punkt')
print('Versão do TensorFlow: ', tf.__version__)
print('Versão do TensorFlow Hub:', hub.__version__ )

Versão do TensorFlow:  2.9.1
Versão do TensorFlow Hub: 0.12.0


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Base de dados SQUAD

- Documentação: https://rajpurkar.github.io/SQuAD-explorer/

In [None]:
squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'
squad_json = json.load(urllib.request.urlopen(squad_url))

In [None]:
squad_json

In [None]:
nltk.tokenize.sent_tokenize('Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm.')

['Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other.',
 'A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm.']

In [None]:
def extrair_sentencas(squad):
  todas_sentencas = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentencas = nltk.tokenize.sent_tokenize(paragraph['context'])
      #print(sentencas)
      todas_sentencas.extend(zip(sentencas, [paragraph['context']] * len(sentencas)))
  return list(set(todas_sentencas))

In [None]:
sentencas = extrair_sentencas(squad_json)

In [None]:
len(sentencas)

10452

In [None]:
sentencas[0:5]

[("Other main architectural attractions include the Castle Square with the Royal Castle and the iconic King Sigismund's Column, St. John's Cathedral, Market Square, palaces, churches and mansions all displaying a richness of colour and architectural detail.",
  "The city is the seat of a Roman Catholic archdiocese (left bank of the Vistula) and diocese (right bank), and possesses various universities, most notably the Polish Academy of Sciences and the University of Warsaw, two opera houses, theatres, museums, libraries and monuments. The historic city-centre of Warsaw with its picturesque Old Town in 1980 was listed as a UNESCO World Heritage Site. Other main architectural attractions include the Castle Square with the Royal Castle and the iconic King Sigismund's Column, St. John's Cathedral, Market Square, palaces, churches and mansions all displaying a richness of colour and architectural detail. Buildings represent examples of nearly every European architectural style and historica

In [None]:
def extrair_perguntas_respostas(squad):
  perguntas_respostas = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          perguntas_respostas.append((qas['question'], qas['answers'][0]['text']))
  return list(set(perguntas_respostas))

In [None]:
perguntas_respostas = extrair_perguntas_respostas(squad_json)

In [None]:
len(perguntas_respostas)

10552

In [None]:
perguntas_respostas[0:10]

[('What did Davies call the System ', 'packet switching'),
 ("Which country does the Rhine encounter it's main tributaries?", 'Germany'),
 ('What is the United States at risk for because of the recession of 2008?',
  'boom-and-bust cycles'),
 ('How can you find the absolute age of sedimentary rock units which do not contain radioactive isotopes?',
  'Dating of lava and volcanic ash layers found within a stratigraphic sequence'),
 ('What is the name of world renowned cellist is a former Harvard student?',
  'Yo Yo Ma'),
 ('Who hosts the weeknight talk show featured on ABC networks?',
  'Jimmy Kimmel'),
 ('What is the shortened name of the annual yacht race that takes place?',
  'Transpac'),
 ('What is a complex net of contracts and other legal obligations?',
  'A construction project'),
 ('Where was Genghis Khan buried?',
  'somewhere in Mongolia at an unknown location'),
 ('What lands were reserved for natives?',
  'west of the Appalachian Mountains')]

In [None]:
print('Sentença e contexto:\n')
sentenca = random.choice(sentencas)
print('Sentença: ')
pprint.pprint(sentenca[0])
print('\nContexto:\n')
pprint.pprint(sentenca[1])
print()

Sentença e contexto:

Sentença: 
('The astronauts would begin wearing a new Apollo spacesuit, designed to '
 'accommodate lunar extravehicular activity (EVA).')

Contexto:

('After an unmanned LM test flight AS-206, a crew would fly the first Block II '
 'CSM and LM in a dual mission known as AS-207/208, or AS-278 (each spacecraft '
 'would be launched on a separate Saturn IB.) The Block II crew positions were '
 'titled Commander (CDR) Command Module Pilot (CMP) and Lunar Module Pilot '
 '(LMP). The astronauts would begin wearing a new Apollo spacesuit, designed '
 'to accommodate lunar extravehicular activity (EVA). The traditional visor '
 'helmet was replaced with a clear "fishbowl" type for greater visibility, and '
 'the lunar surface EVA suit would include a water-cooled undergarment.')



# Criação do índice

In [None]:
# Mais modelos: https://tfhub.dev/s?dataset=squad
caminho_modelo = 'https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3'
modelo = hub.load(caminho_modelo)

In [None]:
sentencas[0][0]

"Other main architectural attractions include the Castle Square with the Royal Castle and the iconic King Sigismund's Column, St. John's Cathedral, Market Square, palaces, churches and mansions all displaying a richness of colour and architectural detail."

In [None]:
sentencas[0][1]

"The city is the seat of a Roman Catholic archdiocese (left bank of the Vistula) and diocese (right bank), and possesses various universities, most notably the Polish Academy of Sciences and the University of Warsaw, two opera houses, theatres, museums, libraries and monuments. The historic city-centre of Warsaw with its picturesque Old Town in 1980 was listed as a UNESCO World Heritage Site. Other main architectural attractions include the Castle Square with the Royal Castle and the iconic King Sigismund's Column, St. John's Cathedral, Market Square, palaces, churches and mansions all displaying a richness of colour and architectural detail. Buildings represent examples of nearly every European architectural style and historical period. Warsaw provides many examples of architecture from the gothic, renaissance, baroque and neoclassical periods, and around a quarter of the city is filled with luxurious parks and royal gardens."

In [None]:
encodings = modelo.signatures['response_encoder'](input = tf.constant([sentencas[0][0]]), 
                                                  context = tf.constant([sentencas[0][1]]))

In [None]:
len(encodings['outputs'][0])

512

In [None]:
indice = simpleneighbors.SimpleNeighbors(len(encodings['outputs'][0]), metric = 'angular')

In [None]:
batch_size = 100
slices = zip(*(iter(sentencas),) * batch_size)
num_batches = int(len(sentencas) / batch_size)
num_batches

104

In [None]:
104 * 100

10400

In [None]:
for s in tqdm(slices, total = num_batches):
  batch_resposta = list([r for r, c in s])
  batch_contexto = list([c for r, c in s])
  encodings = modelo.signatures['response_encoder'](input=tf.constant(batch_resposta), context=tf.constant(batch_contexto))
  for batch_index, batch in enumerate(batch_resposta):
    indice.add_one(batch, encodings['outputs'][batch_index])
indice.build()

  0%|          | 0/104 [00:00<?, ?it/s]

# Visualização dos resultados

In [None]:
numero_resultados = 10
pergunta_resposta = random.choice(perguntas_respostas)
print(pergunta_resposta)

('What monarchy did western troops protect?', 'Saudi')


In [None]:
def mostra_resultados(texto_consulta, resposta):
  embedding_consulta = modelo.signatures['question_encoder'](tf.constant([texto_consulta]))['outputs'][0]
  #print(embedding_consulta)
  resultados_busca = indice.nearest(embedding_consulta, n = numero_resultados)
  
  resultado_formatado = '''
  <p>Pergunta randômica selecionada do SQUAD</p>
  <p>&nbsp;&nbsp;<b>%s</b></p>
  <p>Resposta:</p>
  <p>&nbsp;&nbsp;<b>%s</b></p>
  ''' % (texto_consulta, resposta)

  resultado_formatado += '<ol>'
  for s in resultados_busca:
    resultado_formatado += '<li>'
    resultado_formatado += s
    resultado_formatado += '</li>'
  resultado_formatado += '</ol>'

  display(HTML(resultado_formatado))

In [None]:
mostra_resultados(pergunta_resposta[0], pergunta_resposta[1])