## Question & Answer

Creating a Question-Answer Transformer model or QA Transformer can be beneficial for several reasons, particularly in the field of Natural Language Processing (NLP). Here are some compelling reasons why you might want to develop a QA Transformer:

1. **Question-Answering Systems:** QA Transformers are designed to provide accurate and contextually relevant answers to questions posed in natural language. These systems have a wide range of practical applications, including chatbots, virtual assistants, customer support, and information retrieval.

2. **Information Retrieval:** QA Transformers can be used to search through large corpora of text and extract precise answers to user queries. This can improve the efficiency and effectiveness of information retrieval systems.

3. **Document Summarization:** QA Transformers can be used to summarize long documents by answering questions about the document's content. This makes it easier for users to quickly understand the key points and relevant information in a text.

4. **Education and E-Learning:** QA Transformers can be integrated into educational platforms to provide instant answers and explanations to students' questions. They can also help with the automatic generation of quiz questions and answers.

5. **Content Generation:** QA Transformers can assist in content generation by automatically answering questions based on available knowledge. This can be useful for generating FAQs, product descriptions, and informative articles.

6. **Customer Support:** Many companies use QA systems to automate responses to frequently asked questions, freeing up human agents to handle more complex queries and providing customers with quick solutions.

7. **Medical Diagnosis:** QA Transformers can assist medical professionals by answering questions related to patient records, medical literature, and diagnostic information, potentially leading to faster and more accurate diagnoses.

8. **Legal and Compliance:** In the legal field, QA Transformers can be used to search and extract information from legal documents, assisting lawyers in their research and case preparation.

9. **Language Translation:** QA Transformers can be used to answer questions about language translation, helping users understand the meaning of words, phrases, or sentences in different languages.

10. **Scientific Research:** QA Transformers can support researchers by answering questions related to scientific literature, allowing them to quickly access relevant information for their studies.

11. **Decision Support:** QA Transformers can aid in decision-making processes by providing answers to questions related to data analysis, market research, and business intelligence.

12. **Accessibility:** QA Transformers can improve accessibility for individuals with disabilities by providing spoken or written answers to their questions, helping them access information more easily.

Overall, QA Transformers have the potential to enhance information retrieval, automation, and user interaction in various domains, making them a valuable tool in the development of intelligent systems and applications. The ability to provide accurate and context-aware answers to questions in natural language is a key advantage of these models.

---
Exercise:

Now, as a data scientist expert in NLP, you are asked to create a model to be able to answer question in Spanish. Your stakeholders will pass you an article and one question and your model should answer it.

In [1]:
!pip install requests beautifulsoup4
!pip install transformers
!pip install sentencepiece



In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import string
import requests
from bs4 import BeautifulSoup
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
from transformers import pipeline
import pandas as pd

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
# URL del artículo
url = "https://time.com/collection/time100-ai/6309026/geoffrey-hinton/"

# Realizar una solicitud HTTP para obtener el contenido de la página
response = requests.get(url)

# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo (puedes inspeccionar el HTML de la página para encontrar la estructura adecuada)
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    for paragraph in article_content.find_all("p"):
        article_text += paragraph.get_text() + "\n"

    # Imprimir el texto del artículo
    print(article_text)
else:
    print("Error al obtener la página:", response.status_code)

Over the course of February, Geoffrey Hinton, one of the most influential AI researchers of the past 50 years, had a “slow eureka moment.”
Hinton, 76, has spent his career trying to build AI systems that model the human brain, mostly in academia before joining Google in 2013. He had always believed that the brain was better than the machines that he and others were building, and that by making them more like the brain, they would improve. But in February, he realized “the digital intelligence we’ve got now may be better than the brain already. It’s just not scaled up quite as big.” 
Developers around the world are currently racing to build the biggest AI systems that they can. Given the current rate at which AI companies are increasing the size of models, it could be less than five years until AI systems have 100 trillion connections—roughly as many as there are between neurons in the human brain.
Alarmed, Hinton left his post as VP and engineering fellow in May and gave a flurry of in

In [4]:
question = "How is Geoffrey Hinton?"


### Text Classification

In [5]:
# Define el texto y la pregunta
texto = article_text
question = "How is Geoffrey Hinton?"

In [6]:
# Tokeniza el texto en oraciones
oraciones = sent_tokenize(texto)

In [7]:
# Función de preprocesamiento
def preprocesar(text):
    palabras = word_tokenize(text)
    palabras = [word.lower() for word in palabras if word not in string.punctuation]
    palabras = [word for word in palabras if word not in stopwords.words('english')]
    stemmer = PorterStemmer()
    palabras = [stemmer.stem(word) for word in palabras]
    return " ".join(palabras)

In [8]:
# Preprocesa todas las oraciones
oraciones_preprocesadas = [preprocesar(oracion) for oracion in oraciones]
# Preprocesa la pregunta
pregunta_preprocesada = preprocesar(question)

In [9]:
# Crea la matriz TF-IDF para las oraciones y la pregunta
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(oraciones_preprocesadas + [pregunta_preprocesada])

In [10]:
# Calcula la similitud de coseno entre la pregunta y todas las oraciones
similarity_scores = cosine_similarity(tfidf_matrix[-1:], tfidf_matrix[:-1]).flatten()

In [11]:
# Encuentra el índice de la oración más similar
most_similar_sentence_index = similarity_scores.argmax()

# Obtiene la oración correspondiente como respuesta
respuesta = oraciones[most_similar_sentence_index]
print("Pregunta:", question)
print("Respuesta:", respuesta)

Pregunta: How is Geoffrey Hinton?
Respuesta: Over the course of February, Geoffrey Hinton, one of the most influential AI researchers of the past 50 years, had a “slow eureka moment.”
Hinton, 76, has spent his career trying to build AI systems that model the human brain, mostly in academia before joining Google in 2013.


In [12]:
# Cargar el modelo BERT y el tokenizador
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
# Preparar la entrada para el modelo
inputs = tokenizer.encode_plus(question, texto, add_special_tokens=True, return_tensors="pt", truncation=True, max_length=512)


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


In [14]:
# Realizar la inferencia
with torch.no_grad():
    outputs = model(**inputs)

# Extraer los índices de inicio y fin de la respuesta
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1

In [15]:
answer_tokens = inputs["input_ids"][0][answer_start:answer_end]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print("Respuesta:", answer)

Respuesta: one of the most influential ai researchers of the past 50 years


Con Transformers Question Answer

In [16]:
#hide_output
from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
  _torch_pytree._register_pytree_node(


In [17]:
article_text = article_text[:512]

outputs = classifier(article_text)
pd.DataFrame(outputs)

Unnamed: 0,label,score
0,POSITIVE,0.787146


In [18]:
reader = pipeline("question-answering")
question = "¿Cómo está Geoffrey Hinton?"
outputs = reader(question=question, context=article_text)
pd.DataFrame([outputs])

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,score,start,end,answer
0,0.233791,118,129,slow eureka


In [19]:
summarizer = pipeline("summarization")
outputs = summarizer(article_text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your min_length=56 must be inferior than your max_length=45.


 Geoffrey Hinton, 76, has spent his career trying to build AI systems that model the human brain. Hinton joined Google in 2013. He realized “the digital intelligence we’ve got now is
