# Prueba de concepto

En esta notebook expondremos la implementación de un sistema conversacional capaz de contestar preguntas sobre una lista de documentos (adaptado parcialmente de [chat-langchain](https://github.com/hwchase17/chat-langchain)).

El proceso se puede separar en los siguientes sistemas:

1. _Information Retrieval_ (IR): Para cada pregunta **q** el sistema de IR es el encargado de encontrar el conjunto de documentos **D** donde se encuentra la respuesta.
2. _Question Answering_ (QA): El sistema de QA genera la respuesta a la pregunta **q** usando la información presente en el conjunto de documentos **D** . 

### Setup

In [2]:
import os
import pickle
from abc import ABC, abstractmethod
from typing import Any, Dict, Iterable, List, Optional

import gradio as gr
import numpy as np
import openai
import pandas as pd
from sentence_transformers import SentenceTransformer, util
from tqdm import tqdm

openai.api_key = os.getenv("OPENAI_API_KEY")

# Data

In [3]:
dataset_path = "data/claims_parse_urls-6m-hard_clean.json"
claims = pd.read_json("data/claims_parse_urls-6m-hard_clean.json")

In [3]:
raw_documents = []
for _, claim in tqdm(claims.iterrows(),total=len(claims)):
    text = claim["text"]
    metadata = {
        "id": claim["_id"],
        "claimReviewed": claim["claimReviewed"],
        "url": claim["url"],
    }
    raw_documents.append({"text":text,"metadata":metadata})

100%|████████████████████████████████████████████████████████████████████████| 18951/18951 [00:00<00:00, 24624.48it/s]


La primera **decisión de diseño** consiste en como crear los documentos que servirán como fuentes para el sistema de QA. 

En este caso se ha optado por dividir cada artículo en segmentos de tamaño menor o igual a _chunk_size_ caractéres (1000) con una ventana deslizante de, como máximo, _chunk_overlap_ caracteres (200).

Para que los segmentos sean semanticamente coherentes y sintacticamente correctos se divide recursivamente el texto usando distintos separadores.

In [4]:
chunk_size = 1000
chunk_overlap = 200

separators = ["\n\n", "\n", " ", ""]

In [4]:
def join_docs(docs: List[str], separator: str) -> Optional[str]:
    text = separator.join(docs)
    text = text.strip()
    if text == "":
        return None
    else:
        return text

def merge_splits(splits: Iterable[str], separator: str) -> List[str]:
    # We now want to combine these smaller pieces into medium size
    # chunks to send to the LLM.
    separator_len = len(separator)

    docs = []
    current_doc: List[str] = []
    total = 0
    for d in splits:
        _len = len(d)
        if (
            total + _len + (separator_len if len(current_doc) > 0 else 0)
            > chunk_size
        ):
            if total > chunk_size:
                print(
                    f"Created a chunk of size {total}, "
                    f"which is longer than the specified {chunk_size}"
                )
            if len(current_doc) > 0:
                doc = join_docs(current_doc, separator)
                if doc is not None:
                    docs.append(doc)
                # Keep on popping if:
                # - we have a larger chunk than in the chunk overlap
                # - or if we still have any chunks and the length is long
                while total > chunk_overlap or (
                    total + _len + (separator_len if len(current_doc) > 0 else 0)
                    > chunk_size
                    and total > 0
                ):
                    total -= len(current_doc[0]) + (
                        separator_len if len(current_doc) > 1 else 0
                    )
                    current_doc = current_doc[1:]
        current_doc.append(d)
        total += _len + (separator_len if len(current_doc) > 1 else 0)
    doc = join_docs(current_doc, separator)
    if doc is not None:
        docs.append(doc)
    return docs

def split_text(text: str) -> List[str]:
    """Split incoming text and return chunks."""
    final_chunks = []
    # Get appropriate separator to use
    for _s in separators:
        if _s == "":
            separator = _s
            break
        if _s in text:
            separator = _s
            break
    # Now that we have the separator, split the text
    if separator:
        splits = text.split(separator)
    else:
        splits = list(text)
    # Now go merging things, recursively splitting longer texts.
    _good_splits = []
    for s in splits:
        if len(s) < chunk_size:
            _good_splits.append(s)
        else:
            if _good_splits:
                merged_text = merge_splits(_good_splits, separator)
                final_chunks.extend(merged_text)
                _good_splits = []
            other_info = split_text(s)
            final_chunks.extend(other_info)
    if _good_splits:
        merged_text = merge_splits(_good_splits, separator)
        final_chunks.extend(merged_text)
    return final_chunks

In [6]:
documents = []
for document in tqdm(raw_documents):
    text = document["text"]
    metadata = document["metadata"]

    splits = split_text(text)
    for chunk in splits:
        new_doc = {"text":chunk,"metadata":metadata}
        documents.append(new_doc)

100%|████████████████████████████████████████████████████████████████████████| 18951/18951 [00:00<00:00, 42260.75it/s]


# IR

El sistema de IR es el encargado de recupererar el conjunto de documentos donde se encuentra la respuesta. 

Todos los documentos y la query se representan usando vectores densos, obtenidos mediante un modelo _SentenceTransformer_. La relevancia que tiene un documento $d$ para la query $q$ viene dada por el producto escalar entre ambos, $d\cdot q$.

### Embeddings

In [7]:
model_name = "paraphrase-multilingual-MiniLM-L12-v2"

In [8]:
model = SentenceTransformer(model_name)

In [11]:
vectorstore = model.encode([doc["text"] for doc in documents],show_progress_bar=True)

Batches:   0%|          | 0/3030 [00:00<?, ?it/s]

In [12]:
with open("data/vectorstore.pkl", "wb") as f:
    pickle.dump(vectorstore, f)

#### Si ya has codificado los documentos puedes cargar el indice ejecutando la siguiente celda 

In [14]:
with open("data/vectorstore.pkl", "rb") as f:
    vectorstore = pickle.load(f)

### Search

In [17]:
class Index:

    def __init__(
        self,
        documents: List[Dict[str,Any]],
        vectorstore: np.ndarray,
        model: SentenceTransformer
    ):
        self.documents = documents
        self.vectorstore = vectorstore
        self.model = model
    
    def search(
        self,
        query: str,
        k: int=4
    ):
        """Retrieves the top k documents from the index, by relevance to the query"""
        query_embedding = self.model.encode(query)
        scores = util.dot_score(query_embedding, self.vectorstore)
        scores = scores.squeeze()

        # Bigger is better
        topk = (-scores).argsort()[:k]

        return [{**self.documents[i],"score":scores[i].item() } for i in topk]

In [18]:
index = Index(documents,vectorstore,model)

# QA

In [21]:
class LLM:
    
    @abstractmethod
    def completition(self):
        """Generate text"""

class GPT(LLM):

    def __init__(
        self,
        model: str = "text-davinci-003",
        temperature= 0.0,
        max_tokens=256,
        top_p= 1,
        frequency_penalty = 0,
        presence_penalty= 0,
        n= 1,
        logit_bias= {}
    ):
        self.params = {
            "model":model, 
            "temperature":temperature, 
            "max_tokens":max_tokens, 
            "top_p":top_p, 
            "frequency_penalty":frequency_penalty, 
            "presence_penalty":presence_penalty, 
            "n":n, 
            "logit_bias":logit_bias,    
        }

    def completion(
        self,
        prompt
    ):
        response = openai.Completion.create(
            prompt=prompt,
            **self.params
        )
        return response.choices[0].text

In [22]:
llm = GPT()

# Chat

### Prompts

In [27]:
document_separator = '\n\n'

In [28]:
question_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""

history_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

### Chatbot

In [29]:
class ChatQA:

    def __init__(
        self,
        ir,
        qa
    ):
        self.ir = ir
        self.qa = qa

        self.history = []

    def reset(self):
        self.history = []
        
    def follow_up_query(self,question):
        prompt = history_template.format(chat_history="/n".join(self.history),question=question)
        query = self.qa.completion(prompt)
        print(query)
        return query

    def __call__(
        self,
        question:str
    ):
        if len(self.history):
            query = self.follow_up_query(question)
        else:
            query = question

        documents = self.ir.search(query)

        contexts = [document["text"] for document in documents]
        context = document_separator.join(contexts)
        prompt =  question_template.format(context=context,question=query)

        answer = self.qa.completion(prompt)

        self.history.append("\n".join([prompt,answer]))

        urls = [document["metadata"]["url"] for document in documents]
        urls = list(dict.fromkeys(urls))
        citation = [f'{i+1}. {url}' for i,url in enumerate(urls)]

        return "\n".join([answer,*citation])

In [30]:
chat = ChatQA(index,llm)

### Ejemplos

In [31]:
question = 'Reaccionó Piqué al tema de Shakira diciendo que "esto lo he aguantado durante 15 años"?'

In [32]:
print(chat(question))

 No, ese no es el contenido del vídeo original. El vídeo original no tiene relación con la cantante y el audio que se reproduce es de un excompañero de equipo de Piqué.
1. https://www.newtral.es/comunicado-casio-shakira-pique-bulo/20230118/
2. https://observador.pt/factchecks/fact-check-casio-comparou-bateria-dos-seus-relogios-a-relacao-de-shakira-e-pique/
3. https://maldita.es/malditobulo/20230116/pique-reacciona-directo-shakira/


In [39]:
question = 'Es cierto que el zumo de limón en ayunas adelgaza?'

In [44]:
print(chat(question))

 No hay suficiente evidencia científica para respaldar esta afirmación.
1. https://fullfact.org/health/lemons-and-cancer/
2. https://www.thip.media/health-news-fact-check/fact-check-can-lemon-and-baking-soda-act-as-an-effective-teeth-whitener/36654/
3. https://newschecker.in/fact-check/ganesh-chaturthi-overseas/


In [59]:
question = 'Porque fue Nicolás Maduro atacado por un grupo de personas en San Felix (Venezuela) en el 2022?'
print(chat(question))

 El ataque a Nicolás Maduro no ocurrió en el 2022, sino en el 2017. Fue durante la conmemoración de los 200 años de la Batalla de San Félix.
1. https://colombiacheck.com/chequeos/estados-unidos-no-ha-anunciado-que-maduro-ha-sido-declarado-objetivo-en-enero-de-2023
2. https://www.newtral.es/ataque-nicolas-maduro-venezuela/20220830/


In [60]:
question = 'Cual es el político español que más cobra?'
print(chat(question))

 No sé.
1. https://www.newtral.es/votos-socios-investidura-jaume-asens-factcheck/20221104/
2. https://www.polygraph.info/a/fact-check-before-ouster-peru-castillo-misled-about-closing-congress/6869620.html
3. https://www.newtral.es/pedro-sanchez-candidato-partido-socialista-europeo-internacional/20221026/
4. https://colombiacheck.com/chequeos/venezolano-no-cometio-el-primer-atraco-en-islandia-se-trata-de-una-noticia-falsa-de-una


### Examples

In [37]:
question = 'Did Piqué react to the Shakira issue by saying that "I\'ve put up with this for 15 years"?'
print(chat(question))

 No, he did not say that.
1. https://www.newtral.es/comunicado-casio-shakira-pique-bulo/20230118/
2. https://observador.pt/factchecks/fact-check-casio-comparou-bateria-dos-seus-relogios-a-relacao-de-shakira-e-pique/
3. https://www.verificat.cat/fact-check/pique-no-va-posar-la-canco-de-shakira-i-bizarrap-en-un-directe-de-la-kings-league-ni-va-arribar-a-lestadi-escoltant-la-son-muntatges


In [38]:
question = 'Is it true that lemon juice on an empty stomach is slimming?'
print(chat(question))

 Is it true that drinking lemon juice on an empty stomach can help with weight loss?
 No, that's not true. A clinical dietician told Lead Stories that drinking lemon juice on an empty stomach is not recommended and poses a danger to a person's health. A medical doctor specializing in weight loss confirmed this, telling Lead Stories that this juice mix won't cause significant weight loss and that weight loss is best achieved through healthy diet, exercise and maintaining healthy body weight.
1. https://fullfact.org/health/lemons-and-cancer/
2. https://leadstories.com/hoax-alert/2022/10/fact-check-these-fruits-will-not-cause-you-to-lose-15-pounds-in-21-days.html
3. https://leadstories.com/hoax-alert/2022/08/fact-check-losing-20-pounds-in-20-days-and-preventing-diabetes-with-this-juice-is-not-practical.html


In [39]:
question = 'Why was Nicolás Maduro attacked by a group of people in San Felix (Venezuela) in 2022?'
print(chat(question))

 What was the reason for the attack on Nicolás Maduro in San Felix, Venezuela in 2022?
 The attack on Nicolás Maduro in San Felix, Venezuela in 2017 was to commemorate the 200th anniversary of the Battle of San Felix.
1. https://colombiacheck.com/chequeos/estados-unidos-no-ha-anunciado-que-maduro-ha-sido-declarado-objetivo-en-enero-de-2023
2. https://www.newtral.es/ataque-nicolas-maduro-venezuela/20220830/
3. https://colombiacheck.com/chequeos/es-falso-que-petro-haya-advertido-que-invadiria-venezuela-tras-ataque-militares


In [40]:
question = 'Who is the highest paid Spanish politician?'
print(chat(question))

 Who is the highest paid Spanish politician?
 No sé.
1. https://www.newtral.es/votos-socios-investidura-jaume-asens-factcheck/20221104/
2. https://malayalam.indiatoday.in/fact-check/story/fact-check-sonia-gandhi-indeed-worlds-fourth-richest-politician-452005-2022-09-27
3. https://factual.afp.com/doc.afp.com.337C3GQ
4. https://www.newtral.es/presion-fiscal-pp-sanchez-antonio-sanz-factcheck/20220928/


# Demo

In [34]:
def reset_chat():
    chat.reset()
    return ''

with gr.Blocks() as demo:
    chatbot = gr.Chatbot()
    chat = ChatQA(index,llm)
    msg = gr.Textbox(placeholder="Enter text and press enter, or upload an image",)
    clear = gr.Button("Clear")

    def respond(question, chat_history):
        bot_message = chat(question)
        chat_history.append((question, bot_message))
        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot])
    clear.click(reset_chat, None, chatbot, queue=False)

demo.launch();

Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.
