In [1]:
%load_ext autoreload
%autoreload 2

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juansensio/blog/blob/master/117_langchain/117_langchain.ipynb)

# LangChain 🦜🔗

Con el despegue de los modelos de lenguaje que estamos viviendo en este momento cientos de nuevas herramientas y aplicaciones están apareciendo para aprovechar el poder de estas redes neuronales. Una de ellas parece destacar por encima del resto, y ésta es [LangChain](https://docs.langchain.com/docs/). En este post vamos a ver qué es y cómo podemos usarla.

## ¿Qué es LangChain?

Según su [documentación](https://docs.langchain.com/docs/), Langchain es un entorno de desarrollo de aplicaciones basadas en modelos de lenguajes. Las herramientas proporcionadas por LangChain permiten, por un lado, conectar modelos de lenguaje con otras fuentes de datos (como por ejemplo tus porpios documentos, bases de datos o emails) y, por otro lado, permitir a estos modelos interactuar con su entorno (por ejemplo, enviando emails o llamando a APIs web). Langchain ofrece librerías en Python y Javascript para facilitar el desarrollo de estas aplicaciones, en este post nos centraremos en la librería de Python.

## Un ejemplo práctico

Empezaremos viendo un ejemplo práctico de cómo usar LangChain para proporcionar información sobre un documento, y luego entraremos en detalle de los diferentes componentes y cómo funcionan.

> Vamos a usar como documento el artículo [On the Measure of Intelligence](https://arxiv.org/pdf/1911.01547.pdf), de François Chollet (2019).

Lo primero que necesitamos es instalar la librería de LangChain:

```bash
pip install langchain
````

In [2]:
import langchain

langchain.__version__

'0.0.160'

Primero necesitaremos un modelo. Para ello usaremos [Huggingface](https://huggingface.co/).

In [3]:
from langchain import HuggingFacePipeline

# llm = HuggingFacePipeline.from_model_id(
#     model_id="OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5", 
#     task="text-generation", 
#     model_kwargs={"temperature": 0.9, "max_length": 1024},
#     # model_kwargs={"temperature": 0.9, "max_length": 1024, 'device_map': 'auto'},
#     # device=0
# )

# OJO! max_length tiene que ser suficiente como para tener el documento (chuck) + el prompt + el system prompt + respuesta generada !!!
llm = HuggingFacePipeline.from_model_id(model_id="bigscience/bloom-1b7", task="text-generation", model_kwargs={"temperature": 0, "max_length": 2048, 'device_map': 'sequential'}, device=0)

El siguiente paso es generar nuestro `prompt`. Para ello usaremos el template.

In [4]:
from langchain import PromptTemplate

# template = """<|prompter|>{question}<|endoftext|><|assistant|>"""

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

Una vez tenemos nuestro modelo y prompt, podemos crear nuestra primera `chain`.

In [5]:
from langchain import LLMChain

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "¿Quién es Juan Sensio?"

print(llm_chain.run(question))



 First, let's look at the first letter of the name. The first letter of Sensio is S. The second letter of Sensio is I. The third letter of Sensio is O. The fourth letter of Sensio is I. The fifth letter of Sensio is O. The sixth letter of Sensio is I. The seventh letter of Sensio is O. The eighth letter of Sensio is I. The ninth letter of Sensio is O. The tenth letter of Sensio is I. The eleventh letter of Sensio is O. The twelfth letter of Sensio is I. The thirteenth letter of Sensio is O. The fourteenth letter of Sensio is I. The fifteenth letter of Sensio is O. The sixteenth letter of Sensio is I. The seventeenth letter of Sensio is O. The eighteenth letter of Sensio is I. The nineteenth letter of Sensio is O. The twentieth letter of Sensio is I. The twenty-first letter of Sensio is O. The twenty-second letter of Sensio is I. The twenty-third letter of Sensio is O. The twenty-fourth letter of Sensio is I. The twenty-fifth letter of Sensio is O. The twenty-sixth letter of Sensio is I

Ahora vamos a intentar sacar información de nuestro pdf.

In [21]:
from langchain.document_loaders import OnlinePDFLoader

loader = OnlinePDFLoader("https://arxiv.org/pdf/1911.01547.pdf")
# loader = PyPDFLoader('1911.01547.pdf')
# pages = loader.load_and_split()

In [8]:
# raw_text = ''
# for page in pages:
#     raw_text += page.page_content
# raw_text

'On the Measure of Intelligence\nFranc ¸ois Chollet\x03\nGoogle, Inc.\nfchollet@google.com\nNovember 5, 2019\nAbstract\nTo make deliberate progress towards more intelligent and more human-like artiﬁcial\nsystems, we need to be following an appropriate feedback signal: we need to be able to\ndeﬁne and evaluate intelligence in a way that enables comparisons between two systems,\nas well as comparisons with humans. Over the past hundred years, there has been an abun-\ndance of attempts to deﬁne and measure intelligence, across both the ﬁelds of psychology\nand AI. We summarize and critically assess these deﬁnitions and evaluation approaches,\nwhile making apparent the two historical conceptions of intelligence that have implicitly\nguided them. We note that in practice, the contemporary AI community still gravitates to-\nwards benchmarking intelligence by comparing the skill exhibited by AIs and humans at\nspeciﬁc tasks, such as board games and video games. We argue that solely measuring 

In [22]:
document = loader.load()

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.


In [24]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=256, 
    chunk_overlap=64,
    # separator="\n",
    # length_function=len
)

# texts = text_splitter.split_text(raw_text)
documents = text_splitter.split_documents(document)

len(documents)

Created a chunk of size 1995, which is longer than the specified 256
Created a chunk of size 544, which is longer than the specified 256
Created a chunk of size 327, which is longer than the specified 256
Created a chunk of size 421, which is longer than the specified 256
Created a chunk of size 387, which is longer than the specified 256
Created a chunk of size 639, which is longer than the specified 256
Created a chunk of size 947, which is longer than the specified 256
Created a chunk of size 633, which is longer than the specified 256
Created a chunk of size 722, which is longer than the specified 256
Created a chunk of size 641, which is longer than the specified 256
Created a chunk of size 574, which is longer than the specified 256
Created a chunk of size 626, which is longer than the specified 256
Created a chunk of size 766, which is longer than the specified 256
Created a chunk of size 476, which is longer than the specified 256
Created a chunk of size 692, which is longer th

505

In [25]:
documents[0].page_content

'9 1 0 2\n\nv o N 5 2\n\n] I\n\nA . s c [\n\n2 v 7 4 5 1 0 . 1 1 9 1 : v i X r a\n\nOn the Measure of Intelligence\n\nFranc¸ois Chollet ∗ Google, Inc. fchollet@google.com\n\nNovember 5, 2019\n\nAbstract'

In [27]:
documents[1].page_content

'To make deliberate progress towards more intelligent and more human-like artiﬁcial systems, we need to be following an appropriate feedback signal: we need to be able to deﬁne and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abun- dance of attempts to deﬁne and measure intelligence, across both the ﬁelds of psychology and AI. We summarize and critically assess these deﬁnitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates to- wards benchmarking intelligence by comparing the skill exhibited by AIs and humans at speciﬁc tasks, such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experie

Ahora convertiremos cada página en un embeding

In [29]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

query_result = embeddings.embed_documents(documents[:1])
# query_result = embeddings.embed_query(texts[0])

# doc_result = embeddings.embed_documents([text])

query_result

In [30]:
from langchain.vectorstores import Chroma

# vectorstore = Chroma.from_texts(texts, embeddings)
vectorstore = Chroma.from_documents(documents, embeddings)

Using embedded DuckDB without persistence: data will be transient


Memory object

In [16]:
# from langchain.memory import ConversationBufferMemory

# memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [34]:
from langchain.chains import ConversationalRetrievalChain

qa = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

In [35]:
chat_history = []
query = "Who is the author of the paper?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'\n\nMarcus Hutter, Shane Legg, and Marcus Hutter. Universal intelligence: A deﬁnition of machine intelligence. 2007.\n\nA:\n\nThe paper is by Marcus Hutter, Shane Legg, and Marcus Hutter.'

In [36]:
# chat_history = [(query, result["answer"])]
chat_history = []
query = "What is the definition of intelligence?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'\n\nIntelligence is the efﬁciency with which a learning system turns experience and priors into skill at previously unknown tasks.\n\nA:\n\nThe answer is that intelligence is the ability to learn new things. This is a very broad term, and it is not limited to the ability to learn new things. It is also not limited to the ability to learn new things in a particular domain. It is also not limited to the ability to learn new things in a particular way. It is also not limited to the ability to learn new things in a particular way in a particular domain. It is also not limited to the ability to learn new things in a particular way in a particular domain in a particular way. It is also not limited to the ability to learn new things in a particular way in a particular domain in a particular way in a particular way in a particular way in a particular way in a particular way in a particular way in a particular way in a particular way in a particular way in a particular way in a particular way 

In [38]:
result['source_documents'][1].page_content

'56\n\nIntelligence is the efﬁciency with which a learning system turns experience and priors\n\ninto skill at previously unknown tasks.\n\nAs such, a measure of intelligence must account for priors, experience, and general-\n\nization difﬁculty.'

## Componentes

- [Data loading](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html)

Ejemplo multiple chain para auto-gpt