# LangChain 101: Ruby on Rails for Generative AI

[LangChain](https://langchain.com/) is a popular chatbot and LLM library that efficiently connects many tools. It is known as an easy way to produce chatbots, but is actually much more: an **onramp** to generative AI. LangChain is like [Ruby on Rails](https://en.wikipedia.org/wiki/Ruby_on_Rails) [which powered Web 2.0] for generative artificial intelligence. It has enabled the rapid adoption of LLM technology across the web, startup ecosystem and in the modern enterprise [big companies!].

#### Note: Credit to Ivan Reznikov

This content is indebted to [Ivan Reznikov](https://linkedin.com/in/reznikovivan), who created a [LangChain 101 Course](https://pub.towardsai.net/langchain-101-part-1-building-simple-q-a-app-90d9c4e815f3) that is excellent.

## Question & Answer (Q&A) using Retrieval Augmented Generation (RAG) with LangChain

The most popular technique using large language models and LangChain is Retrieval Augmented Generation (RAG): indexing documents using a technique called _vector search_ and then retrieving parts of them relevant to a question, adding these document segments before the question's in a request, and submitting it to an LLM like [OpenAI](https://platform.openai.com/docs/introduction). **Talk is cheap**. We're going to start out by building a RAG engine using [ChromaDB](https://www.trychroma.com/) which is easy to get started with.

#### Note: Because the simple, local version of Chroma can be more difficult to scale, we will be using OpenSearch via Docker to work with larger sets of documents.

In [2]:
import logging
import os
from typing import Any, Dict, List, Optional, Type

import chromadb
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.embeddings import CacheBackedEmbeddings, OpenAIEmbeddings
from langchain.embeddings.base import Embeddings
from langchain.llms import OpenAI
from langchain.memory import ConversationBufferMemory
from langchain.schema import Document
from langchain.storage import LocalFileStore
from langchain.vectorstores import Chroma

### Q&A [nerding out] about Network Motifs

I am obsessed with _network motifs_ - statistically significant patterns in _graphs_ (a graph in the real world is called a _network_) called _graphlets_ that appear in _complex networks_. You can see a simple and more complex heterogeneous, temporal network motif below.

<br />
<center>
    <img src="../images/5-graphlets.png" width="600px" />
</center>
<br />
<center>
    <a href="https://www.nature.com/articles/s41598-020-69795-1">Exploiting graphlet decomposition to explain the structure of complex networks: the GHuST framework</a>, Espejo et al., 2020
</center>
<br /><br />

<center><img src="../images/temporal-motifs.png" width="400px" /></center>
<br />
<center><a href="https://snap.stanford.edu/temporal-motifs/">Motifs in temporal networks</a>, Ashwin Paranjape, Austin R. Benson, and Jure Leskovec., 2017</center>
<br />

If you run the [`./download.sh`](download.sh) script in a terminal window, it will download a tarball of 25 network motif datasets and extract them into the `data/Network_Motifs/` folder. We will use ChromaDB to implement RAG over these systems before getting into a deeper explanation of what is going on and what is available in Langchain.

#### Load a copy of one of my Dropbox folders with academic papers

In [5]:
PAPER_FOLDER = f"{os.getcwd()}/../data/Network_Motifs/"

#### Verify papers directory

In [6]:
paper_count = len(os.listdir(PAPER_FOLDER))
print(f"You have {paper_count:,} Network Motif PDFs in `{PAPER_FOLDER}`.")

You have 25 Network Motif PDFs in `/home/jovyan/work/Part 1. Langchain/../data/Network_Motifs/`.


#### Load our OpenAI key

In [7]:
# Set in env/openai.env
openai_api_key = os.environ.get("OPENAI_API_KEY")
if not openai_api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")

#### Load all PDFs from academic paper folder

In [8]:
loader = PyPDFDirectoryLoader(PAPER_FOLDER, silent_errors=True)
docs = loader.load()
print(f"You have {len(docs)} document segments in `{PAPER_FOLDER}`.")

You have 661 document segments in `/home/jovyan/work/Part 1. Langchain/../data/Network_Motifs/`.


#### How many papers on network motifs?

In [9]:
motif_docs = [doc for doc in docs if "motif" in doc.page_content]
motif_doc_count = len(motif_docs)
paper_count = len(set(doc.metadata["source"] for doc in motif_docs))
print(
    f"You have {paper_count} papers mentioning network motifs split across {motif_doc_count} document segments in `{PAPER_FOLDER}`."
)

You have 20 papers mentioning network motifs split across 321 document segments in `/home/jovyan/work/Part 1. Langchain/../data/Network_Motifs/`.


#### Embed them with OpenAI ada model and store them in OpenSearch

In [10]:
embeddings = OpenAIEmbeddings()
fs = LocalFileStore("./data/embedding_cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    embeddings, fs, namespace=embeddings.model
)

#### Load it into Chroma from our documents set

In [28]:
db = Chroma.from_documents(docs, cached_embedder)

In [31]:
# Do similarity search
query = "What is a network motif?"
docs = db.similarity_search(query)
docs[0]

Document(page_content='8How do we findmodules of network motifs?', metadata={'page': 7, 'source': '/home/jovyan/work/Part 1. Langchain/../data/Network_Motifs/Higher-Order Organization of Complex Networks - Slides from 2016.pdf'})

In [None]:
# Here we use Chroma - we will show you OpenSearch afterwards
vectordb = RobustChroma.from_documents(
    motif_docs, embedding=cached_embedder, persist_directory="data"
)
vectordb.persist()

# What is LangChain?

#### Note: the following LangChain introduction is originally by [Ivan Reznikov](https://linkedin.com/in/rez) in [LangChain 101: Part 1. Building Simple Q&A App](https://pub.towardsai.net/langchain-101-part-1-building-simple-q-a-app-90d9c4e815f3).

Today, we will discuss the following topics:

* What exactly is LangChain?
* LangChain’s fundamental concepts and components
* How to build a basic LangChain application

Lang stands for language, which is the primary focus of LangChain, and chain — the connotation of connecting things — refers to the chain component used in LangChain. Chains are sequences of instructions that the framework executes to perform a task. This simplifies the use of Large Language Models for specific tasks and enables you to combine the power of LLMs (Large Language Models) with other programming techniques.

I’ve been asked how LangChain differs from ChatGPT or LLM. To answer this question, I’m attaching a table that highlights the differences:

<pre><code>
+==========+========================+====================+====================+
|          | LangChain              | LLM                | ChatGPT            | 
+==========+========================+====================+====================+
| Type     | Framework              | Model              | Model              | 
+----------+------------------------+--------------------+--------------------+
| Purpose  | Build applications     | Generate text      | Generate chat      | 
|          | with LLMs              |                    | conversations      | 
+----------+------------------------+--------------------+--------------------+
| Features | Chains, prompts, LLMs, | Large dataset of   | Large dataset of   | 
|          | memory, index, agents  | text and code      | chat conversations | 
+----------+------------------------+--------------------+--------------------+
| Pros     | Can combine LLMs with  | Generates nearly   | Generates realistic| 
|          | programming techniques | human-quality text | chat conversations | 
+----------+------------------------+--------------------+--------------------+
| Cons     | Requires some          | Not as easy to use | Not as versatile   | 
|          | programming knowledge  | for specific tasks | as LangChain       | 
+----------+------------------------+--------------------+--------------------+
</code></pre>

## OpenAI

https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo