# Step-by-Step Guide to Building a Chatbot with LangChain

This notebook cost around 0.02$ to run using Open Ai api 

In [1]:
#pip install beautifulsoup4

## 1. Setup & Imports

In [2]:
# Basics
import os
import pandas as pd

from collections import namedtuple
import logging

from getpass import getpass

# Scrapping 
import re
from bs4 import BeautifulSoup


# Langchain imports
from langchain.chat_models import ChatOpenAI
from langchain.llms import GPT4All
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain, ConversationalRetrievalChain
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers import SVMRetriever, MultiQueryRetriever
from langchain.memory import ConversationBufferMemory
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma


In [3]:
os.environ['OPENAI_API_KEY'] = 'your_api_key'

## 2. Data Loading

In [37]:
 # for web scraping
#loader = WebBaseLoader("https://docs.your-documentation-link.com/")
#data = loader.load()

### Get data from docs
In this example I chose to work with Langchain [documentation](https://github.com/langchain-ai/langchain/tree/master/docs)

In [5]:
def retrieve_mdx_files_content(directory_path):
    """
    Retrieve the content of .mdx files from a specified directory and its subdirectories.
    
    Args:
    - directory_path (str): The path to the directory to search for .mdx files.
    
    Returns:
    - pd.DataFrame: A pandas DataFrame containing the filename, filepath, and content of each .mdx file.
    """
    data = {
        "Filename": [],
        "Filepath": [],
        "Content": []
    }
    
    # Walking through the directory
    for foldername, subfolders, filenames in os.walk(directory_path):
        for filename in filenames:
            if filename.endswith(".mdx"):
                filepath = os.path.join(foldername, filename)
                with open(filepath, 'r', encoding='utf-8') as file:
                    content = file.read()
                    data["Filename"].append(filename)
                    data["Filepath"].append(filepath)
                    data["Content"].append(content)
    
    return pd.DataFrame(data)


In [6]:
# Retrieve .mdx file contents from the specified directory
directory_path = 'C:\\Users\\Nathan.destrez\\Documents\\GitHub\\virtual-assistant\\Chat_bot_proto\\docs'  # Replace this with your directory path
mdx_dataframe = retrieve_mdx_files_content(directory_path)


In [7]:
mdx_dataframe

Unnamed: 0,Filename,Filepath,Content
0,installation.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,"# Installation\n\nimport Installation from ""@s..."
1,introduction.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 0\n---\n\n# Introductio...
2,quickstart.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# Quickstart\n\n## Installation\n\nTo install ...
3,index.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 6\n---\n\nimport DocCar...
4,index.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 3 \n---\n# Comparison E...
...,...,...,...
88,analyze_document.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# Analyze Document\n\nThe AnalyzeDocumentChain...
89,chat_vector_db.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 2\n---\n\n# Store and r...
90,multi_retrieval_qa_router.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# Dynamically select from multiple retrievers\...
91,question_answering.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# QA over in-memory documents\n\nHere we walk ...


In [8]:
mdx_dataframe.iloc[1]['Content']

'---\nsidebar_position: 0\n---\n\n# Introduction\n\n**LangChain** is a framework for developing applications powered by language models. It enables applications that are:\n- **Data-aware**: connect a language model to other sources of data\n- **Agentic**: allow a language model to interact with its environment\n\nThe main value props of LangChain are:\n1. **Components**: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not\n2. **Off-the-shelf chains**: a structured assembly of components for accomplishing specific higher-level tasks\n\nOff-the-shelf chains make it easy to get started. For more complex applications and nuanced use-cases, components make it easy to customize existing chains or build new ones.\n\n## Get started\n\n[Here’s](/docs/get_started/installation.html) how to install LangChain, set up your environment, a

In [9]:
mdx_dataframe.iloc[87]

Filename                                              api.mdx
Filepath    C:\Users\Nathan.destrez\Documents\GitHub\virtu...
Content     ---\nsidebar_position: 0\n---\n# API chains\nA...
Name: 87, dtype: object

## 3. Data cleaning

In [10]:
def clean_text(text):
    """
    Cleans the provided text by:
    - Removing HTML tags and content
    - Removing Markdown-specific syntax
    - Converting Unicode characters to their actual representation
    - Removing URLs
    - Removing extra white spaces
    
    Args:
    - text (str): The input string to be cleaned.
    
    Returns:
    - str: The cleaned string.
    """
    
    # 1. Remove HTML tags using BeautifulSoup
    soup = BeautifulSoup(text, "html.parser")
    no_html = soup.get_text(separator=' ')
    
    # 2. Remove Markdown Syntax
    # Common markdown syntax: **bold**, *italic*, # Headings, ![alt_text](url), [text](url)
    no_markdown = re.sub(r'\!\[.*?\]\(.*?\)|\[(.*?)\]\(.*?\)|\*\*.*?\*\*|\*.*?\*|#[^\n]*', '', no_html)
    
    # 3. Convert Unicode characters (for common entities; can be expanded further)
    no_unicode = re.sub(r'&amp;', '&', no_markdown)
    no_unicode = re.sub(r'&lt;', '<', no_unicode)
    no_unicode = re.sub(r'&gt;', '>', no_unicode)
    
    # 4. Remove URLs
    no_urls = re.sub(r'http[s]?://\S+', '', no_unicode)
    
    # 5. Remove extra white spaces
    clean_string = ' '.join(no_urls.split())
    
    return clean_string


In [11]:
# Apply the cleaning function to the entire 'Content' column
mdx_dataframe['Cleaned_Content'] = mdx_dataframe['Content'].apply(clean_text)

In [12]:
mdx_dataframe

Unnamed: 0,Filename,Filepath,Content,Cleaned_Content
0,installation.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,"# Installation\n\nimport Installation from ""@s...","import Installation from ""@snippets/get_starte..."
1,introduction.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 0\n---\n\n# Introductio...,--- sidebar_position: 0 --- is a framework for...
2,quickstart.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# Quickstart\n\n## Installation\n\nTo install ...,To install LangChain run: import Tabs from '@t...
3,index.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 6\n---\n\nimport DocCar...,--- sidebar_position: 6 --- import DocCardList...
4,index.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 3 \n---\n# Comparison E...,--- sidebar_position: 3 --- Comparison evaluat...
...,...,...,...,...
88,analyze_document.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# Analyze Document\n\nThe AnalyzeDocumentChain...,The AnalyzeDocumentChain can be used as an end...
89,chat_vector_db.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,---\nsidebar_position: 2\n---\n\n# Store and r...,--- sidebar_position: 2 --- The Conversational...
90,multi_retrieval_qa_router.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# Dynamically select from multiple retrievers\...,This notebook demonstrates how to use the `Rou...
91,question_answering.mdx,C:\Users\Nathan.destrez\Documents\GitHub\virtu...,# QA over in-memory documents\n\nHere we walk ...,Here we walk through how to use LangChain for ...


In [13]:
mdx_dataframe.iloc[1]["Cleaned_Content"]

'--- sidebar_position: 0 --- is a framework for developing applications powered by language models. It enables applications that are: - : connect a language model to other sources of data - : allow a language model to interact with its environment The main value props of LangChain are: 1. : abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not 2. : a structured assembly of components for accomplishing specific higher-level tasks Off-the-shelf chains make it easy to get started. For more complex applications and nuanced use-cases, components make it easy to customize existing chains or build new ones. how to install LangChain, set up your environment, and start building. We recommend following our guide to familiarize yourself with the framework by building your first LangChain application. _: These docs are for the LangChain 

In [14]:
mdx_dataframe.iloc[2]["Cleaned_Content"] #see if the code is still there

'To install LangChain run: import Tabs from \'@theme/Tabs\'; import TabItem from \'@theme/TabItem\'; import Install from "@snippets/get_started/quickstart/installation.mdx" For more details, see our . Using LangChain will usually require integrations with one or more model providers, data stores, APIs, etc. For this example, we\'ll use OpenAI\'s model APIs. import OpenAISetup from "@snippets/get_started/quickstart/openai_setup.mdx" Now we can start building our language model application. LangChain provides many modules that can be used to build language model applications. Modules can be used as stand-alones in simple applications and they can be combined for more complex use cases. The core building block of LangChain applications is the LLMChain. This combines three things: - LLM: The language model is the core reasoning engine here. In order to work with LangChain, you need to understand the different types of language models and how to work with them. - Prompt Templates: This prov

In [15]:
# Cleane Filename columns
def remove_mdx_extension(df):
    df['Filename'] = df['Filename'].str.replace('.mdx', '', regex=False)
    return df


In [16]:
mdx_dataframe = remove_mdx_extension(mdx_dataframe)

### Turn into list 

In [17]:
all_txt = [f"{row['Filename']} {row['Cleaned_Content']}" for index, row in mdx_dataframe.iterrows()]

In [18]:
len(all_txt) # should be equal to number of row of Df

93

In [19]:
all_txt[1]

'introduction --- sidebar_position: 0 --- is a framework for developing applications powered by language models. It enables applications that are: - : connect a language model to other sources of data - : allow a language model to interact with its environment The main value props of LangChain are: 1. : abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not 2. : a structured assembly of components for accomplishing specific higher-level tasks Off-the-shelf chains make it easy to get started. For more complex applications and nuanced use-cases, components make it easy to customize existing chains or build new ones. how to install LangChain, set up your environment, and start building. We recommend following our guide to familiarize yourself with the framework by building your first LangChain application. _: These docs are for t

### Format text

In [25]:
Document = namedtuple("Document", ["page_content", "metadata", "ids"])

all_dtxt = [
    Document(
        page_content=txt, 
        metadata={'title': txt.split(' ')[0]},  # Wrap the first word in a dictionary
        ids=f"v{i+1}"
    ) 
    for i, txt in enumerate(all_txt)
]

In [27]:
all_dtxt[0]

Document(page_content='installation import Installation from "@snippets/get_started/installation.mdx"', metadata={'title': 'installation'}, ids='v1')

## 4. Data Storing & Vectorizing

In [28]:
vectorstore = Chroma.from_documents(documents=all_dtxt, embedding=OpenAIEmbeddings())

In [29]:
# Get first 10 vectors
vectors = vectorstore.get(limit=1)
vectors

{'ids': ['4ef7fe1e-35f2-11ee-ad0c-c43d1a21bde3'],
 'embeddings': None,
 'metadatas': [{'title': 'installation'}],
 'documents': ['installation import Installation from "@snippets/get_started/installation.mdx"']}

## 5. Information Retrieval

Define custom retriever or use the vectorstore one

In [31]:
svm_retriever = SVMRetriever.from_documents(all_dtxt, OpenAIEmbeddings())

## 6. Answer Generation

In [38]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectorstore.as_retriever(),chain_type="stuff")


## 7. Customizations (Optional)
- Prompt Customization: Customize the prompt for your chatbot if needed.
- Return Source Documents: If you want to return the full set of retrieved documents for answer generation, set return_source_documents=True in the RetrievalQA chain.
- Return Citations: For answer citations, use RetrievalQAWithSourcesChain.
- Customizing Retrieved Document Processing: Choose different methods to pass retrieved documents to the LLM (like stuff).

## 8. Implementing Conversations

In [33]:
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
retriever = vectorstore.as_retriever()
chat = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory)


## 9. Use the Chatbot

In [34]:
result = chat({"question": "What is langchain"})
print(result['answer'])


LangChain is a framework for developing applications powered by language models. It provides standard, extendable interfaces and external integrations for interfacing with language models and application-specific data. LangChain allows you to construct sequences of calls and choose which tools to use based on high-level directives. It also enables you to persist application state between runs of a chain and log and stream intermediate steps of any chain. LangChain offers off-the-shelf chains for easy application development and customization options for more complex use cases. It is part of a rich ecosystem of tools and has a supportive community.


In [35]:
result = chat({"question": "Can I define custom ids in chroma DB"})
print(result['answer'])

Yes, custom IDs can be defined in Chroma DB. Chroma DB allows you to assign custom IDs to documents or data points in your database. These custom IDs can be used to uniquely identify and retrieve specific documents or data points when querying the database.


In [36]:
result = chat({"question": "Tell me interesting facts about Langchain framework that might help me developping my own chatbot on custom data"})
print(result['answer'])

Here are some interesting facts about the LangChain framework that might help you develop your own chatbot on custom data:

1. Memory System: LangChain provides utilities for adding memory to a conversational system. The memory system supports reading and writing actions, allowing the chatbot to refer to past interactions and store information for future runs.

2. Data Structures and Algorithms: LangChain offers various data structures and algorithms for working with memory types. These structures and algorithms help in organizing and querying chat messages, allowing you to retrieve relevant information efficiently.

3. Integration with Model Providers: LangChain allows you to integrate with different model providers, such as OpenAI's model APIs. This integration enables you to leverage pre-trained language models for your chatbot.

4. LLMs and Chat Models: LangChain supports two types of models - LLMs (Language Model) and Chat Models. LLMs are text completion models that take a string

In [39]:
result = chat({"question": "Give me example of cool use case where langchain is useful"})
print(result['answer'])

One cool use case where the LangChain framework is useful is in question answering over a list of documents. With LangChain, you can connect a language model to other sources of data and allow it to interact with its environment. This means you can build applications that can answer questions by analyzing a collection of documents.

For example, let's say you have a large database of scientific research papers. You can use LangChain to create a question answering system that takes a user's question as input and searches through the research papers to find the most relevant information. The system can then provide the user with a concise and accurate answer to their question.

This use case can be particularly useful in fields like academia, where researchers often need to sift through a large amount of information to find answers to their questions. LangChain simplifies the process by automating the search and analysis of documents, making it faster and more efficient for researchers t