In [19]:
!pip -q install langchain openai tiktoken chromadb


### Unstructured

Unstructured is a toolkit designed to connect large language models (LLMs) to various data sources. It provides a set of tools for extracting and transforming complex data from different file formats, such as PDFs, Word documents, and Markdown, into a format that can be used by LLMs. The toolkit includes an open-source library, a paid API, and an upcoming enterprise platform. The main goals of Unstructured are:

- **Open Source**: To provide a simple and reliable entry point for individual developers to build prototype applications using LLMs.
- **API**: To offer a production-ready API with premium features for development teams, including advanced models for PDF and image processing, improved table and image extraction, and additional chunking and metadata capabilities.
- **Enterprise Platform**: To provide a full-featured ETL (Extract, Transform, Load) experience with supported upstream and downstream connectors, job scheduling, and incremental data loads, all aimed at automating the LLM ETL process.

### Pandoc

Pandoc is a Haskell library and command-line tool for converting files between different markup formats. It can convert between numerous formats, including Markdown, HTML, LaTeX, and Word docx, and can also produce PDF output. Pandoc has a modular design, consisting of readers that parse input formats and writers that convert the parsed data into target formats. This architecture allows it to perform a wide range of conversions.

Pandoc is often used for tasks such as:

- **Format Conversion**: Converting documents between different formats, such as Markdown to HTML or LaTeX to PDF.
- **Document Processing**: Extracting structural elements from documents, such as headers, paragraphs, and tables.
- **Customization**: Allowing users to run custom filters to modify the intermediate abstract syntax tree (AST) during the conversion process.

In summary, Unstructured is focused on preparing data for use with large language models, while Pandoc is a more general-purpose tool for converting and processing documents between different formats.

In [20]:
!pip -q install unstructured pandoc

In [21]:
!pip show langchain

Name: langchain
Version: 0.2.5
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, async-timeout, langchain-core, langchain-text-splitters, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: langchain-community


In [None]:
!wget -q https://www.dropbox.com/s/zdwh3dy5jc7xq94/ash_maurya.zip
!unzip -q ash_maurya.zip -d ash_videos

replace ash_videos/ash_maurya/k0MsYFp-YvY.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
!ls ash_videos

# LangChain multi-doc retriever with ChromaDB

***New Points***
- Multiple Files - Text + Epub
- ChromaDB -
- gpt-3.5-turbo API
- OpenAI Embeddings


## Setting up LangChain


In [None]:
import os

os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
! pip install langchain_community

In [None]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import DirectoryLoader


## Load multiple and process documents

In [None]:
# Load and process the text files
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./ash_videos/ash_maurya/', glob="./*.txt", loader_cls=TextLoader)

documents = loader.load()

In [None]:
len(documents)

## Load Epub

Comment this out if you don't have an Epub

# EPUB

**EPUB (short for Electronic Publication) is a widely used file format for digital books and publications. It is designed to be flexible and adaptable, allowing the content to be easily read and displayed on various devices, including e-readers, smartphones, and tablets.**

### Key Features of EPUB Files

1. **Reflowable Content**: EPUB files can adjust their layout to fit different screen sizes, making them ideal for reading on devices with varying screen dimensions.
2. **XML-Based**: EPUB files are built using XML, which ensures that the content is structured and easily parseable.
3. **Support for Multimedia**: EPUB files can include multimedia elements such as images, audio, and video, enhancing the reading experience.
4. **Metadata and Annotations**: EPUB files can contain metadata, such as author and title information, and support for annotations, making them useful for educational and research purposes.
5. **Open Standard**: EPUB is an open standard, maintained by the International Digital Publishing Forum (IDPF), ensuring that it is widely supported and compatible with various devices and software.

### Creating and Opening EPUB Files

1. **Creation**: EPUB files can be created using software such as Adobe InDesign, which allows users to design and export their content in the EPUB format.
2. **Opening**: EPUB files can be opened and read using various e-readers and software, including Adobe Digital Editions, Apple's iBooks, and many other e-reader applications.

### Advantages and Uses

1. **Portability**: EPUB files are highly portable and can be easily transferred between devices, making them ideal for reading on the go.
2. **Accessibility**: EPUB files can be optimized for accessibility, including support for screen readers and other assistive technologies.
3. **Flexibility**: EPUB files can be used for a wide range of content, from novels and textbooks to comics and technical publications.

### Tools and Resources

1. **EPUBCheck**: A free online tool for validating EPUB files, ensuring they conform to the EPUB standard.
2. **Adobe Digital Editions**: A popular software for reading and managing EPUB files, available for both Windows and Mac.
3. **EPUBBooks**: A website offering a large collection of free EPUB and Kindle eBooks for download.

Overall, EPUB is a versatile and widely supported format that has become a standard in the digital publishing industry.

In [None]:
!pip install pandoc==2.3


In [None]:
!pip install pypandoc

from langchain_community.document_loaders import UnstructuredEPubLoader

loader = UnstructuredEPubLoader("/content/Running Lean_ Iterate from Plan A to a Plan That Works (Lean Series) - Maurya, Ash.epub")  #, mode="elements")

epub_data = loader.load()

In [None]:
len(epub_data)

In [None]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts_01 = text_splitter.split_documents(documents)
texts_02 = text_splitter.split_documents(epub_data)

In [None]:
len(texts_01), len(texts_02)

In [None]:
texts = texts_01 + texts_02

In [None]:
len(texts)

In [None]:
texts[370]

## OpenAI Embeddings

In [None]:
# Download embeddings from OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

## create the DB

In [None]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

## Here is the nmew embeddings being used
embedding = embeddings

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

## Make a retriever

In [None]:
retriever = vectordb.as_retriever()

In [None]:
docs = retriever.get_relevant_documents("What is product market fit?")

In [None]:
len(docs)

In [None]:
docs[0]

In [None]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

In [None]:
retriever.search_type

In [None]:
retriever.search_kwargs

## Make a chain

In [None]:
# create the chain to answer questions

llm = ChatOpenAI(temperature = 0.0)

qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)


In [None]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

In [None]:
qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template = '''
Your name is Ash Maurya. You are an expert at Lean Startups.
Use the following pieces of context to answer the users question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Always answer from the perspective of being Ash Maurya.
----------------
{context}'''

In [None]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [None]:
# full example
query = "What is product market fit?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
# break it down
query = "When should you quit or pivot?"
llm_response = qa_chain(query)
process_llm_response(llm_response)
# llm_response

In [None]:
query = "What is the purpose of a customer interview?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "What is your name?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "What are the customer interviewing techniques?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "Do you like the color blue?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

In [None]:
query = "What books did you write?"
llm_response = qa_chain(query)
process_llm_response(llm_response)