# Document Loaders

## Text Loading

In [2]:
from langchain_community.document_loaders import TextLoader

In [4]:
loader = TextLoader('speech.txt')
documents = loader.load()

In [5]:
documents

[Document(metadata={'source': 'speech.txt'}, page_content='Malcolm X was one of the most dynamic, dramatic and influential figures of the civil rights era. He was an apostle of black nationalism, self respect, and uncompromising resistance to white oppression. Malcolm X was a polarizing figure who both energized and divided African Americans, while frightening and alienating many whites. He was an unrelenting truth-teller who declared that the mainstream civil rights movement was naïve in hoping to secure freedom through integration and nonviolence. The blazing heat of Malcolm X\'s rhetoric sometimes overshadowed the complexity of his message, especially for those who found him threatening in the first place. Malcolm X was assassinated at age 39, but his political and cultural influence grew far greater in the years after his death than when he was alive.\n\nMalcolm X is now popularly seen as one of the two great martyrs of the 20th century black freedom struggle, the other being his o

## PDf Loading

In [7]:
from langchain_community.document_loaders import PyPDFLoader

In [9]:
loader = PyPDFLoader('random.pdf')
documents = loader.load()

In [10]:
documents

[Document(metadata={'source': 'random.pdf', 'page': 0}, page_content='TECHNOLOGY READINESS LEVELS\nFOR MACHINE LEARNING SYSTEMS\nAlexander Lavin∗\nPasteur LabsCiarán M. Gilligan-Lee\nSpotifyAlessya Visnjic\nWhyLabsSiddha Ganju\nNvidiaDava Newman\nMIT\nAtılım Güne¸ s Baydin\nUniversity of OxfordSujoy Ganguly\nUnity AIDanny Lange\nUnity AIAmit Sharma\nMicrosoft Research\nStephan Zheng\nSalesforce ResearchEric P. Xing\nPetuumAdam Gibson\nKonduitJames Parr\nNASA Frontier Development Lab\nChris Mattmann\nNASA Jet Propulsion LabYarin Gal\nAlan Turing Institute\nABSTRACT\nThe development and deployment of machine learning (ML) systems can be executed easily with\nmodern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can\nlead to technical debt, scope creep and misaligned objectives, model misuse and failures, and\nexpensive consequences. Engineering systems, on the other hand, follow well-deﬁned processes\nand testing standards to streamline development 

In [11]:
type(documents[0])

langchain_core.documents.base.Document

## WebBased Loaders

In [15]:
from langchain_community. document_loaders import WebBaseLoader
import bs4

In [16]:
loader = WebBaseLoader(web_path="https://en.wikipedia.org/wiki/Chicago",
                       bs_kwargs = dict(parse_only=bs4.SoupStrainer(
                        class_ = ("mw-body")
                       ))
                       )
page = loader.load()

In [17]:
page

[Document(metadata={'source': 'https://en.wikipedia.org/wiki/Chicago'}, page_content='\n\n\n\n\n\nToggle the table of contents\n\n\n\n\n\n\n\nChicago\n\n\n\n226 languages\n\n\n\n\nAcèhAfrikaansAlemannischአማርኛAnarâškielâÆngliscالعربيةAragonésܐܪܡܝܐԱրեւմտահայերէնArmãneashtiArpetanAsturianuAtikamekwAvañe\'ẽAzərbaycancaتۆرکجهBasa BaliBamanankanবাংলা閩南語 / Bân-lâm-gúБашҡортсаБеларускаяБеларуская (тарашкевіца)भोजपुरीBikol CentralBislamaБългарскиBoarischBosanskiBrezhonegБуряадCatalàЧӑвашлаCebuanoČeštinaChavacano de ZamboangaChi-ChewaChiShonaCorsuCymraegDagbanliDanskDavvisámegiellaDeitschDeutschDiné bizaadDolnoserbskiडोटेलीEestiΕλληνικάEmiliàn e rumagnòlЭрзяньEspañolEsperantoEstremeñuEuskaraفارسیFiji HindiFøroysktFrançaisFryskFulfuldeFurlanGaeilgeGaelgGàidhligGalego贛語Gĩkũyũ𐌲𐌿𐍄𐌹𐍃𐌺客家語 / Hak-kâ-ngîХальмг한국어HausaHawaiʻiՀայերենहिन्दीHornjoserbsceHrvatskiIdoIgboIlokanoBahasa IndonesiaInterlinguaInterlingueᐃᓄᒃᑎᑐᑦ / inuktitutИронIsiXhosaIsiZuluÍslenskaItalianoעבריתJawaಕನ್ನಡKapampanganКъарачай-малкъарქარ

# Text Transformation

#### Reading PDF Data

In [26]:
from langchain_community.document_loaders import PyPDFLoader

In [27]:
loader = PyPDFLoader('random.pdf')
documents = loader.load()

In [28]:
documents

[Document(metadata={'source': 'random.pdf', 'page': 0}, page_content='TECHNOLOGY READINESS LEVELS\nFOR MACHINE LEARNING SYSTEMS\nAlexander Lavin∗\nPasteur LabsCiarán M. Gilligan-Lee\nSpotifyAlessya Visnjic\nWhyLabsSiddha Ganju\nNvidiaDava Newman\nMIT\nAtılım Güne¸ s Baydin\nUniversity of OxfordSujoy Ganguly\nUnity AIDanny Lange\nUnity AIAmit Sharma\nMicrosoft Research\nStephan Zheng\nSalesforce ResearchEric P. Xing\nPetuumAdam Gibson\nKonduitJames Parr\nNASA Frontier Development Lab\nChris Mattmann\nNASA Jet Propulsion LabYarin Gal\nAlan Turing Institute\nABSTRACT\nThe development and deployment of machine learning (ML) systems can be executed easily with\nmodern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can\nlead to technical debt, scope creep and misaligned objectives, model misuse and failures, and\nexpensive consequences. Engineering systems, on the other hand, follow well-deﬁned processes\nand testing standards to streamline development 

#### Applying Text Splitters

- Recursive Character Text Splitter

In [31]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap= 50)

In [33]:
docs = text_splitter.split_documents(documents)

In [34]:
docs

[Document(metadata={'source': 'random.pdf', 'page': 0}, page_content='TECHNOLOGY READINESS LEVELS\nFOR MACHINE LEARNING SYSTEMS\nAlexander Lavin∗\nPasteur LabsCiarán M. Gilligan-Lee\nSpotifyAlessya Visnjic\nWhyLabsSiddha Ganju\nNvidiaDava Newman\nMIT\nAtılım Güne¸ s Baydin\nUniversity of OxfordSujoy Ganguly\nUnity AIDanny Lange\nUnity AIAmit Sharma\nMicrosoft Research\nStephan Zheng\nSalesforce ResearchEric P. Xing\nPetuumAdam Gibson\nKonduitJames Parr\nNASA Frontier Development Lab\nChris Mattmann\nNASA Jet Propulsion LabYarin Gal\nAlan Turing Institute\nABSTRACT'),
 Document(metadata={'source': 'random.pdf', 'page': 0}, page_content='Alan Turing Institute\nABSTRACT\nThe development and deployment of machine learning (ML) systems can be executed easily with\nmodern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can\nlead to technical debt, scope creep and misaligned objectives, model misuse and failures, and\nexpensive consequences. Engineering s

- Character Text Splitter

In [42]:
from langchain_text_splitters import CharacterTextSplitter
splitter = CharacterTextSplitter(chunk_size = 200)

In [43]:
doc2 = splitter.split_documents(documents=documents)

In [44]:
doc2

[Document(metadata={'source': 'random.pdf', 'page': 0}, page_content='TECHNOLOGY READINESS LEVELS\nFOR MACHINE LEARNING SYSTEMS\nAlexander Lavin∗\nPasteur LabsCiarán M. Gilligan-Lee\nSpotifyAlessya Visnjic\nWhyLabsSiddha Ganju\nNvidiaDava Newman\nMIT\nAtılım Güne¸ s Baydin\nUniversity of OxfordSujoy Ganguly\nUnity AIDanny Lange\nUnity AIAmit Sharma\nMicrosoft Research\nStephan Zheng\nSalesforce ResearchEric P. Xing\nPetuumAdam Gibson\nKonduitJames Parr\nNASA Frontier Development Lab\nChris Mattmann\nNASA Jet Propulsion LabYarin Gal\nAlan Turing Institute\nABSTRACT\nThe development and deployment of machine learning (ML) systems can be executed easily with\nmodern tools, but the process is typically rushed and means-to-an-end. The lack of diligence can\nlead to technical debt, scope creep and misaligned objectives, model misuse and failures, and\nexpensive consequences. Engineering systems, on the other hand, follow well-deﬁned processes\nand testing standards to streamline development 

- HTML Text Splitter

In [46]:
# from langchain_text_splitters import HTMLHeaderTextSplitter

# html_string = """
# <article><nav class="theme-doc-breadcrumbs breadcrumbsContainer_Z_bl" aria-label="Breadcrumbs"><ul class="breadcrumbs" itemscope="" itemtype="https://schema.org/BreadcrumbList"><li class="breadcrumbs__item"><a aria-label="Home page" class="breadcrumbs__link" href="/v0.2/"><svg viewBox="0 0 24 24" class="breadcrumbHomeIcon_YNFT"><path d="M10 19v-5h4v5c0 .55.45 1 1 1h3c.55 0 1-.45 1-1v-7h1.7c.46 0 .68-.57.33-.87L12.67 3.6c-.38-.34-.96-.34-1.34 0l-8.36 7.53c-.34.3-.13.87.33.87H5v7c0 .55.45 1 1 1h3c.55 0 1-.45 1-1z" fill="currentColor"></path></svg></a></li><li itemscope="" itemprop="itemListElement" itemtype="https://schema.org/ListItem" class="breadcrumbs__item"><a class="breadcrumbs__link" itemprop="item" href="/v0.2/docs/integrations/components/"><span itemprop="name">Components</span></a><meta itemprop="position" content="1"></li><li itemscope="" itemprop="itemListElement" itemtype="https://schema.org/ListItem" class="breadcrumbs__item"><a class="breadcrumbs__link" itemprop="item" href="/v0.2/docs/integrations/llms/"><span itemprop="name">LLMs</span></a><meta itemprop="position" content="2"></li><li itemscope="" itemprop="itemListElement" itemtype="https://schema.org/ListItem" class="breadcrumbs__item breadcrumbs__item--active"><span class="breadcrumbs__link" itemprop="name">PipelineAI</span><meta itemprop="position" content="3"></li></ul></nav><div class="tocCollapsible_ETCw theme-doc-toc-mobile tocMobile_ITEo"><button type="button" class="clean-btn tocCollapsibleButton_TO0P">On this page</button></div><div class="theme-doc-markdown markdown"><h1>PipelineAI</h1><blockquote><p><a href="https://pipeline.ai" target="_blank" rel="noopener noreferrer">PipelineAI</a> allows you to run your ML models at scale in the cloud. It also provides API access to <a href="https://pipeline.ai" target="_blank" rel="noopener noreferrer">several LLM models</a>.</p></blockquote><p>This notebook goes over how to use Langchain with <a href="https://docs.pipeline.ai/docs" target="_blank" rel="noopener noreferrer">PipelineAI</a>.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="pipelineai-example">PipelineAI example<a href="#pipelineai-example" class="hash-link" aria-label="Direct link to PipelineAI example" title="Direct link to PipelineAI example">​</a></h2><p><a href="https://docs.pipeline.ai/docs/langchain" target="_blank" rel="noopener noreferrer">This example shows how PipelineAI integrated with LangChain</a> and it is created by PipelineAI.</p><h2 class="anchor anchorWithStickyNavbar_LWe7" id="setup">Setup<a href="#setup" class="hash-link" aria-label="Direct link to Setup" title="Direct link to Setup">​</a></h2><p>The <code>pipeline-ai</code> library is required to use the <code>PipelineAI</code> API, AKA <code>Pipeline Cloud</code>. Install <code>pipeline-ai</code> using <code>pip install pipeline-ai</code>.</p><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color: #9CDCFE; --prism-background-color: #222222;"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token comment" style="color: rgb(106, 153, 85);"># Install the package</span><span class="token plain"></span><br></span><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token plain"></span><span class="token operator" style="color: rgb(212, 212, 212);">%</span><span class="token plain">pip install </span><span class="token operator" style="color: rgb(212, 212, 212);">-</span><span class="token operator" style="color: rgb(212, 212, 212);">-</span><span class="token plain">upgrade </span><span class="token operator" style="color: rgb(212, 212, 212);">-</span><span class="token operator" style="color: rgb(212, 212, 212);">-</span><span class="token plain">quiet  pipeline</span><span class="token operator" style="color: rgb(212, 212, 212);">-</span><span class="token plain">ai</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="example">Example<a href="#example" class="hash-link" aria-label="Direct link to Example" title="Direct link to Example">​</a></h2><h3 class="anchor anchorWithStickyNavbar_LWe7" id="imports">Imports<a href="#imports" class="hash-link" aria-label="Direct link to Imports" title="Direct link to Imports">​</a></h3><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color: #9CDCFE; --prism-background-color: #222222;"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token keyword" style="color: rgb(86, 156, 214);">import</span><span class="token plain"> os</span><br></span><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token plain" style="display: inline-block;"></span><br></span><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token plain"></span><span class="token keyword" style="color: rgb(86, 156, 214);">from</span><span class="token plain"> langchain_community</span><span class="token punctuation" style="color: rgb(212, 212, 212);">.</span><span class="token plain">llms </span><span class="token keyword" style="color: rgb(86, 156, 214);">import</span><span class="token plain"> PipelineAI</span><br></span><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token plain"></span><span class="token keyword" style="color: rgb(86, 156, 214);">from</span><span class="token plain"> langchain_core</span><span class="token punctuation" style="color: rgb(212, 212, 212);">.</span><span class="token plain">output_parsers </span><span class="token keyword" style="color: rgb(86, 156, 214);">import</span><span class="token plain"> StrOutputParser</span><br></span><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token plain"></span><span class="token keyword" style="color: rgb(86, 156, 214);">from</span><span class="token plain"> langchain_core</span><span class="token punctuation" style="color: rgb(212, 212, 212);">.</span><span class="token plain">prompts </span><span class="token keyword" style="color: rgb(86, 156, 214);">import</span><span class="token plain"> PromptTemplate</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><div style="padding-top:1.3rem;background:var(--prism-background-color);color:var(--prism-color);margin-top:calc(-1 * var(--ifm-leading) - 5px);margin-bottom:var(--ifm-leading);box-shadow:var(--ifm-global-shadow-lw);border-bottom-left-radius:var(--ifm-code-border-radius);border-bottom-right-radius:var(--ifm-code-border-radius)"><b style="padding-left:0.65rem;margin-bottom:0.45rem;margin-right:0.5rem">API Reference:</b><span><a href="https://api.python.langchain.com/en/latest/llms/langchain_community.llms.pipelineai.PipelineAI.html">PipelineAI</a> | </span><span><a href="https://api.python.langchain.com/en/latest/output_parsers/langchain_core.output_parsers.string.StrOutputParser.html">StrOutputParser</a> | </span><span><a href="https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.prompt.PromptTemplate.html">PromptTemplate</a></span></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="set-the-environment-api-key">Set the Environment API Key<a href="#set-the-environment-api-key" class="hash-link" aria-label="Direct link to Set the Environment API Key" title="Direct link to Set the Environment API Key">​</a></h3><p>Make sure to get your API key from PipelineAI. Check out the <a href="https://docs.pipeline.ai/docs/cloud-quickstart" target="_blank" rel="noopener noreferrer">cloud quickstart guide</a>. You'll be given a 30 day free trial with 10 hours of serverless GPU compute to test different models.</p><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color: #9CDCFE; --prism-background-color: #222222;"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token plain">os</span><span class="token punctuation" style="color: rgb(212, 212, 212);">.</span><span class="token plain">environ</span><span class="token punctuation" style="color: rgb(212, 212, 212);">[</span><span class="token string" style="color: rgb(206, 145, 120);">"PIPELINE_API_KEY"</span><span class="token punctuation" style="color: rgb(212, 212, 212);">]</span><span class="token plain"> </span><span class="token operator" style="color: rgb(212, 212, 212);">=</span><span class="token plain"> </span><span class="token string" style="color: rgb(206, 145, 120);">"YOUR_API_KEY_HERE"</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h2 class="anchor anchorWithStickyNavbar_LWe7" id="create-the-pipelineai-instance">Create the PipelineAI instance<a href="#create-the-pipelineai-instance" class="hash-link" aria-label="Direct link to Create the PipelineAI instance" title="Direct link to Create the PipelineAI instance">​</a></h2><p>When instantiating PipelineAI, you need to specify the id or tag of the pipeline you want to use, e.g. <code>pipeline_key = "public/gpt-j:base"</code>. You then have the option of passing additional pipeline-specific keyword arguments:</p><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color: #9CDCFE; --prism-background-color: #222222;"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color: rgb(156, 220, 254);"><span class="token plain">llm </span><span class="token operator" style="color: rgb(212, 212, 212);">=</span><span class="token plain"> PipelineAI</span><span class="token punctuation" style="color: rgb(212, 212, 212);">(</span><span class="token plain">pipeline_key</span><span class="token operator" style="color: rgb(212, 212, 212);">=</span><span class="token string" style="color: rgb(206, 145, 120);">"YOUR_PIPELINE_KEY"</span><span class="token punctuation" style="color: rgb(212, 212, 212);">,</span><span class="token plain"> pipeline_kwargs</span><span class="token operator" style="color: rgb(212, 212, 212);">=</span><span class="token punctuation" style="color: rgb(212, 212, 212);">{</span><span class="token punctuation" style="color: rgb(212, 212, 212);">.</span><span class="token punctuation" style="color: rgb(212, 212, 212);">.</span><span class="token punctuation" style="color: rgb(212, 212, 212);">.</span><span class="token punctuation" style="color: rgb(212, 212, 212);">}</span><span class="token punctuation" style="color: rgb(212, 212, 212);">)</span><br></span></code></pre><div class="buttonGroup__atx"><button type="button" class="clean-btn" aria-label="Toggle word wrap" title="Toggle word wrap"><svg viewBox="0 0 24 24" class="wordWrapButtonIcon_Bwma" aria-hidden="true"><path fill="currentColor" d="M4 19h6v-2H4v2zM20 5H4v2h16V5zm-3 6H4v2h13.25c1.1 0 2 .9 2 2s-.9 2-2 2H15v-2l-3 3l3 3v-2h2c2.21 0 4-1.79 4-4s-1.79-4-4-4z"></path></svg></button><button type="button" aria-label="Copy code to clipboard" title="Copy" class="clean-btn"><span class="copyButtonIcons_eSgA" aria-hidden="true"><svg viewBox="0 0 24 24" class="copyButtonIcon_y97N"><path fill="currentColor" d="M19,21H8V7H19M19,5H8A2,2 0 0,0 6,7V21A2,2 0 0,0 8,23H19A2,2 0 0,0 21,21V7A2,2 0 0,0 19,5M16,1H4A2,2 0 0,0 2,3V17H4V3H16V1Z"></path></svg><svg viewBox="0 0 24 24" class="copyButtonSuccessIcon_LjdS"><path fill="currentColor" d="M21,7L9,19L3.5,13.5L4.91,12.09L9,16.17L19.59,5.59L21,7Z"></path></svg></span></button></div></div></div><h3 class="anchor anchorWithStickyNavbar_LWe7" id="create-a-prompt-template">Create a Prompt Template<a href="#create-a-prompt-template" class="hash-link" aria-label="Direct link to Create a Prompt Template" title="Direct link to Create a Prompt Template">​</a></h3><p>We will create a prompt template for Question and Answer.</p><div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color: #9CDCFE; --prism-background-color: #222222;"><div class="codeBlockContent_biex"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar"><code class="codeBlockLines_e6Vv">
# """

In [47]:
# splitter = HTMLHeaderTextSplitter()

- Recursive JSON Splitter

In [49]:
random_json = {
    "user": {
        "id": 12345,
        "name": "John Doe",
        "contact": {
            "email": "john.doe@example.com",
            "phone": "555-1234",
            "address": {
                "street": "123 Elm St",
                "city": "Somewhere",
                "state": "CA",
                "zip": "90210"
            }
        },
        "preferences": {
            "notifications": {
                "email": True,
                "sms": False,
                "push": {
                    "enabled": True,
                    "sound": "default"
                }
            },
            "language": "en",
            "timezone": "PST"
        },
        "subscriptions": [
            {
                "id": 1,
                "name": "Newsletter",
                "type": "email",
                "status": "subscribed"
            },
            {
                "id": 2,
                "name": "Promo Alerts",
                "type": "sms",
                "status": "unsubscribed"
            }
        ]
    },
    "session": {
        "token": "abc123xyz456",
        "expires": "2024-12-31T23:59:59Z",
        "last_login": {
            "ip": "192.168.1.1",
            "location": {
                "country": "USA",
                "city": "Somewhere",
                "coordinates": {
                    "lat": 34.0522,
                    "long": -118.2437
                }
            }
        }
    },
    "shopping_cart": {
        "items": [
            {
                "id": "9876",
                "name": "Widget A",
                "quantity": 2,
                "price": 19.99,
                "attributes": {
                    "color": "red",
                    "size": "M"
                }
            },
            {
                "id": "5432",
                "name": "Gadget B",
                "quantity": 1,
                "price": 99.95,
                "attributes": {
                    "warranty": "2 years"
                }
            }
        ],
        "total_price": 139.93,
        "currency": "USD"
    }
}


In [50]:
from langchain_text_splitters import RecursiveJsonSplitter
json_splitter = RecursiveJsonSplitter(max_chunk_size=100)
json_chunks = json_splitter.split_json(random_json)

In [51]:
json_chunks

[{'user': {'id': 12345,
   'name': 'John Doe',
   'contact': {'email': 'john.doe@example.com'}}},
 {'user': {'contact': {'phone': '555-1234',
    'address': {'street': '123 Elm St'}}}},
 {'user': {'contact': {'address': {'city': 'Somewhere',
     'state': 'CA',
     'zip': '90210'}}}},
 {'user': {'preferences': {'notifications': {'email': True,
     'sms': False,
     'push': {'enabled': True, 'sound': 'default'}}}}},
 {'user': {'preferences': {'language': 'en', 'timezone': 'PST'}}},
 {'user': {'subscriptions': [{'id': 1,
     'name': 'Newsletter',
     'type': 'email',
     'status': 'subscribed'},
    {'id': 2,
     'name': 'Promo Alerts',
     'type': 'sms',
     'status': 'unsubscribed'}]}},
 {'session': {'token': 'abc123xyz456', 'expires': '2024-12-31T23:59:59Z'}},
 {'session': {'last_login': {'ip': '192.168.1.1'}}},
 {'session': {'last_login': {'location': {'country': 'USA',
     'city': 'Somewhere'}}}},
 {'session': {'last_login': {'location': {'coordinates': {'lat': 34.0522,
  

# Creating Embeddings

## OPEN AI

In [52]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [53]:
os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")

In [54]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [55]:
text = "this is a toutorial"

In [56]:
vector = embeddings.embed_query(text)

In [59]:
len(vector)

3072

### Using Chroma DB

- Loading Data

In [60]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader('random.pdf')
final_documents = loader.load()

- Adding to Vector Store

In [61]:
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(final_documents, embeddings)
db


<langchain_community.vectorstores.chroma.Chroma at 0x1295814b0>

In [63]:
query = "The development and deployment of machine learning (ML) systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end. "
retrieval = db.similarity_search(query)
retrieval

[Document(metadata={'page': 7, 'source': 'random.pdf'}, page_content='Figure 2: Most ML and AI projects live in these sections of MLTRL, not concerned with fundamental R&D – that is,\ncompletely using existing methods and implementations, and even pretrained models. In the left diagram, the arrows\nshow a common development pattern with MLTRL in industry: projects go back to the ML toolbox to develop new\nfeatures (dashed line), and frequent, incremental improvements are often a practice of jumping back a couple levels to\nLevel 7 (which is the main systems integrations stage). At Levels 7 and 8 we stress the need for tests that run use-case\nspeciﬁc critical scenarios and data-slices, which are highlighted by a proper risk-quantiﬁcation matrix [ 22]. Cycling\nback to previous lower levels is not just a late-stage mechanism in MLTRL, but rather “switchbacks” occur throughout\nthe process (as discussed in the Methods section and throughout the text). In the right diagram we show the mor

## Ollama

In [75]:
from langchain_community.embeddings import OllamaEmbeddings
embeddings_ollama = OllamaEmbeddings(model="mxbai-embed-large") ## By default llama2

In [76]:
embeddings_ollama

OllamaEmbeddings(base_url='http://localhost:11434', model='mxbai-embed-large', embed_instruction='passage: ', query_instruction='query: ', mirostat=None, mirostat_eta=None, mirostat_tau=None, num_ctx=None, num_gpu=None, num_thread=None, repeat_last_n=None, repeat_penalty=None, temperature=None, stop=None, tfs_z=None, top_k=None, top_p=None, show_progress=False, headers=None, model_kwargs=None)

In [85]:
embeded_ollama = embeddings_ollama.embed_documents("Hi my name is Omii")

In [88]:
len(embeded_ollama[0])

1024

## Hugging Face

In [96]:
from dotenv import load_dotenv
import tqdm
load_dotenv()

True

In [92]:
os.environ['HF_TOKEN'] = os.getenv("HF_TOKEN")

In [97]:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings_hf = HuggingFaceEmbeddings(model_name="all-MiniLM-l6-v2")



In [99]:
text = "This is practice."
result = embeddings_hf.embed_query(text)
len(result)

384

# Vector Databases and Retrievers

## FAISS

In [101]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader('speech.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size =200, chunk_overlap = 0)
docs = text_splitter.split_documents(documents)

Created a chunk of size 807, which is longer than the specified 200
Created a chunk of size 582, which is longer than the specified 200
Created a chunk of size 891, which is longer than the specified 200
Created a chunk of size 419, which is longer than the specified 200


In [103]:
embeddings = OllamaEmbeddings()
db = FAISS.from_documents(docs,embeddings)

In [104]:
query = "Who is Malcom X?"
result = db.similarity_search(query)
result[0]

Document(metadata={'source': 'speech.txt'}, page_content='As a teenager, Malcolm Little made his way to New York, where he took the street name Detroit Red and became a pimp and petty criminal. In 1946, Malcolm Little was sent to prison for burglary. He read voraciously while serving time and converted to the Black Muslim faith. He joined the Nation of Islam (NOI) and changed his name to Malcolm X, eliminating that part of his identity he called a white-imposed slave name.')

##  Chroma

In [3]:
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

loader = TextLoader('speech.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size =200, chunk_overlap = 0)
docs = text_splitter.split_documents(documents)

Created a chunk of size 807, which is longer than the specified 200
Created a chunk of size 582, which is longer than the specified 200
Created a chunk of size 891, which is longer than the specified 200
Created a chunk of size 419, which is longer than the specified 200


In [4]:
embeddings = OllamaEmbeddings()
vectordb = Chroma.from_documents(docs,embeddings)

In [5]:
query = "Who is Malcom X?"
result = vectordb.similarity_search(query)
result[0]

Document(metadata={'source': 'speech.txt'}, page_content='As a teenager, Malcolm Little made his way to New York, where he took the street name Detroit Red and became a pimp and petty criminal. In 1946, Malcolm Little was sent to prison for burglary. He read voraciously while serving time and converted to the Black Muslim faith. He joined the Nation of Islam (NOI) and changed his name to Malcolm X, eliminating that part of his identity he called a white-imposed slave name.')