# Knowledge Graph

**A knowledge graph is a type of database that stores information in the form of entities and their relationships.** It's a powerful tool for organizing and querying complex data, and it's widely used in artificial intelligence, natural language processing, and other fields.

Here's a breakdown of what a knowledge graph is:

**Entities**: In a knowledge graph, entities are the objects, concepts, or things that we want to store information about. These can be anything from people, places, and organizations to abstract concepts like events, ideas, or relationships.

**Relationships**: Relationships are the connections between entities. These can be simple relationships like "Person A is friends with Person B" or more complex relationships like "Company X is a subsidiary of Company Y."

**Graph Structure**: A knowledge graph is typically represented as a graph, with entities as nodes and relationships as edges between those nodes. This structure allows for efficient querying and traversal of the graph.

**Properties and Attributes**: Entities and relationships in a knowledge graph can have properties and attributes that provide additional information. For example, a person entity might have properties like "name," "age," and " occupation," while a relationship like "friendship" might have attributes like "start date" and "end date."

**Types of Knowledge Graphs**: There are several types of knowledge graphs, including:

* **Domain-specific knowledge graphs**: These focus on a specific domain or industry, like a knowledge graph of medical concepts or a knowledge graph of financial entities.
* **General knowledge graphs**: These cover a broad range of topics and are often used as a general-purpose knowledge base.
* **Hybrid knowledge graphs**: These combine different types of knowledge graphs, like a knowledge graph that combines medical and financial information.

**Applications of Knowledge Graphs**: Knowledge graphs have many applications, including:

* **Question answering**: Knowledge graphs can be used to answer complex questions by traversing the graph and retrieving relevant information.
* **Recommendation systems**: Knowledge graphs can be used to build recommendation systems that take into account complex relationships between entities.
* **Natural language processing**: Knowledge graphs can be used to improve natural language processing tasks like entity recognition, sentiment analysis, and text summarization.
* **Data integration**: Knowledge graphs can be used to integrate data from multiple sources, providing a unified view of the data.

Some popular examples of knowledge graphs include:

* **Google's Knowledge Graph**: A massive knowledge graph that provides information about entities like people, places, and things.
* **Wikidata**: A knowledge graph that provides information about entities like people, places, and things, with a focus on verifiability and transparency.
* **DBpedia**: A knowledge graph that extracts information from Wikipedia and makes it available for querying and analysis.

I hope this helps Let me know if you have any further questions.

# ChromaDB

**ChromaDB is a database system designed specifically for Langchain, a conversational AI framework. It's a crucial component that enables Langchain agents to store and retrieve knowledge graphs, which are essential for generating accurate and informative responses.**

Here's a breakdown of what ChromaDB does:

**Knowledge Graph Storage**: ChromaDB is designed to store knowledge graphs, which are complex networks of entities, relationships, and concepts. These graphs are used by Langchain agents to reason about the world, answer questions, and generate responses.

**Database System**: ChromaDB is a database system that provides a structured way of storing and retrieving knowledge graphs. It's optimized for fast querying and retrieval of graph data, making it an ideal choice for Langchain agents.

**Features**: ChromaDB provides several features that make it well-suited for Langchain agents, including:

* **Graph storage**: ChromaDB can store large knowledge graphs with millions of nodes and edges.
* **Querying**: ChromaDB provides a powerful querying system that allows Langchain agents to retrieve specific nodes, edges, or subgraphs from the knowledge graph.
* **Indexing**: ChromaDB uses indexing to speed up querying and retrieval of graph data.
* **Scalability**: ChromaDB is designed to scale horizontally, making it suitable for large-scale Langchain deployments.

**Benefits**: By using ChromaDB, Langchain agents can:

* **Improve response accuracy**: By storing and retrieving knowledge graphs, Langchain agents can generate more accurate and informative responses.
* **Increase knowledge coverage**: ChromaDB enables Langchain agents to store and retrieve large knowledge graphs, increasing their knowledge coverage and ability to answer complex questions.
* **Enhance conversational capabilities**: By leveraging ChromaDB, Langchain agents can engage in more sophisticated conversations, using their knowledge graphs to reason about the world and respond to user queries.

Overall, ChromaDB is a critical component of the Langchain ecosystem, enabling agents to store and retrieve knowledge graphs and generate accurate and informative responses.

In [1]:
!pip -q install langchain openai tiktoken chromadb

In [2]:
!pip show langchain

Name: langchain
Version: 0.2.3
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: aiohttp, async-timeout, langchain-core, langchain-text-splitters, langsmith, numpy, pydantic, PyYAML, requests, SQLAlchemy, tenacity
Required-by: langchain-community


In [3]:
!wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip
!unzip -q new_articles.zip -d new_articles

replace new_articles/05-07-fintech-space-continues-to-be-competitive-and-drama-filled.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_articles/05-07-spacex-starship-startups-future.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_articles/05-03-checks-the-ai-powered-data-protection-project-incubated-in-area-120-officially-exits-to-google.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_articles/05-04-microsoft-doubles-down-on-ai-with-new-bing-features.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace new_articles/05-03-ai-replace-tv-writers-strike.txt? [y]es, [n]o

# LangChain multi-doc retriever with ChromaDB

***New Points***
- Multiple Files
- ChromaDB
- Source info
- gpt-3.5-turbo API

## Setting up LangChain


In [None]:
import os

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

In [5]:
pip install -U langchain-community



In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader


## Load multiple and process documents

In [7]:
# Load and process the text files
# loader = TextLoader('single_text_file.txt')
loader = DirectoryLoader('./new_articles/', glob="./*.txt", loader_cls=TextLoader)

documents = loader.load()

In [8]:
#splitting the text into
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

In [9]:
len(texts)

233

In [10]:
texts[3]

Document(page_content='BrandGPT is designed for working with third parties like an agency or a contractor, who can ask questions about the company’s brand guidelines to make sure they are complying with them, May said. “We’re launching BrandGPT, which is meant to be the interface to all this brand-related security stuff that we’re doing, and as people interact with brands, they can access the style guides and better understand the brand, whether they’re a part of the company or not.\n\nThese two products are available in public beta starting today. The company launched last year and has raised $2.4 million from Bee Partners, Fyrfly Ventures and Argon Ventures.', metadata={'source': 'new_articles/05-03-nova-is-building-guardrails-for-generative-ai-content-to-protect-brand-integrity.txt'})

## create the DB

In [11]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

## here we are using OpenAI embeddings but in future we will swap out to local embeddings
embedding = OpenAIEmbeddings()

vectordb = Chroma.from_documents(documents=texts,
                                 embedding=embedding,
                                 persist_directory=persist_directory)

  warn_deprecated(


In [12]:
# persiste the db to disk
vectordb.persist()
vectordb = None

  warn_deprecated(


In [13]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding)

## Make a retriever

In [14]:
retriever = vectordb.as_retriever()

In [15]:
docs = retriever.get_relevant_documents("How much money did Pando raise?")
# docs = retriever.invoke("How much money did Pando raise?")

  warn_deprecated(


In [16]:
len(docs)

4

In [17]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [18]:
retriever.search_type

'similarity'

In [19]:
retriever.search_kwargs

{'k': 2}

## Make a chain

LangChain provides several search types for its retriever interface, which are used to select relevant text chunks from a vector database. Here are some examples:

1. **Similarity Search**:
   - This type of search selects text chunk vectors that are most similar to the query. It is useful when you want to retrieve the most relevant documents based on their semantic similarity to the query.

2. **Maximum Marginal Relevance (MMR) Search**:
   - This type of search optimizes for both similarity to the query and diversity among the retrieved documents. It ensures that the retrieved documents are not only relevant but also diverse in their content.

These search types are used in conjunction with various components of LangChain, such as the `RetrievalQA` and `ConversationalRetrievalChain`, to create powerful question answering systems that can retrieve relevant information from a database and generate contextually aware responses.

In [None]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [21]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [22]:
# full example
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

  warn_deprecated(


 Pando has raised a total of $45 million, with $30 million coming from their most recent Series B round.


Sources:
new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [23]:
# break it down
query = "What is the news about Pando?"
llm_response = qa_chain(query)
# process_llm_response(llm_response)
llm_response

{'query': 'What is the news about Pando?',
 'result': ' Pando announced that it raised $30 million in a Series B round, bringing its total raised to $45 million. The funding will be used to expand its global sales, marketing, and delivery capabilities, and Pando is open to exploring strategic partnerships and acquisitions.',
 'source_documents': [Document(page_content='Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.\n\nIron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pando’s global sales, marketing and delivery capabilities.\n\n“We will not expand into new industries or adjacent product areas,” he told TechCru

In [24]:
query = "Who led the round in Pando?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Iron Pillar and Uncorrelated Ventures.


Sources:
new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


In [25]:
query = "What did databricks acquire?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Databricks acquired Okera, a data governance solution company.


Sources:
new_articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt
new_articles/05-03-databricks-acquires-ai-centric-data-governance-platform-okera.txt


In [26]:
query = "What is generative ai?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 Generative AI is a type of artificial intelligence that allows users to create new content or experiences through the use of external apps and large language models. It can be incorporated into the Slack platform in customized ways, giving users more choice and flexibility in how they bring AI into their work. 


Sources:
new_articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt
new_articles/05-04-slack-updates-aim-to-put-ai-at-the-center-of-the-user-experience.txt


In [27]:
query = "Who is CMA?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

 CMA stands for Competition and Markets Authority.


Sources:
new_articles/05-04-cma-generative-ai-review.txt
new_articles/05-04-cma-generative-ai-review.txt


In [28]:
qa_chain.retriever.search_type , qa_chain.retriever.vectorstore

('similarity',
 <langchain_community.vectorstores.chroma.Chroma at 0x791ffe367e50>)

In [29]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:


## Deleteing the DB

In [30]:
!zip -r db.zip ./db

updating: db/ (stored 0%)
updating: db/chroma.sqlite3 (deflated 39%)
updating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/ (stored 0%)
updating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/data_level0.bin (deflated 100%)
updating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/header.bin (deflated 61%)
updating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/length.bin (deflated 99%)
updating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/link_lists.bin (stored 0%)


In [31]:
# To cleanup, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# delete the directory
!rm -rf db/

## Starting again loading the db

restart the runtime

In [1]:
!unzip db.zip

Archive:  db.zip
replace db/chroma.sqlite3? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: db/chroma.sqlite3       
replace db/35238da2-46ed-43d3-872f-4ab389f3c4f4/data_level0.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/data_level0.bin  
replace db/35238da2-46ed-43d3-872f-4ab389f3c4f4/header.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/header.bin  
replace db/35238da2-46ed-43d3-872f-4ab389f3c4f4/length.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/length.bin  
replace db/35238da2-46ed-43d3-872f-4ab389f3c4f4/link_lists.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
 extracting: db/35238da2-46ed-43d3-872f-4ab389f3c4f4/link_lists.bin  


In [None]:
import os

os.environ["OPENAI_API_KEY"] = "your_openai_api_key"

In [3]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

In [8]:
persist_directory = 'db'
embedding = OpenAIEmbeddings()

vectordb2 = Chroma(persist_directory=persist_directory,
                  embedding_function=embedding,
                   )

retriever = vectordb2.as_retriever(search_kwargs={"k": 2})

  warn_deprecated(


In [9]:
# Set up the turbo LLM
turbo_llm = ChatOpenAI(
    temperature=0,
    model_name='gpt-3.5-turbo'
)

  warn_deprecated(


In [10]:
# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=turbo_llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [11]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [12]:
# full example
query = "How much money did Pando raise?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

  warn_deprecated(


Pando raised $30 million in its Series B round, bringing its total raised amount to $45 million.


Sources:
new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt
new_articles/05-03-ai-powered-supply-chain-startup-pando-lands-30m-investment.txt


### Chat prompts

In [13]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}


In [14]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.messages[1].prompt.template)

{question}
