<a href="https://colab.research.google.com/github/Asish-baidya29/IPL_MATCHS_PREDICTOR/blob/master/Copy_of_GraphRAG_Neo4j_(Updated).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### What is Langchain?
LangChain is an open-source framework for building LLM-powered applications, offering core components like Models (LLM integration), Prompt Templates (structured prompts), Memory (context retention), Indexes & Retrievers (efficient document retrieval), Agents (dynamic decision-making), and Chains (workflow automation). It simplifies RAG applications by enabling efficient document ingestion, retrieval, contextualized responses, and state management using vector databases, embeddings, and intelligent querying. This allows for seamless integration with external data sources and scalable AI-driven search and reasoning systems. 🚀

1. **`langchain-community`** – Contains community-maintained integrations and tools for working with various LLM providers, databases, and APIs.  
2. **`langchain-experimental`** – Includes experimental and early-stage features for advanced LangChain applications, such as novel retrieval methods and agent capabilities.  
3. **`langchain-groq`** – Provides integration with **Groq's LLMs**, enabling fast and efficient model inference.  
4. **`langchain-huggingface`** – Facilitates the use of **Hugging Face models** (transformers, embeddings, and pipelines) within LangChain applications. 🚀

##Imports

In [1]:
!pip install --upgrade --quiet langchain langchain-community langchain-experimental langchain-groq langchain-huggingface
!pip install --upgrade --quiet  sentence-transformers
!pip install --upgrade --quiet transformers
!pip install --upgrade --quiet neo4j tiktoken yfiles_jupyter_graphs
!pip install --upgrade --quiet pypdf


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/2.5 MB[0m [31m15.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m40.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.9/124.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━

##Uploading the PDF

In [2]:
# Imports
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google.colab import files
import os

# Upload the PDF file using Google Colab's file upload utility
uploaded = files.upload()

# Get the file path
pdf_path = list(uploaded.keys())[0]

# Load the PDF using Langchain's PyPDFLoader
loader = PyPDFLoader(pdf_path)
documents = loader.load()


Saving Answers Philosophy.pdf to Answers Philosophy.pdf


In [3]:
#type(documents)

In [4]:
#documents[0]

In [5]:
#len(documents)

##Setting up the Environment for Developing

### Environment in a Development Project
In a development project, an **environment** refers to a configured system setup where software is developed, tested, and deployed, often using **environment variables** to manage sensitive information like API keys securely. In **Google Colab**, environment variables can be stored in **secrets** (e.g., `os.environ["API_KEY"] = "your_key"`) to prevent hardcoding sensitive data. This ensures security, flexibility, and easier configuration management across different environments. 🚀

In [6]:
import os
from google.colab import userdata

os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["NEO4J_URI"] = userdata.get('NEO4J_URI')
os.environ["NEO4J_USERNAME"] = userdata.get('NEO4J_USERNAME')
os.environ["NEO4J_PASSWORD"] = userdata.get('NEO4J_PASSWORD')

In [7]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

from typing import Tuple, List, Optional

from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import ConfigurableField

from yfiles_jupyter_graphs import GraphWidget
from neo4j import GraphDatabase

from langchain_community.vectorstores import Neo4jVector
from langchain_community.graphs import Neo4jGraph

from langchain_huggingface import HuggingFaceEmbeddings

In [8]:
try:
  import google.colab
  from google.colab import output
  output.enable_custom_widget_manager()
except:
  pass

In [9]:
from langchain_core.runnables import (
    RunnableBranch,
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
)

##Extracting Text from Wikipedia Pages
--Using WikipediaLoader from Langchain

In [None]:
# from langchain.document_loaders import WikipediaLoader
# raw_documents = WikipediaLoader(query="The Merchant of Venice").load()



  lis = BeautifulSoup(html).find_all('li')


In [None]:
# len(raw_documents)

24

In [None]:
# raw_documents[0]

Document(metadata={'title': 'The Merchant of Venice', 'summary': 'The Merchant of Venice is a play by William Shakespeare, believed to have been written between 1596 and 1598. A merchant in Venice named Antonio defaults on a large loan taken out on behalf of his dear friend, Bassanio, and provided by a Jewish moneylender, Shylock, with seemingly inevitable fatal consequences.\nAlthough classified as a comedy in the First Folio and sharing certain aspects with Shakespeare\'s other romantic comedies, the play is most remembered for its dramatic scenes, and it is best known for the character Shylock and his famous demand for a "pound of flesh".\nThe play contains two famous speeches, that of Shylock, "Hath not a Jew eyes?" on the subject of humanity, and that of Portia on "the quality of mercy".  Debate exists on whether the play is anti-Semitic, with Shylock\'s insistence on his legal right to the pound of flesh being in opposition to his seemingly universal plea for the rights of all pe

##Constants

In [10]:
chunk_size = 512
chunk_overlap = 24

model_name = "deepseek-r1-distill-llama-70b"
embedding_model = "sentence-transformers/all-mpnet-base-v2"
temperature = 0.2
tokens_per_minute = 900

##Text Splitting using Recursive Charecter Text Splitter

In [None]:
# # For Wikipedia
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# documents = text_splitter.split_documents(raw_documents[:4])

In [11]:
# For PDF (Custom Upload)
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
document_chunks = text_splitter.split_documents(documents)

In [None]:
#chunks

##Initializing a Large Language Model (LLM) and Graph Transformer instance

In [12]:
from langchain_groq import ChatGroq

llm = ChatGroq(
            model_name=model_name,
            temperature=temperature,
            max_tokens=None,
            groq_api_key=os.environ["GROQ_API_KEY"],
            timeout=60
        )

In [13]:
# Import the LLMGraphTransformer for converting text into a structured graph
from langchain_experimental.graph_transformers import LLMGraphTransformer

# Initialize the Graph Transformer with a Large Language Model (LLM)
llm_transformer = LLMGraphTransformer(llm=llm)

# Convert a list of textual documents into a structured graph representation
graph_documents = llm_transformer.convert_to_graph_documents(document_chunks)

# The output 'graph_documents' contains entities (nodes) and their relationships (edges),
# which can be used for knowledge graph construction, search, and reasoning.


In [15]:
graph_documents

[GraphDocument(nodes=[Node(id='Darśana', type='Concept', properties={}), Node(id='Philosophy Of Language', type='Concept', properties={}), Node(id='Wittgenstein', type='Person', properties={}), Node(id='Frege', type='Person', properties={}), Node(id='Epistemology', type='Concept', properties={}), Node(id='Theory Of Knowledge', type='Concept', properties={})], relationships=[Relationship(source=Node(id='Darśana', type='Concept', properties={}), target=Node(id='Philosophy Of Language', type='Concept', properties={}), type='SIMILAR_TO', properties={}), Relationship(source=Node(id='Wittgenstein', type='Person', properties={}), target=Node(id='Philosophy Of Language', type='Concept', properties={}), type='ASSOCIATED_WITH', properties={}), Relationship(source=Node(id='Frege', type='Person', properties={}), target=Node(id='Philosophy Of Language', type='Concept', properties={}), type='ASSOCIATED_WITH', properties={}), Relationship(source=Node(id='Darśana', type='Concept', properties={}), targ

In [16]:
# Initializing Neo4j Instance
graph = Neo4jGraph()

  graph = Neo4jGraph()


In [17]:
# Adding the Graph created to the Neo4j Cloud
graph.add_graph_documents(
    graph_documents,
    baseEntityLabel=True, #Ensures nodes have labels like Person, Company, etc.
    include_source=True #Keeps the original document as part of the graph for traceability.
)

In [18]:
# directly show the graph resulting from the given Cypher query
default_cypher = "MATCH (s)-[r:!MENTIONS]->(t) RETURN s,r,t LIMIT 50"

In [19]:
from yfiles_jupyter_graphs import GraphWidget
from neo4j import GraphDatabase

In [20]:
# Visualizing the graph through GraphWidget
def showGraph(cypher: str = default_cypher):
    # create a neo4j session to run queries
    driver = GraphDatabase.driver(
        uri = os.environ["NEO4J_URI"],
        auth = (os.environ["NEO4J_USERNAME"],
                os.environ["NEO4J_PASSWORD"]))
    session = driver.session()
    widget = GraphWidget(graph = session.run(cypher).graph())
    widget.node_label_mapping = 'id'
    display(widget)
    return widget

In [21]:
showGraph()

GraphWidget(layout=Layout(height='800px', width='100%'))

GraphWidget(layout=Layout(height='800px', width='100%'))

##Creating Word Embedding

In [22]:
# Creating Word Embedding instance from HuggingFace
embeddings = HuggingFaceEmbeddings(
            model_name=embedding_model,
            model_kwargs={'device': 'cpu'}
        )

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [23]:
from langchain_community.vectorstores import Neo4jVector

# Use the embeddings with Neo4jVector
vector_index = Neo4jVector.from_existing_graph(
    embeddings,
    search_type="hybrid",
    node_label="Document",
    text_node_properties=["text"],
    embedding_node_property="embedding"
)

In [24]:
graph.query("CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]")

[]

##Extracting Entities (Nodes) from the text given input

In [25]:
from pydantic import BaseModel, Field
# Extract entities from text
class Entities(BaseModel):
    """Identifying information about entities."""

    names: List[str] = Field(
        ...,
        description="All the person, organization, or business entities that "
        "appear in the text",
    )

In [26]:
# Creating Prompt Templates using Langchain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are extracting organization and person entities from the text.",
        ),
        (
            "human",
            "Use the given format to extract information from the following "
            "input: {question}",
        ),
    ]
)

In [27]:
entity_chain = prompt | llm.with_structured_output(Entities)

In [28]:
entity_chain.invoke({"question": "What was Shylock's approach to the elopment of Jessica with Lorenzo?"}).names

['Shylock', 'Jessica', 'Lorenzo']

##Graph Retrieval from the Question

In [29]:
# Generates a full-text search query with fuzzy matching (~2) for Neo4j by sanitizing input and combining words using AND.
from langchain_community.vectorstores.neo4j_vector import remove_lucene_chars

def generate_full_text_query(input: str) -> str:
    full_text_query = ""
    words = [el for el in remove_lucene_chars(input).split() if el]
    for word in words[:-1]:
        full_text_query += f" {word}~2 AND"
    full_text_query += f" {words[-1]}~2"
    return full_text_query.strip()

In [30]:
# Full text index query
def structured_retriever(question: str) -> str:
    result = ""
    entities = entity_chain.invoke({"question": question})
    for entity in entities.names:
        response = graph.query(
            """CALL db.index.fulltext.queryNodes('entity', $query, {limit:2})
            YIELD node,score
            CALL {
              WITH node
              MATCH (node)-[r:!MENTIONS]->(neighbor)
              RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output
              UNION ALL
              WITH node
              MATCH (node)<-[r:!MENTIONS]-(neighbor)
              RETURN neighbor.id + ' - ' + type(r) + ' -> ' +  node.id AS output
            }
            RETURN output LIMIT 50
            """,
            {"query": generate_full_text_query(entity)},
        )
        result += "\n".join([el['output'] for el in response])
    return result

In [32]:
print(structured_retriever("What is vrtti?"))

  words = [el for el in remove_lucene_chars(input).split() if el]





##Combining results from a structured retriever and a vector-based similarity search


In [33]:
# Retrieves structured and unstructured data based on the input question.

def retriever(question: str):
    print(f"Search query: {question}")
    structured_data = structured_retriever(question)
    unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""Structured data:
      {structured_data}
      Unstructured data:
      {"#Document ". join(unstructured_data)}
          """
    return final_data

In [34]:
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question,
in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

In [35]:
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

In [36]:
def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
    buffer = []
    for human, ai in chat_history:
        buffer.append(HumanMessage(content=human))
        buffer.append(AIMessage(content=ai))
    return buffer

In [37]:
_search_query = RunnableBranch(
    # If input includes chat_history, we condense it with the follow-up question
    (
        RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
            run_name="HasChatHistoryCheck"
        ),  # Condense follow-up question and chat into a standalone_question
        RunnablePassthrough.assign(
            chat_history=lambda x: _format_chat_history(x["chat_history"])
        )
        | CONDENSE_QUESTION_PROMPT
        | llm
        | StrOutputParser(),
    ),
    # Else, we have no chat history, so just pass through the question
    RunnableLambda(lambda x : x["question"]),
)

In [38]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
Use natural language and be concise.
Answer:"""

In [39]:
prompt = ChatPromptTemplate.from_template(template)

In [40]:
# Creates a processing chain where a search query is retrieved, passed to a prompt, sent to an LLM, and then parsed into a string output.
chain = (
    RunnableParallel(
        {
            "context": _search_query | retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

In [43]:
chain.invoke({"question": "What is vrtti?"})

Search query: What is vrtti?




'<think>\nOkay, I need to answer the question "What is vrtti?" based on the provided context. Let me go through the structured and unstructured data to gather the necessary information.\n\nLooking at the structured data, I see that Vṛtti is categorized into types: Primary, Secondary, and Suggestive. It\'s also divided into Primary Signification Function, Secondary Signification Function, and Suggestive Signification Function. Additionally, Secondary Meaning (Lakṣyārtha) is linked to Verbal Cognition via the Secondary Signification Function.\n\nIn the unstructured data, the answer to question 7 explains that Vṛtti refers to the "signification function" of a word, which can be primary (abhidhā), secondary (lakṣaṇā), or suggestive (vyañjanā). It also details how each function leads to different types of meanings: primary meaning (abhidheya), secondary meaning (lakṣyārtha), and suggested meaning (vyaṅgyārtha).\n\nSo, putting this together, Vṛtti is about how words convey meaning through th

In [None]:
# chain.invoke(
#     {
#         "question": "When was she born?",
#         "chat_history": [("Which house did Elizabeth I belong to?", "House Of Tudor")],
#     }
# )