An improved version to integrate Langchain Chromadb api and MistralAPI provided from langchain

In [1]:
import chromadb

# the client is used to store the critic reviews to the db
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

In [2]:
# from chromadb.utils import embedding_functions

# sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


Created collection (name: test_<name_of_the_game>)

- cyberpunk2077 "test_cyberpunk2077"
- Monster Hunter World "test_mhw"
- dota2 "test_dota2"

In [3]:
collection = chroma_client.get_collection('test_cyberpunk2077')

Exception: {"error":"ValueError('Collection test_cyberpunk2077 does not exist.')"}

Create collection

In [6]:
collection = chroma_client.create_collection(name="test_cyberpunk2077", embedding_function=sentence_transformer_ef)

In [7]:
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from pathlib import Path
from langchain.text_splitter import CharacterTextSplitter

file_dir_path = Path("cyberpunk_2077/")

loader = DirectoryLoader(str(file_dir_path), glob="./*.txt", loader_cls=TextLoader)
docs = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(docs)

# create documents list and metadata list for Chroma
documents = []
metadata = []
for doc in docs:
    documents.append(doc.page_content)
    metadata.append(doc.metadata)

Created a chunk of size 1113, which is longer than the specified 1000
Created a chunk of size 1003, which is longer than the specified 1000
Created a chunk of size 1141, which is longer than the specified 1000
Created a chunk of size 1281, which is longer than the specified 1000
Created a chunk of size 1113, which is longer than the specified 1000
Created a chunk of size 1121, which is longer than the specified 1000
Created a chunk of size 1075, which is longer than the specified 1000
Created a chunk of size 1104, which is longer than the specified 1000
Created a chunk of size 1143, which is longer than the specified 1000


In [8]:
# add the documents to the collection
collection.add(
    documents=documents,
    ids=[str(i) for i in range(len(documents))],
    metadatas=metadata
)

In [9]:
print(len(documents))

113


---

In [10]:
collection.peek()

{'ids': ['0', '1', '10', '100', '101', '102', '103', '104', '105', '106'],
 'embeddings': [[-0.007072583306580782,
   -0.04239673539996147,
   -0.008007119409739971,
   -0.03966406732797623,
   0.05231420323252678,
   0.04659915715456009,
   -0.08682898432016373,
   0.02861125022172928,
   -0.051696743816137314,
   0.04130517318844795,
   -0.09417585283517838,
   0.05365785211324692,
   0.014351348392665386,
   -0.029767761006951332,
   0.04762212187051773,
   -0.059462256729602814,
   0.08518379926681519,
   -0.0928773283958435,
   0.08705755323171616,
   -0.05326329544186592,
   -0.04379286244511604,
   -0.06579174846410751,
   -0.00963597372174263,
   0.03293171152472496,
   -0.005562383681535721,
   0.009120315313339233,
   -0.11124558001756668,
   -0.02571932226419449,
   -0.07856151461601257,
   0.03757723048329353,
   -0.0726824700832367,
   0.010048889555037022,
   -0.05138903483748436,
   0.031233541667461395,
   -0.08064968138933182,
   -0.05645805969834328,
   0.064255930483

---

LLM retrieval with Chromadb docker

In [11]:
from langchain_community.llms import Ollama

In [12]:
llm = Ollama(model="llama2")        # assuming the port is 11434

In [13]:
from langchain.vectorstores import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

db = Chroma(collection_name="test_cyberpunk2077", client=chroma_client, embedding_function=embedding_function)

In [14]:
# make a chain

from langchain.chains import RetrievalQA

chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=db.as_retriever(),
    return_source_documents=True
)

In [15]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])

In [20]:
prompt = 'Narrative'

# retriever
retriever = db.as_retriever(search_kwargs={'k': 5})     # define number of documents to retrieve

docs = retriever.get_relevant_documents(prompt)

print(len(docs))
print('\n\n')
for doc in docs:
    print(doc.page_content)
    print()
    print('Source:', end='')
    print(doc.metadata['source'])
    print('\n\n')

5



I spent a lot of my playtime following side-quest threads like this one, excited about the premise and hoping to find something as interesting or fun or rewarding at the end and, in many cases, I did. But now, after finishing the main story, I can't see how most of those activities fit into the overall narrative or the character I was playing. The main story doesn't even gel with itself.

Cyberpunk 2077 draws heavily from its source material, with everything from the world itself to the life and death of Johnny Silverhand coming from its pen-and-paper inspiration. But unlike in a tabletop RPG, you aren't playing a role of your own creation in Cyberpunk 2077; you're playing V, and this is V's story, not yours. I often felt like I was role-playing two different characters: one V for the side quests and one more limited V for the main story.

Source:cyberpunk_2077/cyberpunk_2077_04.txt



That said, this structure does misfire slightly in how it’s organized and presented. The mission

---

In [9]:
# make a chain

# create the chain to answer questions 
chain = RetrievalQA.from_chain_type(llm=llm, 
                                    chain_type="stuff", 
                                    retriever=retriever, 
                                    return_source_documents=True)

# full example
llm_response = chain.invoke(prompt)
llm_response

{'query': 'What is the game about?',
 'result': "Based on the given context, the game Dota 2 can be described as a multiplayer online battle arena (MOBA) game that combines elements of RPGs and RTS games. The game is set in a fantasy world where two teams, Radiant and Dire, compete to destroy each other's base by controlling a map filled with creeps, towers, and other structures. Players can choose from a diverse pool of characters, called heroes, each with their own abilities and playstyles. The game is known for its complexity and depth, requiring players to work together as a team and make strategic decisions in real-time to emerge victorious.",
 'source_documents': [Document(page_content="Play a match every evening for a couple of weeks, and you start to see how Dota 2's wealth of disparate systems and mechanics combine into their own harmony, and you begin to understand how there are hundreds of elements that affect the game. Dota 2 is a tense war of accumulation and attrition. Th

In [10]:
prompt = \
'''you are a gamer who are reading reviews of a game to understand the characteristics of the game, then deciding whether purchasing the game or not. Generate seven short sentence for each aspect in ['Gameplay', 'Audio', 'Graphics', 'Community', 'Performance', 'Bug', 'Suggestion'].} Output them in json format as {'ASPECT':'SUMMARY'}. Output 'NA' in the 'SUMMARY' if the review does not contain content related to that 'ASPECT'. Do not output other thing except the json.'''

llm_response = chain(prompt)
process_llm_response(llm_response)

  warn_deprecated(


{
"ASPECT": "Gameplay",
"SUMMARY": "Dota 2 is a complex and rewarding game that requires dedication and teamwork, with a steep learning curve and exciting matches."
}

{
"ASPECT": "Audio",
"SUMMARY": "The game features a variety of sounds and music that enhance the overall experience, but there is no detailed information on sound quality or music composition."
}

{
"ASPECT": "Graphics",
"SUMMARY": "Dota 2's graphics are impressive, with detailed character models and environments, but there is no mention of the resolution or frame rate."
}

{
"ASPECT": "Community",
"SUMMARY": "The game has a large and active community, with many pro players and teams, but there is no information on the community's overall quality or behavior."
}

{
"ASPECT": "Performance",
"SUMMARY": "There is no information on the game's performance, such as frame rate or load times, in the reviews provided."
}

{
"ASPECT": "Bug",
"SUMMARY": "The review does not mention any bugs or technical issues with the game."
}

{

---

In [21]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_core.prompts import PromptTemplate

# system_template = \
# '''You are a reviewer of the game. Use the following pieces of context to answer any question about the game.
# If you don't know the answer, just say 'NA'. Do NOT try to make up an answer.
# ---
# {context}'''

# prompt, let say write a summary of the game with some predefined aspects
# Gameplay, Graphics, Sound, Performance, Bug, Suggestion, Price, Overall

# TODO: fine-tune the prompt to use the theory I stated below
prompt_template = \
'''You are reading reviews of a game to understand the characteristics of the game. Use the following pieces of context to answer user's question. 

{summaries}

Question: {question}

If you don't know the answer, just output a json with all values in the json as 'NA'. Do NOT try to make up an answer.
Only output the JSON. Do NOT output other text.'''

my_question = \
'''Extract the following aspects of the game from the reviews. Output a json with each of the aspects as key, and the extracted information as the value. The format of the json is {"ASPECT":"INFORMATION"}. The aspects are: ['Gameplay', 'Narrative', 'Accessibility', 'Sound', 'Graphics & Art Design', 'Performance', 'Bug', 'Suggestion', 'Price', 'Overall']
'''

chain =  RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={
        "prompt": PromptTemplate(
            template=prompt_template,
            input_variables=["summaries", "question"],
        )
    },
    return_source_documents=True,
)

In [22]:
response = chain.invoke(
    {
        'question': my_question
    }
)

response

{'question': 'Extract the following aspects of the game from the reviews. Output a json with each of the aspects as key, and the extracted information as the value. The format of the json is {"ASPECT":"INFORMATION"}. The aspects are: [\'Gameplay\', \'Narrative\', \'Accessibility\', \'Sound\', \'Graphics & Art Design\', \'Performance\', \'Bug\', \'Suggestion\', \'Price\', \'Overall\']\n',
 'answer': 'Here is the extracted information from the reviews:\n\n{\n"Gameplay": "Dota 2 is complicated, exhausting, and sometimes cruel, but its many complexities form an incredibly satisfying and exciting multiplayer game.",\n"Narrative": "There are few games as worthy of your time investment as this.",\n"Accessibility": "This is a free-to-play game that feels generous, with profits being dished out to item creators and the eSports teams, and Valve is already in the habit of running seasonal events that award you with heaps of new items.",\n"Sound": "The experience of playing Dota changes day by day

In [13]:
print(response['answer'])

{
"Gameplay": "Dota 2 is complicated, exhausting, and sometimes cruel, but its many complexities form an incredibly satisfying and exciting multiplayer game."
"Graphics": "Intricate map offers a wealth of strategic possibilities"
"Sound": "Matches are incredibly exciting and unpredictable"
"Performance": "Will take over your life"
"Bug": "NA"
"Suggestion": "NA"
"Price": "Free-to-play business model is fair"
"Overall": "Superb"
}


---

Chain of thought (break large task to smaller task)

- Prompt the llm for each aspect of the game.
- Then ask llm to output a json with better summarizing.

It performs better, as allows knowledge distills to each task.

In [14]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_core.prompts import PromptTemplate

# system_template = \
# '''You are a reviewer of the game. Use the following pieces of context to answer any question about the game.
# If you don't know the answer, just say 'NA'. Do NOT try to make up an answer.
# ---
# {context}'''

# prompt, let say write a summary of the game with some predefined aspects
# Gameplay, Graphics, Sound, Performance, Bug, Suggestion, Price, Overall

# TODO: fine-tune the prompt to use the theory I stated below

prompt_template = \
'''You are reading reviews of a game to understand the characteristics of the game. Use the following pieces of context to answer user's question. 

{summaries}

Question: {question}

If you don't know the answer, output only "NA". Do NOT try to make up an answer. Do NOT output other text.'''

my_question_template = \
'''Extract the the following aspect of the game from the reviews. Output a paragraph with less than 200 words. The aspect is: '''

aspects = ['Gameplay', 'Narrative', 'Accessibility', 'Sound', 'Graphics & Art Design', 'Performance', 'Bug', 'Suggestion', 'Price', 'Overall']
aspects_response = {k: '' for k in aspects}

for aspect in aspects:
    my_question = my_question_template + f'{aspect}'
    print(my_question)



    chain =  RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        chain_type_kwargs={
            "prompt": PromptTemplate(
                template=prompt_template,
                input_variables=["summaries", "question"],
            )
        },
        return_source_documents=True,
    )

    response = chain.invoke(
        {
            'question': my_question
        }
    )

    print(response)
    print('\n\n')
    print(response['answer'])
    aspects_response[aspect] = response['answer']
    print('\n\n')
    print(response['source_documents'])

    print('\n\n\n')

Extract the the following aspect of the game from the reviews. Output a paragraph with less than 200 words. The aspect is: Gameplay


{'question': 'Extract the the following aspect of the game from the reviews. Output a paragraph with less than 200 words. The aspect is: Gameplay', 'answer': 'Gameplay: According to the reviews, Dota 2 has a complex and satisfying gameplay experience with near-infinite depth and variety that rewards dedication and teamwork. The game can be confusing, infuriating, disorienting, and unforgiving at times, but this is part of its appeal. Players must learn to effectively juggle both broad strokes and finer details to succeed in the game. Despite being notorious for its difficulty, Dota 2 has become a world-class competitive game at the highest levels.', 'sources': '', 'source_documents': [Document(page_content="Dota 2 review\nBy Arthur Gies  Aug 15, 2017, 4:45pm EDT\n\nccordingAccording to Steam, I have spent 4,749 hours in Dota 2.\n\nFor perspective, this is approximately 198 days, which is more than 28 weeks, and more than six months. That's more than I've played any other game, ever. So

In [15]:
for k, v in aspects_response.items():
    print(k)
    print(v)
    print('\n\n')

Gameplay
Gameplay: According to the reviews, Dota 2 has a complex and satisfying gameplay experience with near-infinite depth and variety that rewards dedication and teamwork. The game can be confusing, infuriating, disorienting, and unforgiving at times, but this is part of its appeal. Players must learn to effectively juggle both broad strokes and finer details to succeed in the game. Despite being notorious for its difficulty, Dota 2 has become a world-class competitive game at the highest levels.



Narrative
NA. None of the reviews provide information about the narrative of Dota 2.



Accessibility
Accessible: NA

The reviewers mention that Dota 2 is complicated, exhausting, and sometimes cruel, but they also acknowledge that it offers a wealth of customization options, including new taunts, announcers, and HUD skins created by the community. However, they do not mention accessibility as a particular aspect of the game. Therefore, the answer is "NA".



Sound
NA. According to the 

In [16]:
str(aspects_response)

'{\'Gameplay\': \'Gameplay: According to the reviews, Dota 2 has a complex and satisfying gameplay experience with near-infinite depth and variety that rewards dedication and teamwork. The game can be confusing, infuriating, disorienting, and unforgiving at times, but this is part of its appeal. Players must learn to effectively juggle both broad strokes and finer details to succeed in the game. Despite being notorious for its difficulty, Dota 2 has become a world-class competitive game at the highest levels.\', \'Narrative\': \'NA. None of the reviews provide information about the narrative of Dota 2.\', \'Accessibility\': \'Accessible: NA\\n\\nThe reviewers mention that Dota 2 is complicated, exhausting, and sometimes cruel, but they also acknowledge that it offers a wealth of customization options, including new taunts, announcers, and HUD skins created by the community. However, they do not mention accessibility as a particular aspect of the game. Therefore, the answer is "NA".\', 

In [17]:
from langchain_core.prompts import ChatPromptTemplate

system_template = \
'''You are reading reviews of a game to understand the characteristics of the game. Use the following pieces of context to answer user's question.
'''

summary_template = \
'''Extract the following aspects of the game from the reviews, and write a short 20 words description for each aspect. The aspects are: [Gameplay, Graphics, Sound, Performance, Bug, Suggestion, Price, Overall]. Output a JSON with each of the aspects as key, and the information as the value. Only output the JSON. Do NOT output other text.

The context is wrapped by three consecutive apostrophes. The context is as follows:
\'\'\'
{context}
\'\'\'
'''

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", summary_template),
])

chain = chat_prompt | llm
response = chain.invoke({"context":str(aspects_response)})

print(response)

Here is the JSON output based on the reviews provided:

{
"Gameplay": "Complex and satisfying gameplay experience with near-infinite depth and variety that rewards dedication and teamwork.",
"Graphics & Art Design": "Crisp and readable, making it possible to tell what's going on even in a massive brawl...",
"Accessibility": "NA",
"Sound": "NA",
"Performance": "NA",
"Bug": "NA",
"Suggestion": "Yes",
"Price": "FREE" ,
"Overall": "A challenging and rewarding game that offers a unique and exciting multiplayer experience."
}


In [20]:
from langchain_core.prompts import ChatPromptTemplate

system_template = \
'''You are reading reviews of a game to understand the characteristics of the game. Use the following pieces of context to answer user's question.
'''

summary_template = \
'''Extract the following aspects of the game from the reviews, and provide a list of keywords, each of max length 5 words, for each aspect. The aspects are: {aspects}. Output a JSON with each of the aspects as key, and the list of keywords as the value. Only output the JSON. Do NOT output other text.

The context is wrapped by three consecutive apostrophes. The context is as follows:
\'\'\'
{context}
\'\'\'
'''

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", summary_template),
])

chain = chat_prompt | llm
response = chain.invoke({"context":str(aspects_response), "aspects":aspects})

print(response)

Here is the JSON output:

{
"Gameplay": ["Complex", "Satisfying", "Rewards dedication and teamwork"],
"Narrative": ["None"],
"Accessibility": ["NA"],
"Sound": ["NA"],
"Graphics & Art Design": ["Crisp", "Readable", "Well-done but not particularly innovative or groundbreaking"],
"Performance": ["NA"],
"Bug": ["NA"],
"Suggestion": ["Yes"],
"Price": ["FREE"] ,
"Overall": ["Complex and unforgiving game that rewards dedication and teamwork with a brilliantly social experience."]
}
