# Retrieval and Generation

**Vector Database (Vector DB)**
Resources
- [How-to guides](https://python.langchain.com/v0.2/docs/how_to/#vector-stores)
  - [Vectorstores](https://python.langchain.com/v0.2/docs/integrations/vectorstores/): A vector store that stores embedded data and performs similarity search.
    1. [Elasticsearch](https://python.langchain.com/v0.2/docs/integrations/vectorstores/elasticsearch/)
    2. [Milvus](https://python.langchain.com/v0.2/docs/integrations/vectorstores/milvus/)
    3. [Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/)

# Preface
## Environment Setup

Sources  
- [langchain-chroma](https://pypi.org/project/langchain-chroma/)

In [None]:
from importlib.metadata import version
# #!pip install langchain
# # Select langchain to 0.1.3
# try:
#     assert version('langchain') == '0.1.20'
# except:
#     !pip install langchain==0.1.20
# print('langchain package version',version('langchain'))

!pip install --upgrade langchain
print('langchain package version',version('langchain'))

langchain package version 0.3.4


In [None]:
# # Select langchain-huggingface to 0.0.3
# try:
#     assert version('langchain-huggingface') == '0.0.3'
# except:
#     !pip install -qU langchain-huggingface==0.0.3
# print('langchain-huggingface package version',version('langchain-huggingface'))

!pip install -qU langchain-huggingface
print('langchain-huggingface version',version('langchain-huggingface'))

langchain-huggingface version 0.1.0


In [None]:
# # Select langchain-chroma to 0.1.3
# try:
#     assert version('langchain_chroma') == '0.1.3'
# except:
#     !pip install -qU langchain_chroma==0.1.3
# print('langchain_chroma package version',version('langchain_chroma'))

# try:
#     assert version('langchain_community') == '0.0.38'
# except:
#     !pip install -qU langchain_community==0.0.38
# print('langchain_community package version',version('langchain_community'))

!pip install -qU langchain_chroma
print('langchain_chroma version',version('langchain_chroma'))
!pip install -qU langchain_community
print('langchain_community version',version('langchain_community'))

langchain_chroma version 0.1.4
langchain_community version 0.3.3


In [None]:
# try:
#     assert version('llama-cpp-python') == '0.2.74'
# except:
#     !pip install -qU llama-cpp-python==0.2.74
# print('llama-cpp-python package version',version('llama-cpp-python'))

# !pip install -qU llama-cpp-python
# print('llama-cpp-python package version',version('llama-cpp-python'))

In [None]:
# !pip install datamodel_code_generator
# print('datamodel_code_generator package version',version('datamodel_code_generator'))

In [None]:
# OpenAI
# Update OpenAI to 1.42.0
try:
    print('openai package version',version('openai'))
    assert version('openai') == '1.42.0'
except:
    !pip install openai==1.42.0

openai package version 1.42.0


In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.2 MB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m0.7/1.2 MB[0m [31m10.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import os
os.getcwd()

'/content'

In [None]:
!dir

chroma	sample_data


# Connect to VectorDB & LLM Agent
## Connect to VectorDB (Chroma)

In [None]:
import chromadb
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint

collection_name = "collection_postings"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
persistent_client = chromadb.PersistentClient()
print(persistent_client.list_collections())

vector_store = Chroma(client=persistent_client,
                      collection_name=collection_name,
                      embedding_function=embeddings)

# try:
#   if collection_name in persistent_client.list_collections()[0].name:
#       print(f"Collection '{collection_name}' exists!")
#       # Get the existing collection
#       # vector_store = persistent_client.get_collection(collection_name)
#       vector_store = Chroma(client=persistent_client,
#                             collection_name=collection_name,
#                             embedding_function=embeddings)
# except:
#     print(f"Collection '{collection_name}' does not exist!")

[Collection(id=9799cf17-fa1a-462c-818a-b8625701e935, name=collection_postings)]




In [None]:
# prompt: how can I see the data in vector_store?

# Get all the documents in the vector store
documents = vector_store.get()

# Print the documents
print(documents)

# Alternatively, you can get the embeddings and ids
embeddings = vector_store.get()['embeddings']
ids = vector_store.get()['ids']

# Print the embeddings
print(embeddings)

# Print the ids
ids

{'ids': [], 'embeddings': None, 'documents': [], 'uris': None, 'data': None, 'metadatas': [], 'included': [<IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}
None


[]

[Collection(id=9799cf17-fa1a-462c-818a-b8625701e935, name=collection_postings)]

In [None]:
# # Use the `as_retriever()` function to use it as a retriever in LangChain
# retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 2}) #search_kwargs={"k": 2, "fetch_k": 50}

# retriever

## Connect to Agent (Call OpenAI API)

In [None]:
import openai
from google.colab import userdata
#initiate the OpenAI client using the API key
# openai_api_key = os.environ["OPENAI_API_KEY"]
openai_api_key = userdata.get('OPENAI_API_KEY')
client = openai.OpenAI(api_key=openai_api_key)
client

<openai.OpenAI at 0x78739c66ee00>

## Need modification !!!!!

# Retrieval and Generation Application

## Prepare Prompt

In [None]:
# extraction_prompt = ''' You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
#     1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
#     2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

# Upon receiving your aforementioned information, you need to proceed with the following precedures:
# Step 1. Analyze your client's abilities, including hard and soft skills.
# Step 2. Analyze the skills needed for the best possible jobs in the job specification
# Step 3. Summarize your client's strengths that are already sufficient for the job application.
# Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
# Step 5. Finally, give them advice how to get the jobs mentioned in job specification according the reasoning above.

# To give your client a professional advice, you MUST give the following feedback:
# 1. Job Position: the best possible job position or title you suggest your client to pursue.
# 2. Strengths: your client's strengths compared to the job posts
# 3. Weaknesses: your client's weaknesses compared to the job posts
# 4. Strateries: the methods you suggest to get the jobs mentioned in job posts

# FINAL note:
# 1. If you cannot find the relevant informaiton in client's question or job specification for your reasoning, just leave it blank ("").
# 2. Always give advice according to the information given to you (Question and Job Specification), DO NOT make up answer other than those information!

# Question:
#     <query>{query}</query>
# Job Post Information:
#     <specification>{specification}</specification>
# Advice:
# '''

In [None]:
extraction_prompt = ''' You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according the reasoning above.

Question:
    <query>{query}</query>
Job Post Information:
    <specification>{specification}</specification>
Advice:
'''

## Preprare Input Query

In [None]:
query = "I recently graduated with a Bachelor degree in Computer Science, I use Python and have good grades in machine learning and deep learning. I had various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. I am now seeking an entry-level data scientist or data analyst role."

## Search Results based on Query

In [None]:
# results = retriever.invoke(query) #filter={"source": "news"}
# results

In [None]:
results = vector_store.similarity_search_with_score(
    query , k=5, #filter={"title": {"$in": keywords}}
)
i=0
specification = ""
for res, score in results:
    print(f"[{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    specification += ('Title: ' + res.metadata['title'] +'\n ' + res.page_content)
    i+=1

In [None]:
print(specification)




## Get Final Response

In [None]:
prompt_all = extraction_prompt.format(query=query, specification=specification)
print(prompt_all)

 You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according t

In [None]:
import tiktoken

# Define a function to count tokens for a given prompt and model
def count_tokens(text, model="gpt-3.5-turbo-instruct"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Count the number of tokens in the prompt
prompt_tokens = count_tokens(prompt_all);print(f"total prompt tokens = {prompt_tokens}")

# Token limit for gpt-3.5-turbo-instruct
token_limit = 4097

# Ensure the total tokens (prompt + response) is within the limit
# Assume you want the model to generate a maximum of 1000 tokens in the response
response_max_tokens = 1000
if prompt_tokens + response_max_tokens > token_limit:
    print('total token size exceeds limit, start trimming!')
    # Calculate the allowable prompt length
    max_prompt_tokens = token_limit - response_max_tokens

    # Trim the prompt to fit within the token limit
    trimmed_prompt = prompt_all[:max_prompt_tokens]

    # Notify user about trimming
    print(f"Prompt trimmed from {prompt_tokens} to {max_prompt_tokens} tokens.")
    print("final prompt_all:\n",prompt_all)

    # Update the prompt with the trimmed version
    prompt_all = trimmed_prompt
else:
    print('total token size doesn\'t  exceeds limit, good job!')



total prompt tokens = 300
total token size doesn't  exceeds limit, good job!


In [None]:
response = client.completions.create(model="gpt-3.5-turbo-instruct",
                                     prompt=prompt_all,
                                     max_tokens=response_max_tokens)
print(response.choices[0].text)


Step 1. Analyze your client's abilities, including hard and soft skills.
Based on the information provided, your client has a Bachelor's degree in Computer Science with a focus on data analysis and machine learning. They have strong skills in Python, including building predictive models and analyzing large datasets. They also have soft skills such as problem-solving and critical thinking.

Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
From the job specification, the desired skills for an entry-level data scientist or data analyst position include knowledge of programming languages such as Python and experience with machine learning and data analysis. They are also looking for candidates with good communication and problem-solving skills.

Step 3. Summarize your client's strengths that are already sufficient for the job application.
Your client's strengths align with the skills listed in the job specification. They have a strong fou

# What If: Generation without Application

In [None]:
extraction_prompt = ''' You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer


Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs.

Question:
    <query>{query}</query>

Advice:
'''

prompt_all = extraction_prompt.format(query=query)

In [None]:
response = client.completions.create(model="gpt-3.5-turbo-instruct",
                                     prompt=prompt_all,
                                     max_tokens=response_max_tokens)
print(response.choices[0].text)

After analyzing your abilities, I can say that you have a strong foundation for a career in data science. With your Bachelor degree in Computer Science and your proficiency in Python, as well as your good grades in machine learning and deep learning, you possess the necessary technical skills for an entry-level data scientist or data analyst role.

To increase your chances of landing your dream job, I would suggest improving your soft skills such as communication, problem-solving, and teamwork. These skills are highly valued in the data science industry and will make you stand out among other candidates.

Additionally, you can continue to enhance your technical skills by taking online courses or participating in coding workshops to stay updated with the latest technologies and techniques in data science.

I would also recommend networking and attending data science events or conferences to expand your professional connections and gain insights into the industry.

Finally, when applying