# Retrieval and Generation

**Vector Database (Vector DB)**
Resources 
- [How-to guides](https://python.langchain.com/v0.2/docs/how_to/#vector-stores)
  - [Vectorstores](https://python.langchain.com/v0.2/docs/integrations/vectorstores/): A vector store that stores embedded data and performs similarity search.
    1. [Elasticsearch](https://python.langchain.com/v0.2/docs/integrations/vectorstores/elasticsearch/)
    2. [Milvus](https://python.langchain.com/v0.2/docs/integrations/vectorstores/milvus/)
    3. [Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/)

# Preface
## Environment Setup

Sources  
- [langchain-chroma](https://pypi.org/project/langchain-chroma/)

In [1]:
from importlib.metadata import version
# #!pip install langchain
# # Select langchain to 0.1.3
# try:
#     assert version('langchain') == '0.1.20'
# except:
#     !pip install langchain==0.1.20
# print('langchain package version',version('langchain'))

!pip install --upgrade langchain
print('langchain package version',version('langchain'))

Defaulting to user installation because normal site-packages is not writeable
Collecting langchain
  Downloading langchain-0.3.4-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.12 (from langchain)
  Downloading langchain_core-0.3.13-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.0-py3-none-any.whl.metadata (2.3 kB)
Collecting pydantic<3.0.0,>=2.7.4 (from langchain)
  Downloading pydantic-2.9.2-py3-none-any.whl.metadata (149 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.137-py3-none-any.whl.metadata (13 kB)
Collecting requests-toolbelt<2.0.0,>=1.0.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl.metadata (14 kB)
Collecting pydantic-core==2.23.4 (from pydantic<3.0.0,>=2.7.4->langchain)
  Downloading pydantic_core-2.23.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.

In [2]:
# # Select langchain-huggingface to 0.0.3
# try:
#     assert version('langchain-huggingface') == '0.0.3'
# except:
#     !pip install -qU langchain-huggingface==0.0.3
# print('langchain-huggingface package version',version('langchain-huggingface'))

!pip install -qU langchain-huggingface
print('langchain-huggingface version',version('langchain-huggingface'))

langchain-huggingface version 0.1.0


In [3]:
# # Select langchain-chroma to 0.1.3
# try:
#     assert version('langchain_chroma') == '0.1.3'
# except:
#     !pip install -qU langchain_chroma==0.1.3
# print('langchain_chroma package version',version('langchain_chroma'))

# try:
#     assert version('langchain_community') == '0.0.38'
# except:
#     !pip install -qU langchain_community==0.0.38
# print('langchain_community package version',version('langchain_community'))

!pip install -qU langchain_chroma
print('langchain_chroma version',version('langchain_chroma'))
!pip install -qU langchain_community
print('langchain_community version',version('langchain_community'))

langchain_chroma version 0.1.4
langchain_community version 0.3.3


In [4]:
# try:
#     assert version('llama-cpp-python') == '0.2.74'
# except:
#     !pip install -qU llama-cpp-python==0.2.74
# print('llama-cpp-python package version',version('llama-cpp-python'))

# !pip install -qU llama-cpp-python
# print('llama-cpp-python package version',version('llama-cpp-python'))

In [5]:
# !pip install datamodel_code_generator
# print('datamodel_code_generator package version',version('datamodel_code_generator'))

In [6]:
# OpenAI 
# Update OpenAI to 1.42.0
try:
    print('openai package version',version('openai'))
    assert version('openai') == '1.42.0'
except:
    !pip install openai==1.42.0

openai package version 1.30.1
Defaulting to user installation because normal site-packages is not writeable
Collecting openai==1.42.0
  Downloading openai-1.42.0-py3-none-any.whl.metadata (22 kB)
Collecting jiter<1,>=0.4.0 (from openai==1.42.0)
  Downloading jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Downloading openai-1.42.0-py3-none-any.whl (362 kB)
Downloading jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325 kB)
Installing collected packages: jiter, openai
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
crewai 0.30.11 requires langchain<0.2.0,>=0.1.10, but you have langchain 0.3.4 which is incompatible.
embedchain 0.1.110 requires langchain<0.2.0,>=0.1.4, but you have langchain 0.3.4 which is incompatible.
langchain-openai 0.1.7 requires langchain-core<0.3,>=0.1.46, but you hav

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import os
os.getcwd()

'/work/files/workspace'

# Connect to VectorDB & LLM Agent
## Connect to VectorDB (Chroma)

In [8]:
import chromadb
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint

collection_name = "collection_postings"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
persistent_client = chromadb.PersistentClient()

if collection_name in persistent_client.list_collections()[0].name:
    print(f"Collection '{collection_name}' exists!")
    # Get the existing collection
    # vector_store = persistent_client.get_collection(collection_name)
    vector_store = Chroma(client=persistent_client,
                          collection_name=collection_name,
                          embedding_function=embeddings)
else:
    print(f"Collection '{collection_name}' does not exist!")

Collection 'collection_postings' exists!


In [9]:
# # Use the `as_retriever()` function to use it as a retriever in LangChain
# retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 2}) #search_kwargs={"k": 2, "fetch_k": 50}

# retriever

## Connect to Agent (Call OpenAI API)

In [10]:
import openai

#initiate the OpenAI client using the API key
openai_api_key = os.environ["OPENAI_API_KEY"]
client = openai.OpenAI(api_key=openai_api_key)
client

<openai.OpenAI at 0x7fa994d91750>

## Need modification !!!!!

# Retrieval and Generation Application

## Prepare Prompt

In [11]:
# extraction_prompt = ''' You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
#     1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
#     2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

# Upon receiving your aforementioned information, you need to proceed with the following precedures:
# Step 1. Analyze your client's abilities, including hard and soft skills.
# Step 2. Analyze the skills needed for the best possible jobs in the job specification
# Step 3. Summarize your client's strengths that are already sufficient for the job application.
# Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
# Step 5. Finally, give them advice how to get the jobs mentioned in job specification according the reasoning above.

# To give your client a professional advice, you MUST give the following feedback:
# 1. Job Position: the best possible job position or title you suggest your client to pursue.
# 2. Strengths: your client's strengths compared to the job posts
# 3. Weaknesses: your client's weaknesses compared to the job posts
# 4. Strateries: the methods you suggest to get the jobs mentioned in job posts 

# FINAL note:
# 1. If you cannot find the relevant informaiton in client's question or job specification for your reasoning, just leave it blank (""). 
# 2. Always give advice according to the information given to you (Question and Job Specification), DO NOT make up answer other than those information!

# Question:
#     <query>{query}</query>
# Job Post Information:
#     <specification>{specification}</specification>
# Advice:
# '''

In [12]:
extraction_prompt = ''' You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according the reasoning above. 

Question:
    <query>{query}</query>
Job Post Information:
    <specification>{specification}</specification>
Advice:
'''

## Preprare Input Query

In [22]:
query = "I recently graduated with a Bachelor degree in Computer Science, I use Python and have good grades in machine learning and deep learning. I had various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. I am now seeking an entry-level data scientist or data analyst role."

## Search Results based on Query

In [14]:
# results = retriever.invoke(query) #filter={"source": "news"}
# results

In [23]:
results = vector_store.similarity_search_with_score(
    query , k=5, #filter={"title": {"$in": keywords}}
)
i=0
specification = ""
for res, score in results:
    print(f"[{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    specification += ('Title: ' + res.metadata['title'] +'\n ' + res.page_content)
    i+=1

[0][SIM=0.641125] Data Scientist (6+ years) (Fulltime)
---------------------
           Job Title: Data ScientistLocation: Bentonville, AR (Onsite)Fulltime  Mode of interview: Video Call Must have skills : AI/ML models using Google Cloud Platform Relevant Experience: 6+ years Education: Bachelor’s Degree or above  Roles & Responsibilities · Proven experience in deploying real-time AI/ML models using Google Cloud Platform.· Strong programming skills in Python and PySpark.· Proficiency with SQL and relational databases, data warehouses, and BigQuery.· Experience in scaling marketing-related AI/ML solutions such as cross/upsell, recommended systems, and category propensity.· Experience in deploying and managing Large scale Machine Learning Models is a plus· Expertise with classical ML algorithm like K-NN, LSH, logistic regression, linear regression, SVM, Random forest and clustering.· Good understanding of ML & DL algorithms and frameworks (Scikit-learn,Spacy, Tensorflow/Keras/ PyTorch)· 

In [24]:
print(specification)

Title: Data Scientist (6+ years) (Fulltime)
 Job Title: Data ScientistLocation: Bentonville, AR (Onsite)Fulltime  Mode of interview: Video Call Must have skills : AI/ML models using Google Cloud Platform Relevant Experience: 6+ years Education: Bachelor’s Degree or above  Roles & Responsibilities · Proven experience in deploying real-time AI/ML models using Google Cloud Platform.· Strong programming skills in Python and PySpark.· Proficiency with SQL and relational databases, data warehouses, and BigQuery.· Experience in scaling marketing-related AI/ML solutions such as cross/upsell, recommended systems, and category propensity.· Experience in deploying and managing Large scale Machine Learning Models is a plus· Expertise with classical ML algorithm like K-NN, LSH, logistic regression, linear regression, SVM, Random forest and clustering.· Good understanding of ML & DL algorithms and frameworks (Scikit-learn,Spacy, Tensorflow/Keras/ PyTorch)· Experience in deep learning Algorithm s lik

## Get Final Response

In [25]:
prompt_all = extraction_prompt.format(query=query, specification=specification)
print(prompt_all)

 You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer


Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs.

Question:
    <query>I recently graduated with a Bachelor degree in Computer Science, I use Python and have good grades in machine learning and deep learning. I had various projects that allowed me to

In [26]:
import tiktoken

# Define a function to count tokens for a given prompt and model
def count_tokens(text, model="gpt-3.5-turbo-instruct"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Count the number of tokens in the prompt
prompt_tokens = count_tokens(prompt_all);print(f"total prompt tokens = {prompt_tokens}")

# Token limit for gpt-3.5-turbo-instruct
token_limit = 4097

# Ensure the total tokens (prompt + response) is within the limit
# Assume you want the model to generate a maximum of 1000 tokens in the response
response_max_tokens = 1000
if prompt_tokens + response_max_tokens > token_limit:
    print('total token size exceeds limit, start trimming!')
    # Calculate the allowable prompt length
    max_prompt_tokens = token_limit - response_max_tokens

    # Trim the prompt to fit within the token limit
    trimmed_prompt = prompt_all[:max_prompt_tokens]

    # Notify user about trimming
    print(f"Prompt trimmed from {prompt_tokens} to {max_prompt_tokens} tokens.")
    print("final prompt_all:\n",prompt_all)

    # Update the prompt with the trimmed version
    prompt_all = trimmed_prompt
else:
    print('total token size doesn\'t  exceeds limit, good job!')



total prompt tokens = 245
total token size doesn't  exceeds limit, good job!


In [27]:
response = client.completions.create(model="gpt-3.5-turbo-instruct",  
                                     prompt=prompt_all,
                                     max_tokens=response_max_tokens) 
print(response.choices[0].text)

Step 1. Analyze your client's abilities, including hard and soft skills.

Your client has a Bachelor degree in Computer Science with a focus on machine learning and deep learning. They also have experience working on projects that involve building predictive models and analyzing large datasets. Based on this, it can be inferred that your client has strong analytical and problem-solving skills, proficiency in programming languages such as Python, and the ability to handle and make sense of large amounts of data.

Step 2. Analyze and summarize the skills needed for the best possible jobs

For entry-level data scientist or data analyst roles, the following skills are typically required:
1. Proficiency in programming languages such as Python, R, and SQL
2. Knowledge of data analysis and machine learning techniques
3. Understanding of statistics and probability
4. Ability to work with large datasets
5. Problem-solving and critical thinking skills
6. Attention to detail and accuracy
7. Effec

# What If: Generation without Application

In [28]:
extraction_prompt = ''' You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer


Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs.

Question:
    <query>{query}</query>

Advice:
'''

prompt_all = extraction_prompt.format(query=query)

In [29]:
response = client.completions.create(model="gpt-3.5-turbo-instruct",  
                                     prompt=prompt_all,
                                     max_tokens=response_max_tokens) 
print(response.choices[0].text)

    Step 1. Your client's hard skills include a Bachelor degree in Computer Science and proficiency in Python, machine learning and deep learning, as well as experience in building predictive models and analyzing large datasets.

    Step 2. The skills needed for an entry-level data scientist or data analyst role include knowledge of programming languages such as Python, R or SQL, knowledge of statistics and machine learning, and experience in data analysis and data visualization.

    Step 3. Your client's strengths that are already sufficient for the job application include their Bachelor degree in Computer Science and their proficiency in Python, machine learning and deep learning.

    Step 4. Your client's weaknesses that they need to improve on in order to meet the job requirements include their lack of experience in other programming languages such as R or SQL, and their need to gain more experience in data analysis and data visualization.

    Step 5. My advice for your client 