# Knowledge Base (KB) with Retrieval Application

**Vector Database (Vector DB) / Vectorstore**

- [Vectorstores](https://python.langchain.com/v0.2/docs/integrations/vectorstores/): A vector store that stores embedded data and performs similarity search.
Resources

- [How-to guides](https://python.langchain.com/v0.2/docs/how_to/#vector-stores): How to build Vector DB through langchain

    1. [Elasticsearch](https://python.langchain.com/v0.2/docs/integrations/vectorstores/elasticsearch/)
    2. [Milvus](https://python.langchain.com/v0.2/docs/integrations/vectorstores/milvus/)
    3. [Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/)

# Environment Setup

Sources  
- [langchain-chroma](https://pypi.org/project/langchain-chroma/)

In [None]:
from importlib.metadata import version
# !pip install langchain
# Select langchain to 0.1.20
try:
    print('langchain package version',version('langchain'))
    assert version('langchain') == '0.1.20'
except:
    !pip install langchain==0.1.20

# !pip install --upgrade langchain
# print('langchain package version',version('langchain'))

langchain package version 0.1.20


In [None]:
#!pip install -qU langchain-huggingface
# Select langchain-huggingface to 0.0.3
try:
    print('langchain-huggingface package version',version('langchain-huggingface'))
    assert version('langchain-huggingface') =='0.0.3'#'0.2.11'
except:
    !pip install langchain-huggingface==0.0.3
    #0.2.11 (if any)

# !pip install -qU langchain-huggingface
# print('langchain-huggingface package version',version('langchain-huggingface'))

langchain-huggingface package version 0.0.3


In [None]:
# Select langchain-chroma to 0.1.3
try:
    print('langchain_chroma package version',version('langchain_chroma'))
    assert version('langchain_chroma') == '0.1.3'
except:
    !pip install langchain_chroma==0.1.3

# !pip install -qU langchain_chroma==0.1.3
# print('langchain_chroma package version',version('langchain_chroma'))

langchain_chroma package version 0.1.3


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import os
#check current directory
os.getcwd()

'/content'

In [None]:
# check folders / files in current directory
!dir

chroma	drive  postings.csv  sample_data


In [None]:
import langchain_chroma
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# Import Data
- Data source: [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings)

**Features of Use**
- id:
  1. job_id  

- Main data  

  **Texts**  
  1. _description_
  1. _skills_desc_  

  **Vector**  
  1. title
  2. _description_
  3. _skills_desc_  
  
- Meta data:  
  1. title
  1. location
  1. min_salary
  1. pay_period
  1. job_posting_url
  
  


## Download Data Directly from Kaggle

Source:
[How to Load Kaggle Datasets Directly Into Google Colab?](https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/#:~:text=By%20uploading%20API%20credentials%20and,(CLI)%20within%20Google%20Colab.)

In [None]:
import kagglehub

# # Download latest version
# path = kagglehub.dataset_download("arshkon/linkedin-job-postings")

# Download a single file
df_path = kagglehub.dataset_download('arshkon/linkedin-job-postings', path='postings.csv', force_download=True)

print("Path to dataset files:", df_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/arshkon/linkedin-job-postings?dataset_version_number=13&file_name=postings.csv...


100%|██████████| 147M/147M [00:04<00:00, 36.9MB/s]

Path to dataset files: /root/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13/postings.csv





In [None]:
# Uzip a file in df_path, and save it to cwd
import zipfile

def unzip_file(zip_filepath, extract_path):
    try:
        with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
            zip_ref.extractall(extract_path)
        print(f"Successfully unzipped {zip_filepath} to {extract_path}")
    except FileNotFoundError:
        print(f"Error: File not found at {zip_filepath}")
    except zipfile.BadZipFile:
        print(f"Error: Invalid zip file at {zip_filepath}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


unzip_file(df_path, "/content")

Successfully unzipped /root/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13/postings.csv to /content


In [None]:
# df = pd.read_csv('postings.csv', delimiter=',') #, engine="python"
df = pd.read_csv('postings.csv', delimiter=',')

print(f"The variables (features) of the data:\n{df.columns}")
df.head(5)

The variables (features) of the data:
Index(['job_id', 'company_name', 'title', 'description', 'max_salary',
       'pay_period', 'location', 'company_id', 'views', 'med_salary',
       'min_salary', 'formatted_work_type', 'applies', 'original_listed_time',
       'remote_allowed', 'job_posting_url', 'application_url',
       'application_type', 'expiry', 'closed_time',
       'formatted_experience_level', 'skills_desc', 'listed_time',
       'posting_domain', 'sponsored', 'work_type', 'currency',
       'compensation_type', 'normalized_salary', 'zip_code', 'fips'],
      dtype='object')


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


In [None]:
# df['job_post'] = df['description'].astype('str') + df['skills_desc'].astype('str')
# df[['description', 'skills_desc', 'job_post']].head()

In [None]:
df['job_post'] = np.where(df['skills_desc'].isna(), df['description'].astype('str'), df['skills_desc'].astype('str'))
df[['description', 'skills_desc', 'job_post']].head()

Unnamed: 0,description,skills_desc,job_post
0,Job descriptionA leading real estate firm in N...,Requirements: \n\nWe are seeking a College or ...,Requirements: \n\nWe are seeking a College or ...
1,"At Aspen Therapy and Wellness , we are committ...",,"At Aspen Therapy and Wellness , we are committ..."
2,The National Exemplar is accepting application...,We are currently accepting resumes for FOH - A...,We are currently accepting resumes for FOH - A...
3,Senior Associate Attorney - Elder Law / Trusts...,This position requires a baseline understandin...,This position requires a baseline understandin...
4,Looking for HVAC service tech with experience ...,,Looking for HVAC service tech with experience ...


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 32 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   job_id                      123849 non-null  int64  
 1   company_name                122130 non-null  object 
 2   title                       123849 non-null  object 
 3   description                 123842 non-null  object 
 4   max_salary                  29793 non-null   float64
 5   pay_period                  36073 non-null   object 
 6   location                    123849 non-null  object 
 7   company_id                  122132 non-null  float64
 8   views                       122160 non-null  float64
 9   med_salary                  6280 non-null    float64
 10  min_salary                  29793 non-null   float64
 11  formatted_work_type         123849 non-null  object 
 12  applies                     23320 non-null   float64
 13  original_liste

## Select Data
**Purpose**:
- Insert data samples into Vector Database (VectorDB), working as knowledge base

**Note**:
- For demonstration, I pick 50% of data related to a specific job tiltes, and 50% other random job titles

In [None]:
# define yourself

## job title you're interested
keywords = ['data science', 'data scientist', 'data analyst']
# keywords = ['data analytics', 'data scientist', 'hr analytics', 'hr data analyst']

## Number of relevant / irrelevant samples
N = 300

In [None]:
condition = df['title'].str.contains('|'.join(keywords), case=False, na=False)
df_ds = df[condition]
# 'case=False' makes the search case-insensitive
# 'na=False' ensures that NaN values are not considered in the search

N_ds = min(len(df_ds), N)
df_ds = df_ds.sample(n=N_ds)
print(f"There're {N_ds} samples searched according to keywords.")
df_ds.head(10)

There're 300 samples searched according to keywords.


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips,job_post
55869,3901936781,Cybotic System,Loans Transformation Data Analyst,"Only w2\nMandatory Skills:-SQL, SharePoint, Fi...",,,New York City Metropolitan Area,90644499.0,8.0,,...,1713470000000.0,,0,CONTRACT,,,,,,"Only w2\nMandatory Skills:-SQL, SharePoint, Fi..."
51643,3901391072,DataAnnotation,Data Analyst,DataAnnotation is committed to creating qualit...,,,"Fairfax, VA",96139831.0,62.0,,...,1713460000000.0,,0,CONTRACT,,,,22030.0,51600.0,DataAnnotation is committed to creating qualit...
33956,3895242027,Coders Data,Data Analyst,The ideal candidate will use their passion for...,,,New York City Metropolitan Area,14508644.0,10.0,,...,1712866000000.0,,0,FULL_TIME,,,,,,The ideal candidate will use their passion for...
120047,3906227684,Wells Fargo,Data Analyst (contract),Title: Data Analyst\n\nLocation: \n\nCharlotte...,,,"Charlotte, NC",1235.0,3.0,,...,1713563000000.0,wellsfargo.talentnet.community,0,CONTRACT,,,,28202.0,37119.0,Title: Data Analyst\n\nLocation: \n\nCharlotte...
107168,3905331454,CHRISTUS Health,Information Technology Data Analyst II - Profe...,Description\n\nSummary:\n\nPossesses knowledge...,,,"Irving, TX",10769.0,4.0,,...,1713480000000.0,careers.christushealth.org,0,FULL_TIME,,,,75038.0,48113.0,Description\n\nSummary:\n\nPossesses knowledge...
47756,3901170582,"Eos Energy Enterprises, Inc.",Projects & Services Business Data Analyst Inte...,"At Eos, we believe in great chemistry. The com...",,,"New Jersey, United States",2442165.0,104.0,,...,1713280000000.0,eosenergystorage.wd1.myworkdayjobs.com,0,FULL_TIME,,,,,,"At Eos, we believe in great chemistry. The com..."
21524,3889440979,Merck,"Associate Director, Data Science",Job Description\n\nWithin our Commercial Analy...,235100.0,YEARLY,"Rahway, NJ",1486.0,6.0,,...,1712656000000.0,jobs.merck.com,0,FULL_TIME,USD,BASE_SALARY,192250.0,7065.0,34039.0,Job Description\n\nWithin our Commercial Analy...
40647,3898174637,Experis,Talent Management Consultant / Data Analyst,"Our client, Pharmaceutical Industry is seeking...",,,"Princeton, NJ",2203697.0,45.0,,...,1713216000000.0,click.appcast.io,0,CONTRACT,,,,8540.0,34021.0,"Our client, Pharmaceutical Industry is seeking..."
93011,3904708709,Aditi Consulting,Data Analyst 1,Summary:\n\nThe main function of the Data Anal...,,,"Plano, TX",2985733.0,4.0,,...,1713564000000.0,www2.jobdiva.com,0,FULL_TIME,,,,75023.0,48085.0,Summary:\n\nThe main function of the Data Anal...
21047,3889414779,Talentify.io,Data Analyst -AITS (Hybrid/Remote),Employer Industry: Higher Education\n\nWhy Con...,,YEARLY,United States,2204084.0,5.0,80000.0,...,1712654000000.0,tfy.io,0,FULL_TIME,USD,BASE_SALARY,80000.0,,,Employer Industry: Higher Education\n\nWhy Con...


In [None]:
df_others = df[~condition].sample(n=N)
print(f"There're {len(df_others)} samples searched not contained in keywords.")
df_others.head(10)

There're 300 samples searched not contained in keywords.


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips,job_post
60985,3901984718,Microsoft,"Energy Program Manager, Cloud Operations + Inn...",Microsoft’s Cloud Operations + Innovation (CO+...,,,"Atlanta, GA",1035.0,5.0,,...,1713497000000.0,careers.microsoft.com,0,FULL_TIME,,,,30303.0,13121.0,Microsoft’s Cloud Operations + Innovation (CO+...
94283,3904724144,Emory Healthcare,"LPN, PRN, Telehealth Nursing, Northlake",Overview\n\nHeart. Soul. Expertise. That’s wha...,,,"Atlanta, GA",3888.0,4.0,,...,1713566000000.0,nursing-emory.icims.com,0,PART_TIME,,,,30303.0,13121.0,Overview\n\nHeart. Soul. Expertise. That’s wha...
49911,3901363186,University of Michigan,Women's Basketball Assistant Coach,Summary\n\nThe University of Michigan-Dearborn...,,,"Dearborn, MI",2516.0,9.0,,...,1713454000000.0,careers.umich.edu,0,PART_TIME,,,,48120.0,26163.0,Summary\n\nThe University of Michigan-Dearborn...
11535,3887495476,Oak Street Health,Patient Relations Manager- Bilingual/Spanish R...,Description\n\nCompany: Oak Street Health\n\nT...,,,"Chicago, IL",3029525.0,14.0,,...,1712349000000.0,jobs.jobvite.com,0,FULL_TIME,,,,60601.0,17031.0,Description\n\nCompany: Oak Street Health\n\nT...
64663,3902360755,Novant Health,Student Radiologic Technologist I,Overview\n\nThe Student Radiologic Technologis...,,,"Charlotte, NC",9759.0,3.0,,...,1713539000000.0,external-novanthealth.icims.com,0,INTERNSHIP,,,,28202.0,37119.0,Overview\n\nThe Student Radiologic Technologis...
27455,3891274519,Akima,Detention Officer,"Overview\n\nWork Where it Matters\n\nAGS, an A...",,,"Batavia, NY",3760423.0,5.0,,...,1712893000000.0,rr.jobsyn.org,0,FULL_TIME,,,,14020.0,36037.0,"Overview\n\nWork Where it Matters\n\nAGS, an A..."
28214,3891829084,Path Engineering,Full Time Designer,Role\n\nAs a Full-Time Designer (also known as...,,,"Darien, WI",82500012.0,1.0,,...,1712903000000.0,www.adzuna.com,0,FULL_TIME,,,,53114.0,55127.0,Role\n\nAs a Full-Time Designer (also known as...
93860,3904720633,Darigold,Inventory Accountant,Darigold is seeking an Inventory Accountant to...,,,"Seattle, WA",281848.0,4.0,,...,1713565000000.0,recruiting2.ultipro.com,0,FULL_TIME,,,,98101.0,53033.0,Darigold is seeking an Inventory Accountant to...
107929,3905335633,"CivicMinds, Inc",Commercial Lines Account Manager,Job Responsibilities:all aspects of our client...,100000.0,YEARLY,"New York, United States",17880405.0,5.0,,...,1713480000000.0,,0,FULL_TIME,USD,BASE_SALARY,92500.0,,,Job Responsibilities:all aspects of our client...
27751,3891285339,H&R Block,Receptionist,What You'll Bring To The Team...\n\nAnswer pho...,,,"Fort Worth, TX",3671.0,5.0,,...,1712896000000.0,sjobs.brassring.com,0,FULL_TIME,,,,76102.0,48439.0,What You'll Bring To The Team...\n\nAnswer pho...


In [None]:
df_select = pd.concat([df_ds, df_others])

print(f"There're totally {len(df_select)} samples for inserting VectorDB.")

There're totally 600 samples for inserting VectorDB.


# Create VectorDB

## Create Indexer: Construct the Container (aka. collection) for VectorDB

In [None]:
encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")



In [None]:
encoder

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [None]:
import chromadb
collection_name = "collection_postings"

persistent_client = chromadb.PersistentClient()
# print("check all functions/attribute for Chroma:\n", dir(persistent_client))



In [None]:
if not persistent_client.list_collections():
    print('empty')

In [None]:
import chromadb
collection_name = "collection_postings"

persistent_client = chromadb.PersistentClient()
# print("check all functions/attribute for Chroma:\n", dir(persistent_client))

if not persistent_client.list_collections():
    print(f"{collection_name} not in collection (DB) yet!")
else:
    if collection_name in persistent_client.list_collections()[0].name:
        print(f"{collection_name} is already in collection and would be deleted!")
        persistent_client.delete_collection(collection_name)

print(f"Create collection: {collection_name}!")
vector_store = Chroma(
    client=persistent_client,
    collection_name=collection_name,
    embedding_function=encoder,
    persist_directory="./chroma_langchain_db",  #save data locally, remove if not neccesary
)

#print("check all functions/attribute for Chroma:\n", dir(vector_store))

# if collection_name in vector_store.list_collections():
#     vector_store.delete_collection(collection_name)


collection_postings is already in collection and would be deleted!
Create collection: collection_postings!


## Indexing: Insert Data into VectorDB


In [None]:
from langchain_core.documents import Document
ids = []
documents = []
for index, row in df_select.iterrows():
    id_current = str(index)
    ids.append(id_current)

    min_salary = row['min_salary']
    if pd.isna(min_salary):
        min_salary = 0  # or any default value you prefer

    document_current = Document(
        page_content=row['job_post'],
        metadata={"title": row['title'],
                  "location": row['location'],
                  "min_salary": min_salary,
                  "pay_period": row['pay_period'],
                  "job_posting_url": row['job_posting_url'],
                 },
        id=row['job_id'],
    )
    documents.append(document_current)
print(f"There are {len(documents)}.\n")
print(f'Example document content:\n{document_current}')

There are 600.

Example document content:
page_content="Assistant Store Manager (Merchandising) Contribute to our mission to improve Health and Wellness in your community. Become a Rite Aid Assistant Store Manager over Merchandising, today!\n\nAs an Assistant Store Manager over Merchandising, you are responsible for day-to-day operations, care, and service provided to our customer & patients. Your purpose is to lead your store to success through mentoring, coaching, and optimizing processes. Your experience with plan-o-grams, signage, and merchandising are paramount to being successful in this role.\n\nWhy You Will THRIVE Here\n\nYou love to help people.\n\nYou are a natural leader who builds solid professional relationships.\n\nYou enjoy motivating those around you, and you’re good at it.\n\nYou consistently exceed people's expectations.\n\nYou are passionate about providing top-notch customer service.\n\nYou have a keen sense for optimizing, organizing, and visualizing.\n\nDay-in-the

In [None]:
import time
start = time.time()
vector_store.add_documents(documents=documents, ids=ids)
end = time.time()
print('Time spent (min.) for data insertion: \t', (end-start)/60)

Time spent (min.) for data insertion: 	 19.352565836906432


## Search: VectorDB as Retriever

### Vector Search (Similarity Search)
According to the query, find k=5 best possible job posts

In [None]:
query = "I am a passionate job seeker with a strong desire to embark on a career in data science. Having recently graduated with a degree in Computer Science, I have honed my skills in Python programming and developed a deep interest in machine learning. During my studies, I immersed myself in various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. My journey into data science has been driven by a fascination with uncovering hidden patterns in data and using these insights to solve real-world problems. I am now seeking an entry-level data scientist role where I can leverage my Python expertise and enthusiasm for machine learning to contribute to a dynamic team, learn from experienced professionals, and continue to grow my skills in this exciting field."
print(query)

I am a passionate job seeker with a strong desire to embark on a career in data science. Having recently graduated with a degree in Computer Science, I have honed my skills in Python programming and developed a deep interest in machine learning. During my studies, I immersed myself in various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. My journey into data science has been driven by a fascination with uncovering hidden patterns in data and using these insights to solve real-world problems. I am now seeking an entry-level data scientist role where I can leverage my Python expertise and enthusiasm for machine learning to contribute to a dynamic team, learn from experienced professionals, and continue to grow my skills in this exciting field.


In [None]:
results = vector_store.similarity_search_with_score(
    query , k=10,
)
i =1
for res, score in results:
    print(f"* [{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    i +=1

* [1][SIM=0.460155] Data scientist intern
---------------------
           About the Role – Data Scientist InternAt AstrumU we are focused on our customers and users’ needs, relying heavily on data for productdevelopment, customer engagement, and education marketplace optimization. Our entire companyis oriented around Data Science and Artificial Intelligence. It is essentially the core of our business.Our mission is to power AstrumU’s platform using an AI engine that measures the multi-dimensionaleducation and professional journey of students and provides skill-based career preparationrecommendations. Our Data Science team leverages their knowledge of machine learning, NLP,and data engineering to map the unique offerings in education programs and translates learningoutcomes into ‘workforce attributes’ as a common language between education and industry. Thisopportunity is unique and challenging because you will be dealing with an unprecedented scope ofdata in delivering career outcome-

### Vector Search with Filtering 1
According to the query, find k=5 best possible job posts which have minimum salary greater than (gt) 100000

In [None]:
results = vector_store.similarity_search_with_score(
    query , k=3, filter={"min_salary": {"$gt": 100000}}
)# perator: $gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
i = 1
for res, score in results:
    print(f"* [{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    i +=1

* [1][SIM=0.529117] Data Scientist
---------------------
           Job Number: R0195929

Data Scientist

The Opportunity: 

As a data scientist, you’re excited at the prospect of unlocking the secrets held by a data set, and you’re fascinated by the possibilities presented by IoT, machine learning, and artificial intelligence. In an increasingly connected world, massive amounts of structured and unstructured data open new opportunities. As a data scientist at Booz Allen, you can turn these complex data sets into useful information to solve global challenges. Across private and public sectors from fraud detection to cancer research, to national intelligence, we need a data scientist like you to help find the answers in the data.

On our team, you’ll use your data and analysis skills to improve Veteran access to healthcare. You’ll work closely with your customer to understand their questions and needs, then dig into their data-rich environment to find the pieces of their information puz

# Retrieval and Generation

**Vector Database (Vector DB)**
Resources
- [How-to guides](https://python.langchain.com/v0.2/docs/how_to/#vector-stores)
  - [Vectorstores](https://python.langchain.com/v0.2/docs/integrations/vectorstores/): A vector store that stores embedded data and performs similarity search.
    1. [Elasticsearch](https://python.langchain.com/v0.2/docs/integrations/vectorstores/elasticsearch/)
    2. [Milvus](https://python.langchain.com/v0.2/docs/integrations/vectorstores/milvus/)
    3. [Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/)

# Preface
## Environment Setup

Sources  
- [langchain-chroma](https://pypi.org/project/langchain-chroma/)
- [Gemini API Python quickstart](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/tutorials/quickstart_colab.ipynb#scrollTo=-QhPWE1lwZHH)

In [None]:
# Gemini
!pip install -q -U google-generativeai

In [None]:
!pip install tiktoken



# Connect to VectorDB & LLM Agent
## Connect to VectorDB (Chroma)

In [None]:
import chromadb
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint

collection_name = "collection_postings"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
persistent_client = chromadb.PersistentClient()
print(persistent_client.list_collections())

vector_store = Chroma(client=persistent_client,
                      collection_name=collection_name,
                      embedding_function=embeddings)

# try:
#   if collection_name in persistent_client.list_collections()[0].name:
#       print(f"Collection '{collection_name}' exists!")
#       # Get the existing collection
#       # vector_store = persistent_client.get_collection(collection_name)
#       vector_store = Chroma(client=persistent_client,
#                             collection_name=collection_name,
#                             embedding_function=embeddings)
# except:
#     print(f"Collection '{collection_name}' does not exist!")

[Collection(id=12fd0f52-99aa-4a0f-84ae-3697e1fd6e3f, name=collection_postings)]


In [None]:
# prompt: how can I see the data in vector_store?

# Get all the documents in the vector store
documents = vector_store.get()

# Print the documents
print(documents)

# Alternatively, you can get the embeddings and ids
embeddings = vector_store.get()['embeddings']
ids = vector_store.get()['ids']

# Print the embeddings
print(embeddings)

# Print the ids
ids

{'ids': ['55869', '51643', '33956', '120047', '107168', '47756', '21524', '40647', '93011', '21047', '91898', '29113', '22377', '29908', '8457', '99384', '56343', '6588', '389', '29848', '17846', '31328', '38456', '120555', '49843', '28519', '62890', '55921', '78590', '6182', '76229', '23944', '64671', '6531', '80498', '11838', '61641', '50935', '41371', '90659', '48033', '42920', '31316', '46390', '122094', '2658', '9344', '6455', '31314', '41468', '6184', '14019', '101917', '44303', '93566', '51969', '23891', '93501', '2887', '26433', '48791', '96056', '31891', '41797', '13246', '116996', '47620', '67677', '12953', '61934', '24094', '29254', '56044', '76171', '32838', '52419', '6705', '86214', '75010', '83764', '67493', '95865', '95577', '56018', '109667', '41919', '56732', '51895', '67117', '33829', '69503', '115363', '6777', '27175', '77091', '73646', '3730', '60859', '90085', '53326', '56428', '100893', '64856', '25425', '102639', '81538', '115052', '6780', '70554', '42405', '2101

['55869',
 '51643',
 '33956',
 '120047',
 '107168',
 '47756',
 '21524',
 '40647',
 '93011',
 '21047',
 '91898',
 '29113',
 '22377',
 '29908',
 '8457',
 '99384',
 '56343',
 '6588',
 '389',
 '29848',
 '17846',
 '31328',
 '38456',
 '120555',
 '49843',
 '28519',
 '62890',
 '55921',
 '78590',
 '6182',
 '76229',
 '23944',
 '64671',
 '6531',
 '80498',
 '11838',
 '61641',
 '50935',
 '41371',
 '90659',
 '48033',
 '42920',
 '31316',
 '46390',
 '122094',
 '2658',
 '9344',
 '6455',
 '31314',
 '41468',
 '6184',
 '14019',
 '101917',
 '44303',
 '93566',
 '51969',
 '23891',
 '93501',
 '2887',
 '26433',
 '48791',
 '96056',
 '31891',
 '41797',
 '13246',
 '116996',
 '47620',
 '67677',
 '12953',
 '61934',
 '24094',
 '29254',
 '56044',
 '76171',
 '32838',
 '52419',
 '6705',
 '86214',
 '75010',
 '83764',
 '67493',
 '95865',
 '95577',
 '56018',
 '109667',
 '41919',
 '56732',
 '51895',
 '67117',
 '33829',
 '69503',
 '115363',
 '6777',
 '27175',
 '77091',
 '73646',
 '3730',
 '60859',
 '90085',
 '53326',
 '5642

## Connect to Agent (Call Gemini API)

In [None]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-pro')

# Retrieval and Generation Application

## Prepare Prompt

In [None]:
extraction_prompt = ''' You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according the reasoning above.

Question:
    <query>{query}</query>
Job Post Information:
    <specification>{specification}</specification>
Advice:
'''

## Preprare Input Query

In [None]:
query = "I recently graduated with a Bachelor degree in Computer Science, I use Python and have good grades in machine learning and deep learning. I had various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. I am now seeking an entry-level data scientist or data analyst role."

## Search Results based on Query

In [None]:
results = vector_store.similarity_search_with_score(
    query , k=5, #filter={"title": {"$in": keywords}}
)
i=0
specification = ""
for res, score in results:
    print(f"[{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    specification += ('Title: ' + res.metadata['title'] +'\n ' + res.page_content)
    i+=1

[0][SIM=0.641125] Data Scientist (6+ years) (Fulltime)
---------------------
           Job Title: Data ScientistLocation: Bentonville, AR (Onsite)Fulltime  Mode of interview: Video Call Must have skills : AI/ML models using Google Cloud Platform Relevant Experience: 6+ years Education: Bachelor’s Degree or above  Roles & Responsibilities · Proven experience in deploying real-time AI/ML models using Google Cloud Platform.· Strong programming skills in Python and PySpark.· Proficiency with SQL and relational databases, data warehouses, and BigQuery.· Experience in scaling marketing-related AI/ML solutions such as cross/upsell, recommended systems, and category propensity.· Experience in deploying and managing Large scale Machine Learning Models is a plus· Expertise with classical ML algorithm like K-NN, LSH, logistic regression, linear regression, SVM, Random forest and clustering.· Good understanding of ML & DL algorithms and frameworks (Scikit-learn,Spacy, Tensorflow/Keras/ PyTorch)· 

In [None]:
print(specification)

Title: Data Scientist (6+ years) (Fulltime)
 Job Title: Data ScientistLocation: Bentonville, AR (Onsite)Fulltime  Mode of interview: Video Call Must have skills : AI/ML models using Google Cloud Platform Relevant Experience: 6+ years Education: Bachelor’s Degree or above  Roles & Responsibilities · Proven experience in deploying real-time AI/ML models using Google Cloud Platform.· Strong programming skills in Python and PySpark.· Proficiency with SQL and relational databases, data warehouses, and BigQuery.· Experience in scaling marketing-related AI/ML solutions such as cross/upsell, recommended systems, and category propensity.· Experience in deploying and managing Large scale Machine Learning Models is a plus· Expertise with classical ML algorithm like K-NN, LSH, logistic regression, linear regression, SVM, Random forest and clustering.· Good understanding of ML & DL algorithms and frameworks (Scikit-learn,Spacy, Tensorflow/Keras/ PyTorch)· Experience in deep learning Algorithm s lik

## Get Final Response

In [None]:
prompt_all = extraction_prompt.format(query=query, specification=specification)
print(prompt_all)

 You are a carear consuler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according t

In [None]:
response = model.generate_content(prompt_all)
print(response.text)

**1. Client's Abilities**

* **Hard Skills:**
    * Python programming
    * Machine learning (K-NN, LSH, logistic regression, linear regression, SVM, Random forest, clustering)
    * Deep learning (MLP, CNN, RNN, LSTMs, GANs, Transformers, LLMs)
    * SQL and relational databases
    * Data cleaning, preprocessing, and feature engineering
* **Soft Skills:**
    * Problem-solving and analytical skills
    * Strong written and verbal communication skills

**2. Skills Needed for the Best Possible Jobs**

**Data Scientist (6+ years)**
* 6+ years of experience in deploying AI/ML models using Google Cloud Platform
* Expertise with classical ML algorithms and DL algorithms
* Experience in deploying and managing large-scale ML models
* Proficiency in collecting data, data cleaning, preprocessing, and feature engineering
* Excellent programming skills in Python
* Expertise in Google Cloud and operationalization of models using MLOPs
* Experience in scheduling jobs for automated training and in

# What If: Generation without Application

In [None]:
extraction_prompt = ''' You are a career consoler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer


Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs.

Question:
    <query>{query}</query>

Advice:
'''

prompt_all = extraction_prompt.format(query=query)

In [None]:
response = model.generate_content(prompt_all)
print(response.text)

**Step 1: Analyze your client's abilities, including hard and soft skills.**

**Hard skills:**

* Python programming
* Machine learning
* Deep learning
* Data analysis

**Soft skills:**

* Communication skills
* Teamwork skills
* Problem-solving skills

**Step 2: Analyze and summarize the skills needed for the best possible jobs**

**Data Scientist:**

* **Hard skills:** Python programming, machine learning, deep learning, data analysis, statistics, data visualization
* **Soft skills:** Communication skills, teamwork skills, problem-solving skills

**Data Analyst:**

* **Hard skills:** Python programming, data analysis, data visualization, SQL
* **Soft skills:** Communication skills, teamwork skills, problem-solving skills

**Step 3: Summarize your client's strengths that are already sufficient for the job application.**

* Strong programming skills in Python
* Good understanding of machine learning and deep learning
* Experience applying these skills to real-world projects

**Step 4: 

# Store Data to your Google Drive
save your IPython (Jupyter) notebooks from Google Colab to Google Drive

## Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Create a Folder in Google Drive

In [None]:
def create_folder_in_drive(folder_name):
    # Define the path for the new folder
    folder_path = f'/content/drive/MyDrive/{folder_name}'

    # Create the folder if it doesn't exist
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f'Folder "{folder_name}" created in Google Drive.')
    else:
        print(f'Folder "{folder_name}" already exists.')

# Create the folder
folder_name = 'Tutorial - LinkedIn Job Posting with GenAI'
create_folder_in_drive(folder_name)

subfolder_name = ''


Folder "Tutorial - LinkedIn Job Posting with GenAI" already exists.


## Save the Notebook to the Created Folder

In [None]:
import shutil

def save_notebook_to_drive(file_name, folder_name):
    source_path = f'/content/{file_name}'  # Current location of the notebook
    destination_path = f'/content/drive/MyDrive/{folder_name}/{file_name}'  # Destination path in Drive

    # Copy the notebook to the destination
    shutil.copy(source_path, destination_path)
    print(f'Notebook "{file_name}" saved to "{folder_name}" in Google Drive.')

# Save the current notebook (replace 'your_notebook_name.ipynb' with the actual name)
file_name = 'postings.csv'
save_notebook_to_drive(file_name, folder_name)

Notebook "postings.csv" saved to "Tutorial - LinkedIn Job Posting with GenAI" in Google Drive.
