# 1️⃣ Knowledge Base (KB) with Retrieval Application

**Vector Database (Vector DB) / Vectorstore**

- [Vectorstores](https://python.langchain.com/v0.2/docs/integrations/vectorstores/): A vector store that stores embedded data and performs similarity search.
Resources

- [How-to guides](https://python.langchain.com/v0.2/docs/how_to/#vector-stores): How to build Vector DB through langchain

    1. [Elasticsearch](https://python.langchain.com/v0.2/docs/integrations/vectorstores/elasticsearch/)
    2. [Milvus](https://python.langchain.com/v0.2/docs/integrations/vectorstores/milvus/)
    3. [Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/): [langchain-chroma](https://pypi.org/project/langchain-chroma/)

# 1.1. Environment Setup


In [7]:
from importlib.metadata import version
# !pip install langchain
# Select langchain to 0.1.20
try:
    print('langchain package version',version('langchain'))
    assert version('langchain') == '0.1.20'
except:
    !pip install langchain==0.1.20

# !pip install --upgrade langchain
# print('langchain package version',version('langchain'))

langchain package version 0.3.4
Collecting langchain==0.1.20
  Downloading langchain-0.1.20-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.1.20)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain-community<0.1,>=0.0.38 (from langchain==0.1.20)
  Downloading langchain_community-0.0.38-py3-none-any.whl.metadata (8.7 kB)
Collecting langchain-core<0.2.0,>=0.1.52 (from langchain==0.1.20)
  Downloading langchain_core-0.1.52-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain==0.1.20)
  Downloading langchain_text_splitters-0.0.2-py3-none-any.whl.metadata (2.2 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain==0.1.20)
  Downloading marshmallow-3.23.0-py3-none-any.whl.metadata (7.6 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain==0.1.20)
  Downloading typing_inspect-0.9.0-py3-none-any.whl

In [8]:
#!pip install -qU langchain-huggingface
# Select langchain-huggingface to 0.0.3
try:
    print('langchain-huggingface package version',version('langchain-huggingface'))
    assert version('langchain-huggingface') =='0.0.3'#'0.2.11'
except:
    !pip install langchain-huggingface==0.0.3
    #0.2.11 (if any)

# !pip install -qU langchain-huggingface
# print('langchain-huggingface package version',version('langchain-huggingface'))

langchain-huggingface package version 0.0.3


In [9]:
# Select langchain-chroma to 0.1.3
try:
    print('langchain_chroma package version',version('langchain_chroma'))
    assert version('langchain_chroma') == '0.1.3'
except:
    !pip install langchain_chroma==0.1.3

# !pip install -qU langchain_chroma==0.1.3
# print('langchain_chroma package version',version('langchain_chroma'))

langchain_chroma package version 0.1.3


In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import os
#check current directory
os.getcwd()

'/content'

In [11]:
# check folders / files in current directory
!dir

sample_data


In [12]:
import langchain_chroma
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# 1.2. Import Data
**Data**
- Source: [LinkedIn Job Postings (2023 - 2024)](https://www.kaggle.com/datasets/arshkon/linkedin-job-postings)
- Full Downloaded Folder:
  - archive
    - **postings.csv** ✅ (selected for this project!)
    -  mappings
      - skills.csv.
      - industries.csv.
    - jobs
      - salaries.csv
      - job_skills.csv
      - job_industries.csv
      - benefits.csv
    - companies
      - employee_counts.csv
      - company_specialities.csv
      - company_industries.csv
      - companies.csv



**Selected Features in postings.csv**
- **ID**:
  - job_id  

- **Main data**: embedding (encoded from texts)
  1. description
  1. skills_desc
  
- **Meta data**
  1. title
  1. location
  1. min_salary
  1. pay_period
  1. job_posting_url
  
  


## 1.2.1. Download Data Directly from Kaggle

Tutorial:
[How to Load Kaggle Datasets Directly Into Google Colab?](https://www.analyticsvidhya.com/blog/2021/06/how-to-load-kaggle-datasets-directly-into-google-colab/#:~:text=By%20uploading%20API%20credentials%20and,(CLI)%20within%20Google%20Colab.)

In [None]:
import kagglehub

# # Download latest version
# path = kagglehub.dataset_download("arshkon/linkedin-job-postings")

# Download a single file
df_path = kagglehub.dataset_download('arshkon/linkedin-job-postings', path='postings.csv', force_download=True)

print("Path to dataset files:", df_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/arshkon/linkedin-job-postings?dataset_version_number=13&file_name=postings.csv...


100%|██████████| 147M/147M [00:04<00:00, 36.9MB/s]

Path to dataset files: /root/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13/postings.csv





In [13]:
# Uzip a file in df_path, and save it to extract_path
import zipfile

def unzip_file(zip_filepath, extract_path):
    try:
        with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:
            zip_ref.extractall(extract_path)
        print(f"Successfully unzipped {zip_filepath} to {extract_path}")
    except FileNotFoundError:
        print(f"Error: File not found at {zip_filepath}")
    except zipfile.BadZipFile:
        print(f"Error: Invalid zip file at {zip_filepath}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")


unzip_file(df_path, "/content")

Successfully unzipped /root/.cache/kagglehub/datasets/arshkon/linkedin-job-postings/versions/13/postings.csv to /content


In [16]:
df = pd.read_csv('postings.csv', delimiter=',')
print(f"The data has {df.shape[0]} obervations with {df.shape[1]} variables")
print(f"The variables (features) of the data:\n{df.columns}")
df.head(5)

The data has 123849 obervations with 31 variables
The variables (features) of the data:
Index(['job_id', 'company_name', 'title', 'description', 'max_salary',
       'pay_period', 'location', 'company_id', 'views', 'med_salary',
       'min_salary', 'formatted_work_type', 'applies', 'original_listed_time',
       'remote_allowed', 'job_posting_url', 'application_url',
       'application_type', 'expiry', 'closed_time',
       'formatted_experience_level', 'skills_desc', 'listed_time',
       'posting_domain', 'sponsored', 'work_type', 'currency',
       'compensation_type', 'normalized_salary', 'zip_code', 'fips'],
      dtype='object')


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


In [17]:
df['job_post'] = df['description'].astype('str') + df['skills_desc'].astype('str')
df[['description', 'skills_desc', 'job_post']].head()

Unnamed: 0,description,skills_desc,job_post
0,Job descriptionA leading real estate firm in N...,Requirements: \n\nWe are seeking a College or ...,Job descriptionA leading real estate firm in N...
1,"At Aspen Therapy and Wellness , we are committ...",,"At Aspen Therapy and Wellness , we are committ..."
2,The National Exemplar is accepting application...,We are currently accepting resumes for FOH - A...,The National Exemplar is accepting application...
3,Senior Associate Attorney - Elder Law / Trusts...,This position requires a baseline understandin...,Senior Associate Attorney - Elder Law / Trusts...
4,Looking for HVAC service tech with experience ...,,Looking for HVAC service tech with experience ...


In [18]:
# df['job_post'] = np.where(df['skills_desc'].isna(), df['description'].astype('str'), df['skills_desc'].astype('str'))
# df[['description', 'skills_desc', 'job_post']].head()

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 123849 entries, 0 to 123848
Data columns (total 32 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   job_id                      123849 non-null  int64  
 1   company_name                122130 non-null  object 
 2   title                       123849 non-null  object 
 3   description                 123842 non-null  object 
 4   max_salary                  29793 non-null   float64
 5   pay_period                  36073 non-null   object 
 6   location                    123849 non-null  object 
 7   company_id                  122132 non-null  float64
 8   views                       122160 non-null  float64
 9   med_salary                  6280 non-null    float64
 10  min_salary                  29793 non-null   float64
 11  formatted_work_type         123849 non-null  object 
 12  applies                     23320 non-null   float64
 13  original_liste

## 1.2.1 Select Data Ingested into Build Vector DB

**Purpose**:
- Insert data samples into Vector Database (VectorDB), working as knowledge base

**Note**:
- Because data ingestion is very time consuming, here I only select N*2 (default: N=200) samples for demonstraton (approximatedly 15 min. for 400 samples).

- In the sample, I pick 50% of data related to a specific job tiltes (decide it yourself!!), and 50% other random picked job titles.

### 💡 Customize Yourself!

In [20]:
# define yourself for job titles you're interested
keywords = ['data science', 'data scientist', 'data analyst']

## Number of relevant / irrelevant samples
N = 200

In [21]:
# select the data with the job titles which contain the keywords you define
condition = df['title'].str.contains('|'.join(keywords), case=False, na=False)
df_ds = df[condition]
# 'case=False' makes the search case-insensitive
# 'na=False' ensures that NaN values are not considered in the search

N_ds = min(len(df_ds), N)
df_ds = df_ds.sample(n=N_ds)
print(f"There're {N_ds} samples searched according to keywords.")
df_ds.head(10)

There're 300 samples searched according to keywords.


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips,job_post
23944,3890892407,TalentBridge,Data Analyst,Data analysts play a crucial role in organizat...,47.0,HOURLY,New York City Metropolitan Area,95313.0,42.0,,...,1712855000000.0,,0,CONTRACT,USD,BASE_SALARY,92560.0,,,Data analysts play a crucial role in organizat...
75963,3903461431,INSPYR Solutions,Sr. Data Analyst,"Title:Sr. Data Analyst\n\nLocation: Deerfield,...",75.0,HOURLY,"Deerfield Beach, FL",1742506.0,30.0,,...,1713458000000.0,www2.jobdiva.com,0,FULL_TIME,USD,BASE_SALARY,145600.0,33441.0,12011.0,"Title:Sr. Data Analyst\n\nLocation: Deerfield,..."
29428,3894294473,Aditi Consulting,Data Analyst I,Responsibilities:\n\nPerforms complex data ana...,,,"Austin, TX",2985733.0,4.0,,...,1712865000000.0,www2.jobdiva.com,0,FULL_TIME,,,,78701.0,48453.0,Responsibilities:\n\nPerforms complex data ana...
11779,3887499080,Athene,Sales Data Analyst/Consultant,"We are Athene. At Athene, we are driven to do ...",,,"West Des Moines, IA",3362620.0,6.0,,...,1712348000000.0,athene.wd5.myworkdayjobs.com,0,FULL_TIME,,,,50265.0,19153.0,"We are Athene. At Athene, we are driven to do ..."
28674,3894213012,Federal Transit Administration,Transportation Program Specialist/ Transportat...,Summary\n\nThis is a public notice for FTA aut...,,,"Washington, DC",2204354.0,4.0,,...,1712856000000.0,www.usajobs.gov,0,FULL_TIME,,,,20001.0,11001.0,Summary\n\nThis is a public notice for FTA aut...
71952,3902865990,Fixity Technologies,Data Analyst (BANKING DOMAIN ONLY),Position Title: Data Analyst (BANKING DOMAIN O...,55.0,HOURLY,"Jersey City, NJ",74777356.0,4.0,,...,1713567000000.0,,0,CONTRACT,USD,BASE_SALARY,109200.0,7302.0,34017.0,Position Title: Data Analyst (BANKING DOMAIN O...
95865,3904938734,SS&C Technologies,LLM Data Scientist,SS&C is a global provider of investment and fi...,,,"Boston, MA",7684.0,60.0,,...,1713453000000.0,wd1.myworkdaysite.com,0,FULL_TIME,,,,2108.0,25025.0,SS&C is a global provider of investment and fi...
84694,3904362210,Dice,BI Data Analyst - AWS QuickSight,Dice is the leading career destination for tec...,,,"Tempe, AZ",6849.0,27.0,,...,1713390000000.0,click.appcast.io,0,FULL_TIME,,,,85281.0,4013.0,Dice is the leading career destination for tec...
91395,3904507677,Genpact,Data Analyst – Banking,"Role: Data Analyst – Banking Location: NYC, NY...",120000.0,YEARLY,"Tampa, FL",210064.0,3.0,,...,1713539000000.0,,0,FULL_TIME,USD,BASE_SALARY,115000.0,33602.0,12057.0,"Role: Data Analyst – Banking Location: NYC, NY..."
12953,3887710955,Trimble Inc.,Data Science/Data Engineering Intern,Your Title: Data Engineer / Data Scientist Int...,,,"Minnetonka, MN",5160.0,19.0,,...,1712351000000.0,trimble.wd1.myworkdayjobs.com,0,INTERNSHIP,,,,55345.0,27053.0,Your Title: Data Engineer / Data Scientist Int...


In [22]:
# randomly pick data with other job titles
df_others = df[~condition].sample(n=N)
print(f"There're {len(df_others)} samples searched not contained in keywords.")
df_others.head(10)

There're 300 samples searched not contained in keywords.


Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips,job_post
45178,3900953334,"TalentBurst, an Inc 5000 company",Contracts Administrator 5,Title: Contracts Administrator\n\nDuration: 6 ...,,,"Costa Mesa, CA",122451.0,31.0,,...,1713390000000.0,www1.jobdiva.com,0,CONTRACT,,,,92626.0,6059.0,Title: Contracts Administrator\n\nDuration: 6 ...
34400,3895488776,Connected Health Care,"Travel - Registered Nurse - Med/Surg * $2,200/Wk*",Job Description\n\nTravel Contract for Registe...,,HOURLY,"Augusta, GA",72610772.0,5.0,61.0,...,1712884000000.0,jobs.connectedhealthcare.com,0,FULL_TIME,USD,BASE_SALARY,126880.0,30901.0,13245.0,Job Description\n\nTravel Contract for Registe...
101191,3905243299,DNA Partners,Trust & Estates Paralegal,Trust & Estates ParalegalSalary: $80K - $100K\...,100000.0,YEARLY,"Roseland, NJ",98211233.0,3.0,,...,1713465000000.0,,0,FULL_TIME,USD,BASE_SALARY,90000.0,7068.0,34013.0,Trust & Estates ParalegalSalary: $80K - $100K\...
102127,3905259695,Elgen Staffing,Personal Injury Attorney,"Handle New Jersey personal injury cases, such ...",,,"Hackensack, NJ",28126126.0,2.0,,...,1713471000000.0,,0,FULL_TIME,,,,7601.0,34003.0,"Handle New Jersey personal injury cases, such ..."
119912,3906226886,Camping World,"RV Sales Associate - Ft Myers, FL","Camping World Holdings, Inc., headquartered in...",,,"Fort Myers, FL",23441.0,4.0,,...,1713563000000.0,recruiting.adp.com,0,FULL_TIME,,,,33900.0,,"Camping World Holdings, Inc., headquartered in..."
316,3774103540,Commercial Partners of Texas,Commercial Loan Officer,Company Description\n Commercial Partners of T...,,,"Texas, United States",3478848.0,2.0,,...,1712865000000.0,,0,CONTRACT,,,,,,Company Description\n Commercial Partners of T...
95106,3904923642,Macy's,"Asset Protection Detective, Firewheel Town Cen...",Are you ready to bring your amazing self to wo...,,,"Garland, TX",2300.0,5.0,,...,1713451000000.0,jsv3.recruitics.com,0,FULL_TIME,,,,75040.0,48113.0,Are you ready to bring your amazing self to wo...
20022,3889125533,Columbia Sportswear Company,Co-Manager,Total Rewards: Quarterly Bonus Program to awar...,91700.0,YEARLY,"Castle Rock, CO",7003.0,6.0,,...,1712639000000.0,columbiasportswearcompany.wd5.myworkdayjobs.com,0,FULL_TIME,USD,BASE_SALARY,75350.0,80104.0,8035.0,Total Rewards: Quarterly Bonus Program to awar...
33926,3895236154,H&M,H&M Retail Sales Associate,"Job Description\n\nAt H&M, as a Retail Sales A...",,,"Murfreesboro, TN",5328.0,3.0,,...,1712866000000.0,jobs.smartrecruiters.com,0,PART_TIME,,,,37127.0,,"Job Description\n\nAt H&M, as a Retail Sales A..."
74176,3902948157,Apex Systems,Network Engineer,Job#: 2025001\n\nJob Description:\n\nThis posi...,,,"Birmingham, AL",4787.0,13.0,,...,1713281000000.0,www.apexsystems.com,0,FULL_TIME,,,,35203.0,1073.0,Job#: 2025001\n\nJob Description:\n\nThis posi...


In [23]:
# combine the two selected data sets together
df_select = pd.concat([df_ds, df_others])

print(f"There're totally {len(df_select)} samples for inserting VectorDB.")

There're totally 600 samples for inserting VectorDB.


# 1.3 Build a VectorDB

## 1.3.1.Create a Container

**Container (aka. collection)**
- To create VectorDB, you need to create a container in it, which is a collection that stores and organizes similar types of vectors, allowing efficient, relevant searches.
italicized text
- The created collection needs a specified encoder function, so it knows how to encode the data into embeddings

In [26]:
# specify and download the encoder from hugging face platform
encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
encoder

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [27]:
import chromadb
collection_name = "collection_postings"

persistent_client = chromadb.PersistentClient()
# print("check all functions/attribute for Chroma:\n", dir(persistent_client))

if not persistent_client.list_collections():
    print(f"{collection_name} not in collection (DB) yet!")
else:
    if collection_name in persistent_client.list_collections()[0].name:
        print(f"{collection_name} is already in collection and would be deleted!")
        persistent_client.delete_collection(collection_name)

print(f"Create collection: {collection_name}!")
vector_store = Chroma(
    client=persistent_client,
    collection_name=collection_name,
    embedding_function=encoder,
    persist_directory="./chroma_langchain_db",  #save data locally, remove if not neccesary
)

#print("check all functions/attribute for Chroma:\n", dir(vector_store))

# if collection_name in vector_store.list_collections():
#     vector_store.delete_collection(collection_name)


collection_postings not in collection (DB) yet!
Create collection: collection_postings!


## 1.3.2. Indexing: Insert Data into VectorDB


In [29]:
from langchain_core.documents import Document

# create list of documents (each document is a chunk) with specified ID
# they will be later ingested into the VectorDB
ids = []
documents = []
for index, row in df_select.iterrows():
    id_current = str(index)
    ids.append(id_current)

    min_salary = row['min_salary']
    if pd.isna(min_salary):
        min_salary = 0  # or any default value you prefer

    document_current = Document(
        page_content=row['job_post'], #Main Data (the one be encoded)
        metadata={"title": row['title'],
                  "location": row['location'],
                  "min_salary": min_salary,
                  "pay_period": row['pay_period'],
                  "job_posting_url": row['job_posting_url'],
                 },
        id=row['job_id'],
    )
    documents.append(document_current)
print(f"There are {len(documents)} documents (chunks).\n")
print(f'Example document content:\n{document_current}')

There are 600 documents (chunks).

Example document content:
page_content='The Senior Scientist is responsible for all steps of product development from ideation phase to final phase of formulation. Primary responsibilities will be the innovation of new products, managing new development, and reformulation in the hair care category.
The successful candidate should be an experienced practitioner excited by hands-on laboratory work in a dynamic R&D environment in the cosmetics industry, with broad skill set and experience in hair care formulations the work enables development and commercialization of new products, qualification of starting materials, and helps ensure safety, quality, and regulatory compliance of the innovation portfolio and the base business.
Formulation:You schedule daily activity, including prioritizing projects. You also schedule your direct reports daily activity and assist in those experimental designs.You are quite familiar with skin and hair care technologies (sur

In [30]:
import time
start = time.time()
vector_store.add_documents(documents=documents, ids=ids)
end = time.time()
print('Time spent (min.) for data insertion: \t', (end-start)/60)

Time spent (min.) for data insertion: 	 19.958421730995177


In [50]:
# Check the data in vector_store.

# Get all the documents in the vector store
documents = vector_store.get(include=["documents", "metadatas", "embeddings"])


doc = documents["documents"][0]
metadata = documents["metadatas"][0]
embedding = documents["embeddings"][0]
print("First Document Metadata:\n", metadata)
print("First Document:\n", doc)
print(f"First Document Embedding (with vector len={len(embedding)}):\n", embedding)


First Document Metadata:
 {'job_posting_url': 'https://www.linkedin.com/jobs/view/3890892407/?trk=jobs_biz_prem_srch', 'location': 'New York City Metropolitan Area', 'min_salary': 42.0, 'pay_period': 'HOURLY', 'title': 'Data Analyst'}
First Document:
 Data analysts play a crucial role in organizations by conducting in-depth analysis across the data lifecycle, encompassing requirements gathering, activities, and design phases. They are tasked with developing robust reporting capabilities, monitoring performance metrics, and implementing quality control measures to drive continuous improvement.Job Duties:Interpret data and derive actionable insights by applying statistical techniques to analyze results effectively.Develop and deploy databases, data collection systems, and advanced analytics strategies to enhance statistical efficiency and ensure data quality.Source data from primary and secondary sources, and maintain databases and data systems to ensure accessibility and reliability.Ide

# 1.4 Search Engine: VectorDB as Retriever

### 1.4.1. Vector Search (Similarity Search)

**Similarity Search**
- To find the top most similar embeddings which are closer to the query embedding
- score: cosine similarity


**Application**
- According to your query (autobiography), find k=10 most suitable job posts.

### 💡 Customize Yourself!

**Prepare a query**: self description
- To provide your brief autobiography for your consultant’s reference Imagine you’re preparing your resume, what information should you put? (E.g., education, experience, abilities, personalities, job position you’re looking for, etc.)
- The words (texts only) need not be too long (< 500 words)

In [33]:
query = "I am a passionate job seeker with a strong desire to embark on a career in data science. Having recently graduated with a degree in Computer Science, I have honed my skills in Python programming and developed a deep interest in machine learning. During my studies, I immersed myself in various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. My journey into data science has been driven by a fascination with uncovering hidden patterns in data and using these insights to solve real-world problems. I am now seeking an role related to data analysis where I can leverage my Python expertise and enthusiasm for machine learning to contribute to a dynamic team, learn from experienced professionals, and continue to grow my skills in this exciting field. What jobs are most suitable for me?"
print(query)

I am a passionate job seeker with a strong desire to embark on a career in data science. Having recently graduated with a degree in Computer Science, I have honed my skills in Python programming and developed a deep interest in machine learning. During my studies, I immersed myself in various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. My journey into data science has been driven by a fascination with uncovering hidden patterns in data and using these insights to solve real-world problems. I am now seeking an role related to data analysis where I can leverage my Python expertise and enthusiasm for machine learning to contribute to a dynamic team, learn from experienced professionals, and continue to grow my skills in this exciting field. What jobs are most suitable for me?


In [34]:
results = vector_store.similarity_search_with_score(
    query , k=10,
)
i =1
for res, score in results:
    print(f"* [{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    i +=1

* [1][SIM=0.558345] Data Scientist II
---------------------
           Job Opportunity: Data Scientist II

Our client in the Insurance industry is seeking a talented Data Scientist II to join their Advanced Analytics department.

Role & Responsibilities:
Develop and implement data analytics strategies to drive business insightsCreate predictive models and algorithms to optimize decision-making processesAnalyze large datasets to identify trends and patterns for actionable recommendationsCollaborate with cross-functional teams to support data-driven decision-makingPresent findings and recommendations to senior management
Key Skills:
Strong experience in data science and analytics - 4 years minimumProficiency in PythonAdvanced knowledge of statistical analysis and data visualization toolsAbility to work with large datasets and databasesExcellent communication and presentation skillsKnowledge in AI/MLWorking experience with Databricks, Azure ML, and Azure CloudWroking experience with healt

### 1.4.2. Vector Search with Filtering

**Filtering**  
- You can filter out some job posts based on the condition you set on the _Meta Data_.

**Application**
- According to the query, find k=5 best possible job posts which have minimum salary greater than (gt) 100000

In [35]:
results = vector_store.similarity_search_with_score(
    query , k=5, filter={"min_salary": {"$gt": 100000}}
)# perator: $gt, $gte, $lt, $lte, $ne, $eq, $in, $nin
i = 1
for res, score in results:
    print(f"* [{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    i +=1

* [1][SIM=0.558345] Data Scientist II
---------------------
           Job Opportunity: Data Scientist II

Our client in the Insurance industry is seeking a talented Data Scientist II to join their Advanced Analytics department.

Role & Responsibilities:
Develop and implement data analytics strategies to drive business insightsCreate predictive models and algorithms to optimize decision-making processesAnalyze large datasets to identify trends and patterns for actionable recommendationsCollaborate with cross-functional teams to support data-driven decision-makingPresent findings and recommendations to senior management
Key Skills:
Strong experience in data science and analytics - 4 years minimumProficiency in PythonAdvanced knowledge of statistical analysis and data visualization toolsAbility to work with large datasets and databasesExcellent communication and presentation skillsKnowledge in AI/MLWorking experience with Databricks, Azure ML, and Azure CloudWroking experience with healt

# 2️⃣ Retrieval and Generation


# 2.1. Environment Setup

Sources  
- [langchain-chroma](https://pypi.org/project/langchain-chroma/)
- [Gemini API Python quickstart](https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/tutorials/quickstart_colab.ipynb#scrollTo=-QhPWE1lwZHH)

In [36]:
# install package for Google Gemini
!pip install -q -U google-generativeai

In [37]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.2 MB[0m [31m5.6 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.2/1.2 MB[0m [31m18.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.8.0


# 2.2. Connect to VectorDB & LLM Agent


## 2.2.1. Connect to VectorDB (Chroma)

You can connect to the VectorDB with a specified collection name, after it's built-up.

In [39]:
import chromadb
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint

collection_name = "collection_postings"

#encoder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

persistent_client = chromadb.PersistentClient()
print(persistent_client.list_collections())

vector_store = Chroma(client=persistent_client,
                      collection_name=collection_name,
                      embedding_function=encoder)


[Collection(id=22242d9e-6d04-4685-ad52-9afd020857cb, name=collection_postings)]


In [53]:
# # Check the data in vector_store.

# # Get all the documents in the vector store
# documents_with_embeddings = vector_store.get(include=["documents", "metadatas", "embeddings"])


# doc = documents_with_embeddings["documents"][0]
# metadata = documents_with_embeddings["metadatas"][0]
# embedding = documents_with_embeddings["embeddings"][0]
# print("First Document:\n", doc)
# print("First Document Metadata:\n", metadata)
# print(f"First Document Embedding (with vector len={len(embedding)}):\n", embedding)


## 2.2.2. Connect to Agent (Call Gemini API)

In [51]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GEMINI_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-pro')

# 2.3. Retrieval and Generation Application

## 2.3.1. Prepare Prompt

**Purpose**

Give instruction to the AI assistant
1. Role: career conselor
1. Tasks: how the AI assistant respond to the query
1. Context: provide the AI assistant relevant information, so he/she can respond accordingly
  - query
  - specification: retrieved job posts

**Resources**
- [**What is Prompt Engineering?**](https://www.datacamp.com/blog/what-is-prompt-engineering-the-future-of-ai-communication)
 > **Prompt engineering** a practice of designing and refining prompts—questions or instructions—to elicit specific responses from AI models.
- [Prompt Optimization Techniques: Prompt Engineering for Everyone](https://www.datacamp.com/blog/prompt-optimization-techniques)

### 💡 Customize Yourself!

In [52]:
extraction_prompt = ''' You are a carear consultant who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according the reasoning above.

Question:
    <query>{query}</query>
Job Post Information:
    <specification>{specification}</specification>
Advice:
'''

## 2.3.2. Preprare Input Query


### 💡 Customize Yourself!

In [54]:
# this query is the same as the one at section 1.4.1.
query = "I recently graduated with a Bachelor degree in Computer Science, I use Python and have good grades in machine learning and deep learning. I had various projects that allowed me to apply these skills, from building predictive models to analyzing large datasets. I am now seeking an entry-level data scientist or data analyst role."

## 2.3.3. Search Results based on Query

In [57]:
results = vector_store.similarity_search_with_score(
    query , k=5, #filter={"title": {"$in": keywords}}
)
i=0
specification = ""
for res, score in results:
    print(f"[{i}][SIM={score:3f}] {res.metadata['title']}\n---------------------\n \
          {res.page_content} \n--------------------\n \
           [{res.metadata}]\n\n")
    specification += ('\nTitle: ' + res.metadata['title'] +'\n ' + res.page_content)
    i+=1

[0][SIM=0.632815] Content Development | Data Science & NLP
---------------------
           We are looking to hire a Course Developer for the Mentioned "Computer Science Branches". Particular candidates will be responsible for Developing, Reviewer and questioning the relevant details of the subject. Please have a look and update with your best.
Subjects:Data Science Tools and TechnologiesNatural Language Processing (NLP)
Course 1: Data Science Tools and TechnologiesQualification Requirements:Master’s degree or higher in Data Science, Computer Science, or a related technical field.Certifications in data science tools and technologies, such as Python, R, SQL, or cloud platforms.Published work or contributions to open-source projects related to data science methodologies.Evidence of continuous learning and skill development in emerging data science technologies.
Experience Requirements:At least 4 years of professional experience in data science, with a strong portfolio demonstrating exper

In [58]:
print(specification)


Title: Content Development | Data Science & NLP
 We are looking to hire a Course Developer for the Mentioned "Computer Science Branches". Particular candidates will be responsible for Developing, Reviewer and questioning the relevant details of the subject. Please have a look and update with your best.
Subjects:Data Science Tools and TechnologiesNatural Language Processing (NLP)
Course 1: Data Science Tools and TechnologiesQualification Requirements:Master’s degree or higher in Data Science, Computer Science, or a related technical field.Certifications in data science tools and technologies, such as Python, R, SQL, or cloud platforms.Published work or contributions to open-source projects related to data science methodologies.Evidence of continuous learning and skill development in emerging data science technologies.
Experience Requirements:At least 4 years of professional experience in data science, with a strong portfolio demonstrating expertise in data manipulation, analysis, and v

## 2.3.4. Get Advice from Career Consultant

In [59]:
#Give your Career Consultant your query (and the relevant job posts)
prompt_all = extraction_prompt.format(query=query, specification=specification)
print(prompt_all)

 You are a carear consultant who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer
    2. Specification: The job post information (enclosed in <specification> tag below) that might best meets your client's requirements

Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs in the job specification
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs mentioned in job specification according

In [60]:
response = model.generate_content(prompt_all)
print(response.text)

**Step 1: Analyze your client's abilities, including hard and soft skills**

**Hard skills:**
* Proficient in Python
* Strong in machine learning and deep learning
* Experience with predictive modeling and data analysis
* Familiar with SQL and data visualization tools

**Soft skills:**
* Excellent problem-solving skills
* Strong communication and presentation skills

**Step 2: Analyze and summarize the skills needed for the best possible jobs in the job specification**

**Data Scientist II**
* 4+ years of experience in data science and analytics
* Proficiency in Python, statistical analysis, and data visualization tools
* Ability to work with large datasets and databases
* Excellent communication and presentation skills

**Data Scientist (6+ years)**
* 6+ years of experience in data science
* Proven experience in deploying real-time AI/ML models using Google Cloud Platform
* Strong programming skills in Python and PySpark
* Expertise with classical ML algorithm and deep learning algori

# 2.4 What If: Generation without Retrieved Context

In [61]:
extraction_prompt = ''' You are a career consoler who helps job seekers to find their dream jobs, you give professional advice tailored to the need of your client (i.e., job seeker) according to the following information:
    1. Query: Your client's question (enclosed in <query> tag below) that you need to answer


Upon receiving your aforementioned information, you need to proceed with the following precedures:
Step 1. Analyze your client's abilities, including hard and soft skills.
Step 2. Analyze and summarize the skills needed for the best possible jobs
Step 3. Summarize your client's strengths that are already sufficient for the job application.
Step 4. Summarize your client's weaknesses that they need to improve in order to meet the job requirements.
Step 5. Finally, give them advice how to get the jobs.

Question:
    <query>{query}</query>

Advice:
'''

prompt_all = extraction_prompt.format(query=query)

In [62]:
response = model.generate_content(prompt_all)
print(response.text)

**Step 1: Analyze your client's abilities**
**1.1 Hard skills**
* Proficiency in Python programming language
* Machine learning and deep learning techniques
* Data analysis and visualization skills
* Experience in building predictive models
* Experience in analyzing large datasets

**1.2 Soft skills**
* Communication skills
* Problem-solving skills
* Analytical skills
* Teamwork skills
* Detail-oriented

**Step 2: Analyze and summarize the skills needed for the best possible jobs**
* Entry-level data scientist roles typically require a strong foundation in statistics, machine learning, and programming.
* Entry-level data analyst roles generally involve data cleaning, data analysis, and visualization.

**Step 3: Summarize your client's strengths that are already sufficient for the job application**
* Strong academic background in computer science with a focus on machine learning and deep learning
* Hands-on experience in applying machine learning and deep learning skills to real-world p

# 3️⃣ Store Data to your Google Drive

Save your file from Google Colab to Google Drive

**Note**
- Your data will be lost once you close the colab page, unless you save them somewhere else (e.g., Google Drive, Github, etc.)


**Save to Google Drive**
- To save this IPython (Jupyter) notebook, you can manually select:
  > File > Save a copy in Drive
- To save specific file (e.g., csv file), you can use the following method.
- Or after mounting the Drive here, you can manually drag the file / folder (with cursor) (e.g., chroma folder) to the specified folder (MyDrive)

## Mount Google Drive

In [63]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


## Create a Folder in Google Drive


In [66]:
def create_folder_in_drive(folder_name):
    # Define the path for the new folder (under MyDrive folder)
    folder_path = f'/content/drive/MyDrive/{folder_name}'

    # Create the folder if it doesn't exist
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f'Folder "{folder_name}" created in Google Drive.')
    else:
        print(f'Folder "{folder_name}" already exists.')

# Create the folder in you Google Drive
folder_name = 'Tutorial - LinkedIn Job Posting with GenAI'
create_folder_in_drive(folder_name)

subfolder_name = ''


Folder "Tutorial - LinkedIn Job Posting with GenAI" already exists.


## Save the Notebook to the Created Folder

In [65]:
import shutil

def save_file_to_drive(file_name, folder_name):
    source_path = f'/content/{file_name}'  # Current location of the notebook
    destination_path = f'/content/drive/MyDrive/{folder_name}/{file_name}'  # Destination path in Drive

    # Copy the notebook to the destination
    shutil.copy(source_path, destination_path)
    print(f'Notebook "{file_name}" saved to "{folder_name}" in Google Drive.')

# Save the current notebook (replace 'your_notebook_name.ipynb' with the actual name)
file_name = 'postings.csv'
save_file_to_drive(file_name, folder_name)

Notebook "postings.csv" saved to "Tutorial - LinkedIn Job Posting with GenAI" in Google Drive.
