### Utilize the provided jobs data


In [1]:
!pip install pandas numpy beautifulsoup4 sentence-transformers transformers pinecone-client tqdm



In [2]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from pinecone import Pinecone, ServerlessSpec
from tqdm import tqdm 
import json
import pickle
import re
from confige import *
# Suppress Warnings
import warnings
# Suppress all warnings globally
warnings.filterwarnings('ignore')

In [3]:
df = pd.read_csv(sampled_jobs)
df.head()

Unnamed: 0,job_title,description,requirements,career_level
0,Senior Developer Relations Manager,<p>Senior Developer Relations Manager page is ...,,Not specified
1,Costing Manager - Cairo,"<ul>\n<li>Supervise, design and implement a co...",<ul>\n<li>Bachelor’s degree in Accounting</li>...,Manager
2,Banquet Supervisor,Mandarin Oriental Hotel GroupMandarin Oriental...,,Experienced (Non-Manager)
3,Trade Finance & Credit Collection,<p><b>About Us</b></p><br><p>Alfa Laval is a l...,,Not specified
4,Taste & Wellbeing Creative Marketing Associate...,Join us and celebrate the beauty of human expe...,,Not specified


In [24]:
df.shape

(40000, 4)

In [4]:

# fill NaN values in specific columns
df['requirements'] = df['requirements'].fillna('Not specified')
# Convert text to lowercase and remove unnecessary spaces or punctuation
df['description'] = df['description'].str.lower().str.replace(r'\s+', ' ', regex=True).str.strip()
df['requirements'] = df['requirements'].str.lower().str.replace(r'\s+', ' ', regex=True).str.strip()
# Function to clean HTML content
def clean_html(text):
    if isinstance(text, str):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text(separator=" ")
    return text

df['description'] = df['description'].apply(clean_html)
df['requirements'] = df['requirements'].apply(clean_html)

# Function to remove newline characters
def remove_newlines(text):
    if isinstance(text, str):
        return re.sub(r'\n', '', text)
    return text

# Apply the function to all elements in the DataFrame
df = df.map(remove_newlines)

print(df.shape)
df.head()

(40000, 4)


Unnamed: 0,job_title,description,requirements,career_level
0,Senior Developer Relations Manager,senior developer relations manager page is loa...,not specified,Not specified
1,Costing Manager - Cairo,"supervise, design and implement a consistent...",bachelor’s degree in accounting 10+ years ...,Manager
2,Banquet Supervisor,mandarin oriental hotel groupmandarin oriental...,not specified,Experienced (Non-Manager)
3,Trade Finance & Credit Collection,about us alfa laval is a leading global provid...,not specified,Not specified
4,Taste & Wellbeing Creative Marketing Associate...,join us and celebrate the beauty of human expe...,not specified,Not specified


In [5]:
# Combine relevant columns into a single text field for embedding
df['job_details'] = 'job_title:'+ df['job_title'] + ' - description:' + df['description'] + ' - requirements:' + df['requirements'] + ' - career_level:' + df['career_level']
df['job_details'].iloc[0]

"job_title:Senior Developer Relations Manager - description:senior developer relations manager page is loaded  senior developer relations manager locationsuae, dubai time typefull time posted onposted 3 days ago job requisition idjr1977176 we are looking for a senior developer relations manager to lead and expand nvidia’s engagements with isvs, startups, and universities in the middle east and africa region. this is a dynamic role responsible for forging strategic partnerships with various developer personas, guiding them to resolve their most challenging computing problems using nvidia’s technologies and platforms. in this position, you will serve as an nvidia expert in driving engagement to increase nvidia sdk adoption. you will play a significant role in defining use cases for nvidia’s products across industry metaverse, computer vision, and more. ideal candidates will have a technical background in al/ml, accelerated computing, computer vision, and simulation.  what you will be doi

In [28]:
df.head()

Unnamed: 0,job_title,description,requirements,career_level,job_details
0,Senior Developer Relations Manager,Senior Developer Relations Manager page is loa...,Not specified,Not specified,job_title:Senior Developer Relations Manager -...
1,Costing Manager - Cairo,"Supervise, design and implement a consistentl...",Bachelor’s degree in Accounting 10+ years pr...,Manager,job_title:Costing Manager - Cairo - descriptio...
2,Banquet Supervisor,Mandarin Oriental Hotel GroupMandarin Oriental...,Not specified,Experienced (Non-Manager),job_title:Banquet Supervisor - description:Man...
3,Trade Finance & Credit Collection,About Us Alfa Laval is a leading global provid...,Not specified,Not specified,job_title:Trade Finance & Credit Collection - ...
4,Taste & Wellbeing Creative Marketing Associate...,Join us and celebrate the beauty of human expe...,Not specified,Not specified,job_title:Taste & Wellbeing Creative Marketing...


### Implement document embedding using Sentence Transformers

In [6]:
# Load the SBERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Encode the job details into dense vectors
df['job_vectors'] = df['job_details'].apply(lambda x: model.encode(x))

In [7]:
df.to_csv('./data/vector_jobs.csv')

In [8]:
df.head()

Unnamed: 0,job_title,description,requirements,career_level,job_details,job_vectors
0,Senior Developer Relations Manager,senior developer relations manager page is loa...,not specified,Not specified,job_title:Senior Developer Relations Manager -...,"[-0.29941955, -0.152443, 0.03969142, -0.082494..."
1,Costing Manager - Cairo,"supervise, design and implement a consistent...",bachelor’s degree in accounting 10+ years ...,Manager,job_title:Costing Manager - Cairo - descriptio...,"[-0.36262524, 0.28939262, -0.045071788, -0.048..."
2,Banquet Supervisor,mandarin oriental hotel groupmandarin oriental...,not specified,Experienced (Non-Manager),job_title:Banquet Supervisor - description:man...,"[0.04397667, 0.20301022, 0.10298208, 0.0690942..."
3,Trade Finance & Credit Collection,about us alfa laval is a leading global provid...,not specified,Not specified,job_title:Trade Finance & Credit Collection - ...,"[-0.34001347, -0.13443585, -0.32917526, -0.230..."
4,Taste & Wellbeing Creative Marketing Associate...,join us and celebrate the beauty of human expe...,not specified,Not specified,job_title:Taste & Wellbeing Creative Marketing...,"[0.14135568, 0.15496188, 0.25110003, -0.143409..."


In [9]:

pc = Pinecone(api_key="e355e8c2-fae7-43a7-91d6-e318dd10f982")

In [10]:
# Initialize Pinecone index
index_name = "job-advice"
if index_name in pc.list_indexes():
    print(f"Index '{index_name}' already exists.")
else:
    try:
        pc.create_index(name=index_name, dimension=384, metric = "cosine",  spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    ) )
        print(f"Index '{index_name}' created.")
    except Exception as e:
        print(f"Error creating index: {e}")
        print(index_name)

index = pc.Index(index_name)

Error creating index: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2024-07', 'X-Cloud-Trace-Context': '2264e88b827a62b0b57bf272cbfec7f6', 'Date': 'Mon, 19 Aug 2024 11:13:56 GMT', 'Server': 'Google Frontend', 'Content-Length': '196', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"FORBIDDEN","message":"Request failed. You've reached the max serverless indexes allowed in project Default (5). To add more serverless indexes, upgrade your plan."},"status":403}

job-advice


In [11]:
# Index job postings
upserts = [(str(i), vector) for i, vector in enumerate(df['job_vectors'])]

# Use tqdm to add a progress bar
for i in tqdm(range(0, len(upserts), 1000)):  
    # Process in batches of 1000 vectors
    batch = upserts[i:i + 1000]  
    index.upsert(vectors=batch)

print("Job postings indexed.")

100%|██████████| 40/40 [05:57<00:00,  8.94s/it]

Job postings indexed.





In [12]:
index_metadata = {
    "index_name": index_name,
    "dimension": 384  # Replace with your index dimension
}

# Save metadata to a file
with open('./data/index_metadata.json', 'w') as f:
    json.dump(index_metadata, f)

import pickle

# Save vectors
with open('./data/index_vectors.pkl', 'wb') as f:
    pickle.dump(upserts, f)    

In [14]:
# Load vectors
with open('./data/index_vectors.pkl', 'rb') as f:
    upserts = pickle.load(f)

# Re-index the vectors in Pinecone (if needed)
for i in tqdm(range(0, len(upserts), 1000)):
    batch = upserts[i:i + 1000]
    index.upsert(vectors=batch)

100%|██████████| 40/40 [05:27<00:00,  8.18s/it]


In [34]:
# Get index statistics and metadata
index_stats = index.describe_index_stats()
print(index_stats)

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 40000}},
 'total_vector_count': 40000}


In [35]:
# Replace 'vector_id' with the actual ID of the vector you want to retrieve
vector_id = '1'
fetched_vector = index.fetch(ids=[vector_id])
print(fetched_vector)

{'namespace': '',
 'usage': {'read_units': 1},
 'vectors': {'1': {'id': '1',
                   'values': [-0.362625241,
                              0.28939262,
                              -0.0450717881,
                              -0.0480320081,
                              0.0955152363,
                              -0.222818762,
                              0.0772963241,
                              -0.0981858075,
                              -0.00208469294,
                              0.244954824,
                              -0.0822946429,
                              -0.00831546262,
                              -0.0193811785,
                              -0.030132331,
                              -0.148294181,
                              -0.254513651,
                              0.000565566123,
                              -0.0437088795,
                              -0.0958576277,
                              -0.0932767466,
                              0.

In [36]:
user_query = "Looking for a Senior Developer Relations Manager with experience in AI"

query_embedding = model.encode(user_query)

print(query_embedding)

[-3.30146044e-01 -6.35749876e-01  2.45703254e-02 -7.21853256e-01
  4.31780636e-01  4.18231279e-01 -2.84826100e-01  1.26781076e-01
 -8.79529044e-02  2.46806517e-01 -2.46940464e-01  4.01146412e-01
  1.09041981e-01  1.32414237e-01  2.61295050e-01  7.94001818e-01
 -2.68279105e-01  1.85125172e-01  3.17614436e-01 -7.30682090e-02
 -4.49825644e-01 -2.63411969e-01  2.21017838e-01 -5.24516046e-01
  1.14148632e-01  2.20390961e-01  3.85786951e-01 -3.76119822e-01
  1.16747245e-01 -3.59379619e-01  7.71762192e-01 -2.02731937e-01
  8.89835477e-01  2.17707247e-01 -2.21728371e-03  2.78088778e-01
 -5.15913010e-01 -6.22332469e-02 -1.35260716e-01 -1.38729095e-01
  1.83490649e-01  3.05478394e-01  1.62178129e-01 -1.92333907e-01
 -1.15872368e-01  1.85125023e-01 -1.90353855e-01 -4.48482245e-01
 -6.87094182e-02 -2.16847397e-02 -8.25686306e-02 -4.38451648e-01
  3.27161491e-01  1.10650972e-01 -2.64415234e-01  5.75312614e-01
  3.48765194e-01  3.78210604e-01 -1.97401598e-01 -2.17200011e-01
  3.06398600e-01 -3.14502

In [37]:
user_query = "Looking for a Senior Developer Relations Manager with experience in AI"

query_embedding = model.encode(user_query).tolist()

# print(query_embedding)

results = index.query(vector=query_embedding, top_k=3)

In [38]:
print(results)

{'matches': [{'id': '0', 'score': 0.650319338, 'values': []},
             {'id': '26741', 'score': 0.617222548, 'values': []},
             {'id': '14067', 'score': 0.615976036, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}


In [15]:
def search_jobs(query, model, index, df, top_k=5):
    """Search for relevant job postings based on a user query."""
    # Encode the query to get the vector
   
    query_embedding = model.encode(user_query).tolist()

    # print(query_embedding)

    results = index.query(vector=query_embedding, top_k=5)

    # Process the results
    job_ids = [int(match['id']) for match in results['matches']]
    
    # Retrieve job details from the DataFrame
    relevant_jobs = df.iloc[job_ids]
    
    return relevant_jobs

# Example user query
user_query = "Looking for a Senior Developer Relations Manager with experience in AI"
relevant_jobs_df = search_jobs(user_query, model, index, df)

pred_docs = []
# Display the relevant job postings
for _, row in relevant_jobs_df.iterrows():
    pred_docs.append(row['job_details'])
    # print(f"Job Details: {row['job_details']}")
results_df = pd.DataFrame(pred_docs, columns= ['pred_details'])
results_df


Unnamed: 0,pred_details
0,job_title:Senior Developer Relations Manager -...
1,job_title:Business Development Specialist (AI ...
2,job_title:Virtual Technology and AI Support Of...
3,job_title:Software Project Manager - Enterpris...
4,job_title:Salesforce Developer-Intern - descri...


In [17]:
pipe = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)


A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Downloading shards: 100%|██████████| 2/2 [23:21<00:00, 700.97s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:15<00:00,  7.50s/it]


In [23]:
def generate_personalized_response(user_query, relevant_jobs, generative_model):
    """Generate a personalized response using a generative model."""
    # Prepare the context for the generative model
    context = f"User is looking for: {user_query}. Here are some relevant job postings:\n\n"
    for _, row in relevant_jobs.iterrows():
        context += f"- {row['job_details']}\n"

    # Use the generative model to create a personalized response
    personalized_response = generative_model(
        context,
        max_length=1024,  # Adjust max_length to handle longer texts
        max_new_tokens=200,  # Limit the length of the generated content
        do_sample=True,
        num_return_sequences=1
    )[0]['generated_text']
    
    return personalized_response

In [27]:
# Example user query
user_query = "AI Engineer "
relevant_jobs_df = search_jobs(user_query, model, index, df)
# Generate personalized response
personalized_response = generate_personalized_response(user_query, relevant_jobs_df, pipe)

# Display the personalized response
print(personalized_response)

Both `max_new_tokens` (=200) and `max_length`(=1024) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


User is looking for: AI Engineer . Here are some relevant job postings:

- job_title:AI Test Engineer - description:we are seeking a talented and experienced ai tester to join our dynamic team. as an ai tester, you will be responsible for ensuring the quality, reliability, and performance of our ai systems across various applications. you will collaborate closely with our development, data science, and engineering teams to design and execute comprehensive testing strategies. responsibilities: · conduct thorough testing of ai models, algorithms, and applications to identify defects, errors, and performance issues. · validate data quality, integrity, and relevance for ai training and inference. · evaluate and benchmark ai models against specified requirements and industry standards. · test for bias, fairness, and ethical considerations in ai decision-making processes. · develop tools, frameworks, and methodologies to support automated testing and continuous integration of ai solutions. ·

In [28]:
# Define the chatbot function
def chatbot(user_query, search_model, search_index, job_df):
    """Handle user queries and provide responses."""
    print(f"User Query: {user_query}")
    
    # Perform job search
    relevant_jobs_df = search_jobs(user_query, search_model, search_index, job_df)
    
    # Generate personalized response
    response = generate_personalized_response(user_query, relevant_jobs_df, pipe)
    
    return response

In [29]:
 # Get chatbot response
response = chatbot(user_query, model, index, df)
print(f"Chatbot Response: {response}")

User Query: AI Engineer 


Both `max_new_tokens` (=200) and `max_length`(=1024) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Chatbot Response: User is looking for: AI Engineer . Here are some relevant job postings:

- job_title:AI Test Engineer - description:we are seeking a talented and experienced ai tester to join our dynamic team. as an ai tester, you will be responsible for ensuring the quality, reliability, and performance of our ai systems across various applications. you will collaborate closely with our development, data science, and engineering teams to design and execute comprehensive testing strategies. responsibilities: · conduct thorough testing of ai models, algorithms, and applications to identify defects, errors, and performance issues. · validate data quality, integrity, and relevance for ai training and inference. · evaluate and benchmark ai models against specified requirements and industry standards. · test for bias, fairness, and ethical considerations in ai decision-making processes. · develop tools, frameworks, and methodologies to support automated testing and continuous integration 