<a href="https://colab.research.google.com/github/NimdaGrogu/Machine_Learning_Projects/blob/dev/Generative%20AI%20Foundations%20/Capstone%20Projects/Project%202%20Medical%20Assistant/Learners_Notebook_Full_Code_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement


**Job Hunt is not easy**

### Use Case Context

Job Assistant is an AI-powered toolkit designed to support job seekers through the modern recruitment process. In today‚Äôs job market, success requires more than just experience and qualifications‚Äîit demands strategic customization, fast adaptation, and effective communication. This project aims to make that process easier, faster, and more accessible for candidates navigating career transitions.

This notebook serves as an early prototype for a stable application that assists with analyzing job descriptions, tailoring resumes, and preparing candidates for interviews using advanced large language models.

**Motivation & Context**

After being laid off, I found myself facing the same challenges as millions of other professionals: needing to explore new roles while ensuring that my resume, skills, and application strategy were aligned with evolving hiring expectations.

While Applicant Tracking Systems (ATS) have long used automation to filter candidates, jobseekers now have access to their own arsenal of AI tools to level the playing field. This project reflects that shift‚Äîleveraging AI to help candidates strategize, tailor their applications, and present themselves competitively in a market where efficiency and precision matter more than ever.




### **Features**

üîç Job Description Analysis:
- Extract core responsibilities and skills
- Identify keywords and competencies
- Assess seniority and role expectations
- Summarize qualifications into actionable insights

üìÑ Resume Review & Enhancement
- Evaluate alignment between resume and target role
- Generate tailored bullet points and work summaries
- Suggest improvements for clarity, impact, and ATS compatibility

ü§ñ AI-Driven Workflow Automation (Future)
- Auto-generate resume variants per job
- Track applications and versioning
- Integrate with job platforms and APIs

üß† Interview Preparation
- Role-specific behavioral and technical questions
- STAR-based answer guidance
- Prompts for self-reflection and story crafting



### Objective

* üß± Architecture (Prototype)
This project is currently implemented as a Jupyter notebook that orchestrates interactions between:
	* LLM models for text analysis and generation
	* Prompt-based evaluation of job descriptions
    * RAG Pipeline Implementation
    * Vector Database
    * Custom utility functions for parsing, extraction, and formatting

* Long-term goals include:
	* Converting the notebook into a modular application service
	* Building a UI for non-technical users
	* Creating a persistent system for resume templates and job history

### üèóÔ∏è Installation

Requirements
- Python 3.10+
- Jupyter Notebook/Lab (Google Colab)
- Access to an LLM provider (OpenAI, Google, etc.)


### üß™ Usage

The notebook is organized into workflows:
- 1 Import a Job Description
- 2 Run Job Analysis Cell
- 3 Upload Resume PDF format
- 4 Run Resume Alignment and Recommendations
- 5 (Optional) Generate Interview Questions

Each step outputs structured insights intended for immediate application or further iteration.

### üìà Roadmap

* Modular Python package
* RAG Agent for automation
* CLI for job analysis
* Resume builder with AI-assisted templates
* Frontend UI
* Integration with job boards and ATS tools ()
* Export to PDF, DOCX, and LinkedIn text blocks


### ü§ù Contributing

Contributions, suggestions, and collaborations are welcome.
If you‚Äôve recently experienced a layoff or are actively job hunting, your perspective is especially valuable in shaping this tool.


### üí¨ Why This Matters

The job market is evolving rapidly. Candidates must now manage personal branding, narrative framing, research, and continuous refinement‚Äîoften under pressure and uncertainty.

This project attempts to remove some of that burden.

By combining empathy, technology, and practical workflows, Job Assistant aims to help people rebuild confidence, highlight their strengths, and secure meaningful work.

## Libraries

In [72]:
# Install required libraries
!pip install -q langchain_community==0.3.27 \
              langchain==0.3.27 \
              chromadb==1.0.15 \
              pymupdf==1.26.3 \
              tiktoken==0.9.0 \
              datasets==4.0.0 \
              evaluate==0.4.5 \
              langchain_openai==0.3.30 \
              langchain-chroma \
              langchain-google-genai \
              langchain_core




In [73]:
# Import core libraries
import os                                                                       # Interact with the operating system (e.g., set environment variables)
import json                                                                     # Read/write JSON data
import requests                                                                  # Make HTTP requests (e.g., API calls); ignore type checker

# Import libraries for working with PDFs and OpenAI
from langchain.document_loaders import PyMuPDFLoader                            # Load and extract text from PDF files
# from langchain_community.document_loaders import PyPDFLoader                    # Load and extract text from PDF files
from openai import OpenAI                                                       # Access OpenAI's models and services


# Import libraries for Gemini
from langchain_google_genai.llms import ChatGoogleGenerativeAI

# Import libraries for processing dataframes and text
import tiktoken                                                                 # Tokenizer used for counting and splitting text for models
import pandas as pd                                                             # Load, manipulate, and analyze tabular data

# Import LangChain components for data loading, chunking, embedding, and vector DBs
from langchain.text_splitter import RecursiveCharacterTextSplitter              # Break text into overlapping chunks for processing
#from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings                                   # Create vector embeddings using OpenAI's models  # type: ignore
#from langchain.vectorstores import Chroma                                      # Store and search vector embeddings using Chroma DB  # type: ignore
from langchain_chroma import Chroma                                             # Store and search vector embeddings using Chroma DB  # type: ignore
from datasets import Dataset                                                    # Used to structure the input (questions, answers, contexts etc.) in tabular format
from langchain_openai import ChatOpenAI                                         # This is needed since LLM is used in metric computation

from langchain_google_genai import ChatGoogleGenerativeAI                       # Google Gemini
from langchain.memory import ConversationBufferMemory

## Loading OpenAI and Google API configuration

In [74]:
import json # Import the json module

# Load the JSON file and extract values
file_name = "../Config.json"                                                       # Name of the configuration file
with open(file_name, 'r') as file:                                              # Open the config file in read mode
    config = json.load(file)                                                    # Load the JSON content as a dictionary
    OPENAI_API_KEY = config.get("OPENAI_API_KEY")                               # Extract the Google API key from the config
    OPENAI_API_BASE = config.get("OPENAI_API_BASE")
    GOOGLE_API_KEY = config.get("GOOGLE_API_KEY")                                # Extract the Google API key from the config
                              # Extract the OpenAI base URL from the config

# Store API credentials in environment variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY                                   # Set API key as environment variable
os.environ["OPENAI_BASE_URL"] = OPENAI_API_BASE                                 # Set API base URL as environment variable
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

# Initialize OpenAI client
client = OpenAI()

## Question Answering using LLM

## Question Answering with OpenAI



In [75]:
# decorator function that cleans the text,
# particularly useful for adjusting line breaks, indentation, and truncation.
import textwrap, re
def clean_output(func=None):
  def wrapper(*args, **kwargs):
    result = func(*args, **kwargs)

    # Check if the result from the function is a string before attempting to clean
    if isinstance(result, str):
      output = ' '.join(result)
      output = textwrap.fill(output, width=100)

      return output
    return result
  return wrapper

# Define a function to get the response from the OpenAI LLM ChatBot
# @clean_output
def response(user_prompt, max_tokens=300, temperature=0.5, top_p=0.9):
    # Create a chat completion using the OpenAI client
    completion = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "user",
             "content": user_prompt                                             # User prompt is the input/query to respond to
             }
        ],
        max_tokens=max_tokens,                                                  # Max number of tokens to generate in the response
        temperature=temperature,                                                # Controls randomness in output
        top_p=top_p                                                             # Controls diversity via nucleus sampling
    )
    return completion.choices[0].message.content                                # Return the text content from the model's reply

### Defining Recruter queries and Storing the responses

In [1]:
q1 = "Does the candidate meet the required skills?"
q2 = "Is the candidate a good fit for the job position?"
q3 = "Analyse Candidate Strengths for the job position"
q4 = "Analyse Candidate Opportunities to improve based on the job description"
q5="Show match details"
q6 = "Create a cover letter"
q7="Help me to stand out"

set_of_questions = [q1,q2,q3,q4,q5,q6,q7]



### Defining Job Description

In [108]:
# Mock Job Description

jd = """
About the job

Nimda LLC Security provides trusted cybersecurity expertise, solutions and services that help organizations make better decisions and minimize risk. By taking a three-tiered, holistic approach for evaluating security posture and ecosystems, GuidePoint enables some of the nation‚Äôs top organizations, such as Fortune 500 companies and U.S. government agencies, to identify threats, optimize resources and integrate best-fit solutions that mitigate risk.

General Description

We are seeking a Senior Application Security Engineer to strengthen the integrity of our software development lifecycle and safeguard our products from emerging threats. In this role, you‚Äôll collaborate across Engineering, DevOps, AI/ML, and Security teams to embed security considerations from the earliest stages of design through to production and beyond, leveraging artificial intelligence and advanced automation to enhance security effectiveness and efficiency.

The Senior Application Security Engineer will demonstrate deep expertise in application security, secure software design, and cloud-native architecture, leading by influence to embed security into the development lifecycle. This role will drive the adoption of modern DevSecOps practices, threat modeling, and secure coding programs that align with engineering goals and business priorities.

This position is 100% remote, and the Senior Application Security Engineer must be self-directed, able to work both individually and as part of a multifunctional team and possess the necessary written and verbal communication skills to interact effectively with team members, IT, AI/ML teams, and other internal customers.

Roles and Responsibilities:

Embed security into development: Collaborate closely with engineering teams to embed security throughout the SDLC‚Äîfrom architectural design and code implementation to CI/CD pipelines and peer reviews‚Äîensuring security is an integral part of how software is built and shipped. Conduct thorough pull request reviews with security focus and contribute to secure coding practices through hands-on development experience.
Find and fix vulnerabilities: Use both automated and manual approaches to identify risks, including static and dynamic analysis (SAST/DAST), manual code reviews, and penetration testing of web and mobile applications. Implement AI-powered vulnerability discovery and intelligent prioritization systems to enhance detection capabilities.
Automate security checks: Integrate a robust set of security practices into the CI/CD workflow‚Äîcovering Software Composition Analysis (SCA), secret detection, Infrastructure as Code (IaC) scanning, container/image scanning, and dependency monitoring‚Äîto address issues early and continuously. Design and implement advanced automation workflows using modern development practices and AI/ML frameworks.
Monitor and respond in production: Implement security telemetry and runtime monitoring to maintain visibility into production environments, detect threats in real time, and support rapid response during security incidents. Deploy AI-driven anomaly detection and automated response capabilities.
Guide secure design: Participate in architecture reviews and threat modeling sessions to proactively identify design-level risks and shape secure system patterns. Apply deep knowledge of cybersecurity standards (NIST, ISO 27001, SOC 2) and current threat landscape to guide security architecture decisions.
Respond to security incidents: Work alongside engineering and security teams during investigations, remediation efforts, and post-incident reviews to strengthen long-term resilience.
Shape the program: Define and track meaningful application security metrics to measure impact, influence priorities, and drive continuous improvement. Beyond metrics and tooling, lead efforts in secure development practices, risk-based decision-making, and help foster a culture where security is thoughtfully embedded. Leverage AI-driven analytics for program insights and automated reporting.
"""

### Resume Data Extraction

In [78]:

pdf_loader = PyMuPDFLoader("../candidate_resume.pdf") # Load the PDF
data = pdf_loader.load()
resume_text = "\n".join(doc.page_content for doc in data)
print(resume_text)

Jonathan Angeles 
CYBERSECURITY & DATA SCIENCE PROFESSIONAL 
 
 Linkedin ‚Äã
 9177022992 
 Tulsa, OK, USA‚Äã
 jonathan@nimda.sh 
 
 
 
Summary 
Driven Cybersecurity and Data Science professional with a decade of experience spearheading secure 
software development, risk management, and governance initiatives. Expert in applying cutting-edge data 
science and AI/ML methodologies to enhance cybersecurity posture, from vulnerability management to 
threat modeling. Committed to continuous learning, as evidenced by a Master's in Cybersecurity, 
post-graduate studies in AI/ML and Generative AI, and a comprehensive suite of industry certifications, 
including OSCP and AWS Cloud Practitioner. 
Skills & Training 
 
DS/AI/ML  
Core Security 
SecDevOps / Appsec 
‚óè‚Äã
Supervised Machine Learning: 
Linear regression, logistic 
regression, and Decision Tree 
‚óè‚Äã
Ensemble Techniques 
‚óè‚Äã
Unsupervised Machine Learning 
‚óè‚Äã
EDA, Data Preprocessing & 
Customer Profiling, and Feature 
Engineer

### Prompt Engineering for Base Questions

In [109]:
system_prompt = ("""
    You are an Technical Cybersecurity Recruiter assistant specialized in finding the bes fit candidate for a job position based on jon description and candidate resume.

    Your primary goal is to offer accurate, concise, and helpful information for the candidate understanding only.
    You must strictly adhere to the following guidelines:
    - Provide answers based solely on general job description knowledge.
    - Do NOT provide personalized recommendations for individuals outside of the job description and candidate background.
    - Always maintain a professional, informative, and neutral tone.
    - Clarify that your responses are for informational purposes only and not a substitute for professional recruiter consultation.
    - Focus on explaining and analysing candidate technical skills, soft skills, Professional Experience, Academic Experience, Certifications and Tools the candidate has experience with.
    """)

user_prompt = """ Your tasks are the following:

### 1 Consider the following candidate resume: {}

### 2 Consider the following job description: {}

### 3 Answer the following question: {}

"""

In [80]:
print(user_prompt)

 Your tasks are the following:

### 1 Consider the following candidate resume: {}

### 2 Consider the following job description: {}

### 3 Answer the following question: {}




### Defining the function to Generate a Response From the LLM with Prompt Engineering


In [111]:
# Define a function to get a response from the OpenAI chat model
def response_with_prompt_eng(system_prompt, user_query, max_tokens=500, temperature=0.5, top_p=0.9):
    global user_prompt
    user_prompt = user_prompt.format(resume_text, jd, user_query)
    # Create a chat completion using the OpenAI client
    completion = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": system_prompt},                       # System prompt sets the assistant's behavior
            {"role": "user", "content": user_prompt}                            # User prompt is the input/query to respond to
        ],
        max_tokens=max_tokens,                                                  # Max number of tokens to generate in the response
        temperature=temperature,                                                # Controls randomness in output (0 = deterministic)
        top_p=top_p                                                             # Controls diversity via nucleus sampling
    )
    return completion.choices[0].message.content                                # Return the text content from the model's reply

### Response Q1 From the LLM with Prompt Engineering

In [82]:
response_with_prompt_eng_q1 = response_with_prompt_eng(system_prompt, user_query=q1)
print(q1)
print(response_with_prompt_eng_q1)

Does the candidate meet the required skills?
Based on the information provided in Jonathan Angeles' resume, he appears to be a strong candidate for the Senior Application Security Engineer position at GuidePoint Security. Here‚Äôs a breakdown of his qualifications in relation to the job description:

1. **Application Security and Secure Software Design**: Jonathan has extensive experience in application security engineering, as demonstrated in his roles at City National Bank and BBVA USA, where he conducted penetration testing, security code reviews, and designed security programs using frameworks like OWASP SAMM.

2. **DevSecOps Practices**: His experience includes embedding security into the software development lifecycle (SDLC), particularly noted during his tenure at Spoton and City National Bank where he implemented security tools within the SDLC and automated security checks in CI/CD pipelines.

3. **AI/ML Integration in Security**: Jonathan has a strong background in AI/ML, havi

### Response Q2 From the LLM with Prompt Engineering


In [83]:
response_with_prompt_eng_q2 = response_with_prompt_eng(system_prompt, user_query=q2)
print(q2)
print(response_with_prompt_eng_q2)

Is the candidate a good fit for the job position?
Based on the provided resume of Jonathan Angeles and the job description for the Senior Application Security Engineer position at GuidePoint Security, Jonathan appears to be a strong candidate who meets many of the required skills and qualifications for the role. Here's an analysis of his alignment with the job requirements:

1. **Application Security and Secure Software Design**: 
   - Jonathan has extensive experience as an Application Security Engineer at City National Bank and BBVA USA, where he performed penetration testing, security code reviews, and managed application security vulnerability programs. This directly aligns with the job requirement to have deep expertise in application security and secure software design.

2. **DevSecOps and Secure Coding Practices**:
   - His roles involved embedding security into the SDLC, conducting secure code reviews, and implementing security tools and practices within CI/CD workflows, as see

### Response Q3 From the LLM with Prompt Engineering


In [84]:
response_with_prompt_eng_q3 = response_with_prompt_eng(system_prompt, user_query=q3)
print(q3)
print(response_with_prompt_eng_q3)

Analyse Candidate Strengths for the job position
Based on the provided resume of Jonathan Angeles and the job description for the Senior Application Security Engineer position at GuidePoint Security, it appears that Jonathan is a strong candidate who meets the required skills for the role. Here's a detailed analysis:

1. **Application Security and Secure Software Design Expertise:**
   - Jonathan has extensive experience in application security engineering, demonstrated through roles at City National Bank, Seegrid Robotics Automation, and BBVA USA. His responsibilities included conducting secure code reviews, application security testing, and vulnerability management‚Äîall crucial for the Senior Application Security Engineer position.

2. **Experience with AI/ML in Security:**
   - The job description emphasizes leveraging AI/ML for enhancing security. Jonathan has a strong background in applying AI/ML methodologies within cybersecurity frameworks, as evidenced by his role at Seegrid R

### Response Q4 From the LLM with Prompt Engineering


In [85]:
response_with_prompt_eng_q4 = response_with_prompt_eng(system_prompt, user_query=q4)
print(q4)
print(response_with_prompt_eng_q4)

Analyse Candidate Opportunities to improve based on the job description
Based on the candidate Jonathan Angeles' resume, he appears to meet and exceed the required skills for the position of Senior Application Security Engineer at GuidePoint Security. Here‚Äôs a detailed analysis:

1. **Application Security and Secure Software Design**: Jonathan has extensive experience in application security engineering roles at City National Bank, BBVA USA, and Seegrid Robotics Automation. His work included conducting secure code reviews, application security testing, vulnerability management, and designing security programs which align closely with the job description‚Äôs requirements for embedding security into the software development lifecycle (SDLC).

2. **DevSecOps Practices**: Jonathan‚Äôs experience as a DevSecOps Engineer at Spoton demonstrates his capability in integrating security practices within the SDLC, automating security checks, and collaborating with product teams to enhance securi

### Response Q5 From the LLM with Prompt Engineering


In [113]:
response_with_prompt_eng_q5 = response_with_prompt_eng(system_prompt, user_query=q5)
print(q5)
print(response_with_prompt_eng_q5)

Show match details
Based on the provided job description for the Senior Application Security Engineer position at Nimda LLC Security and the resume of Jonathan Angeles, there is a strong alignment between the candidate‚Äôs experience and the job requirements. Here are the match details:

**1. Application Security and Secure Software Design:**
   - **Job Requirement:** Deep expertise in application security and secure software design.
   - **Candidate Experience:** Jonathan has extensive experience as an Application Security Engineer at City National Bank and BBVA USA, where he performed penetration testing, security code reviews, and designed security programs based on OWASP frameworks. This directly aligns with the job‚Äôs focus on secure software design and application security.

**2. DevSecOps and SDLC Integration:**
   - **Job Requirement:** Drive the adoption of modern DevSecOps practices and embed security into the development lifecycle.
   - **Candidate Experience:** At Spoton, 

### Response Q6 From the LLM with Prompt Engineering


In [114]:
response_with_prompt_eng_q6 = response_with_prompt_eng(system_prompt, user_query=q6)
print(q6)
print(response_with_prompt_eng_q6)

Create a cover letter
Based on the resume of Jonathan Angeles and the job description provided for the Senior Application Security Engineer position at Nimda LLC Security, here are the match details:

**Technical Skills and Experience:**
1. **Application Security and SDLC Integration:**
   - Jonathan has extensive experience in embedding security practices within the SDLC, as seen in his roles at City National Bank and Seegrid Robotics Automation. He has performed secure code reviews, application security testing, and collaborated with engineering teams, which aligns well with the responsibilities of ensuring security throughout the SDLC as described in the job description.

2. **AI/ML Integration in Security:**
   - Jonathan's background in applying AI/ML methodologies to enhance cybersecurity, including AI-specific threat modeling and risk assessments, matches the job requirement of leveraging artificial intelligence to enhance security effectiveness and efficiency.

3. **DevSecOps a

### Response Q7 From the LLM with Prompt Engineering


In [115]:
response_with_prompt_eng_q7 = response_with_prompt_eng(system_prompt, user_query=q7)
print(q7)
print(response_with_prompt_eng_q7)

Help me to stand out
### Match Details between Jonathan Angeles' Resume and the Job Description for Senior Application Security Engineer at Nimda LLC Security

#### Technical Skills and Experience
1. **Secure Software Development Lifecycle (SDLC) Expertise:**
   - **Resume:** Jonathan has extensive experience in embedding security best practices across the ML and software development lifecycle, including secure code reviews, application security testing, and vulnerability management in various roles.
   - **Job Requirement:** The role requires embedding security into the SDLC from design to production, which matches Jonathan's demonstrated capabilities.

2. **Application Security and DevSecOps:**
   - **Resume:** Experience as an Application Security Engineer and DevSecOps Engineer, implementing security programs, and automating security checks within CI/CD pipelines.
   - **Job Requirement:** Needs a candidate who can automate security checks and integrate security practices into the 

#### Storing the generated output


In [117]:
# A Dataframe to store the responses
responses = pd.DataFrame({
    "Question": [q1, q2, q3, q4,q5,q6,q7],
    "Base Prompt Response": [response_with_prompt_eng_q1, response_with_prompt_eng_q2, response_with_prompt_eng_q3, response_with_prompt_eng_q4, response_with_prompt_eng_q5, response_with_prompt_eng_q6, response_with_prompt_eng_q7]
    }
)

responses

Unnamed: 0,Question,Base Prompt Response
0,Does the candidate meet the required skills?,Based on the information provided in Jonathan ...
1,Is the candidate a good fit for the job position?,Based on the provided resume of Jonathan Angel...
2,Analyse Candidate Strengths for the job position,Based on the provided resume of Jonathan Angel...
3,Analyse Candidate Opportunities to improve bas...,Based on the candidate Jonathan Angeles' resum...
4,Show match details,Based on the provided job description for the ...
5,Create a cover letter,Based on the resume of Jonathan Angeles and th...
6,Help me to stand out,### Match Details between Jonathan Angeles' Re...


## Data Preparation for RAG (Retrieval Augmented Generation)

### Loading the Resume Data

In [119]:
resume_pdf = "../candidate_resume.pdf"
pdf_loader = PyMuPDFLoader(resume_pdf) # Load the PDF
data = pdf_loader.load()

### Data Overview

In [120]:

for i in range(len(data)):
  print(f"Page Number: {i+1}", end="\n")
  print(data[i].page_content,end="\n")


Page Number: 1
Jonathan Angeles 
CYBERSECURITY & DATA SCIENCE PROFESSIONAL 
 
 Linkedin ‚Äã
 9177022992 
 Tulsa, OK, USA‚Äã
 jonathan@nimda.sh 
 
 
 
Summary 
Driven Cybersecurity and Data Science professional with a decade of experience spearheading secure 
software development, risk management, and governance initiatives. Expert in applying cutting-edge data 
science and AI/ML methodologies to enhance cybersecurity posture, from vulnerability management to 
threat modeling. Committed to continuous learning, as evidenced by a Master's in Cybersecurity, 
post-graduate studies in AI/ML and Generative AI, and a comprehensive suite of industry certifications, 
including OSCP and AWS Cloud Practitioner. 
Skills & Training 
 
DS/AI/ML  
Core Security 
SecDevOps / Appsec 
‚óè‚Äã
Supervised Machine Learning: 
Linear regression, logistic 
regression, and Decision Tree 
‚óè‚Äã
Ensemble Techniques 
‚óè‚Äã
Unsupervised Machine Learning 
‚óè‚Äã
EDA, Data Preprocessing & 
Customer Profiling, and Fe

### Data Chunking
#### Chunk the PDF into Manageable Text Sections Using a Token-Based Splitter


In [121]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( # Performed character text split and encode the split using OpenAI's tiktoken tokenizer for OPENAI Embedding compatibility
    encoding_name="cl100k_base",
    chunk_size=512,
    chunk_overlap=10,
)

In [122]:
document_chunks = text_splitter.split_documents(data) # Create a Document chunks based on the RecursiveCharacterTextSplitter
#document_chunks = pdf_loader.load_and_split(text_splitter)


In [123]:
len(document_chunks) # Check the length

5

### Embedding



### Embeddings from OpenAI

In [124]:
# Initialize the OpenAI Embeddings model with API credentials
embedding_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY,                                              # Your OpenAI API key for authentication
    openai_api_base=OPENAI_API_BASE                                             # The OpenAI API base URL endpoint
)

# Generate embeddings (vector representations) for the first two document chunks
embedding_1 = embedding_model.embed_query(document_chunks[0].page_content)      # Embedding for chunk 0
embedding_2 = embedding_model.embed_query(document_chunks[1].page_content)      # Embedding for chunk 1

# Check and print the dimension (length) of the embedding vector
print("Dimension of the embedding vector ", len(embedding_1))                   # Typically 1536 or 2048 depending on model

Dimension of the embedding vector  1536


In [125]:
# Verify if both embeddings have the same dimension (should be True)
len(embedding_1) == len(embedding_2)


True

In [126]:

# Return/display the two embedding vectors for further inspection or use
embedding_1, embedding_2

([-0.009754889644682407,
  -0.010497037321329117,
  0.018694086000323296,
  -0.012596444226801395,
  -0.003272802336141467,
  0.020419077947735786,
  -0.023762082681059837,
  0.009006056934595108,
  -0.01715630292892456,
  -0.017396999523043633,
  0.017356883734464645,
  0.0062480769120156765,
  0.008658383972942829,
  0.0024186645168811083,
  0.003386464435607195,
  0.011426392942667007,
  0.03203936293721199,
  -0.006094298791140318,
  0.004212186671793461,
  -0.01158685702830553,
  -0.040570713579654694,
  -0.0004981078091077507,
  0.004125268664211035,
  -0.0015895990654826164,
  -0.0067094117403030396,
  -0.0026894479524344206,
  -0.008123503066599369,
  -0.048059046268463135,
  -0.0009644570527598262,
  -0.0192958265542984,
  0.024831844493746758,
  0.005088054109364748,
  -0.0032377007883042097,
  -0.007688912563025951,
  -0.03431260585784912,
  -0.004065094515681267,
  0.0024654665030539036,
  -0.014735967852175236,
  0.03415214270353317,
  0.005051281303167343,
  0.00276466552

### Vector Database



#### Setup Vector Database Directory

In [127]:
out_dir = 'vectorDB'    # name of the vector database

if not os.path.exists(out_dir):
  os.makedirs(out_dir)

In [128]:
# Initialize an empty Chroma vector store for persistence
vectorstore = Chroma(
    embedding_function=embedding_model,
    persist_directory=out_dir
)

# Define a batch size for adding documents to avoid hitting API token limits
batch_size = 300

# Add documents in batches
for i in range(0, len(document_chunks), batch_size):
    batch = document_chunks[i : i + batch_size]
    print(f"Adding batch {i // batch_size + 1}/{(len(document_chunks) -1) // batch_size + 1} with {len(batch)} documents...")
    vectorstore.add_documents(documents=batch)

print("Vector store created and saved.")

Adding batch 1/1 with 5 documents...
Vector store created and saved.


#### Load Vector DataDB

In [129]:
vectorstore = Chroma(
    persist_directory=out_dir,
    embedding_function=embedding_model
)

In [104]:
vectorstore.embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x13c2a4bb0>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x13c2a58a0>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base='https://aibe.mygreatlearning.com/openai/v1', openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [130]:
vectorstore.similarity_search(q1, k=3)

[Document(id='cf97fa52-2974-4146-960a-71c606716df6', metadata={'creationDate': '', 'modDate': '', 'source': '../candidate_resume.pdf', 'creator': '', 'title': 'Jonathan_Angeles_Resume_v1.3.docx', 'page': 0, 'producer': 'Skia/PDF m144 Google Docs Renderer', 'subject': '', 'total_pages': 3, 'creationdate': '', 'keywords': '', 'moddate': '', 'author': '', 'file_path': '../candidate_resume.pdf', 'format': 'PDF 1.4', 'trapped': ''}, page_content='Tools \n‚óè\u200b\nScikit-learn,  Pandas, Matplotlib, \n‚óè\u200b\nTensorFlow, Jupyter Notebook, \nNumPy, Scipy, Google Colab,  \nHadoop  and RAG Pipelines \n‚óè\u200b\nSecurity Information Systems  \n‚óè\u200b\nNetwork Security Architecture  \n‚óè\u200b\nComputer Forensics  \n‚óè\u200b\nLinux Security  \n‚óè\u200b\nVulnerability Management Process \n‚óè\u200b\nSoftware Security Program \n \n                 Programing Language  \n‚óè\u200b\nShell Bash Scripting \n‚óè\u200b\nPython \n‚óè\u200b\nJAVA \n \nSoft Skills \n‚óè\u200b\nCuriosity, Teamwork

#### Convert Vector Database into a Retriever and Retrieve Relevant Documents


###  Retriever

In [131]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}
)

### RAG System and User Prompt Template

In [144]:
rag_system_prompt = """
### ROLE
You are an expert Technical Cybersecurity Recruiter. You possess deep knowledge of cybersecurity frameworks (NIST, ISO), certifications (CISSP, CEH, OSCP), and technical tools (SIEM, EDR, Firewalls).

### TASK
Your task is to evaluate a candidate's resume against a specific job description to determine their technical and cultural fit.

### INPUT DATA
1. Job Description (JD): provided in the context.
2. Candidate Resume: provided in the context.

### GUIDELINES & CONSTRAINTS
1. **Source of Truth:** You must answer based *strictly* on the provided Resume and Job Description. Do not use outside knowledge to fill in gaps about the candidate's experience.
2. **Missing Information:** If the resume does not mention a specific skill required by the JD, explicitly state: "Not mentioned in resume." Do not infer or guess.
3. **Tone:** Professional, objective, and analytical.
4. **Disclaimer:** Always end with: "This analysis is generated by AI for informational purposes and should be verified by a human recruiter."

### ANALYSIS REQUIRED
Analyze the match based on the following dimensions:
- **Hard Skills Match:** Compare technical skills (Languages, Security Tools, Platforms).
- **Certifications:** Check for mandatory vs. preferred certifications.
- **Experience Level:** Compare years of experience and seniority.
- **Gap Analysis:** explicitly list what the candidate is missing based on the JD.

### OUTPUT FORMAT
Provide the response in Markdown format with clear headers.
"""



In [157]:
rag_user_prompt_template = """
    Please analyze the following candidate based on the job requirements provided.

    CONTEXT DATA:

    <JobDescription>
    {job_description_context}
    </JobDescription>

    <CandidateResume>
    {resume_context}
    </CandidateResume>

    ANALYSIS REQUEST:
    Based on the context above, answer the following question:
    <Query>
    {question}
    </Query>
    """



### RAG Response Function

In [162]:
def generate_rag_response(user_query: str, k=3, max_tokens=600, temperature=0, top_p=0.95):
    # Access global system and user prompt templates for RAG
    global rag_system_prompt, rag_user_prompt_template

    # Retrieve relevant documents based on the user query from the vector store
    context_documents = retriever.invoke(input=user_query, k=k)
    # Extract page content from the retrieved documents to form a list of contexts
    context_list = [d.page_content for d in context_documents]

    # Combine all retrieved document chunks into a single context string
    context_for_query = '. '.join(context_list)
    # Format the user prompt using the RAG user prompt template, including the retrieved context and original query
    user_message = rag_user_prompt_template.format(resume_context=context_for_query,job_description_context=jd, question=user_query)

    try:
        # Generate a response using the OpenAI chat model with the system and formatted user messages
        response = client.chat.completions.create(
            model="gpt-4-turbo", # Specify the model to use
            messages=[
                {"role": "system", "content": rag_system_prompt}, # System role defines the AI's behavior
                {"role": "user", "content": user_message}        # User role provides the query and context
            ],
            max_tokens=max_tokens,     # Set the maximum number of tokens for the generated response
            temperature=temperature,   # Control the randomness of the output (0 makes it more deterministic)
            top_p=top_p                # Control diversity via nucleus sampling
        )
        # Extract and strip whitespace from the generated text content
        response = response.choices[0].message.content.strip()
    except Exception as e:
        # Catch any exceptions during the API call and print the error
        print(f"Error Found:\n {e}")
        return None  # Return None if an error occurs

    # Return the generated response and the context used for the query
    return response, context_for_query


## Question Answering using RAG

### Response Q1 From the LLM with RAG and Prompt Engineering


In [163]:
response_with_rag_1, q1_context  = generate_rag_response(q1)
print(response_with_rag_1)

### Hard Skills Match

**Programming Languages and Tools:**
- **Languages:** The candidate has experience with Python, Java, and Shell Bash Scripting, which are commonly used in application security roles. The job description does not specify required programming languages, but these are generally relevant.
- **Security Tools:** The candidate has extensive experience with a variety of security tools including Snyk Code, Snyk Security Container, InsightAppsec Rapid7, Jenkins, Kubernetes, Docker, DefectDojo, Kali Linux, Burp Suite Proxy, and ZAP proxy. These tools align well with the job description's requirements for security checks, vulnerability management, and automation within CI/CD pipelines.

**Security Practices and Frameworks:**
- The candidate has experience with CI/CD security checks (SCA, SAST, DAST), threat modeling (STRIDE), and software security frameworks (OWASP SAMM, MAESTRO). These directly match the job description's emphasis on embedding security into the SDLC, conduc

### Response Q2 From the LLM with RAG and Prompt Engineering



In [None]:
response_with_rag_2, q2_context  = generate_rag_response(q2)
print(response_with_rag_2 )

### Store RAG Responses

In [39]:
responses["Response with RAG"] = [response_with_rag_1, response_with_rag_2, response_with_rag_3, response_with_rag_4]
responses["Question Context"] = [q1_context, q2_context, q3_context, q4_context]
responses

Unnamed: 0,Question,Base Prompt Response,Response with Prompt Engineering,Response with RAG,Question Context
0,What is the protocol for managing sepsis in a ...,The management of sepsis in a critical care se...,Sepsis is a life-threatening condition trigger...,The management of sepsis in a critical care un...,Parenteral antibiotics should be given after s...
1,"What are the common symptoms for appendicitis,...",Appendicitis is an inflammation of the appendi...,Appendicitis is an inflammation of the appendi...,Common symptoms of appendicitis typically begi...,Etiology\nAppendicitis is thought to result fr...
2,What are the effective treatments or solutions...,"Sudden patchy hair loss, which appears as loca...",Sudden patchy hair loss that appears as locali...,"Sudden patchy hair loss, commonly seen as loca...","squaric acid dibutylester), or psoralen plus u..."
3,What treatments are recommended for a person w...,"Treatment for brain injuries, whether they res...","Treatment for brain injuries, whether temporar...",For a person who has sustained a traumatic bra...,Chapter 324. Traumatic Brain Injury\nIntroduct...


# **Definig the LLM-as-a-Judge Evaluation function**

### Output Evaluation

### Prompt for Evaluation

In [40]:

groundedness_rater_system_messagedds_test = """
You are an expert at evaluating the factual groundedness of a generated answer with respect to a given context. Your task is to determine if every fact in the 'Answer' is directly supported by the information presented in the 'Context'.

Follow these strict rules:
1.  **Fact-checking**: Scrutinize each factual statement in the 'Answer'.
2.  **Context-only**: If a fact in the 'Answer' cannot be directly and explicitly found or inferred from the 'Context', then the Answer is NOT grounded.
3.  **No External Knowledge**: Do NOT use any external knowledge. Base your judgment SOLELY on the provided 'Context'.
4.  **Output Format**: Provide your assessment as a single word: 'Grounded' if all facts are supported by the context, or 'Not Grounded' if even one fact is not supported by the context.

Example:
Context: 'The Eiffel Tower is in Paris. It was built by Gustave Eiffel.'
Answer: 'The Eiffel Tower is in Paris and is a famous landmark.'
Result: Groundedd

Context: 'The Eiffel Tower is in Paris. It was built by Gustave Eiffel.'
Answer: 'The Eiffel Tower is in Rome and is a famous landmark.'
Result: Not Grounded

Context: 'The Eiffel Tower is in Paris.'
Answer: 'The Eiffel Tower is in Paris and was designed by Gustave Eiffel.'
Result: Not Grounded (because 'designed by Gustave Eiffel' is not explicitly in the context given)
"""




In [41]:
relevance_rater_system_message_test = """
You are an expert at evaluating the relevance of a generated 'Answer' to a given 'Question'. Your task is to determine if the 'Answer' directly and comprehensively addresses all aspects of the 'Question'.

Follow these strict rules:
1.  **Directness**: The 'Answer' must directly respond to the 'Question'. Avoid considering any information in the answer that does not pertain to the question.
2.  **Completeness**: If the 'Question' asks for multiple pieces of information, the 'Answer' should ideally provide all of them to be considered highly relevant.
3.  **No Extraneous Information**: The 'Answer' should not include significant information that is unrelated to the 'Question'.
4.  **Output Format**: Provide your assessment as a single word: 'Relevant' if the answer directly and thoroughly addresses the question, or 'Not Relevant' if it misses key aspects, contains significant irrelevant information, or does not answer the question directly.

Example:
Question: 'What are the common symptoms for appendicitis?'
Answer: 'Appendicitis commonly presents with abdominal pain, loss of appetite, nausea, and vomiting.'
Result: Relevant

Question: 'What are the common symptoms for appendicitis and its primary treatment?'
Answer: 'Appendicitis commonly presents with abdominal pain, loss of appetite, nausea, and vomiting.'
Result: Not Relevant (as it misses the treatment aspect)

Question: 'What are the common symptoms for appendicitis?'
Answer: 'The primary treatment for appendicitis is surgery. The appendix is a small, finger-like organ.'
Result: Not Relevant (as it does not address the symptoms)
"""

In [42]:
# Prompt to evaluate how relevant the answer is to the original question
relevance_rater_system_message = """
You will be presented with a ###Question, the ###Context used by the AI system to generate a response, and the AI-generated ###Answer.

Your task is to judge the extent to which the ###Answer is relevant to the ###Question, considering whether it directly addresses the key aspects of the ###Question based on the provided ###Context.

Rate the relevance as follows:
- Rate 1 ‚Äì The ###Answer is not relevant to the ###Question at all.
- Rate 2 ‚Äì The ###Answer is only slightly relevant to the **###Question**, missing key aspects.
- Rate 3 ‚Äì The ###Answer is moderately relevant, addressing some parts of the **###Question** but leaving out important details.
- Rate 4 ‚Äì The ###Answer is mostly relevant, covering key aspects but with minor gaps.
- Rate 5 ‚Äì The ###Answer is fully relevant, directly answering all important aspects of the **###Question** with appropriate details from the **###Context**.

The final output should be a single overall rating in the range of 1 to 5, along with a brief explanation of the rationale for the rating.
"""

# Prompt to evaluate how well the answer is grounded in the provided context
groundedness_rater_system_message = """
You will be presented a ###Question, ###Context used by the AI system and AI generated ###Answer.

Your task is to judge the extent to which the ###Answer is derived from ###Context.

Rate it 1 - if The ###Answer is not derived from the ###Context at all
Rate it 2 - if The ###Answer is derived from the ###Context only to a limited extent
Rate it 3 - if The ###Answer is derived from ###Context to a good extent
Rate it 4 - if The ###Answer is derived from ###Context mostly
Rate it 5 - if The ###Answer is is derived from ###Context completely

The final output should be a single overall rating in the range of 1 to 5, along with a brief explanation of the rationale for the rating.
"""

In [43]:
user_message_template = """
    ###Question
    {question}

    ###Context
    {context}

    ###Answer
    {answer}
    """

## Google Gemini as Judge

In [44]:
def rate_ground_relevace_responses(question, response, context_for_query=""):
    """
    Evaluates the groundedness and relevance of a generated response using a Gemini LLM as a judge.

    Args:
        question (str): The original question asked by the user.
        response (str): The AI-generated answer.
        context_for_query (str, optional): The context used to generate the answer. Defaults to "".

    Returns:
        tuple: A tuple containing the groundedness rating and the relevance rating from the LLM.
    """
    # Format the user message prompt with the provided question, context, and answer
    filled_user_message_prompt = user_message_template.format(context=context_for_query, question=question, answer=response)

    # Initialize the Google Gemini LLM for evaluation
    gemini_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    temperature=0.2 # Set a low temperature for more deterministic evaluation results
    )

    # Initialize ConversationBufferMemory (note: this memory is not currently used in the direct LLM invocation messages below)
    memory = ConversationBufferMemory(return_messages=True)

    # Prepare messages for groundedness evaluation
    messages_ground = [
                ("system", f"{groundedness_rater_system_message}"), # System prompt for groundedness
                ("human", f"{filled_user_message_prompt}")            # User message containing question, context, and answer
            ]

    # Invoke the Gemini LLM to rate the groundedness of the response
    groundedness_response = gemini_llm.invoke(messages_ground).content

    # Prepare messages for relevance evaluation
    messages_relevance = [
                ("system", f"{relevance_rater_system_message}"),   # System prompt for relevance
                ("human", f"{filled_user_message_prompt}")             # User message containing question, context, and answer
            ]

    # Invoke the Gemini LLM to rate the relevance of the response
    relevance_response = gemini_llm.invoke(messages_relevance).content

    # Return both the groundedness and relevance ratings
    return groundedness_response, relevance_response

#### **Evaluation 1: Base Prompt Response Evaluation**

In [45]:
ground_rate, relevance_rate = rate_ground_relevace_responses(q1,base_prompt_response_1 , q1_context)

print(F"Groundedness Rate:\n{ground_rate}")
print('\n\n******************************\n\n')
print(f"Relevance Rate:\n{relevance_rate}")

  memory = ConversationBufferMemory(return_messages=True)


Groundedness Rate:
Rating: 2

Rationale: The answer is derived from the context to only a limited extent. While the context does mention obtaining blood cultures before administering antibiotics and the importance of prompt empiric antibiotic therapy, these are the only two points that overlap. The answer is structured around the "Hour-1 Bundle" from the Surviving Sepsis Campaign, which is not mentioned in the context. Furthermore, key recommendations in the answer, such as measuring lactate, fluid resuscitation with 30 mL/kg crystalloid, and using vasopressors, are not present in the provided text. Conversely, the answer omits the majority of the specific protocols detailed in the context, such as specific antibiotic regimens, source control (surgery), glucose management, and corticosteroid therapy.


******************************


Relevance Rate:
**Rating: 1**

**Rationale:**
The answer is not relevant because it does not use the information provided in the context. The context det

#### **Evaluation 2: Prompt Engineering Response Evaluation**

In [48]:
ground_rate, relevance_rate = rate_ground_relevace_responses(q1,response_with_prompt_eng_1 , q1_context)


print(F"Groundedness Rate:\n{ground_rate}")
print('\n\n******************************\n\n')
print(f"Relevance Rate:\n{relevance_rate}")

Groundedness Rate:
Rating: 2

Rationale: The answer is derived from the context only to a limited extent. While the context does mention taking blood cultures and administering antibiotics promptly, the answer introduces a significant amount of information not present in the provided text, such as the "Sepsis Six" bundle, the SOFA score, and serum lactate measurement. Furthermore, the answer omits the majority of the detailed protocols described in the context, including specific antibiotic regimens, source control (draining abscesses), glucose management, and corticosteroid therapy.


******************************


Relevance Rate:
**Rating: 2**

**Rationale:**
The answer is only slightly relevant because it provides a very general overview of sepsis management that is not derived from the specific, detailed protocol outlined in the provided context. The context gives explicit information on antibiotic regimens (e.g., gentamicin plus a 3rd-generation cephalosporin), the necessity of 

#### **Evaluation 3: RAG Response Evaluation**

In [47]:
ground_rate, relevance_rate = rate_ground_relevace_responses(q1,response_with_rag_1 , q1_context)

print(F"Groundedness Rate:\n{ground_rate}")
print('\n\n******************************\n\n')
print(f"Relevance Rate:\n{relevance_rate}")

Groundedness Rate:
Rating: 5

Rationale: The answer is a well-structured and comprehensive summary of the information provided in the context. Each point in the answer, from empiric antibiotic therapy and source control to supportive care like glucose management and emerging therapies, is directly and accurately extracted from the provided text from the Merck Manual. The answer does not introduce any outside information and faithfully represents the protocol described in the source material.


******************************


Relevance Rate:
**Rating:** 5

**Rationale:** The answer is fully relevant. It directly addresses the question by outlining a clear, step-by-step protocol for managing sepsis based entirely on the provided context. It accurately covers all the key aspects mentioned, including antibiotic therapy, source control, supportive care (glucose management, corticosteroids), and emerging therapies, presenting them in a logical and easy-to-understand format.


## Actionable Insights and Business Recommendations

- During this project I Develop and evaluate a RAG-based AI solution using renowned medical manuals (provided as "medical_diagnosis_manual.pdf") to address healthcare challenges. The evaluation was focus on various rag metrics and involve integrating ChatGoogleGenerativeAI into the RAG pipeline as Judge.
- High Scores of **Groundedness and Relevance** I was able to get, Implementng RAG technology indicates the solution is effectevly leveraging the provided context to answer questions accurately and comprehensively

- Low Scores of **Groundedness and Relevance** I was able to get implementing Base Prompt and Prompt Engineering highligth areas where the LLM struglees without an explicit context aid or proper guidance these become opportuninites to improve , perhaps implementing different technies of prompt engineering or different LLM or Embedding Model
- RRAG model (Response with RAG) significantly outperforms both the base prompt and prompt-engineered responses in terms of groundedness and relevance. This is a crucial insight: RAG effectively uses the external knowledge base.

- Importance of Context: The poorer performance of the non-RAG models shows that without specific, relevant context, even advanced LLMs can hallucinate or provide generic answers that lack specific medical authority.
Value of Prompt Engineering: Even though it didn't match RAG, prompt engineering improved the initial general answers, showing the importance of guiding the LLM's behavior.

- Value of Prompt Engineering: Even though it didn't match RAG, prompt engineering can improve the initial general answers, showing the importance of guiding the LLM's behavior.


- Implementing a RAG architecture dramatically enhances the accuracy and context-specificity of medical information retrieval, directly addressing the challenge of information overload.
- Relying solely on general LLM knowledge can lead to less grounded and relevant medical responses, potentially risking misinformation.
- The quality of the provided context (Merck Manuals) is vital for ensuring high-quality, authoritative responses from the RAG system.

- Prioritize the deployment of the RAG-based AI solution in healthcare settings to provide medical professionals with highly accurate, context-specific information derived from trusted sources like the Merck Manuals. This will streamline decision-making and reduce the risk of information overload.
- Invest in expanding and maintaining the quality of the medical knowledge base (vector database). Regularly update the ingested manuals and documents to ensure the RAG system always provides the most current and comprehensive medical information.

- Develop training programs for healthcare staff on how to effectively use the RAG solution for quick access to critical information, and integrate the solution into existing clinical workflows to maximize its utility and impact on patient outcomes.

- Establish a feedback loop for continuous monitoring of the RAG system's performance, allowing for ongoing adjustments and improvements to the prompts, retrieval mechanisms, and knowledge base based on user experience and emerging medical knowledge.


<font size=6 color='#4682B4'>Power Ahead</font>
___