# Langchain based Interview Agent

This notebook demonstrates the development of an AI Interview Agent that can dynamically generate interview questions based on a provided job posting and candidate resume. Leveraging LangChain, we enable more advanced techniques for document parsing, output processing, and LLM interaction, allowing for the creation of comprehensive question-generation pipelines.

LangChain's framework offers flexibility by removing the requirement to use specific SDKs, empowering us to select and integrate any LLM model along with its functions and tools.

## 1. Information extraction and parsing

This section covers parsing and extracting essential information from the provided documents (such as the job post and candidate resume) to build the AI Interview Agent.

**1.1. Loding model environment and API keys**

For this project, we utilize OpenAI's **GPT-4** model to handle information extraction and question generation. Here, we initialize the model environment and securely load the necessary API keys for seamless interaction with the model throughout the process.


In [2]:
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

**1.2. Loading libraries for document parsing**

In this project, I have used the **PyMuPDF** library for parsing PDF documents.
I chose PyMuPDF for its high accuracy in extracting text while preserving the original formatting, making it well-suited for processing job posts and resumes, which are widely available in PDF format.

In [3]:
!pip install -q pymupdf python-docx

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m80.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import fitz  # PyMuPDF
import docx
from pathlib import Path

In [5]:
def extract_text_from_file(file_path: str) -> str:
  file_path = Path(file_path)

  if file_path.suffix.lower() == '.pdf':
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text("text")  # 'text' mode preserves the original spacing
    doc.close()
    return text

  elif file_path.suffix.lower() == '.docx':
    doc = docx.Document(file_path)
    text = "\n".join([para.text for para in doc.paragraphs])
    return text

  elif file_path.suffix.lower() == '.txt':
    return file_path.read_text()

  else:
    raise ValueError(f"Unsupported file type: {file_path.suffix}")

**1.3. Using functions for information extraction**

In the cells below, we define three specialized functions to extract relevant information from:
- The candidate's resume
- The job description
- The company profile

For easier function definition and organization, we use **Pydantic** classes, which allow us to describe and construct functions without needing to manually define complex JSON structures.
The utility function `convert_pydantic_to_openai_function` automatically
converts our Pydantic classes into OpenAI-compatible function formats, streamlining the process of passing structured data to and from the model.


In [6]:
from typing import List
from pydantic import BaseModel, Field
from langchain.utils.openai_functions import convert_pydantic_to_openai_function

In [7]:
class ResumeParser(BaseModel):
    """Extract the following information from the resume"""
    Name: str = Field(description="candidate's name")
    Education: str = Field(description="educational detail's such as degree, major/field of study, institution name and graduation date.")
    Work_experience: str = Field(description="Total years of experience and details including job title, company, dates, key responsibilities and achievements. Also summarize major projects, accomplishments, or publications")
    Skills: str = Field(description="list technical skills, programming languages, frameworks, tools, and other relevant skills")
    Certifications: str = Field(description="list certifications with the name, issuing organization, and issue date.")

In [8]:
class JobPostingParser(BaseModel):
    """Extract the following information from the job posting"""
    Job_title: str = Field(description = "The job title for this position.")
    Responsibilities: str = Field(description="The main duties and responsibilities.")
    Required_skills : str = Field(description="The skills and qualifications explicitly required for this role.")
    Preferred_skills : str = Field("The skills and qualifications that are preferred but not mandatory.")
    Required_experience: str = Field(description="The level of experience required, including years and specific areas.")
    Location: str = Field(description="The location, including remote options if available.")

In [9]:
class CompanyProfile(BaseModel):
    """Extract the following information from the company profile"""
    Company_Name: str = Field(description="The name of the company.")
    Industry: str = Field(description="The industry or industries this company operates within.")
    Mission: str = Field(description="The mission statement that explains the company's purpose and objectives.")
    Vision: str = Field(description="The vision statement that describes the company’s long-term goals.")
    Core_values: str = Field("The key principles or beliefs that define the company’s identity.")
    Company_culture: str = Field(" A description of the work environment, team dynamics, and company atmosphere")

In [10]:
resume_text = extract_text_from_file('/content/Atanu_Dahari_Resume.pdf')
job_posting_text = extract_text_from_file('/content/Senior AI Research Engineer.docx')
company_profile_text = extract_text_from_file('/content/NRG company values.txt')

In [11]:
!pip install -q langchain_community

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m44.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m408.7/408.7 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m73.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [12]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

In [13]:
model = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    api_key=OPENAI_API_KEY
)

  model = ChatOpenAI(


In [14]:
extracting_functions = [convert_pydantic_to_openai_function(ResumeParser), convert_pydantic_to_openai_function(JobPostingParser), convert_pydantic_to_openai_function(CompanyProfile) ]

  extracting_functions = [convert_pydantic_to_openai_function(ResumeParser), convert_pydantic_to_openai_function(JobPostingParser), convert_pydantic_to_openai_function(CompanyProfile) ]


In [15]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, based on the function requested. If not explicitly provided do not guess."),
    ("user", "{input}")
])

In [16]:
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

In [17]:
extraction_model = model.bind(functions=extracting_functions)

In [18]:
extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()

**1.4. Extraction output**

Below is the output from the extraction chains. Here, we utilize chains to seamlessly combine the prompt, model, and output parser into a single processing pipeline. This setup allows us to pass documents through the pipeline, where the model generates structured outputs that are automatically parsed and ready for further processing in the interview question generation workflow.

In [19]:
resume_info = extraction_chain.invoke({resume_text})
job_posting_info = extraction_chain.invoke({job_posting_text})
company_profile_info = extraction_chain.invoke({company_profile_text})

In [20]:
resume_info

{'Name': 'ATANU DAHARI',
 'Education': 'Master of Computer Science (MCS) from Rice University, Houston, Texas (August 2021 – December 2022), CGPA: 3.43 / 4.0. Relevant coursework: Deep Learning, Computational Statistics, R, Database management Systems, Web development. Bachelor of Technology in Electronics and Communication Technology (B. Tech in ECE) from Vellore Institute of Technology, Vellore, India (July 2017 - June 2021), CGPA: 8.62 / 10.0. Relevant coursework: Data Structures and Algorithms, Cloud Computing, Object oriented programming, Advanced Java.',
 'Work_experience': 'Machine Learning Engineer at Circle.ooo, Houston, TX (February 2024 – Present): Created an AI assistant using RAG for an event hosting platform, integrated data from various APIs and vector databases, automated event creation, resulting in a 25% increase in user interaction. AI Engineer at WarrantyMe, Houston, TX (May 2023 – January 2024): Developed a warranty information extraction system, extracted warranty

In [21]:
job_posting_info

{'Job_title': 'Senior AI Research Engineer',
 'Responsibilities': 'Advanced Model Architecture: Develop and optimize transformer-based models for NLP tasks, ensuring scalability and efficiency in the cloud environment (Azure/AWS). Generative AI Development: Build and refine generative models, particularly LLMs, to generate coherent and contextually appropriate text for applications like customer care and chatbot automation. AI Integration: Work on integrating Azure OpenAI, Assistant API, and other LLM solutions into practical AI-driven products, such as chatbots and virtual assistants. Research and Innovation: Lead AI research initiatives focusing on cutting-edge transformer and NLP methodologies, exploring new applications and improving model performance. Technical Leadership: Provide leadership in PyTorch, transformers, and LLMs, guiding the development team through complex training and deployment challenges. Collaboration & Mentorship: Collaborate with cross-functional teams to inco

In [22]:
company_profile_info

{'Company_Name': 'NRG',
 'Industry': 'Energy and Home Services',
 'Mission': 'Driven by the idea of a smarter, cleaner future, focusing on innovative solutions that make customers’ lives easier by helping them power, protect, and intelligently manage their homes and businesses.',
 'Vision': 'Creating possibilities to empower the millions of customers we serve and communities where they live and work.'}

## 2. Interview questions script generation  

This section focuses on generating a personalized interview questions script
using the information extracted from the documents. The prompts are carefully designed to create a balanced mix of technical and behavioral questions,
ensuring a comprehensive interview that assesses both skill fit and cultural alignment with the company.


In [23]:
technical_questions_prompt = """Generate {num_questions} technical interview questions based on the job requirements and Candidate's resume:

        Requirements:
        1. Questions should test both job requirements and candidate skills.
        2. Include a mix of difficulty levels.
        3. Make them specific to the candidate's experience level."""

behavioral_questions_prompt = """Generate {num_questions} behavioral interview questions based on the candiadte's resume and the company profile:

        Requirements:
        1. Questions should assess cultural fit for the role.
        2. Questions should reveal candidate's alignment with company values."""

In [29]:
from langchain_core.messages import AnyMessage, SystemMessage, HumanMessage, AIMessage, ChatMessage
from langchain.chat_models import ChatOpenAI

technical_num_questions = 5

# Format the technical questions prompt content separately
technical_system_content = technical_questions_prompt.format(num_questions = technical_num_questions)

# Define the user message content using f-string formatting
technical_user_content = f"""
Candidate's resume:
Name: {resume_info.get('Name', 'N/A')}
Education: {resume_info.get('Education', 'N/A')}
Work Experience: {resume_info.get('Work_experience', 'N/A')}
Skills: {resume_info.get('Skills', 'N/A')}

Job description:
Job Title: {job_posting_info.get('Job_title', 'N/A')}
Responsibilities: {job_posting_info.get('Responsibilities', 'N/A')}
Required Skills: {job_posting_info.get('Required_skills', 'N/A')}
Preferred Skills: {job_posting_info.get('Preferred_skills', 'N/A')}
Required Experience: {job_posting_info.get('Required_experience', 'N/A')}
"""

behavioral_num_questions = 5

# Format the technical questions prompt content separately
behavioral_system_content = behavioral_questions_prompt.format(num_questions = behavioral_num_questions)

behavioral_user_content = f"""
Candidate's resume:
Name: {resume_info.get('Name', 'N/A')}
Work Experience: {resume_info.get('Work_experience', 'N/A')}
Skills: {resume_info.get('Skills', 'N/A')}

Company Profile:
Company Name: {company_profile_info.get('Company_Name', 'N/A')}
Industry: {company_profile_info.get('Industry', 'N/A')}
Mission: {company_profile_info.get('Mission', 'N/A')}
Core Values: {company_profile_info.get('Core_values', 'N/A')}
Company Culture: {company_profile_info.get('Company_culture', 'N/A')}
    """

# Define the prompt template
technical_messages = [
    SystemMessage(content=technical_system_content),
    HumanMessage(content=technical_user_content)
]

behavioral_messages = [
    SystemMessage(content=behavioral_system_content),
    HumanMessage(content=behavioral_user_content)
]

# Instantiate the chat model
generation_model = ChatOpenAI(
    model="gpt-4o",
    temperature=0.7,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    api_key=OPENAI_API_KEY
)

# Run the prompt template to generate interview questions
technical_response = generation_model.invoke(technical_messages)
behavioral_response = generation_model.invoke(behavioral_messages)

In [33]:
# Assuming `technical_questions_response` and `behavioral_questions_response` are AIMessage objects
def clean_response(response):
    # Access the content attribute directly
    content = response.content if hasattr(response, 'content') else response
    # Start with an empty list to gather cleaned questions
    cleaned_questions = []
    # Split the content by lines and process each line
    for line in content.split('\n'):
        # Strip unwanted parts like `content="` and remove leading/trailing whitespace
        cleaned_line = line.strip()
        # Append cleaned lines that are not empty
        if cleaned_line:
            cleaned_questions.append(cleaned_line)
    # Join the cleaned questions with line breaks for a more readable format
    return '\n'.join(cleaned_questions)

# Clean the technical and behavioral responses
cleaned_technical_questions = clean_response(technical_response)
cleaned_behavioral_questions = clean_response(behavioral_response)

# Combine the responses in a neat format
combined_cleaned_output = f"""
Technical Questions:
{cleaned_technical_questions}

Behavioral Questions:
{cleaned_behavioral_questions}
"""

# Display the cleaned output
print(combined_cleaned_output)



Technical Questions:
1. **Transformer-based Model Development:**
Given your experience with developing AI assistants using RAG, how would you approach optimizing a transformer-based model for a customer care chatbot to ensure both scalability and efficiency? Please detail any specific techniques or tools you would utilize, especially within a cloud environment like Azure or AWS.
2. **Generative AI and LLMs:**
You've worked on extracting warranty information using transformer models. Could you describe how you would refine a large language model (LLM) to generate contextually appropriate responses, particularly for complex customer queries? What strategies would you employ for fine-tuning and regularization in PyTorch?
3. **AI Integration and API Utilization:**
With your background in integrating data from various APIs, how would you go about integrating Azure OpenAI and Assistant API into an AI-driven product, such as a virtual assistant? What challenges might arise, and how would you