# RAG System for CV Analysis

This notebook implements a Retrieval-Augmented Generation (RAG) system using LangChain and HuggingFace to analyze CV/Resume documents. The system allows you to ask natural language questions about a CV and get relevant answers with source citations.

## Features:
- PDF document loading and processing
- Text chunking and embedding generation
- Vector similarity search with FAISS
- Question-answering with source attribution
- Interactive Q&A session

## Requirements:
- LangChain community packages
- HuggingFace Transformers
- FAISS vector store
- PyMuPDF for PDF processing

## 1. Import Required Libraries

First, let's import all the necessary libraries for our RAG system. We'll also install python-dotenv for secure environment variable management:

In [None]:
%pip install langchain langchain-community transformers faiss-cpu pymupdf sentence-transformers python-dotenv

import os
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.prompts import PromptTemplate

# Optional: Load environment variables from .env file
try:
    from dotenv import load_dotenv
    load_dotenv()  # This loads variables from .env file if it exists
    print("✅ Environment variables loaded from .env file")
except ImportError:
    print("ℹ️  python-dotenv not available. Using system environment variables only.")

print("✅ All libraries imported successfully!")

## 2. Configuration Setup

Define the configuration parameters for our RAG system:

**🔒 Security Best Practices:**
- **Never hardcode API tokens** in notebooks or code
- Use environment variables, config files, or secure vaults
- Add sensitive files to `.gitignore`
- Use different methods based on your environment

**Choose one of the following methods:**

In [None]:
import os

# --- Configuration ---
# Choose ONE of the following methods based on your environment:

# METHOD 1: Environment Variables (Recommended for production)
# Set these in your system environment or .env file
PDF_PATH = os.getenv('PDF_PATH', './sample_cv.pdf')  # Default fallback
HF_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')

# METHOD 2: Load from .env file (Recommended for development)
# Uncomment the following if you're using python-dotenv:
# from dotenv import load_dotenv
# load_dotenv()
# HF_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')

# METHOD 3: Load from config file (Alternative approach)
# config_file = 'config.json'  # Create this file separately
# if os.path.exists(config_file):
#     import json
#     with open(config_file, 'r') as f:
#         config = json.load(f)
#     HF_API_TOKEN = config.get('huggingface_token')
#     PDF_PATH = config.get('pdf_path', './sample_cv.pdf')

# METHOD 4: Interactive input (For testing/demo purposes)
# HF_API_TOKEN = input("Enter your HuggingFace API token: ")
# PDF_PATH = input("Enter path to your CV PDF: ")

# Set the API token in environment
if HF_API_TOKEN:
    os.environ["HUGGINGFACEHUB_API_TOKEN"] = HF_API_TOKEN
    print("✅ HuggingFace API token loaded successfully")
else:
    print("⚠️  No HuggingFace API token found. Please set HUGGINGFACEHUB_API_TOKEN environment variable.")
    print("   You can also create a .env file or use one of the other methods above.")

# Model configurations (these can be public)
HF_LLM_MODEL = "google/flan-t5-small"  # Language model for Q&A
HF_EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # Embedding model

print(f"\n📁 Configuration Summary:")
print(f"PDF Path: {PDF_PATH}")
print(f"LLM Model: {HF_LLM_MODEL}")
print(f"Embedding Model: {HF_EMBEDDING_MODEL}")
print(f"API Token: {'✅ Set' if HF_API_TOKEN else '❌ Not found'}")

### 🔧 Setup Instructions for Each Method

#### Method 1: Environment Variables (Recommended)

**Windows:**
```bash
# Set temporarily (current session only)
set HUGGINGFACEHUB_API_TOKEN=your_token_here
set PDF_PATH=C:/path/to/your/cv.pdf

# Set permanently
setx HUGGINGFACEHUB_API_TOKEN "your_token_here"
setx PDF_PATH "C:/path/to/your/cv.pdf"
```

**Linux/Mac:**
```bash
# Add to ~/.bashrc or ~/.zshrc
export HUGGINGFACEHUB_API_TOKEN="your_token_here"
export PDF_PATH="/path/to/your/cv.pdf"

# Then reload: source ~/.bashrc
```

#### Method 2: .env File (Development)

1. Install python-dotenv: `pip install python-dotenv`
2. Create `.env` file in your project root:
```
HUGGINGFACEHUB_API_TOKEN=your_token_here
PDF_PATH=./your_cv.pdf
```
3. Add `.env` to your `.gitignore` file
4. Uncomment the dotenv lines in the code above

#### Method 3: Config File

1. Create `config.json` (add to `.gitignore`):
```json
{
    "huggingface_token": "your_token_here",
    "pdf_path": "./your_cv.pdf"
}
```
2. Uncomment the config file lines in the code above

#### Method 4: Interactive Input
- Good for demos or one-time use
- Token won't be saved anywhere
- Uncomment the input lines in the code above

In [None]:
# === SETUP HELPER: Create .env file template ===
# Run this cell to create a template .env file

env_template = """# HuggingFace API Token - Get from https://huggingface.co/settings/tokens
HUGGINGFACEHUB_API_TOKEN=your_token_here

# Path to your CV PDF file
PDF_PATH=./sample_cv.pdf

# Optional: Other configurations
# HF_LLM_MODEL=google/flan-t5-small
# HF_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
"""

# Create .env template file
with open('.env.template', 'w') as f:
    f.write(env_template)

print("✅ Created .env.template file")
print("📝 Steps to use:")
print("   1. Copy .env.template to .env")
print("   2. Edit .env and add your actual token")
print("   3. Install python-dotenv: pip install python-dotenv")
print("   4. Uncomment the dotenv lines in the configuration cell above")

# Create .gitignore if it doesn't exist
gitignore_content = """# Environment variables
.env
config.json

# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python

# Jupyter
.ipynb_checkpoints/

# IDE
.vscode/
.idea/

# OS
.DS_Store
Thumbs.db
"""

if not os.path.exists('.gitignore'):
    with open('.gitignore', 'w') as f:
        f.write(gitignore_content)
    print("✅ Created .gitignore file")
else:
    print("ℹ️  .gitignore already exists")

In [None]:
# === SECURITY VALIDATION ===
# This cell checks if you're following security best practices

def validate_security():
    issues = []
    warnings = []
    
    # Check if token is hardcoded in this cell
    if 'hf_' in str(locals()) or 'hf_' in str(globals()):
        issues.append("🚨 Potential hardcoded token detected in variables")
    
    # Check if .env exists and is in gitignore
    if os.path.exists('.env'):
        if os.path.exists('.gitignore'):
            with open('.gitignore', 'r') as f:
                gitignore_content = f.read()
            if '.env' not in gitignore_content:
                issues.append("🚨 .env file exists but not in .gitignore")
        else:
            issues.append("🚨 .env file exists but no .gitignore found")
    
    # Check environment variable
    if os.getenv('HUGGINGFACEHUB_API_TOKEN'):
        if os.getenv('HUGGINGFACEHUB_API_TOKEN').startswith('hf_'):
            print("✅ Valid HuggingFace token format detected")
        else:
            warnings.append("⚠️  Token format doesn't match expected HuggingFace format")
    else:
        warnings.append("⚠️  No HUGGINGFACEHUB_API_TOKEN environment variable found")
    
    # Print results
    if not issues and not warnings:
        print("🎉 All security checks passed!")
    else:
        if issues:
            print("🚨 SECURITY ISSUES FOUND:")
            for issue in issues:
                print(f"   {issue}")
        if warnings:
            print("⚠️  WARNINGS:")
            for warning in warnings:
                print(f"   {warning}")
    
    return len(issues) == 0

validate_security()

## 3. Define RAG System Builder Function

This function builds the complete RAG system pipeline:
1. **Document Loading**: Load PDF using PyMuPDFLoader
2. **Text Splitting**: Break document into manageable chunks
3. **Embeddings**: Generate vector embeddings for semantic search
4. **Vector Store**: Create FAISS index for similarity search
5. **LLM Setup**: Initialize the language model pipeline
6. **QA Chain**: Configure the retrieval-based question answering system

In [None]:
def build_rag_system(pdf_path: str):
    """
    Builds an improved RAG system with better token management and prompting.
    """
    print(f"--- Starting RAG System Build for: {pdf_path} ---")

    # 1. Document Loader
    try:
        print("Loading document with PyMuPDFLoader...")
        loader = PyMuPDFLoader(pdf_path)
        documents = loader.load()
        print(f"Loaded {len(documents)} pages from the PDF.")
    except Exception as e:
        print(f"Error loading PDF: {e}")
        return None

    # 2. Text Splitting with smaller chunks for better token management
    print("Splitting documents into smaller chunks...")
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,  # Reduced chunk size
        chunk_overlap=100,  # Reduced overlap
        length_function=len,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Created {len(chunks)} text chunks.")

    # 3. Embedding
    print(f"Generating embeddings using: {HF_EMBEDDING_MODEL}...")
    try:
        embeddings = HuggingFaceEmbeddings(model_name=HF_EMBEDDING_MODEL)
    except Exception as e:
        print(f"Error initializing embeddings: {e}")
        return None

    # 4. Vector Store
    print("Creating FAISS vector store...")
    try:
        vector_store = FAISS.from_documents(chunks, embeddings)
        print("FAISS vector store created successfully.")
    except Exception as e:
        print(f"Error creating vector store: {e}")
        return None

    # 5. LLM Setup with better configuration
    print(f"Initializing LLM with model: {HF_LLM_MODEL}...")
    try:
        tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL)
        model = AutoModelForSeq2SeqLM.from_pretrained(HF_LLM_MODEL)
        
        # Better pipeline configuration
        pipe = pipeline(
            "text2text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=150,  # Limit new tokens generated
            temperature=0.3,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
        llm = HuggingFacePipeline(pipeline=pipe)
        print("LLM initialized successfully.")
    except Exception as e:
        print(f"Error initializing LLM: {e}")
        return None

    # 6. Custom prompt template for better responses
    prompt_template = """Use the following pieces of context to answer the question. 
    If you don't know the answer, just say that you don't know.
    
    Context: {context}
    
    Question: {question}
    
    Answer:"""
    
    PROMPT = PromptTemplate(
        template=prompt_template, 
        input_variables=["context", "question"]
    )

    # 7. Create RetrievalQA chain with custom prompt
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 2}  # Limit to 2 most relevant chunks
        ),
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )
    print("RetrievalQA chain configured with custom prompt.")
    print("--- RAG System Build Complete ---")
    return qa_chain

## 4. Define Answer Formatting Function

This function formats the RAG system's responses in a user-friendly way, showing both the answer and the source documents used to generate the response.

In [None]:
def format_answer(result):
    """
    Format the answer in a more readable way.
    """
    print("\n--- Answer ---")
    answer = result["result"].strip()
    if answer:
        print(answer)
    else:
        print("I couldn't find a specific answer to your question.")
    
    print("\n--- Source Information ---")
    if "source_documents" in result:
        for i, doc in enumerate(result["source_documents"], 1):
            page_num = doc.metadata.get('page', 'Unknown')
            print(f"Source {i} (Page {page_num}):")
            # Show more content for context
            content = doc.page_content.strip()
            print(f"  {content[:300]}...")
            print("-" * 50)
    else:
        print("No source documents found.")

## 5. Main Execution Logic

Now let's build the RAG system and start the interactive Q&A session. 

**Before running this cell:**
1. Make sure the PDF path in the configuration is correct
2. Ensure you have a valid HuggingFace API token
3. The PDF file should be accessible from the specified path

In [None]:
# Check if PDF file exists
if not os.path.exists(PDF_PATH):
    print(f"Error: PDF file not found at '{PDF_PATH}'.")
    print("Please update the PDF_PATH in the configuration section.")
else:
    print("PDF file found. Building RAG system...")
    qa_system = build_rag_system(PDF_PATH)
    
    if qa_system:
        print("\n--- RAG System Ready ---")
        print("You can now ask questions about the CV document.")
        print("\nSample questions you can ask:")
        print("- Who is the candidate?")
        print("- What are the candidate's skills?")
        print("- What is the candidate's education background?")
        print("- What projects has the candidate worked on?")
        print("- What programming languages does the candidate know?")
        print("- What is the candidate's work experience?")
    else:
        print("\nFailed to initialize RAG system.")
        print("Please check the error messages above and try again.")

## 6. Interactive Q&A Session

Run the following cells to ask questions about the CV. Each cell demonstrates a different type of query you can make to the RAG system.

In [None]:
# Example Query 1: Ask about the candidate's identity
if 'qa_system' in locals() and qa_system:
    question = "Who is the candidate? What is their name and professional title?"
    print(f"Question: {question}")
    try:
        result = qa_system.invoke({"query": question})
        format_answer(result)
    except Exception as e:
        print(f"Error during query: {e}")
else:
    print("RAG system not initialized. Please run the previous cells first.")

In [None]:
# Example Query 2: Ask about technical skills
if 'qa_system' in locals() and qa_system:
    question = "What are the candidate's technical skills and programming languages?"
    print(f"Question: {question}")
    try:
        result = qa_system.invoke({"query": question})
        format_answer(result)
    except Exception as e:
        print(f"Error during query: {e}")
else:
    print("RAG system not initialized. Please run the previous cells first.")

In [None]:
# Custom Query: Ask your own question
if 'qa_system' in locals() and qa_system:
    # You can modify this question to ask anything about the CV
    custom_question = "What is the candidate's education background?"
    
    print(f"Custom Question: {custom_question}")
    try:
        result = qa_system.invoke({"query": custom_question})
        format_answer(result)
    except Exception as e:
        print(f"Error during query: {e}")
        
    print("\n" + "="*60)
    print("To ask a different question, modify the 'custom_question' variable above and run this cell again.")
else:
    print("RAG system not initialized. Please run the previous cells first.")

## 7. Continuous Interactive Session

For a more interactive experience, run the following cell to start a continuous Q&A session where you can ask multiple questions.

In [None]:
# Continuous Interactive Session
if 'qa_system' in locals() and qa_system:
    print("Starting interactive Q&A session...")
    print("Type 'exit' or 'quit' to end the session.")
    print("Type 'help' to see sample questions.")
    
    sample_questions = [
        "Who is the candidate?",
        "What are the candidate's skills?",
        "What is the candidate's education background?",
        "What projects has the candidate worked on?",
        "What programming languages does the candidate know?",
        "What is the candidate's work experience?",
        "What certifications does the candidate have?",
        "What are the candidate's achievements?"
    ]
    
    session_active = True
    question_count = 0
    
    while session_active and question_count < 10:  # Limit to 10 questions for notebook
        user_query = input("\nEnter your question (or 'exit' to quit): ")
        
        if user_query.lower() in ["exit", "quit"]:
            print("Exiting RAG Q&A session. Goodbye!")
            session_active = False
            break
            
        if user_query.lower() == "help":
            print("\nSample questions you can ask:")
            for i, q in enumerate(sample_questions, 1):
                print(f"{i}. {q}")
            continue
            
        if user_query.strip():
            try:
                print("Querying RAG system...")
                result = qa_system.invoke({"query": user_query})
                format_answer(result)
                question_count += 1
            except Exception as e:
                print(f"An error occurred during query: {e}")
                print("Please try rephrasing your question.")
        else:
            print("Please enter a non-empty question.")
    
    if question_count >= 10:
        print("\nReached maximum number of questions for this session.")
        print("To continue, restart this cell.")
else:
    print("RAG system not initialized. Please run the previous cells first.")

## Conclusion and Next Steps

Congratulations! You've successfully built and tested a RAG system for CV analysis. Here are some ways to improve and extend this system:

### Potential Improvements:
1. **Better Models**: Try larger language models like `google/flan-t5-base` or `google/flan-t5-large`
2. **Enhanced Chunking**: Experiment with different chunk sizes and overlap values
3. **Multiple Documents**: Extend to handle multiple CVs for comparison
4. **Structured Output**: Parse CV sections (education, experience, skills) separately
5. **Semantic Search**: Add filters for specific CV sections
6. **Performance Optimization**: Implement caching for frequently asked questions

### Usage Tips:
- For better results, ask specific questions rather than general ones
- The system works best with well-structured PDF documents
- Try different phrasings if you don't get the expected answer
- Check the source documents to understand how the answer was derived

### Common Issues:
- **PDF Loading Errors**: Ensure the PDF path is correct and accessible
- **Model Loading Issues**: Check your internet connection and HuggingFace token
- **Memory Issues**: For large documents, consider reducing chunk size or using a smaller embedding model