# Résumé–Job Description Matching with PDF Parsing & Sentence Transformers

## Purpose  
Extract text from PDF résumés (with built-in OCR fallback), clean the text, and compute similarity with job descriptions using a fine-tuned Sentence Transformer model (`anass1209/resume-job-matcher-all-MiniLM-L6-v2`).

## Pipeline  

**PDF → Text**  
- Try direct text extraction via `pypdf`.  
- If extraction fails or the text is too short, run OCR with `pdf2image` + `pytesseract` (supports `eng+ara`).  

**Text Cleaning**  
- Normalize whitespace and remove excess newlines.  

**Embedding**  
- Encode both the cleaned résumé text and the job description with the Sentence Transformer model.  

**Similarity**  
- Compute cosine similarity between embeddings to get a match score.  

**Optional Demo**  
- Encode sample sentences and compute an embedding similarity matrix.

## Inputs  

- `resume_pdf_path` — PDF file containing the CV/résumé.  
- `job_text_path` — text file containing the job description.  

## Outputs  

- **Similarity Score:** cosine similarity between the résumé and job description.  
- **Embedding Shapes:** for inspection when encoding multiple samples.  

## Usage  

Run the cells sequentially:  

1. Adjust `resume_pdf_path` and `job_text_path` to your files.  
2. The notebook will extract, clean, embed, and print the similarity score.  
3. Optionally, test with a provided list of sample sentences to see the model’s embedding behavior.


In [9]:
pip install -U sentence-transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting sentence-transformers
  Downloading sentence_transformers-5.1.0-py3-none-any.whl.metadata (16 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_

In [6]:
!pip install -U transformers sentence-transformers

Collecting transformers
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.34.6-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Downloading transformers-4.56.1-py3-none-any.whl (11.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m108.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hDownloading huggingface_hub-0.34.6-py3-none-any.whl (562 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m562.6/562.6 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[2K 

In [10]:
!pip install -U "transformers==4.44.2" "sentence-transformers==3.0.1"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting transformers==4.44.2
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers==3.0.1
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.44.2)
  Downloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.44.2-py3-none-any.whl (9.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m61.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.19.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 M

In [12]:
from pathlib import Path
import re

# ========= 1) PDF to Text ==========
from pypdf import PdfReader
from pdf2image import convert_from_path
import pytesseract

def pdf_to_text(pdf_path: str, ocr_lang: str = "eng+ara") -> str:
   
    try:
        reader = PdfReader(pdf_path)
        texts = []
        for page in reader.pages:
            txt = page.extract_text() or ""
            texts.append(txt)
        text = "\n".join(texts).strip()
        if len(text) > 50:  
            return text
    except Exception:
        pass
    
    # fallback OCR
    images = convert_from_path('/kaggle/input/pdfcvjd/Malak Ahmed.pdf', dpi=300)
    ocr_text = []
    for img in images:
        txt = pytesseract.image_to_string(img, lang=ocr_lang)
        ocr_text.append(txt)
    return "\n".join(ocr_text).strip()

# ========= 2) Text Cleaning ==========
def clean_text(s: str) -> str:
    s = s.replace("\r", "\n")
    s = re.sub(r"[ \t]+", " ", s)
    s = re.sub(r"\n{3,}", "\n\n", s)
    return s.strip()

In [13]:
from sentence_transformers import SentenceTransformer, util

# 1) Load the model once
model = SentenceTransformer("anass1209/resume-job-matcher-all-MiniLM-L6-v2")

# 2) Use your functions to extract and clean text from PDF
resume_pdf_path = "/kaggle/input/pdfcvjd/Malak Ahmed.pdf"  # مسار السي في PDF
job_text_path = "/kaggle/input/resume-job-matcher/job_desc.txt"  # مسار الـJob Description txt

resume_text_raw = pdf_to_text(resume_pdf_path)      # تحويل الـPDF لنص
resume_text = clean_text(resume_text_raw)           # تنظيف النص
jd_text = open(job_text_path, encoding="utf-8").read()  # قراءة الجوب

# 3) Encode both texts to embeddings
resume_emb = model.encode(resume_text, convert_to_tensor=True)
jd_emb = model.encode(jd_text, convert_to_tensor=True)

# 4) Compute similarity
similarity = util.pytorch_cos_sim(resume_emb, jd_emb)
print(f"Similarity score: {similarity.item():.4f}")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Similarity score: 0.7233


In [4]:
# from sentence_transformers import SentenceTransformer

# # Download from the 🤗 Hub
# model = SentenceTransformer("anass1209/resume-job-matcher-all-MiniLM-L6-v2")
# # Run inference
# sentences = [
#     'Developed and maintained core backend services using Python and Django, focusing on scalability and efficiency. Implemented RESTful APIs for data retrieval and manipulation.  Worked extensively with PostgreSQL for data storage and retrieval.  Responsible for optimizing database queries and improving API response times.  Experience with model fine-tuning for semantic search and document retrieval using pre-trained embedding models like Sentence Transformers or similar libraries, specifically for improving the relevance of search results and document matching within the web application.  Experience using vector databases (e.g., ChromaDB, Weaviate) preferred.',
#     '## Senior Backend Engineer\n\n*   **ABC Corp** | 2020 - Present\n*   Led development of a new REST API for user authentication and profile management using Python and Django.\n*   Managed a PostgreSQL database, optimizing queries and schema design for improved performance, resulting in a 20% reduction in average API response time.\n*   Improved system scalability through efficient code design and load balancing techniques.\n*   Experience using pre-trained embedding models (BERT) for natural language processing tasks to improve search accuracy, with focus on keyphrase extraction and content similarity comparison for the recommendations engine. Proficient in Flask.',
#     "PhD in Computer Science, University of California, Berkeley (2018-2023). Dissertation: 'Adversarial Robustness in NLP for Cybersecurity Applications.' Focused on fine-tuning BERT for malware detection and social engineering attacks. Proficient in Python, TensorFlow, and AWS. Published in top-tier NLP and security conferences. Experienced with large datasets and model evaluation metrics.\n\nMaster of Science in Cybersecurity, Johns Hopkins University (2016-2018). Relevant coursework included Machine Learning, Data Mining, and Network Security. Developed a system for anomaly detection using a recurrent neural network (RNN). Familiar with Python and cloud computing platforms. Good understanding of NLP concepts, but limited experience fine-tuning transformer models. Strong understanding of Information Security Principles.\n\nBachelor of Science in Computer Engineering, Carnegie Mellon University (2012-2016). Relevant coursework: Artificial Intelligence, Database Management, and Software Engineering. Project experience: Developed a web application using Python. No direct experience with fine-tuning NLP models, but a strong foundation in programming and data structures.  Familiar with cloud infrastructure concepts. Possess CISSP certification.",
# ]
# embeddings = model.encode(sentences)
# print(embeddings.shape)
# # [3, 384]

# # Get the similarity scores for the embeddings
# similarities = model.similarity(embeddings, embeddings)
# print(similarities.shape)
# # [3, 3]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(3, 384)
torch.Size([3, 3])


In [14]:
resume_text = open("/kaggle/input/ai-resume/resume.txt").read()
jd_text = open("/kaggle/input/resume-job-matcher/job_desc.txt").read()

resume_emb = model.encode(resume_text, convert_to_tensor=True)
jd_emb = model.encode(jd_text, convert_to_tensor=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [15]:

similarity = util.pytorch_cos_sim(resume_emb, jd_emb)
print(f"Similarity score: {similarity.item():.4f}")


Similarity score: 0.8701


In [16]:
from sentence_transformers import SentenceTransformer, util

# 1) Load the model once
model = SentenceTransformer("anass1209/resume-job-matcher-all-MiniLM-L6-v2")

# 2) Use your functions to extract and clean text from PDF
resume_pdf_path = "/kaggle/input/pdfcvjd/Malak Ahmed.pdf"  # مسار السي في PDF
job_text_path = "/kaggle/input/jobdescription/joc-desc-flutter.txt"  # مسار الـJob Description txt

resume_text_raw = pdf_to_text(resume_pdf_path)      # تحويل الـPDF لنص
resume_text = clean_text(resume_text_raw)           # تنظيف النص
jd_text = open(job_text_path, encoding="utf-8").read()  # قراءة الجوب

# 3) Encode both texts to embeddings
resume_emb = model.encode(resume_text, convert_to_tensor=True)
jd_emb = model.encode(jd_text, convert_to_tensor=True)

# 4) Compute similarity
similarity = util.pytorch_cos_sim(resume_emb, jd_emb)
print(f"Similarity score: {similarity.item():.4f}")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Similarity score: 0.6817


In [17]:
resume_text = open("/kaggle/input/ai-resume/resume.txt").read()
jd_text = open("/kaggle/input/jobdescription/joc-desc-flutter.txt").read()

resume_emb = model.encode(resume_text, convert_to_tensor=True)
jd_emb = model.encode(jd_text, convert_to_tensor=True)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [18]:

similarity = util.pytorch_cos_sim(resume_emb, jd_emb)
print(f"Similarity score: {similarity.item():.4f}")


Similarity score: 0.6816
