# PDF Text Extraction for CV Analysis

This script processes CVs stored in PDF format, organized into directories by job categories.

## Part I: Text Extraction with PyPDF2

###  fitz Library

- `fitz` is a Pure-Python library built as a PDF toolkit. It is capable of extracting text from PDFs that contain selectable text.

### Workflow

1. **Directory Traversal**: The script walks through each directory starting from the specified base path, processing only `.pdf` files.

2. **Text Extraction**: For each PDF, the script attempts to extract text using PyPDF2. It iterates through all pages of the document, concatenating the text it finds.

3. **Skipped Documents**: If fitz is unable to extract text, which suggests the document may contain scanned images instead of text, the file is noted and skipped.

4. **Output**: Extracted text is printed to the console. For production use, this would typically be redirected to save in a file or database.

### Notes

- The script does not handle OCR and will not extract text from scanned image PDFs.
- It assumes all text-based content in a PDF is extractable, which may not be true for PDFs with complex encodings or security restrictions.


In [48]:
import os
import fitz  # PyMuPDF
import csv
import spacy
import pandas as pd

In [43]:
def clean_text(text):
    # Perform basic text cleaning
    text = text.replace('\n', ' ')  # Replace new lines with spaces
    text = ' '.join(text.split())   # Remove extra spaces
    return text

In [44]:
def extract_text_from_pdf(pdf_path, writer):
    # Open the PDF file
    with fitz.open(pdf_path) as pdf:
        # Concatenate text from all pages
        full_text = ''
        for page_num in range(len(pdf)):
            page = pdf[page_num]
            text = page.get_text()
            full_text += text
        
        # Clean the extracted text
        clean_full_text = clean_text(full_text)
        
        # Extract category and filename for CSV
        category = os.path.basename(os.path.dirname(pdf_path))
        filename = os.path.basename(pdf_path)
        
        # Write to CSV
        writer.writerow([category, filename, clean_full_text])

In [45]:
def process_pdf_directories(base_path, output_csv_path):
    # Open the output CSV file
    with open(output_csv_path, mode='w', newline='', encoding='utf-8') as csvfile:
        # Create a CSV writer
        csv_writer = csv.writer(csvfile)
        # Write header
        csv_writer.writerow(['Category', 'Filename', 'Text'])
        
        # Walk through the base directory
        for root, dirs, files in os.walk(base_path):
            for file in files:
                if file.endswith('.pdf'):
                    # Construct the full path of the PDF file
                    pdf_path = os.path.join(root, file)
                    # Extract text and write to CSV
                    extract_text_from_pdf(pdf_path, csv_writer)

In [46]:
base_path = './data'
output_csv_path = './processed_resumes.csv'
process_pdf_directories(base_path, output_csv_path)

In [49]:
nlp = spacy.load("en_core_web_sm")


In [59]:
def preprocess_text(text):
    # convert non-strings to  strings
    text = str(text)
    # Process the text with spaCy. This runs the entire pipeline.
    doc = nlp(text)        
    # Lemmatization, remove stopwords and punctuation and convert to lower case
    tokens = [
        token.lemma_.lower()
        for token in doc
        if not token.is_stop 
        and not token.is_punct
        # Remove tokens that are only spaces
        and not token.is_space
    ]
    
    # Return preprocessed tokens
    return ' '.join(tokens)

In [60]:
df = pd.read_csv('processed_resumes.csv')
df['Processed Text'] = df['Text'].apply(preprocess_text)


In [61]:
df.head()

Unnamed: 0,Category,Filename,Text,Processed Text
0,ACCOUNTANT,10554236.pdf,ACCOUNTANT Summary Financial Accountant specia...,accountant summary financial accountant specia...
1,ACCOUNTANT,10674770.pdf,STAFF ACCOUNTANT Summary Highly analytical and...,staff accountant summary highly analytical det...
2,ACCOUNTANT,11163645.pdf,ACCOUNTANT Professional Summary To obtain a po...,accountant professional summary obtain positio...
3,ACCOUNTANT,11759079.pdf,SENIOR ACCOUNTANT Experience Company Name June...,senior accountant experience company june 2011...
4,ACCOUNTANT,12065211.pdf,SENIOR ACCOUNTANT Professional Summary Senior ...,senior accountant professional summary senior ...


In [62]:
from rank_bm25 import BM25Okapi
from nltk.tokenize import word_tokenize
import nltk

In [63]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Marouane\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [83]:
# exemple for a job offer 
job_offer = "We are looking for a data scientist with experience in machine learning and natural language processing"
# clean the job offer
job_offer = preprocess_text(job_offer)


# remove verb and pronouns from the job offer using spacy
doc = nlp(job_offer)
job_offer = ' '.join([token.lemma_ for token in doc if token.pos_ not in ['VERB', 'PRON']])
# tokenize the job offer
tokenized_job_offer = word_tokenize(job_offer)
# return job offer simple text
job_offer = ' '.join(tokenized_job_offer)

In [84]:
# using bm 25 rank the resumes based on the job offer
def rank_resumes(job_offer, resumes):
    tokenized_corpus = [word_tokenize(doc) for doc in resumes]
    bm25 = BM25Okapi(tokenized_corpus)
    tokenized_query = word_tokenize(job_offer)
    doc_scores = bm25.get_scores(tokenized_query)
    return doc_scores


In [87]:
rank_resumes(job_offer, df['Processed Text'])


array([3.12825729, 1.4847528 , 5.66822565, ..., 4.79884232, 2.48181935,
       3.10907577])

In [88]:
# get the top 5 resumes
top_resumes = df.iloc[rank_resumes(job_offer, df['Processed Text']).argsort()[::-1][:5]]

In [89]:
top_resumes

Unnamed: 0,Category,Filename,Text,Processed Text
1464,ENGINEERING,12011623.pdf,ENGINEERING AND QUALITY TECHNICIAN Career Over...,engineering quality technician career overview...
296,AGRICULTURE,81042872.pdf,RESEARCH SCIENTIST Summary Highly motivated Re...,research scientist summary highly motivated re...
548,AVIATION,12144825.pdf,SOFTWARE ENGINEERING CO-OP Summary Highly skil...,software engineering co op summary highly skil...
1526,ENGINEERING,28923650.pdf,THERMAL ENGINEERING INTERN Summary Graduating ...,thermal engineering intern summary graduating ...
1371,DIGITAL-MEDIA,14036515.pdf,MONITOR TECH Summary Knowledge of modern offic...,monitor tech summary knowledge modern office m...


In [90]:
# search pdfs and copy them to a new directory
import shutil
# Create a new directory to store the top resumes
output_dir = 'top_resumes'
os.makedirs(output_dir, exist_ok=True)
# use fileName to search for the pdfs in data directory
for fileName in top_resumes['Filename']:
    for root, dirs, files in os.walk(base_path):
        for file in files:
            if file == fileName:
                # Construct the full path of the PDF file
                pdf_path = os.path.join(root, file)
                # Copy the file to the output directory
                shutil.copy(pdf_path, output_dir)