# Resume Parser Documentation

## Overview

This notebook provides a comprehensive Python-based solution to parse resume PDFs and extract key information, including:

1. **Full Name**
2. **Contact Information** (e.g., email, phone, LinkedIn)
3. **Summary or Objective Statement**
4. **Skills** (as a list)
5. **Work Experience** (including company, job title, dates, and responsibilities)
6. **Education** (degree, institution, dates, and additional information)
7. **Certifications and Awards** (if present)
8. **Projects** (if present)

The parser is designed to handle diverse resume formats, leveraging PDF text extraction, regular expressions, and natural language processing to identify and extract each section.

## Libraries Used

- `PyPDF2`: For opening and reading text from PDF files.
- `PyMuPDF`: For extracting text with font information, enabling more accurate section identification.
- `re`: For regular expressions to locate and match sections, dates, and other structured information.
- `spacy`: For advanced natural language processing to detect names and segment text.

## Notebook Structure

Each cell in the notebook is organized to focus on one key component of resume parsing:

1. **Import Libraries for PDF Extraction**: Imports all necessary libraries.
2. **Function to Extract Text from PDF (using PyPDF2)**: Reads PDF content to extract text.
3. **Function to Extract Sections with Font Information (using PyMuPDF)**: Extracts text with font details to aid in identifying sections.
4. **Function to Extract Name from Text**: Uses `spacy` to detect a full name.
5. **Extracting Summary/Objective from Resume Text**: Identifies and extracts the summary or objective section.
6. **Function to Extract Accounts (URLs and Emails) from Resume**: Extracts contact details like email and LinkedIn profile links.
7. **Function to Extract Skills from Resume Sections**: Extracts skills listed within a specific section.
8. **Function to Extract Education Details from Resume**: Identifies and extracts educational information, including degrees and institutions.
9. **Function to Extract Certifications and Awards**: Searches for certification or award-related sections and gathers details.
10. **Function to Extract Projects**: Identifies and extracts project information.
11. **Extract Work Experience from Resume**: Extracts details such as job title, company, dates, and responsibilities.

## Challenges

- **Format Variations**: Resume layouts vary widely, so this parser uses flexible regex patterns and generic keywords to identify each section across different styles and formats.
- **Two-Column Layouts**: PDFs sometimes store text in a non-linear order, especially in multi-column layouts, making extraction challenging. The parser attempts to handle this by processing text with structure-aware extraction methods, though highly complex layouts may still require further refinement.

## Instructions for Use

1. Run each code cell in sequence.
2. Provide the path to a resume PDF file when prompted.
3. The parser will output a structured dictionary containing the parsed resume information.

## Expected Output Format

The parsed resume information is output as a structured dictionary with keys representing each information type:

```python
{
    "Full Name": "John Doe",
    "Contact Information": {
        "Email": "john.doe@example.com",
        "Phone": "123-456-7890",
        "LinkedIn": "linkedin.com/in/johndoe"
    },
    "Summary": "Experienced data professional...",
    "Skills": ["Python", "SQL", "Machine Learning", "Data Analysis"],
    "Work Experience": [
        {
            "Company": "Example Corp",
            "Job Title": "Data Scientist",
            "Dates": "Jan 2018 - Dec 2020",
            "Responsibilities": "Developed machine learning models..."
        }
    ],
    "Education": [
        {
            "Degree": "BSc. Computer Science",
            "Institution": "University of XYZ",
            "Dates": "2014 - 2018"
        }
    ],
    "Certifications": ["Certified Data Scientist"],
    "Projects": ["Project A: Developed a recommendation system..."]
}


# Import Libraries for PDF Extraction

This cell imports the necessary libraries for extracting and analyzing text from PDF documents. We use `PyPDF2` for basic PDF text extraction, `fitz` (PyMuPDF) for advanced text extraction with font information, `re` for regular expressions, and `spacy` for natural language processing. Additionally, we load the English NLP model from spaCy (`en_core_web_sm`) for named entity recognition and language processing tasks.

**Input:**
- None

**Output:**
- None (Library imports for setting up extraction and NLP tasks)


In [1]:
import PyPDF2
import fitz  # PyMuPDF
import re
import spacy

# Load the English NLP model from spaCy
nlp = spacy.load("en_core_web_sm")



# Function to Extract Text from PDF (using PyPDF2)
This function reads a PDF file and extracts plain text from it using the PyPDF2 library.

**Input:**
- `pdf_path` (str): The file path of the PDF.

**Output:**
- `text` (str): The extracted text from the PDF file.


In [2]:
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text += page.extract_text()  # Accumulate text from each page
    return text

# Example usage:
pdf_file = "data/Sample Resume for Assessment.pdf"
result = extract_text_from_pdf(pdf_file)
print(result)


Gilbert Adjei 
Gilbert is an experienced data professional with 6+ years of experience in data engineering, analytics & software engineering. Career assignments have
ranged from building data science solutions for startups to leading project teams with eﬀective communication & teamwork. With this background, he
is adept at picking up new skills quickly to deliver robust solutions to the most demanding of businesses. 
gilbertadjei800@gmail.com 
linkedin.com/in/gilbert-adjei-900ba2110 
github.com/GilbertAbakahAdjei 
medium.com/@gilbertadjei800 
WORK EXPERIENCE 
Data Engineering Consultant 
Entelligent 
04/2022 - Present
, 
 
Colorado, US 
Mentored data engineering apprentices & new hires to build new
climate risk measures while ensuring data quality and architecture
initiatives are met. 
Led the build out of A.I data product and a platform that supports
analytics, product development, and product delivery. 
Lead Data & Analytics Engineer 
Float (YC W'20) 
09/2021 - 04/2022
, 
 
San Franc

# Function to Extract Sections with Font Information (using PyMuPDF)
This function extracts the text and associated font details (like size and name) from each section in the PDF using `PyMuPDF`. It also distinguishes between titles and content.

**Input:**
- `pdf_path` (str): The file path of the PDF.

**Output:**
- `sections` (dict): A dictionary where each key is a section title (e.g., "Work Experience") and the value is a list of dictionaries containing text and font info.


In [3]:
def extract_sections_with_font_info(pdf_path):
    document = fitz.open(pdf_path)
    sections = {}  # Dictionary to store sections with text and font info
    current_title = None
    current_content = []

    for page_num in range(len(document)):
        page = document.load_page(page_num)
        blocks = page.get_text("dict")["blocks"]

        for block in blocks:
            if "lines" in block:
                for line in block["lines"]:
                    for span in line["spans"]:
                        text = span["text"].strip()
                        font_size = span["size"]
                        font_name = span["font"]
                        
                        if re.match(r"^[A-Z]{4,}\b.*", text):  # Title line pattern
                            if current_title:
                                sections[current_title] = current_content
                            current_title = text
                            current_content = [{"text": text, "font_size": font_size, "font_name": font_name}]
                        else:
                            if current_title:
                                current_content.append({"text": text, "font_size": font_size, "font_name": font_name})

    if current_title:
        sections[current_title] = current_content

    document.close()
    return sections

# Example usage
extracted_sections = extract_sections_with_font_info(pdf_file)


# Function to Extract Name from Text
This function searches the extracted text for a name pattern, which typically consists of two capitalized words.

**Input:**
- `text` (str): The extracted text from the resume.

**Output:**
- `name` (str): The extracted name or "Name not found" if no valid name is detected.


In [4]:
def extract_name(text):
    lines = text.splitlines()
    name_pattern = re.compile(r"^[A-Z][a-zA-Z]*\s+[A-Z][a-zA-Z]*$")
    
    for line in lines:
        line = line.strip()
        if name_pattern.match(line):
            return line

    return "Name not found"

name = extract_name(result)
print("Extracted Name:", name)


Extracted Name: Gilbert Adjei


# Extracting Summary/Objective from Resume Text
This function extracts the summary or objective section from a resume. It looks for section titles and extracts the corresponding content.

**Input:**
- `text` (str): The extracted text from the resume.

**Output:**
- `summary_text` (str): The extracted summary or objective.


In [5]:
summary_title_pattern = (
    r"^(Summary|Professional Summary|Career Summary|Executive Summary|Summary of Qualifications|"
    r"Profile|Professional Profile|Personal Profile|Career Profile|Personal Summary|Overview|"
    r"Objective|Career Objective|Professional Objective|Statement|Introduction|About Me)\s*$"
)

uppercase_title_pattern = r"^[A-Z]{4,}\b.*"  # Matches lines that are fully uppercase with more than 4 characters
name_pattern = r"^[A-Za-z\s]+$"  # Basic pattern to match a name (adjust as needed)

lines = result.splitlines()
summary_text = ""
name_found = False
is_capturing = False

for line in lines:
    if re.match(summary_title_pattern, line, re.IGNORECASE):
        is_capturing = True
        continue
    elif is_capturing:
        if re.match(uppercase_title_pattern, line):
            break
        elif line.strip():
            summary_text += line.strip() + " "
            if line.strip().endswith('.'):
                break
    elif re.match(name_pattern, line) and not name_found:
        name_found = True
    elif name_found and not is_capturing:
        summary_text = line.strip()
        is_capturing = True

print("Extracted Summary/Objective:")
print(summary_text.strip())


Extracted Summary/Objective:
Gilbert is an experienced data professional with 6+ years of experience in data engineering, analytics & software engineering. Career assignments haveranged from building data science solutions for startups to leading project teams with eﬀective communication & teamwork. With this background, he is adept at picking up new skills quickly to deliver robust solutions to the most demanding of businesses.


# Function to Extract Accounts (URLs and Emails) from Resume
This function extracts any URLs or email addresses from the resume text. It uses regular expressions to find and categorize them.

**Input:**
- `text` (str): The extracted text from the resume.

**Output:**
- `accounts` (dict): A dictionary where keys are domains (e.g., "github", "linkedin") and values are lists of matching URLs or emails.


In [6]:
def extract_accounts_from_resume(text):
    contact_number = None
    pattern = r'(https?://)?(www\.)?([a-zA-Z0-9-]+)\.[a-zA-Z]+(/\S*)?|([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'
    accounts = {}

    matches = re.findall(pattern, text)
    for match in matches:
        full_url = ''.join(match[:4])
        email = match[4]

        if full_url:
            domain_name = match[2]
            if domain_name and len(domain_name) > 1:
                if domain_name in accounts:
                    accounts[domain_name].append(full_url)
                else:
                    accounts[domain_name] = [full_url]

        if email:
            email_domain = email.split('@')[1].split('.')[0]
            if email_domain in accounts:
                accounts[email_domain].append(email)
            else:
                accounts[email_domain] = [email]

    return accounts

accounts = extract_accounts_from_resume(result)
print("accounts: ", accounts)


accounts:  {'gmail': ['gilbertadjei800@gmail.com'], 'linkedin': ['linkedin/in/gilbert-adjei-900ba2110'], 'github': ['github/GilbertAbakahAdjei'], 'medium': ['medium/@gilbertadjei800']}


# Function to Extract Skills from Resume Sections
This function extracts skills-related information from the resume. It identifies sections such as "Skills", "Technical Skills", etc., and retrieves their content.

**Input:**
- `extracted_sections` (dict): The dictionary of sections extracted from the resume, with each section having associated content and font size.

**Output:**
- `skills` (list): A list of skills extracted from the resume.


In [7]:
skill_section_keywords = [
    "Skills", "Technical Skills", "Core Competencies", "Technical Proficiencies",
    "IT Skills", "Programming Skills", "Software Skills", "Data Skills",
    "Engineering Skills", "Hard Skills", "Technical Abilities", "Core Skills",
    "Key Skills", "Professional Skills", "Skill Set", "Competencies",
    "Expertise", "Areas of Expertise", "Technical Competencies",
    "Technical Expertise", "Technical Proficiencies", "Proficiencies",
    "Core Proficiencies", "Technical Abilities", "Technical Acumen",
    "Relevant Skills", "Specialized Skills", "Tools and Technologies"
]

skills_pattern = re.compile(r"(" + "|".join(skill_section_keywords) + ")", re.IGNORECASE)

skills = []
for title, content in extracted_sections.items():
    if skills_pattern.search(title):
        skills_items = [item for item in content if item["font_size"] < content[0]["font_size"]]
        skills.extend([item["text"] for item in skills_items])

for skill in skills:
    print(skill)


Python
SQL & MongoDB
DBT
Kubernetes
Spark
Docker
GIT
Snowﬂake
Analytics
Metabase
Airbyte
Airﬂow
AWS
Scala
A.I
Statistics
Power BI
FastAPI


# Function to Extract Education Details from Resume
This function identifies the "Education" section in the resume and extracts relevant information such as degree, institution, and dates.

**Input:**
- `extracted_sections` (dict): The dictionary of sections extracted from the resume.

**Output:**
- `education_list` (list): A list of dictionaries containing education details such as degree, institution, and dates.


In [8]:
education_section_titles = [
    "Education", "Educational Background", "Academic Background",
    "Academic History", "Educational Qualifications", "Academic Qualifications",
    "Academic Experience", "Education and Training", "Formal Education",
    "Scholastic Background", "Scholarly Background", "Education Details",
    "Academic Record", "Academic Credentials", "Education Summary",
    "Educational Achievements", "Education History", "Professional Education",
    "Training and Education"
]

education_pattern = re.compile(r"(" + "|".join(education_section_titles) + ")", re.IGNORECASE)

education_list = []
for title, content in extracted_sections.items():
    if education_pattern.search(title):
        education_entries = [
            item["text"] for item in content if item["font_size"] < content[0]["font_size"]
        ]
        
        degree_pattern = r"(Bachelor|Master|Doctor|Associate|PhD)"
        institution = None
        degree = None
        dates = None

        for entry in education_entries:
            if re.search(degree_pattern, entry):
                degree = entry
            elif "University" in entry or "College" in entry:
                institution = entry
            elif re.search(r"\d{4}", entry):
                dates = entry

        education_list.append({"degree": degree, "institution": institution, "dates": dates})

for edu in education_list:
    print(edu)


{'degree': None, 'institution': 'University of Ghana, Legon.', 'dates': '09/2013 - 07/2017'}


# Function to Extract Certifications and Awards
This function identifies the "Certifications" or "Awards" section in the resume, searches for common section titles, and extracts relevant text content.

**Function Name:**
- `extract_certifications`

**Input:**
- `sections` (dict): A dictionary where keys are section titles and values are lists of text elements, each with attributes like `text`, `font_size`, and `font_name`.

**Output:**
- `certifications` (list): A list of certification or award entries extracted from the resume.


In [9]:
import re

def extract_certifications(sections):
    # List of possible names for "Certification" or "Awards" sections
    certification_section_titles = [
        "Certification", "Certifications", "Certification and Licenses", 
        "Licenses and Certifications", "Licenses", "Professional Certifications",
        "Professional Licenses", "Credentials", "Awards", "Honors", 
        "Achievements", "Recognitions", "Accolades"
    ]

    # Compile the regex pattern to match any of these titles
    certification_pattern = re.compile(r"(" + "|".join(certification_section_titles) + ")", re.IGNORECASE)

    # Initialize a list to store all certification entries as dictionaries
    certifications = []

    # Loop through sections to check for certifications and awards
    for title, content in sections.items():
        if certification_pattern.search(title):
            certifications += [
                item["text"] for item in content
                if item["font_size"] < content[0]["font_size"]  # Exclude title by font size
            ]

    return certifications

# Example usage
certifications = extract_certifications(extracted_sections)
print("Extracted Certifications and Awards:")
for certification in certifications:
    print(f"- {certification}")


Extracted Certifications and Awards:
- Winner – AfriHack Data Science Challenge, 2019.
- Higher Calculus & Functions (10/2018 - Present)
- 
- 
- Duke University Data Science Math Skills (11/2018 - Present)
- 
- University of Michigan Intro to Data Science
- (09/2018 - Present)
- 
- 
- IBM Machine Learning with Python (10/2018 - Present)
- 
- 
- IBM Data Science Professional Certiﬁcate (10/2018 - Present)
- 
- 
- Agile Scrum Foundation (08/2017 - Present)
- 
- 


# Function to Extract Projects
This function identifies the "Projects" section in the resume by searching for common project-related section titles, then extracts relevant text content.

**Function Name:**
- `extract_projects`

**Input:**
- `sections` (dict): A dictionary where keys are section titles and values are lists of text elements, each with attributes like `text`, `font_size`, and `font_name`.

**Output:**
- `projects` (list): A list of project entries extracted from the resume.


In [10]:
def extract_projects(sections):
    # List of possible names for the "Projects" section
    project_section_titles = [
        "Projects", "Project Experience", "Professional Projects", "Personal Projects",
        "Key Projects", "Relevant Projects", "Work Projects", "Technical Projects",
        "Major Projects", "Significant Projects", "Project Highlights", "Selected Projects"
    ]

    # Compile the regex pattern to match any of these titles
    project_pattern = re.compile(r"(" + "|".join(project_section_titles) + ")", re.IGNORECASE)

    # Initialize a list to store all project entries as dictionaries
    projects = []

    # Loop through sections to check for projects
    for title, content in sections.items():
        if project_pattern.search(title):
            projects += [
                item["text"] for item in content
                if item["font_size"] < content[0]["font_size"]  # Exclude title by font size
            ]

    return projects

# Example usage
projects = extract_projects(extracted_sections)
print("Extracted Projects:")
for project in projects:
    print(f"- {project}")


Extracted Projects:


# Extract Work Experience from Resume
This function identifies and extracts "Work Experience" sections in the resume. It uses regex patterns to match section titles and extracts job titles, company names, dates, and responsibilities.

**Input:**
- `sections` (dict): A dictionary of sections extracted from the resume.

**Output:**
- `work_experience` (list): A list of dictionaries with each dictionary containing job title, company, dates, and responsibilities for a work experience entry.


In [11]:
# List of possible names for the "Work Experience" section
work_experience_titles = [
    "Work Experience", "Professional Experience", "Employment History", "Employment Experience",
    "Job History", "Work History", "Career History", "Professional Background", 
    "Work Background", "Experience", "Employment", "Relevant Experience", "Professional Experience"
]

# Compile the regex pattern to match any of these titles
work_experience_pattern = re.compile(r"(" + "|".join(work_experience_titles) + ")", re.IGNORECASE)

def extract_work_experience(sections):
    work_experience = []

    for title, content in sections.items():
        if work_experience_pattern.search(title):
            current_entry = {
                'title': None,
                'company': None,
                'dates': None,
                'responsibilities': []
            }
            collecting_responsibilities = False

            for i, line in enumerate(content[1:]):  # Skip title
                text = line['text'].strip()
                font_size = line['font_size']
                font_name = line['font_name']

                # Determine if the font is bold
                is_bold = "Bold" in font_name

                # Detect job title (Bold and biggest font size)
                if is_bold and font_size >= 10:
                    # If a job title is already found, save the previous entry and start a new one
                    if current_entry['title']:
                        work_experience.append(current_entry)
                        current_entry = {'title': None, 'company': None, 'dates': None, 'responsibilities': []}
                    
                    current_entry['title'] = text
                    collecting_responsibilities = False

                # Detect company name (same size as title but not bold)
                elif font_size >= 10 and not is_bold and current_entry['title'] and not current_entry['company']:
                    current_entry['company'] = text
                    collecting_responsibilities = False

                # Detect dates (smaller font size, look for dates or "Present")
                elif re.search(r'\b(\d{2}/\d{4}|\d{4})\b', text) or 'Present' in text:
                    current_entry['dates'] = text
                    collecting_responsibilities = True  # Start collecting responsibilities after date

                # Collect responsibilities (smaller font size and at least 3 words)
                elif collecting_responsibilities and font_size < 10:
                    if len(text.split()) > 3:  # Check for more than three words
                        current_entry['responsibilities'].append(text)

            # Append the last work experience entry
            if current_entry['title']:
                work_experience.append(current_entry)
            break  # Break after finding the first matching section

    return work_experience

# Assuming `extracted_sections` is the dictionary output from the PDF processing

# Example usage
work_experience = extract_work_experience(extracted_sections)
print(work_experience)


[{'title': 'Data Engineering Consultant', 'company': 'Entelligent', 'dates': '04/2022 - Present', 'responsibilities': ['Mentored data engineering apprentices & new hires to build new', 'climate risk measures while ensuring data quality and architecture', 'Led the build out of A.I data product and a platform that supports', 'analytics, product development, and product delivery.']}, {'title': 'Lead Data & Analytics Engineer', 'company': "Float (YC W'20)", 'dates': '09/2021 - 04/2022', 'responsibilities': ['San Francisco, US · Remote', 'Led Data Science & Engineering team to build data products +', 'engage in strategic partnerships to solve cashﬂow and working', "capital problems for Africa's SMEs"]}, {'title': 'Senior Data Engineer', 'company': 'SuperFluid Labs, Ltd', 'dates': '04/2020 - 09/2021', 'responsibilities': ['Built a data product that generated credit scores and credit limits', "to be assigned to a Telco's customer base of about 25 million", 'Led data engineering, data governan