# Resume Parser Documentation

## Overview

This notebook provides a Python-based solution to parse a resume PDF document and extract key information, including:

1. **Full Name**
2. **Contact Information** (e.g., email, phone, LinkedIn)
3. **Summary or Objective Statement**
4. **Skills** (as a list)
5. **Work Experience** (including company, job title, dates, and responsibilities)
6. **Education** (degree, institution, dates, and additional information)
7. **Certifications** (if present)
8. **Projects** (if present)

The parser is designed to handle diverse resume formats by utilizing PDF text extraction and regular expressions to identify and gather relevant information.

## Libraries Used

- `pdfminer`: For extracting text from PDF files.
- `re`: For regular expression matching to locate and extract specific resume sections.

## Assumptions

- Each section (e.g., Skills, Work Experience) is identified by common headers like "Skills" or "Work Experience."
- Contact information, such as **email** and **phone number**, typically appears near the beginning of the resume.
- Not all resumes contain all requested sections, so the parser handles missing information gracefully, returning empty fields for missing sections.

## Challenges

- **Format Variations**: Resume layouts vary widely, so this parser uses flexible regex patterns and generic keywords to identify each section.
- **Two-Column Layouts**: PDFs sometimes store text in a non-linear order, especially in multi-column layouts, making extraction challenging. This parser attempts to handle these cases but may require further adjustment for highly complex layouts.

## Instructions for Use

1. Run each code cell in sequence.
2. Provide the path to a resume PDF file when prompted.
3. The parser will output a structured dictionary with the parsed resume information.

## Expected Output Format

The output is a Python dictionary with keys representing each information type, such as:
```python
{
    "Full Name": "John Doe",
    "Contact Information": {
        "Email": "john.doe@example.com",
        "Phone": "123-456-7890",
        "LinkedIn": "linkedin.com/in/johndoe"
    },
    "Summary": "Experienced data professional...",
    "Skills": ["Python", "SQL", "Machine Learning", "Data Analysis"],
    "Work Experience": [
        {
            "Company": "Example Corp",
            "Job Title": "Data Scientist",
            "Dates": "Jan 2018 - Dec 2020",
            "Responsibilities": "Developed machine learning models..."
        }
    ],
    "Education": [
        {
            "Degree": "BSc. Computer Science",
            "Institution": "University of XYZ",
            "Dates": "2014 - 2018"
        }
    ],
    "Certifications": ["Certified Data Scientist"],
    "Projects": ["Project A: Developed a recommendation system..."]
}


In [4]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)

        for page_num in range(len(reader.pages)):
            page = reader.pages[page_num]
            text = page.extract_text()
            # print(f"Page {page_num + 1}:\n{text}\n")
    return text

# Example usage:
pdf_file = "data/Sample Resume for Assessment.pdf"
result = extract_text_from_pdf(pdf_file)
print(result)

Gilbert Adjei 
Gilbert is an experienced data professional with 6+ years of experience in data engineering, analytics & software engineering. Career assignments have
ranged from building data science solutions for startups to leading project teams with eﬀective communication & teamwork. With this background, he
is adept at picking up new skills quickly to deliver robust solutions to the most demanding of businesses. 
gilbertadjei800@gmail.com 
linkedin.com/in/gilbert-adjei-900ba2110 
github.com/GilbertAbakahAdjei 
medium.com/@gilbertadjei800 
WORK EXPERIENCE 
Data Engineering Consultant 
Entelligent 
04/2022 - Present
, 
 
Colorado, US 
Mentored data engineering apprentices & new hires to build new
climate risk measures while ensuring data quality and architecture
initiatives are met. 
Led the build out of A.I data product and a platform that supports
analytics, product development, and product delivery. 
Lead Data & Analytics Engineer 
Float (YC W'20) 
09/2021 - 04/2022
, 
 
San Franc

In [None]:
def extract_name(text):
    """Extracts the name, assuming it is the first line with capitalized words."""
    # Split text into lines
    lines = text.splitlines()

    # Regex pattern for a typical name format (two capitalized words)
    name_pattern = re.compile(r"^[A-Z][a-zA-Z]*\s+[A-Z][a-zA-Z]*$")

    # Look for the first line that matches the name pattern
    for line in lines:
        line = line.strip()
        if name_pattern.match(line):
            return line

    return "Name not found"
name = extract_name(result)
print("Extracted Name:", name)

In [None]:
def extract_accounts_from_resume(text):
    contact_number = None

    # Regex pattern to capture URLs and email addresses
    pattern = r'(https?://)?(www\.)?([a-zA-Z0-9-]+)\.[a-zA-Z]+(/\S*)?|([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'

    # Dictionary to store results
    accounts = {}

    # Find all matches
    matches = re.findall(pattern, text)

    # Process each match to build the accounts dictionary
    for match in matches:
        full_url = ''.join(match[:4])  # URL components are in the first four groups
        email = match[4]  # Email is in the fifth group

        if full_url:  # If a URL is found
            domain_name = match[2]  # Third capture group is the domain (e.g., "github", "linkedin")
            if domain_name and len(domain_name) > 1:
                if domain_name in accounts:
                    accounts[domain_name].append(full_url)
                else:
                    accounts[domain_name] = [full_url]

        if email:  # If an email is found
            email_domain = email.split('@')[1].split('.')[0]  # Get domain before the first dot in domain part
            if email_domain in accounts:
                accounts[email_domain].append(email)
            else:
                accounts[email_domain] = [email]

    return accounts


accounts = extract_accounts_from_resume(result)
print("Textracted Accounts: ", accounts)