# Resume Parsing with SpaCy

## Introduction

In this notebook, we will explore how to use **SpaCy**, a powerful NLP library in Python, to parse resumes. We'll focus on extracting key information like names, phone numbers, emails, and LinkedIn URLs.

### What is SpaCy?
SpaCy is an open-source library for advanced **Natural Language Processing (NLP)** in Python. It is designed specifically for production use and offers:
- **Pre-trained Models**: SpaCy provides models trained on large datasets to identify entities like names, organizations, and dates.
- **Efficiency and Speed**: Built with performance in mind, SpaCy is one of the fastest NLP libraries.
- **Customizable Pipelines**: You can easily add or modify components like tokenizers, taggers, and entity recognizers to fit your specific needs.
- **Wide Applications**: From text classification and sentiment analysis to information extraction and resume parsing, SpaCy is versatile and powerful.

### Why Use SpaCy for Resume Parsing?
- **Pre-trained NER models** for entity extraction
- **Interactive and customizable** for specialized tasks like resume parsing
- **Fast and efficient** processing for large datasets
- **Easy integration** with other Python libraries and tools

#### Helpful Resources:
- [SpaCy Documentation](https://spacy.io/usage)
- [Named Entity Recognition in SpaCy](https://spacy.io/usage/linguistic-features#named-entities)
- [Custom Components in SpaCy Pipelines](https://spacy.io/usage/processing-pipelines)

---

In [None]:
# Install SpaCy if you haven't already
!pip install pdfplumber
!pip install spacy
!python3 -m spacy download en_core_web_sm

In [None]:
import spacy
from spacy.matcher import Matcher
import re
import pdfplumber

## Step 1: Load SpaCy Model
We'll start by loading the small English model `en_core_web_sm`, which includes tokenization, POS tagging, and NER.

In [None]:
nlp = spacy.load("en_core_web_sm")

## Step 2: Extract Text from a Resume PDF
Use `pdfplumber` to extract text from the resume PDF.

In [None]:
with pdfplumber.open("examples/jakes-resume.pdf") as pdf:
    resume_text = "\n".join([page.extract_text() for page in pdf.pages if page.extract_text()])

print(resume_text[:500])  # Displaying the first 500 characters of the resume

## Step 3: Named Entity Recognition (NER)
SpaCy's NER can recognize entities like names, organizations, and more. Let's see what it can extract from the resume.

### Exercise:
After running the code, try highlighting specific entities (like `ORG`, `PERSON`, or `GPE`) to see how well SpaCy detects them.

In [None]:
doc = nlp(resume_text)

for ent in doc.ents:
    print(f"{ent.label_}: {ent.text}")

## Step 4: Custom Pattern Matching with SpaCy's Matcher
For entities like phone numbers, emails, and LinkedIn URLs, we can create custom patterns using `Matcher`.

### Example Patterns:
- **Phone Number**: Sequence of digits and optional symbols like `(`, `)`, `-`, and spaces.
- **Email**: Text patterns with `@` symbol.
- **LinkedIn URL**: URLs containing "linkedin.com/in/".

### Exercise:
Try tweaking the patterns to see if you can improve the detection of phone numbers or LinkedIn URLs. (if they don't work)

In [None]:
# Regex pattern for phone numbers
phone_regex = r'(\+\d{1,2}\s)?(\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4})'
phones = re.findall(phone_regex, resume_text)

# Regex pattern for LinkedIn URLs
linkedin_regex = r'linkedin\.com/(in|pub)/[A-Za-z0-9-_/]+'
linkedins = re.findall(linkedin_regex, resume_text)

# Extract emails using SpaCy Matcher
matcher = Matcher(nlp.vocab)
email_pattern = [
    {"TEXT": {"REGEX": r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}"}}
]
matcher.add("EMAIL", [email_pattern])

matches = matcher(doc)

emails = []
for match_id, start, end in matches:
    span = doc[start:end]
    emails.append(span.text)

print(f"Extracted Phones: {[phone[1] for phone in phones]}")
print(f"Extracted Emails: {emails}")
print(f"Extracted LinkedIn URLs: {[f'https://linkedin.com/{linkedin}' for linkedin in linkedins]}")

## Step 5: Extracting Name Using NER
SpaCy's NER can often detect the candidate's name under the `PERSON` label.

In [None]:
name = ""
for ent in doc.ents:
    if ent.label_ == "PERSON":
        name = ent.text
        break

print(f"Extracted Name: {name}")

## Step 6: Your Turn - Extract Other Entities!
Now it's your turn to extract other entities. Try finding:
1. **Organizations (ORG)**
2. **Locations (GPE)**
3. **Dates (DATE)**

### Exercise:
Modify the code below to extract these entities and see how SpaCy performs.

In [None]:
# Example for extracting organizations
organizations = [ent.text for ent in doc.ents if ent.label_ == "ORG"]
print(f"Extracted Organizations: {organizations}")

## Step 7: Combining All Extracted Information
We'll now compile all extracted details into a structured format.

In [None]:
parsed_resume = {
    "Name": name,
    "Phone": phones[0][1],
    "Email": emails[0],
    "LinkedIn": f'https://linkedin.com/{linkedins[0]}'
}

for match_id, start, end in matches:
    span = doc[start:end]
    match_label = nlp.vocab.strings[match_id]
    if match_label == "EMAIL":
        parsed_resume["Email"] = span.text

print(parsed_resume)

## Final Thoughts
- **SpaCy** offers powerful tools for both general and custom text extraction.
- You can further enhance this by training a **custom NER model** for more specialized resume parsing.
- Integrate this into larger applications to automate resume screening.

Happy Parsing! 🚀