## **Installing libraries**

In [5]:
# Install necessary libraries for OCR and NLP
!pip install pytesseract Pillow pdfminer.six tesserocr regex spacy poppler-utils
!pip install transformers torch
# Install Tesseract OCR engine
!sudo apt-get install tesseract-ocr
# Verify Tesseract installation by checking its version
!tesseract --version

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 2s (2,659 kB/s)
Selecting previously unselected package tesseract-ocr-eng.
(Reading database ... 123599 files and directories currently installed.)
Preparing to unpack .../tesseract-ocr-

## **Data Extraction from different formats**
### This code performs resume data extraction from PDFs or images using OCR, text processing, NLP techniques and JSON Schema. It first extracts text from a PDF or image, cleans the text, and uses regex and NLP to identify key sections like personal information, work experience, education, skills, and certifications. The extracted information is then structured into a JSON format, which can be further processed or analyzed.

In [13]:
import re
import json  ## json v imp
import spacy  ## NLP library / NLTK
import pytesseract   ## OCR
from pdfminer.high_level import extract_text
from PIL import Image ## image

# Load the pre-trained NLP model
nlp = spacy.load("en_core_web_sm")

# Define the function
def extract_resume_data(file_path):
    # Helper function to extract text from PDF or image
    def ocr_text_extraction(file_path):
        if file_path.endswith('.pdf'):
            text = extract_text(file_path)
            return text
        elif file_path.endswith(('.png', '.jpg', '.jpeg')):
            # Apply OCR directly on image
            img = Image.open(file_path)
            return pytesseract.image_to_string(img)
        else:
            raise ValueError("Unsupported file format. Use PDF or image.")

    # Helper function for text preprocessing
    def preprocess_text(text):
        text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
        text = text.strip()  # Remove leading/trailing spaces
        return text

    # Helper function to extract sections using regex
    def extract_section(text, keyword):
        pattern = rf'({keyword}.*?)(\n\n|\Z)'
        match = re.search(pattern, text, re.IGNORECASE | re.DOTALL)
        if match:
            return match.group(1).strip()
        return None

    # Extract text using OCR
    raw_text = ocr_text_extraction(file_path)

    # Preprocess the text
    clean_text = preprocess_text(raw_text)

    # Use regex and NLP to identify sections
    personal_info = extract_section(clean_text, 'Personal Information|Contact Information')
    work_experience = extract_section(clean_text, 'Work Experience|Professional Experience|Employment')
    education = extract_section(clean_text, 'Education|Academic Background|Qualifications')
    skills = extract_section(clean_text, 'Skills|Technical Skills|Core Competencies')
    certifications = extract_section(clean_text, 'Certifications|Licenses|Accreditations')

    # If NLP is needed for further classification or entity extraction
    doc = nlp(clean_text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    # Create structured JSON output
    resume_data = {
        "personal_information": personal_info if personal_info else "N/A",
        "work_experience": work_experience if work_experience else "N/A",
        "education": education if education else "N/A",
        "skills": skills if skills else "N/A",
        "certifications": certifications if certifications else "N/A",
        "entities": entities  # Captured using SpaCy
    }

    # Convert to JSON format
    json_output = json.dumps(resume_data, indent=4)
    return json_output, clean_text

# Example usage
file_path = '/content/Resume in pdf format.pdf'
resume_json,extracted_text = extract_resume_data(file_path)
print(resume_json)


{
    "personal_information": "N/A",
    "work_experience": "employment benefit options. Arranged hospital-wide guest speakers symposia to educate management about new employment laws and workplace confidence and morale building techniques. Administrative tasks. Skills Type 96WPM \u2022 Proficient with Workday \u2022 Team player \u2022 Excellent time management skills \u2022 Conflict Management \u2022 Public Speaking \u2022 Data analytics Education MAY 2012 Bachelor of Arts Human Resources Management/Beachy University, Sunny, Florida Activities Literature \u2022 Environmental conservation \u2022 Art \u2022 Yoga \u2022 Skiing \u2022 Travel",
    "education": "N/A",
    "skills": "N/A",
    "certifications": "N/A",
    "entities": [
        [
            "Janna Gardner",
            "PERSON"
        ],
        [
            "4567",
            "DATE"
        ],
        [
            "Chico",
            "GPE"
        ],
        [
            "Illinois 98052",
            "ORG"
        ],

## **Extracted Text**

In [14]:
extracted_text

'Janna Gardner 4567 Main Street, Chico, Illinois 98052 (716) 555-0100 j.gardner@live.com www.linkedin.com/in/j.gardner Human Resources Generalist with 6+ years of experience assisting with and fulfilling organization staffing needs and requirements. A proven track record of using my excellent personal, communication and organization skills to lead and improve HR departments, recruit excellent personnel, and improve department efficiencies. Team player with excellent communication skills, high quality of work, driven and highly self-motivated. Strong negotiating skills and business acumen and able to work independently. Experience 2014 – PRESENT Human Resources Generalist/Lamna Healthcare Company, Chico, Illinois Review, update, and revise company hiring practices, vacation, and other human resources policies to ensure compliance with OSHA and all local, state, and federal labor regulations. By creating and maintaining a positive and responsive work environment, raised employee retentio

## **Load a pre-trained BERT model for sequence classification.**
### It loads the BERT tokenizer and model, specifically the 'bert-base-uncased' version, which can be replaced with a fine-tuned model if available. A classification pipeline is created to easily classify text input using the pre-trained BERT model and tokenizer.

In [15]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'  # Replace with your fine-tuned model name if available
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Create a pipeline for classification
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Classifying Sections**
### This function, `classify_sections`, takes a block of text and a classifier to categorize the text into predefined sections. The text is split into individual sentences, and each sentence is classified using the provided classifier. Based on the classification label (e.g., PERSONAL_INFORMATION, WORK_EXPERIENCE), sentences are added to the respective section in a dictionary. The function returns a dictionary with categorized sentences for each section, such as personal information, work experience, education, skills, and certifications.

In [10]:
def classify_sections(text, classifier):
    sentences = text.split('\n')  # Simplistic sentence splitting; can be customized as needed

    classified_data = {
        "personal_information": [],
        "work_experience": [],
        "education": [],
        "skills": [],
        "certifications": []
    }

    for sentence in sentences:
        result = classifier(sentence)
        label = result[0]['label']

        # Add sentence to the corresponding section
        if label == "PERSONAL_INFORMATION":
            classified_data["personal_information"].append(sentence)
        elif label == "WORK_EXPERIENCE":
            classified_data["work_experience"].append(sentence)
        elif label == "EDUCATION":
            classified_data["education"].append(sentence)
        elif label == "SKILLS":
            classified_data["skills"].append(sentence)
        elif label == "CERTIFICATIONS":
            classified_data["certifications"].append(sentence)

    return classified_data


## **Installing of transformers**
### This code snippet installs the `transformers` library, essential for using pre-trained models like BERT. It imports necessary modules from `torch` and `transformers` to load a pre-trained BERT model and tokenizer for sequence classification. A text classification pipeline is created using BERT, enabling sentence-level classification of resume sections. The `classify_sections` function splits the input text into sentences, truncates any sentence longer than 510 tokens to fit within BERT's token limit, and classifies each sentence into one of five categories: personal information, work experience, education, skills, or certifications. The classified sentences are then organized into their respective categories.

In [11]:
!pip install transformers

import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'  # Replace with your fine-tuned model name if available
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

# Create a pipeline for classification
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

def classify_sections(text, classifier):
    sentences = text.split('\n')  # Simplistic sentence splitting; customize as needed

    classified_data = {
        "personal_information": [],
        "work_experience": [],
        "education": [],
        "skills": [],
        "certifications": []
    }

    for sentence in sentences:
        # Truncate the sentence if it exceeds the maximum length
        tokens = tokenizer.tokenize(sentence)
        if len(tokens) > 510:  # Account for [CLS] and [SEP] tokens
            tokens = tokens[:510]
        sentence = tokenizer.convert_tokens_to_string(tokens)

        result = classifier(sentence)
        label = result[0]['label']

        # Add sentence to the corresponding section
        if label == "PERSONAL_INFORMATION":
            classified_data["personal_information"].append(sentence)
        elif label == "WORK_EXPERIENCE":
            classified_data["work_experience"].append(sentence)
        elif label == "EDUCATION":
            classified_data["education"].append(sentence)
        elif label == "SKILLS":
            classified_data["skills"].append(sentence)
        elif label == "CERTIFICATIONS":
            classified_data["certifications"].append(sentence)

    return classified_data



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Data Classification**

In [12]:
classified_data = classify_sections(extracted_text, classifier)
classified_data

{'personal_information': [],
 'work_experience': [],
 'education': [],
 'skills': [],
 'certifications': []}