<a href="https://colab.research.google.com/github/PradeepRajan24/Resume_Parser/blob/main/ResumeParser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Install the needed Libraries

In [2]:
!pip install langchain langchain-community langchain-core langchain_huggingface transformers pdfplumber python-docx pypandoc unstructured accelerate --upgrade --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m76.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m440.2/440.2 kB[0m [31m29.9 MB/s[0m eta [36

2. Import the necesary libraries

In [3]:
from huggingface_hub import notebook_login
from langchain_community.document_loaders import PDFPlumberLoader, UnstructuredWordDocumentLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_core.prompts import ChatPromptTemplate
import torch
import json
import os

3. Reading the text from PDF or Word files by detecting the file type and using the appropriate document loader to extract the text

In [4]:
def extract_text_from_document(file_path):
    file_extension = os.path.splitext(file_path)[1].lower()

    if file_extension == '.pdf':
        loader = PDFPlumberLoader(file_path)
    elif file_extension in ['.docx', '.doc']:
        loader = UnstructuredWordDocumentLoader(file_path)
    else:
        raise ValueError(f"Unsupported file type: {file_extension}. Only .pdf, .docx, .doc are accepted.")

    docs = loader.load()
    text = "\n".join(doc.page_content for doc in docs)
    return text

4. Defining the role or behaviour for the LLM

In [5]:
system_prompt="""You are a highly skilled AI resume parser. Your task is to extract all relevant information from the provided resume text and format it into a structured JSON object.
"""

In [6]:
human_prompt = """
             **Task:** Extract key information from the following resume text.

            **Resume Text:**
            {context}


**Instructions:**
Please extract the following information from the resume and structure it into a single JSON object. Ensure the JSON adheres strictly to the specified schema below. Use `null` for any missing or unavailable values.

**JSON Schema Requirements:**

* **Top-level fields:**
    * `first_name` (string): The candidate's first name.
    * `last_name` (string): The candidate's last name.
    * `email` (string): The candidate's email address.
    * `phone` (string): The candidate's phone number.
    * `summary` (string): A concise summary or objective or personnel statement from the resume.
    * `address` (object):
        * `city` (string): The city from the address.
        * `state` (string): The state from the address.
        * `country` (string): The country from the address.
    * `education_history` (array of objects):
        * Each object represents an educational entry and must contain:
            * `name` (string): Name of the institution.
            * `degree` (string): Degree obtained.
            * `from_date` (string): Start date of education (e.g., "MM-DD-YYYY" or "YYYY" or null).
            * `to_date` (string): End date of education (e.g., "MM-DD-YYYY" or "YYYY" or present or null).
    * `work_history` (array of objects):
        * Each object represents a work experience entry and must contain:
            * `company` (string): Name of the company.
            * `title` (string): Job title.
            * `from_date` (string): Start date of employment (e.g., "MM-DD-YYYY" or "YYYY" or null).
            * `to_date` (string): End date of employment (e.g., "MM-DD-YYYY" or "YYYY" or present or null).
            * `description` (string): A detailed description of responsibilities and achievements in that role.
    * `skills` (array of objects):
        * Each object represents a single skill and must contain:
            * `skill` (string): The name of the skill.

**Question:**
Provide the extracted information as a single, valid JSON object following the exact schema described above.
"""

5. Specifying the paths to read the contents from the file and then extracting text to analyse it

In [9]:
pdf_path = "/content/Pradeep_Rajan_2405_CV.pdf"
#docx_path=""
#doc_path=""

In [None]:
context = extract_text_from_document(pdf_path)
print(context)

6. Using a prompt template to combine the systems role with the human input for the LLM.

In [12]:
template = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", human_prompt),
])

In [13]:
complete_prompt = template.format_messages(context=context)

In [14]:
complete_prompt = [text.content.replace("•", " ") for text in complete_prompt]


7. Using Meta LLaMA 3 70B Instruct model and its tokenizer from Hugging Face to perform text generation tasks

In [None]:
notebook_login()

In [None]:
# tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# task = "text-generation"

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-70B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B-Instruct", device_map="auto")
task = "text-generation"


In [None]:
#pip install hf_xet

In [None]:
pipe = pipeline(
    task,
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=2048,
)

8. Invoking the LLM to generate a JSON-formatted output based on the full prompt, and then parsing that string into a Python dictionary using json.loads()

In [None]:
llm = HuggingFacePipeline(pipeline=pipe)

In [None]:
json_output = llm.invoke(complete_prompt)

In [None]:
print(json_output)

In [None]:
output_filename = "output.json"

In [None]:
parsed_json = json.loads(json_output)

9. The desired JSON output file can be downloaded using this

In [None]:
with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(parsed_json, f, indent=2, ensure_ascii=False)