### **Methodology**

When I was given the task, I researched about the work on the approaches of the project.
This can be acheived by the below steps process.
1. Unstructured to Structured Data Conversion:
   Resumes, often in formats like PDF or DOCX, contain unstructured data. The initial challenge was to transform this unstructured content into a structured format like JSON.

2. Text Extraction: Firstly using PyPDF2, docx libraries , extracted the text from the resumes.but in a unstructured form.
3. Leveraging Generative AI (Llama3):
   Used Gen AI(Llama3) model to extract the structured data which also includes prompt engineering.
4. Job Description Processing:
   Similarly done to the job description.
5. Storage of Extracted Data:
   Stored the extracted json from resume and job description in seperate files.
6. NLP-based Similarity Check:
   Used NLP process (spaCy model) to check the similarity between the two json files.
7. Ranking:
   Rank them according to the spaCy similarity scores. 
8. Reinforcement Learning:'
   Q-learning-based reinforcement learning approach for feedback mechanism

<img src="flow_chart_nebula.png" alt="Flow Chart" width="300" height="500" />


### **Importing necessary libraries**

In [1]:
from groq import Groq
from PyPDF2 import PdfReader
from docx import Document
import re
import spacy
import json
import pandas as pd

the process of generating api_key given in the README file

In [2]:
api_key = "paste_your_api_key"
client = Groq(api_key=api_key)

### **Text Extraction**

In [3]:
def extract_from_pdf(file):
    pdf_text = ""
    pdf_reader = PdfReader(file)
    for page in pdf_reader.pages:
        pdf_text += page.extract_text()
    return pdf_text

In [22]:
extract_from_pdf("sample_resumes/software-engineer-resume-example.pdf")

'C H A R L E S  M C T U R L A N D\nS O F T W A R E  E N G I N E E R\nC O N T A C T\ncmcturland@email.com\n(123) 456-7890\nNew York, NY\nLinkedIn\nE D U C A T I O N\nB.S.\nComputer Science\nUniversity of Pittsburgh\nSeptember 2008 - April 2012\nPittsburgh, PA\nS K I L L S\nPython (Django)\nJavascript (NodeJS ReactJS,\njQuery)\nSQL (MySQL, PostgreSQL,\nNoSQL)\nHTML5/CSS\nAWS\nUnix, GitW O R K  E X P E R I E N C E\nSoftware Engineer\nEmbark\nJanuary 2015 - current/New York, NY\nWorked with product managers to re-architect a multi-page web\napp into a single page web-app, boosting yearly revenue by $1.4M\nConstructed the logic for a streamlined ad-serving platform that\nscaled to our 35M users, which improved the page speed by 15%\nafter implementation\nTested software for bugs and operating speed, ﬁxing bugs and\ndocumenting processes to increase efﬁciency by 18%\nIterated platform for college admissions, collaborating with a group\nof 4 engineers to create features across the software\nS

In [5]:
def extract_from_doc(file):
    doc = Document(file)
    doc_text = ""
    for paragraph in doc.paragraphs:
        doc_text += paragraph.text + "\n"
    return doc_text

In [7]:
extract_from_doc("graphic.docx")

'David Lee\ndavidlee@example.com\n\nExperience:\n- Graphic Designer at CreativeStudio (2019 - Present)\n  - Designed marketing materials, including brochures, flyers, and social media graphics.\n  - Collaborated with clients to understand their design needs.\n  - Created logos and branding materials.\n\nEducation:\n- Bachelor of Fine Arts in Graphic Design, Art Institute (2015 - 2019)\n\nSkills:\n- Tools: Adobe Photoshop, Illustrator, InDesign\n- Design: Branding, Print Design, Digital Design\n'

### **Prompt Engineering**

In [8]:
def get_prompt():
    return '''You are the HR of a company.Your task is to shortlist the resumes based on the required role description.Now extract the following details from the resume in the exact JSON format below.
    Do not include any other text except the JSON. If a field is missing or unknown, use an empty string ("").

    Expected JSON format:
    {
        "Name": "",
        "email_id": "",
        "mob_number": "",
        "qualification": "",
        "experience": "",
        "skills": "",
        "certification": "",
        "achievement": ""
    }

    Resume Text:
    ============
    '''



### **Extracting the Structured Data using Llama model**

In [9]:

def extract_resume_data(text):
    prompt = get_prompt() + text

    response = client.chat.completions.create(
        model="llama3-70b-8192",
        messages=[{"role": "user", "content": prompt}]
    )

    content = response.choices[0].message.content

    # Extract JSON from the response using regex
    json_pattern = r'\{.*\}'
    match = re.search(json_pattern, content, re.DOTALL)

    json_content = match.group(0)
    data = json.loads(json_content)

    # Return both the DataFrame and JSON data
    return pd.DataFrame(data.items(), columns=["Entities", "Value"]), data


In [23]:
extract_resume_data(extract_from_pdf("sample_resumes/software-engineer-resume-example.pdf"))

(        Entities                                              Value
 0           Name                                  CHARLES MCTURLAND
 1       email_id                               cmcturland@email.com
 2     mob_number                                     (123) 456-7890
 3  qualification                           B.S. in Computer Science
 4     experience                                           7+ years
 5         skills  Python (Django), Javascript (NodeJS ReactJS, j...
 6  certification                                                   
 7    achievement  Boosting yearly revenue by $1.4M, improving pa...,
 {'Name': 'CHARLES MCTURLAND',
  'email_id': 'cmcturland@email.com',
  'mob_number': '(123) 456-7890',
  'qualification': 'B.S. in Computer Science',
  'experience': '7+ years',
  'skills': 'Python (Django), Javascript (NodeJS ReactJS, jQuery), SQL (MySQL, PostgreSQL, NoSQL), HTML5/CSS, AWS, Unix, Git',
  'certification': '',
  'achievement': 'Boosting yearly revenue by $1.4M

In [12]:
def get_job_description_prompt():
    return '''Extract the following details from the job description in the exact JSON format below.
    Do not include any other text except the JSON. If a field is missing or unknown, use an empty string ("").

    Expected JSON format:
    {
        "required_skills": "",
        "experience": "",
        "qualification": "",
        "certification": ""
    }

    Job Description:
    ============

    '''


In [13]:
def extract_job_description(text):
    prompt = get_job_description_prompt() + text

    response = client.chat.completions.create(
        model="llama3-70b-8192",
        messages=[{"role": "user", "content": prompt}]
    )

    content = response.choices[0].message.content
    json_pattern = r'\{.*\}'
    match = re.search(json_pattern, content, re.DOTALL)

    json_content = match.group(0)
    data = json.loads(json_content)


    return pd.DataFrame(data.items(), columns=["Entities", "Value"]), data

In [14]:
def clean_text(text):
    # Replace multiple spaces or newlines with a single space
    cleaned_text = re.sub(r'\s+', ' ', text)
    cleaned_text = cleaned_text.strip()
    return cleaned_text

In [15]:
jd = "We are looking for a highly skilled Software Engineer with experience in full-stack development. The ideal candidate will be responsible for building and maintaining web applications, working across both front-end and back-end technologies, and collaborating with cross-functional teams to deliver high-quality solutions. with knowledge of mysql also , with 5+ years of experience. candidate  must have done bachelors"

In [16]:
clean_text(jd)

'We are looking for a highly skilled Software Engineer with experience in full-stack development. The ideal candidate will be responsible for building and maintaining web applications, working across both front-end and back-end technologies, and collaborating with cross-functional teams to deliver high-quality solutions. with knowledge of mysql also , with 5+ years of experience. candidate must have done bachelors'

In [17]:
extract_job_description(jd)

(          Entities                          Value
 0  required_skills  full-stack development, mysql
 1       experience                       5+ years
 2    qualification                      bachelors
 3    certification                               ,
 {'required_skills': 'full-stack development, mysql',
  'experience': '5+ years',
  'qualification': 'bachelors',
  'certification': ''})

### **spaCy Language Model**

In [18]:
nlp = spacy.load("en_core_web_md")

resume_data.json and job_description.json are the structured format files formed by extracting the data from resume and job description. This process is done in the working streamlit application. At present default details are stored. Run the streamlit application.

In [19]:
def rank_resumes(resume_file="resume_data.json", job_desc_file="job_description.json"):
    
    with open(resume_file, "r") as rf:
        resume_data = json.load(rf)

    with open(job_desc_file, "r") as jf:
        job_desc_data = json.load(jf)

    # Extract relevant fields from the job description
    job_skills = job_desc_data.get("required_skills", "")
    job_experience = job_desc_data.get("experience", "")
    job_qualification = job_desc_data.get("qualification", "")

    # Convert job description fields to spaCy docs
    job_doc = nlp(f"Skills: {job_skills}, Experience: {job_experience}, Qualification: {job_qualification}")

    ranked_resumes = []

    # Iterate through each resume and calculate similarity score
    for resume in resume_data:
        resume_name = resume["file_name"]
        resume_skills = resume["resume_data"].get("skills", "")
        resume_experience = resume["resume_data"].get("experience", "")
        resume_qualification = resume["resume_data"].get("qualification", "")

        resume_doc = nlp(f"Skills: {resume_skills}, Experience: {resume_experience}, Qualification: {resume_qualification}")

        # Calculate similarity between resume and job description
        similarity_score = job_doc.similarity(resume_doc)
        ranked_resumes.append({
            "file_name": resume_name,
            "similarity_score": similarity_score,
            "skills": resume_skills,
            "experience": resume_experience,
            "qualification": resume_qualification
        })

    # Sort resumes based on similarity score
    ranked_resumes = sorted(ranked_resumes, key=lambda x: x["similarity_score"], reverse=True)

    # Return ranked resumes
    return ranked_resumes

In [20]:
ranked_resumes = rank_resumes("resume_data.json", "job_description.json")
for rank, resume in enumerate(ranked_resumes, 1):
    print(f"**Rank {rank}: {resume['file_name']}**")
    print(f"**Similarity Score:** {resume['similarity_score']:.2f}")

**Rank 1: marketing specialist.pdf**
**Similarity Score:** 0.87
**Rank 2: Dasariraju_Deepak_BMU.pdf**
**Similarity Score:** 0.85
**Rank 3: designer.pdf**
**Similarity Score:** 0.80
**Rank 4: Teacher.pdf**
**Similarity Score:** 0.78


 streamlit run app.py 
 
 to run the application