<a href="https://colab.research.google.com/github/Tariquzzaman-faisal/190041101-CSE-4302/blob/master/CV_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup Langchain

In [1]:
!pip install pip install pdfminer.six



In [2]:
!pip install langchain



# Mounting Colab to drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Reading Text

In [4]:
# reading the target pdf file
pdf_file = "/content/drive/MyDrive/CV_extraction_Reddot/CM_ML/Anirudh_Sarda_Resume.pdf"


In [5]:
from langchain.document_loaders import PDFMinerPDFasHTMLLoader

# Using PDFminer, we are extracting the PDF as a HTML file
# This process retains the necessary formattting info of the original PDF
# Now we can use features like checking font size, style etc
loader = PDFMinerPDFasHTMLLoader(pdf_file)

# loader expects multiple PDF, since we are using 1 pdf, we want the first item of the list.
# The first item is stored at the 0th index
data = loader.load()[0]
# print(data)

In [6]:
from bs4 import BeautifulSoup

# We are using bs4 for parsing our extracted html file
soup = BeautifulSoup(data.page_content,'html.parser')
# all elements are inside a div, so we are finding all the divs first and those divs contain
# the information about the CV (title, subsections etc all text)
# other parts like <html><title> etc are discarded in this way

total_content = soup.find_all('div')

The following code processes some HTML content to extract text snippets based on the font size

In [7]:
# importing regular expression for pattern matching purposes
import re

current_font_size = None

# current_text contains the text written inside our currently processing tag
current_text = ''

# for collecting all snippets that have the same font size
snippets = []
for content in total_content:
    span_tag = content.find('span')
    # Attempts to find the <span> element within the current content object.
    # It appears to look for specific HTML elements with the tag

    if not span_tag:
        continue
        # if no <span> element is found within the current content object,
        # the code moves to the next iteration, skipping the rest of the loop for this content.


    style_tag = span_tag.get('style')
    # Retrieves the 'style' attribute from the found <span> element, if it exists
    if not style_tag:
        continue
    # If the 'style' attribute is not found in the <span> element,
    # the code moves to the next iteration, skipping the rest of the loop for this content.

    font_attribute = re.findall('font-size:(\d+)px',style_tag)
    # It looks for a pattern in the style_tag
    # matching 'font-size:' followed by digits (\d+) and 'px' (indicating pixels)
    if not font_attribute:
        continue
        # If the font size information is not found in the 'style' attribute,
        # the code moves to the next iteration, skipping the rest of the loop for this content.


    font_attribute = int(font_attribute[0])
    # Converts the first match (font size) found by the regular expression
    # search into an integer and stores it in the variable

    if not current_font_size:
        # It's the first time encountering a font size in the content.
        current_font_size = font_attribute

    if font_attribute == current_font_size:
        # This means it's part of the same text snippet of the font size
        current_text += content.text
    else:
        # it means a new text snippet is starting. So, it appends the current text snippet
        # (cur_text) and its corresponding font size (cur_fs) as a tuple to the snippets list.
        snippets.append((current_text,current_font_size))

        current_font_size = font_attribute
        current_text = content.text
        # Initializing the new snippet

snippets.append((current_text,current_font_size))
# For the last snippet, since we wont see any change in snippet font size,
# Our previous appending wont trigger because of this.
# So we need to append manually after exiting the loop

The code aims to organize these snippets into semantic sections based on font size, with the assumption that headings have higher font sizes than their respective content. The code creates a list of **semantic_snippets**, which will store the organized content as separate "documents" represented by the Document class.

In [8]:
from langchain.docstore.document import Document
# The Document class is used to represent a document with content and associated metadata.

cur_idx = -1
# The current idx position in our snippets object

semantic_snippets = []
# This will contain the organized snippets where there are formed together
# based on their relative font size in the document


# Assumption: headings have higher font size than their respective content
for snippet in snippets:
    # if current snippet's font size > previous section's heading => it is a new heading
    if not semantic_snippets or snippet[1] > semantic_snippets[cur_idx].metadata['heading_font']:
        metadata={'heading':snippet[0], 'content_font': 0, 'heading_font': snippet[1]}
        # Since we saved the heading and its fontsize as a tuple
        # Its first index contains the heading text
        # and the second index contains the fontsize of the heading
        # We also initialized the content_font of the heading to 0

        metadata.update(data.metadata) # Here data is inherited from langchain document library.

        semantic_snippets.append(Document(page_content='',metadata=metadata))
        cur_idx += 1
        continue

    # if (current snippet's font size <= previous section's content)
    # -> content belongs to the same section
    if not semantic_snippets[cur_idx].metadata['content_font'] \
                or snippet[1] <= semantic_snippets[cur_idx].metadata['content_font']:

       # checks if the content font size is not yet set or
       # whether the current snippet's font size (snippet[1]) is
       # less than or equal to the font size of the previous section's content
       # if these conditions are true, then the new snippet is part of the previous section
        semantic_snippets[cur_idx].page_content += snippet[0]
        semantic_snippets[cur_idx].metadata['content_font'] \
                = max(snippet[1], semantic_snippets[cur_idx].metadata['content_font'])
        continue

    else:
        # else if current snippet's font size > previous section's content
        # but less than previous section's heading than also make a new section
        metadata={'heading':snippet[0], 'content_font': 0, 'heading_font': snippet[1]}
        metadata.update(data.metadata)
        semantic_snippets.append(Document(page_content='',metadata=metadata))
        cur_idx += 1

In [35]:
cv_text = ""
for snippents in semantic_snippets:
    for page in snippets:
        cv_text += page[0]
print(cv_text)

ANIRUDH SARDA
CS Graduate | Brac University
@ anirudhsarda20@gmail.com
Ł anirudhsarda20
¥ Sarda20
(cid:129) +8801746520929
( Dhaka, Bangladesh
EDUCATION
B.Sc in Computer Science
Brac University
x CGPA - 3.74/4.00
A’Level
Scholars International School
x GPA - 5.00/5.00
O’Level
Scholars International School
x GPA - 5.00/5.00
EXPERIENCE
( 2017 – 2021
( 2013 – 2015
( 2001 – 2013
Trainee-IT(Server and Systems)
DARAZ Bangladesh
x August 2022 – Ongoing
( Dhaka, BD
• Deployment of the system and evaluation of system performance issues
• Support the IT team in the maintenance of hardware and software
• Research unusual bugs or issues the company encounters
Student Tutor
Brac University
x Sep 2020 – Dec 2020
( Dhaka, BD
• Assist in Introduction to Computer Science Lab
• Mentor students and evaluate Assignments
PROJECTS
• E-Commerce Website [ReactJS, Firebase, Boot-
strap, Mongodb, Node.js]
Product details are fetched from our own
data created in Mongodb and after adding
products to the cart the 

# Applicant Name

In [10]:
import os

def get_file_name(directory_path):
    file_name = os.path.basename(directory_path)
    return file_name

In [11]:
applicant_name = ''

if semantic_snippets:
    largest_headings = sorted(semantic_snippets, key=lambda x: x.metadata['heading_font'], reverse=True)
    largest_heading_font = largest_headings[0].metadata['heading_font']
    largest_headings = [heading for heading in largest_headings if heading.metadata['heading_font'] == largest_heading_font]

    # print("Largest Headings:")
    # for heading in largest_headings:
    #     print(heading.metadata['heading'])
    applicant_name = largest_headings[0].metadata['heading']
else:
    # print("No headings found.")
    applicant_name = get_file_name(pdf_file)

# print(f'Applicant Name: {applicant_name}')


Applicant Name: ANIRUDH SARDA



# Section Title Extraction

In [30]:
# List to store all the extracted headings
# all_headings = []

# # Loop through the semantic_snippets list to extract headings
# for doc in semantic_snippets:
#     heading = doc.metadata.get('heading', None)  # Get the heading from the metadata
#     if heading:
#         all_headings.append(heading)

# # Now all_headings contains all the extracted headings
# print(all_headings)

In [39]:
section_titles_with_content = []  # For storing titles and their corresponding content

if len(semantic_snippets) >= 2:
    sorted_snippets = sorted(semantic_snippets, key=lambda x: x.metadata['heading_font'], reverse=True)
    second_largest_heading_font = sorted_snippets[1].metadata['heading_font']
    second_largest_headings = [snippet for snippet in sorted_snippets if snippet.metadata['heading_font'] == second_largest_heading_font]

    # Extract the indexes and headings of the second-largest headings
    second_largest_headings_info = []
    for heading in second_largest_headings:
        index = snippets.index((heading.metadata['heading'], heading.metadata['heading_font']))
        second_largest_headings_info.append((index, heading.metadata['heading']))

    # Sort the extracted heading info based on the index to preserve order
    second_largest_headings_info.sort(key=lambda x: x[0])

    # Extract the content for each section
    for idx, heading_text in second_largest_headings_info:
        if idx + 1 < len(snippets):
            next_idx, next_heading_text = second_largest_headings_info[idx + 1] if idx + 1 < len(second_largest_headings_info) else (len(snippets), '')
            content = ''
            # Append the content from the current heading index to the next heading index
            for i in range(idx + 1, next_idx):
                content += snippets[i][0]

            section_titles_with_content.append((heading_text.lower(), content))

In [40]:
section_titles_with_content

[('education\n',
  'B.Sc in Computer Science\nBrac University\nx CGPA - 3.74/4.00\nA’Level\nScholars International School\nx GPA - 5.00/5.00\nO’Level\nScholars International School\nx GPA - 5.00/5.00\nEXPERIENCE\n( 2017 – 2021\n( 2013 – 2015\n( 2001 – 2013\nTrainee-IT(Server and Systems)\nDARAZ Bangladesh\nx August 2022 – Ongoing\n( Dhaka, BD\n• Deployment of the system and evaluation of system performance issues\n• Support the IT team in the maintenance of hardware and software\n• Research unusual bugs or issues the company encounters\nStudent Tutor\nBrac University\nx Sep 2020 – Dec 2020\n( Dhaka, BD\n• Assist in Introduction to Computer Science Lab\n• Mentor students and evaluate Assignments\nPROJECTS\n• E-Commerce Website [ReactJS, Firebase, Boot-\nstrap, Mongodb, Node.js]\nProduct details are fetched from our own\ndata created in Mongodb and after adding\nproducts to the cart the receipt is created.\n[PROJECT LINK]\n• Transport Ticket Management System [JavaScript,\nPHP, MySQL]\nA

In [41]:
for section in section_titles_with_content:
    print(section[0])
    print(section[1])
    print('------')

education

B.Sc in Computer Science
Brac University
x CGPA - 3.74/4.00
A’Level
Scholars International School
x GPA - 5.00/5.00
O’Level
Scholars International School
x GPA - 5.00/5.00
EXPERIENCE
( 2017 – 2021
( 2013 – 2015
( 2001 – 2013
Trainee-IT(Server and Systems)
DARAZ Bangladesh
x August 2022 – Ongoing
( Dhaka, BD
• Deployment of the system and evaluation of system performance issues
• Support the IT team in the maintenance of hardware and software
• Research unusual bugs or issues the company encounters
Student Tutor
Brac University
x Sep 2020 – Dec 2020
( Dhaka, BD
• Assist in Introduction to Computer Science Lab
• Mentor students and evaluate Assignments
PROJECTS
• E-Commerce Website [ReactJS, Firebase, Boot-
strap, Mongodb, Node.js]
Product details are fetched from our own
data created in Mongodb and after adding
products to the cart the receipt is created.
[PROJECT LINK]
• Transport Ticket Management System [JavaScript,
PHP, MySQL]
An automated system for purchasing online
bus

# Extraction of info


In [None]:
# {
#     "candidate_info": {
#         "name": "",
#         "phone": "",
#         "email": "",
#         "present_address": "",
#         "permanent_address": ""
#     },
#     "education_info": {
#             "institution": "",
#             "department": "",
#             "cgpa": 0.0
#     },
#     "experience": 0.0,
#     "score": 0.0,
#     "rank": "--"
# }