<a href="https://colab.research.google.com/github/Ritz0820/Resume-Revealer/blob/main/PDF_REVEALER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Resume Revealer
This project is a resume revealer application that helps extract, analyze, and summarize information from resumes. The application utilizes various natural language processing (NLP) techniques, text mining algorithms, and machine learning models to provide insights into resume content.

Overview
The Resume Revealer application is designed to streamline the process of reviewing and understanding resume content for both recruiters and job seekers. It leverages NLP tools and machine learning models to extract key information such as job titles, skills, work experience, education, and more from resumes in different formats (e.g., PDF, DOCX).

Features
Text Extraction: Extracts text from resumes in PDF, DOCX, and other formats using libraries such as PyPDF2, python-docx, and fitz.
Information Extraction: Utilizes NLP techniques to extract key information such as job titles, skills, work experience, education, and other relevant details from the resume text.
Resume Summarization: Summarizes the content of the resume to provide an abstract overview, highlighting key points and important details.
Skills Mapping: Matches extracted skills and job titles to standardized lists or databases (e.g., ONET database) to provide additional insights and context.
User Interface: Provides a user-friendly interface for interacting with the application, allowing users to upload resumes, view extracted information, and access summarized content.
Dependencies
Python 3.x
spaCy
scikit-learn
PyMuPDF
textract
python-docx
fitz


Usage
Upload a resume file (PDF, DOCX, etc.) to the application.
Explore the extracted information, including job titles, skills, work experience, education, and more.
Review the summarized content to gain insights into the resume content quickly and efficiently.
Customize and fine-tune the application according to specific requirements by adjusting parameters and configurations.

In [None]:
#installing libraries
!pip install python-docx
!pip install PyMuPDF
!pip install textract
!pip install transformers
!pip install sentence-transformers
!apt-get install -y poppler-utils antiword

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
antiword is already the newest version (0.37-16).
poppler-utils is already the newest version (22.02.0-2ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


In [None]:
#import statements
import os
import docx
import fitz
import textract     # extracting texts from various formats
from transformers import pipeline

In [None]:
#Extracting text based on file type
def extract_text_from_resume(file_path):
    # Extract text based on file type
    if file_path.endswith('.docx'):
        doc = docx.Document(file_path)
        text = "\n".join(paragraph.text for paragraph in doc.paragraphs)
    elif file_path.endswith('.pdf'):
        doc = fitz.open(file_path)
        text = ""
        for page in doc:
            text += page.get_text()
    else:
        text = textract.process(file_path).decode()
    return text
file_path='/content/sample_data/easy_resume_level1_a.pdf'
resume_text=extract_text_from_resume(file_path)
print(resume_text)

Prasham Sheth
Data Scientist
Phone: +1 (516) 707-1668
Email: p.d.sheth@columbia.edu
LinkedIn: http://www.linkedin.com/in/prasham-sheth
SUMMARY
● I currently work as a Data Scientist at the SLB Software Technology Innovation Center (STIC) in Menlo Park, California
● My research interests include Machine Learning and Deep Learning based approaches for solving complex problems in the
fields of Computer Vision, Prognostic and Health Management, and Time-Series Analysis. Further, I am focusing on
Hybrid modeling techniques involving Physics Informed Machine Learning
EDUCATION
Columbia University
New York, NY
Master of Science in Data Science, GPA: 4.08/4.00
Dec 2020
Coursework: Machine Learning, Applied Machine Learning, Applied Deep Learning, Statistical Inference & Modeling,
Personalization Theory, Natural Language Processing, Algorithms for Data Science, Computer Systems, Exploratory Data
Analysis and Visualization
Nirma University
Ahmedabad, India
Bachelor of Technology in Computer Engi

In [None]:
#Using spacy to detect job titles
import spacy
import pandas as pd

# Load English model
nlp = spacy.load("en_core_web_sm")

# Define a function to extract noun chunks from text
def extract_noun_chunks(text):
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks]


# Define a function to extract skills from resume text
def extract_job_title(resume_text):
    nlp_text = nlp(resume_text)

    # removing stop words and implementing word tokenization
    tokens = [token.text for token in nlp_text if not token.is_stop]

    # Reading the CSV file containing skills
    data = pd.read_csv('/content/sample_data/Job titles and industries.csv')

    # Extract skill names from the CSV file
    job_title = data['job title'].tolist()

    print("Job Titles:", job_title)

    job_positions = []

    # Check for one-grams (e.g., Python)
    for token in tokens:
        if token.lower() in job_title and token not in job_positions:
            job_positions.append(token)

    # Check for bi-grams and tri-grams (e.g., machine learning)
    noun_chunks = extract_noun_chunks(resume_text)
    for token in noun_chunks:
        token = token.lower().strip()
        if token in job_title :
            job_positions.append(token)

    return [i.capitalize() for i in set([i.lower() for i in job_positions])]
title=extract_job_title(resume_text)
for position in title:
    print(position)

Job Titles: ['technical support and helpdesk supervisor - county buildings, ayr soa04086', 'senior technical support engineer', 'head of it services', 'js front end engineer', 'network and telephony controller', 'privileged access management expert', 'devops engineers x 3 - global brand', 'devops engineers x 3 - global brand', 'data modeller', 'php web developer £45,000 based in london', 'devops engineers x 3 - global brand', 'devops engineers x 3 - global brand', 'solution / technical architect - ethical brand', 'lead developer - ethical brand', 'junior front-end developer', 'vb .net web developer, milton keynes, £45k', 'data scientist, newcastle, up to £40k', 'senior bi engineer', 'machine learning engineer', 'full stack developer, oxfordshire, £40k', 'c# software developer, waltham cross, £55k', 'senior data engineer', 'erp support analyst - unit4, agresso business world', 'application support analyst - cheshire - financial services', 'accountancy software trainer - manchester - rem

In [None]:
import pandas as pd
df=pd.read_csv('/content/sample_data/2019_Occupations.csv')
df['job_code']=df['O*NET-SOC 2019 Code'].str[:-2]
df.drop_duplicates(subset='job_code',keep="first",inplace=True)
df['job_code']=df['job_code']+'00'
job_code=df['job_code']
job_title=df['O*NET-SOC 2019 Title']
job_description=df['O*NET-SOC 2019 Description']
description_title_df=pd.DataFrame({'job_description':job_description,'job_title':job_title})
description_title_df.index=job_code
description_title_df.index.name='job_code'
print(description_title_df)


                                              job_description  \
job_code                                                        
11-1011.00  Determine and formulate policies and provide o...   
11-1021.00  Plan, direct, or coordinate the operations of ...   
11-1031.00  Develop, introduce, or enact laws and statutes...   
11-2011.00  Plan, direct, or coordinate advertising polici...   
11-2021.00  Plan, direct, or coordinate marketing policies...   
...                                                       ...   
55-3014.00  Target, fire, and maintain weapons used to des...   
55-3015.00  Operate and monitor communications, detection,...   
55-3016.00  Operate weapons and equipment in ground combat...   
55-3018.00  Implement unconventional operations by air, la...   
55-3019.00  All military enlisted tactical operations and ...   

                                                    job_title  
job_code                                                       
11-1011.00                

In [None]:
from sentence_transformers import SentenceTransformer

# Load SBERT model
model_name = 'bert-base-nli-mean-tokens'  # Example model, you can use other models as well
sbert_model = SentenceTransformer(model_name)

description_title_df['embeddings'] = description_title_df['job_description'].apply(lambda x: sbert_model.encode([x])[0])

# Now df contains the original job descriptions along with their embeddings
print(description_title_df)

                                              job_description  \
job_code                                                        
11-1011.00  Determine and formulate policies and provide o...   
11-1021.00  Plan, direct, or coordinate the operations of ...   
11-1031.00  Develop, introduce, or enact laws and statutes...   
11-2011.00  Plan, direct, or coordinate advertising polici...   
11-2021.00  Plan, direct, or coordinate marketing policies...   
...                                                       ...   
55-3014.00  Target, fire, and maintain weapons used to des...   
55-3015.00  Operate and monitor communications, detection,...   
55-3016.00  Operate weapons and equipment in ground combat...   
55-3018.00  Implement unconventional operations by air, la...   
55-3019.00  All military enlisted tactical operations and ...   

                                                    job_title  \
job_code                                                        
11-1011.00              

In [None]:
dataset=pd.read_csv('/content/sample_data/skills.csv')
skills=list(dataset.columns.values)

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

data = pd.read_csv('/content/sample_data/Job titles and industries.csv')

# Extract skill names from the CSV file
job_title = data['job title'].tolist()

# Calculate TF-IDF vectors for job titles
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(job_title)

# Calculate TF-IDF vector for words in the resume
resume_tfidf = tfidf_vectorizer.transform([resume_text])

# Calculate cosine similarity between resume TF-IDF vector and job title TF-IDF vectors
similarities = cosine_similarity(resume_tfidf, tfidf_matrix)

# Find the index of the job title with highest similarity
max_similarity_index = similarities.argmax()

# Extract the matched job title
matched_job_title = job_title[max_similarity_index]

# Find the index of the matched job title in the DataFrame
matched_job_index = description_title_df.index[description_title_df['job_title'] == matched_job_title]

# Get the job description corresponding to the matched job title
if not matched_job_index.empty:
    matched_job_description = description_title_df.loc[matched_job_index[0], 'job_description']

# Calculate TF-IDF vectors for the matched job description
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([matched_job_description])

# Extract skills from the matched job description
predicted_skills = [skill for skill in skills if skill.lower() in matched_job_description.lower()]

# Print the matched job title, matched job description, and predicted skills
print("Matched Job Title:", matched_job_title)
print("Matched Job Description:", matched_job_description)
print("Predicted Skills:", ", ".join(predicted_skills))

Matched Job Title: science and engineering teacher for kids
Matched Job Description: Determine and formulate policies and provide overall direction of companies or private and public sector organizations within guidelines set up by a board of directors or similar governing body. Plan, direct, or coordinate operational activities at the highest level of management with the help of subordinate executives and staff managers.
Predicted Skills: p, plan, eve, c, ui, policies, r, pr, lan


IndexError: index 0 is out of bounds for axis 0 with size 0

In [None]:
!pip install transformers
!pip install bert-extractive-summarizer

Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Installing collected packages: bert-extractive-summarizer
Successfully installed bert-extractive-summarizer-0.10.1


In [None]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load Pegasus model and tokenizer
model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

# Define a function for text summarization
def summarize_text(text, max_length=150):
    # Tokenize the input text
    inputs = tokenizer([text], max_length=max_length, return_tensors="pt", truncation=True)

    # Generate summary
    summary_ids = model.generate(inputs["input_ids"], max_length=max_length, num_beams=4, early_stopping=True)

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    return summary

summary = summarize_text(resume_text)
print("Summary:", summary)
print("Matched Job Title:", matched_job_title)
print("Matched Job Description:", matched_job_description)
print("Predicted Skills:", ", ".join(predicted_skills))

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Summary: ANAM IQBAL Pittsburgh, PA | +1 (646) 207-4431 | anami@andrew.cmu.edu | LinkedIn | Github EDUCATION CARNEGIE MELLON UNIVERSITY (CMU) Pittsburgh, PA Master of Information Systems Management: Business Intelligence & Data Analytics (Focus in Data Science) December 2022 Relevant Coursework: Introduction to Deep Learning, Big Data & Large Scale Computing, Unstructured Data Analytics, Machine Learning for Problem Solving, Interactive Data Science, Distributed Systems LAHORE UNIVERSITY OF MANAGEMENT SCIENCES (LUMS) Lahore, Pakistan Bachelor of Science (Honors) - Management Science June 2018 SKILLS Functional: Machine Learning, Deep Learning, Natural Language Processing (NLP), Artificial Intelligence, A/
Matched Job Title: Chief Executives
Index in DataFrame: 11-1011.00
Predicted Skills: p, plan, eve, c, ui, policies, r, pr, lan
