# Resume Analysis & Candidate Ranking System
Welcome to the Resume Analysis & Candidate Ranking System Major Project of Batch-4(CSM)! In today's competitive job market, organizations receive a vast number of resumes for each job opening, making the candidate selection process time-consuming and challenging. To address this issue, this system offers an automated solution for analyzing resumes and ranking candidates based on their suitability for a given job description.

## Table of Contents
1. [Introduction](#introduction)
2. [System Architecture](#system-architecture)
3. [Zip File Extraction](#zip-file-extraction)
4. [Information Extraction](#information-extraction)
5. [Analysis of Job Description](#analysis-of-job-description)
6. [Analysis of Candidate Resume](#analysis-of-candidate-resume)
7. [Applying LDA Model](#applying-lda-model)
8. [Calculating Similarity](#calculating-similarity)
9. [Ranking Candidates](#ranking-candidates)
10. [Clearing Extracted Files](#clearing-extracted-files)

<a id="introduction"></a>
## 1. Introduction

Imagine a scenario where a public organization, such as a government agency or a non-profit organization, is hiring for a critical role that requires specific skills and qualifications. With limited resources and a high volume of applicants, manual screening of resumes becomes impractical and inefficient. In such cases, an automated resume analysis and candidate ranking system can significantly streamline the hiring process, allowing the organization to identify top candidates quickly and efficiently.

This system can be particularly beneficial for public organizations with limited HR resources, enabling them to make data-driven decisions in their hiring process while ensuring fairness and transparency. By leveraging natural language processing techniques and machine learning algorithms, the system evaluates resumes objectively and ranks candidates based on their alignment with the job requirements.

Let's explore the components of this system and see how it can revolutionize the candidate selection process for public organizations.


<a id="system-architecture"></a>
## 2. System Architecture
The system architecture consists of several components that work together to analyze resumes and rank candidates:
- **Zip File Extraction:** Resumes are extracted from a compressed zip file for further processing.
- **Information Extraction:** Data is loaded from both the job description and resumes, and textual content is extracted for analysis.
- **Analysis of Job Description:** The job description is analyzed to identify key requirements and skills using TF-IDF.
- **Analysis of Candidate Resume:** Each candidate's resume is analyzed to extract relevant features and qualifications using TF-IDF.
- **Applying LDA Model:** The LDA (Latent Dirichlet Allocation) model is applied to uncover underlying topics within the job description.
- **Calculating Similarity:** Similarity between the job description and each resume is calculated using cosine similarity and LDA topic distributions.
- **Ranking Candidates:** Candidates are ranked based on their similarity to the job description.
- **Clearing Extracted Files:** Extracted files are cleared to maintain a tidy workspace.

Let's delve into each component in detail.


<a id="zip-file-extraction"></a>
## 3. Zip file extraction

In this initial step, resumes are extracted from a compressed zip file for further processing. This allows us to access and analyze multiple resumes conveniently.

In [3]:
import os

# Set the working directory to the notebook's folder
os.chdir(r'D:\project\Major Project\Final Documents\Code Explanation')

In [4]:
import zipfile,os

zip_path = 'Data/Code.zip'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall('Data/Extracted/')
print(os.listdir("Data/Extracted/"))

['sample.pdf', 'sample2.pdf', 'sample3.pdf']


<a id="information-extraction"></a>
## 4. Information Extraction
This section focuses on loading data from both the job description and resumes. Resumes are parsed to extract textual content, which is then stored for subsequent analysis.

In [1]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.25.5-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-win_amd64.whl (16.6 MB)
   ---------------------------------------- 0.0/16.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.6 MB ? eta -:--:--
    --------------------------------------- 0.3/16.6 MB ? eta -:--:--
   - -------------------------------------- 0.5/16.6 MB 728.2 kB/s eta 0:00:23
   - -------------------------------------- 0.5/16.6 MB 728.2 kB/s eta 0:00:23
   - -------------------------------------- 0.8/16.6 MB 798.0 kB/s eta 0:00:20
   -- ------------------------------------- 1.0/16.6 MB 811.6 kB/s eta 0:00:20
   -- ------------------------------------- 1.0/16.6 MB 811.6 kB/s eta 0:00:20
   --- ------------------------------------ 1.3/16.6 MB 745.3 kB/s eta 0:00:21
   --- ------------------

In [5]:
import pandas as pd
import fitz
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def load_data(job_description_file, resumes_file):
    resumes_df = []
    stop_words = set(stopwords.words('english'))
    for resume in resumes_file:
        text = ""
        with fitz.open(resume) as pdf_document:
            for page_number in range(pdf_document.page_count):
                page = pdf_document[page_number]
                text += page.get_text()
        words = word_tokenize(text)
        words = [word.lower() for word in words if word.isalnum()]
        words = [word for word in words if word not in stop_words]
        text = ' '.join(words)
        resumes_df.append({'resume_name':resume, 'resume_text': text})
    print(resumes_df)
    resumes_df = pd.DataFrame(resumes_df)

    job_descriptions = ""
    with fitz.open(job_description_file) as job_description_file:
        for page_number in range(job_description_file.page_count):
            page = job_description_file[page_number]
            job_descriptions += page.get_text()
    words = word_tokenize(job_descriptions)
    words = [word.lower() for word in words if word.isalnum()]
    words = [word for word in words if word not in stop_words]
    job_descriptions = ' '.join(words)

    return job_descriptions, resumes_df

# print(['Data/Extracted/'+x for x in os.listdir('Data/Extracted/')])
JD, resume_df = load_data('Data/sample_JD.pdf',['Data/Extracted/'+x for x in os.listdir('Data/Extracted/')])
print(JD)
resume_df

[{'resume_name': 'Data/Extracted/sample.pdf', 'resume_text': 'madhava reddy creative programmer faat madhavso2018 fap hone 9347156120 famapmarker vijayawada india falinkedin summary dedicated computer science engineering student strong passion data analysis machine learning web development seeking position apply technical skills innovative mindset solve complex problems tribute organization success education computer science engineering aiml nri institute technology z 2020 2024 pothavarappadu cgpa board intermediate education ap mpc sri chaitanya educational institutions z 2018 2020 vijayawada cgpa andhra pradesh board secondary education high school z 2017 2018 vijayawada cgpa projects data extractor tripadvisor reviews eapcet flipkart academic performance cases prediction using machine learning developed machine learning models predict cases aiding proactive public health measures online prices monitor bot designed implemented price monitoring telegram bot skills technical skills pro

Unnamed: 0,resume_name,resume_text
0,Data/Extracted/sample.pdf,madhava reddy creative programmer faat madhavs...
1,Data/Extracted/sample2.pdf,experience education skills certifications vij...
2,Data/Extracted/sample3.pdf,experience education skills certifications vij...


In [6]:
import os

print("Current Working Directory:", os.getcwd())
print("Exists:", os.path.exists('Data/Extracted'))
print("Contents:", os.listdir('Data/Extracted') if os.path.exists('Data/Extracted') else "Folder not found")


Current Working Directory: D:\project\Major Project\Final Documents\Code Explanation
Exists: True
Contents: ['sample.pdf', 'sample2.pdf', 'sample3.pdf']


<a id="analysis-of-job-description"></a>
## 5. Analysis of Job Description
Here, we analyze the job description to identify key requirements and skills. This involves using TF-IDF (Term Frequency-Inverse Document Frequency) to extract important keywords.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

def analyze_job_role(job_description, stop_words):
    stop_words.update(["role", "responsibilities", "skills", "experience", "qualifications"])  # Add additional custom stop words
    job_description = ' '.join([word for word in job_description.lower().split() if word not in stop_words])

    tfidf_vectorizer = TfidfVectorizer()

    tfidf_matrix = tfidf_vectorizer.fit_transform([job_description])

    feature_names = tfidf_vectorizer.get_feature_names_out()
    
    tfidf_scores = tfidf_matrix.toarray()[0]

    keyword_scores = dict(zip(feature_names, tfidf_scores))

    sorted_keywords = sorted(keyword_scores.items(), key=lambda x: x[1], reverse=True)

    top_keywords = dict([(keyword,score) for keyword, score in sorted_keywords[:50]])

    return top_keywords

stop_words = set(stopwords.words('english'))
job_keywords = analyze_job_role(JD, stop_words)
print(job_keywords)

{'google': 0.5214500094539749, 'data': 0.21332045841298974, 'related': 0.165915912098992, 'center': 0.14221363894199315, 'engineering': 0.1185113657849943, 'equal': 0.1185113657849943, 'opportunity': 0.1185113657849943, 'status': 0.1185113657849943, 'technical': 0.1185113657849943, 'world': 0.1185113657849943, '2024': 0.09480909262799544, 'belonging': 0.09480909262799544, 'english': 0.09480909262799544, 'gender': 0.09480909262799544, 'information': 0.09480909262799544, 'infrastructure': 0.09480909262799544, 'please': 0.09480909262799544, 'resume': 0.09480909262799544, 'search': 0.09480909262799544, 'team': 0.09480909262799544, 'technology': 0.09480909262799544, 'us': 0.09480909262799544, 'users': 0.09480909262799544, 'ability': 0.07110681947099658, 'behind': 0.07110681947099658, 'challenges': 0.07110681947099658, 'current': 0.07110681947099658, 'cvs': 0.07110681947099658, 'degree': 0.07110681947099658, 'design': 0.07110681947099658, 'engineers': 0.07110681947099658, 'hire': 0.071106819

<a id="analysis-of-candidate-resume"></a>
## 6. Analysis of Candidate Resume
Each candidate's resume is analyzed to extract relevant features and qualifications. We employ TF-IDF to identify important terms within the resumes.

In [8]:
def analyze_candidate_resume(resume, stop_words):
    stop_words.update(["phone", "email", "address", "linkedin", "github"])  # Add additional custom stop words
    resume = ' '.join([word for word in resume.lower().split() if word not in stop_words])

    tfidf_vectorizer = TfidfVectorizer()

    tfidf_matrix = tfidf_vectorizer.fit_transform([resume])

    feature_names = tfidf_vectorizer.get_feature_names_out()

    tfidf_scores = tfidf_matrix.toarray()[0]

    resume_features = dict(zip(feature_names, tfidf_scores))
    
    sorted_keywords = sorted(resume_features.items(), key=lambda x: x[1], reverse=True)
    
    resume_features = dict([(keyword,score) for keyword, score in sorted_keywords[:50]])

    return resume_features

for idx in range(len(resume_df)):
    print(resume_df['resume_name'][idx])
    candidate_features = analyze_candidate_resume(resume_df['resume_text'][idx],stop_words)
    print(candidate_features, end="\n\n\n\n")

Data/Extracted/sample.pdf
{'data': 0.3258752679561411, 'learning': 0.2715627232967842, 'machine': 0.2715627232967842, '2022': 0.16293763397807054, 'cgpa': 0.16293763397807054, 'education': 0.16293763397807054, 'engineering': 0.16293763397807054, 'python': 0.16293763397807054, 'vijayawada': 0.16293763397807054, 'web': 0.16293763397807054, '2018': 0.10862508931871369, '2020': 0.10862508931871369, 'analysis': 0.10862508931871369, 'blackbucks': 0.10862508931871369, 'board': 0.10862508931871369, 'bot': 0.10862508931871369, 'cases': 0.10862508931871369, 'computer': 0.10862508931871369, 'development': 0.10862508931871369, 'engineers': 0.10862508931871369, 'india': 0.10862508931871369, 'ltd': 0.10862508931871369, 'project': 0.10862508931871369, 'pvt': 0.10862508931871369, 'science': 0.10862508931871369, 'technical': 0.10862508931871369, '2017': 0.054312544659356844, '2023': 0.054312544659356844, '2024': 0.054312544659356844, '9347156120': 0.054312544659356844, 'academic': 0.054312544659356844,

<a id="applying-lda-model"></a>
## 7. Applying LDA Model (Demo)
In this section, we demonstrate the application of the LDA (Latent Dirichlet Allocation) model on the job keywords. LDA helps uncover underlying topics within the job description.

In [10]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp39-cp39-win_amd64.whl.metadata (8.2 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Downloading gensim-4.3.3-cp39-cp39-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
    --------------------------------------- 0.5/24.0 MB 8.5 MB/s eta 0:00:03
    --------------------------------------- 0.5/24.0 MB 8.5 MB/s eta 0:00:03
    --------------------------------------- 0.5/24.0 MB 8.5 MB/s eta 0:00:03
   - -------------------------------------- 0.8/24.0 MB 714.3 kB/s eta 0:00:33
   - -------------------------------------- 0.8/24.0 MB 714.3 kB/s eta 0:00:33
   - -------------------------------------- 0.8/24.0 MB 714.3 kB/s eta 0:00:33
   - -------------------------------------- 0.8/24.0 MB 714.3 kB/s eta 0:00:33
   - -------------------------------------- 0.8/24.0 MB 714

In [11]:
from gensim import corpora, models

def apply_lda(texts):
    print(f'text:',texts)
    tokenized_texts = [text.lower().split() for text in texts]
    print(f"tokenized:",tokenized_texts)
    dictionary = corpora.Dictionary(tokenized_texts)
    print(dictionary)
    corpus = [dictionary.doc2bow(text) for text in tokenized_texts]
    print(corpus)
    num_topics = 5  # Adjust the number of topics based on your requirements
    lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

    corpus_lda = lda_model[corpus]
    
    return lda_model, corpus_lda, dictionary

lda_model, corpus_lda, dictionary = apply_lda(job_keywords)

print("Corpus LDA:")
for doc in corpus_lda:
    print(doc)
    
print("\nDictionary:")
for word, index in dictionary.token2id.items():
    print(f"{word}: {index}")

text: {'google': 0.5214500094539749, 'data': 0.21332045841298974, 'related': 0.165915912098992, 'center': 0.14221363894199315, 'engineering': 0.1185113657849943, 'equal': 0.1185113657849943, 'opportunity': 0.1185113657849943, 'status': 0.1185113657849943, 'technical': 0.1185113657849943, 'world': 0.1185113657849943, '2024': 0.09480909262799544, 'belonging': 0.09480909262799544, 'english': 0.09480909262799544, 'gender': 0.09480909262799544, 'information': 0.09480909262799544, 'infrastructure': 0.09480909262799544, 'please': 0.09480909262799544, 'resume': 0.09480909262799544, 'search': 0.09480909262799544, 'team': 0.09480909262799544, 'technology': 0.09480909262799544, 'us': 0.09480909262799544, 'users': 0.09480909262799544, 'ability': 0.07110681947099658, 'behind': 0.07110681947099658, 'challenges': 0.07110681947099658, 'current': 0.07110681947099658, 'cvs': 0.07110681947099658, 'degree': 0.07110681947099658, 'design': 0.07110681947099658, 'engineers': 0.07110681947099658, 'hire': 0.071

<a id="calculating-similarity"></a>
## 8. Calculating Similarity between JD and Resumes
We calculate the similarity between the job description and each resume. This involves computing cosine similarity between TF-IDF vectors and LDA topic distributions.

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(job_keywords, resume_features, lda_model, corpus_lda, dictionary):
     # Compute cosine similarity between TF-IDF vectors of job keywords and resume features
    tfidf_cosine_similarity = cosine_similarity([list(job_keywords.values())], [list(resume_features.values())])[0][0]
    
    # Convert resume text to LDA vector
    resume_lda_vector = lda_model[dictionary.doc2bow(resume_features.keys())]

    # Compute similarity between job LDA topics and resume LDA vector
    lda_similarity = 0
    print(resume_lda_vector)
    for topic_id, topic_score in resume_lda_vector:
        lda_similarity += topic_score * corpus_lda[0][topic_id][1]  # Weighted similarity based on LDA scores

    # Combine both similarities (you can adjust the weights based on importance)
    final_score = 0.7 * tfidf_cosine_similarity + 0.3 * lda_similarity
    
    return final_score

print(calculate_similarity(job_keywords, analyze_candidate_resume(resume_df['resume_text'][1],stop_words), lda_model, corpus_lda, dictionary))

[(0, 0.050042022), (1, 0.2997978), (2, 0.29979622), (3, 0.30033007), (4, 0.050033897)]
0.7119986164491449


<a id="ranking-candidates"></a>
## 9. Ranking Candidates
Candidates are ranked based on their similarity to the job description. The ranking helps identify the most suitable candidates for the position.

In [19]:
def rank_candidates(job_description, resumes, stop_words):
    job_keywords = analyze_job_role(job_description, stop_words)
    lda_model, corpus_lda, dictionary = apply_lda(job_description)

    ranking_scores = []
    for resume in resumes["resume_text"][:]:
        resume_features = analyze_candidate_resume(resume, stop_words)
        score = calculate_similarity(job_keywords, resume_features, lda_model, corpus_lda, dictionary)
        ranking_scores.append(score)
    ranked_candidates = sorted(enumerate(ranking_scores), key=lambda x: x[1], reverse=True)
    print(ranked_candidates)
    return ranked_candidates

ranked_candidates = rank_candidates(JD, resume_df, stop_words)
leaderboard_data = [(resume_df.iloc[idx]['resume_name'].split("/")[-1], score) for idx, score in ranked_candidates]
print(leaderboard_data)

[[(0, 1)], [(1, 1)], [(2, 1)], [(1, 1)], [], [(3, 1)], [(4, 1)], [(5, 1)], [(2, 1)], [(4, 1)], [(6, 1)], [], [(4, 1)], [(5, 1)], [(7, 1)], [(8, 1)], [(5, 1)], [(4, 1)], [(4, 1)], [(6, 1)], [(8, 1)], [(5, 1)], [(7, 1)], [], [(8, 1)], [(5, 1)], [(2, 1)], [(4, 1)], [(6, 1)], [(5, 1)], [], [(9, 1)], [(8, 1)], [(5, 1)], [(2, 1)], [(4, 1)], [(6, 1)], [], [(10, 1)], [(11, 1)], [(10, 1)], [(12, 1)], [], [(7, 1)], [(13, 1)], [(13, 1)], [(7, 1)], [(14, 1)], [(4, 1)], [], [(15, 1)], [(16, 1)], [(15, 1)], [(17, 1)], [(1, 1)], [(8, 1)], [], [(15, 1)], [(1, 1)], [(18, 1)], [(1, 1)], [(6, 1)], [(1, 1)], [(19, 1)], [(18, 1)], [(2, 1)], [(6, 1)], [(1, 1)], [], [(8, 1)], [(5, 1)], [(0, 1)], [(8, 1)], [(1, 1)], [], [(8, 1)], [(5, 1)], [(2, 1)], [(4, 1)], [(6, 1)], [(5, 1)], [], [(1, 1)], [(20, 1)], [(20, 1)], [(6, 1)], [(4, 1)], [(5, 1)], [(2, 1)], [(8, 1)], [(3, 1)], [(4, 1)], [], [(15, 1)], [(8, 1)], [(5, 1)], [(8, 1)], [(15, 1)], [(16, 1)], [(15, 1)], [], [(3, 1)], [(16, 1)], [(6, 1)], [(6, 1)], [(4, 

<a id="clearing-extracted-files"></a>
## 10. Clearing Extracted Files
Finally, we clean up by removing the extracted files to maintain a tidy workspace.

In [17]:
def clear_directory(directory):
    # Iterate over all the files in the directory
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            # Check if it is a file
            if os.path.isfile(file_path):
                # Delete the file
                os.remove(file_path)
                print(f"Deleted file: {file_path}")
        except Exception as e:
            print(f"Error deleting {file_path}: {e}")
            
clear_directory('Data\Extracted')

Deleted file: Data\Extracted\sample.pdf
Deleted file: Data\Extracted\sample2.pdf
Deleted file: Data\Extracted\sample3.pdf
