# Resume Analysis & Candidate Ranking System
Welcome to the Resume Analysis & Candidate Ranking System Major Project of Batch-4(CSM)! In today's competitive job market, organizations receive a vast number of resumes for each job opening, making the candidate selection process time-consuming and challenging. To address this issue, this system offers an automated solution for analyzing resumes and ranking candidates based on their suitability for a given job description.

## Table of Contents
1. [Introduction](#introduction)
2. [System Architecture](#system-architecture)
3. [Zip File Extraction](#zip-file-extraction)
4. [Information Extraction](#information-extraction)
5. [Analysis of Job Description](#analysis-of-job-description)
6. [Analysis of Candidate Resume](#analysis-of-candidate-resume)
7. [Applying LDA Model](#applying-lda-model)
8. [Calculating Similarity](#calculating-similarity)
9. [Ranking Candidates](#ranking-candidates)
10. [Clearing Extracted Files](#clearing-extracted-files)

<a id="introduction"></a>
## 1. Introduction

Imagine a scenario where a public organization, such as a government agency or a non-profit organization, is hiring for a critical role that requires specific skills and qualifications. With limited resources and a high volume of applicants, manual screening of resumes becomes impractical and inefficient. In such cases, an automated resume analysis and candidate ranking system can significantly streamline the hiring process, allowing the organization to identify top candidates quickly and efficiently.

This system can be particularly beneficial for public organizations with limited HR resources, enabling them to make data-driven decisions in their hiring process while ensuring fairness and transparency. By leveraging natural language processing techniques and machine learning algorithms, the system evaluates resumes objectively and ranks candidates based on their alignment with the job requirements.

Let's explore the components of this system and see how it can revolutionize the candidate selection process for public organizations.


<a id="system-architecture"></a>
## 2. System Architecture
The system architecture consists of several components that work together to analyze resumes and rank candidates:
- **Zip File Extraction:** Resumes are extracted from a compressed zip file for further processing.
- **Information Extraction:** Data is loaded from both the job description and resumes, and textual content is extracted for analysis.
- **Analysis of Job Description:** The job description is analyzed to identify key requirements and skills using TF-IDF.
- **Analysis of Candidate Resume:** Each candidate's resume is analyzed to extract relevant features and qualifications using TF-IDF.
- **Applying LDA Model:** The LDA (Latent Dirichlet Allocation) model is applied to uncover underlying topics within the job description.
- **Calculating Similarity:** Similarity between the job description and each resume is calculated using cosine similarity and LDA topic distributions.
- **Ranking Candidates:** Candidates are ranked based on their similarity to the job description.
- **Clearing Extracted Files:** Extracted files are cleared to maintain a tidy workspace.

Let's delve into each component in detail.


<a id="zip-file-extraction"></a>
## 3. Zip file extraction

In this initial step, resumes are extracted from a compressed zip file for further processing. This allows us to access and analyze multiple resumes conveniently.

In [1]:
import zipfile,os

zip_path = 'Data/Code.zip'
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall('Data/Extracted/')
print(os.listdir("Data/Extracted/"))

['sample.pdf', 'sample2.pdf', 'sample3.pdf']


<a id="information-extraction"></a>
## 4. Information Extraction
This section focuses on loading data from both the job description and resumes. Resumes are parsed to extract textual content, which is then stored for subsequent analysis.

In [2]:
import pandas as pd
import fitz
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def load_data(job_description_file, resumes_file):
    resumes_df = []
    stop_words = set(stopwords.words('english'))
    for resume in resumes_file:
        text = ""
        with fitz.open(resume) as pdf_document:
            for page_number in range(pdf_document.page_count):
                page = pdf_document[page_number]
                text += page.get_text()
        words = word_tokenize(text)
        words = [word.lower() for word in words if word.isalnum()]
        words = [word for word in words if word not in stop_words]
        text = ' '.join(words)
        resumes_df.append({'resume_name':resume, 'resume_text': text})
    resumes_df = pd.DataFrame(resumes_df)

    job_descriptions = ""
    with fitz.open(job_description_file) as job_description_file:
        for page_number in range(job_description_file.page_count):
            page = job_description_file[page_number]
            job_descriptions += page.get_text()
    words = word_tokenize(job_descriptions)
    words = [word.lower() for word in words if word.isalnum()]
    words = [word for word in words if word not in stop_words]
    job_descriptions = ' '.join(words)

    return job_descriptions, resumes_df

JD, resume_df = load_data('Data/sample_JD.pdf',['Data/Extracted/'+x for x in os.listdir('Data/Extracted/')])
print(JD)
resume_df

data center engineering intern winter 2024 google mumbai maharashtra india intern apprentice minimum currently pursuing bachelor master phd degree computer science related technical field experience installing network linux system ability communicate english fluently apply please complete application 5 2024 winter internships start february 2024 weeks duration internship intended students pursuing final year bachelor master dual degree program computer science related field graduate 2024 start application process need updated cv resume current unofficial official transcript english click apply button page provide required materials appropriate sections pdfs preferred 1 resume section attach updated cv resume please ensure listed anticipated graduation date proficiency statistical software database languages resume along technical experience projects 2 education section attach current recent unofficial official transcript english degree status select attending upload transcript back job

Unnamed: 0,resume_name,resume_text
0,Data/Extracted/sample.pdf,madhava reddy creative programmer madhavso2018...
1,Data/Extracted/sample2.pdf,experience education skills certifications vij...
2,Data/Extracted/sample3.pdf,experience education skills certifications vij...


<a id="analysis-of-job-description"></a>
## 5. Analysis of Job Description
Here, we analyze the job description to identify key requirements and skills. This involves using TF-IDF (Term Frequency-Inverse Document Frequency) to extract important keywords.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

def analyze_job_role(job_description, stop_words):
    stop_words.update(["role", "responsibilities", "skills", "experience", "qualifications"])  # Add additional custom stop words
    job_description = ' '.join([word for word in job_description.lower().split() if word not in stop_words])

    tfidf_vectorizer = TfidfVectorizer()

    tfidf_matrix = tfidf_vectorizer.fit_transform([job_description])

    feature_names = tfidf_vectorizer.get_feature_names_out()

    tfidf_scores = tfidf_matrix.toarray()[0]

    keyword_scores = dict(zip(feature_names, tfidf_scores))

    sorted_keywords = sorted(keyword_scores.items(), key=lambda x: x[1], reverse=True)

    top_keywords = dict([(keyword,score) for keyword, score in sorted_keywords[:50]])

    return top_keywords

stop_words = set(stopwords.words('english'))
job_keywords = analyze_job_role(JD, stop_words)
print(job_keywords)

{'google': 0.5045977978988233, 'data': 0.21625619909949573, 'related': 0.16819926596627444, 'center': 0.14417079939966382, 'engineering': 0.12014233283305317, 'equal': 0.12014233283305317, 'opportunity': 0.12014233283305317, 'status': 0.12014233283305317, 'technical': 0.12014233283305317, 'world': 0.12014233283305317, '2024': 0.09611386626644254, 'belonging': 0.09611386626644254, 'english': 0.09611386626644254, 'gender': 0.09611386626644254, 'information': 0.09611386626644254, 'infrastructure': 0.09611386626644254, 'please': 0.09611386626644254, 'resume': 0.09611386626644254, 'search': 0.09611386626644254, 'team': 0.09611386626644254, 'technology': 0.09611386626644254, 'us': 0.09611386626644254, 'users': 0.09611386626644254, 'ability': 0.07208539969983191, 'behind': 0.07208539969983191, 'challenges': 0.07208539969983191, 'current': 0.07208539969983191, 'cvs': 0.07208539969983191, 'degree': 0.07208539969983191, 'design': 0.07208539969983191, 'engineers': 0.07208539969983191, 'hire': 0.0

<a id="analysis-of-candidate-resume"></a>
## 6. Analysis of Candidate Resume
Each candidate's resume is analyzed to extract relevant features and qualifications. We employ TF-IDF to identify important terms within the resumes.

In [4]:
def analyze_candidate_resume(resume, stop_words):
    stop_words.update(["phone", "email", "address", "linkedin", "github"])  # Add additional custom stop words
    resume = ' '.join([word for word in resume.lower().split() if word not in stop_words])

    tfidf_vectorizer = TfidfVectorizer()

    tfidf_matrix = tfidf_vectorizer.fit_transform([resume])

    feature_names = tfidf_vectorizer.get_feature_names_out()

    tfidf_scores = tfidf_matrix.toarray()[0]

    resume_features = dict(zip(feature_names, tfidf_scores))
    
    sorted_keywords = sorted(resume_features.items(), key=lambda x: x[1], reverse=True)
    
    resume_features = dict([(keyword,score) for keyword, score in sorted_keywords[:50]])

    return resume_features

for idx in range(len(resume_df)):
    print(resume_df['resume_name'][idx])
    candidate_features = analyze_candidate_resume(resume_df['resume_text'][idx],stop_words)
    print(candidate_features, end="\n\n\n\n")

Data/Extracted/sample.pdf
{'data': 0.3283053930987497, 'learning': 0.2735878275822914, 'machine': 0.2735878275822914, '2022': 0.16415269654937484, 'cgpa': 0.16415269654937484, 'education': 0.16415269654937484, 'engineering': 0.16415269654937484, 'python': 0.16415269654937484, 'vijayawada': 0.16415269654937484, 'web': 0.16415269654937484, '2018': 0.10943513103291655, '2020': 0.10943513103291655, 'analysis': 0.10943513103291655, 'blackbucks': 0.10943513103291655, 'board': 0.10943513103291655, 'bot': 0.10943513103291655, 'cases': 0.10943513103291655, 'computer': 0.10943513103291655, 'development': 0.10943513103291655, 'engineers': 0.10943513103291655, 'india': 0.10943513103291655, 'ltd': 0.10943513103291655, 'project': 0.10943513103291655, 'pvt': 0.10943513103291655, 'science': 0.10943513103291655, 'technical': 0.10943513103291655, '2017': 0.054717565516458275, '2023': 0.054717565516458275, '2024': 0.054717565516458275, '9347156120': 0.054717565516458275, 'academic': 0.054717565516458275,

<a id="applying-lda-model"></a>
## 7. Applying LDA Model (Demo)
In this section, we demonstrate the application of the LDA (Latent Dirichlet Allocation) model on the job keywords. LDA helps uncover underlying topics within the job description.

In [5]:
from gensim import corpora, models

def apply_lda(texts):
    tokenized_texts = [text.lower().split() for text in texts]
    dictionary = corpora.Dictionary(tokenized_texts)
    
    corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

    num_topics = 5  # Adjust the number of topics based on your requirements
    lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
    
    corpus_lda = lda_model[corpus]
    
    return lda_model, corpus_lda, dictionary

lda_model, corpus_lda, dictionary = apply_lda(job_keywords)

print("Corpus LDA:")
for doc in corpus_lda:
    print(doc)
    
print("\nDictionary:")
for word, index in dictionary.token2id.items():
    print(f"{word}: {index}")

Corpus LDA:
[(0, 0.100023076), (1, 0.10002085), (2, 0.5999131), (3, 0.100019895), (4, 0.100023076)]
[(0, 0.100023076), (1, 0.10002085), (2, 0.5999131), (3, 0.100019895), (4, 0.100023076)]
[(0, 0.5999131), (1, 0.10002085), (2, 0.100023076), (3, 0.100019895), (4, 0.100023076)]
[(0, 0.10002683), (1, 0.10002423), (2, 0.10002682), (3, 0.5998953), (4, 0.10002682)]
[(0, 0.100023076), (1, 0.10002085), (2, 0.100023076), (3, 0.100019895), (4, 0.5999131)]
[(0, 0.5999131), (1, 0.10002085), (2, 0.100023076), (3, 0.100019895), (4, 0.100023076)]
[(0, 0.10002309), (1, 0.100020856), (2, 0.10002308), (3, 0.1000199), (4, 0.5999131)]
[(0, 0.100023076), (1, 0.10002085), (2, 0.5999131), (3, 0.10001989), (4, 0.100023076)]
[(0, 0.10002308), (1, 0.100020856), (2, 0.5999131), (3, 0.100019895), (4, 0.10002308)]
[(0, 0.5999131), (1, 0.100020856), (2, 0.10002308), (3, 0.1000199), (4, 0.10002308)]
[(0, 0.5999131), (1, 0.10002085), (2, 0.100023076), (3, 0.100019895), (4, 0.100023076)]
[(0, 0.5999131), (1, 0.10002085

<a id="calculating-similarity"></a>
## 8. Calculating Similarity between JD and Resumes
We calculate the similarity between the job description and each resume. This involves computing cosine similarity between TF-IDF vectors and LDA topic distributions.

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(job_keywords, resume_features, lda_model, corpus_lda, dictionary):
     # Compute cosine similarity between TF-IDF vectors of job keywords and resume features
    tfidf_cosine_similarity = cosine_similarity([list(job_keywords.values())], [list(resume_features.values())])[0][0]

    # Convert resume text to LDA vector
    resume_lda_vector = lda_model[dictionary.doc2bow(resume_features.keys())]

    # Compute similarity between job LDA topics and resume LDA vector
    lda_similarity = 0
    for topic_id, topic_score in resume_lda_vector:
        lda_similarity += topic_score * corpus_lda[0][topic_id][1]  # Weighted similarity based on LDA scores

    # Combine both similarities (you can adjust the weights based on importance)
    final_score = 0.7 * tfidf_cosine_similarity + 0.3 * lda_similarity
    
    return final_score

print(calculate_similarity(job_keywords, analyze_candidate_resume(resume_df['resume_text'][1],stop_words), lda_model, corpus_lda, dictionary))

0.7175028329875784


<a id="ranking-candidates"></a>
## 9. Ranking Candidates
Candidates are ranked based on their similarity to the job description. The ranking helps identify the most suitable candidates for the position.

In [15]:
def rank_candidates(job_description, resumes, stop_words):
    job_keywords = analyze_job_role(job_description, stop_words)
    lda_model, corpus_lda, dictionary = apply_lda(job_description)

    ranking_scores = []
    for resume in resumes["resume_text"][:]:
        resume_features = analyze_candidate_resume(resume, stop_words)
        score = calculate_similarity(job_keywords, resume_features, lda_model, corpus_lda, dictionary)
        ranking_scores.append(score)
    ranked_candidates = sorted(enumerate(ranking_scores), key=lambda x: x[1], reverse=True)
    return ranked_candidates

ranked_candidates = rank_candidates(JD, resume_df, stop_words)
leaderboard_data = [(resume_df.iloc[idx]['resume_name'].split("/")[-1], score) for idx, score in ranked_candidates]
print(leaderboard_data)

[('sample.pdf', 0.7250509821243742), ('sample2.pdf', 0.7025081332002299), ('sample3.pdf', 0.7025081332002299)]


<a id="clearing-extracted-files"></a>
## 10. Clearing Extracted Files
Finally, we clean up by removing the extracted files to maintain a tidy workspace.

In [17]:
def clear_directory(directory):
    # Iterate over all the files in the directory
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        try:
            # Check if it is a file
            if os.path.isfile(file_path):
                # Delete the file
                os.remove(file_path)
                print(f"Deleted file: {file_path}")
        except Exception as e:
            print(f"Error deleting {file_path}: {e}")
            
clear_directory('Data\Extracted')

Deleted file: Data\Extracted\sample.pdf
Deleted file: Data\Extracted\sample2.pdf
Deleted file: Data\Extracted\sample3.pdf
