# AI Resume Screening with NLP and Machine Learning

This Jupyter Notebook extracts text from resumes (PDFs), preprocesses the text, and ranks resumes based on how well they match a given job description using **TF-IDF (Term Frequency - Inverse Document Frequency)** and **cosine similarity**.


In [1]:
# Install necessary libraries (if not already installed)
%pip install pandas numpy PyMuPDF nltk scikit-learn

# Import necessary libraries
import os
import fitz  # PyMuPDF for PDF extraction
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download stopwords dataset
nltk.download("stopwords")

Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aksha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
## Step 1: Extract Text from PDFs

# We define a function `extract_text_from_pdf` to extract text from resumes stored as PDFs.


def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF file using PyMuPDF (fitz)."""
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text("text") + "\n"
    return text

# Define path to your CV
# Use your own path 
cv_path = r"C:\\Users\\aksha\\Downloads\\Data_Analyst_Cv\\Akshay-Bhujbal-Resume.pdf"  

# Extract text from your resume
resume_text = extract_text_from_pdf(cv_path)

# Print the first 1000 characters of extracted text
print(resume_text[:1000])

Akshay Bhujbal
Pune | akshay.bhujbal16@gmail.com | 07499902809 | github.com/AkshayBhujbal1995
linkedin.com/in/akshay-1995-bhujbal
Professional Summary
• Data Analyst with expertise in data visualization, predictive analytics, and business intelligence.
• Proficient in Python, SQL, and Machine Learning, with a strong ability to extract meaningful insights from
complex datasets.
• Adept at designing automated dashboards, optimizing business processes, and leveraging AI techniques to
drive decision-making.
• Passionate about utilizing data-driven solutions to enhance business performance and efficiency.
Technologies
Programming Languages: Python, SQL, R
Data Analysis & Visualization: Pandas, NumPy, Matplotlib, Seaborn, Power BI, Tableau
Machine Learning & AI: Supervised and Unsupervised Learning, Neural Networks, TensorFlow, Scikit-learn
Statistical & Business Analytics: SAS (Base) Hypothesis Testing, Predictive Modeling, Anomaly Detection
Databases & ETL: SQL, MySQL, Data Warehousing, ET

In [3]:
## Step 2: Preprocess Text
"""
We define a function `preprocess_text` that:
- Converts text to lowercase
- Removes punctuation
- Removes stopwords (common words like "the", "is", "and" that don’t add much meaning)
"""

def preprocess_text(text):
    """Cleans and preprocesses text."""
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])  # Remove stopwords
    return text

# Example job description
job_description = """
We are looking for a Data Scientist with experience in Python, Machine Learning, NLP, and SQL.
"""

# Clean the texts
clean_resume_text = preprocess_text(resume_text)
clean_job_description = preprocess_text(job_description)

# Print cleaned text samples
print(clean_resume_text[:500])
print(clean_job_description)

akshay bhujbal pune akshaybhujbal16gmailcom 07499902809 githubcomakshaybhujbal1995 linkedincominakshay1995bhujbal professional summary data analyst expertise data visualization predictive analytics business intelligence proficient python sql machine learning strong ability extract meaningful insights complex datasets adept designing automated dashboards optimizing business processes leveraging ai techniques drive decisionmaking passionate utilizing datadriven solutions enhance business performan
looking data scientist experience python machine learning nlp sql


In [4]:
## Step 3: Calculate Resume Similarity Score

# We use **TF-IDF** to convert text into numerical vectors and 
# then measure similarity using cosine similarity.


# Convert job description and resumes to TF-IDF vectors
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([clean_job_description, clean_resume_text])

# Compute similarity (closer to 1 = better match)
similarity_score = cosine_similarity(vectors)[0][1]
print(f"Resume Match Score: {similarity_score:.2f}")


Resume Match Score: 0.31


In [5]:
## Step 4: Process Multiple Resumes in a Folder

# This section processes multiple resumes from a folder and ranks them based on job match score.

# Folder containing resumes
resume_folder = r"C:\\Users\\aksha\\Downloads\\Data_Analyst_Cv\\resumes"

# Process multiple resumes
resume_scores = {}
for resume_file in os.listdir(resume_folder):
    resume_text = extract_text_from_pdf(os.path.join(resume_folder, resume_file))
    clean_resume_text = preprocess_text(resume_text)
    
    # Convert to TF-IDF vectors
    vectors = vectorizer.transform([clean_job_description, clean_resume_text])
    similarity_score = cosine_similarity(vectors)[0][1]
    
    resume_scores[resume_file] = similarity_score

# Sort resumes by highest match score
sorted_resumes = sorted(resume_scores.items(), key=lambda x: x[1], reverse=True)

# Print ranked resumes
for resume, score in sorted_resumes:
    print(f"{resume}: {score:.2f}")


Akshay-Bhujbal_Data-Analyst_CV..pdf: 0.27
Akshay_Bhujbal_DA_CV.pdf: 0.26
AkshayBhujbalResume.pdf: 0.25
Akshay_Bhujbal_Data_Analyst_Resume.pdf: 0.23
