<a href="https://colab.research.google.com/github/Anish32/Ai-Resume-Screener/blob/main/AI_Resume_Screener.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AI Resume Screening with NLP and Machine Learning

This Jupyter Notebook extracts text from resumes (PDFs), preprocesses the text, and ranks resumes based on how well they match a given job description using **TF-IDF (Term Frequency - Inverse Document Frequency)** and **cosine similarity**.


In [1]:
# Install necessary libraries (if not already installed)
%pip install pandas numpy PyMuPDF nltk scikit-learn

# Import necessary libraries
import os
import fitz  # PyMuPDF for PDF extraction
import re
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download stopwords dataset
nltk.download("stopwords")

Collecting PyMuPDF
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.3


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
from google.colab import files
import fitz  # PyMuPDF

# Upload the resume PDF
uploaded = files.upload()

# Extract text from uploaded PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Get the uploaded file name
for file_name in uploaded.keys():
    resume_text = extract_text_from_pdf(file_name)
    print(resume_text[:1000])  # Print first 1000 characters


Saving pindukuru_anish.pdf to pindukuru_anish.pdf
    PINDUKURU ANISH REDDY      
    Phone: +91 9440159356                    
    E-mail: anishpindukuru@gmail.com  
LinkedIn: https://www.linkedin.com/in/anish-reddy-43a065258/ 
GitHub: https://github.com/Anish32 
 
 
Detail-oriented and results-driven Data Analyst with hands-on experience in data cleaning, analysis, visualization, and 
web development. Proficient in tools like SQL, Power BI, HTML, CSS, and MS Office, with a strong foundation in data 
science principles. Eager to apply analytical thinking and technical skills to extract actionable insights and support data- 
driven decision-making. Seeking a challenging position as a Data Analyst or Data Scientist to contribute to business 
growth while continuously learning and growing in a dynamic environment. 
ACADEMIC QUALIFICATION 
 
 
Qualification 
Institution 
Year 
Percentage/ CGPA 
B. Tech (CSE 
DATA 
SCIENCE) 
 Audisankara College Of Engineering & 
Technology Bachelor of Tec

In [7]:
## Step 2: Preprocess Text
"""
We define a function `preprocess_text` that:
- Converts text to lowercase
- Removes punctuation
- Removes stopwords (common words like "the", "is", "and" that don’t add much meaning)
"""

def preprocess_text(text):
    """Cleans and preprocesses text."""
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])  # Remove stopwords
    return text

# Example job description
job_description = """
We are looking for a Data Scientist with experience in Python, Machine Learning, NLP, and SQL.
"""

# Clean the texts
clean_resume_text = preprocess_text(resume_text)
clean_job_description = preprocess_text(job_description)

# Print cleaned text samples
print(clean_resume_text[:500])
print(clean_job_description)

pindukuru anish reddy phone 91 9440159356 email anishpindukurugmailcom linkedin httpswwwlinkedincominanishreddy43a065258 github httpsgithubcomanish32 detailoriented resultsdriven data analyst handson experience data cleaning analysis visualization web development proficient tools like sql power bi html css ms office strong foundation data science principles eager apply analytical thinking technical skills extract actionable insights support data driven decisionmaking seeking challenging position
looking data scientist experience python machine learning nlp sql


In [8]:
## Step 3: Calculate Resume Similarity Score

# We use **TF-IDF** to convert text into numerical vectors and
# then measure similarity using cosine similarity.


# Convert job description and resumes to TF-IDF vectors
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([clean_job_description, clean_resume_text])

# Compute similarity (closer to 1 = better match)
similarity_score = cosine_similarity(vectors)[0][1]
print(f"Resume Match Score: {similarity_score:.2f}")


Resume Match Score: 0.26


In [11]:
import os
import fitz  # PyMuPDF
from google.colab import files

# Step 1: Upload multiple resumes
uploaded = files.upload()

# Step 2: Define resume text extraction
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Step 3: Simple text preprocessing (add more as needed)
def preprocess_text(text):
    return text.lower().replace('\n', ' ')

# Step 4: Simulated AI resume screening (you can replace this with ML model)
def score_resume(text):
    # Dummy logic: score based on keyword match (just an example)
    keywords = ['python', 'machine learning', 'data', 'ai', 'llm']
    score = sum(text.count(keyword) for keyword in keywords)
    return score

# Step 5: Process all uploaded resumes
resume_scores = {}

for file_name in uploaded.keys():
    resume_text = extract_text_from_pdf(file_name)
    clean_text = preprocess_text(resume_text)
    score = score_resume(clean_text)
    resume_scores[file_name] = score

# Step 6: Display scores
for resume, score in resume_scores.items():
    print(f"Resume: {resume} | Score: {score}")


Saving pindukuru_anish.pdf to pindukuru_anish (1).pdf
Resume: pindukuru_anish (1).pdf | Score: 35


In [14]:
%pip install streamlit PyPDF2

Collecting streamlit
  Downloading streamlit-1.47.1-py3-none-any.whl.metadata (9.0 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.47.1-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [26]:

import streamlit as st
import os
import fitz  # PyMuPDF
from collections import Counter
from google.colab import files

# Step 1: Upload Resume
uploaded = files.upload()

# Step 2: Load and extract PDF text
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Step 3: Preprocess text (basic)
def preprocess_text(text):
    return text.lower().replace('\n', ' ').replace('\r', ' ').strip()

# Step 4: Define Job Description Keywords
job_description = """
We are looking for a Machine Learning Engineer with experience in Python, data analysis,
TensorFlow or PyTorch, and experience working with large language models (LLMs).
Knowledge of NLP, Scikit-learn, SQL, cloud platforms (AWS or GCP), and Git is a plus.
"""

job_keywords = [
    "machine learning", "python", "data analysis", "tensorflow", "pytorch",
    "llm", "nlp", "scikit-learn", "sql", "aws", "gcp", "git"
]

# Step 5: Score Resume against keywords
def score_resume(resume_text, keywords):
    score = 0
    found = []
    missing = []

    for keyword in keywords:
        if keyword in resume_text:
            found.append(keyword)
            score += 1
        else:
            missing.append(keyword)

    return {
        "score": score,
        "total": len(keywords),
        "percentage": round((score / len(keywords)) * 100, 2),
        "found_keywords": found,
        "missing_keywords": missing
    }

# Step 6: Run ATS Checker
for file_name in uploaded.keys():
    resume_raw = extract_text_from_pdf(file_name)
    resume_clean = preprocess_text(resume_raw)
    results = score_resume(resume_clean, job_keywords)

    # Step 7: Print Report
    print(f"\n--- ATS Resume Check for: {file_name} ---")
    print(f"Score: {results['score']} / {results['total']} ({results['percentage']}%)")
    print(f"✅ Found Keywords: {', '.join(results['found_keywords'])}")
    print(f"❌ Missing Keywords: {', '.join(results['missing_keywords'])}")


Saving Anish_Resume.pdf to Anish_Resume.pdf

--- ATS Resume Check for: Anish_Resume.pdf ---
Score: 6 / 12 (50.0%)
✅ Found Keywords: machine learning, python, data analysis, scikit-learn, sql, git
❌ Missing Keywords: tensorflow, pytorch, llm, nlp, aws, gcp


In [24]:
!wget -q -O - ipv4.icanhazip.com

35.232.111.43


In [25]:
! streamlit run app.py --server.port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.232.111.43:8501[0m
[0m
[34m  Stopping...[0m
[34m  Stopping...[0m
