# Resume Screening System

This notebook implements a beginner-friendly Machine Learning system to screen and rank resumes against a job description. It includes:
1.  Data Loading & Exploration
2.  Text Cleaning & Preprocessing
3.  Skill Extraction
4.  Job Description Parsing
5.  Similarity Scoring
6.  Candidate Ranking
7.  Gap Analysis & Visualization

In [None]:
# Install necessary libraries if not already installed
!pip install pandas numpy scikit-learn nltk matplotlib seaborn spacy

In [None]:
import pandas as pd
import numpy as np
import re
import string
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

# Download NLTK data
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## 1. Data Loading

In [None]:
# Load the dataset
try:
    df = pd.read_csv('resume.csv')
    print("Dataset loaded successfully!")
    print(f"Shape: {df.shape}")
    display(df.head())
except FileNotFoundError:
    print("Error: resume.csv not found. Please ensure the file exists in the current directory.")

## 2. Text Cleaning & Preprocessing
We need to clean the resume text to remove noise like URLs, special characters, and extra spaces.

In [None]:
def clean_text(text):
    if not isinstance(text, str):
        return ""
    
    text = text.lower()  # Lowercase
    text = re.sub(r'httpS+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>', '', text)    # Remove HTML tags
    text = re.sub(r'[^a-z0-9\s]', '', text) # Remove special chars
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    filtered_words = [word for word in words if word not in stop_words]
    return " ".join(filtered_words)

# Apply cleaning
if 'Resume_str' in df.columns:
    df['cleaned_resume'] = df['Resume_str'].apply(clean_text)
elif 'Resume' in df.columns:
    df['cleaned_resume'] = df['Resume'].apply(clean_text)
else:
    # Fallback to the first column if exact name is unknown
    print("Column 'Resume_str' or 'Resume' not found. Using the first column as resume text.")
    df['cleaned_resume'] = df.iloc[:, 0].apply(clean_text)

print("Preprocessing completed.")
display(df[['cleaned_resume']].head())

## 3. Job Description Parsing
Enter a job description to compare resumes against.

In [None]:
# Example Job Description (You can change this)
job_description = """
We are looking for a Data Scientist with experience in Python, Machine Learning, and Deep Learning.
Key skills required: Pandas, NumPy, Scikit-Learn, TensorFlow, SQL, Data Visualization, Communication.
Experience with Natural Language Processing (NLP) is a plus.
"""

cleaned_jd = clean_text(job_description)
print("Cleaned Job Description:")
print(cleaned_jd)

## 4. Similarity Scoring
We use TF-IDF to vectorize text and Cosine Similarity to score resumes against the JD.

In [None]:
# Combine JD with Resumes for Vectorization
all_docs = [cleaned_jd] + df['cleaned_resume'].tolist()

# Vectorize
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf_vectorizer.fit_transform(all_docs)

# Calculate Cosine Similarity
# The first vector (index 0) is the Job Description
cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])

# Add scores to DataFrame
df['similarity_score'] = cosine_sim[0]

# Rank candidates
ranked_df = df.sort_values(by='similarity_score', ascending=False)
display(ranked_df[['Category', 'similarity_score']].head(10))

## 5. Skill Extraction
Let's identify specific skills present in the resume based on a predefined list.

In [None]:
# Define a simple skill dictionary (expand as needed)
tech_skills = ['python', 'java', 'c++', 'sql', 'machine learning', 'deep learning', 'nlp',
               'pandas', 'numpy', 'scikit-learn', 'tensorflow', 'keras', 'pytorch', 'tableau', 'power bi']

def extract_skills(text, skill_list):
    found_skills = []
    for skill in skill_list:
        if skill in text:
            found_skills.append(skill)
    return found_skills

df['extracted_skills'] = df['cleaned_resume'].apply(lambda x: extract_skills(x, tech_skills))
display(df[['extracted_skills']].head())

## 6. Gap Analysis
Identify which skills from the Job Description are missing in the top candidates.

In [None]:
jd_skills = extract_skills(cleaned_jd, tech_skills)

def find_missing_skills(resume_skills, required_skills):
    return [skill for skill in required_skills if skill not in resume_skills]

df['missing_skills'] = df['extracted_skills'].apply(lambda x: find_missing_skills(x, jd_skills))

# Display Top 5 Candidates with details
top_candidates = df.sort_values(by='similarity_score', ascending=False).head(5)
print(f"Required Skills: {jd_skills}")
display(top_candidates[['Category', 'similarity_score', 'extracted_skills', 'missing_skills']])

## 7. Visualization
Visualize the distribution of similarity scores.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['similarity_score'], bins=20, kde=True, color='skyblue')
plt.title('Distribution of Resume Similarity Scores')
plt.xlabel('Similarity Score')
plt.ylabel('Count')
plt.show()