# 📂 Resume Classifier with NLP

## 📘 1. Project Overview

This project aims to build a model capable of classifying resumes into job-related categories using Natural Language Processing (NLP). It will use supervised learning to predict the appropriate category of a resume based solely on its text content.

## 🎯 2. Objectives

- Load and explore a dataset of resumes
- Clean and preprocess the text data
- Vectorize the resumes into numerical features
- Train a classification model
- Evaluate model performance
- Visualize and interpret key results

## 🧾 3. Dataset Information

- **Source**: [[Link to dataset](https://www.kaggle.com/datasets/youssefkhalil/resumes-images-datasets/data)]
- **Fields**:
  - `Category`: Resume category label (e.g., IT, HR, Sales)
  - `Text`: Full resume content

In [4]:
import os
import kagglehub

# root directory for kaggle datasets (using expanduser to not depend on current working directory)
dataset_dir = os.path.expanduser("~/.cache/kagglehub/datasets/youssefkhalil/resumes-images-datasets/versions/1")

# Download latest version
# Verify if dataset is already exist
if not os.path.exists(dataset_dir):
    path = kagglehub.dataset_download("youssefkhalil/resumes-images-datasets")
    print("Dataset downloaded to:", path)
else:
    print("Dataset already exists at:", dataset_dir)

Dataset already exists at: C:\Users\maxim/.cache/kagglehub/datasets/youssefkhalil/resumes-images-datasets/versions/1


In [6]:
import pandas as pd
df = pd.read_csv(os.path.join(dataset_dir, "resumes.csv"))

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\maxim/.cache/kagglehub/datasets/youssefkhalil/resumes-images-datasets/versions/1\\resumes.csv'

## 🔍 4. Initial Exploration

Here we inspect the size of the dataset, class distribution, and look at a few examples.

## 🧹 5. Text Cleaning & Preprocessing

- Lowercasing
- Removing punctuation
- Removing stopwords
- Stemming or Lemmatization
- Tokenization

## 🔡 6. Vectorization

We convert text data into numerical format using:
- `CountVectorizer` or
- `TfidfVectorizer`

## 🤖 7. Model Training

We train and test several models:
- Naive Bayes
- Logistic Regression
- Support Vector Machine (optional)

## 📈 8. Evaluation

We use:
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrix
- Classification Report

## 🔍 9. Interpretation

We explore the most relevant words per class and potential model biases.

## 📊 10. Visualization

Graphs that help communicate the model’s performance and dataset structure.

## ✅ 11. Conclusions

Key takeaways, strengths, limitations, and potential improvements.

## 🚀 12. Next Steps (Optional)

Ideas for future improvements:
- Use real resumes or larger datasets
- Deploy the model in a web app with Streamlit
- Add resume parsing (PDF, DOCX)