## Import Libraries

**Objective of this project is:**


*   To automatically classify resumes into predefined job categories.

*   To extract insights like experience level and relevant skills.

*   To speed up recruitment by reducing manual screening of resumes.

*   To provide a scalable and accurate AI-based resume screening system using a pre-trained language model (DistilBERT).



In [1]:
import pandas as pd
import re
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from torch.utils.data import DataLoader
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer,
    TrainingArguments,
)
from datasets import Dataset



## Import Data

In [2]:

df = pd.read_csv('UpdatedResumeDataSet.csv')
df = df[['Resume', 'Category']]

## Clean Resumes

In [3]:
def cleanResume(txt):
    txt = re.sub(r'http\S+\s?', ' ', txt)
    txt = re.sub(r'RT|cc', ' ', txt)
    txt = re.sub(r'#\S+\s?', ' ', txt)
    txt = re.sub(r'@\S+', ' ', txt)
    txt = re.sub(r'[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', txt)
    txt = re.sub(r'[^\x00-\x7f]', ' ', txt)
    txt = re.sub(r'\s+', ' ', txt)
    return txt.strip()

df['Resume'] = df['Resume'].apply(cleanResume)

  txt = re.sub(r'[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', txt)


## Encode Categories

In [4]:

le = LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])

## Train/Test Split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    df['Resume'], df['Category'], test_size=0.2, random_state=42, stratify=df['Category']
)

## Prepare Dataset for Transformers

In [6]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, max_length=512)

train_dataset = Dataset.from_dict({'text': X_train.tolist(), 'label': y_train.tolist()})
test_dataset = Dataset.from_dict({'text': X_test.tolist(), 'label': y_test.tolist()})

train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/769 [00:00<?, ? examples/s]

Map:   0%|          | 0/193 [00:00<?, ? examples/s]

## Load Pretrained BERT

In [7]:

num_labels = len(le.classes_)
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=num_labels
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training Arguments

In [8]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert_resume_model",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    save_steps=500
)


## Trainer

In [9]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

In [10]:
import os
os.environ["WANDB_DISABLED"] = "true"


In [None]:
trainer.train()


Step,Training Loss


## Evaluate

In [None]:
preds_output = trainer.predict(test_dataset)
y_pred = preds_output.predictions.argmax(-1)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

## Save

In [None]:
trainer.save_model("resume_bert_model")
tokenizer.save_pretrained("resume_bert_model")

import pickle
pickle.dump(le, open("label_encoder.pkl", "wb"))


In [None]:
# Create a ZIP archive of the entire model directory
!zip -r /content/resume_bert_model.zip /content/resume_bert_model/

**Why DistilBERT:**


1.   Lightweight and fast, good for production
2.   Strong at text classification
3.   Pre-trained embeddings understand language context
4.   Easy to fine-tune with Hugging Face tools

## Predict


**Prediction trial to see if results are accurate**

In [None]:
def predict_category(text: str) -> str:
    # Optional: simple cleaning
    text = text.lower()

    inputs = tokenizer(
        text,
        truncation=True,
        padding=True,
        max_length=256,
        return_tensors="pt"
    )

    with torch.no_grad():
        outputs = model(**inputs)
        pred_id = torch.argmax(outputs.logits, dim=1).item()

    # Convert numeric label to original category
    return le.inverse_transform([pred_id])[0]

In [None]:
myresume = """
I am a data scientist with experience in machine learning, deep learning,
computer vision, and NLP. Skilled in Python, PyTorch, and TensorFlow.
"""

category = predict_category(myresume)
print("Predicted Category:", category)
