## Import Libraries

**Objective of this project is:**


*   To automatically classify resumes into predefined job categories.

*   To extract insights like experience level and relevant skills.

*   To speed up recruitment by reducing manual screening of resumes.

*   To provide a scalable and accurate AI-based resume screening system using a pre-trained language model (DistilBERT).



In [1]:
import pandas as pd
import re
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from torch.utils.data import DataLoader
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer,
    TrainingArguments,
)
from datasets import Dataset



## Load Data

In [2]:
# ===============================
df = pd.read_csv('UpdatedResumeDataSet.csv')
df = df[['Resume', 'Category']]

## Clean Resumes

In [3]:
def cleanResume(txt):
    txt = re.sub(r'http\S+\s?', ' ', txt)
    txt = re.sub(r'RT|cc', ' ', txt)
    txt = re.sub(r'#\S+\s?', ' ', txt)
    txt = re.sub(r'@\S+', ' ', txt)
    txt = re.sub(r'[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', txt)
    txt = re.sub(r'[^\x00-\x7f]', ' ', txt)
    txt = re.sub(r'\s+', ' ', txt)
    return txt.strip()

df['Resume'] = df['Resume'].apply(cleanResume)

  txt = re.sub(r'[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', txt)


## Encode Categories

In [4]:

le = LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])

## Train/Test Split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    df['Resume'], df['Category'], test_size=0.2, random_state=42, stratify=df['Category']
)

## Prepare Dataset for Transformers

In [6]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True, max_length=512)

train_dataset = Dataset.from_dict({'text': X_train.tolist(), 'label': y_train.tolist()})
test_dataset = Dataset.from_dict({'text': X_test.tolist(), 'label': y_test.tolist()})

train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Map:   0%|          | 0/769 [00:00<?, ? examples/s]

Map:   0%|          | 0/193 [00:00<?, ? examples/s]

## Load Pretrained BERT

In [7]:

num_labels = len(le.classes_)
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', num_labels=num_labels
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training Arguments

In [18]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert_resume_model",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    save_steps=500
)


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


## Trainer

In [19]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

In [20]:
import os
os.environ["WANDB_DISABLED"] = "true"


In [21]:
trainer.train()

Step,Training Loss
100,2.7211
200,1.1727


TrainOutput(global_step=291, training_loss=1.498170164442554, metrics={'train_runtime': 7217.4031, 'train_samples_per_second': 0.32, 'train_steps_per_second': 0.04, 'total_flos': 305727638307840.0, 'train_loss': 1.498170164442554, 'epoch': 3.0})

## Evaluate

In [22]:
preds_output = trainer.predict(test_dataset)
y_pred = preds_output.predictions.argmax(-1)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))



Accuracy: 0.9896373056994818
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         4
           1       1.00      1.00      1.00         7
           2       0.80      0.80      0.80         5
           3       1.00      1.00      1.00         8
           4       1.00      1.00      1.00         6
           5       1.00      1.00      1.00         5
           6       1.00      1.00      1.00         8
           7       1.00      1.00      1.00         7
           8       1.00      0.91      0.95        11
           9       1.00      1.00      1.00         5
          10       1.00      1.00      1.00         8
          11       1.00      1.00      1.00         6
          12       1.00      1.00      1.00         9
          13       1.00      1.00      1.00         8
          14       1.00      1.00      1.00         6
          15       1.00      1.00      1.00        17
          16       1.00     

## Save

In [31]:
trainer.save_model("resume_bert_model")
tokenizer.save_pretrained("resume_bert_model")

import pickle
pickle.dump(le, open("label_encoder.pkl", "wb"))


In [32]:
# Create a ZIP archive of the entire model directory
!zip -r /content/resume_bert_model.zip /content/resume_bert_model/

  adding: content/resume_bert_model/ (stored 0%)
  adding: content/resume_bert_model/tokenizer_config.json (deflated 75%)
  adding: content/resume_bert_model/model.safetensors (deflated 8%)
  adding: content/resume_bert_model/vocab.txt (deflated 53%)
  adding: content/resume_bert_model/special_tokens_map.json (deflated 42%)
  adding: content/resume_bert_model/tokenizer.json (deflated 71%)
  adding: content/resume_bert_model/training_args.bin (deflated 54%)
  adding: content/resume_bert_model/config.json (deflated 64%)


**Why DistilBERT:**


1.   Lightweight and fast, good for production
2.   Strong at text classification
3.   Pre-trained embeddings understand language context
4.   Easy to fine-tune with Hugging Face tools


In [None]:
# py -m streamlit run main1.py


## Predict


**Prediction trial to see if results are accurate**

In [23]:
def predict_category(text: str) -> str:
    # Optional: simple cleaning
    text = text.lower()

    inputs = tokenizer(
        text,
        truncation=True,
        padding=True,
        max_length=256,
        return_tensors="pt"
    )

    with torch.no_grad():
        outputs = model(**inputs)
        pred_id = torch.argmax(outputs.logits, dim=1).item()

    # Convert numeric label to original category
    return le.inverse_transform([pred_id])[0]

In [24]:
myresume = """
I am a data scientist with experience in machine learning, deep learning,
computer vision, and NLP. Skilled in Python, PyTorch, and TensorFlow.
"""

category = predict_category(myresume)
print("Predicted Category:", category)


Predicted Category: Data Science


In [26]:
myresume = """
John Doe is an experienced Network Security Engineer with over 7 years of expertise in designing, implementing, and managing network security infrastructures. Specializing in safeguarding critical network systems, John has worked with various organizations to protect against cyber threats, data breaches, and unauthorized access. He is proficient in deploying firewalls, intrusion detection systems (IDS), VPNs, and network monitoring tools to ensure the integrity and security of networks.

John holds a degree in Computer Science and certifications in several cybersecurity domains, including Certified Information Systems Security Professional (CISSP), Certified Ethical Hacker (CEH), and Cisco Certified Network Associate (CCNA). He has extensive experience in troubleshooting and resolving network vulnerabilities, and has played a key role in conducting security audits and risk assessments.

Key Skills:
- Network Security Architecture
- Firewall Management and Configuration
- Intrusion Detection and Prevention Systems (IDS/IPS)
- Virtual Private Networks (VPNs)
- Security Audits and Risk Assessments
- Cybersecurity Incident Response
- Network Monitoring and Traffic Analysis
- Vulnerability Assessment and Penetration Testing
- Data Encryption and Secure Communications

Certifications:
- CISSP (Certified Information Systems Security Professional)
- CEH (Certified Ethical Hacker)
- CCNA (Cisco Certified Network Associate)
- CompTIA Security+

Education:
BSc in Computer Science, XYZ University, 2012-2016

Professional Experience:
- Network Security Engineer at ABC Corp (2016-Present)
- IT Security Specialist at DEF Solutions (2014-2016)

Languages:
- English (Fluent)
- French (Intermediate)
"""

# Now, test the model with the Network Security Engineer-focused resume
predict_category(myresume)

'Network Security Engineer'

In [28]:

sample_resume = """Experienced software engineer skilled in Python, machine learning, and data analysis. Worked on various AI projects..."""
print("Predicted Category:", predict_category(sample_resume))

Predicted Category: Python Developer
