<center>
  <a href="https://imgbb.com/"><img src="https://i.ibb.co/gwD1XY2/eurisko-logo-white-2x-png1.png" alt="eurisko-logo-white-2x-png1" border="0" /></a>
</center>
<font size=10><center>Data Analytics and AI</center></font>
<font size=6><center>Project - Deep Learning - NLP - Week 4</center>

# Project NLP Classification: Resume Classification at TechCorp
**Points: 100**

## Context:
TechCorp, a leading technology company, receives thousands of resumes every month for various job openings. The HR department faces a significant challenge in manually reviewing and shortlisting candidates, which is both time-consuming and prone to human error. Automating this process can enhance efficiency and accuracy, ensuring that the best candidates are quickly identified for **the right** job position.

## Objective:
In this assignment, you will play the role of an NLP engineer at TechCorp, tasked with developing a deep learning model to automate resume classification. Using data available online in text format, you will preprocess and clean resumes to train a model that predicts the best-matching job position based on labels of that you choose/find. This involves several steps, from data collection and preprocessing to model training and evaluation.

## Key Requirements:
- **Data Collection**: Collect/augment open source labeled text data about CVs.
- **Data Preprocessing**: Preporcess the data to fit the needed task.
- **Model Class:** Encapsulate your model within a class structure for training.
- **Evaluation Techniques:** Apply proper evaluation methods to assess your  model's classification performance.
- **README File:** Include a README file detailing the data source, preprocessing steps, model details, results, and key insights such as strengths and weaknesses.
- **Documentation:** Ensure clear documentation with explanations and comments for each function or class. Add your explanation in this Jupyter Notebook
- **Checkpoint:** Save your best model checkpoint on your drive. Share the folder of the checkpoint through google drive to our email "ai@eurisko.net" so we can evaluate your model.

### Bonus:
- For a fun test, try applying your own CV to your trained model. If your CV is in PDF format, use a PDF parser (e.g., PyPDF2, PyMuPDF, pdfminer) to convert it to text. Ensure you preprocess the text appropriately before feeding it into your model.


# **Installations And Importing necessary librairies**

In [1]:
!pip -q install datasets matplotlib seaborn
!pip -q install evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m26.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns


import torch

In [3]:
print(torch.cuda.is_available())

True


# **Manipulating datasets**

## **HuggingFace Dataset 1**

In [4]:
dataset = load_dataset('Shawn0069/resume_classification_kaggle')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/651 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/16.2M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.15M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.10M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1987 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/497 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/497 [00:00<?, ? examples/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'Resume_str', 'Resume_html', 'Category', '__index_level_0__'],
        num_rows: 1987
    })
    test: Dataset({
        features: ['ID', 'Resume_str', 'Resume_html', 'Category', '__index_level_0__'],
        num_rows: 497
    })
    validation: Dataset({
        features: ['ID', 'Resume_str', 'Resume_html', 'Category', '__index_level_0__'],
        num_rows: 497
    })
})

In [6]:
import numpy as np

print(np.unique(dataset['train']['Category']))

['ACCOUNTANT' 'ADVOCATE' 'AGRICULTURE' 'APPAREL' 'ARTS' 'AUTOMOBILE'
 'AVIATION' 'BANKING' 'BPO' 'BUSINESS-DEVELOPMENT' 'CHEF' 'CONSTRUCTION'
 'CONSULTANT' 'DESIGNER' 'DIGITAL-MEDIA' 'ENGINEERING' 'FINANCE' 'FITNESS'
 'HEALTHCARE' 'HR' 'INFORMATION-TECHNOLOGY' 'PUBLIC-RELATIONS' 'SALES'
 'TEACHER']


## **HuggingFace Dataset 2**

In [7]:
dataset2 = load_dataset('DevashishBhake/resume_section_classification')
dataset2

Downloading data:   0%|          | 0.00/8.80M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1219 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['ID', 'Category', 'Resume'],
        num_rows: 1219
    })
})

In [8]:
dataset2 = dataset2.rename_column('Resume', 'Resume_str')

In [9]:
print(np.unique(dataset2['train']['Category']))

['Accountant' 'Advocate' 'Agricultural' 'Apparel' 'Architects' 'Arts'
 'Automobile' 'Aviation' 'BPO' 'Banking' 'Building & Construction'
 'Business Development' 'Consultant' 'Designing' 'Digital Media'
 'Education' 'Engineering' 'Finance' 'Food & Beverages' 'HR'
 'Health & Fitness' 'Information Technology' 'Managment'
 'Public Relations' 'Sales']


In [10]:
import datasets
from datasets import concatenate_datasets

# Extract the desired portions from dataset3
test_portion = dataset2['train'].select(range(0, 609))
validation_portion = dataset2['train'].select(range(609, 1219))

# Concatenate the portions to the existing test and validation sets in dataset
updated_test_dataset = concatenate_datasets([dataset['test'], test_portion]) # Use concatenate_datasets to combine datasets
updated_validation_dataset = concatenate_datasets([dataset['validation'], validation_portion])

# Update the original dataset with the modified test and validation sets
updated_dataset = dataset.map(
    lambda example, idx: example,
    with_indices=True,
    batched=True,
    batch_size=1000,
    num_proc=2
)
updated_dataset['test'] = updated_test_dataset
updated_dataset['validation'] = updated_validation_dataset

Map (num_proc=2):   0%|          | 0/1987 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/497 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/497 [00:00<?, ? examples/s]

In [11]:
dataset = updated_dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'Resume_str', 'Resume_html', 'Category', '__index_level_0__'],
        num_rows: 1987
    })
    test: Dataset({
        features: ['ID', 'Resume_str', 'Resume_html', 'Category', '__index_level_0__'],
        num_rows: 1106
    })
    validation: Dataset({
        features: ['ID', 'Resume_str', 'Resume_html', 'Category', '__index_level_0__'],
        num_rows: 1107
    })
})

In [12]:
dataset['test'][1000]

{'ID': 504,
 'Resume_str': "b'Curriculum vitae\\nPersonal\\nName\\n\\nManal Hussein El-Mahdy\\n\\nNationality\\n\\nEgyptian\\n\\nAdress\\n\\nEgypt , Benha\\n\\nMarital status\\n\\nmarried\\n\\nE-mail\\n\\nmanalelmahdy@hotmail.com\\n\\nTelephone\\n\\n0020133231984\\n\\nEducation\\n11/ 1997\\n\\xef\\x82\\xb7 Master degree of radiodiagnosis Department of Diagnostic\\nRadiology , Faculty of Medicine , Zagazig university , Egypt.\\n\\xef\\x82\\xb7 The topic of master\\xe2\\x80\\x99s thesis was correlation between Abdominal\\nultrasound and CT in diagnosis of pediatric\\nAbdominal masses\\n\\xef\\x82\\xb7 Curriculum : this is a two part , each part is followed by an\\nExam . The first part has the following components : radiological\\nAnatomy, Physics , Nuclear medicine , Radioprotection ,\\nRadiological technipues , Positioning and dark room applications.\\nThe second part of the course includes all the branches of\\nPractical Radiodiagnosis , i.e neuroradiology , musculoskletal ,\\n\\n\\x0

In [13]:
dataset = dataset.remove_columns(['__index_level_0__', 'ID','Resume_html'])
dataset


DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1987
    })
    test: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1106
    })
    validation: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1107
    })
})

## **CSV Dataset 1 Kaggle**

In [14]:
! pip install -q kaggle

In [15]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mhammadibrahim","key":"cdb2d7f3deb29c45bf6f5c7a91c5b215"}'}

In [16]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download -d jillanisofttech/updated-resume-dataset

Dataset URL: https://www.kaggle.com/datasets/jillanisofttech/updated-resume-dataset
License(s): CC0-1.0
Downloading updated-resume-dataset.zip to /content
  0% 0.00/383k [00:00<?, ?B/s]
100% 383k/383k [00:00<00:00, 72.5MB/s]


In [17]:
! unzip updated-resume-dataset.zip

Archive:  updated-resume-dataset.zip
  inflating: UpdatedResumeDataSet.csv  


In [18]:
import pandas as pd
dataset3 = pd.read_csv('UpdatedResumeDataSet.csv')

In [19]:
dataset3

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
957,Testing,Computer Skills: â¢ Proficient in MS office (...
958,Testing,â Willingness to accept the challenges. â ...
959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...


In [20]:
dataset3.Category.unique()

array(['Data Science', 'HR', 'Advocate', 'Arts', 'Web Designing',
       'Mechanical Engineer', 'Sales', 'Health and fitness',
       'Civil Engineer', 'Java Developer', 'Business Analyst',
       'SAP Developer', 'Automation Testing', 'Electrical Engineering',
       'Operations Manager', 'Python Developer', 'DevOps Engineer',
       'Network Security Engineer', 'PMO', 'Database', 'Hadoop',
       'ETL Developer', 'DotNet Developer', 'Blockchain', 'Testing'],
      dtype=object)

In [21]:
dataset3 = dataset3.rename(columns={'Resume': 'Resume_str'})

In [22]:
dataset3

Unnamed: 0,Category,Resume_str
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
957,Testing,Computer Skills: â¢ Proficient in MS office (...
958,Testing,â Willingness to accept the challenges. â ...
959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...


In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
import datasets

def merge_and_split_datasets(dataset, dataFrame):
    # Convert dataFrame to Hugging Face dataset format
    dataFrame_dict = dataFrame.to_dict('list')
    dataFrame_hf = datasets.Dataset.from_dict(dataFrame_dict)

    # Convert Hugging Face dataset to Pandas DataFrame for splitting
    dataFrame_df = dataFrame_hf.to_pandas()

    # Split dataFrame into train, validation, and test
    train_df, temp_df = train_test_split(dataFrame_df, test_size=0.3, random_state=42,stratify=dataFrame_df['Category'])
    valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42,stratify=temp_df['Category'])

    # Convert split dataframes back to Hugging Face datasets
    train_hf = datasets.Dataset.from_pandas(train_df)
    valid_hf = datasets.Dataset.from_pandas(valid_df)
    test_hf = datasets.Dataset.from_pandas(test_df)

    # Concatenate the split datasets to the existing dataset
    updated_train_dataset = datasets.concatenate_datasets([dataset['train'], train_hf])
    updated_validation_dataset = datasets.concatenate_datasets([dataset['validation'], valid_hf])
    updated_test_dataset = datasets.concatenate_datasets([dataset['test'], test_hf])

    # Update the original dataset with the modified train, validation, and test sets
    updated_dataset = dataset.map(
        lambda example, idx: example,
        with_indices=True,
        batched=True,
        batch_size=1000,
        num_proc=2
    )
    updated_dataset['train'] = updated_train_dataset
    updated_dataset['validation'] = updated_validation_dataset
    updated_dataset['test'] = updated_test_dataset

    return updated_dataset

In [24]:
updated_dataset = merge_and_split_datasets(dataset, dataset3)

Map (num_proc=2):   0%|          | 0/1987 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/1106 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/1107 [00:00<?, ? examples/s]

In [25]:
dataset = updated_dataset

In [26]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'Category', '__index_level_0__'],
        num_rows: 2660
    })
    test: Dataset({
        features: ['Resume_str', 'Category', '__index_level_0__'],
        num_rows: 1251
    })
    validation: Dataset({
        features: ['Resume_str', 'Category', '__index_level_0__'],
        num_rows: 1251
    })
})

In [27]:
dataset = dataset.remove_columns(['__index_level_0__'])
dataset

DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 2660
    })
    test: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1251
    })
    validation: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1251
    })
})

## **CSV Dataset 2 GitHub**

In [28]:
dataset4 = pd.read_csv('Raw_Resume.csv')
dataset4

Unnamed: 0,Category,Raw_Details
0,PeopleSoft,Anubhav Kumar Singh\n\n\n\n To work in a globa...
1,PeopleSoft,G. Ananda Rayudu \n https://www.linkedin.com/i...
2,PeopleSoft,PeopleSoft Database Administrator \n\nGangared...
3,PeopleSoft,Classification: Internal \n\nMurali \n\nExperi...
4,PeopleSoft,"Priyanka Ramadoss\n\n61/46, MountPleasant, \nC..."
...,...,...
74,Workday,Workday Integration Consultant \n\nName ...
75,Workday,S R I K A N T H ( W O R K D A Y H C M C O N...
76,Workday,WORKDAY | HCM | FCM \n\nName Role \n\n: Kumar ...
77,Workday,Venkateswarlu.B \n\n\n\nWorkday Consultant\n\n...


In [29]:
dataset4 = dataset4.rename(columns={'Raw_Details': 'Resume_str'})
dataset4

Unnamed: 0,Category,Resume_str
0,PeopleSoft,Anubhav Kumar Singh\n\n\n\n To work in a globa...
1,PeopleSoft,G. Ananda Rayudu \n https://www.linkedin.com/i...
2,PeopleSoft,PeopleSoft Database Administrator \n\nGangared...
3,PeopleSoft,Classification: Internal \n\nMurali \n\nExperi...
4,PeopleSoft,"Priyanka Ramadoss\n\n61/46, MountPleasant, \nC..."
...,...,...
74,Workday,Workday Integration Consultant \n\nName ...
75,Workday,S R I K A N T H ( W O R K D A Y H C M C O N...
76,Workday,WORKDAY | HCM | FCM \n\nName Role \n\n: Kumar ...
77,Workday,Venkateswarlu.B \n\n\n\nWorkday Consultant\n\n...


In [30]:
updated_dataset = merge_and_split_datasets(dataset, dataset4)

Map (num_proc=2):   0%|          | 0/2660 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/1251 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/1251 [00:00<?, ? examples/s]

In [31]:
dataset = updated_dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'Category', '__index_level_0__'],
        num_rows: 2715
    })
    test: Dataset({
        features: ['Resume_str', 'Category', '__index_level_0__'],
        num_rows: 1263
    })
    validation: Dataset({
        features: ['Resume_str', 'Category', '__index_level_0__'],
        num_rows: 1263
    })
})

In [32]:
dataset = dataset.remove_columns(['__index_level_0__'])
dataset

DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 2715
    })
    test: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1263
    })
    validation: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1263
    })
})

In [33]:
print(np.unique(dataset['train']['Category']))

['ACCOUNTANT' 'ADVOCATE' 'AGRICULTURE' 'APPAREL' 'ARTS' 'AUTOMOBILE'
 'AVIATION' 'Advocate' 'Arts' 'Automation Testing' 'BANKING' 'BPO'
 'BUSINESS-DEVELOPMENT' 'Blockchain' 'Business Analyst' 'CHEF'
 'CONSTRUCTION' 'CONSULTANT' 'Civil Engineer' 'DESIGNER' 'DIGITAL-MEDIA'
 'Data Science' 'Database' 'DevOps Engineer' 'DotNet Developer'
 'ENGINEERING' 'ETL Developer' 'Electrical Engineering' 'FINANCE'
 'FITNESS' 'HEALTHCARE' 'HR' 'Hadoop' 'Health and fitness'
 'INFORMATION-TECHNOLOGY' 'Java Developer' 'Mechanical Engineer'
 'Network Security Engineer' 'Operations Manager' 'PMO' 'PUBLIC-RELATIONS'
 'PeopleSoft' 'Python Developer' 'React JS Developer' 'SALES'
 'SAP Developer' 'SQL Developer' 'Sales' 'TEACHER' 'Testing'
 'Web Designing' 'Workday']


In [34]:
print(np.unique(dataset['validation']['Category']))

['ACCOUNTANT' 'ADVOCATE' 'AGRICULTURE' 'APPAREL' 'ARTS' 'AUTOMOBILE'
 'AVIATION' 'Accountant' 'Advocate' 'Apparel' 'Architects' 'Arts'
 'Automation Testing' 'Automobile' 'Aviation' 'BANKING' 'BPO'
 'BUSINESS-DEVELOPMENT' 'Banking' 'Blockchain' 'Building & Construction'
 'Business Analyst' 'CHEF' 'CONSTRUCTION' 'CONSULTANT' 'Civil Engineer'
 'Consultant' 'DESIGNER' 'DIGITAL-MEDIA' 'Data Science' 'Database'
 'DevOps Engineer' 'Digital Media' 'DotNet Developer' 'ENGINEERING'
 'ETL Developer' 'Electrical Engineering' 'Engineering' 'FINANCE'
 'FITNESS' 'Finance' 'Food & Beverages' 'HEALTHCARE' 'HR' 'Hadoop'
 'Health and fitness' 'INFORMATION-TECHNOLOGY' 'Java Developer'
 'Mechanical Engineer' 'Network Security Engineer' 'Operations Manager'
 'PMO' 'PUBLIC-RELATIONS' 'PeopleSoft' 'Public Relations'
 'Python Developer' 'React JS Developer' 'SALES' 'SAP Developer'
 'SQL Developer' 'Sales' 'TEACHER' 'Testing' 'Web Designing' 'Workday']


In [35]:
import numpy as np
# List of categories to remove
categories_to_remove = ['TEACHER', 'Health and fitness', 'FITNESS', 'HEALTHCARE', 'ENGINEERING',
                        'Civil Engineer', 'CHEF', 'ADVOCATE', 'AGRICULTURE', 'APPAREL',
                        'ARTS', 'AUTOMOBILE', 'AVIATION', 'Advocate', 'Arts','Aviation',
                        'Automobile','Architects','Apparel','Engineering','Building & Construction',
                        'Food & Beverages']

# Filter the dataset to remove rows with specified categories
dataset = dataset.filter(lambda example: example['Category'] not in categories_to_remove)

Filter:   0%|          | 0/2715 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1263 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1263 [00:00<?, ? examples/s]

In [36]:
# Print unique categories after filtering
print(np.unique(dataset['train']['Category']))

['ACCOUNTANT' 'Automation Testing' 'BANKING' 'BPO' 'BUSINESS-DEVELOPMENT'
 'Blockchain' 'Business Analyst' 'CONSTRUCTION' 'CONSULTANT' 'DESIGNER'
 'DIGITAL-MEDIA' 'Data Science' 'Database' 'DevOps Engineer'
 'DotNet Developer' 'ETL Developer' 'Electrical Engineering' 'FINANCE'
 'HR' 'Hadoop' 'INFORMATION-TECHNOLOGY' 'Java Developer'
 'Mechanical Engineer' 'Network Security Engineer' 'Operations Manager'
 'PMO' 'PUBLIC-RELATIONS' 'PeopleSoft' 'Python Developer'
 'React JS Developer' 'SALES' 'SAP Developer' 'SQL Developer' 'Sales'
 'Testing' 'Web Designing' 'Workday']


In [37]:
# Define a mapping of old values to new values
value_mapping = {
    'Accountant': 'Finance',
    'Business Analyst': 'Business',
    'BPO': 'Business',
    'Operations Manager': 'Management',
    'PMO': 'Management',
    'DotNet Developer': 'Backend Developer',
    'Java Developer': 'Backend Developer',
    'Python Developer': 'Backend Developer',
    'React JS Developer': 'Frontend Developer',
    'Web Designing': 'Frontend Developer',
    'SQL Developer': 'Database',
    'Automation Testing': 'Testing',
    'Hadoop': 'Data Science',
    'Banking': 'Finance',
    'DIGITAL-MEDIA': 'Media',
    'ACCOUNTANT': 'Finance',
    'PUBLIC-RELATIONS': 'Marketing',
    'Public Relations': 'Marketing',
    'SALES': 'Sales',
    'FINANCE': 'Finance',
    'BUSINESS-DEVELOPMENT': 'Business',
    "CONSULTANT": 'Consulting',
    "Consultant": 'Consulting',
    "Digital Media": "Media",
    'CONSTRUCTION': 'Construction'
    }

# Apply the mapping to the 'Category' column in all splits of the dataset
dataset = dataset.map(lambda example: {'Category': value_mapping.get(example['Category'], example['Category'])})


Map:   0%|          | 0/1743 [00:00<?, ? examples/s]

Map:   0%|          | 0/977 [00:00<?, ? examples/s]

Map:   0%|          | 0/756 [00:00<?, ? examples/s]

In [38]:
train_cat = np.unique(dataset['train']['Category'])
train_cat

array(['BANKING', 'Backend Developer', 'Blockchain', 'Business',
       'Construction', 'Consulting', 'DESIGNER', 'Data Science',
       'Database', 'DevOps Engineer', 'ETL Developer',
       'Electrical Engineering', 'Finance', 'Frontend Developer', 'HR',
       'INFORMATION-TECHNOLOGY', 'Management', 'Marketing',
       'Mechanical Engineer', 'Media', 'Network Security Engineer',
       'PeopleSoft', 'SAP Developer', 'Sales', 'Testing', 'Workday'],
      dtype='<U25')

In [39]:
val_cat = np.unique(dataset['validation']['Category'])
val_cat

array(['BANKING', 'Backend Developer', 'Blockchain', 'Business',
       'Construction', 'Consulting', 'DESIGNER', 'Data Science',
       'Database', 'DevOps Engineer', 'ETL Developer',
       'Electrical Engineering', 'Finance', 'Frontend Developer', 'HR',
       'INFORMATION-TECHNOLOGY', 'Management', 'Marketing',
       'Mechanical Engineer', 'Media', 'Network Security Engineer',
       'PeopleSoft', 'SAP Developer', 'Sales', 'Testing', 'Workday'],
      dtype='<U25')

In [40]:
# Values in train_cat but not in val_cat
in_train_not_val = np.setdiff1d(train_cat, val_cat)
print("Values in train_cat but not in val_cat:", in_train_not_val)

# Values in val_cat but not in train_cat
in_val_not_train = np.setdiff1d(val_cat, train_cat)
print("Values in val_cat but not in train_cat:", in_val_not_train)


Values in train_cat but not in val_cat: []
Values in val_cat but not in train_cat: []


In [41]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 1743
    })
    test: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 977
    })
    validation: Dataset({
        features: ['Resume_str', 'Category'],
        num_rows: 756
    })
})

In [42]:
# Get unique categories from both train and validation sets
all_categories = np.unique(np.concatenate((dataset['train']['Category'], dataset['validation']['Category'])))
label_dict = {v: i for i, v in enumerate(all_categories)}

# create a new column named label in it the index of category
def create_label(example):
  example['label'] = label_dict[example['Category']]
  return example

dataset['train'] = dataset['train'].map(create_label)
dataset['validation'] = dataset['validation'].map(create_label)
# dataset['test'] = dataset['test'].map(create_label)

Map:   0%|          | 0/1743 [00:00<?, ? examples/s]

Map:   0%|          | 0/756 [00:00<?, ? examples/s]

In [43]:
dataset['train'].features

{'Resume_str': Value(dtype='string', id=None),
 'Category': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None)}

In [44]:
dataset['validation'].features

{'Resume_str': Value(dtype='string', id=None),
 'Category': Value(dtype='string', id=None),
 'label': Value(dtype='int64', id=None)}

In [45]:
dataset['train'][1700]

{'Resume_str': 'Kanumuru Deepak Reddy \n\nCAREER OBJECTIVE: \n\nTo secure a position in a reputed organization where I can efficiently contribute my knowledge and skills to the growth of the organization and build my professional career. \n\nACADEMIC QUALIFICATIONS: \n\nQualification \n\nInstitute \n\nBoard (or) \nUniversity \n\nYear of \ncompletion \n\nPercentage/ CGPA \n\nB.Tech (E.C.E) \n\nAudisankara College of Engineering & \nTechnology,Gudur. \n\nJNTU Anantapur. \n\n2018 \n\n77.3 \n\nIntermediate\n\n\tNarayana Junior college, \tNaidupet. \n\nBoard of \nIntermediate, AP. \n\n2014 \n\n89.5 \n\nSSC \n\nNavodaya High School,Naidupet \n\nBoard of Secondary education, AP. \n\n2012 \n\n6.7 \n\nPROJECT: \n\nTitle :Density based Traffic Control System USING ARDUINO. \n\nDuration:4 months. \n\nDescription: Traffic congestion is a severe problem in most of the cities across the world and it has become a nightmare for the citizens. It is caused by delay in signal, inappropriate timing of tra

# **Processing the data**

In [46]:
import re
def remove_punctuation_and_extra_spaces(text):
    # Replace multiple occurrences of ! and ? with a single occurrence
    text = re.sub(r'!+', '!', text)
    text = re.sub(r'\?+', '?', text)
    # Remove links
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove all other punctuation
    text = re.sub(r'[^\w\s!?]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Remove email addresses
    text = re.sub(r'\S*@\S*\s?', '', text)
    # Remove newline and carriage return characters
    text = text.replace('\n', ' ').replace('\r', ' ')
    return text


In [47]:
remove_punctuation_and_extra_spaces(dataset['train'][0]['Resume_str'])

'SALES CONSULTANT INTERIOR DESIGNER Professional Summary Resultsoriented sales professional eager to join a reputable organization Hardworking consultant gifted at turning prospects into clients by delivering exceptional presentations Engaging and personable with expertise managing key milestones and delivering exemplary customer service Highly enthusiastic with ability to absorb information rapidly and make a correct response Skills Persuasive communication Prospect qualification Retention strategies Exceptional Customer Service Sales Work History Sales Consultant Interior Designer 012018 to 122020 Company Name City State Assisted clients with budget considerations and made recommendations for furniture custom made leather sofas and accessories items Developed space planning concepts color palette selections and leather presentations Used consultative sales approach to understand customer needs and recommend relevant offerings Created detailed sales presentations to communicate produc

In [48]:
remove_punctuation_and_extra_spaces(dataset3['Resume_str'][957])

'Computer Skills â Proficient in MS office Word Basic Excel Power point Strength â Hard working Loyalty Creativity â Selfmotivated Responsible Initiative â Good people management skill positive attitude â knowledge of windows InternetEducation Details Bachelor of Electrical Engineering Electrical Engineering Nashik Maharashtra Guru Gobind Singh College of Engineering and Research Centre Diploma Electrical Engineering Nashik Maharashtra S M E S Polytechnic College Testing Engineer Skill Details EXCEL Exprience 6 months MS OFFICE Exprience 6 months WORD Exprience 6 monthsCompany Details company description Department Testing Responsibilities â To check ACB and VCB of Circuit Breaker â Following test conducted of Circuit Breaker as per drawing 1 To check breaker timing 2 To check contact resistance using contact resistance meter CRM 3 To check breaker insulation resistance IR 4 To check breaker rack out and rack in properly or not 5 To check closing and tripping operation work properly or

# **Tokenizer**

In [49]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# **Applying the preprocessing**

In [50]:
for category, index in label_dict.items():
  print(f"Category: {category}, Index: {index}")


Category: BANKING, Index: 0
Category: Backend Developer, Index: 1
Category: Blockchain, Index: 2
Category: Business, Index: 3
Category: Construction, Index: 4
Category: Consulting, Index: 5
Category: DESIGNER, Index: 6
Category: Data Science, Index: 7
Category: Database, Index: 8
Category: DevOps Engineer, Index: 9
Category: ETL Developer, Index: 10
Category: Electrical Engineering, Index: 11
Category: Finance, Index: 12
Category: Frontend Developer, Index: 13
Category: HR, Index: 14
Category: INFORMATION-TECHNOLOGY, Index: 15
Category: Management, Index: 16
Category: Marketing, Index: 17
Category: Mechanical Engineer, Index: 18
Category: Media, Index: 19
Category: Network Security Engineer, Index: 20
Category: PeopleSoft, Index: 21
Category: SAP Developer, Index: 22
Category: Sales, Index: 23
Category: Testing, Index: 24
Category: Workday, Index: 25


In [51]:
label2id = {"BANKING": 0, "Backend Developer": 1, "Blockchain": 2, "Business": 3, "Construction": 4, "Consulting": 5,
            "DESIGNER": 6, "Data Science": 7, "Database": 8, "DevOps Engineer": 9, "ETL Developer": 10, "Electrical Engineering": 11,
            "Finance": 12, "Frontend Developer": 13, "HR": 14, "INFORMATION-TECHNOLOGY": 15, "Management": 16, "Marketing": 17,
            "Mechanical Engineer": 18, "Media": 19, "Network Security Engineer": 20, "PeopleSoft": 21, "SAP Developer": 22, "Sales": 23,
            "Testing": 24, "Workday": 25}
id2label = {v: k for k, v in label2id.items()}

In [52]:
# preprocess function
def preprocess_function(examples): # parameters dataset
    examples['Resume_str'] = [remove_punctuation_and_extra_spaces(text) for text in examples['Resume_str']] # call function remove_punctuation_and_extra_spaces
    tokenized_examples = tokenizer(examples['Resume_str'], truncation=True, padding='max_length', max_length=512) # use the tokenizer
    return tokenized_examples

In [53]:
preprocessed_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1743 [00:00<?, ? examples/s]

Map:   0%|          | 0/977 [00:00<?, ? examples/s]

Map:   0%|          | 0/756 [00:00<?, ? examples/s]

In [54]:
print(preprocessed_dataset['train'][1700])

{'Resume_str': 'Kanumuru Deepak Reddy CAREER OBJECTIVE To secure a position in a reputed organization where I can efficiently contribute my knowledge and skills to the growth of the organization and build my professional career ACADEMIC QUALIFICATIONS Qualification Institute Board or University Year of completion Percentage CGPA BTech ECE Audisankara College of Engineering TechnologyGudur JNTU Anantapur 2018 773 Intermediate Narayana Junior college Naidupet Board of Intermediate AP 2014 895 SSC Navodaya High SchoolNaidupet Board of Secondary education AP 2012 67 PROJECT Title Density based Traffic Control System USING ARDUINO Duration4 months Description Traffic congestion is a severe problem in most of the cities across the world and it has become a nightmare for the citizens It is caused by delay in signal inappropriate timing of traffic signalling etc The delay of traffic light is hard coded and it does not depend on traffic Therefore for optimising traffic control there is an incre

In [55]:
train_dataset = preprocessed_dataset['train']
validation_dataset = preprocessed_dataset['validation']

# **Training the Model**

In [83]:
from transformers import AutoModelForSequenceClassification, BertConfig, BertForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback
import numpy as np
import evaluate
import torch
from sklearn.metrics import accuracy_score, precision_score, f1_score

# Define EarlyStoppingCallback
early_stopping = EarlyStoppingCallback(
    early_stopping_patience=3,  # Number of evaluations with no improvement before stopping
    early_stopping_threshold=0.01  # Minimum improvement required to reset patience
)

# create a class Model
class ModelBERTSequenceClassifier:
    # constructor
    def __init__(self, model_name='bert-base-uncased', num_labels=26, id2label=None, label2id=None):
        self.model_name = model_name
        self.config = BertConfig.from_pretrained(model_name, num_labels=num_labels, id2label=id2label, label2id=label2id)
        self.model = BertForSequenceClassification.from_pretrained(model_name, config=self.config)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)
        self.metric = evaluate.load("f1")
        self.trainer = None

    # train method
    def train(self, train_dataset, validation_dataset, output_dir="trainer_output", logging_dir='./logs', num_train_epochs=5, per_device_train_batch_size=16, per_device_eval_batch_size=16):
        training_args = TrainingArguments(
            output_dir=output_dir,
            evaluation_strategy="epoch",
            num_train_epochs=num_train_epochs,
            per_device_train_batch_size=per_device_train_batch_size,
            per_device_eval_batch_size=per_device_eval_batch_size,
            logging_dir=logging_dir,
            logging_strategy='epoch',
            load_best_model_at_end=True, # Add this line
            metric_for_best_model="f1", # Add this line to specify which metric to use for early stopping
            save_strategy="epoch"  # Add this line to match the eval strategy

        )

        self.trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=validation_dataset,
            compute_metrics=self.compute_metrics,
            callbacks=[early_stopping]
        )

        self.trainer.train()

    # compute_metrics method
    def compute_metrics(self, eval_pred): # Added self as the first argument
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        accuracy = accuracy_score(labels, predictions)
        precision = precision_score(labels, predictions, average='weighted')  # Use weighted average for multiclass
        f1 = f1_score(labels, predictions, average='weighted')  # Use weighted average for multiclass
        return {"accuracy": accuracy, "precision": precision, "f1": f1}

    # evaluate method
    def evaluate(self):
        if self.trainer is None:
            raise ValueError("The model has not been trained yet. Please call the train method first.")
        eval_results = self.trainer.evaluate()
        return eval_results

In [84]:
# from transformers import AutoModelForSequenceClassification, BertConfig, BertForSequenceClassification, Trainer, TrainingArguments
# import numpy as np
# import evaluate
# import torch

# # create a class Model
# class ModelBERTSequenceClassifier:
#     # constructor
#     def __init__(self, model_name='bert-base-uncased', num_labels=26, id2label=None, label2id=None):
#         self.model_name = model_name
#         self.config = BertConfig.from_pretrained(model_name, num_labels=num_labels, id2label=id2label, label2id=label2id)
#         self.model = BertForSequenceClassification.from_pretrained(model_name, config=self.config)
#         self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#         self.model.to(self.device)
#         self.metric = evaluate.load("f1")
#         self.trainer = None

#     # train method
#     def train(self, train_dataset, validation_dataset, output_dir="trainer_output", logging_dir='./logs', num_train_epochs=5, per_device_train_batch_size=16, per_device_eval_batch_size=16):
#         training_args = TrainingArguments(
#             output_dir=output_dir,
#             evaluation_strategy="epoch",
#             num_train_epochs=num_train_epochs,
#             per_device_train_batch_size=per_device_train_batch_size,
#             per_device_eval_batch_size=per_device_eval_batch_size,
#             logging_dir=logging_dir,
#             logging_strategy='epoch'
#         )

#         self.trainer = Trainer(
#             model=self.model,
#             args=training_args,
#             train_dataset=train_dataset,
#             eval_dataset=validation_dataset,
#             compute_metrics=self.compute_metrics
#         )

#         self.trainer.train()

#     # compute_metrics method
#     def compute_metrics(self, eval_pred):
#         logits, labels = eval_pred
#         predictions = np.argmax(logits, axis=-1)
#         return self.metric.compute(predictions=predictions, references=labels, average='micro')

#     # evaluate method
#     def evaluate(self):
#         if self.trainer is None:
#             raise ValueError("The model has not been trained yet. Please call the train method first.")
#         eval_results = self.trainer.evaluate()
#         return eval_results



In [85]:
classifier = ModelBERTSequenceClassifier(id2label=id2label, label2id=label2id)
classifier.train(train_dataset, validation_dataset)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,Precision,F1
1,2.6835,2.451538,0.403439,0.660374,0.407961
2,1.1988,1.266795,0.702381,0.797736,0.708736
3,0.5108,1.124602,0.748677,0.848022,0.759823
4,0.2593,1.168669,0.742063,0.858019,0.760403
5,0.149,1.181924,0.744709,0.858797,0.762974


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [86]:
eval_results = classifier.evaluate()
print(eval_results)

{'eval_loss': 1.1819238662719727, 'eval_accuracy': 0.7447089947089947, 'eval_precision': 0.8587971116184157, 'eval_f1': 0.7629736004231013, 'eval_runtime': 23.1169, 'eval_samples_per_second': 32.703, 'eval_steps_per_second': 2.076, 'epoch': 5.0}


# **Pipeline**

In [87]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [88]:
# # copy path of trainer_output to my drive

!cp -r /content/trainer_output/checkpoint-436 /content/drive/MyDrive/Update_Deep_Learning_Project

# !cp -r /content/trainer_output/checkpoint-500 /content/drive/MyDrive/Update_Deep_Learning_Project



In [65]:
# save my trained model

# classifier.trainer.save_model("my_trained_model")


In [66]:
# !cp -r /content/my_trained_model /content/drive/MyDrive/Update_Deep_Learning_Project


In [89]:
from transformers import pipeline

resume_classifier = pipeline("text-classification", model = "/content/drive/MyDrive/Update_Deep_Learning_Project/checkpoint-436", tokenizer = tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [90]:
print(preprocessed_dataset['test'][700])

{'Resume_str': 'bLEONARD HIRSCHn2356 Lucy Lane xe2x80xa2 Seymour IN xe2x80xa2 Phone 8215472214 xe2x80xa2 EmailnleohirschjumpmailcomnnOBJECTIVEnTo develop my skills and experience in Finance and Business Administration in ansuccessful organizationnnQUALIFICATIONSnxefx82xb7nnHands on experience of how investment brokerage firms function from annoperational perspectivennxefx82xb7nnFive years of sales experience in retail operations utilizing strongninterpersonal and customer service skills with a focus on customernsatisfaction and loyaltynnxefx82xb7nnExtensive accounting and MS Office application experience and preparingnpayroll for small firmsnnWORK HISTORYnKilners Trading Corporation Wire OperatorOctober 2008 Presentnxefx82xb7nnAssist brokers in nine Merrill Lynch offices in the state of South Carolinanwith entering trades worth 1 Million on average a daynnxefx82xb7nnCoordiante with all exchanges and trading desks to resolve trade relatednissuesnnxefx82xb7nnEncode 850000 worth of trade 

In [91]:
# Truncate the input text to the maximum length the model can handle
max_length = tokenizer.model_max_length  # Get the maximum length from the tokenizer
truncated_text = 'bLEONARD HIRSCHn2356 Lucy Lane xe2x80xa2 Seymour IN xe2x80xa2 Phone 8215472214 xe2x80xa2 EmailnleohirschjumpmailcomnnOBJECTIVEnTo develop my skills and experience in Finance and Business Administration in ansuccessful organizationnnQUALIFICATIONSnxefx82xb7nnHands on experience of how investment brokerage firms function from annoperational perspectivennxefx82xb7nnFive years of sales experience in retail operations utilizing strongninterpersonal and customer service skills with a focus on customernsatisfaction and loyaltynnxefx82xb7nnExtensive accounting and MS Office application experience and preparingnpayroll for small firmsnnWORK HISTORYnKilners Trading Corporation Wire OperatorOctober 2008 Presentnxefx82xb7nnAssist brokers in nine Merrill Lynch offices in the state of South Carolinanwith entering trades worth 1 Million on average a daynnxefx82xb7nnCoordiante with all exchanges and trading desks to resolve trade relatednissuesnnxefx82xb7nnEncode 850000 worth of trade corrections dailynnxefx82xb7nnLog corrections into error lognnxefx82xb7nnAssist cashiers with deposits and securities processingnnxefx82xb7nnAssist with other admin tasks when needednGenerics Matelle Inc Wire OperatorMay 2008 xe2x80x93 September 2008nnxefx82xb7nnAssist individuals with attaining a logon and signing on to LANnnxefx82xb7nnFacilitate system upgrades that occur frequentlynnxefx82xb7nnAssist individuals with any computer problems they may havennBankard Financials Assistantnxefx82xb7nnMay 2006 April 2008nnPrepare payroll using accounting software for small accounts duringnabsence of chief accountantnnx0cLEONARD HIRSCHn2356 Lucy Lane xe2x80xa2 Seymour IN xe2x80xa2 Phone 8215472214 xe2x80xa2 Emailnleohirschjumpmailcomnnxefx82xb7nnProcessed new employee information in preparation for payroll andnrecruitment activitiesnnComfort Style Incorporated Sales Representative November 2005 April 2006nxefx82xb7nnHighest sales producer for one quarternnxefx82xb7nnSold upscale menxe2x80x99s clothingnnxefx82xb7nnIncreased revenues for assigned region by 25nnxefx82xb7nnParticipated in taking store inventory semiannuallynnEDUCATIONnxefx82xb7nnBachelor of Science in Finance University of Notre Dame May 2005nnINTERESTS ACTIVITIESnxefx82xb7nnCHI PSI Social Fraternity pledge class auditor and correspondencenchairmannnxefx82xb7nnStudent Alumni AssociationnnAWARDS RECEIVEDnxefx82xb7nnDeanxe2x80x99s List 2003 and 2004nnxefx82xb7nnExcellence in Mathematics 20022003and 2004'[:max_length]

# Perform inference on the truncated text
result = resume_classifier(truncated_text)
print(result)

[{'label': 'BANKING', 'score': 0.6074850559234619}]


In [92]:
max_length = tokenizer.model_max_length  # Get the maximum length from the tokenizer
truncated_text = 'Skill Set Hadoop Map Reduce HDFS Hive Sqoop java Duration 2016 to 2017 Role Hadoop Developer Rplus offers an quick simple and powerful cloud based Solution Demand Sense to accurately predict demand for your product in all your markets which Combines Enterprise and External Data to predict demand more accurately through Uses Social Conversation and Sentiments to derive demand and Identifies significant drivers of sale out of hordes of factors that Selects the best suited model out of multiple forecasting models for each product Responsibilities â Involved in deploying the product for customers gathering requirements and algorithm optimization at backend of the product â Load and transform Large Datasets of structured semi structured â Responsible to manage data coming from different sources and application â Supported Map Reduce Programs those are running on the cluster â Involved in creating Hive tables loading with data and writing hive queries which will run internally in map reduce wayEducation Details Hadoop Developer Hadoop Developer Braindatawire Skill Details APACHE HADOOP HDFS Exprience 49 months APACHE HADOOP SQOOP Exprience 49 months Hadoop Exprience 49 months HADOOP Exprience 49 months HADOOP DISTRIBUTED FILE SYSTEM Exprience 49 monthsCompany Details company Braindatawire description Technical Skills â Programming Core Java Map Reduce Scala â Hadoop Tools HDFS Spark Map Reduce Sqoop Hive Hbase â Database MySQL Oracle â Scripting Shell Scripting â IDE Eclipse â Operating Systems Linux CentOS Windows â Source Control Git Github'[:max_length]

# Perform inference on the truncated text
result = resume_classifier(truncated_text)
print(result)

[{'label': 'Data Science', 'score': 0.8821694850921631}]


In [93]:
!pip install PyPDF2
import PyPDF2
import torch


# Function to test resumes
def test_resume(pdf_file_path, resume_classifier, tokenizer): # parameters pdf file, pipeline and tokenizer
    # Open the PDF file
    with open(pdf_file_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Extract text from all pages
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()

    # Preprocess the extracted text
    preprocessed_text = remove_punctuation_and_extra_spaces(text)

    # Perform inference
    results = resume_classifier(preprocessed_text)

    # Extract predicted class and score from results
    predicted_class = results[0]['label']
    score = results[0]['score']

    # Print the results
    print(f"Predicted Class: {predicted_class}")
    print(f"Score: {score:.4f}")

    # return the results
    return predicted_class, score



In [94]:
pdf_file_path = '/content/CVMhammadIbrahim.pdf'
predicted_class, score = test_resume(pdf_file_path, resume_classifier, tokenizer)

Predicted Class: Backend Developer
Score: 0.7473


In [95]:
pdf_file_path = '/content/python.pdf'
predicted_class, score = test_resume(pdf_file_path, resume_classifier, tokenizer)

Predicted Class: Backend Developer
Score: 0.8650


In [96]:
pdf_file_path = '/content/sales-resume-example.pdf'
predicted_class, score = test_resume(pdf_file_path, resume_classifier, tokenizer)

Predicted Class: Sales
Score: 0.9627


In [97]:
from transformers import pipeline

resume_classifier2 = pipeline("text-classification", model = "/content/drive/MyDrive/Update_Deep_Learning_Project/checkpoint-500", tokenizer = tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [98]:
pdf_file_path = '/content/human-resources-resume-example.pdf'
predicted_class, score = test_resume(pdf_file_path, resume_classifier2, tokenizer)

Predicted Class: HR
Score: 0.9801


In [99]:
pdf_file_path = '/content/CVMhammadIbrahim.pdf'
predicted_class, score = test_resume(pdf_file_path, resume_classifier2, tokenizer)

Predicted Class: DevOps Engineer
Score: 0.6443
