# **Resume Categorization**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


![Resume Cateorization](CreatingaPhysicianCVthatShines.jpg)

This project tackles the challenge of **resume categorization** using machine learning and deep learning techniques. Companies often face the daunting task of sifting through numerous resumes for each job opening. This app aims to automate and streamline this process by predicting the job category a given resume belongs to. By using a trained model, the app can quickly suggest the appropriate job category for each resume, saving time and resources for recruiters.

## **Set Environmnt**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from pandas.plotting import scatter_matrix
from sklearn.neighbors import KNeighborsClassifier

**Dataset Acquisition:** The project uses the [Resume Dataset](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset) from Kaggle. This dataset consists of resumes categorized into 25 distinct job fields, providing a solid foundation for training our models.

In [3]:
df = pd.read_csv("/content/drive/MyDrive/UpdatedResumeDataSet.csv")
df.head()

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."


### **Displaying the distinct categories of resume**

In [4]:
print ("Displaying the distinct categories of resume:\n\n ")
print (df['Category'].unique())

Displaying the distinct categories of resume:

 
['Data Science' 'HR' 'Advocate' 'Arts' 'Web Designing'
 'Mechanical Engineer' 'Sales' 'Health and fitness' 'Civil Engineer'
 'Java Developer' 'Business Analyst' 'SAP Developer' 'Automation Testing'
 'Electrical Engineering' 'Operations Manager' 'Python Developer'
 'DevOps Engineer' 'Network Security Engineer' 'PMO' 'Database' 'Hadoop'
 'ETL Developer' 'DotNet Developer' 'Blockchain' 'Testing']


### **Displaying the number of resumes in each category**

In [5]:
print ("Displaying the distinct categories of resume and the number of records belonging to each category:\n\n")
print (df['Category'].value_counts())

Displaying the distinct categories of resume and the number of records belonging to each category:


Category
Java Developer               84
Testing                      70
DevOps Engineer              55
Python Developer             48
Web Designing                45
HR                           44
Hadoop                       42
Sales                        40
Data Science                 40
Mechanical Engineer          40
ETL Developer                40
Blockchain                   40
Operations Manager           40
Arts                         36
Database                     33
Health and fitness           30
PMO                          30
Electrical Engineering       30
Business Analyst             28
DotNet Developer             28
Automation Testing           26
Network Security Engineer    25
Civil Engineer               24
SAP Developer                24
Advocate                     20
Name: count, dtype: int64


### **Check the dataset**

In [6]:
df['Category'][0]

'Data Science'

In [7]:
df['Resume'][0]

'Skills * Programming Languages: Python (pandas, numpy, scipy, scikit-learn, matplotlib), Sql, Java, JavaScript/JQuery. * Machine learning: Regression, SVM, NaÃ¯ve Bayes, KNN, Random Forest, Decision Trees, Boosting techniques, Cluster Analysis, Word Embedding, Sentiment Analysis, Natural Language processing, Dimensionality reduction, Topic Modelling (LDA, NMF), PCA & Neural Nets. * Database Visualizations: Mysql, SqlServer, Cassandra, Hbase, ElasticSearch D3.js, DC.js, Plotly, kibana, matplotlib, ggplot, Tableau. * Others: Regular Expression, HTML, CSS, Angular 6, Logstash, Kafka, Python Flask, Git, Docker, computer vision - Open CV and understanding of Deep learning.Education Details \r\n\r\nData Science Assurance Associate \r\n\r\nData Science Assurance Associate - Ernst & Young LLP\r\nSkill Details \r\nJAVASCRIPT- Exprience - 24 months\r\njQuery- Exprience - 24 months\r\nPython- Exprience - 24 monthsCompany Details \r\ncompany - Ernst & Young LLP\r\ndescription - Fraud Investigatio

## **Text preprocessing**

**Text Preprocessing:** Raw resume text is often messy. We apply preprocessing techniques to clean the resume data, including:
- Removing URLs, RTs, hashtags, and mentions
- Eliminating special characters and non-ASCII characters
- Collapsing extra whitespace

In [8]:
import re
def cleanResume(txt):
    cleanText = re.sub('http\S+\s', ' ', txt) # This line removes any URLs from the text
    cleanText = re.sub('RT|cc', ' ', cleanText) # This line removes any RTs or cc from the text
    cleanText = re.sub('#\S+\s', ' ', cleanText) # This line removes any hashtags from the text
    cleanText = re.sub('@\S+', '  ', cleanText) # This line removes any @ from the text
    cleanText = re.sub('[%s]' % re.escape("""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), ' ', cleanText) # This line removes any punctuations from the text
    cleanText = re.sub(r'[^\x00-\x7f]', ' ', cleanText) # This line removes any non-ASCII characters from the text
    cleanText = re.sub('\s+', ' ', cleanText) # This line removes any extra whitespaces from the text
    return cleanText

In [9]:
df['Resume'] = df['Resume'].apply(lambda x: cleanResume(x))

### **Check cleaned text**

In [10]:
df['Resume'][0]

'Skills Programming Languages Python pandas numpy scipy scikit learn matplotlib Sql Java JavaScript JQuery Machine learning Regression SVM Na ve Bayes KNN Random Forest Decision Trees Boosting techniques Cluster Analysis Word Embedding Sentiment Analysis Natural Language processing Dimensionality reduction Topic Modelling LDA NMF PCA Neural Nets Database Visualizations Mysql SqlServer Cassandra Hbase ElasticSearch D3 js DC js Plotly kibana matplotlib ggplot Tableau Others Regular Expression HTML CSS Angular 6 Logstash Kafka Python Flask Git Docker computer vision Open CV and understanding of Deep learning Education Details Data Science Assurance Associate Data Science Assurance Associate Ernst Young LLP Skill Details JAVASCRIPT Exprience 24 months jQuery Exprience 24 months Python Exprience 24 monthsCompany Details company Ernst Young LLP description Fraud Investigations and Dispute Services Assurance TECHNOLOGY ASSISTED REVIEW TAR Technology Assisted Review assists in a elerating the 

## **Model Preprocessing**


 **Data Preparation:**
- **Category Encoding:** We transform the textual categories into numerical labels using Label Encoding, allowing our machine learning algorithms to work with the data.
- **TF-IDF Vectorization:** We convert the cleaned text data into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency), which gives more weight to words that are specific to a document in the dataset.

### **Category Encoding**

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(df['Category'])
df['Category'] = le.transform(df['Category'])

In [12]:
df.Category.unique()

array([ 6, 12,  0,  1, 24, 16, 22, 14,  5, 15,  4, 21,  2, 11, 18, 20,  8,
       17, 19,  7, 13, 10,  9,  3, 23])

### **TF-IDF**

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')

tfidf.fit(df['Resume'])
requredTaxt  = tfidf.transform(df['Resume'])

### **Data Splitting**

**Data Splitting:** The dataset is divided into training and testing sets to properly evaluate the performance of our models.

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(requredTaxt, df['Category'], test_size=0.2, random_state=42)
X_train.shape

(769, 7351)

In [15]:
X_test.shape

(193, 7351)

## **Model Development**

**Model Development:** We build and train multiple models:
 - **Machine Learning Models:**
    *   **K-Nearest Neighbors (KNN):** A simple yet effective classification algorithm based on distance between points.
    *   **Support Vector Machine (SVC):** A powerful algorithm that finds an optimal hyperplane to separate classes.
    *   **Random Forest:** An ensemble method that combines multiple decision trees for more robust predictions.
-   **Deep Learning Model:**
    *   **Multilayer Perceptron (MLP):** A neural network model with multiple hidden layers to learn complex patterns from the data.
-   **Ensemble Model**
    *   **Voting Classifier:** Combines the predictions from the machine learning models to achieve more accurate and robust results.

### **Machine Learning**

In [16]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

In [17]:
# Ensure that X_train and X_test are dense if they are sparse
X_train = X_train.toarray() if hasattr(X_train, 'toarray') else X_train
X_test = X_test.toarray() if hasattr(X_test, 'toarray') else X_test

#### **KNN**

In [18]:
# 1. Train KNeighborsClassifier
knn_model = OneVsRestClassifier(KNeighborsClassifier())
knn_model.fit(X_train, y_train)

# Accuracy for the training set
y_train_pred_knn = knn_model.predict(X_train)
train_accuracy_knn = accuracy_score(y_train, y_train_pred_knn)
print(f"Training Accuracy: {train_accuracy_knn:.4f}")

# Accuracy for the test set
y_pred_knn = knn_model.predict(X_test)
test_accuracy_knn = accuracy_score(y_test, y_pred_knn)
print(f" Testing Accuracy: {test_accuracy_knn:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_knn)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_knn)}")

Training Accuracy: 0.9857
 Testing Accuracy: 0.9845
Confusion Matrix:
[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  0  0
   0]
 [ 0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  0  6  0  0  0  0 

#### **Support Vector Machine**

In [19]:
# 2. Train SVC
svc_model = OneVsRestClassifier(SVC())
svc_model.fit(X_train, y_train)

# Accuracy for the training set
y_train_pred_svc = svc_model.predict(X_train)
train_accuracy_svc = accuracy_score(y_train, y_train_pred_svc)
print(f"Training Accuracy: {train_accuracy_svc:.4f}")

# Accuracy for the test set
y_pred_svc = svc_model.predict(X_test)
test_accuracy_svc = accuracy_score(y_test, y_pred_svc)
print("\nSVC Results:")
print(f"Testing Accuracy: {test_accuracy_svc:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_svc)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_svc)}")

Training Accuracy: 1.0000

SVC Results:
Testing Accuracy: 0.9948
Confusion Matrix:
[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  0  6

#### **Random Forest**

In [20]:
# 3. Train RandomForestClassifier
rf_model = OneVsRestClassifier(RandomForestClassifier())
rf_model.fit(X_train, y_train)

# Accuracy for the training set
y_train_pred_rf = rf_model.predict(X_train)
train_accuracy_rf = accuracy_score(y_train, y_train_pred_rf)
print(f"Training Accuracy: {train_accuracy_rf:.4f}")

# Accuracy for the test set
y_pred_rf = rf_model.predict(X_test)
test_accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("\nRandomForestClassifier Results:")
print(f"Testing Accuracy: {test_accuracy_rf:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_rf)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_rf)}")

Training Accuracy: 1.0000

RandomForestClassifier Results:
Testing Accuracy: 0.9948
Confusion Matrix:
[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  

#### **MLP**

In [21]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense
import numpy as np
from sklearn.metrics import classification_report

# Assuming X_train, X_test, y_train, y_test are already defined

# Define the MLP model
model = keras.Sequential([
    Dense(128, activation='relu', input_shape=(X_train.shape[1],)),  # Input layer
    Dense(64, activation='relu'),  # Hidden layer
    Dense(len(np.unique(y_train)), activation='softmax')  # Output layer (number of classes)
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',  # Use sparse_categorical_crossentropy for integer labels
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy = history.history['accuracy'][-1]  # Last epoch's training accuracy
val_accuracy = history.history['val_accuracy'][-1]  # Last epoch's validation accuracy

print(f"Final Training Accuracy: {train_accuracy:.4f}")
print(f"Final Validation Accuracy: {val_accuracy:.4f}")

# Evaluate the model on the test set
loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

# Make predictions
y_pred_mlp = np.argmax(model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_mlp))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 151ms/step - accuracy: 0.2701 - loss: 3.1599 - val_accuracy: 0.4545 - val_loss: 2.8693
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step - accuracy: 0.6069 - loss: 2.5747 - val_accuracy: 0.6494 - val_loss: 2.0252
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.8443 - loss: 1.5311 - val_accuracy: 0.9740 - val_loss: 1.0601
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.9989 - loss: 0.6230 - val_accuracy: 0.9870 - val_loss: 0.4473
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 1.0000 - loss: 0.2120 - val_accuracy: 1.0000 - val_loss: 0.1959
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 1.0000 - loss: 0.0864 - val_accuracy: 1.0000 - val_loss: 0.1014
Epoch 7/10
[1m22/22[0m [32m━━━━━━━━

### **Ensemble Model**

#### **Voting**

In [22]:
from sklearn.ensemble import VotingClassifier

# Create a VotingClassifier with the three models
ensemble_model = VotingClassifier(estimators=[
    ('knn', knn_model),
    ('svc', svc_model),
    ('rf', rf_model)
], voting='hard')

# Train the ensemble model
ensemble_model.fit(X_train, y_train)

# Make predictions with the ensemble model
y_pred_ensemble = ensemble_model.predict(X_test)

# Evaluate the ensemble model
ensemble_accuracy = accuracy_score(y_test, y_pred_ensemble)
print(f"Ensemble Model Accuracy: {ensemble_accuracy:.4f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_ensemble)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred_ensemble)}")

Ensemble Model Accuracy: 0.9948
Confusion Matrix:
[[ 3  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  8  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  5  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0
   0]
 [ 0  0  0  0  0  0  0  0  0  0  0  6  0  0  0  0  0  0  0  0  0  0  0

## **Deep Learning**

### **RNN**

In [23]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

#### **RNN with Adam**

In [27]:
# Tokenize the text data
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(df['Resume'])
sequences = tokenizer.texts_to_sequences(df['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Encode the labels
le = LabelEncoder()
le.fit(df['Category'])
encoded_labels = le.transform(df['Category'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Define the RNN model
rnn_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    SimpleRNN(64, return_sequences=False),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
rnn_model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Train the model
history_rnn = rnn_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_rnn = history_rnn.history['accuracy'][-1]
val_accuracy_rnn = history_rnn.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_rnn:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_rnn:.4f}")

# Evaluate the model on the test set
loss_rnn, test_accuracy_rnn = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_rnn:.4f}")
print(f"Test Accuracy: {test_accuracy_rnn:.4f}")

# Make predictions
y_pred_rnn = np.argmax(rnn_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_rnn))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 235ms/step - accuracy: 0.3178 - loss: 3.0118 - val_accuracy: 0.7532 - val_loss: 2.5169
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 30ms/step - accuracy: 0.8474 - loss: 2.1228 - val_accuracy: 0.8831 - val_loss: 1.6793
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - accuracy: 0.9381 - loss: 1.2177 - val_accuracy: 0.8961 - val_loss: 0.9940
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 24ms/step - accuracy: 0.9310 - loss: 0.6610 - val_accuracy: 0.9091 - val_loss: 0.6442
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step - accuracy: 0.9674 - loss: 0.3852 - val_accuracy: 0.9610 - val_loss: 0.3908
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - accuracy: 0.9956 - loss: 0.2126 - val_accuracy: 0.9610 - val_loss: 0.2926
Epoch 7/10
[1m22/22[0m [32m━━



[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 95ms/step
              precision    recall  f1-score   support

           0       0.50      0.33      0.40         3
           1       1.00      0.67      0.80         6
           2       1.00      1.00      1.00         5
           3       1.00      1.00      1.00         7
           4       1.00      1.00      1.00         4
           5       1.00      0.67      0.80         9
           6       1.00      1.00      1.00         5
           7       1.00      1.00      1.00         8
           8       1.00      0.93      0.96        14
           9       1.00      1.00      1.00         5
          10       1.00      1.00      1.00         7
          11       1.00      0.50      0.67         6
          12       0.80      1.00      0.89        12
          13       1.00      1.00      1.00         4
          14       0.58      1.00      0.74         7
          15       1.00      1.00      1.00        15
         

### **RNN with GD**

In [28]:
# Compile the model
from tensorflow.keras.optimizers import SGD

sgd_optimizer = SGD(learning_rate=0.01, momentum=0.9)
rnn_model.compile(optimizer=sgd_optimizer,
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

# Train the model
history_rnn = rnn_model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_rnn = history_rnn.history['accuracy'][-1]
val_accuracy_rnn = history_rnn.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_rnn:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_rnn:.4f}")

# Evaluate the model on the test set
loss_rnn, test_accuracy_rnn = rnn_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_rnn:.4f}")
print(f"Test Accuracy: {test_accuracy_rnn:.4f}")

# Make predictions
y_pred_rnn = np.argmax(rnn_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_rnn))

Epoch 1/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 87ms/step - accuracy: 0.4104 - loss: 2.6136 - val_accuracy: 0.0390 - val_loss: 3.6470
Epoch 2/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.0684 - loss: 3.3071 - val_accuracy: 0.1429 - val_loss: 3.1616
Epoch 3/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step - accuracy: 0.0786 - loss: 3.1271 - val_accuracy: 0.0260 - val_loss: 3.2517
Epoch 4/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 24ms/step - accuracy: 0.0978 - loss: 3.0579 - val_accuracy: 0.0519 - val_loss: 3.1179
Epoch 5/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.0770 - loss: 3.0604 - val_accuracy: 0.1039 - val_loss: 3.1609
Epoch 6/100
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 0.0893 - loss: 3.0538 - val_accuracy: 0.0779 - val_loss: 3.1186
Epoch 7/100
[1m22/22[0m [



[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 50ms/step
              precision    recall  f1-score   support

           0       1.00      0.33      0.50         3
           1       0.67      0.33      0.44         6
           2       1.00      1.00      1.00         5
           3       1.00      1.00      1.00         7
           4       1.00      1.00      1.00         4
           5       1.00      0.56      0.71         9
           6       1.00      0.80      0.89         5
           7       1.00      1.00      1.00         8
           8       1.00      0.93      0.96        14
           9       1.00      0.60      0.75         5
          10       1.00      0.71      0.83         7
          11       1.00      0.50      0.67         6
          12       0.00      0.00      0.00        12
          13       1.00      1.00      1.00         4
          14       1.00      0.43      0.60         7
          15       0.21      1.00      0.35        15
         

### **LSTM**

In [34]:
from tensorflow.keras.layers import LSTM, Embedding, Dense, GlobalMaxPool1D
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# Tokenize the text data
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(df['Resume'])
sequences = tokenizer.texts_to_sequences(df['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Encode the labels
le = LabelEncoder()
le.fit(df['Category'])
encoded_labels = le.transform(df['Category'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Define the LSTM model
lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    LSTM(64, return_sequences=True),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
lstm_model.compile(optimizer='adam',
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

# Train the model
history_lstm = lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_lstm = history_lstm.history['accuracy'][-1]
val_accuracy_lstm = history_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_lstm:.4f}")

# Evaluate the model on the test set
loss_lstm, test_accuracy_lstm = lstm_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_lstm:.4f}")

# Make predictions
y_pred_lstm = np.argmax(lstm_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_lstm))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 26ms/step - accuracy: 0.0814 - loss: 3.1957 - val_accuracy: 0.2468 - val_loss: 3.0749
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.2456 - loss: 2.9919 - val_accuracy: 0.2468 - val_loss: 2.8769
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.2296 - loss: 2.7831 - val_accuracy: 0.2857 - val_loss: 2.5922
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.3308 - loss: 2.3399 - val_accuracy: 0.4545 - val_loss: 2.2734
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.5506 - loss: 1.9703 - val_accuracy: 0.5325 - val_loss: 1.8394
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 19ms/step - accuracy: 0.6855 - loss: 1.4866 - val_accuracy: 0.7143 - val_loss: 1.4066
Epoch 7/10
[1m22/22[0m [32m━━━━

### **LSTM with GD**

In [35]:
from tensorflow.keras.optimizers import SGD

# Compile the model with SGD optimizer
sgd_optimizer = SGD(learning_rate=0.01, momentum=0.9)
lstm_model.compile(optimizer=sgd_optimizer,
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

# Train the model
history_lstm = lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_lstm = history_lstm.history['accuracy'][-1]
val_accuracy_lstm = history_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_lstm:.4f}")

# Evaluate the model on the test set
loss_lstm, test_accuracy_lstm = lstm_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_lstm:.4f}")

# Make predictions
y_pred_lstm = np.argmax(lstm_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_lstm))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 35ms/step - accuracy: 0.9809 - loss: 0.2508 - val_accuracy: 0.8831 - val_loss: 0.8441
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.9100 - loss: 0.6410 - val_accuracy: 0.9091 - val_loss: 0.6245
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - accuracy: 0.9758 - loss: 0.3674 - val_accuracy: 0.9740 - val_loss: 0.3776
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.9898 - loss: 0.2382 - val_accuracy: 0.9740 - val_loss: 0.2624
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - accuracy: 0.9931 - loss: 0.1665 - val_accuracy: 0.9740 - val_loss: 0.2051
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.9970 - loss: 0.1216 - val_accuracy: 0.9740 - val_loss: 0.1826
Epoch 7/10
[1m22/22[0m [32m━━━━

### **BI-LSTM**

In [43]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Bidirectional, GlobalMaxPool1D, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

# Tokenize the text data
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(df['Resume'])
sequences = tokenizer.texts_to_sequences(df['Resume'])
padded_sequences = pad_sequences(sequences, maxlen=200, padding='post', truncating='post')

# Encode the labels
le = LabelEncoder()
le.fit(df['Category'])
encoded_labels = le.transform(df['Category'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(padded_sequences, encoded_labels, test_size=0.2, random_state=42)

# Define the BI-LSTM model
bi_lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=128, input_length=200),
    Bidirectional(LSTM(64, return_sequences=True)),
    GlobalMaxPool1D(),
    Dense(64, activation='relu'),
    Dense(len(np.unique(y_train)), activation='softmax')
])

# Compile the model
bi_lstm_model.compile(optimizer='adam',
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])

# Train the model
history_bi_lstm = bi_lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_bi_lstm = history_bi_lstm.history['accuracy'][-1]
val_accuracy_bi_lstm = history_bi_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_bi_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_bi_lstm:.4f}")

# Evaluate the model on the test set
loss_bi_lstm, test_accuracy_bi_lstm = bi_lstm_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_bi_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_bi_lstm:.4f}")

# Make predictions
y_pred_bi_lstm = np.argmax(bi_lstm_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_bi_lstm))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 66ms/step - accuracy: 0.1615 - loss: 3.1908 - val_accuracy: 0.1948 - val_loss: 3.0863
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 33ms/step - accuracy: 0.3121 - loss: 2.9855 - val_accuracy: 0.2078 - val_loss: 2.7326
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 45ms/step - accuracy: 0.3492 - loss: 2.4300 - val_accuracy: 0.3766 - val_loss: 2.0412
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 36ms/step - accuracy: 0.6359 - loss: 1.6027 - val_accuracy: 0.7403 - val_loss: 1.2891
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 43ms/step - accuracy: 0.8909 - loss: 0.8487 - val_accuracy: 0.8701 - val_loss: 0.7302
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 45ms/step - accuracy: 0.9691 - loss: 0.4047 - val_accuracy: 0.9221 - val_loss: 0.3561
Epoch 7/10
[1m22/22[0m [32m━━━━

### **BI-LSTM with GD**


In [44]:
from tensorflow.keras.optimizers import SGD

# Compile the model with SGD optimizer
sgd_optimizer = SGD(learning_rate=0.01, momentum=0.9)
bi_lstm_model.compile(optimizer=sgd_optimizer,
                      loss='sparse_categorical_crossentropy',
                      metrics=['accuracy'])

# Train the model
history_bi_lstm = bi_lstm_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

# Retrieve training and validation accuracy from history
train_accuracy_bi_lstm = history_bi_lstm.history['accuracy'][-1]
val_accuracy_bi_lstm = history_bi_lstm.history['val_accuracy'][-1]

print(f"Final Training Accuracy: {train_accuracy_bi_lstm:.4f}")
print(f"Final Validation Accuracy: {val_accuracy_bi_lstm:.4f}")

# Evaluate the model on the test set
loss_bi_lstm, test_accuracy_bi_lstm = bi_lstm_model.evaluate(X_test, y_test)
print(f"Test Loss: {loss_bi_lstm:.4f}")
print(f"Test Accuracy: {test_accuracy_bi_lstm:.4f}")

# Make predictions
y_pred_bi_lstm = np.argmax(bi_lstm_model.predict(X_test), axis=1)

# Print classification report
print(classification_report(y_test, y_pred_bi_lstm))

Epoch 1/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 51ms/step - accuracy: 1.0000 - loss: 0.0277 - val_accuracy: 1.0000 - val_loss: 0.0402
Epoch 2/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 24ms/step - accuracy: 1.0000 - loss: 0.0246 - val_accuracy: 1.0000 - val_loss: 0.0349
Epoch 3/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 23ms/step - accuracy: 1.0000 - loss: 0.0228 - val_accuracy: 1.0000 - val_loss: 0.0322
Epoch 4/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - accuracy: 1.0000 - loss: 0.0190 - val_accuracy: 1.0000 - val_loss: 0.0287
Epoch 5/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 40ms/step - accuracy: 1.0000 - loss: 0.0161 - val_accuracy: 1.0000 - val_loss: 0.0267
Epoch 6/10
[1m22/22[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 38ms/step - accuracy: 1.0000 - loss: 0.0163 - val_accuracy: 1.0000 - val_loss: 0.0240
Epoch 7/10
[1m22/22[0m [32m━━━━

# **BERT language model**

In [52]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np
import re

In [55]:
df = pd.read_csv("/content/drive/MyDrive/UpdatedResumeDataSet.csv")
df.head()

# Preprocess Text
df['Resume'] = df['Resume'].apply(cleanResume)

# Encode labels to numerical
label_encoder = LabelEncoder()
df['Category'] = label_encoder.fit_transform(df['Category'])

# Split into train and validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

In [56]:
class ResumeDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        text = str(self.dataframe.iloc[idx]['Resume'])
        label = int(self.dataframe.iloc[idx]['Category'])

        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

In [57]:
# Choose a BERT model variant (e.g., 'bert-base-uncased', 'bert-large-uncased')
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

# Find the number of labels
num_labels= len(label_encoder.classes_)

# Initialize the BERT model for classification
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

# Maximum sequence length
max_length = 512

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [58]:
# Create custom datasets
train_dataset = ResumeDataset(train_df, tokenizer, max_length)
val_dataset = ResumeDataset(val_df, tokenizer, max_length)

# Create dataloaders for train and validation dataset
batch_size = 8
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [59]:
# Set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# AdamW optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Number of training epochs
num_epochs = 3

In [60]:
# Initialize a variable to store the final validation accuracy
final_val_accuracy = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    train_predictions = []
    train_true_labels = []

    for batch in train_dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        train_loss += loss.item()
        loss.backward()
        optimizer.step()

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        train_predictions.extend(predictions.cpu().numpy())
        train_true_labels.extend(labels.cpu().numpy())

    avg_train_loss = train_loss / len(train_dataloader)
    train_accuracy = accuracy_score(train_true_labels, train_predictions)
    print(f"Epoch {epoch + 1}/{num_epochs}, Training Loss: {avg_train_loss}, Training Accuracy: {train_accuracy}")

    # Evaluation
    model.eval()
    val_loss = 0
    val_predictions = []
    val_true_labels = []

    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()

            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            val_predictions.extend(predictions.cpu().numpy())
            val_true_labels.extend(labels.cpu().numpy())

    avg_val_loss = val_loss / len(val_dataloader)
    accuracy = accuracy_score(val_true_labels, val_predictions)
    final_val_accuracy = accuracy  # Update the final accuracy after each epoch
    print(f"Epoch {epoch + 1}/{num_epochs}, Testing Loss: {avg_val_loss}, Testing Accuracy: {accuracy}")
    print(classification_report(val_true_labels, val_predictions, target_names=label_encoder.classes_))

# Print the final validation accuracy after all epochs
print(f"Final Validation Accuracy: {final_val_accuracy}")

Epoch 1/3, Training Loss: 2.997423695534775, Training Accuracy: 0.1625487646293888
Epoch 1/3, Testing Loss: 2.4730951166152955, Testing Accuracy: 0.37823834196891193
                           precision    recall  f1-score   support

                 Advocate       1.00      0.33      0.50         3
                     Arts       1.00      0.17      0.29         6
       Automation Testing       0.00      0.00      0.00         5
               Blockchain       0.00      0.00      0.00         7
         Business Analyst       0.14      0.25      0.18         4
           Civil Engineer       0.00      0.00      0.00         9
             Data Science       0.14      1.00      0.25         5
                 Database       0.00      0.00      0.00         8
          DevOps Engineer       0.00      0.00      0.00        14
         DotNet Developer       0.00      0.00      0.00         5
            ETL Developer       0.00      0.00      0.00         7
   Electrical Engineering    

In [61]:
def predict_category(resume_text, tokenizer, model, max_length, device, label_encoder):
    model.eval()
    cleaned_text = cleanResume(resume_text)
    encoding = tokenizer.encode_plus(
        cleaned_text,
        add_special_tokens=True,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
        )
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    with torch.no_grad():
        output = model(input_ids, attention_mask=attention_mask)
        predicted_class = torch.argmax(output.logits, dim=-1).cpu().item()
        predicted_label = label_encoder.inverse_transform([predicted_class])[0]
    return predicted_label

In [62]:
# Example Usage
sample_resume = """Name: Sarah Johnson
Contact Information:

Phone: +1 555-123-4567
Email: sarah.johnson@email.com
LinkedIn: linkedin.com/in/sarahjohnson
Address: New York, NY
Professional Summary
A results-driven HR professional with over 6 years of experience in talent acquisition, employee engagement, and HR operations. Skilled in building strong teams, fostering positive work environments, and implementing HR strategies that align with organizational goals. Proficient in HR software and data-driven decision-making to improve workforce management.

Key Skills
Talent Acquisition and Recruitment
Employee Onboarding and Training
Performance Management
HR Policies and Compliance
Employee Relations and Engagement
Compensation and Benefits Administration
HR Analytics and Reporting
HR Software: SAP, Workday, BambooHR
Strong Interpersonal and Communication Skills
Professional Experience
HR Manager
BrightPath Solutions | January 2020 – Present

Led end-to-end recruitment processes, successfully hiring over 50 candidates annually across various roles.
Designed and implemented onboarding programs, reducing new hire turnover by 20%.
Developed performance management frameworks, increasing employee productivity by 15%.
Conducted employee satisfaction surveys and implemented strategies to enhance engagement.
Ensured compliance with labor laws and company policies, minimizing legal risks.
HR Generalist
Global Reach Inc. | June 2016 – December 2019

Supported HR operations, including recruitment, payroll, and benefits administration.
Assisted in developing HR policies and communicated updates to employees.
Resolved employee grievances, fostering a collaborative workplace.
Analyzed HR metrics to identify trends and presented actionable insights to management.
HR Coordinator
TalentFirst Consulting | March 2014 – May 2016

Scheduled interviews and coordinated recruitment activities.
Maintained employee records and ensured accuracy in HR databases.
Assisted in planning company events and training sessions.
Education
Bachelor’s Degree in Human Resource Management
University of California, Berkeley | 2013

Certifications

Certified Professional in Human Resources (PHR)
SHRM Certified Professional (SHRM-CP)
Advanced HR Analytics (Coursera)
Achievements
Reduced time-to-hire by 30% through process optimization.
Increased employee retention by 25% by implementing a mentorship program.
Spearheaded diversity and inclusion initiatives, leading to a 40% increase in diverse hires.
Languages
English (Fluent)
Spanish (Intermediate)
"""

predicted_category = predict_category(sample_resume, tokenizer, model, max_length, device, label_encoder)
print(f"Predicted Category: {predicted_category}")

Predicted Category: HR


## **Testing**

**Prediction:** The app provides predictions from all the trained models, allowing for an extensive comparison of their results. The user receives a result from each model, including KNN, SVC, Random Forest, MLP, and the ensemble model.

In [63]:
# Function to predict the category of a resume and print results for each model
def pred(input_resume):
    # Preprocess the input text (e.g., cleaning, etc.)
    cleaned_text = cleanResume(input_resume)

    # Vectorize the cleaned text using the same TF-IDF vectorizer used during training
    vectorized_text = tfidf.transform([cleaned_text])

    # Convert sparse matrix to dense
    vectorized_text = vectorized_text.toarray()

    # Prediction
    predicted_category_knn = knn_model.predict(vectorized_text)
    predicted_category_svc = svc_model.predict(vectorized_text)
    predicted_category_rf = rf_model.predict(vectorized_text)
    predicted_category_mlp = np.argmax(model.predict(vectorized_text), axis=1)
    predicted_category_ensemble = ensemble_model.predict(vectorized_text)

    # Get name of predicted category for each model
    category_knn = le.inverse_transform(predicted_category_knn)[0]
    category_svc = le.inverse_transform(predicted_category_svc)[0]
    category_rf = le.inverse_transform(predicted_category_rf)[0]
    category_mlp = le.inverse_transform(predicted_category_mlp)[0]
    category_ensemble = le.inverse_transform(predicted_category_ensemble)[0]

    # Print results for each model
    print(f"KNN Model Prediction: {category_knn}")
    print(f"SVC Model Prediction: {category_svc}")
    print(f"Random Forest Model Prediction: {category_rf}")
    print(f"MLP Model Prediction: {category_mlp}")
    print(f"Ensemble Model Prediction: {category_ensemble}")

    # Return the category name predicted by the first model (as an example)
    return category_knn

In [64]:
myresume = """Name: Sarah Johnson
Contact Information:

Phone: +1 555-123-4567
Email: sarah.johnson@email.com
LinkedIn: linkedin.com/in/sarahjohnson
Address: New York, NY
Professional Summary
A results-driven HR professional with over 6 years of experience in talent acquisition, employee engagement, and HR operations. Skilled in building strong teams, fostering positive work environments, and implementing HR strategies that align with organizational goals. Proficient in HR software and data-driven decision-making to improve workforce management.

Key Skills
Talent Acquisition and Recruitment
Employee Onboarding and Training
Performance Management
HR Policies and Compliance
Employee Relations and Engagement
Compensation and Benefits Administration
HR Analytics and Reporting
HR Software: SAP, Workday, BambooHR
Strong Interpersonal and Communication Skills
Professional Experience
HR Manager
BrightPath Solutions | January 2020 – Present

Led end-to-end recruitment processes, successfully hiring over 50 candidates annually across various roles.
Designed and implemented onboarding programs, reducing new hire turnover by 20%.
Developed performance management frameworks, increasing employee productivity by 15%.
Conducted employee satisfaction surveys and implemented strategies to enhance engagement.
Ensured compliance with labor laws and company policies, minimizing legal risks.
HR Generalist
Global Reach Inc. | June 2016 – December 2019

Supported HR operations, including recruitment, payroll, and benefits administration.
Assisted in developing HR policies and communicated updates to employees.
Resolved employee grievances, fostering a collaborative workplace.
Analyzed HR metrics to identify trends and presented actionable insights to management.
HR Coordinator
TalentFirst Consulting | March 2014 – May 2016

Scheduled interviews and coordinated recruitment activities.
Maintained employee records and ensured accuracy in HR databases.
Assisted in planning company events and training sessions.
Education
Bachelor’s Degree in Human Resource Management
University of California, Berkeley | 2013

Certifications

Certified Professional in Human Resources (PHR)
SHRM Certified Professional (SHRM-CP)
Advanced HR Analytics (Coursera)
Achievements
Reduced time-to-hire by 30% through process optimization.
Increased employee retention by 25% by implementing a mentorship program.
Spearheaded diversity and inclusion initiatives, leading to a 40% increase in diverse hires.
Languages
English (Fluent)
Spanish (Intermediate)
"""

pred(myresume)

AttributeError: 'BertForSequenceClassification' object has no attribute 'predict'

In [None]:
myresume = """I am a Business Analyst specializing in developing dashboards,
reports, and data models to drive performance insights.
Proficient in Python, R, SQL, Excel, and Power BI, I excel in data
analysis, advanced analytics, and automation of data processes.
Skilled in statistical analysis and data visualization, I derive
insights for data-driven decisions. Experienced in designing and
optimizing data warehouse solutions, managing ETL processes,
and ensuring data integrity and security. Additionally, I hold a
CCNA certification from Cisco, showcasing my knowledge in
networking.
"""

pred(myresume)