# Resumama
Resumama leverages a fine-tuned T5-small transformer model to analyze job postings and generate one-page, tailored resumes optimized for each application. By predicting the fit between a candidate's long-format resume and a given job post, Resumama provides a three-tiered recommendation: "Good Fit," "Potential Fit," or "No Fit." For suitable matches, the system then generates a concise, professional resume highlighting the most relevant qualifications and experiences.

In [1]:
import torch

print("CUDA Available:", torch.cuda.is_available())
print("CUDA Version:", torch.version.cuda)
print("Device Name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "No GPU found")


CUDA Available: True
CUDA Version: 11.8
Device Name: NVIDIA GeForce RTX 2060


### T5-Small
T5-small is a compact version of the Text-to-Text Transfer Transformer (T5) model, developed by Google. It treats every NLP task as a text-to-text problem, enabling it to perform tasks such as translation, summarization, and classification within a unified framework. With 60 million parameters, T5-small balances computational efficiency and performance, making it suitable for resource-constrained applications while maintaining robust language understanding capabilities.

In [None]:
# model = model.to("cpu")
from transformers import T5ForConditionalGeneration, T5Tokenizer

model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to("cuda")



You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Data Preparation

The data used for training contains pairs of resume texts and job descriptions, annotated with labels indicating the fit level ("Good Fit," "Potential Fit," or "No Fit"). It includes a total of 4,992 samples, with the majority labeled as "No Fit" (3,143), followed by "Potential Fit" (1,556), and "Good Fit" (1,542). Each record consists of a candidate's resume, a corresponding job description, and a fit label, allowing the model to learn the relationship between resumes and job requirements. The class imbalance in the dataset, with more "No Fit" examples, may present challenges for training, potentially requiring rebalancing techniques or weighting adjustments to improve classification performance.

In [4]:
import pandas as pd

splits = {'train': 'train.csv', 'test': 'test.csv'}
df = pd.read_csv("hf://datasets/cnamuangtoun/resume-job-description-fit/" + splits["train"])

In [5]:
df['label'].value_counts()

No Fit           3143
Potential Fit    1556
Good Fit         1542
Name: label, dtype: int64

In [None]:
from sklearn.model_selection import train_test_split
'''
preprocess_data takes in a dataframe and a tokenizer and returns a dictionary of inputs and outputs that are ready to be passed to the model
'''
# Prepare the data
def preprocess_data(df, tokenizer, max_length=512):
    inputs = []
    labels = []
    
    for _, row in df.iterrows():
        input_text = (f"Evaluate the fit level of the following resume for the job description. Understand what makes a resume a good fit."
                      f"Job Description: {row['job_description_text']} "
                      f"Resume: {row['resume_text']}")
        label_text = row['label']
        
        inputs.append(input_text)
        labels.append(label_text)

    # Tokenize the inputs and outputs
    tokenized_inputs = tokenizer(inputs, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
    tokenized_labels = tokenizer(labels, max_length=max_length, padding=True, truncation=True, return_tensors="pt")

    # Return tokenized inputs and labels
    return tokenized_inputs, tokenized_labels

# Split data into train and validation
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

# Tokenize
tokenizer = T5Tokenizer.from_pretrained("t5-small")
train_inputs, train_labels = preprocess_data(train_df, tokenizer)
val_inputs, val_labels = preprocess_data(val_df, tokenizer)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
train_df


Unnamed: 0,resume_text,job_description_text,label
3659,SummaryA business management graduate with sig...,Position Title: Senior Accountant Organization...,Potential Fit
1426,Professional ProfileCapable International Tax ...,RoleGaming Business AnalystResponsibilities Ab...,No Fit
497,Professional ProfileHighly motivated Sales Ass...,If you can handle the accounting responsibilit...,No Fit
2833,SummaryOrganized and motivated employee eager ...,Our client is a growing Medical Device company...,No Fit
1480,SummaryEmployed by the U.S. Navy as a Civilian...,Were seeking a detail-oriented and analytical ...,No Fit
...,...,...,...
3772,Professional ProfileExpert in Functional Testi...,Lead Software Developer New York City - Hybr...,Potential Fit
5191,SummaryI am a detail-oriented professional see...,Job Purpose: Perform designated tasks in the a...,Good Fit
5226,SummaryDedicated and focused Clerk who excels ...,Role - Business Analyst - Mobile Location - Lo...,Good Fit
5390,SummaryTo obtain a challenging and rewarding a...,"About Allvue\nWe are Allvue Systems, the leadi...",Good Fit


In [None]:
import torch
''' 
fit_level_dataset is a custom dataset class that takes in tokenized inputs and labels and returns a dictionary of input_ids, attention_mask, and labels
'''
class FitLevelDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.encodings.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings.input_ids[idx],
            "attention_mask": self.encodings.attention_mask[idx],
            "labels": self.labels.input_ids[idx],
        }

train_dataset = FitLevelDataset(train_inputs, train_labels)
val_dataset = FitLevelDataset(val_inputs, val_labels)


### Fine-Tune
Here we fine-tune a pre-trained T5-small model on our resume dataset. The T5ForConditionalGeneration class is used to load the pre-trained model, which is capable of generating text conditioned on input text. Training arguments are configured via TrainingArguments, specifying parameters like the output directory, evaluation strategy, batch size, number of epochs, learning rate, and precision mode. The Trainer class orchestrates the training process, combining the model, training arguments, and datasets (training and validation). Finally, the trainer.train() call initiates the fine-tuning process, updating the model weights based on the input data to optimize its performance.

In [9]:
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments

# Load the T5 model
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Training arguments
training_args = TrainingArguments(
    output_dir="./t5-fit-level",
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    logging_dir="./logs",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=10,
    learning_rate=3e-5,
    save_total_limit=2,
    fp16=True,  # Enable mixed precision if you have GPU
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()




  0%|          | 0/49920 [00:00<?, ?it/s]

{'loss': 1.79, 'grad_norm': 2.7435436248779297, 'learning_rate': 2.9703725961538463e-05, 'epoch': 0.1}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6606224179267883, 'eval_runtime': 22.0997, 'eval_samples_per_second': 56.517, 'eval_steps_per_second': 56.517, 'epoch': 0.1}
{'loss': 0.5865, 'grad_norm': 7.690234661102295, 'learning_rate': 2.9403245192307694e-05, 'epoch': 0.2}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4432353377342224, 'eval_runtime': 21.9501, 'eval_samples_per_second': 56.902, 'eval_steps_per_second': 56.902, 'epoch': 0.2}
{'loss': 0.5665, 'grad_norm': 0.49165016412734985, 'learning_rate': 2.9102764423076925e-05, 'epoch': 0.3}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5360539555549622, 'eval_runtime': 22.294, 'eval_samples_per_second': 56.024, 'eval_steps_per_second': 56.024, 'epoch': 0.3}
{'loss': 0.5685, 'grad_norm': 14.284967422485352, 'learning_rate': 2.8802283653846156e-05, 'epoch': 0.4}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6435320973396301, 'eval_runtime': 21.9825, 'eval_samples_per_second': 56.818, 'eval_steps_per_second': 56.818, 'epoch': 0.4}
{'loss': 0.5467, 'grad_norm': 1.1861993074417114, 'learning_rate': 2.8501802884615383e-05, 'epoch': 0.5}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4955109655857086, 'eval_runtime': 22.2931, 'eval_samples_per_second': 56.026, 'eval_steps_per_second': 56.026, 'epoch': 0.5}
{'loss': 0.5284, 'grad_norm': 0.9705380797386169, 'learning_rate': 2.8201322115384614e-05, 'epoch': 0.6}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4090416431427002, 'eval_runtime': 21.988, 'eval_samples_per_second': 56.804, 'eval_steps_per_second': 56.804, 'epoch': 0.6}
{'loss': 0.5065, 'grad_norm': 3.2549123764038086, 'learning_rate': 2.790084134615385e-05, 'epoch': 0.7}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.3828349709510803, 'eval_runtime': 21.743, 'eval_samples_per_second': 57.444, 'eval_steps_per_second': 57.444, 'epoch': 0.7}
{'loss': 0.5345, 'grad_norm': 7.248378753662109, 'learning_rate': 2.760036057692308e-05, 'epoch': 0.8}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.33548596501350403, 'eval_runtime': 21.9477, 'eval_samples_per_second': 56.908, 'eval_steps_per_second': 56.908, 'epoch': 0.8}
{'loss': 0.485, 'grad_norm': 0.7842099666595459, 'learning_rate': 2.7299879807692307e-05, 'epoch': 0.9}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.3283754587173462, 'eval_runtime': 21.9023, 'eval_samples_per_second': 57.026, 'eval_steps_per_second': 57.026, 'epoch': 0.9}
{'loss': 0.4655, 'grad_norm': 1.1646045446395874, 'learning_rate': 2.699939903846154e-05, 'epoch': 1.0}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4382343292236328, 'eval_runtime': 22.1576, 'eval_samples_per_second': 56.369, 'eval_steps_per_second': 56.369, 'epoch': 1.0}
{'loss': 0.4965, 'grad_norm': 0.7365132570266724, 'learning_rate': 2.669891826923077e-05, 'epoch': 1.1}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4554312527179718, 'eval_runtime': 21.8732, 'eval_samples_per_second': 57.102, 'eval_steps_per_second': 57.102, 'epoch': 1.1}
{'loss': 0.4626, 'grad_norm': 12.034932136535645, 'learning_rate': 2.6398437500000004e-05, 'epoch': 1.2}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4630197286605835, 'eval_runtime': 21.9084, 'eval_samples_per_second': 57.01, 'eval_steps_per_second': 57.01, 'epoch': 1.2}
{'loss': 0.464, 'grad_norm': 3.2634572982788086, 'learning_rate': 2.6098557692307692e-05, 'epoch': 1.3}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.3272598087787628, 'eval_runtime': 22.309, 'eval_samples_per_second': 55.986, 'eval_steps_per_second': 55.986, 'epoch': 1.3}
{'loss': 0.4645, 'grad_norm': 9.339091300964355, 'learning_rate': 2.5798076923076926e-05, 'epoch': 1.4}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4603619873523712, 'eval_runtime': 22.2978, 'eval_samples_per_second': 56.014, 'eval_steps_per_second': 56.014, 'epoch': 1.4}
{'loss': 0.4632, 'grad_norm': 9.694141387939453, 'learning_rate': 2.5497596153846154e-05, 'epoch': 1.5}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4346650242805481, 'eval_runtime': 22.0343, 'eval_samples_per_second': 56.684, 'eval_steps_per_second': 56.684, 'epoch': 1.5}
{'loss': 0.4749, 'grad_norm': 10.63216781616211, 'learning_rate': 2.5197115384615385e-05, 'epoch': 1.6}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.36287644505500793, 'eval_runtime': 22.443, 'eval_samples_per_second': 55.652, 'eval_steps_per_second': 55.652, 'epoch': 1.6}
{'loss': 0.4244, 'grad_norm': 11.440138816833496, 'learning_rate': 2.489723557692308e-05, 'epoch': 1.7}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.3529011607170105, 'eval_runtime': 22.1025, 'eval_samples_per_second': 56.509, 'eval_steps_per_second': 56.509, 'epoch': 1.7}
{'loss': 0.401, 'grad_norm': 0.27251631021499634, 'learning_rate': 2.4596754807692308e-05, 'epoch': 1.8}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.3890533149242401, 'eval_runtime': 22.3682, 'eval_samples_per_second': 55.838, 'eval_steps_per_second': 55.838, 'epoch': 1.8}
{'loss': 0.4661, 'grad_norm': 0.22069890797138214, 'learning_rate': 2.429747596153846e-05, 'epoch': 1.9}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4156334698200226, 'eval_runtime': 21.9018, 'eval_samples_per_second': 57.027, 'eval_steps_per_second': 57.027, 'epoch': 1.9}
{'loss': 0.43, 'grad_norm': 4.968319416046143, 'learning_rate': 2.399699519230769e-05, 'epoch': 2.0}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4570123851299286, 'eval_runtime': 22.0035, 'eval_samples_per_second': 56.764, 'eval_steps_per_second': 56.764, 'epoch': 2.0}
{'loss': 0.3871, 'grad_norm': 7.406917095184326, 'learning_rate': 2.3696514423076925e-05, 'epoch': 2.1}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4928951859474182, 'eval_runtime': 22.2541, 'eval_samples_per_second': 56.124, 'eval_steps_per_second': 56.124, 'epoch': 2.1}
{'loss': 0.417, 'grad_norm': 5.651954650878906, 'learning_rate': 2.3396033653846156e-05, 'epoch': 2.2}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.49632444977760315, 'eval_runtime': 22.008, 'eval_samples_per_second': 56.752, 'eval_steps_per_second': 56.752, 'epoch': 2.2}
{'loss': 0.4545, 'grad_norm': 17.239105224609375, 'learning_rate': 2.3095552884615384e-05, 'epoch': 2.3}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4592262804508209, 'eval_runtime': 21.9869, 'eval_samples_per_second': 56.807, 'eval_steps_per_second': 56.807, 'epoch': 2.3}
{'loss': 0.4191, 'grad_norm': 21.191303253173828, 'learning_rate': 2.2795072115384615e-05, 'epoch': 2.4}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5099661350250244, 'eval_runtime': 22.0942, 'eval_samples_per_second': 56.531, 'eval_steps_per_second': 56.531, 'epoch': 2.4}
{'loss': 0.4135, 'grad_norm': 13.316132545471191, 'learning_rate': 2.2494591346153846e-05, 'epoch': 2.5}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.517138659954071, 'eval_runtime': 21.9472, 'eval_samples_per_second': 56.909, 'eval_steps_per_second': 56.909, 'epoch': 2.5}
{'loss': 0.439, 'grad_norm': 0.6280094981193542, 'learning_rate': 2.219411057692308e-05, 'epoch': 2.6}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5022063255310059, 'eval_runtime': 22.3432, 'eval_samples_per_second': 55.901, 'eval_steps_per_second': 55.901, 'epoch': 2.6}
{'loss': 0.4646, 'grad_norm': 29.929990768432617, 'learning_rate': 2.1893629807692308e-05, 'epoch': 2.7}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5101897716522217, 'eval_runtime': 22.1245, 'eval_samples_per_second': 56.453, 'eval_steps_per_second': 56.453, 'epoch': 2.7}
{'loss': 0.4927, 'grad_norm': 17.922956466674805, 'learning_rate': 2.159314903846154e-05, 'epoch': 2.8}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5237619876861572, 'eval_runtime': 21.8427, 'eval_samples_per_second': 57.182, 'eval_steps_per_second': 57.182, 'epoch': 2.8}
{'loss': 0.4743, 'grad_norm': 2.944620132446289, 'learning_rate': 2.129326923076923e-05, 'epoch': 2.9}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.4992309808731079, 'eval_runtime': 22.5948, 'eval_samples_per_second': 55.278, 'eval_steps_per_second': 55.278, 'epoch': 2.9}
{'loss': 0.4184, 'grad_norm': 22.65220832824707, 'learning_rate': 2.0992788461538462e-05, 'epoch': 3.0}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5322273969650269, 'eval_runtime': 22.0966, 'eval_samples_per_second': 56.525, 'eval_steps_per_second': 56.525, 'epoch': 3.0}
{'loss': 0.4488, 'grad_norm': 24.40729331970215, 'learning_rate': 2.0692307692307693e-05, 'epoch': 3.1}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.502545177936554, 'eval_runtime': 21.9384, 'eval_samples_per_second': 56.932, 'eval_steps_per_second': 56.932, 'epoch': 3.1}
{'loss': 0.3878, 'grad_norm': 21.42656135559082, 'learning_rate': 2.0391826923076924e-05, 'epoch': 3.21}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5770018100738525, 'eval_runtime': 22.2907, 'eval_samples_per_second': 56.032, 'eval_steps_per_second': 56.032, 'epoch': 3.21}
{'loss': 0.4307, 'grad_norm': 16.60619354248047, 'learning_rate': 2.0091947115384615e-05, 'epoch': 3.31}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.594901978969574, 'eval_runtime': 21.909, 'eval_samples_per_second': 57.009, 'eval_steps_per_second': 57.009, 'epoch': 3.31}
{'loss': 0.4472, 'grad_norm': 7.1486711502075195, 'learning_rate': 1.9791466346153846e-05, 'epoch': 3.41}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5911118388175964, 'eval_runtime': 22.0002, 'eval_samples_per_second': 56.772, 'eval_steps_per_second': 56.772, 'epoch': 3.41}
{'loss': 0.4543, 'grad_norm': 21.204547882080078, 'learning_rate': 1.9490985576923077e-05, 'epoch': 3.51}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5790989995002747, 'eval_runtime': 22.1235, 'eval_samples_per_second': 56.456, 'eval_steps_per_second': 56.456, 'epoch': 3.51}
{'loss': 0.441, 'grad_norm': 15.70605182647705, 'learning_rate': 1.919050480769231e-05, 'epoch': 3.61}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6186304688453674, 'eval_runtime': 22.0185, 'eval_samples_per_second': 56.725, 'eval_steps_per_second': 56.725, 'epoch': 3.61}
{'loss': 0.4583, 'grad_norm': 32.96261215209961, 'learning_rate': 1.889002403846154e-05, 'epoch': 3.71}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6114766597747803, 'eval_runtime': 22.339, 'eval_samples_per_second': 55.911, 'eval_steps_per_second': 55.911, 'epoch': 3.71}
{'loss': 0.4371, 'grad_norm': 24.203201293945312, 'learning_rate': 1.859014423076923e-05, 'epoch': 3.81}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5770112872123718, 'eval_runtime': 21.8059, 'eval_samples_per_second': 57.278, 'eval_steps_per_second': 57.278, 'epoch': 3.81}
{'loss': 0.4257, 'grad_norm': 21.992900848388672, 'learning_rate': 1.8289663461538462e-05, 'epoch': 3.91}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6046270728111267, 'eval_runtime': 22.2074, 'eval_samples_per_second': 56.243, 'eval_steps_per_second': 56.243, 'epoch': 3.91}
{'loss': 0.4601, 'grad_norm': 3.0674216747283936, 'learning_rate': 1.7989182692307693e-05, 'epoch': 4.01}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.596476674079895, 'eval_runtime': 22.2159, 'eval_samples_per_second': 56.221, 'eval_steps_per_second': 56.221, 'epoch': 4.01}
{'loss': 0.411, 'grad_norm': 0.20815302431583405, 'learning_rate': 1.7688701923076924e-05, 'epoch': 4.11}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5965874791145325, 'eval_runtime': 22.0798, 'eval_samples_per_second': 56.568, 'eval_steps_per_second': 56.568, 'epoch': 4.11}
{'loss': 0.4728, 'grad_norm': 30.895212173461914, 'learning_rate': 1.7388822115384616e-05, 'epoch': 4.21}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.5780961513519287, 'eval_runtime': 21.913, 'eval_samples_per_second': 56.998, 'eval_steps_per_second': 56.998, 'epoch': 4.21}
{'loss': 0.4356, 'grad_norm': 13.051544189453125, 'learning_rate': 1.7088341346153847e-05, 'epoch': 4.31}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6502951383590698, 'eval_runtime': 21.9388, 'eval_samples_per_second': 56.931, 'eval_steps_per_second': 56.931, 'epoch': 4.31}
{'loss': 0.4164, 'grad_norm': 0.314775675535202, 'learning_rate': 1.6787860576923078e-05, 'epoch': 4.41}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6304618120193481, 'eval_runtime': 21.9841, 'eval_samples_per_second': 56.814, 'eval_steps_per_second': 56.814, 'epoch': 4.41}
{'loss': 0.4585, 'grad_norm': 23.99639320373535, 'learning_rate': 1.6487379807692305e-05, 'epoch': 4.51}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6354663372039795, 'eval_runtime': 22.2945, 'eval_samples_per_second': 56.023, 'eval_steps_per_second': 56.023, 'epoch': 4.51}
{'loss': 0.4384, 'grad_norm': 0.6628952026367188, 'learning_rate': 1.61875e-05, 'epoch': 4.61}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6884129643440247, 'eval_runtime': 22.0866, 'eval_samples_per_second': 56.55, 'eval_steps_per_second': 56.55, 'epoch': 4.61}
{'loss': 0.43, 'grad_norm': 17.30394172668457, 'learning_rate': 1.5887019230769228e-05, 'epoch': 4.71}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6760469079017639, 'eval_runtime': 22.0365, 'eval_samples_per_second': 56.679, 'eval_steps_per_second': 56.679, 'epoch': 4.71}
{'loss': 0.4914, 'grad_norm': 0.911189079284668, 'learning_rate': 1.5586538461538462e-05, 'epoch': 4.81}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6154776215553284, 'eval_runtime': 22.1744, 'eval_samples_per_second': 56.326, 'eval_steps_per_second': 56.326, 'epoch': 4.81}
{'loss': 0.3856, 'grad_norm': 11.123381614685059, 'learning_rate': 1.5286057692307694e-05, 'epoch': 4.91}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6375555396080017, 'eval_runtime': 22.022, 'eval_samples_per_second': 56.716, 'eval_steps_per_second': 56.716, 'epoch': 4.91}
{'loss': 0.4351, 'grad_norm': 2.7392871379852295, 'learning_rate': 1.4986177884615385e-05, 'epoch': 5.01}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6614298224449158, 'eval_runtime': 21.9326, 'eval_samples_per_second': 56.947, 'eval_steps_per_second': 56.947, 'epoch': 5.01}
{'loss': 0.4605, 'grad_norm': 3.65012526512146, 'learning_rate': 1.4685697115384616e-05, 'epoch': 5.11}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6181291937828064, 'eval_runtime': 22.0192, 'eval_samples_per_second': 56.723, 'eval_steps_per_second': 56.723, 'epoch': 5.11}
{'loss': 0.4123, 'grad_norm': 1.5037472248077393, 'learning_rate': 1.4385216346153847e-05, 'epoch': 5.21}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6514842510223389, 'eval_runtime': 21.9806, 'eval_samples_per_second': 56.823, 'eval_steps_per_second': 56.823, 'epoch': 5.21}
{'loss': 0.4502, 'grad_norm': 7.10018253326416, 'learning_rate': 1.4084735576923076e-05, 'epoch': 5.31}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6194990873336792, 'eval_runtime': 22.2158, 'eval_samples_per_second': 56.221, 'eval_steps_per_second': 56.221, 'epoch': 5.31}
{'loss': 0.4169, 'grad_norm': 0.8300756216049194, 'learning_rate': 1.378425480769231e-05, 'epoch': 5.41}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6706216335296631, 'eval_runtime': 22.2043, 'eval_samples_per_second': 56.25, 'eval_steps_per_second': 56.25, 'epoch': 5.41}
{'loss': 0.4521, 'grad_norm': 0.03473956137895584, 'learning_rate': 1.3484374999999999e-05, 'epoch': 5.51}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6473796963691711, 'eval_runtime': 21.7941, 'eval_samples_per_second': 57.309, 'eval_steps_per_second': 57.309, 'epoch': 5.51}
{'loss': 0.3964, 'grad_norm': 21.584869384765625, 'learning_rate': 1.3183894230769232e-05, 'epoch': 5.61}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6779866218566895, 'eval_runtime': 22.206, 'eval_samples_per_second': 56.246, 'eval_steps_per_second': 56.246, 'epoch': 5.61}
{'loss': 0.4101, 'grad_norm': 0.17757649719715118, 'learning_rate': 1.2883413461538461e-05, 'epoch': 5.71}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.676101565361023, 'eval_runtime': 21.9693, 'eval_samples_per_second': 56.852, 'eval_steps_per_second': 56.852, 'epoch': 5.71}
{'loss': 0.4308, 'grad_norm': 31.48599624633789, 'learning_rate': 1.2582932692307692e-05, 'epoch': 5.81}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7042039632797241, 'eval_runtime': 21.9207, 'eval_samples_per_second': 56.978, 'eval_steps_per_second': 56.978, 'epoch': 5.81}
{'loss': 0.4236, 'grad_norm': 5.302909851074219, 'learning_rate': 1.2283052884615385e-05, 'epoch': 5.91}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6749285459518433, 'eval_runtime': 21.9768, 'eval_samples_per_second': 56.833, 'eval_steps_per_second': 56.833, 'epoch': 5.91}
{'loss': 0.4483, 'grad_norm': 13.491842269897461, 'learning_rate': 1.1982572115384615e-05, 'epoch': 6.01}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7090458273887634, 'eval_runtime': 22.1546, 'eval_samples_per_second': 56.377, 'eval_steps_per_second': 56.377, 'epoch': 6.01}
{'loss': 0.408, 'grad_norm': 27.98006248474121, 'learning_rate': 1.1682091346153848e-05, 'epoch': 6.11}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7063722014427185, 'eval_runtime': 22.2914, 'eval_samples_per_second': 56.031, 'eval_steps_per_second': 56.031, 'epoch': 6.11}
{'loss': 0.3901, 'grad_norm': 6.035536766052246, 'learning_rate': 1.1381610576923077e-05, 'epoch': 6.21}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7018461227416992, 'eval_runtime': 22.0611, 'eval_samples_per_second': 56.615, 'eval_steps_per_second': 56.615, 'epoch': 6.21}
{'loss': 0.4074, 'grad_norm': 6.94213342666626, 'learning_rate': 1.108173076923077e-05, 'epoch': 6.31}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6828694343566895, 'eval_runtime': 22.0458, 'eval_samples_per_second': 56.655, 'eval_steps_per_second': 56.655, 'epoch': 6.31}
{'loss': 0.4138, 'grad_norm': 16.285579681396484, 'learning_rate': 1.078125e-05, 'epoch': 6.41}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7723380923271179, 'eval_runtime': 22.2731, 'eval_samples_per_second': 56.077, 'eval_steps_per_second': 56.077, 'epoch': 6.41}
{'loss': 0.4406, 'grad_norm': 0.3997447192668915, 'learning_rate': 1.048076923076923e-05, 'epoch': 6.51}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.703401505947113, 'eval_runtime': 22.0481, 'eval_samples_per_second': 56.649, 'eval_steps_per_second': 56.649, 'epoch': 6.51}
{'loss': 0.481, 'grad_norm': 20.122997283935547, 'learning_rate': 1.0180288461538462e-05, 'epoch': 6.61}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6640209555625916, 'eval_runtime': 22.0892, 'eval_samples_per_second': 56.543, 'eval_steps_per_second': 56.543, 'epoch': 6.61}
{'loss': 0.4034, 'grad_norm': 8.880927085876465, 'learning_rate': 9.880408653846153e-06, 'epoch': 6.71}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6927049160003662, 'eval_runtime': 21.999, 'eval_samples_per_second': 56.775, 'eval_steps_per_second': 56.775, 'epoch': 6.71}
{'loss': 0.4816, 'grad_norm': 0.09294828027486801, 'learning_rate': 9.579927884615386e-06, 'epoch': 6.81}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6966079473495483, 'eval_runtime': 21.8333, 'eval_samples_per_second': 57.206, 'eval_steps_per_second': 57.206, 'epoch': 6.81}
{'loss': 0.4156, 'grad_norm': 38.10234832763672, 'learning_rate': 9.279447115384615e-06, 'epoch': 6.91}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6846146583557129, 'eval_runtime': 22.08, 'eval_samples_per_second': 56.567, 'eval_steps_per_second': 56.567, 'epoch': 6.91}
{'loss': 0.4612, 'grad_norm': 19.636178970336914, 'learning_rate': 8.978966346153846e-06, 'epoch': 7.01}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6813660860061646, 'eval_runtime': 22.0529, 'eval_samples_per_second': 56.637, 'eval_steps_per_second': 56.637, 'epoch': 7.01}
{'loss': 0.4512, 'grad_norm': 0.06610029935836792, 'learning_rate': 8.67908653846154e-06, 'epoch': 7.11}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7023696303367615, 'eval_runtime': 22.046, 'eval_samples_per_second': 56.654, 'eval_steps_per_second': 56.654, 'epoch': 7.11}
{'loss': 0.429, 'grad_norm': 0.08186301589012146, 'learning_rate': 8.378605769230769e-06, 'epoch': 7.21}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7232782244682312, 'eval_runtime': 22.2385, 'eval_samples_per_second': 56.164, 'eval_steps_per_second': 56.164, 'epoch': 7.21}
{'loss': 0.3776, 'grad_norm': 0.12057949602603912, 'learning_rate': 8.078125e-06, 'epoch': 7.31}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7290324568748474, 'eval_runtime': 22.1306, 'eval_samples_per_second': 56.438, 'eval_steps_per_second': 56.438, 'epoch': 7.31}
{'loss': 0.4154, 'grad_norm': 0.43057775497436523, 'learning_rate': 7.777644230769231e-06, 'epoch': 7.41}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6977748274803162, 'eval_runtime': 22.033, 'eval_samples_per_second': 56.688, 'eval_steps_per_second': 56.688, 'epoch': 7.41}
{'loss': 0.4094, 'grad_norm': 0.3874826431274414, 'learning_rate': 7.477764423076923e-06, 'epoch': 7.51}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7344380021095276, 'eval_runtime': 21.9719, 'eval_samples_per_second': 56.845, 'eval_steps_per_second': 56.845, 'epoch': 7.51}
{'loss': 0.4449, 'grad_norm': 2.303772449493408, 'learning_rate': 7.177283653846154e-06, 'epoch': 7.61}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7180905342102051, 'eval_runtime': 21.9357, 'eval_samples_per_second': 56.939, 'eval_steps_per_second': 56.939, 'epoch': 7.61}
{'loss': 0.4103, 'grad_norm': 29.05681610107422, 'learning_rate': 6.8768028846153845e-06, 'epoch': 7.71}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7321071624755859, 'eval_runtime': 21.9405, 'eval_samples_per_second': 56.927, 'eval_steps_per_second': 56.927, 'epoch': 7.71}
{'loss': 0.4792, 'grad_norm': 24.3717041015625, 'learning_rate': 6.5763221153846155e-06, 'epoch': 7.81}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7167288661003113, 'eval_runtime': 21.8527, 'eval_samples_per_second': 57.155, 'eval_steps_per_second': 57.155, 'epoch': 7.81}
{'loss': 0.4459, 'grad_norm': 0.044593628495931625, 'learning_rate': 6.276442307692308e-06, 'epoch': 7.91}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6919227242469788, 'eval_runtime': 22.3275, 'eval_samples_per_second': 55.94, 'eval_steps_per_second': 55.94, 'epoch': 7.91}
{'loss': 0.4051, 'grad_norm': 21.050241470336914, 'learning_rate': 5.975961538461538e-06, 'epoch': 8.01}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7503241896629333, 'eval_runtime': 22.229, 'eval_samples_per_second': 56.188, 'eval_steps_per_second': 56.188, 'epoch': 8.01}
{'loss': 0.4172, 'grad_norm': 19.936941146850586, 'learning_rate': 5.675480769230769e-06, 'epoch': 8.11}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7118529081344604, 'eval_runtime': 22.0121, 'eval_samples_per_second': 56.741, 'eval_steps_per_second': 56.741, 'epoch': 8.11}
{'loss': 0.4483, 'grad_norm': 8.370477676391602, 'learning_rate': 5.375e-06, 'epoch': 8.21}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7177258133888245, 'eval_runtime': 21.957, 'eval_samples_per_second': 56.884, 'eval_steps_per_second': 56.884, 'epoch': 8.21}
{'loss': 0.3839, 'grad_norm': 4.3004841804504395, 'learning_rate': 5.074519230769231e-06, 'epoch': 8.31}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7099173069000244, 'eval_runtime': 21.9688, 'eval_samples_per_second': 56.853, 'eval_steps_per_second': 56.853, 'epoch': 8.31}
{'loss': 0.3724, 'grad_norm': 26.29486656188965, 'learning_rate': 4.774639423076924e-06, 'epoch': 8.41}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7491852641105652, 'eval_runtime': 22.3609, 'eval_samples_per_second': 55.856, 'eval_steps_per_second': 55.856, 'epoch': 8.41}
{'loss': 0.4343, 'grad_norm': 24.399307250976562, 'learning_rate': 4.474158653846155e-06, 'epoch': 8.51}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7343939542770386, 'eval_runtime': 22.0142, 'eval_samples_per_second': 56.736, 'eval_steps_per_second': 56.736, 'epoch': 8.51}
{'loss': 0.4534, 'grad_norm': 4.7399210929870605, 'learning_rate': 4.173677884615385e-06, 'epoch': 8.61}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7152500152587891, 'eval_runtime': 22.2836, 'eval_samples_per_second': 56.05, 'eval_steps_per_second': 56.05, 'epoch': 8.61}
{'loss': 0.4303, 'grad_norm': 0.06552407890558243, 'learning_rate': 3.873197115384615e-06, 'epoch': 8.71}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.6990986466407776, 'eval_runtime': 21.9963, 'eval_samples_per_second': 56.782, 'eval_steps_per_second': 56.782, 'epoch': 8.71}
{'loss': 0.4362, 'grad_norm': 2.22381854057312, 'learning_rate': 3.5733173076923075e-06, 'epoch': 8.81}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7204425930976868, 'eval_runtime': 22.5087, 'eval_samples_per_second': 55.49, 'eval_steps_per_second': 55.49, 'epoch': 8.81}
{'loss': 0.3918, 'grad_norm': 16.131868362426758, 'learning_rate': 3.2728365384615385e-06, 'epoch': 8.91}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7408996820449829, 'eval_runtime': 21.9153, 'eval_samples_per_second': 56.992, 'eval_steps_per_second': 56.992, 'epoch': 8.91}
{'loss': 0.4426, 'grad_norm': 2.1150801181793213, 'learning_rate': 2.972355769230769e-06, 'epoch': 9.01}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7247517704963684, 'eval_runtime': 22.0853, 'eval_samples_per_second': 56.554, 'eval_steps_per_second': 56.554, 'epoch': 9.01}
{'loss': 0.4058, 'grad_norm': 34.52687454223633, 'learning_rate': 2.671875e-06, 'epoch': 9.11}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7186013460159302, 'eval_runtime': 22.3596, 'eval_samples_per_second': 55.86, 'eval_steps_per_second': 55.86, 'epoch': 9.11}
{'loss': 0.4265, 'grad_norm': 0.031829506158828735, 'learning_rate': 2.371995192307692e-06, 'epoch': 9.21}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7187245488166809, 'eval_runtime': 21.8401, 'eval_samples_per_second': 57.188, 'eval_steps_per_second': 57.188, 'epoch': 9.21}
{'loss': 0.4272, 'grad_norm': 27.492788314819336, 'learning_rate': 2.071514423076923e-06, 'epoch': 9.31}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7267150282859802, 'eval_runtime': 22.2912, 'eval_samples_per_second': 56.031, 'eval_steps_per_second': 56.031, 'epoch': 9.31}
{'loss': 0.43, 'grad_norm': 23.963422775268555, 'learning_rate': 1.7710336538461538e-06, 'epoch': 9.42}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7317846417427063, 'eval_runtime': 22.3081, 'eval_samples_per_second': 55.989, 'eval_steps_per_second': 55.989, 'epoch': 9.42}
{'loss': 0.4667, 'grad_norm': 28.927635192871094, 'learning_rate': 1.4705528846153846e-06, 'epoch': 9.52}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7313274145126343, 'eval_runtime': 21.9946, 'eval_samples_per_second': 56.787, 'eval_steps_per_second': 56.787, 'epoch': 9.52}
{'loss': 0.3634, 'grad_norm': 0.4396069645881653, 'learning_rate': 1.170673076923077e-06, 'epoch': 9.62}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7470300197601318, 'eval_runtime': 22.3419, 'eval_samples_per_second': 55.904, 'eval_steps_per_second': 55.904, 'epoch': 9.62}
{'loss': 0.4791, 'grad_norm': 27.615554809570312, 'learning_rate': 8.701923076923078e-07, 'epoch': 9.72}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.739204466342926, 'eval_runtime': 22.1032, 'eval_samples_per_second': 56.508, 'eval_steps_per_second': 56.508, 'epoch': 9.72}
{'loss': 0.3898, 'grad_norm': 30.569196701049805, 'learning_rate': 5.697115384615385e-07, 'epoch': 9.82}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7406260967254639, 'eval_runtime': 21.909, 'eval_samples_per_second': 57.009, 'eval_steps_per_second': 57.009, 'epoch': 9.82}
{'loss': 0.4384, 'grad_norm': 4.763878345489502, 'learning_rate': 2.692307692307692e-07, 'epoch': 9.92}


  0%|          | 0/1249 [00:00<?, ?it/s]

{'eval_loss': 0.7377873063087463, 'eval_runtime': 22.1981, 'eval_samples_per_second': 56.266, 'eval_steps_per_second': 56.266, 'epoch': 9.92}
{'train_runtime': 6253.5418, 'train_samples_per_second': 7.983, 'train_steps_per_second': 7.983, 'train_loss': 0.45593070066892183, 'epoch': 10.0}


TrainOutput(global_step=49920, training_loss=0.45593070066892183, metrics={'train_runtime': 6253.5418, 'train_samples_per_second': 7.983, 'train_steps_per_second': 7.983, 'total_flos': 6756262729482240.0, 'train_loss': 0.45593070066892183, 'epoch': 10.0})

#### Training result
The fine-tuned T5-small model achieved a training loss of 0.456, indicating that it effectively learned patterns from the training data while minimizing prediction errors. Over the course of 10 epochs, the model completed 49,920 steps, processing approximately 8 samples per second, which demonstrates efficient utilization of computational resources. The reported floating-point operations (FLOPs) totaled approximately 6.76 quadrillion, highlighting the computational intensity of the training process.

In [10]:
def predict(job_description, resume):
    input_text = (f"Evaluate the fit level of the following resume for the job description: "
                  f"Job Description: {job_description} "
                  f"Resume: {resume}")
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(input_ids,top_p=0.9)
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return prediction

# Example prediction
job_description = "Software Engineer with experience in Python and cloud computing."
resume = "Experienced developer with expertise in Python, AWS, and software design."
print(predict(job_description, resume))


Good Fit




In [10]:
torch.cuda.empty_cache()

### Job description Parsing

Here we will use Optical Character Recognition with the PaddleOCR library to extract text from an image. It initializes the PaddleOCR object. The ocr method processes the given image path and returns a structured result, from which the function extracts only the recognized text (ignoring bounding box data and confidence scores). The extracted text is concatenated into a single string and returned, providing a readable transcription of the text present in the image. This is particularly useful for converting job description images into machine-readable text for further analysis.

In [17]:
import pytesseract

# Set tesseract_cmd to the Tesseract executable path
pytesseract.pytesseract.tesseract_cmd = r"c:\users\michael\anaconda3\envs\tf-gpu\lib\site-packages"


In [None]:
from paddleocr import PaddleOCR

def extract_text_with_paddleocr(image_path):
    ocr = PaddleOCR()  # Initialize PaddleOCR
    results = ocr.ocr(image_path)
    text = [line[1][0] for line in results[0]]  # Extract the text content
    return " ".join(text)

# Example Usage
job_description = extract_text_with_paddleocr("images/job_description_image.jpg")
print(job_description)


[2024/12/07 14:44:06] ppocr DEBUG: Namespace(alpha=1.0, alphacolor=(255, 255, 255), benchmark=False, beta=1.0, binarize=False, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='C:\\Users\\Michael/.paddleocr/whl\\cls\\ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_box_type='quad', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_limit_side_len=960, det_limit_type='max', det_model_dir='C:\\Users\\Michael/.paddleocr/whl\\det\\ch\\ch_PP-OCRv4_det_infer', det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=N

Here we extracts all text from a given PDF file using the PyPDF2 library. It initializes a PdfReader object to read the PDF, iterates through each page, and appends the extracted text to a string variable. Finally, the function returns the concatenated text, providing a plain-text representation of the entire resume.

In [12]:
from PyPDF2 import PdfReader
# Extract original resume text from a PDF
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    resume_text = ""
    for page in reader.pages:
        resume_text += page.extract_text()
    return resume_text


### Generate Resume
 Here we atempt to create a tailored resume. We provide an input prompt combining the job description and the original resume, instructing the model to generate a tailored version. The prompt is tokenized into a format suitable for the model and transferred to a GPU for faster processing. Using the model's generate method, it produces a single output sequence, ensuring the response stays within a specified maximum length of 512 tokens. Finally, the generated text is decoded into a readable format and returned as the tailored resume, optimized for the provided job description.

In [17]:
# Generate a tailored resume based on the job description and original resume
def generate_matching_resume(job_description, resume_text):
    # Construct the input prompt for the model
    input_text = (
        f"Generate a tailored Resume\n\n"
        f"Job Description: {job_description}\n\n"
        f"Original Resume: {resume_text}"
    )
    input_ids = tokenizer(input_text, return_tensors="pt", truncation=True).input_ids.to("cuda")
    
    # Generate the tailored resume
    outputs = model.generate(input_ids, max_length=512, num_return_sequences=1)
    tailored_resume = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return tailored_resume

In [None]:
job_description_image = "images/job_description_image.jpg"  # Replace with the actual job description image file path
resume_pdf_path = "resources/Resume.pdf"  # Replace with the actual resume PDF file path

# Step 1: Extract text from the job description image
job_description = extract_text_with_paddleocr(job_description_image)
print("Extracted Job Description:")
print(job_description)

# Step 2: Extract text from the resume PDF
original_resume_text = extract_text_from_pdf(resume_pdf_path)
print("\nExtracted Original Resume Text:")
print(original_resume_text)

# Step 3: Generate a tailored resume
tailored_resume = generate_matching_resume(job_description, original_resume_text)
print("\nTailored Resume:")
print(tailored_resume)

[2024/12/07 14:45:44] ppocr DEBUG: Namespace(alpha=1.0, alphacolor=(255, 255, 255), benchmark=False, beta=1.0, binarize=False, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='C:\\Users\\Michael/.paddleocr/whl\\cls\\ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_box_type='quad', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_limit_side_len=960, det_limit_type='max', det_model_dir='C:\\Users\\Michael/.paddleocr/whl\\det\\ch\\ch_PP-OCRv4_det_infer', det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=N

### Conclusion
The model's underperformance in generating tailored resumes could stem from several factors. First, fine-tuning the T5-small model may have been insufficient for generating coherent, professional resumes due to the complexity of resume formatting and content prioritization. The training data may not have provided enough diverse examples of high-quality resumes and their alignment with job descriptions, limiting the model's ability to generalize effectively. Additionally, truncation of input data during tokenization could have caused the model to lose critical context from longer resumes or job descriptions

Another contributing factor could be the lack of explicit training for formatting resumes in a professional structure, as T5 is primarily optimized for text-to-text tasks rather than structured document generation. The model also likely struggled to balance creativity with precision, essential for accurately reflecting a candidate's qualifications while adhering to job requirements.

Future efforts could include expanding the dataset with more annotated examples of resumes and job descriptions, ensuring a diverse representation of industries and roles. Incorporating additional pre-processing steps, such as segmenting resumes into discrete sections, could provide the model with more structured input. Fine-tuning larger transformer models, like T5-base or T5-large, might improve performance by leveraging their greater capacity to understand and generate complex content.