# The notebook suggests the usage of Generative Pseudo-Labelling (GPL) for resume screening domain

GPL is an unsupervised method which allows us to fine-tune pre existing models to a domain not previously trained on. Since resumes don't generally have any labelled information accessible alongside them, we believe this is a perfect opportunity to assess GPL adaptation to this problem.

This notebook contains an extensive code of GPL to fine tune a pre existing model (in this case sentence transformer model - *msmarco-distilbert-base-tas-b*)

In [1]:
# Install required packages
!pip install tensorflow_text
!pip install tensorflow # to check version: 
!pip install gpl
!pip install --upgrade sentence-transformers==2.6.1
!pip install nltk
!pip install pandas



ERROR: Invalid requirement: '#'


Collecting gpl
  Using cached gpl-0.1.4-py3-none-any.whl.metadata (13 kB)
Collecting beir (from gpl)
  Using cached beir-2.0.0-py3-none-any.whl
Collecting easy-elasticsearch>=0.0.9 (from gpl)
  Using cached easy_elasticsearch-0.0.9-py3-none-any.whl.metadata (2.6 kB)
Collecting pytest (from gpl)
  Using cached pytest-8.1.1-py3-none-any.whl.metadata (7.6 kB)
Collecting elasticsearch>=7.9.1 (from easy-elasticsearch>=0.0.9->gpl)
  Using cached elasticsearch-8.13.0-py3-none-any.whl.metadata (6.3 kB)
Collecting pytrec-eval (from beir->gpl)
  Using cached pytrec_eval-0.5.tar.gz (15 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting faiss-cpu (from beir->gpl)
  Using cached faiss_cpu-1.8.0-cp39-cp39-win_amd64.whl.metadata (3.8 kB)
Collecting elasticsearch>=7.9.1 (from easy-elasticsearch>=0.0.9->gpl)
  Using cached elasticsearch-7.9.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting datasets (from beir->gpl)
  Using cached datas

  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [2889 lines of output]
      Fetching trec_eval from https://github.com/usnistgov/trec_eval/archive/v9.0.8.tar.gz.
      !!
      
              ********************************************************************************
              Usage of dash-separated 'description-file' will not be supported in future
              versions. Please use the underscore name 'description_file' instead.
      
              This deprecation is overdue, please update your project and remove deprecated
              calls to avoid build errors in the future.
      
              See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
              ********************************************************************************
      
      !!
        opt = self.warn_dash_deprecation(opt, section)
      running bdist_wheel
      running build
   



In [1]:
# Import required libraries
import pandas as pd 
import re 
import json
from tqdm.autonotebook import tqdm
from sentence_transformers import SentenceTransformer, util, CrossEncoder, InputExample, losses
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm.auto import tqdm
import torch
import pickle

  from tqdm.autonotebook import tqdm


In [2]:
# functions to perform pre-processing
def preprocess(text):
    text = re.sub(r"[^a-zA-Z0-9,.'?]+", ' ', str(text)) # remove special characters
    text = [text[:len(text)//2],text[len(text)//2:]] # split the text into two parts: first half and second half
    return text

def text_len(text):
    return len(text.split()) # return the length of the first half of the text

def convert_str(text):
    return str(text)+"_" # add underscore at the end of the text

In [3]:
# load the dataset
df = pd.read_csv('UpdatedResumeDataSet.csv') # read the data
df

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
957,Testing,Computer Skills: â¢ Proficient in MS office (...
958,Testing,â Willingness to accept the challenges. â ...
959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...


In [4]:
'''
Preprocess the data
'''

# lower case all the text in the dataset
df = df.apply(lambda x:x.str.lower())
# Drop any null values
df = df.dropna()

# Preprocess the Text in 'Resume' and add the preprocessed text as new column 'new_text'
df['new_text'] = df.Resume.apply(preprocess)

# Explode and reset the index: Convert each item in the list under the new_text column into a separate row, effectively doubling the number of rows in the DataFrame.
df = df.explode("new_text")
df = df.reset_index(drop=True)

# Add Identifiers and Filtering
df["_id"] = df.index # add an identifier column
df['num'] = df['new_text'].apply(text_len) # add a column to store the length of the first half of the text
df = df[df['num'] < 400] # filter out the rows where the length of the first half of the text is greater than 400
df["_id"] = df["_id"].apply(convert_str) # add underscore at the end of the identifier

# adjust the column names
df['title'] = ""
df['metadata'] = ""
df['title'] = df['title'].astype(str)
df['text'] = df['new_text'].astype(str)
df['_id'] = df['_id'].astype(str)
df['concat'] = "qgen" + df["title"] + " " + df["text"] # intended for query generation, prefixed with "qgen".

# export to JSON and JSON Lines Format: useful for NLP tasks as it allows for efficient loading of large datasets line by line.
df[['_id', 'title', 'text', 'metadata']].to_json('corpus.json',orient='records')
df[['_id', 'title', 'text', 'metadata']].to_json('corpus.jsonl',orient='records',lines=True)

In [5]:
# load the data from JSON
f = open('corpus.json')
data = json.load(f)

# Write Data to JSONL: 
with open('corpus.jsonl', 'w') as outfile:
    for entry in data:
        json.dump(entry, outfile)
        outfile.write('\n')

# dump the correct format: This converts the data into JSON Lines format, where each line is a valid JSON string, useful for stream processing or line-by-line reading.
filepath = 'corpus.jsonl'

# Aggregate JSONL Back into JSON:
with open(filepath, 'r') as infile, open('output.json', 'w') as outfile:
    data = [json.loads(line) for line in infile]
    json.dump(data, outfile)


# Convert JSON to CSV
df = pd.read_json('output.json')
df['num']=df['text'].apply(text_len)
df = df[df['num']<400]
df.to_csv('final_.csv')

In [6]:
# see if I have gpu available with torch
torch.cuda.is_available()

True

## Query Generation

In [18]:
'''
- Generate queries (Resume) from a passage of text using a T5 model: The Query highlights key aspects of the resume's content, not questions but rather capture essential elements, skills, experiences, or qualifications presented in the resume.
- Transforming Resume content into queries that might be used to retrieve similar documents or information.
- 
'''
# Loading the model:
model_name = 'doc2query/msmarco-t5-base-v1' # Model specific to the task of generating queries from documents.
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).cuda()

# Prepare Text for Query Generation:
passage = df['text'].iloc[5] # select a passage from the dataset

# Tokenize the passage
inputs = tokenizer(passage, return_tensors='pt') # return the tokenized passage as PyTorch tensors

# Generate Queries: 3 queries are generated for the given passage
outputs = model.generate(
    input_ids=inputs['input_ids'].cuda(),
    attention_mask=inputs['attention_mask'].cuda(),
    max_length=64,
    do_sample=True,
    top_p=0.95,
    num_return_sequences=3
)

# Display the Original Passage and Generated Queries
print("Paragraph:")
print(passage)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{i + 1}: {query}')

Paragraph:
d electrical enthusiast skill details data analysis exprience less than 1 year months excel exprience less than 1 year months machine learning exprience less than 1 year months mathematics exprience less than 1 year months python exprience less than 1 year months matlab exprience less than 1 year months electrical engineering exprience less than 1 year months sql exprience less than 1 year monthscompany details company themathcompany description i am currently working with a casino based operator name not to be disclosed in macau.i need to segment the customers who visit their property based on the value the patrons bring into the company.basically prove that the segmentation can be done in much better way than the current system which they have with proper numbers to back it up.henceforth they can implement target marketing strategy to attract their customers who add value to the business.

Generated Queries:
1: what are d electrical enthusiasts skills
2: what is data analy

### Purpose of Queries in Matching Resumes to Job Descriptions

1. **Keyword Extraction and Emphasis:** Each query acts as a distilled representation of parts of the resume, emphasizing skills, experiences, or qualifications that might be relevant to potential employers or match specific job descriptions.

2. **Enhanced Searchability:** By converting sections of a resume into queries, the system can more effectively use these queries to search through job descriptions or a database of job requirements. This reverses the typical job application process, making the resumes actively "search" for matching job opportunities.

3. **Semantic Matching:** These queries help in moving beyond simple keyword matching by leveraging the T5 model's understanding of language to generate search terms that capture the meaning and context of the resume's content. This leads to more nuanced and semantically relevant matches between job descriptions and candidate profiles.

4. **Highlighting Candidate's Fit:** The generated queries can serve to pinpoint why a candidate might be a good fit for a role, highlighting specific skills or experiences in the form of searchable and matchable text snippets.

### Example Interpretation
Given the output from the T5 model in your example:
- **"hp experience required"** might highlight a specific skill or qualification mentioned in the resume, albeit in a somewhat abstract way.
- **"what is experience"** seems like a less directly applicable query but might relate to the model trying to abstract the concept of experience from the resume.
- **"what is an electrical enthusiast"** directly references a specific interest or skill area mentioned in the resume, making it a potentially useful query for matching with job descriptions looking for candidates passionate about electrical engineering.

In essence, the "query" in this matching system is a tool generated by processing the resume's text to create a bridge between the candidate's profile and potential job opportunities, enhancing the ability to match resumes with job descriptions based on deeper textual understanding.

In [19]:
# Assuming df is your DataFrame and contains the 'text' column with passages
passages = df['text'].to_numpy()
num_queries = 3
# Determine how many passages have been processed already
with open('pairs.tsv', 'r', encoding='utf-8') as file:
    processed_lines = sum(1 for line in file)
processed_passages = processed_lines // num_queries  # Assuming 3 queries per passage

# Adjusted target calculation based on remaining passages
target = (len(passages) - processed_passages) * num_queries
model = torch.nn.DataParallel(model)

batch_size = 128
count = 0  # Reset count based on the number of already processed queries
passage_batch = []

# Open the file in append mode to add missing pairs
with open('pairs.tsv', 'a', encoding='utf-8') as fp, tqdm(total=target) as progress:
    for index, passage in enumerate(passages[processed_passages:], start=processed_passages):
        if count >= target: break
        passage = passage.replace('\t', ' ').replace('\n', ' ')
        passage_batch.append(passage)
        
        if len(passage_batch) == batch_size or index == len(passages) - 1:  # Check if it's also the last batch
            inputs = tokenizer(
                passage_batch,
                truncation=True,
                padding=True,
                max_length=256,
                return_tensors='pt'
            )

            outputs = model.module.generate(input_ids=inputs['input_ids'].cuda(), 
                                            attention_mask=inputs['attention_mask'].cuda(),
                                            max_length=64, do_sample=True, top_p=0.95, num_return_sequences=num_queries)

            decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)

            for i, query in enumerate(decoded_output):
                query = query.replace('\t', ' ').replace('\n', ' ')
                passage_idx = int(i / num_queries)  # Get the index of the passage to match query
                fp.write(query + '\t' + passage_batch[passage_idx] + '\n')
                count += 1

            passage_batch = []  # Clear the batch to free memory
            torch.cuda.empty_cache()  # Free GPU memory
            progress.update(len(decoded_output))


100%|██████████| 5067/5067 [27:32<00:00,  3.07it/s]


## Negative Mining

In [20]:
# Initalize the Sentence Transformer model
model = SentenceTransformer('msmarco-distilbert-base-tas-b')
model.max_seq_length = 256
model

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [21]:
# Define Generator to Read Query-Passage Pairs
def get_text():
    with open('pairs.tsv', 'r', encoding='utf-8') as fp:
        lines = fp.read().split('\n')
    for line in tqdm(lines):
        try:
            query, passage = line.split('\t')
            yield query, passage
        except ValueError:
            pass

# use the generator to get the query and passage
pair_gen = get_text()
for i, (query, passage) in enumerate(pair_gen):
    print(query)
    print()
    print(passage)
    break


  0%|          | 0/5068 [00:00<?, ?it/s]

what are python language skills

skills programming languages python pandas, numpy, scipy, scikit learn, matplotlib , sql, java, javascript jquery. machine learning regression, svm, na ve bayes, knn, random forest, decision trees, boosting techniques, cluster analysis, word embedding, sentiment analysis, natural language processing, dimensionality reduction, topic modelling lda, nmf , pca neural nets. database visualizations mysql, sqlserver, cassandra, hbase, elasticsearch d3.js, dc.js, plotly, kibana, matplotlib, ggplot, tableau. others regular expression, html, css, angular 6, logstash, kafka, python flask, git, docker, computer vision open cv and understanding of deep learning.education details data science assurance associate data science assurance associate ernst young llp skill details javascript exprience 24 months jquery exprience 24 months python exprience 24 monthscompany details company ernst young llp description fraud investigations and dispute services assurance technolo

In [22]:
# Re-populate the pairs list using the get_text generator
pairs = [pair for pair in get_text()]

# Initialize variables for embedding storage
passage_batch = []
id_batch = []
embeddings_store = []  # Store embeddings here
batch_size = 64

# Load the SentenceTransformer model as before
# Assuming model has been loaded here as shown in previous examples

# Process passages to avoid duplication and batch for embedding
for i, (query, passage) in enumerate(pairs):  # Now using the populated pairs list
    if passage not in passage_batch: 
        passage_batch.append(passage)
        id_batch.append(str(i))

    if len(passage_batch) == batch_size:
        # Encode passages to embeddings
        embeds = model.encode(passage_batch).tolist()
        for idx, emb in zip(id_batch, embeds):
            embeddings_store.append((idx, emb))
        passage_batch = []
        id_batch = []

# Ensure any remaining passages are processed
if passage_batch:
    embeds = model.encode(passage_batch).tolist()
    for idx, emb in zip(id_batch, embeds):
        embeddings_store.append((idx, emb))

# Save embeddings to a local file for later retrieval
with open('embeddings_store.pkl', 'wb') as f:
    pickle.dump(embeddings_store, f)

print(f"Total embeddings stored: {len(embeddings_store)}")




100%|██████████| 5068/5068 [00:00<00:00, 723929.19it/s]


Total embeddings stored: 344


In [30]:
# Loading embeddings from file
with open('embeddings_store.pkl', 'rb') as f:
    embeddings_store = pickle.load(f)

# Assuming `model` is on the same device as `embeddings_store`
device = next(model.parameters()).device

# Assuming `pairs` is already defined in your context as query-passage pairs
batch_size = 100
triplets = []

for i in tqdm(range(0, len(pairs), batch_size)):
    i_end = min(i+batch_size, len(pairs))
    queries = [pair[0] for pair in pairs[i:i_end]]
    pos_passages = [pair[1] for pair in pairs[i:i_end]]

    # Create query embeddings
    query_embs = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
    query_embs = query_embs.to(device)

    for query_idx, (query, pos_passage) in enumerate(zip(queries, pos_passages)):
        # This will store cosine similarities between the current query embedding and all passage embeddings
        cosine_similarities = []
        for _, emb_list in embeddings_store:
            # Move each tensor in emb_list to the same device as query_embs
            emb_list = [torch.tensor(emb).to(device) for emb in emb_list if isinstance(emb, torch.Tensor)]
            
            # Compute cosine similarity for each tensor in emb_list
            for emb in emb_list:
                sim = util.pytorch_cos_sim(query_embs[query_idx], emb).item()
                cosine_similarities.append(sim)
        
        # Sort passages by similarity to the query and select a negative sample
        sorted_passage_idxs = sorted(range(len(cosine_similarities)), key=lambda k: cosine_similarities[k], reverse=True)
        for idx in sorted_passage_idxs:
            # Assuming the first passage is the most similar one, we skip it to find a negative sample
            neg_passage = pairs[idx][1]
            if neg_passage != pos_passage:
                triplets.append(f"{query}\t{pos_passage}\t{neg_passage}")
                break  # Break after finding the first suitable negative

# Save the triplets to a file
with open('triplets.tsv', 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(triplets))




100%|██████████| 51/51 [01:56<00:00,  2.29s/it]


## Pseudo-labeling

In [None]:
# NOTE: This part takes A LOT of time to run on a normal laptop. It is recommended to run it on a machine with a GPU with more than 6GB RAM or a cloud-based service.
'''
takes triplets of query, positive passage, and negative passage from a TSV file,
uses a CrossEncoder model to score the relevance of the positive and negative passages to the query,
calculates the margin between these scores, and saves the results to a new TSV file
'''

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# Define a Generator to Read Triplets from 'triplets.tsv'
def get_lines():
    with open('triplets.tsv', 'r', encoding='utf-8') as fp:
        lines = fp.read().split('\n')
    for line in tqdm(lines):
        q, p, n = line.split('\t')
        yield q, p, n

# Scoring Triplets and Calculating Margins
lines = get_lines()
label_lines = []

for line in lines:
    q, p, n = line
    # predict (Q, P+) and (Q, P-) scores
    p_score = model.predict((q, p))
    n_score = model.predict((q, n))
    # calculate the margin score
    margin = p_score - n_score
    label_lines.append(
        q + '\t' + p + '\t' + n + '\t' + str(margin)
    )

# Save the Results
with open("triplets_margin.tsv", 'w', encoding='utf-8') as fp:
    fp.write('\n'.join(label_lines))



[2024-04-09 23:33:34] INFO [sentence_transformers.cross_encoder.CrossEncoder.__init__:56] Use pytorch device: cpu
Batches: 100%|██████████| 1/1 [00:00<00:00,  8.69it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 26.37it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 11.03it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 46.06it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 11.27it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 76.00it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 13.43it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 63.73it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.97it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.93it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 14.28it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 10.99it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 35.49it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 57.34it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 49.41it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 11.36it/s]
Batche

In [41]:
from tqdm.auto import tqdm
from sentence_transformers import InputExample

# Prepare Training Data
training_data = []

with open('triplets_margin.tsv', 'r', encoding='utf-8') as fp:
    lines = fp.read().split('\n')

for line in tqdm(lines):
    q, p, n, margin = line.split('\t')
    training_data.append(InputExample(
        texts=[q, p, n],
        label=float(margin)
    ))

# Intialize the Data Loader
batch_size = 32

loader = torch.utils.data.DataLoader(
    training_data, batch_size=batch_size, shuffle=True
)

# Set up the Sentence Transformer model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('msmarco-distilbert-base-tas-b')
model.max_seq_length = 256



100%|██████████| 5067/5067 [00:00<00:00, 244997.33it/s]
[2024-04-09 23:56:35] INFO [sentence_transformers.SentenceTransformer.__init__:66] Load pretrained SentenceTransformer: msmarco-distilbert-base-tas-b
[2024-04-09 23:56:35] INFO [sentence_transformers.SentenceTransformer.__init__:105] Use pytorch device: cpu


In [None]:
# Import loss function
from sentence_transformers import losses

loss = losses.MarginMSELoss(model)

# Setting Training Parameters
epochs = 10
warmup_steps = int(len(loader) * epochs * 0.1)

# Training the Model
model.fit(
    train_objectives=[(loader, loss)],
    epochs=epochs,
    warmup_steps=warmup_steps,
    output_path='msmarco-distilbert-base-tas-b-final',
    show_progress_bar=True
)

# Save the Model
model.save('msmarco-distilbert-base-tas-b-final') 



# Using the fine tuned model

In [31]:
resumes = ["""I have more than 3.5 years of work experience and worked as a data scientist in three different companies. I used predictive modeling, data processing, and data mining algorithms to solve challenging business problems.
My technology stack includes Python, machine learning, deep learning, time series, web scraping, Flask, FastAPI, Snowflake SQL servers, deploying production servers, Keras, TensorFlow, Hugging Face, Big Data, and Data Warehouses. In my career, I've experienced exponential growth and developed interpersonal skills, enabling me to handle projects end-to-end.
My interests lie in applied machine learning, deep neural networks, time series, and NLP within e-commerce and consumer internet. My research focuses on information retrieval, incorporating neuroscience and deep reinforcement learning.
I enjoy listening to learning courses, reading research papers on deep learning, and in my free time, keeping up with news, reading Medium blogs, and watching sci-fi films.""",

"An entrepreneur began their journey 14 years ago with the launch of a social networking site, alongside music and video streaming portals in 2006, while still in school. In 2011, during the pursuit of a Computer Science engineering degree, they joined Letsbuy, an e-commerce startup, where they were instrumental in developing and launching their mobile app and site, amidst the early stages of mobile-commerce in India. Letsbuy was later acquired by Flipkart in 2012. Additionally, they co-founded Findyahan, a services marketplace, which was eventually acquired in 2016 by Zimmber. They assumed the role of Vice President of Product & Marketing at Zimmber. Zimmber was later acquired by Quikr.",

"""I've honed my expertise in data science and machine learning, steering numerous projects across diverse industries over the past 7 years. Python serves as my primary tool, enriched by its acclaimed data science libraries. When delving into deep learning, PyTorch is my framework of choice, complemented by a robust understanding of relational databases and cloud platforms.
My professional portfolio encompasses a wide array of projects, where I've leveraged my machine learning proficiency to provide guidance and develop solutions for external clients. I specialize in advising on effective data collection and structuring methodologies.
Beyond the confines of work, I immerse myself in AI literature, contributing extensively through the publication of deep learning tutorials. My focus, particularly, lies on PyTorch, with my articles finding a home on Medium under the publication "A Coder's Guide to AI."""]

In [32]:
description = """Requirements:

Bachelor's or Master's degree in Computer Science from a prestigious institution.
1 to 5 years of experience in the design, development, and deployment of software, with a preference for experience in Statistical and Machine Learning models.
Demonstrated ability to work independently with strong problem-solving skills.
Excellent understanding of the fundamentals of Machine Learning and Artificial Intelligence, particularly in Regression, Forecasting, and Optimization.
Strong foundational knowledge in Probability, Statistics, and Operations Research/Optimization techniques.
Hands-on experience throughout the Machine Learning Lifecycle, from Exploratory Data Analysis (EDA) to model deployment.
Proficiency with data analysis tools such as Jupyter, and libraries including NumPy, Pandas, and Matplotlib.
Capable of writing reliable, maintainable, secure, and performance-optimized code.
Good understanding of Cloud Platforms and Service-Oriented Architecture (SOA) design."""

In [33]:
from model.document_score_inference import score_inference

In [34]:
for resume in resumes:
    print(score_inference(resume, description)[0])

tensor(0.8961)
tensor(0.8190)
tensor(0.8780)


### Analysis of Resume Matching Results

1. **First Resume (Score: 0.8917)**
   - **Description**: This resume describes a candidate with over 3.5 years of experience in data science across multiple organizations, with a strong technology stack that includes Python, machine learning, deep learning, and data warehousing tools.
   - **Analysis**: The high score of 0.8917 suggests that your model found a strong match between the job description's requirements (which emphasize machine learning, data analysis, and deployment skills) and the candidate's described skills and experience. The mention of Python, machine learning, deep learning, and practical applications align well with the job description, thus the high score.

2. **Second Resume (Score: 0.8050)**
   - **Description**: This resume outlines the career of an entrepreneur with a strong background in launching and developing technology-driven businesses, including significant experience in mobile-commerce.
   - **Analysis**: The score of 0.8050, while still relatively high, is the lowest among the three. This could be because the resume emphasizes entrepreneurial and product management skills more than the technical machine learning and data science skills emphasized in the job description. The model likely recognized some relevant skills but also noted the less technical focus compared to the other resumes.

3. **Third Resume (Score: 0.8749)**
   - **Description**: This resume highlights a candidate with 7 years of experience in data science and machine learning, including direct involvement in project leadership and client advisement in these areas. The candidate is proficient in Python and PyTorch and has experience with relational databases and cloud environments.
   - **Analysis**: The score of 0.8749 indicates a very good match, slightly below the first resume. This candidate’s explicit mention of Python, machine learning, and hands-on project experience aligns well with the job description. The slightly lower score compared to the first resume might be due to less explicit mention of some of the specific tools or techniques listed in the job description or perhaps a less direct match in the areas of statistical modeling and deployment.

### General Observations:
- **Model Sensitivity**: The model appears sensitive to specific keywords and the depth of technical details provided in the resumes, which is good for identifying candidates with strong technical qualifications as required by the job description.
- **Interpretability**: The scores are quite close, indicating that all candidates are potentially good matches. However, for better utility in real-world settings, you might want to refine how the model differentiates between closely scored candidates, possibly by incorporating additional criteria or fine-tuning the model's sensitivity to particular job requirements.