<a href="https://colab.research.google.com/github/Tahm24/JobFinder_FYP/blob/main/MainModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spacy Model Creation - Key word extractor (Custom Dataset)**

In [28]:
#Install spacy and download model
!pip install -q spacy==3.8.5
!python -m spacy download en_core_web_sm

# python Imports
import spacy
import json
import random
from pathlib import Path
from spacy.tokens import DocBin
from sklearn.model_selection import train_test_split
from google.colab import files

# Upload custom (mines) dataset json
uploaded = files.upload()
jsonl_file = list(uploaded.keys())[0]

# Load and validate training data
examples = []
invalid_spans = 0

with open(jsonl_file, 'r', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line)
        text = data['text']
        entities = []
        for start, end, label in data['entities']:
            span_text = text[start:end].strip()
            if span_text:
                entities.append((start, end, label))
            else:
                invalid_spans += 1
        examples.append((text, {"entities": entities}))

print(f"Loaded {len(examples)} examples")
print(f"Skipped {invalid_spans} invalid spans")

#Split model into train/dev
train_data, dev_data = train_test_split(examples, test_size=0.1, random_state=42)

# Help to convert to spaCy format
def create_docbin(data, nlp):
    db = DocBin()
    valid_docs = 0
    for text, annot in data:
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annot["entities"]:
            span = doc.char_span(start, end, label=label)
            if span:
                ents.append(span)
        if ents:
            doc.ents = ents
            db.add(doc)
            valid_docs += 1
    print(f"✔ Added {valid_docs}/{len(data)} valid examples")
    return db

# Save DocBin files
Path("data").mkdir(parents=True, exist_ok=True)
nlp_blank = spacy.blank("en")

create_docbin(train_data, nlp_blank).to_disk("data/train.spacy")
create_docbin(dev_data, nlp_blank).to_disk("data/dev.spacy")

# Generate config for cpu (gpu too slow)
!python -m spacy init config data/config.cfg --lang en --pipeline ner --optimize efficiency --force

# Training the model and not using gpu-1
!python -m spacy train data/config.cfg --output model --paths.train data/train.spacy --paths.dev data/dev.spacy --gpu-id -1

print("\n Model Training Complete")

# Load model and test on sample below
nlp_trained = spacy.load("model/model-best")

test_text = """
John Doe — Backend Developer skilled in Python, Flask, and PostgreSQL.
Worked at DevSolutions Ltd from 2020-2024. MSc Computer Science, University of Manchester.
"""

doc = nlp_trained(test_text)

print("\nEntities Found:\n")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")


Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Saving full_cleaned_training_data.json to full_cleaned_training_data (1).json
✅ Loaded 59 examples
⚠️  Skipped 4 invalid spans
✔ Added 17/53 valid examples
✔ Added 3/6 valid examples
[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
data/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
[38;5;4mℹ Saving to output directory: model[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2V

In [34]:
# CV text block testing again with saved model and longer texts
cv_text = """
James Carter
Email: james.carter@example.com
Phone: +44 1234 567890
LinkedIn: linkedin.com/in/jamescarter
GitHub: github.com/jcarter
Location: London, UK

Professional Summary
Machine learning engineer with 4+ years of experience in designing, developing, and maintaining scalable software solutions. Adept in Java, Python, and JavaScript frameworks, with strong skills in Agile methodologies, teamwork, and problem-solving.

Technical Skills
Languages: Java, Python, JavaScript, C++
Frameworks: React, Angular, Node.js, Django, Spring Boot
Tools & Platforms: Git, Docker, Kubernetes, AWS, Azure, Jenkins
Databases: MySQL, PostgreSQL, MongoDB

Professional Experience
Software Engineer | TechSolutions Ltd., London, UK
March 2021 – Present
•	- Developed and maintained web applications using React, Node.js, and MongoDB.
•	- Improved application performance by 25% through code optimisation and refactoring.
•	- Led integration of RESTful APIs, enhancing application scalability.
•	- Collaborated with cross-functional teams using Agile methodologies.

Junior Software Developer | Innovatech Inc., London, UK
January 2019 – February 2021
•	- Assisted in developing backend systems with Java (Spring Boot) and Python (Django).
•	- Contributed to database design and management, optimising query performance by 15%.
•	- Participated in code reviews, debugging sessions, and maintained coding standards, python flask.

Education
BSc Computer Science | University College London, UK | Graduated: 2018

Certifications
- AWS Certified Solutions Architect – Associate (2022)
- Oracle Certified Java Programmer (2021)

Projects
Inventory Management System (React, Node.js, MongoDB) – Developed a full-stack web application to manage warehouse inventory.
Chat Application (Java, Spring Boot, WebSocket) – Created a real-time chat application with secure authentication.

Languages
- English (Native)
- French (Intermediate)

"""

# Extract some structured fields using regex + NER. Will work if model is trained above (model - best*)
import re



# Initialising result storage
extracted_info = {}

######Regex-based fields#######ß
lines = [line.strip() for line in cv_text.strip().split('\n') if line.strip()]
extracted_info['Name'] = lines[0]

email_match = re.search(r'[\w\.-]+@[\w\.-]+', cv_text)
extracted_info['Email'] = email_match.group(0) if email_match else None

phone_match = re.search(r'\+?\d[\d\s\-]{7,}\d', cv_text)
extracted_info['Phone'] = phone_match.group(0) if phone_match else None

linkedin_match = re.search(r'linkedin\.com\/[^\s]+', cv_text)
extracted_info['LinkedIn'] = linkedin_match.group(0) if linkedin_match else None

prof_summary_match = re.search(r'Professional Summary\n([^\n]+)', cv_text)
if prof_summary_match:
    first_line_summary = prof_summary_match.group(1)
    profession_match = re.match(r'(.+?)(?: with|,)', first_line_summary)
    extracted_info['Profession Title'] = profession_match.group(1) if profession_match else first_line_summary

years_match = re.search(r'(\d+)\+?\s*years? of experience', cv_text, re.IGNORECASE)
extracted_info['Years of Experience'] = years_match.group(1) if years_match else None

# spacy NER for skills
doc = nlp(cv_text)

skills = sorted({ent.text for ent in doc.ents if ent.label_ == "SKILL"})
extracted_info['Skills'] = skills if skills else []


print("Extracted CV Info \n")
for key, val in extracted_info.items():
    if isinstance(val, list):
        print(f"{key}: {', '.join(val)}")
    else:
        print(f"{key}: {val}")


Extracted CV Info 

Name: James Carter
Email: james.carter@example.com
Phone: +44 1234 567890
LinkedIn: linkedin.com/in/jamescarter
Profession Title: Machine learning engineer
Years of Experience: 4
Skills: AWS Certified, Django, Node.js, •, •	


# Extraction Prcoess for Extractor

In [27]:
from google.colab import files
import shutil

#Zipped up to open/load locally/on Servers(Flash)
shutil.make_archive("ner_model", 'zip', "model/model-best")
files.download("ner_model.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **Pre-Trained Model using all-mpnet-V2**

In [29]:
!pip install -q sentence-transformers python-docx

import torch
from sentence_transformers import SentenceTransformer, util
import logging
from transformers.utils import logging as hf_logging
import numpy as np

# Logging
logging.basicConfig(level=logging.INFO)
hf_logging.set_verbosity_info()

class CVMatcher:
    def __init__(self, model_name='all-mpnet-base-v2', device=None):
        self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
        logging.info(f"Loading model: {model_name} on {self.device}")
        self.model = SentenceTransformer(model_name, device=self.device)
        logging.info("Model loaded successfully.")

    def segment_text(self, text, max_length=256):
        segments = []
        current = []
        count = 0
        for line in text.split('\n'):
            line = line.strip()
            if not line:
                continue
            current.append(line)
            count += len(line.split())
            if count >= max_length:
                segments.append(' '.join(current))
                current = []
                count = 0
        if current:
            segments.append(' '.join(current))
        return segments

#Similairty to check closest between points
    def compute_similarity(self, cv_text, job_text) -> float:
        cv_chunks = self.segment_text(cv_text)
        job_chunks = self.segment_text(job_text)

        print(f"Comparing {len(cv_chunks)} CV segments to {len(job_chunks)} job segments...")

        cv_embeddings = self.model.encode(cv_chunks, convert_to_tensor=True)
        job_embeddings = self.model.encode(job_chunks, convert_to_tensor=True)

        sim_matrix = util.cos_sim(cv_embeddings, job_embeddings).cpu().numpy()
        max_sim = np.max(sim_matrix)

        print(f"Max segment similarity: {max_sim * 100:.2f}%")
        return max_sim * 100

#ranking jobs
    def rank_jobs(self, cv: str, jobs: dict, top_k: int = None, verbose=True):
        print("Ranking jobs based on similarity to the CV\n")
        results = []
        for title, desc in jobs.items():
            print(f"Evaluating job: {title}")
            score = self.compute_similarity(cv, desc)
            results.append((title, score))
        ranked = sorted(results, key=lambda x: x[1], reverse=True)
        if verbose:
            print("\nFinal Job Matching Results:")
            for title, score in ranked[:top_k or len(ranked)]:
                print(f" - {title:40s}: {score:.2f}%")
        return ranked


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m155.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m121.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [30]:
from google.colab import files
import docx
import os

# Upload CV
uploaded = files.upload()
docx_filename = next(iter(uploaded))

# Read CV text
def read_docx_text(path):
    doc = docx.Document(path)
    return "\n".join([para.text for para in doc.paragraphs if para.text.strip()])

cv_text = read_docx_text(docx_filename)

# jobs description testing
jobs = {
    "Software Engineer (React/AWS)": """
        We are seeking a full-stack software engineer to develop cloud-native applications using React, Node.js,
        and AWS. Experience with containerisation tools like Docker and CI/CD pipelines using Jenkins is required.
        Must be a strong team player and comfortable working in Agile environments.
    """,
    "Data Scientist": """
        We are looking for a Data Scientist with solid experience in Python, deep learning, and ML frameworks like
        PyTorch and TensorFlow. Knowledge in NLP, pandas, scikit-learn, and AWS is highly desirable. Candidates
        should be able to build models, run experiments, and deliver production-grade systems.
    """,
    "Random Text": "I have quick react speed on cars"
}

# Run matcher
matcher = CVMatcher()
matcher.rank_jobs(cv=cv_text, jobs=jobs)

# Cleanup uploaded file
os.remove(docx_filename)


Saving James_Carter_CV.docx to James_Carter_CV.docx


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0/config.json
Model config MPNetConfig {
  "architectures": [
    "MPNetForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "mpnet",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "relative_attention_num_buckets": 32,
  "transformers_version": "4.52.2",
  "vocab_size": 30527
}

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0/model.safetensors
All model checkpoint weights were used when initializing MPNetModel.

All the weights of MPNetModel were initialized from the model checkpoint at sentence-transformers/all-mpnet-base-v2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MPNetModel for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--sentence-transformers--all-mpnet-base-v2/snapshots/12e86a3c702fc3c50205a8db88f0ec7c0b6b94a0/tokenizer_config.json
loading file chat_template.jinja from cache at None


config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Ranking jobs based on similarity to the CV

Evaluating job: Software Engineer (React/AWS)
Comparing 1 CV segments to 1 job segments...
Max segment similarity: 55.45%
Evaluating job: Data Scientist
Comparing 1 CV segments to 1 job segments...
Max segment similarity: 52.94%
Evaluating job: Random Text
Comparing 1 CV segments to 1 job segments...
Max segment similarity: 13.73%

Final Job Matching Results:
 - Software Engineer (React/AWS)           : 55.45%
 - Data Scientist                          : 52.94%
 - Random Text                             : 13.73%


# **Save / Export Model - Similarity Model**

In [31]:
matcher.model.save('cv_matcher_model')

Configuration saved in cv_matcher_model/config.json
Model weights saved in cv_matcher_model/model.safetensors
tokenizer config file saved in cv_matcher_model/tokenizer_config.json
Special tokens file saved in cv_matcher_model/special_tokens_map.json


In [32]:
import shutil
from google.colab import files

shutil.make_archive('cv_matcher_model', 'zip', 'cv_matcher_model')
files.download('cv_matcher_model.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>