<a href="https://colab.research.google.com/github/Niranjana-08/AI-Ascent/blob/main/notebooks/data_cleaning/data_cleaning_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

paragraph comparing

Notebook Overview :  


*   This notebook classifies job descriptions into sub-categories using Sentence Transformer embeddings based on semantic similarity.
*   It loads a cleaned dataset and a hierarchical keyword list, encoding keyword paragraphs into vectors.
*   Job descriptions are then encoded and compared to keyword embeddings using cosine similarity.
*   Each job is assigned the category with the highest similarity score, along with a confidence measure. The process leverages GPU acceleration for efficient computation.







1. running on T4 GPU
2. Using keywords-mega
3. sentence transformer usage

## 1. Setup & Imports

In [None]:
!pip install sentence-transformers -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m109.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m93.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import pandas as pd
import sys
from google.colab import drive
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm
import torch

## 2. Data Loading

Mount Google Drive and load dataset files for processing.

In [None]:
print("Mounting Google Drive")
drive.mount('/content/drive', force_remount=True)

keywords_folder_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/keywords/'
sys.path.append(keywords_folder_path)
data_file_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/data_cleaning/cleaned_for_classification.csv'

Mounting Google Drive...
Mounted at /content/drive


### Load project-specific keywords and the cleaned classification dataset.

In [None]:
try:
    from keywords_mega_changed import MEGA_KEYWORDS
    df = pd.read_csv(data_file_path)
except (ImportError, FileNotFoundError) as e:
    print(f"Error: Could not load files. Details: {e}")
    raise e

keywords_mega_changed.py imported successfully.
Cleaned dataset loaded successfully.


## 3. Prepare List of Sub-Categories and Keyword Paragraphs

Flatten nested keywords structure to lists and create mapping from sub-category to main category.

In [None]:
all_sub_categories = []
mega_keyword_paragraphs = []
sub_to_main_map = {}
for main_cat, sub_cats in MEGA_KEYWORDS.items():
    for sub_cat_name, paragraph in sub_cats.items():
        all_sub_categories.append(sub_cat_name)
        mega_keyword_paragraphs.append(paragraph)
        sub_to_main_map[sub_cat_name] = main_cat
print(f"\nCreated a flat list of {len(all_sub_categories)} sub-categories.")


Created a flat list of 38 sub-categories.


## 4. Load Model and Encode Keyword Paragraphs

Load the pre-trained Sentence Transformer model and verify GPU availability.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"\nUsing device: {device}")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

print("Encoding sub-category keywords into vectors...")
category_embeddings = model.encode(mega_keyword_paragraphs, convert_to_tensor=True, show_progress_bar=True)



Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding sub-category keywords into vectors...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


## 5. Initial Classification on Sample Dataset

Encode a sample of job descriptions and calculate similarity scores to assign initial categories.

In [None]:
sample_df = df.head(500).copy() # Using 500 jobs to analyze scores
job_texts = sample_df['combined_text'].astype(str).tolist()

print("\nEncoding job descriptions into vectors")
job_embeddings = model.encode(job_texts, convert_to_tensor=True, show_progress_bar=True)

print("\nCalculating similarity scores")
cosine_scores = util.pytorch_cos_sim(job_embeddings, category_embeddings)

# Find the best match (highest score) for each job
top_scores, top_indices = torch.max(cosine_scores, dim=1)


Encoding job descriptions into vectors...


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)



Calculating similarity scores...


In [None]:
sample_df['sub_category'] = [all_sub_categories[i] for i in top_indices]
sample_df['main_category'] = sample_df['sub_category'].map(sub_to_main_map)
sample_df['confidence_score'] = top_scores.cpu().numpy()

print("Classification and scoring complete.")

Classification and scoring complete.


In [None]:
print("\n--- Classification Finished! ---")
final_classification_df = sample_df[[
    'job_id',
    'title',
    'main_category',
    'sub_category',
    'confidence_score'
]].copy()

final_classification_df.head(50)


--- Classification Finished! ---


Unnamed: 0,job_id,title,main_category,sub_category,confidence_score
0,921716,Marketing Coordinator,Human Resources,Talent Acquisition & Recruiting,0.412866
1,1829192,Mental Health Therapist/Counselor,Healthcare (Research & Admin),Clinical & Patient Care,0.448669
2,10998357,Assitant Restaurant Manager,Consulting & Strategy,Major Consulting Firms,0.392926
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Legal,Support & Paralegal,0.425426
4,35982263,Service Technician,Supply Chain & Logistics,Logistics & Operations,0.414852
5,91700727,Economic Development and Planning Intern,Consulting & Strategy,Specialized & Domain-Specific Advisory,0.317096
6,103254301,Producer,Marketing,"Content, Creative & Brand",0.675385
7,112576855,Building Engineer,Technology,"Infrastructure, Cloud & Operations",0.331292
8,1218575,Respiratory Therapist,Healthcare (Research & Admin),Clinical & Patient Care,0.331576
9,2264355,Worship Leader,Human Resources,Talent Acquisition & Recruiting,0.35951


In [None]:
final_classification_df[50:101]

Unnamed: 0,job_id,title,main_category,sub_category,confidence_score
50,974774701,Blog writer and virtual assistant,Marketing,"Content, Creative & Brand",0.560186
51,1014822088,Marketing Specialist,Marketing,Marketing Strategy & Analytics,0.55512
52,1093227543,Sales Associate Natural Food Products,Marketing,Marketing Strategy & Analytics,0.354836
53,1129235875,Industrial Sales Representative,Supply Chain & Logistics,Logistics & Operations,0.34177
54,1143359956,National Sales Manager,Consulting & Strategy,Specialized & Domain-Specific Advisory,0.349123
55,1168783207,Reps manager,Supply Chain & Logistics,Logistics & Operations,0.413218
56,1183148438,"Montessori Lead Guide, Primary",Education & EdTech,Instructional & Curriculum Design,0.305986
57,1219205895,Director of Training,Human Resources,Core HR & Business Partnership,0.404714
58,1448163866,Office Manager,Human Resources,Core HR & Business Partnership,0.480291
59,1573178251,Social Media Coordinator,Consulting & Strategy,Specialized & Domain-Specific Advisory,0.494537


In [None]:
final_classification_df[102:150]

Unnamed: 0,job_id,title,main_category,sub_category,confidence_score
102,3018278978,Seasonal Office Administrator,Human Resources,Talent Acquisition & Recruiting,0.414949
103,3040487795,Digital Marketing Intern,Marketing,Digital & Performance Marketing,0.48623
104,3045980831,Project Engineer,Technology,"Data, AI & Analytics",0.576492
105,3075721793,Architect/Project Manager,Consulting & Strategy,Specialized & Domain-Specific Advisory,0.442046
106,3117273910,Administrative Assistant,Consulting & Strategy,Specialized & Domain-Specific Advisory,0.499775
107,3127577086,Histologist - HT,Healthcare (Research & Admin),Administration & Informatics,0.365463
108,3169712432,Salesforce Vlocity Developer,Technology,Core Software & Web Development,0.484782
109,3177010992,Customer Service Representative,Consulting & Strategy,Specialized & Domain-Specific Advisory,0.435198
110,3184403524,Events & Communications Assistant,Healthcare (Research & Admin),Clinical & Patient Care,0.362007
111,3189117072,Client Service Associate / Practice Manager,Consulting & Strategy,Specialized & Domain-Specific Advisory,0.50737


try 2 : trying with specific threshold scores per topic

## 6. Define Custom Confidence Thresholds per Category

categories_threshold defined by self

In [None]:
category_thresholds = {
    'Technology': 0.30,
    'Finance': 0.35,
    'Legal': 0.35,
    'Healthcare (Research & Admin)': 0.38,
    'Marketing': 0.35,
    'Human Resources': 0.40,
    'Education & EdTech': 0.30,
    'Consulting & Strategy': 0.35,
    'Supply Chain & Logistics': 0.45,
    'Design': 0.35,
    'Automotive': 0.15,
    'Media & Journalism': 0.10
}
print("Custom category-specific thresholds are set.")

Custom category-specific thresholds are set.


## 7. Full Classification on Entire Dataset

Repeat encoding and classification procedure on the full dataset.

In [None]:
# sample_df = df.head(500).copy()
# For now, we'll use a sample to see the results of the thresholding

# Using the full dataframe now
sample_df = df.copy()

job_texts = sample_df['combined_text'].astype(str).tolist()

print("\nEncoding job descriptions into vectors")
job_embeddings = model.encode(job_texts, convert_to_tensor=True, show_progress_bar=True)

print("\nCalculating similarity scores")
cosine_scores = util.pytorch_cos_sim(job_embeddings, category_embeddings)
top_scores, top_indices = torch.max(cosine_scores, dim=1)


Encoding job descriptions into vectors...


Batches:   0%|          | 0/3871 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)



Calculating similarity scores...


In [None]:
sample_df['sub_category'] = [all_sub_categories[i] for i in top_indices]
sample_df['main_category'] = sample_df['sub_category'].map(sub_to_main_map)
sample_df['confidence_score'] = top_scores.cpu().numpy()
print("Initial classification and scoring complete.")

Initial classification and scoring complete.


## 8. Apply Category-Specific Confidence Thresholds on complete dataset

Mark jobs as 'Other' if their confidence score is below the category-specific thresholds.

In [None]:
print("\nApplying custom thresholds to filter results...")

def apply_threshold(row):
    main_cat = row['main_category']
    score = row['confidence_score']

    threshold = category_thresholds.get(main_cat, 0.5)

    if score < threshold:
        return 'Other'
    else:
        return main_cat


Applying custom thresholds to filter results...


In [None]:
sample_df['final_main_category'] = sample_df.apply(apply_threshold, axis=1)

sample_df['final_sub_category'] = sample_df.apply(
    lambda row: row['sub_category'] if row['final_main_category'] != 'Other' else 'Other',
    axis=1
)
print("Thresholding complete.")

Thresholding complete.

--- Final Classification Finished! ---


Unnamed: 0,job_id,title,main_category,sub_category,confidence_score
0,921716,Marketing Coordinator,Human Resources,Talent Acquisition & Recruiting,0.412866
1,1829192,Mental Health Therapist/Counselor,Healthcare (Research & Admin),Clinical & Patient Care,0.448669
2,10998357,Assitant Restaurant Manager,Consulting & Strategy,Major Consulting Firms,0.392926
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Legal,Support & Paralegal,0.425426
4,35982263,Service Technician,Other,Other,0.414852
5,91700727,Economic Development and Planning Intern,Other,Other,0.317096
6,103254301,Producer,Marketing,"Content, Creative & Brand",0.675385
7,112576855,Building Engineer,Technology,"Infrastructure, Cloud & Operations",0.331292
8,1218575,Respiratory Therapist,Other,Other,0.331576
9,2264355,Worship Leader,Other,Other,0.35951


## 9. Prepare Final Classified DataFrame and Save

Format final DataFrame to include relevant columns and save to CSV on Google Drive.

In [None]:
print("\n--- Final Classification Finished! ---")
final_df = sample_df[[
    'job_id',
    'title',
    'final_main_category',
    'final_sub_category',
    'confidence_score'
]].copy()

final_df.rename(columns={
    'final_main_category': 'main_category',
    'final_sub_category': 'sub_category'
}, inplace=True)

final_df.head(50)

In [None]:
num_rows, num_columns = final_df.shape

print(f"The final DataFrame has:")
print(f"- {num_rows} rows")
print(f"- {num_columns} columns")

The final DataFrame has:
- 123849 rows
- 5 columns


In [None]:
output_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/classified_jobs/classified_jobs.csv'

print(f"Saving the final classified DataFrame to: {output_path}")
final_df.to_csv(output_path, index=False)

print("\nFile saved successfully!")

Saving the final classified DataFrame to: /content/drive/My Drive/job-analysis/job-analysis-dataset/classified_jobs/classified_jobs.csv

File saved successfully!
