<a href="https://colab.research.google.com/github/Niranjana-08/AI-Ascent/blob/main/notebooks/data_cleaning/data_cleaning_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

paragraph comparing

Notebook Overview :  


*   This notebook classifies job descriptions into sub-categories using Sentence Transformer embeddings based on semantic similarity.
*   It loads a cleaned dataset and a hierarchical keyword list, encoding keyword paragraphs into vectors.
*   Job descriptions are then encoded and compared to keyword embeddings using cosine similarity.
*   Each job is assigned the category with the highest similarity score, along with a confidence measure. The process leverages GPU acceleration for efficient computation.







1. running on T4 GPU
2. Using keywords-mega
3. sentence transformer usage

## 1. Setup & Imports

In [None]:
!pip install sentence-transformers -q

In [None]:
import pandas as pd
import sys
from google.colab import drive
from sentence_transformers import SentenceTransformer, util
from tqdm.auto import tqdm
import torch

## 2. Data Loading

Mount Google Drive and load dataset files for processing.

In [None]:
print("Mounting Google Drive")
drive.mount('/content/drive', force_remount=True)

keywords_folder_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/keywords/'
sys.path.append(keywords_folder_path)
data_file_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/data_cleaning/cleaned_for_classification.csv'

### Load project-specific keywords and the cleaned classification dataset.

In [None]:
try:
    from keywords_mega_changed import MEGA_KEYWORDS
    df = pd.read_csv(data_file_path)
except (ImportError, FileNotFoundError) as e:
    print(f"Error: Could not load files. Details: {e}")
    raise e

## 3. Prepare List of Sub-Categories and Keyword Paragraphs

Flatten nested keywords structure to lists and create mapping from sub-category to main category.

In [None]:
all_sub_categories = []
mega_keyword_paragraphs = []
sub_to_main_map = {}
for main_cat, sub_cats in MEGA_KEYWORDS.items():
    for sub_cat_name, paragraph in sub_cats.items():
        all_sub_categories.append(sub_cat_name)
        mega_keyword_paragraphs.append(paragraph)
        sub_to_main_map[sub_cat_name] = main_cat
print(f"\nCreated a flat list of {len(all_sub_categories)} sub-categories.")

## 4. Load Model and Encode Keyword Paragraphs

Load the pre-trained Sentence Transformer model and verify GPU availability.

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"\nUsing device: {device}")

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

print("Encoding sub-category keywords into vectors...")
category_embeddings = model.encode(mega_keyword_paragraphs, convert_to_tensor=True, show_progress_bar=True)


## 5. Initial Classification on Sample Dataset

Encode a sample of job descriptions and calculate similarity scores to assign initial categories.

In [None]:
sample_df = df.head(500).copy() # Using 500 jobs to analyze scores
job_texts = sample_df['combined_text'].astype(str).tolist()

print("\nEncoding job descriptions into vectors")
job_embeddings = model.encode(job_texts, convert_to_tensor=True, show_progress_bar=True)

print("\nCalculating similarity scores")
cosine_scores = util.pytorch_cos_sim(job_embeddings, category_embeddings)

# Find the best match (highest score) for each job
top_scores, top_indices = torch.max(cosine_scores, dim=1)

In [None]:
sample_df['sub_category'] = [all_sub_categories[i] for i in top_indices]
sample_df['main_category'] = sample_df['sub_category'].map(sub_to_main_map)
sample_df['confidence_score'] = top_scores.cpu().numpy()

print("Classification and scoring complete.")

In [None]:
print("\n--- Classification Finished! ---")
final_classification_df = sample_df[[
    'job_id',
    'title',
    'main_category',
    'sub_category',
    'confidence_score'
]].copy()

final_classification_df.head(50)

In [None]:
final_classification_df[50:101]

In [None]:
final_classification_df[102:150]

try 2 : trying with specific threshold scores per topic

## 6. Define Custom Confidence Thresholds per Category

categories_threshold defined by self

In [None]:
category_thresholds = {
    'Technology': 0.30,
    'Finance': 0.35,
    'Legal': 0.35,
    'Healthcare (Research & Admin)': 0.38,
    'Marketing': 0.35,
    'Human Resources': 0.40,
    'Education & EdTech': 0.30,
    'Consulting & Strategy': 0.35,
    'Supply Chain & Logistics': 0.45,
    'Design': 0.35,
    'Automotive': 0.15,
    'Media & Journalism': 0.10
}
print("Custom category-specific thresholds are set.")

## 7. Full Classification on Entire Dataset

Repeat encoding and classification procedure on the full dataset.

In [None]:
# sample_df = df.head(500).copy()
# For now, we'll use a sample to see the results of the thresholding

# Using the full dataframe now
sample_df = df.copy()

job_texts = sample_df['combined_text'].astype(str).tolist()

print("\nEncoding job descriptions into vectors")
job_embeddings = model.encode(job_texts, convert_to_tensor=True, show_progress_bar=True)

print("\nCalculating similarity scores")
cosine_scores = util.pytorch_cos_sim(job_embeddings, category_embeddings)
top_scores, top_indices = torch.max(cosine_scores, dim=1)

In [None]:
sample_df['sub_category'] = [all_sub_categories[i] for i in top_indices]
sample_df['main_category'] = sample_df['sub_category'].map(sub_to_main_map)
sample_df['confidence_score'] = top_scores.cpu().numpy()
print("Initial classification and scoring complete.")

## 8. Apply Category-Specific Confidence Thresholds on complete dataset

Mark jobs as 'Other' if their confidence score is below the category-specific thresholds.

In [None]:
print("\nApplying custom thresholds to filter results...")

def apply_threshold(row):
    main_cat = row['main_category']
    score = row['confidence_score']

    threshold = category_thresholds.get(main_cat, 0.5)

    if score < threshold:
        return 'Other'
    else:
        return main_cat

In [None]:
sample_df['final_main_category'] = sample_df.apply(apply_threshold, axis=1)

sample_df['final_sub_category'] = sample_df.apply(
    lambda row: row['sub_category'] if row['final_main_category'] != 'Other' else 'Other',
    axis=1
)
print("Thresholding complete.")

## 9. Prepare Final Classified DataFrame and Save

Format final DataFrame to include relevant columns and save to CSV on Google Drive.

In [None]:
print("\n--- Final Classification Finished! ---")
final_df = sample_df[[
    'job_id',
    'title',
    'final_main_category',
    'final_sub_category',
    'confidence_score'
]].copy()

final_df.rename(columns={
    'final_main_category': 'main_category',
    'final_sub_category': 'sub_category'
}, inplace=True)

final_df.head(50)

In [None]:
num_rows, num_columns = final_df.shape

print(f"The final DataFrame has:")
print(f"- {num_rows} rows")
print(f"- {num_columns} columns")

In [None]:
output_path = '/content/drive/My Drive/job-analysis/job-analysis-dataset/classified_jobs/classified_jobs.csv'

print(f"Saving the final classified DataFrame to: {output_path}")
final_df.to_csv(output_path, index=False)

print("\nFile saved successfully!")