## Code for sampling based on dissimilar job titles

**Author:** Benjamin Aw  
**Date:** 17 Sep 2021  
**Context:** Need to choose 500 job ad IDs to send to MRSD for manual tagging, would be best to pick 500 very different job ads for this task   
**Objective:** Select 500 job ad IDs that are most dissimilar to send it to MRSD  

In [12]:
import pandas as pd
import spacy
import random
import re
import tqdm
from tqdm.notebook import trange, tqdm
import numpy as np
import swifter

In [3]:
nlp = spacy.load('en_core_web_lg')

### Reading and cleaning data

Reading the csv files, filter only Job ID and Title

In [4]:
mcf_df = pd.read_csv("..\Data\Processed\WGS_Dataset_JobInfo_precleaned.csv")
mcf_df = mcf_df[["Job_ID", "Title"]].sample(frac=0.1, random_state=1).reset_index(drop = True)

Cleaning up job titles based on eyeballing of entries

In [5]:
#removing everything within [...], (...), #sgunitiedjob, 3-4 digit number, dash and chinese characters

cleaning_regex= ['\[.*?\]', '\(.*?\)', '#sgunitedjobs*', '^\d{2,5}', '-', '#sgunitedtraineeships', '#sgups*', '#sguniteds*', '#sg', '#(\w+)', '\d{4,}']

# Iteratively apply each regex
for regex in cleaning_regex:
    mcf_df['Title'] = mcf_df['Title'].map(lambda text: re.sub(regex, ' ', text).strip())

# Remove non ACSII characters (0 to 122), chinese and specical characters
mcf_df['Title'] = mcf_df['Title'].map(lambda text: re.sub('([^\x00-\x7F])+',' ',text))

# Remove all other symbols 
mcf_df['Title'] = mcf_df['Title'].map(lambda text: re.sub(r'[^a-zA-Z0-9\s]',' ', text))

# Remove double spacings
mcf_df['Title'] = mcf_df['Title'].map(lambda text: re.sub('\s\s+' , ' ', text))

### Splitting to get initial data

In [6]:
#setting random number
random.seed(1)
random_row = random.randint(0, mcf_df.shape[0])

#selecting the first entry to be in the test set randomly
mcf_df_test = mcf_df.iloc[[random_row]]

#removing selected entry from exisitng data
mcf_df_existing = mcf_df.drop([random_row])
mcf_df_existing.reset_index(drop = True, inplace = True)

### Function to get text similarity between two string

In [7]:
#find the difference between two strings of text

def distance_text(text1, text2):
    text1 = nlp(text1)
    text2 = nlp(text2)
    return text1.similarity(text2)

### Iterating to compare and split the datasets

Edit: Realised repeated calculations done

In [None]:
for _ in trange(500, desc="Overall progress"):
    
    #lastest job title added will act as the reference
    title_ref = mcf_df_test["Title"].iloc[-1]
    
    if "Distance" not in mcf_df_existing.columns:
        #initinalisaion
        mcf_df_existing["Distance"] = mcf_df_existing["Title"].swifter.progress_bar(enable = True, desc = 'Mapping progress').apply(lambda title: distance_text(title, title_ref))
    else:
        #remaining calculation
        mcf_df_existing["Distance"] = mcf_df_existing["Distance"] + mcf_df_existing["Title"].swifter.progress_bar(enable = True, desc = 'Mapping progress').apply(lambda title: distance_text(title, title_ref))

    mcf_df_existing["Distance_sort"] = mcf_df_existing["Distance"]/len(mcf_df_test)
    
    #take in the first (min) as sample
    mcf_df_test = pd.concat([mcf_df_test, mcf_df_existing.sort_values('Distance_sort').iloc[[0]][['Job_ID', 'Title']]])
    
    #remaining mcf_df_exisiting
    mcf_df_existing = mcf_df_existing.sort_values('Distance_sort').iloc[1:, :][['Job_ID', 'Title', 'Distance']]
    
    print(f'Round {_} done\nNumber of entries in test: {mcf_df_test.shape[0]}\nTitle added was: {title_ref}\nNumber of entries remaining: {mcf_df_existing.shape[0]}')

Overall progress:   0%|          | 0/500 [00:00<?, ?it/s]

Mapping progress:   0%|          | 0/23306 [00:00<?, ?it/s]

  


In [None]:
mcf_df_test[["Job_ID"]].to_csv("..\Data\Processed\500_Job_ID_Samples.csv", index=False)