## Data Augmentation

In this notebook, we aim to resolve the data quality issue through various methods of augmenting our dataset.

### A) Setting up

In [1]:
import os
os.chdir('..')

In [None]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action

In [13]:
# Download fasttext model, only run once
#from nlpaug.util.file.download import DownloadUtil
#DownloadUtil.download_fasttext(model_name = 'wiki-news-300d-1M', dest_dir = 'Models')

In [14]:
#import nltk
#nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\shaun\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [3]:
model_dir = 'Models/'

In [2]:
import pandas as pd
SSOC_2020 = pd.read_csv('Data/Processed/Training/train-aws/SSOC_2020.csv')
data = pd.read_csv('Data/Processed/Training/train-aws/train_full.csv')
extra_info = pd.read_csv('Data/Processed/MCF_Training_Set_Full.csv')

In [40]:
# with open('ssoc_autocoder/sentaugment/data/sentences.txt', 'w') as f:
#     for item in SSOC_2020['Description'][295:296]:
#         f.write("%s\n" % ''.join([i if ord(i) < 128 else ' ' for i in item]))
#         f.write("%s\n" % ''.join([i if ord(i) < 128 else ' ' for i in item]))

### B) Testing different types of augmentation

In [37]:
text = SSOC_2020['Description'][SSOC_2020['SSOC 2020'] == 25121].values[0]

In [38]:
print(text)

Software developer researches, designs and develops computer and network software or specialised utility programs. He/she analyses user needs and develops software solutions, applying principles and techniques of computer science, engineering, and mathematical analysis. He/she also updates software, enhances existing software capabilities, and develops and directs software testing and validation procedures. He/she may work with computer hardware engineers to integrate hardware and software systems, and develop specifications and performance requirements. . researching, analysing and evaluating requirements for software, web and multimedia applications. designing, and developing computer software, web and multimedia systems. designing and developing digital animations, imaging, presentations, games, audio and video clips, and Internet applications using multimedia software, tools and utilities, interactive graphics and programming languages. consulting with engineering staff to evaluate

#### 1. Using pretrained word embeddings (`fasttext`)

In [23]:
fasttext_aug = naw.WordEmbsAug(model_type = 'fasttext', 
                               model_path = model_dir + 'wiki-news-300d-1M.vec',
                               action = "substitute",
                               top_k = 5,
                               aug_p = 0.5,
                               aug_min = 10,
                               aug_max = None)

In [40]:
fasttext_augmented_text = aug.augment(text, num_thread = 4)
print(fasttext_augmented_text)

Politician determine, formulates which directs governement policies. His / --she creates, ratifies, amends either repeals laws, pubic rules. but regulations withing the statutory whether constitutional framework. . supreme nearly whether participating in to hearings the parliament. determining, formulating in directed govenment policies. make, approving, amending even repeals rules, pulic rule with regulations. investigating issues with interest to a public with promoting on interests in the electorate nevertheless they comprise. although memebers of the government, directing juniors admins and officals with govenment departments with statutory boards in a interpretation and implemenation with governemnt procedures


#### 2. Using back translation
Back translation means translating the whole text to another language and back to English.

In [36]:
back_translation_aug = naw.BackTranslationAug(from_model_name='facebook/wmt19-en-de', 
                                              to_model_name='facebook/wmt19-de-en',
                                              device = 'cuda',
                                              max_length = 2000)

In [39]:
backtransl_augmented_text = back_translation_aug.augment(text, num_thread = 4)
print(backtransl_augmented_text)

The legislator determines, formulates and directs government policy. He / she adopts, ratifies, amends or repeals laws, public rules and ordinances within a legal or constitutional framework. He / she presides or participates in the proceedings of parliament. He / she determines, formulates and directs government policy. He / she adopts, ratifies, amends or repeals laws, public rules and ordinances. He / she investigates matters concerning the public and promotes the interests of the constituencies he / she represents. As members of government, heads senior administrative officers and officials of government departments and legal bodies in the interpretation and implementation of government policy.


#### 3. Using synonyms

In [10]:
synonym_aug = naw.SynonymAug(aug_src = 'ppdb', 
                             model_path = model_dir + 'ppdb-2.0-tldr',
                             aug_p = 0.5,
                             aug_min = 10,
                             aug_max = None)

In [15]:
synonym_augmented_text = synonym_aug.augment(text, num_thread = 4)
print(synonym_augmented_text)

Legislator notes, formulae and directs campaigns directives. He / she enjoys, signs, modifications or repealing acts, public roles and companies inside a obligatory or constitutional foundations. . presiding over or aiding requests the factors of senators. charges, emanating and levelling territories entitled. decisions, concluding, pertaining or discontinuing bylaws, pubic enactments and authorities. checking ingredients of worry to the perceptions and consolidating the interests of the interviewees which they originated. whereas rep of the provinces, advancing high level owners and magistrates of stakeholders parties and regulatory directories during the reinterpretation and applicability of skills agreements


#### 4. Using contextual word embeddings

In [57]:
distilbert_aug = naw.ContextualWordEmbsAug(model_path = 'distilbert-base-uncased', 
                                           action = "substitute",
                                           top_k = 10,
                                           aug_p = 0.7,
                                           aug_min = 5,
                                           aug_max = None,
                                           device = 'cuda')

In [58]:
distilbert_augmented_text = distilbert_aug.augment(text, num_thread = 4)
print(distilbert_augmented_text)

he prepares, determines and manages constitutional policy. he / she establishes, ratifies, proposing or enforcing laws, establishing institutions and regulations within a legal or regulatory context.. sitting over or presiding in the sessions of legislatures. interpreting, modifying and interpreting public regulations. proposing, ratifying, enforcing or modifying laws, enforcing laws and regulations. maintaining laws of relevance to the constitution and maintaining the integrity of the parties whom they contest. as president of the legislature, appointing legislative committees and heads of public institutions and governing boards in the implementation and enforcement of laws ;


#### 5. Using sentence augmentation

In [29]:
sentence_aug = nas.ContextualWordEmbsForSentenceAug(model_path = 'distilgpt2',
                                                    min_length = 100,
                                                    max_length = 300,
                                                    top_k = 50,
                                                    top_p = .9,
                                                    device = 'cuda')

In [30]:
sentence_augmented_text = sentence_aug.augment(text, num_thread = 4)
print(sentence_augmented_text)

Using pad_token, but it is not set yet.


Legislator determines, formulates and directs government policies. He/she makes, ratifies, amends or repeals laws, public rules and regulations within a statutory or constitutional framework. . presiding over or participating in the proceedings of parliament. determining, formulating and directing government policies. making, ratifying, amending or repealing laws, public rules and regulations. investigating matters of concern to the public and promoting the interests of the constituencies which they represent. as members of the government, directing senior administrators and officials of government departments and statutory boards in the interpretation and implementation of government policies, and coordinating, or managing the functioning of government functions. , implementing, or directing legislation, public rules and regulations. investigating matters of concern to the public and promoting the interests of the constituencies which they represent. As members of the government, dire

#### 6. Using summarisation

In [41]:
summ_aug = nas.AbstSummAug(model_path = 't5-base', 
                           min_length = 50,
                           max_length = 100,
                           top_k = 20)

In [45]:
summ_augmented_text = summ_aug.augment(text, num_thread = 4)

In [46]:
summ_augmented_text

'Legislator determines, formulates and directs government policies . makes, ratifies, amends or repeals laws, public rules and regulations . investigating matters of concern to the public and promoting the interests of the constituencies which they represent .'

### C) Using GloVE embeddings to find and label more examples

In [3]:
import spacy
from spacy.language import Language
nlp = spacy.load('en_core_web_lg', disable = ['tagger', 'parser', 'ner', 'lemmatizer'])
stopwords = nlp.Defaults.stop_words

Add in additional preprocessing to remove the stop words

In [4]:
@Language.component("additional_preprocessing")
def additional_preprocessing(doc):
    lemma_list = [tok for tok in doc
                  if tok.is_alpha and tok.text.lower() not in stopwords] 
    return lemma_list
nlp.add_pipe('additional_preprocessing', last = True)

<function __main__.additional_preprocessing(doc)>

Run the `nlp` processing pipeline over the two corpuses and convert the job postings into vectors

In [8]:
SSOC_2020_nlp = list(nlp.pipe(SSOC_2020['Description']))
data_nlp = list(nlp.pipe(data['Cleaned_Description']))

In [11]:
target_vecs = []
for i, desc in enumerate(data_nlp):
    if i % 100 == 0:
        print(f'Job posting {i}/{len(data_nlp)}...\r', end = '')
    if len(desc) == 0:
        target_vecs.append(np.array([0]*300))
    else:
        target_vecs.append(np.mean([token.vector for token in desc], axis = 0))

Job posting 42800/42842...

In [387]:
detailed_definitions_raw = pd.read_excel('Data/Raw/SSOC2020 Detailed Definitions.xlsx', skiprows = 4)

  warn("""Cannot parse header or footer so it will be ignored""")


In [391]:
detailed_definitions = detailed_definitions_raw[(~detailed_definitions_raw['SSOC 2020'].astype('str').str.contains('X')) & (detailed_definitions_raw['SSOC 2020'].astype('str').apply(len) >= 5)]

In [419]:
to_replace = {
    '•': '',
    '\n': '.',
    '<Blank>': '',
    '\([A-Za-z0-9 ]+\)': ''
}

detailed_definitions['Jobs Cleaned'] = detailed_definitions['Examples of Job Classified Under this Code']

for k, v in to_replace.items():
    detailed_definitions['Jobs Cleaned'] = detailed_definitions['Jobs Cleaned'].str.replace(k, v)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_definitions['Jobs Cleaned'] = detailed_definitions['Examples of Job Classified Under this Code']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  detailed_definitions['Jobs Cleaned'] = detailed_definitions['Jobs Cleaned'].str.replace(k, v)
  detailed_definitions['Jobs Cleaned'] = detailed_definitions['Jobs Cleaned'].str.replace(k, v)


In [420]:
detailed_definitions['Jobs Cleaned']

4               President .  Attorney general.  Minister 
6         Director-general.  High commissioner .  Perm...
7         Chairman .  Chief executive .  Managing dire...
9           Administrator of political party organisation
11        Administrator of business association.  Admi...
                              ...                        
1599                                                     
1601                               Newspaper delivery man
1602      Parking meter reader.  Coin machine collecto...
1603                                  Labourer.  Handyman
1604                                Food delivery on foot
Name: Jobs Cleaned, Length: 997, dtype: object

Write a simple function to identify the top `n` jobs that are closest to the selected SSOC

In [66]:
from sklearn.metrics.pairwise import cosine_similarity
def identify_top_n(selected,
                   data,
                   extra_info,
                   target_vecs,
                   top_n = 10,
                   threshold = 0.9):
    
    source_vec = np.array([np.mean([token.vector for token in selected], axis = 0)])
    matrix = cosine_similarity(source_vec, target_vecs)
    indices = np.apply_along_axis(lambda x: x.argsort()[-top_n:][::-1], axis = 1, arr = matrix)
    above_threshold = matrix[0][indices][0] >= threshold
    indices = [idx for idx, above in zip(indices[0], above_threshold) if above]
    if len(indices) == 0:
        print('None meet the threshold required.')
    else:
        cosine_similarity_index = 0
        for i, row in data.loc[indices, :].iterrows():
            print(f'Index: {i}')
            print(f'Cosine similarity: {matrix[0][indices][cosine_similarity_index]}')
            print(f'Predicted SSOC: {row["SSOC 2020"]}')
            print(f'Job title: {extra_info["title"][i]}')
            print(f'Description: {row["Cleaned_Description"]}')
            print('================================================================')
            cosine_similarity_index += 1

In [413]:
def find_matching_job_title(data,
                            include,
                            exclude):
    
    output = copy.deepcopy(data)
    output['title'] = output['title'].str.lower()
    
    include_boolean = [False] * len(output)
    for words in include:
        entry_boolean = [True] * len(output)
        for word in words.split(' '):
            entry_boolean = entry_boolean & output['title'].str.contains(word.lower())
        include_boolean = include_boolean | entry_boolean
    
    for words in exclude:
        for word in words.split(' '):
            include_boolean = include_boolean & ~output['title'].str.contains(word.lower())
            
    job_titles_idx = output[include_boolean.values].index.tolist()
    return job_titles_idx
            

In [383]:
pd.set_option('display.max_rows', 500)
import copy
import json
# Run this to initialise the dictionary object
# with open('manual_tagging.json', 'r') as outfile:
#     manual_tagging1 = json.load(outfile)

In [382]:
# Run this to export the manual tagging to the JSON file
# with open('manual_tagging.json', 'w') as outfile:
#     json.dump(manual_tagging, outfile)

Set the SSOC you are scanning for here

In [428]:
ssoc = 12221
include_job_titles = [17640, 30290, 34491, 36065, 36409, 37141, 42499]
for detailed_def_job in detailed_definitions['Jobs Cleaned'][detailed_definitions['SSOC 2020'] == str(ssoc)].values[0].split('.'):
    print(detailed_def_job.strip())
    include_job_titles.append(detailed_def_job.strip())

Corporate Relations Manager
Director of marketing communications
Public relations  director


In [439]:
job_titles_idx = find_matching_job_title(extra_info,
                                         include = ['public relations manager'],
                                         exclude = [])

In [446]:
print(job_titles_idx)
for i, title in extra_info.loc[job_titles_idx, 'title'].iteritems():
    print(f"{i}: {title}")

[8884, 17640, 30290, 34491, 36065, 36409, 37141, 42499]
8884: Account Manager (Public Relations)
17640: Digital Public Relations Manager #SGUP
30290: Senior Manager, Public Relations
34491: Public Relations Manager #SGUNITED
36065: Public Relations Manager
36409: Public Relations Manager
37141: Public Relations Manager
42499: Public Relations and Marketing Manager


In [445]:
print(extra_info.loc[34491, 'description'])

<p><strong>KEY ROLES &amp; RESPONSIBILITIES</strong></p>
<p><strong>- </strong>Leading the planning and implementation of PR initiatives related to the company's brand(s).</p>
<p>- Leveraging a variety of media channels to maximise brand and campaign exposure.</p>
<p>- Building strong and positive relationships with media, influencers, and other related parties.</p>
<p>- Tracking effectiveness of various campaigns and course correcting as required.</p>
<p>- Maintain and develop relationship with the customers.</p>
<p>-Execute any other job responsibilities which the management assigns from time to time.</p>
<p><strong>Knowledge and Skill Requirements</strong></p>
<p>&gt; Minimum Diploma in Marketing/Business/Creative Multimedia/Advertising/Media/Mass Communication or equivalent qualification</p>
<p>&gt;Proficient in Microsoft Office (Word, Excel, PowerPoint) and prefer with Adobe Software skills.</p>
<p>&gt;Proficient in managing social media networks for business purposes.</p>
<p>&gt;

In [447]:
ssoc_index = SSOC_2020[SSOC_2020['SSOC 2020'] == ssoc].index[0]
identify_top_n(SSOC_2020_nlp[ssoc_index], data, extra_info, target_vecs, top_n = 15, threshold = 0.85)

Index: 8185
Cosine similarity: 0.955994710767933
Predicted SSOC: 24320
Job title: Senior Executive – Corporate Communications and Community Engagement
Description: Create engaging and creative messages, collaterals and publicity materials to support programmes and activities, including fundraising campaigns. Planning and writing publications for Newsletters, Annual Report, Brochures etc. Assist in the updating and maintaining of website and social media platform. Handling of media enquiries, interviews and media liaison where required. Assist in the management of community partnership projects with schools, corporate companies, sponsors, etc. Management of organisation-wide communications plans (internal and external) Participate in events for Marketing/ Fundraising. Conceptualise and implement integrated branding campaign across a various communication platform including internal and outward-facing channels. Develop and nurture good relationships with the external stakeholders, media 

Use this to find job postings with the exact job title

In [400]:
words = ['admin', 'manager'] # what words to include
exclude = ['account', 'database', 'it', 'project'] # what words to exclude
output = copy.deepcopy(extra_info)
for word in words:
    output = output[output['title'].str.lower().str.contains(word)]
for word in exclude:
    output = output[~output['title'].str.lower().str.contains(word)]
job_titles_idx = output.index.tolist()
print(job_titles_idx)

[36, 3712, 4096, 4234, 4282, 4814, 5087, 5125, 5657, 6845, 7185, 7221, 7245, 9761, 10365, 11048, 11254, 11894, 12867, 14357, 14573, 14652, 14748, 15072, 15314, 16527, 16822, 16999, 19009, 19756, 20075, 20098, 20105, 20209, 20319, 20426, 20815, 20899, 21097, 21911, 22086, 22249, 22414, 22588, 22625, 22739, 23144, 23148, 23370, 23488, 24173, 24440, 24704, 24999, 26027, 26832, 27082, 27821, 29594, 30976, 31347, 31541, 32018, 32176, 32478, 32516, 32967, 33005, 33542, 33724, 35692, 36537, 36674, 37348, 37682, 38313, 38352, 41510]


In [None]:
#extra_info.loc[25834, 'description']

Change the list `inputting` here to input the indices of the job postings that you want to manually tag as that SSOC

In [378]:
manual_tagging[ssoc] = []
inputting = []
inputting_dedup = list(set(inputting))
for key in manual_tagging.keys():
    for new_idx in inputting_dedup:
        if new_idx in manual_tagging[key]:
            print(f'Duplicate detected for index {new_idx} which has already been marked for SSOC {key}')
            inputting_dedup.remove(new_idx)
manual_tagging[ssoc].extend(inputting_dedup)
print(f'SSOC: {ssoc}')
print(manual_tagging[ssoc])

Duplicate detected for index 10797 which has already been marked for SSOC 12111
Duplicate detected for index 40689 which has already been marked for SSOC 12111
Duplicate detected for index 33662 which has already been marked for SSOC 12111
Duplicate detected for index 16999 which has already been marked for SSOC 12121
Duplicate detected for index 26832 which has already been marked for SSOC 12121
Duplicate detected for index 14748 which has already been marked for SSOC 12121
SSOC: 12112
[4096, 5125, 7185, 14357, 5657, 9761, 36, 41510, 7221, 22588, 19009, 12867, 22086, 7245, 22625, 23144, 21097, 20075, 23148, 24173, 31347, 11894, 10365, 3712, 24704, 20098, 20105, 4234, 16527, 27821, 36537, 4282, 6845, 32967, 4814, 22739, 32478, 15072, 22249, 14573, 33005, 20209, 30976, 32516, 33542, 32018, 11048, 19756, 37682, 31541, 14652, 36674, 23370, 20815, 20319, 35692, 24440, 22414, 21911, 29594, 20899, 24999, 38313, 26027, 32176, 16822, 33724, 23488, 20426, 27082, 38352, 15314, 5087, 37348, 11254

In [385]:
# ssoc_index = SSOC_2020[SSOC_2020['SSOC 2020'] == ssoc].index[0]
# identify_top_n(SSOC_2020_nlp[ssoc_index], data, extra_info, target_vecs, top_n = 15, threshold = 0.85)

### C) Implementing text augmentation

NameError: name 'data' is not defined