## Data Augmentation

In this notebook, we aim to resolve the data quality issue through various methods of augmenting our dataset.

### A) Setting up

In [1]:
import os
os.chdir('..')

In [2]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action



In [3]:
# Download fasttext model, only run once
#from nlpaug.util.file.download import DownloadUtil
#DownloadUtil.download_fasttext(model_name = 'wiki-news-300d-1M', dest_dir = 'Models')

In [4]:
#import nltk
#nltk.download('averaged_perceptron_tagger')

In [5]:
model_dir = 'Models/'

In [6]:
import pandas as pd
SSOC_2020 = pd.read_csv('Data/Processed/Training/train-aws/SSOC_2020.csv')
data = pd.read_csv('Data/Processed/Training/train-aws/train_full.csv')
extra_info = pd.read_csv('Data/Processed/MCF_Training_Set_Full.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [7]:
# with open('ssoc_autocoder/sentaugment/data/sentences.txt', 'w') as f:
#     for item in SSOC_2020['Description'][295:296]:
#         f.write("%s\n" % ''.join([i if ord(i) < 128 else ' ' for i in item]))
#         f.write("%s\n" % ''.join([i if ord(i) < 128 else ' ' for i in item]))

### B) Testing different types of augmentation

In [8]:
text = SSOC_2020['Description'][SSOC_2020['SSOC 2020'] == 25121].values[0]

In [9]:
print(text)

Software developer researches, designs and develops computer and network software or specialised utility programs. He/she analyses user needs and develops software solutions, applying principles and techniques of computer science, engineering, and mathematical analysis. He/she also updates software, enhances existing software capabilities, and develops and directs software testing and validation procedures. He/she may work with computer hardware engineers to integrate hardware and software systems, and develop specifications and performance requirements. . researching, analysing and evaluating requirements for software, web and multimedia applications. designing, and developing computer software, web and multimedia systems. designing and developing digital animations, imaging, presentations, games, audio and video clips, and Internet applications using multimedia software, tools and utilities, interactive graphics and programming languages. consulting with engineering staff to evaluate

#### 1. Using pretrained word embeddings (`fasttext`)

In [16]:
fasttext_aug = naw.WordEmbsAug(model_type = 'fasttext', 
                               model_path = model_dir + 'wiki-news-300d-1M.vec',
                               action = "substitute",
                               top_k = 5,
                               aug_p = 0.5,
                               aug_min = 10,
                               aug_max = None)

In [18]:
fasttext_augmented_text = fasttext_aug.augment(text, num_thread = 4)
print(fasttext_augmented_text)

Software builder researches, prototypes and builds computer which networks software or specialized efficiency programs. .He / her Analysis interface wants in builds software issues, applies tenets, technologies of computing sceince, engineering, and algebraic anaylsis. He / she still updates softare, strengthens existing hardware capabilities, , develops which directs software tests and validation procedures. They / she will research along computing software engineering to combine firmware in software components, but cultivate specification but performance standards. . researching, summarising which examining requirments for software, websites and audio applications. constructing, but establishing computer software, web in multi-media system. designing in creating digital visuals, imaging, lectures, game, recordings and video- excerpts, in Internet applications using interactive sofware, toolkit and utilities, multimedia graphic and programs langauages. consulting with engineering facu

#### 2. Using back translation
Back translation means translating the whole text to another language and back to English.

In [None]:
back_translation_aug = naw.BackTranslationAug(from_model_name='facebook/wmt19-en-de', 
                                              to_model_name='facebook/wmt19-de-en',
                                              device = 'cuda',
                                              max_length = 2000)

In [None]:
backtransl_augmented_text = back_translation_aug.augment(text, num_thread = 4)
print(backtransl_augmented_text)

#### 3. Using synonyms

In [None]:
synonym_aug = naw.SynonymAug(aug_src = 'ppdb', 
                             model_path = model_dir + 'ppdb-2.0-tldr',
                             aug_p = 0.5,
                             aug_min = 10,
                             aug_max = None)

In [None]:
synonym_augmented_text = synonym_aug.augment(text, num_thread = 4)
print(synonym_augmented_text)

#### 4. Using contextual word embeddings

In [20]:
distilbert_aug = naw.ContextualWordEmbsAug(model_path = 'distilbert-base-uncased', 
                                           action = "substitute",
                                           top_k = 10,
                                           aug_p = 0.7,
                                           aug_min = 5,
                                           aug_max = None,
                                           device = 'cpu')

In [21]:
distilbert_augmented_text = distilbert_aug.augment(text, num_thread = 4)
print(distilbert_augmented_text)

website designer prepares, implements and develops network and database architecture or virtual web hardware. he / she directs user design and engineers internal hardware, designs hardware and tools of information hardware, automation, and industrial physics. he / she also builds computers, manufactures computing system components, and maintains and maintains performance analysis and assessment programs. he / she may work with application network developers to develop website and server products, and implements standards and compliance standards.. develops, analysing and verification procedures for network, web and web services. preparing, and supervising interactive networking, web and desktop applications. design and producing electronic audio, audio, graphic, multimedia, web and multimedia presentations, and graphical products. multimedia systems, graphics and components, presentation programs and interactive products. assists with application professionals to update components of h

#### 5. Using sentence augmentation

In [None]:
sentence_aug = nas.ContextualWordEmbsForSentenceAug(model_path = 'distilgpt2',
                                                    min_length = 100,
                                                    max_length = 300,
                                                    top_k = 50,
                                                    top_p = .9,
                                                    device = 'cuda')

In [None]:
sentence_augmented_text = sentence_aug.augment(text, num_thread = 4)
print(sentence_augmented_text)

#### 6. Using summarisation

In [None]:
summ_aug = nas.AbstSummAug(model_path = 't5-base', 
                           min_length = 50,
                           max_length = 100,
                           top_k = 20)

In [None]:
summ_augmented_text = summ_aug.augment(text, num_thread = 4)

In [None]:
summ_augmented_text

#### 7. Adding spelling mistakes

In [31]:
spl_aug = naw.SpellingAug(dict_path=None, 
                          name='Spelling_aug',
                          aug_min=1, 
                          aug_max=10, 
                          aug_p=0.3)

In [32]:
spl_augmented_text = spl_aug.augment(text)

In [33]:
spl_augmented_text

'Software developer researches, designs and develops computer amd network software or specialised utility programs. He / she analisys user needs and develops software soliutions, applying princilpes and techniques of computer science, engineering, and mathematical analysis. He / whe also updates software, enhances existing softwares capabilities, and develops and directs software testing an validation procedures. He / she may work with computer hardware engineers to integrate hardware and software systems, and develop specifications and performance requirements. . researching, analysing and evaluating requirements for software, web and multimedia applications. designing, and developing computer sotfware, web and multimedia systems. designing and developing digital animations, imaging, presentations, games, audio and video clips, and Internet applications using multimedia software, tools and utilities, interactive graphics and programming languages. consulting with engineering staff to 

### C) Using GloVE embeddings to find and label more examples

In [3]:
import spacy
from spacy.language import Language
nlp = spacy.load('en_core_web_lg', disable = ['tagger', 'parser', 'ner', 'lemmatizer'])
stopwords = nlp.Defaults.stop_words

Add in additional preprocessing to remove the stop words

In [4]:
@Language.component("additional_preprocessing")
def additional_preprocessing(doc):
    lemma_list = [tok for tok in doc
                  if tok.is_alpha and tok.text.lower() not in stopwords] 
    return lemma_list
nlp.add_pipe('additional_preprocessing', last = True)

<function __main__.additional_preprocessing(doc)>

In [6]:
import pandas as pd
import numpy as np
SSOC_2020 = pd.read_csv('Data/Processed/Training/train-aws/SSOC_2020.csv')
data = pd.read_csv('Data/Processed/Training/train-aws/train_full.csv')
extra_info = pd.read_csv('Data/Processed/MCF_Training_Set_Full.csv')

Run the `nlp` processing pipeline over the two corpuses and convert the job postings into vectors

In [8]:
SSOC_2020_nlp = list(nlp.pipe(SSOC_2020['Description']))
data_nlp = list(nlp.pipe(data['Cleaned_Description']))

In [11]:
target_vecs = []
for i, desc in enumerate(data_nlp):
    if i % 100 == 0:
        print(f'Job posting {i}/{len(data_nlp)}...\r', end = '')
    if len(desc) == 0:
        target_vecs.append(np.array([0]*300))
    else:
        target_vecs.append(np.mean([token.vector for token in desc], axis = 0))

Job posting 42800/42842...

Write a simple function to identify the top `n` jobs that are closest to the selected SSOC

In [66]:
from sklearn.metrics.pairwise import cosine_similarity
def identify_top_n(selected,
                   data,
                   extra_info,
                   target_vecs,
                   top_n = 10,
                   threshold = 0.9):
    
    source_vec = np.array([np.mean([token.vector for token in selected], axis = 0)])
    matrix = cosine_similarity(source_vec, target_vecs)
    indices = np.apply_along_axis(lambda x: x.argsort()[-top_n:][::-1], axis = 1, arr = matrix)
    above_threshold = matrix[0][indices][0] >= threshold
    indices = [idx for idx, above in zip(indices[0], above_threshold) if above]
    if len(indices) == 0:
        print('None meet the threshold required.')
    else:
        cosine_similarity_index = 0
        for i, row in data.loc[indices, :].iterrows():
            print(f'Index: {i}')
            print(f'Cosine similarity: {matrix[0][indices][cosine_similarity_index]}')
            print(f'Predicted SSOC: {row["SSOC 2020"]}')
            print(f'Job title: {extra_info["title"][i]}')
            print(f'Description: {row["Cleaned_Description"]}')
            print('================================================================')
            cosine_similarity_index += 1

In [382]:
#manual_tagging = {}
pd.set_option('display.max_rows', 500)
import copy
import json
with open('manual_tagging.json', 'w') as outfile:
    json.dump(manual_tagging, outfile)

In [381]:
len(manual_tagging.keys())

19

In [380]:
#print(manual_tagging[12191])

In [246]:
#manual_tagging[12191].append(42271)

In [378]:
manual_tagging[ssoc] = []
inputting = []
inputting_dedup = list(set(inputting))
for key in manual_tagging.keys():
    for new_idx in inputting_dedup:
        if new_idx in manual_tagging[key]:
            print(f'Duplicate detected for index {new_idx} which has already been marked for SSOC {key}')
            inputting_dedup.remove(new_idx)
manual_tagging[ssoc].extend(inputting_dedup)
print(f'SSOC: {ssoc}')
print(manual_tagging[ssoc])

Duplicate detected for index 10797 which has already been marked for SSOC 12111
Duplicate detected for index 40689 which has already been marked for SSOC 12111
Duplicate detected for index 33662 which has already been marked for SSOC 12111
Duplicate detected for index 16999 which has already been marked for SSOC 12121
Duplicate detected for index 26832 which has already been marked for SSOC 12121
Duplicate detected for index 14748 which has already been marked for SSOC 12121
SSOC: 12112
[4096, 5125, 7185, 14357, 5657, 9761, 36, 41510, 7221, 22588, 19009, 12867, 22086, 7245, 22625, 23144, 21097, 20075, 23148, 24173, 31347, 11894, 10365, 3712, 24704, 20098, 20105, 4234, 16527, 27821, 36537, 4282, 6845, 32967, 4814, 22739, 32478, 15072, 22249, 14573, 33005, 20209, 30976, 32516, 33542, 32018, 11048, 19756, 37682, 31541, 14652, 36674, 23370, 20815, 20319, 35692, 24440, 22414, 21911, 29594, 20899, 24999, 38313, 26027, 32176, 16822, 33724, 23488, 20426, 27082, 38352, 15314, 5087, 37348, 11254

In [379]:
ssoc = 12221
# stopped here

In [374]:
words = ['admin', 'manager']
exclude = ['account', 'database', 'it', 'project']
output = copy.deepcopy(extra_info)
for word in words:
    output = output[output['title'].str.lower().str.contains(word)]
for word in exclude:
    output = output[~output['title'].str.lower().str.contains(word)]
job_titles_idx = output.index.tolist()
print(job_titles_idx)

[36, 3712, 4096, 4234, 4282, 4814, 5087, 5125, 5657, 6845, 7185, 7221, 7245, 9761, 10365, 11048, 11254, 11894, 12867, 14357, 14573, 14652, 14748, 15072, 15314, 16527, 16822, 16999, 19009, 19756, 20075, 20098, 20105, 20209, 20319, 20426, 20815, 20899, 21097, 21911, 22086, 22249, 22414, 22588, 22625, 22739, 23144, 23148, 23370, 23488, 24173, 24440, 24704, 24999, 26027, 26832, 27082, 27821, 29594, 30976, 31347, 31541, 32018, 32176, 32478, 32516, 32967, 33005, 33542, 33724, 35692, 36537, 36674, 37348, 37682, 38313, 38352, 41510]


In [375]:
#extra_info.loc[25834, 'description']

In [376]:
extra_info.loc[job_titles_idx, 'title']

36                                  Administrative Manager
3712     Senior Executive / Assistant Manager (Ophthalm...
4096               Human Resource & Administration Manager
4234                                         Admin Manager
4282                      ASSISTANT ADMINISTRATIVE MANAGER
4814                                Administration Manager
5087                    Finance and Administrative Manager
5125                                Administrative Manager
5657                        Manager (Finance & Fund Admin)
6845     Finance & Admin Manager (Fullset SAGE, HR Admi...
7185     R00004959 Assistant Manager, Student Administr...
7221                              Admin operations manager
7245                                         ADMIN MANAGER
9761                    Finance and Administrative Manager
10365              Human Resource & Administration Manager
11048                      Assistant Manager (Admin Sales)
11254                 General Manager, HR & Administrati

In [377]:
ssoc_index = SSOC_2020[SSOC_2020['SSOC 2020'] == ssoc].index[0]
identify_top_n(SSOC_2020_nlp[ssoc_index], data, extra_info, target_vecs, top_n = 15, threshold = 0.85)

Index: 41600
Cosine similarity: 0.9692999556017325
Predicted SSOC: 12111
Job title: Finance Manager (Up $8000 / Group Conso / Manufacturing / North)
Description: Lead staff in accordance with organization's policies and applicable legislation. Responsibilities include planning, assigning, and directing work appraising performance and guiding professional development rewarding and disciplining employees addressing employee relations issues and resolving problems. Approve actions on human resources matters. May contribute to discussions on implementation of Finance strategy on a group/regional basis and implement objectives, as appropriate. Ensure staff have a consistent understanding and positive impression of business strategy in local country. Ensure compliance with established financial processes, and local compliance requirements, statutory government reporting, procedures and controls financial reporting and review of revenue, expenditures, asset management, tax, management reporti

In [56]:
picked = 25513
print(extra_info.loc[picked, 'title'])
print(extra_info.loc[picked, 'description'])

Group Legal & Compliance Counsel
<p>This incumbent will assist to monitor changes to laws and regulations in Singapore and APAC offices, coordinate between Compliance Team and other depts on implementation, give guidance on operationalising changes in laws and interpretation of laws and regulations and on compliance matters that require legal input, updating and maintenance of cross border opinions on permissibility of marketing in various countries, advising on laws and regulations relating to new products and businesses, assist Dept Head in managing litigations and projects related to legal or regulatory matters.</p>
<p><br></p>
<p><strong>Responsibilities:</strong></p>
<ul>
  <li>Monitoring for changes to laws and regulations.</li>
  <li>Summarising changes to laws, understanding implications and impact and advising      relevant depts on the changes and on implementation.</li>
  <li>Ensuring Legal Team is aware of and understands the changes to laws.</li>
  <li>Updating and mainten

### D) Trying out lambada

Ref: https://github.com/makcedward/nlpaug/blob/master/example/lambada-train_model.ipynb

In [71]:
test_data = data.sample(100)

test_data = test_data[['SSOC 2020', 'Cleaned_Description']]

In [72]:
test_data.rename({'SSOC 2020': 'label', 'Cleaned_Description': 'text'}, axis=1, inplace=True)

In [73]:
test_data = test_data[['text', 'label']]

In [76]:
test_data.to_csv('Data/test/classification.csv', index = False)

Training classifier

DL files from nlpaug
Copy and paste scripts from nlpaug to Models folder
Create file path model\lambada\cls in Models folder
Uploaded c

In [78]:
!python Models/scripts/lambada/train_cls.py  \
    --train_data_path Data/test/classification.csv \
    --val_data_path Data/test/classification.csv \
    --output_dir Models/model/lambada/cls \
    --device cpu \
    --num_epoch 2

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

Output processing data as mlm_data.txt

In [79]:
!python Models/scripts/lambada/data_processing.py \
    --data_path Data/test/classification.csv \
    --output_dir Data/test

In [98]:
!python Models/scripts/lambada/run_clm.py \
    --tokenizer_name Models/model/lambada/cls \
    --model_name_or_path gpt2 \
    --model_type gpt2 \
    --train_file Data/test/mlm_data.txt \
    --output_dir Models/scripts/lambada/gen \
    --do_train \
    --overwrite_output_dir \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --save_steps=10000 \
    --num_train_epochs 2

10/15/2021 17:53:44 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).

0 tables [00:00, ? tables/s]
                            
[INFO|file_utils.py:1631] 2021-10-15 17:53:46,440 >> https://huggingface.co/gpt2/resolve/main/config.json not found in cache or force_download set to True, downloading to C:\Users\benjamin\.cache\huggingface\transformers\tmpgxgi5rlu

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]
Downloading: 100%|##########| 665/665 [00:00<00:00, 166kB/s]
[INFO|file_utils.py:1635] 2021-10-15 17:53:47,507 >> storing https://huggingface.co/gpt2/resolve/main/config.json in cache at C:\Users\benjamin/.cache\huggingface\transformers\fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|file_utils.py:1643] 2021-10-15 17:53:47,510 >> creating metadata file for C:\User


gradient_accumulation_steps=1,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=Models/scripts/lambada/gen\runs\Oct15_17-53-44_benjamin-PC,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
output_dir=Models/scripts/lambada/gen,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=gen,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=Models/scripts/lambada/g

Downloading:  62%|######1   | 337M/548M [00:05<00:03, 60.2MB/s]
Downloading:  63%|######2   | 343M/548M [00:05<00:03, 54.0MB/s]
Downloading:  64%|######3   | 350M/548M [00:05<00:03, 58.2MB/s]
Downloading:  65%|######5   | 357M/548M [00:05<00:03, 59.8MB/s]
Downloading:  66%|######6   | 363M/548M [00:05<00:03, 60.5MB/s]
Downloading:  67%|######7   | 369M/548M [00:06<00:02, 60.5MB/s]
Downloading:  69%|######8   | 377M/548M [00:06<00:02, 67.2MB/s]
Downloading:  70%|#######   | 384M/548M [00:06<00:02, 61.6MB/s]
Downloading:  71%|#######1  | 392M/548M [00:06<00:02, 66.2MB/s]
Downloading:  73%|#######2  | 399M/548M [00:06<00:02, 62.2MB/s]
Downloading:  74%|#######3  | 405M/548M [00:06<00:02, 62.9MB/s]
Downloading:  75%|#######5  | 412M/548M [00:06<00:02, 65.8MB/s]
Downloading:  76%|#######6  | 419M/548M [00:06<00:01, 66.1MB/s]
Downloading:  78%|#######7  | 426M/548M [00:06<00:01, 62.2MB/s]
Downloading:  79%|#######8  | 432M/548M [00:07<00:01, 63.2MB/s]
Downloading:  80%|########  | 439M/548M 

Not tested yet

In [None]:
aug = nas.LambadaAug(model_dir='../model/lambada', threshold=0.3, batch_size=4)

In [None]:
aug.augment(['24111', '23619'], n=10)

This entry gave errors, a character is not UTF-8 compliant, not sure which one though.

In [96]:
'''Atleast 7 years of working experience in Oracle Financial Cloud consultant and 2 live projects implementation. Strong expertise on Oracle Financial modules –General Ledger, Payables, Fixed Assets, Cash Management, Accounts Payables, Accounts Receivables, Inventory, Purchasing and order Management modules and preparing the Financial Statements for client. Strong technical and analytical skills on problem-solving and suggest solutions. Working experience on Integrations between Cloud and on-premise systems. Should have ability to work independently on Project deliverables.'''


' Oracle Financial modules –Gen'

#### E) Implementing Data Augmentation