## Generating train and test sets

**Author:** Benjamin Aw  
**Date:** 13 Dec 2021  
**Context:** Extracted data needs to be clean and split up for training and testing purposes.  
**Objective:** To apply previously generated functions to clean up the data, generate augmented data and finally to split them for training and testing purposes

#### A) Setting up

Importing the libraries and obtaing the file path for the datasets required

In [1]:
import pandas as pd
from ssoc_autocoder.processing import process_text
from ssoc_autocoder.augmentation import data_augmentation
from tqdm.auto import tqdm
import math
from itertools import chain
from sklearn.model_selection import GroupShuffleSplit

path = "../Data/"

tqdm.pandas()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\benjamin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


#### B) Functions

We have to create 3 functions: 

1. To clean up the original data, `cleaning_data`
2. To create augmented data, `augmenting_data`
3. To create the train test split, `splitting_data`

In [2]:
def cleaning_up_dataset(path):
    
    # Reading in the data
    df = pd.read_csv(path + "Raw/Raw_Labelled.csv")
    
    # Filtering out the necessary columns used, keeping Ad_ID as primary identifier
    df = df[["MCF_Job_Ad_ID", "Predicted_SSOC_2020", "title", "description"]]
    
    # Applying cleaning function across the description column
    df["description"] = df["description"].apply(process_text)
    
    return df

In [3]:
def augmenting_data(df, prob, edit_phrase):
    
    # Applying cleaning function across the description column
    df["description_augmented"] = df["description"].progress_apply(data_augmentation, args = (prob, edit_phrase))
    
    # Renaming description column
    df = df.rename(columns={'description':'description_originial'})
    
    # Appending to the original dataset
    df = pd.concat([df, df['description_augmented'].apply(pd.Series)], axis=1)
    
    # Dropping used or repeated columns
    df = df.drop(columns="description_augmented")
    df = df.drop(columns="orginal_text")
    
    return df

In [4]:
def supplementing_data(df, data_detailed_def):
    
    # Getting the relevant columns
    data_detailed_def = data_detailed_def[["SSOC 2020", "SSOC 2020 Title", "Detailed Definitions"]]
    
    # Filtering entries with only 5D ssoc and entries that start with X
    data_detailed_def = data_detailed_def[data_detailed_def["SSOC 2020"].apply(lambda x: len(x) > 4)]
    data_detailed_def = data_detailed_def[data_detailed_def["SSOC 2020"].apply(lambda x: x[0] != 'X')]
    
    # Changing column names for merging purposes
    data_detailed_def = data_detailed_def.rename(columns={"SSOC 2020": "Predicted_SSOC_2020", 
                                                          "SSOC 2020 Title": "title", 
                                                          "Detailed Definitions": "description"}, 
                                                 errors="raise")
    
    # Creating an additional column of ID, default to None for now
    data_detailed_def["MCF_Job_Ad_ID"] = None
    
    # Concat the two datasets together
    output = df.append(data_detailed_def)
    
    output = output.sort_values(by=['Predicted_SSOC_2020', 'title'])
    
    output = output.reset_index(drop=True)
    
    return output

In [5]:
def train_test_split(df, train_set_size):
    
    # Create a dictionary based on the number of occurrence   
    counter = df.groupby('Predicted_SSOC_2020').count().to_dict(orient='dict')['title']
    counter_once = { key:value for (key,value) in counter.items() if value == 1}
    counter_multiple = { key:value for (key,value) in counter.items() if value > 1}
    
    # Subset out multiple 
    df_multiple = df[df["Predicted_SSOC_2020"].apply(lambda x: x in counter_multiple.keys())]
    
    train_inds, test_inds = next(GroupShuffleSplit(test_size= (1 - train_set_size), n_splits=2, random_state = 7).split(df_multiple, groups = df_multiple['Predicted_SSOC_2020']))
    
    df_multiple_train = df_multiple.iloc[train_inds]
    
    df_multiple_test = df_multiple.iloc[test_inds]
    
    # Subset out once
    df_once = df[df["Predicted_SSOC_2020"].apply(lambda x: x in counter_once.keys())]
    
    # Append once with train
    df_train = df_multiple_train.append(df_multiple_train)
    
    # Pivot longer df_train
    df_train = pd.melt(df_train, 
                       id_vars= ["MCF_Job_Ad_ID", 
                                 "Predicted_SSOC_2020", 
                                 "title"], 
                       value_vars=["description_originial", 
                                   "wrd_emb_out", 
                                   "bk_trans_out", 
                                   "synonym_out", 
                                   "context_emb_out", 
                                   "sent_out", 
                                   "summ_out"],
                       var_name='Augment', 
                       value_name='job_description')
    
    # Pivot longer for df_test, we are only interested in testing on the orginal dataset
    df_test = pd.melt(df_multiple_test, 
                       id_vars= ["MCF_Job_Ad_ID", 
                                 "Predicted_SSOC_2020", 
                                 "title"], 
                       value_vars=["description_originial", 
                                   "wrd_emb_out", 
                                   "bk_trans_out", 
                                   "synonym_out", 
                                   "context_emb_out", 
                                   "sent_out", 
                                   "summ_out"],
                       var_name='Augment', 
                       value_name='job_description')
    
    df_test = df_test[df_test["Augment"] == "description_originial"]
    
    return df_train, df_test 

#### C) Running functions


Reading in the raw csv file and subsequently cleaning up the dataset, while extacting only the necessary columns. The input of the function only requires the path of the csv file for `Raw_Labelled.csv`

In [None]:
df = cleaning_up_dataset(path)

df.to_csv(path + "Processed/Processed_Labelled.csv", index = False)

Once the dataset is extracted, we need to supplement information of SSOCS that are not present in the current dataset. This is done by merging the dataset with `SSOC2020 Detailed Definitions.xlsx`

In [12]:
# Chunk to read in output from above cell, only do this if you have run the above function before

df = pd.read_csv(path + "Processed/Processed_Labelled.csv")

In [13]:
data_detailed_def = pd.read_excel(path + "Reference/SSOC2020 Detailed Definitions.xlsx", header = 4)

df = supplementing_data(df, data_detailed_def)

df.to_csv(path + "Reference/Intermediate_Dataset.csv", index = False)

  warn("""Cannot parse header or footer so it will be ignored""")


Because augmenting all 15k entries will take a while, we take a sample of the dataset to augment to test for now. We have to find a better way to run the augmentation, since it will take about 9-10 days to run locally on my own PC.

In [14]:
df = df.sample(n = 200)

Augmenting the current dataset, where each row represnts a job description, and the additional columns added represents the augmented data

In [15]:
df = augmenting_data(df, 0.5, True)

df.to_csv(path + "Processed/Processed_Augmented_Labelled_sample.csv", index = False)

  0%|          | 0/200 [00:00<?, ?it/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ..\aten\src\ATen\native\BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Input length of input_ids is 236, but ``max_length`` is set to 236.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 139, but ``max_length`` is set to 115.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 241, but ``max_length`` is set to 195.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 13, but ``max_length`` is set to 13.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Your min_length is set to 50, but y

Input length of input_ids is 93, but ``max_length`` is set to 75.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 109, but ``max_length`` is set to 102.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 90, but ``max_length`` is set to 83.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 173, but ``max_length`` is set to 173.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 49, but ``max_length`` is set to 46.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 51, but ``max_length`` is set to 48.This can lead to unexpected behavior. You should conside

Input length of input_ids is 91, but ``max_length`` is set to 71.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 89, but ``max_length`` is set to 86.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 264, but ``max_length`` is set to 205.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 150, but ``max_length`` is set to 142.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 146, but ``max_length`` is set to 142.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 70, but ``max_length`` is set to 62.This can lead to unexpected behavior. You should consi

Input length of input_ids is 111, but ``max_length`` is set to 106.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 66, but ``max_length`` is set to 55.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 177, but ``max_length`` is set to 171.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 166, but ``max_length`` is set to 151.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 76, but ``max_length`` is set to 66.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.
Input length of input_ids is 216, but ``max_length`` is set to 206.This can lead to unexpected behavior. You should con

Now we want to split the dataset into a train test set.

If there is only one SSOC value present in the dataset, we leave that in as a training set.

If there are more than one entry for a particular SSOC value, we split them up based on the second argument in the `train_test_split` function, and append the training section with the one above.

In [6]:
# Chunk to read in output from above cell, only do this if you have run the above function before

df = pd.read_csv(path + "Processed/Processed_Augmented_Labelled_sample.csv")

In [7]:
train, test = train_test_split(df, 0.8)

train.to_csv(path + "Train/Train.csv", index = False)

test.to_csv(path + "Test/Test.csv", index = False)