
## Task 1. Basic Text Pre-processing
#### Name: Matthew Bentham


Date: 02/10/2022

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used:
* pandas
* re
* numpy
* nltk 
* intertools
* os

## Introduction

The objective of this task is to perform basic text pre-processing on 776 job descriptions so that the words and language used in the description of each job can be easily assigned meaning. After successful completion of this process this more simplified and less noisy version of the description section of each job text file can then be used to generate document vectors to be inputted in an NLP machine learning model for the purpose of text classification. 

**Pre-processing tasks:**   
1. Tokenisation & case uniformity 
2. Removal of short words (<2)
3. Removal of stopwords 
4. Removal of less frequent words and most frequent words


**INPUTS**:
- 4 X folders with the job category being the folder name, containing the job description of each job under that category
- Categories:
    * Accounting Finance 
    * Engineering 
    * Healthcare Nursing 
    * Sales
- Each job.txt contains:
    * Title 
    * Web index value (unique)
    * Company 
    * Description 

**OUTPUTS**:
- **Jobs.txt**: Contains all job decscriptions in a single file (line per description)
- **Vocab.txt**: Contains the unigram vocabulary, in the format *word_string:word_integer_index*
- **webindxs.txt**: Contains a list of all the stored web index values in the same order as the job.txt file 
- **jobtitles.txt**:Contains a list of all the titles in the same order as the job.txt file 
- **jobtypes.txt**:Contains a list of all the job types in the same order as the job.txt file 

## Importing libraries 

In [2]:
# Code to import libraries
import nltk
from nltk.probability import *
import os
import numpy as np
from itertools import chain
import re
import pandas as pd

### 1.1 loading data
*NOTE: format of the data folders can be seen in the intro above*  

**TASKS**
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


In [10]:
# LOAD DATA
# Directories used:
dir_ = "data/"
paths =['Accounting_Finance','Engineering','Healthcare_Nursing','Sales']
# Initialise lists to store description , job_indx , job titles , job_type
job_descriptions = [] 
job_indxs = []
job_titles = []
job_type=[]
company = []

# Iterate through each directory and extract the requred information using regex 

for type1 in paths:
    
    dir_path = dir_+type1
    for filename1 in sorted(os.listdir(dir_path)): # we want to load articles in ascending order of their file names
        if filename1.endswith(".txt"): # we only look at the txt file
            file = dir_path+"/"+filename1 # this gives the file path
            
            with open(file,"r",encoding= 'unicode_escape') as f:
                lines = f.readlines()
                count=0
                company_count = 0
                for line in lines:
                    index=re.search(r'Webindex: (\d+)',line) 
                    title = re.search(r'Title: (.+)\n$',line)
                    Description = re.search(r'Description: (?:[a-zA-Z]+(?:\s[a-zA-Z]+)?: )?(.+)',line)
                    Company = re.search(r'Company: (.+)\n$',line)
                    
                    if title:
                        job_titles.append(title.group(1))
                    if index:
                        job_indxs.append(index.group(1))
                    if Company:
                        company_count = 1
                        company.append(Company.group(1))
                    if Description:
                        job_descriptions.append(str(Description.group(1)))
                if company_count == 0:
                    company.append('NA')
                job_type.append(type1)
                
                f.close()

In [11]:
len(company)

776

In [12]:

data={'Web indexs':job_indxs,'Catergory':job_type,'Title':job_titles,'Company':company,'Description':job_descriptions}
data=pd.DataFrame(data=data)
data.head()

Unnamed: 0,Web indexs,Catergory,Title,Company,Description
0,68802053,Accounting_Finance,FP&A Blue Chip,Hays Senior Finance,A market leading retail business is going thro...
1,70757636,Accounting_Finance,Part time Management Accountant,FS2 UK Ltd,You will be responsible for the efficient runn...
2,71356489,Accounting_Finance,IFA EMPLOYED,Clark James Ltd,Role The purpose of the role is to provide adv...
3,69073629,Accounting_Finance,Finance Manager,Accountancy Action Ltd,"Excellent opportunity to join our client, an e..."
4,70656648,Accounting_Finance,Management Accountant,Alexander Lloyd,Our client offers a interesting opportunity fo...


In [13]:
data.to_csv('data.csv', sep=',')

In [None]:
print('Number of Job files:',len(job_type))

Number of Job files: 776


### 1.2 Pre-processing data

After doing some preliminary analysis of the raw data it can be seen that there is some formatting errors when the advertisements were converted to txt files. These include:
- `brbr` scattered throughout the files (HTML formatting)
- Additional subheadings after `Description:` (e.g. Position: , Job description)

In [None]:
# Example
job_descriptions[488]

'Staff Nurse  RGN will also consider Newly Qualified brLocation: Selby brSalary: **** per hour plus overtime rate brbrJob Description: brI am currently looking to recruit a qualified RGN to work for a service based within a rural location. The service is CQC compliant and part of a Yorkshire based Healthcare Company brbrJob Requirements:brbrResponsible for the assessment of care/support needs of service usersbrDevelopment and implementation of care programmes brWorking alongside other nurses reporting to the Manager brSkills/ Qualifications:brbrRegistered Nurse  RGN will also consider newly qualified brDesire to make a difference to people brPassionate at delivering services that enhance lives brBenefits:brSalary **** per hour plus overtime rate brHoliday entitlement brEXCELLENT career progression and training opportunities brPicturesque working environment brFor more information on how to apply for this fantastic opportunity please contact Shona Blackburn on or email a copy of your up

As you can see subheadings likes `brBenefits:brSalary` or `Skills/ Qualifications:` dont actually give any intrinsic information about job itself and therefore can be removed 

In [None]:
for i,a in enumerate(job_descriptions):

    # Remove all invalid br values from job_description list
    check = re.compile(r'(?:(br)+([A-Z]))')
    check =re.sub(check,r'\2',a)
    job_descriptions[i]=check

    # remove all subheadings from job_description list
    check2= re.findall(r'[A-Z]\w+:',check)
    if check2:
        fixed_des =re.sub(r'[A-Z]\w+:','',check)
        job_descriptions[i]=fixed_des
    
    


#### 1.2.1 Tokenization & case uniformity 
Each word in the job description needs to be tokenized. The word tokenization must uses the following regular expression: r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?" which accounts for words with - and ' symbols 

In [None]:
def tokenize_txt(txt):
    """
        This function tokenizes a raw text document.
    """        
    txt_lower = txt.lower() # cover all words to lowercase

    pattern = r'''(?x)          
    [a-zA-Z]+(?:[-'][a-zA-Z]+)?       #Regex expression extracts all words including those with - & ' embedded 
    '''
    tokenizer = nltk.RegexpTokenizer(pattern) 
    tokenised_text = tokenizer.tokenize(txt_lower)
    return tokenised_text

tokenised_job_descriptions = [tokenize_txt(job) for job in job_descriptions]  # list comprehension, generate a list of tokenized articles

#### 1.2.2 Remove small words 
Remove words with length less than 2. This ensures indefinite articles like `a` or pronouns like `i` do not take up space in the job tokenised as they give no indication towards job type

In [None]:
removed_words = [[w for w in job if len(w)<2] \
                      for job in tokenised_job_descriptions]
tokenised_job_descriptions = [[w for w in job if len(w)>=2] \
                      for job in tokenised_job_descriptions]
print('-'*40)
print('Removed words:')
print('-'*40)
print(removed_words[0:10])
print('-'*40)

----------------------------------------
Removed words:
----------------------------------------
[['a', 'a', 'a', 'a', 'a', 't'], ['a', 'a', 'a'], ['a', 'a', 'a'], ['a'], ['a', 'a', 'a', 'a', 'i', 't', 'a', 'a', 'a', 'k'], ['a', 'a', 'a', 'a', 'a'], ['a'], ['a', 'i', 'a', 'a', 'a', 'a'], ['a', 'a', 'a', 'a', 'p', 'l', 'a', 'a', 'a', 's'], ['a']]
----------------------------------------


As seen above single letters like 'a' were removed from the list of tokens as they hold no significance as to the job type thier reside in 

#### 1.2.3 Remove stop words
Stopwords are removed using the provided stop words list (stopwords_en.txt).

In [None]:
# Generate list of stop words 
stopwords_ = []
with open('./stopwords_en.txt') as f:
    stopwords_ = f.read().splitlines()


# filter out stop words located in tokenised_job_descriptions

no_stops =[]
for job in tokenised_job_descriptions:
    no_stop = []
    for w in job:
        if w not in stopwords_:
            no_stop.append(w)
            
    no_stops.append(no_stop)
tokenised_job_descriptions = no_stops

In [None]:
print('-'*40)
print('Stopwords:')
print('-'*40)
print(stopwords_[0:100])
print('-'*40)

----------------------------------------
Stopwords:
----------------------------------------
['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', "c'mon", "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'co', 'com', 'come', 'com

Stopwords as seen above were removed from the token lists as they generally contain low-level information and generally take away focus from the more important class specific words

#### 1.2.4 Remove Less and Most frequent words 
- Remove the word that appears only once in the document collection (term frequency =1).
- Remove the top 50 most frequent words by the number of documents they appear in (document frequency)

In [None]:
# Compute VOCAB
words = list(chain.from_iterable(tokenised_job_descriptions))# we put all the tokens in the corpus in a single list
vocab = set(words) 

In [None]:
rm_fd = FreqDist(words) # compute term frequency for each unique word/type
print('-'*40)
print('Term frequencies:')
print('-'*40)
print(rm_fd.most_common()[9300:-1])
print('-'*40)
# get words with only 1 frequency 
lessFreqWords = set(rm_fd.hapaxes())
print('-'*40)
print('Less frequent words:')
print('-'*40)
lessFreqWords

----------------------------------------
Term frequencies:
----------------------------------------
[('slow', 1), ('applytodaystarttomorrownewsalesfor', 1), ('envolve', 1), ('ue', 1), ('organically', 1), ('complemented', 1), ('wwf', 1), ('amnesty', 1), ('bundles', 1), ('invaluable', 1), ('waiter', 1), ('bartender', 1), ('altringham', 1), ('bolton', 1), ('rochdale', 1), ('helens', 1), ('embarking', 1), ('faint', 1), ('hearted', 1), ('recruitmentsalesexecutive', 1), ('appealing', 1), ('ordered', 1), ('stuck', 1), ('salesadministrator', 1), ('titlebusiness', 1), ('industryinternational', 1), ('parcel', 1), ('surpass', 1), ('businessdevelopmentmanagercourierservices', 1), ('koteuncapp', 1), ('redirects', 1), ('solves', 1), ('educate', 1), ('browse', 1), ('infrastructures', 1), ('developmentongoing', 1), ('atmospherea', 1), ('qualityrewarded', 1), ('generously', 1), ('struggling', 1), ('openended', 1), ('towcester', 1), ('resultsfocussed', 1), ('percentage', 1), ('mentors', 1), ('monetary',

{'whois',
 'inpatient',
 'assertiveness',
 'stone',
 'londoncare',
 'facilitation',
 "cv's",
 'dealt',
 'validity',
 'sapa',
 'shelf',
 'agendas',
 'allowancepensionhealthcare',
 'bromsgrove',
 'multicultural',
 'kgv',
 'trollies',
 'gatekeeper',
 'interrogate',
 'remaining',
 'pest',
 'deferred',
 'btec',
 'yellow',
 'bamber',
 'mainting',
 'therapies',
 'newspapers',
 'ampm',
 'processor',
 'allround',
 'mode',
 'teller',
 'greeks',
 'retrieve',
 'amplitude',
 'woods',
 'sjh',
 'rigorous',
 'offenders',
 'takeoffs',
 'mdm',
 'washington',
 'implication',
 'vulnerability',
 'timeframes',
 'kpmg',
 'eurolondon',
 'geo',
 'yearly',
 'dominates',
 'werkshage',
 'academia',
 'implicitly',
 'slots',
 'portsdown',
 'citations',
 'hazan',
 'hurt',
 'classleading',
 'thunderhclplc',
 'opportunties',
 'corporates',
 'acs',
 'churchdown',
 'socialwork',
 'spending',
 'cousins',
 'mortgageprocessor',
 'alliance',
 'supportability',
 'label',
 'chose',
 'stamping',
 'attains',
 'testandvalidation

In [None]:
def removewords(array,words):
    """ removes the less frequent words from a list of tokens 
    """
    return [w for w in array if w not in words]
# Remove less frequent words: 
tokenised_job_descriptions = [removewords(job,lessFreqWords) for job in tokenised_job_descriptions] 

As seen above the words with a term frequency of 1 was removed from the list of tokens because like stop words , as they only appear once, they hold no useful predicit capabilites. 

In [None]:
#Genertate a document frequency distribution 
words_2 = list(chain.from_iterable([set(job) for job in tokenised_job_descriptions]))
doc_fd = FreqDist(words_2)  # compute document frequency for each unique word/type
doc_fd_50=doc_fd.most_common(50)

# remove top 50 words in document frequency distribution 
mostfreq= set([k for k, v in doc_fd_50])


tokenised_job_descriptions = [removewords(job,mostfreq) for job in tokenised_job_descriptions]

## Saving required outputs
**FILES TO SAVE**
- vocab.txt
- jobs.txt 
- job_types.txt
- webindxs.txt
-job_titles.txt


In [None]:
def saveinfo(directory, array):
    out_file = open(directory, 'w') # creates a txt file 
    for job in array:
        try:
           
           
            if directory == "./jobs.txt":
                out_file.write(' '.join(job) + '\n')
                
            else:
                out_file.write(''.join(job) + '\n')
        except Exception:
            out_file.write('INVALID'+'\n')
    out_file.close() # close the file

# Save files: 
job_file = "./jobs.txt"
jobtype_file = "./job_types.txt"
webindxs_file = "./webindxs.txt"
jobtitles_file="./jobtitles.txt"

saveinfo(job_file,tokenised_job_descriptions)
saveinfo(jobtype_file,job_type)
saveinfo(webindxs_file,job_indxs)
saveinfo(jobtitles_file,job_titles)


In [None]:
# Generate Vocab.txt file 
# generate vocab list:
words = list(chain.from_iterable(tokenised_job_descriptions)) 
vocab = set(words)

out_file = open("./vocab.txt", 'w') # creates a txt file named './bbcNews_voc.txt', open in write mode
vocab = list(vocab)
vocab.sort()
# sort the vocab list alphabetically and assign value according to its alphabetical order 
for ind in range(0, len(vocab)):
    out_file.write("{}:{}\n".format(vocab[ind],ind)) # write each index and vocabulary word, note that index start from 0
out_file.close() # close the file

## Summary

Overall a relatively comprehensive text pre-processing methodology was performed on the job description data to imporve the accuracy of any NLP model used on said data. The methodolgy included tokenization , stopword removal , Less and most frequent term removal and removal of short words, all of which aim to reduce the impact of unimportant / words with limited predictive capabilities so that words that give a better indication of the job type take up a larger proportion of the data. One common step that could also be performed would be lemmatization or stemming which aim to reduce tokens to thier root word, however for the purpose of our evaluations this step is not important.  

