# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Shivam Manish Shinde
#### Student ID: s3994666

Date: 20-05-2024

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used in the assignment are as follows:
* pandas
* re
* numpy
* import re
* os
* regexp_tokenize
* word_tokenize
* Counter
* defaultdict
* load_files
* chain
* warnings

## Introduction
Milestone I of this assignment specifically focuses on Natural Language Processing (NLP) which is part of a larger project to develop an automated job ad classification system. This technology is designed to help categorise job adverts more precisely by decreasing human error, hence increasing the exposure of these ads to qualified individuals. The purpose is to process and prepare job advertisement data so that text classification models can anticipate job categories automatically.

### Objective of Task 1

Task 1 is to preprocess job advertisements for modelling and classification. This includes cleaning the text data, creating a vocabulary, and ensuring that the data is in the proper format for training machine learning models.

#### Key Tasks:

This task is completely focused on **Text Preprocessing** which further involves following subtasks:

* Tokenization: 
Use the provided regex pattern to tokenize job posting descriptions.


* Normalisation: 
To standardise the data, convert all tokens to lowercase.


* Filteration: 
Filter tokens based on characteristics such as length (less than 2), stopwords, rarity, and frequency.


* Vocabulary Construction:
Create a vocabulary from cleaned job advertising, excluding words that were filtered out during the pre-processing stage. The vocabulary should be organised alphabetically, with an integer index starting at 0.


* Saving Processed Data:
Save preprocessed job advertisement texts in a given format, including web index and cleaned description.
Also, store the built vocabulary in a text file with the chosen format.


## Importing libraries 

In [1]:
#Importing the required Libraries  for Task 1
import re
import os
import pandas as pd
import numpy as np
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import word_tokenize
from collections import Counter
from collections import defaultdict
from sklearn.datasets import load_files
from itertools import chain
import warnings 
warnings.filterwarnings("ignore")

### 1.1 Examining and loading data
- xamine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


### Loading the Data

In [2]:
# Loading the text data file 
# Path to the directory containing the job categories
Data_Path = r"D:/AP_Assignment 2/rename_me/data" 

#Defining the function for base path
def data_path(*args):
    return os.path.join(Data_Path, *args)

# Loading the job advertisement data
job_data = load_files(data_path())

In [3]:
#Accessing the documents and its category through print statements
print("Sample Job Description:", job_data.data)  
print("Category:", job_data.target_names[job_data.target[0]])

Category: Accounting_Finance


###  Examine the data

In [4]:
#Checking for the file names
job_data['filenames']

array(['D:/AP_Assignment 2/rename_me/data\\Accounting_Finance\\Job_00382.txt',
       'D:/AP_Assignment 2/rename_me/data\\Accounting_Finance\\Job_00354.txt',
       'D:/AP_Assignment 2/rename_me/data\\Healthcare_Nursing\\Job_00547.txt',
       'D:/AP_Assignment 2/rename_me/data\\Accounting_Finance\\Job_00246.txt',
       'D:/AP_Assignment 2/rename_me/data\\Healthcare_Nursing\\Job_00543.txt',
       'D:/AP_Assignment 2/rename_me/data\\Engineering\\Job_00089.txt',
       'D:/AP_Assignment 2/rename_me/data\\Healthcare_Nursing\\Job_00580.txt',
       'D:/AP_Assignment 2/rename_me/data\\Accounting_Finance\\Job_00419.txt',
       'D:/AP_Assignment 2/rename_me/data\\Sales\\Job_00767.txt',
       'D:/AP_Assignment 2/rename_me/data\\Sales\\Job_00670.txt',
       'D:/AP_Assignment 2/rename_me/data\\Accounting_Finance\\Job_00263.txt',
       'D:/AP_Assignment 2/rename_me/data\\Accounting_Finance\\Job_00374.txt',
       'D:/AP_Assignment 2/rename_me/data\\Engineering\\Job_00111.txt',
       'D:/AP

In [5]:
#Targeting the folder names with respect to their category
job_data['target_names']

['Accounting_Finance', 'Engineering', 'Healthcare_Nursing', 'Sales']

In [6]:
job_data['target']

array([0, 0, 2, 0, 2, 1, 2, 0, 3, 3, 0, 0, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2,
       3, 3, 0, 2, 2, 2, 0, 2, 3, 1, 2, 0, 1, 3, 3, 1, 1, 0, 2, 2, 2, 2,
       0, 0, 2, 1, 3, 1, 1, 2, 2, 3, 0, 0, 1, 0, 2, 2, 3, 3, 3, 0, 3, 0,
       1, 2, 3, 1, 3, 2, 3, 1, 3, 2, 1, 3, 2, 1, 3, 2, 2, 1, 0, 1, 1, 1,
       3, 0, 3, 1, 3, 2, 2, 0, 2, 3, 2, 1, 0, 1, 1, 2, 0, 3, 0, 1, 3, 2,
       1, 2, 0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 3,
       2, 0, 0, 1, 3, 2, 0, 1, 0, 3, 1, 2, 1, 0, 0, 0, 3, 0, 1, 2, 3, 1,
       1, 1, 2, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0, 2, 2, 0, 2, 3, 2, 2, 0, 2,
       1, 0, 1, 1, 1, 3, 1, 3, 1, 0, 3, 1, 0, 2, 0, 0, 2, 1, 1, 0, 1, 3,
       0, 1, 1, 3, 0, 1, 0, 2, 3, 0, 2, 0, 1, 0, 1, 3, 1, 0, 1, 1, 0, 1,
       0, 1, 2, 1, 3, 1, 2, 3, 1, 1, 2, 0, 0, 1, 2, 0, 3, 2, 3, 2, 2, 3,
       0, 1, 1, 1, 1, 1, 1, 0, 3, 1, 1, 0, 0, 2, 1, 2, 2, 2, 2, 1, 3, 1,
       2, 1, 2, 3, 2, 3, 0, 1, 3, 0, 2, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1, 2,
       2, 1, 2, 0, 2, 2, 1, 2, 0, 1, 0, 0, 3, 2, 1,

Here the job_data['target'] targets the Job category in numerical way as following:
* 0 = Accounting_Finance
* 1 = Engineering
* 2 = Healthcare_Nursing
* 3 = Sales

### Extraction of data
In this step we will extract the required data only for preprcessing the text. For that we will first define those data function and then convert it into a dataframe.

In [7]:
# Defining a function to safely read file content
def read_file_content(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            return file.read()
    except Exception as e:
        print(f"Error reading file {filename}: {e}")
        return None  

In [8]:
# Function to extract Webindex from job advertisements
def extract_web_index(text):
    # Regular expression to find 'Webindex: <number>'
    match = re.search(r'Webindex:\s*(\d+)', text)
    return int(match.group(1)) if match else None

In [9]:
# Function to exract Title from Job Advertisements
def extract_title(text):
    #Regex expression to find 'Title of the Job'
    match = re.match(r"Title: (.*?)\n", text)
    return match.group(1) if match else "No Title Found"

In [10]:
# Function to exract specific description from different Job Advertisements
def extract_description(text):
    # Extracting description section
    match = re.search(r'Description:\s*(.*)', text, re.DOTALL)
    return match.group(1).strip() if match else "No Description Found"

 ### Converting the data into DataFrame

Here before converting the data into a dataframe we have define four seperate functions:
1. read_file_content - To read each and every content present in the text files.
2. extract_web_index - To extract the web index present in the text files. We also implemented the regex pattern to just indentify the digits for the webindex through search() function.
3. extract_title - To extract the title present in the text files. For extracting we use a simple regex pattern and used match() function.
4. extract_description - To extract the description for various job advertisements. We implememted the re.DOTALL flag which matches any character, including the newline character as well.


In [11]:
#Creating a dataframe and imputing the specific content into column for easy preprocessing of the text
df = pd.DataFrame({
    'web_index': [extract_web_index(content.decode('utf-8')) for content in job_data.data],
    'title': [extract_title(content.decode('utf-8')) for content in job_data.data],
    'description': [extract_description(content.decode('utf-8')) for content in job_data.data],
    'category': [job_data.target_names[target] for target in job_data.target],
    'target': job_data.target
})

df.head()

Unnamed: 0,web_index,title,description,category,target
0,68997528,Finance / Accounts Asst Bromley to ****k,Accountant (partqualified) to **** p.a. South ...,Accounting_Finance,0
1,68063513,Fund Accountant Hedge Fund,One of the leading Hedge Funds in London is cu...,Accounting_Finance,0
2,68700336,Deputy Home Manager,An exciting opportunity has arisen to join an ...,Healthcare_Nursing,2
3,67996688,Brokers Wanted Imediate Start,OneTwoTrade is expanding their Sales Team and ...,Accounting_Finance,0
4,71803987,RGN Nurses (Hospitals) Penarth,RGN Nurses (Hospitals) Immediate fulltime and ...,Healthcare_Nursing,2


### 1.2 Pre-processing data
Following are the pre-processing steps as follows:

### Tokenization, Lowercase and Removal of short words
1. For tokenization we have used the regex pattern metnioned in the assignment.
2. For converting the tokens to lowercase we have used lower() function.
3. To remove the words with less than 2, we have used the len() function.

Here we have defined all these tasks into a single function for easy readability and instead for longer code lines we made it short for better understanding.

In [12]:
def custom_tokenize(text):
    # Regular expression pattern for tokenization(mentioned in the assignment)
    regex_pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    
    # Tokenizing the text using the defined regular expression
    tokens = regexp_tokenize(text, regex_pattern)
    
    # Converting each token to lower case and filtering out tokens with length less than 2
    filtered_tokens = [token.lower() for token in tokens if len(token) > 1]
    
    print("Original tokens:", '\n', tokens)
    print('\n', "Filtered tokens:", '\n', filtered_tokens,'\n')
    
    return filtered_tokens

# Applying the updated tokenization function to each description
df['tokens'] = df['description'].apply(custom_tokenize)

# Displaying the first few rows of the DataFrame to verify the results
df[['description', 'tokens']].head(n=15)


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Unnamed: 0,description,tokens
0,Accountant (partqualified) to **** p.a. South ...,"[accountant, partqualified, to, south, east, l..."
1,One of the leading Hedge Funds in London is cu...,"[one, of, the, leading, hedge, funds, in, lond..."
2,An exciting opportunity has arisen to join an ...,"[an, exciting, opportunity, has, arisen, to, j..."
3,OneTwoTrade is expanding their Sales Team and ...,"[onetwotrade, is, expanding, their, sales, tea..."
4,RGN Nurses (Hospitals) Immediate fulltime and ...,"[rgn, nurses, hospitals, immediate, fulltime, ..."
5,Production Coordinator Sandbach Salary: pound;...,"[production, coordinator, sandbach, salary, po..."
6,Job Reference VAC**** Job Title Scrub Nurse ...,"[job, reference, vac, job, title, scrub, nurse..."
7,Our client is looking to recruit an experience...,"[our, client, is, looking, to, recruit, an, ex..."
8,Our client is a leading recruitment company ba...,"[our, client, is, leading, recruitment, compan..."
9,Position: Business Development Executive Fiel...,"[position, business, development, executive, f..."


In [13]:
# Apply the updated tokenization function to a single description for testing
test_description = df['description'].iloc[2]
test_tokens = custom_tokenize(test_description)

Original tokens: 
 ['An', 'exciting', 'opportunity', 'has', 'arisen', 'to', 'join', 'an', 'establish', 'provider', 'of', 'elderly', 'care', 'services', 'The', 'role', 'of', 'Deputy', 'Home', 'Manager', 'is', 'to', 'support', 'the', 'Home', 'Manager', 'in', 'the', 'day', 'to', 'day', 'running', 'of', 'the', 'home', 'You', 'need', 'to', 'have', 'a', 'passion', 'for', 'working', 'in', 'the', 'care', 'sector', 'and', 'have', 'proven', 'experience', 'in', 'this', 'role', 'Job', 'Description', 'To', 'assist', 'the', 'Registered', 'Manager', 'in', 'the', 'management', 'of', 'the', 'home', 'To', 'take', 'overall', 'responsibility', 'for', 'the', 'home', 'in', 'the', 'absence', 'of', 'the', 'Registered', 'Manager', 'Job', 'Requirements', 'Ensure', 'that', 'high', 'standards', 'of', 'care', 'are', 'delivered', 'to', 'meet', 'the', 'needs', 'of', 'the', 'individual', 'residents', 'Ensure', 'the', 'healthcare', 'needs', 'of', 'the', 'residents', 'are', 'met', 'by', 'liaising', 'with', 'GP', 's', '

### Removal of Stopwords
To remove the stopwords we will first load the stopwords_en.txt file. Then we will make sure to print those stopwords and remove those words from the tokens.

In [14]:
#Defining a function to load the given stopwords from text file
def load_stopwords(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        stopwords = set(file.read().split())
    return stopwords

stopwords_path = data_path('stopwords_en.txt')  
stopwords = load_stopwords(stopwords_path)
print(stopwords)

{'could', 'anyway', 'tries', 'et', 'otherwise', 'thence', 'myself', "he's", 'ours', 'whereas', 'why', 'at', 'qv', 'amongst', "i'll", 'unto', 'going', 'becoming', 'if', 'inner', 'secondly', 'hopefully', 'were', 'five', 'theirs', 'thereupon', "we'd", 'somebody', 'am', 'almost', 'said', 'whereafter', 'fifth', 'hereafter', 'from', 'u', 'lately', 'tried', 'none', 'down', 'thoroughly', 'old', 'vs', 'wherever', 'edu', 'regarding', 'much', 'z', 'particularly', 'many', 'elsewhere', 'causes', 'therein', 'seriously', 'out', 'somewhat', 'indeed', 'specifying', 'under', 'where', 'next', 'seven', 'help', 'yours', 'latterly', 'ok', 'accordingly', 'knows', 'along', 'howbeit', 'such', 'aside', 'downwards', "it'll", 'according', 'until', 'around', 'use', 'went', 'eg', 'cannot', 'for', 'him', 'outside', 'she', 'anything', 'liked', 'four', 'new', 'l', 'just', "can't", 'but', 'even', 'there', 'towards', 'whence', 'unless', 'except', 'every', 'herein', 'comes', 'g', 'ex', 'oh', 'go', 'normally', 'soon', 'sp

In [15]:
def remove_stopwords(tokens, stopwords):
    # Filtering out stopwords from the tokens
    return [token for token in tokens if token not in stopwords]

# Applying custom_tokenize to get filtered_tokens and then remove stopwords
df['tokens'] = df['description'].apply(custom_tokenize)
df['tokens'] = df['tokens'].apply(lambda tokens: remove_stopwords(tokens, stopwords))



Original tokens: 
 ['Accountant', 'partqualified', 'to', 'p', 'a', 'South', 'East', 'London', 'Our', 'client', 'a', 'successful', 'manufacturing', 'company', 'has', 'an', 'immediate', 'requirement', 'for', 'an', 'Accountant', 'for', 'permanent', 'role', 'in', 'their', 'modern', 'offices', 'in', 'South', 'East', 'London', 'The', 'Role', 'Credit', 'Control', 'Purchase', 'Sales', 'Ledger', 'Daily', 'collection', 'of', 'debts', 'by', 'phone', 'letter', 'and', 'email', 'Handling', 'of', 'ledger', 'accounts', 'Handling', 'disputed', 'accounts', 'and', 'negotiating', 'payment', 'terms', 'Allocating', 'of', 'cash', 'and', 'reconciliation', 'of', 'accounts', 'Adhoc', 'administration', 'duties', 'within', 'the', 'business', 'The', 'Person', 'The', 'ideal', 'candidate', 'will', 'have', 'previous', 'experience', 'in', 'a', 'Credit', 'Control', 'capacity', 'you', 'will', 'possess', 'exceptional', 'customer', 'service', 'and', 'communication', 'skills', 'together', 'with', 'IT', 'proficiency', 'You'

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [16]:
# Printing the first few entries for verification
for index, row in df.sample(n=2).iterrows():
    print("Original Description:", row['description'],'\n')
    print("'\n'Processed Tokens:", row['tokens'])
    print()


Original Description: Apply Today, Start Tomorrow New Sales for 2013 Is this you? Do you need money and a career immediately? We are looking for at least **** people to start now No experience is necessary, as we provide all successful applicants with full product training along with continuous coaching from day one in the office. Although we welcome candidates with previous experience in Sales, Customer Service, Advertising, Promotions, Retail, Call Centre, Hospitality Marketing. Opportunities throughout 2013 We never slow down If you are hardworking and looking to start a new career, then please apply now for an immediate appointment ****  **** per week  Average Earnings APPLY ONLINE NOW ALL CANDIDATES MUST BE **** OR OVER, LIVE IN THE UK  PLYMOUTH AREA  AND MUST BE ABLE TO COMMUTE TO OUR PLYMOUTH OFFICE DAILY. This job was originally posted as www.totaljobs.com/JobSeeking/ApplyTodayStartTomorrowNewSalesfor2013_job**** 

'
'Processed Tokens: ['apply', 'today', 'start', 'tomorrow', 's

### Removal of words that appear only once
Here we remove the words from the document that appear only once. Based on the term frequency we will remove those words. Now,before calculating the term frequency we will convert all the tokens into a list which will make it easier for counting the words.  

In [17]:
# Converting all tokens into a single list
all_tokens = [token for tokens_list in df['tokens'] for token in tokens_list]
print(all_tokens)



In [18]:
# Calculate term frequency across all documents
term_frequency = Counter(all_tokens)
print(term_frequency)



In [19]:
# Get words that appear only once
single_words = {word for word, count in term_frequency.items() if count == 1}
print(single_words)

{'indicating', 'pttls', 'buick', 'timcryerbaker', 'bodmin', 'als', 'corporates', 'plump', 'confront', 'southeast', 'rheoli', 'prince', 'conformity', 'offences', 'onboarding', 'stations', 'chemicals', 'electrification', 'nx', 'goruchwylio', 'ceramics', 'saas', 'relief', 'faro', 'minority', 'angels', 'lodgings', 'sterling', 'fatigue', 'throughlife', 'polishes', 'recreations', 'exposed', 'coast', 'newspapers', 'logo', 'downloading', 'kingswodd', 'thin', 'nebosh', 'generallocation', 'storrington', 'ooad', 'twilight', 'loveridge', 'tidying', 'waverley', 'spinalcord', 'confidentially', 'glesurfredlinegroup', 'iosh', 'bmssales', 'reallocations', "limited's", 'quoted', 'seeker', 'weeklies', 'ears', 'cannulate', 'recruitmentconsultantcontractmediadivision', 'samplingward', 'nsrp', 'genetics', 'nonaggressive', 'cyfweliadau', 'warranties', 'negligible', 'scanners', 'firstly', 'stone', 'innate', 'chwarae', 'constraint', 'spares', 'omega', 'salespurchaseledgerclerkmaternitycover', 'stan', 'easter',

In [20]:
# Function to remove words that appear only once in the entire collection
def remove_single_words(tokens):
    return [token for token in tokens if token not in single_words]

# Applying the function to each document's tokens
df['tokens'] = df['tokens'].apply(remove_single_words)

# Printing the first few entries for verification
for index, row in df.head(n=2).iterrows():
    print("Description:", row['description'],'\n')
    print("Tokens:", row['tokens'])
    print()  # Adds a blank line for better readability between entries


Description: Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role 

Tokens: ['accountant', 'partqualified', 'south', 'east', 'london', 'client', 'successful', 'manufacturing', 'company', 'requirement', 'accountant', 'permanent', 'role', 'modern', 'offices', 'south', 'ea

### Removal of Top 50 most frequent words
To remove the top 50 words from the documnet, we will initialize a dictionary first to count the documnet frequencies. Later after calculating the document frequency we will remove those words.

In [21]:
# Initializing a dictionary to count document frequencies
document_frequency = defaultdict(int)

# Calculating the document frequency
for tokens in df['tokens']:
    unique_tokens = set(tokens)
    for token in unique_tokens:
        document_frequency[token] += 1

In [22]:
# Sorting words by document frequency and selecting the top 50
top_50_frequent = sorted(document_frequency.items(), key=lambda x: x[1], reverse=True)[:50]
top_50_words = {word for word, freq in top_50_frequent}
print(top_50_words)

{'knowledge', 'client', 'essential', 'based', 'include', 'full', 'cv', 'posted', 'good', 'service', 'job', 'leading', 'join', 'role', 'services', 'management', 'salary', 'skills', 'support', 'development', 'candidate', 'team', 'opportunity', 'information', 'benefits', 'jobseeking', 'required', 'uk', 'originally', 'apply', 'successful', 'strong', 'including', 'manager', 'sales', 'position', 'experience', 'provide', 'business', 'excellent', 'working', 'work', 'contact', 'ability', 'www', 'clients', 'training', 'company', 'recruitment', 'high'}


In [23]:
# Function to remove top 50 most frequent words based on document frequency
def remove_top_50_words(tokens, top_50_words):
    return [token for token in tokens if token not in top_50_words]

# Applying the function to each document's tokens
df['tokens'] = df['tokens'].apply(lambda tokens: remove_top_50_words(tokens, top_50_words))

# Displaying the first few entries for verification
for index, row in df.head(n=2).iterrows():
    print("Description:", row['description'],'\n')
    print("Tokens:", row['tokens'])
    print()  


Description: Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role 

Tokens: ['accountant', 'partqualified', 'south', 'east', 'london', 'manufacturing', 'requirement', 'accountant', 'permanent', 'modern', 'offices', 'south', 'east', 'london', 'credit', 'control', 'purcha

### Statistics Print
Here in this task we will count the following:
1. Vocabulary Size
2. Total Number of Tokens
3. Lexical Diversity
4. Total Number of Descriptions
5. Average Description Length
6. Maximum Description Length
7. Minimum Description Length
8. Standard Deviation of Description Length


In [24]:
#Applying the stats_print function to the processed dataframe
def stats_print(df):
    words = list(chain.from_iterable(df['tokens'])) # Flattening the list of tokenized descriptions to get all tokens in a single list

    vocab = set(words) # Computing the vocabulary by converting the list of words/tokens to a set, thus getting unique words

    # Calculating the lexical diversity as the ratio of unique words to total words
    lexical_diversity = len(vocab) / len(words) if words else 0  # Adding a check to avoid division by zero

    print("Vocabulary size:", len(vocab))
    print("Total number of tokens:", len(words))
    print("Lexical diversity:", lexical_diversity)
    print("Total number of descriptions:", len(df))

    # Calculating the length of each description
    lens = [len(description) for description in df['tokens']]
    
    # Computing average and other statistics using numpy for better handling of numerical operations
    if lens:  # Checking if the list is not empty to avoid errors with numpy functions
        print("Average description length:", np.mean(lens))
        print("Maximum description length:", np.max(lens))
        print("Minimum description length:", np.min(lens))
        print("Standard deviation of description length:", np.std(lens))
    else:
        print("Average description length: N/A")
        print("Maximum description length: N/A")
        print("Minimum description length: N/A")
        print("Standard deviation of description length: N/A")

stats_print(df)

Vocabulary size: 5168
Total number of tokens: 81205
Lexical diversity: 0.06364140139153993
Total number of descriptions: 776
Average description length: 104.64561855670104
Maximum description length: 401
Minimum description length: 7
Standard deviation of description length: 58.44628718710534


## Saving required outputs
Save the vocabulary, preprocessed jobs, and Title and catergory into text file.
- vocab.txt
- preprocessed_job_ads.txt
- title_category.txt

In [25]:
# Saving the preprocessed job advertisement texts to a text file
preprocessed_path = data_path('preprocessed_job_ads.txt')

def preprocessed_data(df, filepath):
    with open(filepath, 'w', encoding='utf-8') as f:
        for index, row in df.iterrows():
            web_index = row['web_index']
            title = row['title']
            category = row['category']
            tokens = row['tokens']
            cleaned_description = ' '.join(tokens)
            # Format: web_index:title:category:cleaned_description
            f.write(f"{web_index}:{cleaned_description}\n\n")

# Call the function to save the data
preprocessed_data(df, data_path('preprocessed_job_ads.txt'))

#Printing the path to verify the text has been created
print("Preprocessed job advertisements saved to:", preprocessed_path)

Preprocessed job advertisements saved to: D:/AP_Assignment 2/rename_me/data\preprocessed_job_ads.txt


In [26]:
# Building and saving the vocabulary to a new text file
vocab_path = data_path('vocab.txt')
with open(vocab_path, 'w', encoding='utf-8') as f:
    
    # Flattening all tokens into a single list to build vocabulary
    all_cleaned_tokens = [token for tokens_list in df['tokens'] for token in tokens_list]
    
    # Calculating frequency of each token to ensure we only include words that appear more than once and are not removed in previous steps
    vocab_counter = Counter(all_cleaned_tokens)
    
    # Creating a vocabulary that only includes words that were not removed
    vocabulary = sorted(vocab_counter)
    
    # Saving the vocabulary with index
    for index, word in enumerate(vocabulary):
        f.write(f"{word}:{index}\n")

        
#Printing the path to verify the text has been created        
print("Vocabulary file saved to:", vocab_path)

Vocabulary file saved to: D:/AP_Assignment 2/rename_me/data\vocab.txt


In [27]:
# Defining a Function to save title and category to a text file
def save_title_category(df, filepath):
    with open(filepath, 'w', encoding='utf-8') as f:
        for index, row in df.iterrows():
            title = row['title']
            category = row['category']
            # Format: title:category
            f.write(f"{title}:{category}\n")
            
# Path to save title and category file
title_category_path = data_path('title_category.txt')

save_title_category(df, title_category_path)

# Printing the path to verify the text has been created
print("Title and Category file saved to:", title_category_path)


Title and Category file saved to: D:/AP_Assignment 2/rename_me/data\title_category.txt


In [28]:
def preprocessed_job_contents(file_path):
    # Opening the file in read mode
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            print(line.strip())  # Using strip() to remove leading/trailing whitespace

preprocessed_job_contents(preprocessed_path)

68997528:accountant partqualified south east london manufacturing requirement accountant permanent modern offices south east london credit control purchase ledger daily collection debts phone letter email handling ledger accounts handling accounts negotiating payment terms cash reconciliation accounts adhoc administration duties person ideal previous credit control capacity possess exceptional customer communication part fully qualified accountant considered

68063513:hedge funds london recruiting fund accountant paying outstanding west end report head fund accounting number fund accountants senior fund accountants responsible fund accounting number hedge funds dealing equity related products involves aspects fund accounting preparation journal voucher entries nav control part nav review fund accountant reviews cash securities reconciliation trade input pricing financial statements

68700336:exciting arisen establish provider elderly care deputy home home day day running home passion c

In [29]:
def vocab_contents(file_path):
    # Opening the file in read mode
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            print(line.strip())  # Using strip() to remove leading/trailing whitespace

#Printing the voacb file contents 
vocab_contents(vocab_path)

aap:0
aaron:1
aat:2
abb:3
abenefit:4
aberdeen:5
abi:6
abilities:7
abreast:8
abroad:9
absence:10
absolute:11
ac:12
aca:13
academic:14
academy:15
acca:16
accept:17
acceptable:18
acceptance:19
accepted:20
access:21
accessible:22
accident:23
accommodates:24
accommodation:25
accomplished:26
accordance:27
account:28
accountabilities:29
accountability:30
accountable:31
accountancy:32
accountant:33
accountants:34
accounting:35
accounts:36
accreditation:37
accredited:38
accruals:39
accuracy:40
accurate:41
accurately:42
achievable:43
achieve:44
achieved:45
achievement:46
achievements:47
achiever:48
achieving:49
acii:50
acquired:51
acquisition:52
acquisitions:53
act:54
acting:55
action:56
actions:57
actionscript:58
active:59
actively:60
activites:61
activities:62
activity:63
acts:64
actual:65
actuarial:66
acumen:67
acute:68
ad:69
adam:70
adapt:71
adaptability:72
add:73
added:74
addiction:75
adding:76
addition:77
additional:78
additionally:79
additions:80
address:81
addresses:82
addressing:83
adec

In [30]:
# Function to print the contents of the new file
def print_file_contents(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            print(line.strip())

#Printing the contents of the file to verify
print_file_contents(title_category_path)

Finance / Accounts Asst Bromley to ****k:Accounting_Finance
Fund Accountant  Hedge Fund:Accounting_Finance
Deputy Home Manager:Healthcare_Nursing
Brokers Wanted Imediate Start:Accounting_Finance
RGN Nurses (Hospitals)  Penarth:Healthcare_Nursing
Production Coordinator:Engineering
Scrub Nurse:Healthcare_Nursing
Sales & Purchase Ledger Clerk  Maternity Cover:Accounting_Finance
Recruitment Sales Executive:Sales
Business Development Executive  Field Sales  Dartford:Sales
Investments & Treasury Controller:Accounting_Finance
European Payroll:Accounting_Finance
Engineering Assessor / Instructor  South Yorkshire:Engineering
International Account Manager:Sales
Senior Production Technologist (Malaysia):Engineering
Insurance Sales Executive  Horsham:Sales
Vehicle Purchaser / Car Sales:Sales
Marine Engines Specialist – Product Support:Engineering
Sales Manager/Medical Sales Executive:Sales
Optical Assistant  Oxfordshire:Healthcare_Nursing
PERM Unit Mgr RGN Kid minster Flexi ****K due:Healthcare_Nu

## Summary
In Task 1 of the assignment, we created a comprehensive text preprocessing pipeline for a collection of employment ads. This includes extracting web indices from job descriptions, tokenizing text according to predefined standards, and converting tokens to lowercase for consistency. Filtering out short words and stopwords, as well as altering token frequency to exclude rare and overly common words, provides additional refinement. The processed tokens are then utilised to create a vocabulary, which is saved with the preprocessed texts and associated with their corresponding web indexes, title and category. This preparation ensures that the text data is cleaned and formatted, making it ready for feature extraction and subsequent machine learning tasks in the project's later phases.The implementation strictly follows the assignment requirements, emphasising data quality and suitability for automated task classification.


## References
- Localhost. (n.d.). Exercise 1: Preprocessing Movie Review Data with Sample Answers. Retrieved from http://localhost:8888/notebooks/Exe%201_Preprocessing%20Movie%20Review%20Data-WithSampleAnswers.ipynb


- RMIT University. (n.d.). Basic text pre-processing. Retrieved from https://rmit.instructure.com/courses/134429/pages/basic-text-pre-processing?module_item_id=5855623


- Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O'Reilly Media. Retrieved from https://www.nltk.org/


- GeeksforGeeks. (n.d.). Text Preprocessing in Python - Set 1. Retrieved from https://www.geeksforgeeks.org/text-preprocessing-in-python-set-1/