# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Mrwan Alhandi
#### Student ID: s3969393

Date: 9/10/23

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used:
* pandas
* re
* numpy
* json
* collections
* glob

## Introduction
In this task, we will perform text pre-processing on a given dataset, focusing on the description of job advertisements. The following steps are executed:

1. Extraction of information from each job advertisement.
2. Tokenization using specific regular expression.
3. Conversion of all words to lowercase.
4. Removal of words with a length of less than 2.
5. Elimination of stopwords using the provided stop words list (stopwords_en.txt).
6. Removal of words that appear only once in the document collection based on term frequency.
7. Removal of the top 50 most frequent words based on document frequency.
8. Saving of preprocessed job advertisement text and information in the given format.
9. Building a vocabulary of cleaned job advertisement descriptions and saving it in a txt file.

These pre-processing steps prepare the data for further analysis.

## Importing libraries 

In [1]:
# Required Imports
import os
import glob
import re
import json
import pandas as pd
from collections import Counter

### 1.1 Examining and loading data

In [2]:
def read_files_in_folder(folder_path):
    
    """
    This function reads the files from the paths provided
    """
    data = []
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        with open(file_path, "r", encoding="utf-8") as file:
            lines = file.readlines()
            entry = {"title": None, "webindex": None, "company": None, "description": None}
            for line in lines:
                key, value = map(str.strip, line.split(":", 1))
                entry[key.lower()] = value
            data.append(entry)
    return data

In [4]:
folder_paths = glob.glob("./data/*")

combined_data = []

for folder_path in folder_paths:
    category = os.path.split(folder_path)[1]
    folder_data = read_files_in_folder(folder_path)
    category_data = pd.DataFrame(folder_data)
    category_data["category"] = category
    combined_data.append(category_data)
    
combined_data = pd.concat(combined_data, axis=0)

In [5]:
# Number of files for each of the different categories
combined_data["category"].value_counts()

Engineering           231
Healthcare_Nursing    198
Accounting_Finance    191
Sales                 156
Name: category, dtype: int64

In [9]:
combined_data.shape

(776, 5)

In [6]:
combined_data.head()

Unnamed: 0,title,webindex,company,description,category
0,FP&A Blue Chip,68802053,Hays Senior Finance,A market leading retail business is going thro...,Accounting_Finance
1,Part time Management Accountant,70757636,FS2 UK Ltd,You will be responsible for the efficient runn...,Accounting_Finance
2,IFA EMPLOYED,71356489,Clark James Ltd,Role The purpose of the role is to provide adv...,Accounting_Finance
3,Finance Manager,69073629,Accountancy Action Ltd,"Excellent opportunity to join our client, an e...",Accounting_Finance
4,Management Accountant,70656648,Alexander Lloyd,Our client offers a interesting opportunity fo...,Accounting_Finance


### 1.2 Pre-processing data

In [10]:
def preprocess_text(df, column):
    """
    This function preprocesses the data given the column on which
    preprocessing is to be applied
    """
    # Define the tokenization function
    def tokenize(column):
        # Use the regular expression pattern for tokenization
        pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
        tokens = re.findall(pattern, column.lower())  # Convert to lowercase
        return tokens

    # Apply the tokenization function to the input column
    df['tokens'] = df[column].apply(tokenize)

    # Calculate document frequency (DF) for each word in the document collection
    document_frequency = Counter()
    for tokens in df['tokens']:
        document_frequency.update(set(tokens))  # Using set to count each word only once per document

    # Identify words that appear only once (DF == 1)
    rare_words = [word for word, df in document_frequency.items() if df == 1]

    # Identify the top 50 most frequent words in terms of term frequency
    term_frequency = Counter()
    for tokens in df['tokens']:
        term_frequency.update(tokens)

    top_50_words = [word for word, tf in term_frequency.most_common(50)]

    # Define a function to remove stopwords, words less than length 3, rare words, and top 50 frequent words
    def remove_words(tokens):
        # Read stop words from stopwords_en.txt
        with open("stopwords_en.txt", "r") as stopword_file:
            stop_words = stopword_file.read().splitlines()
        
        # Combine the lists of stop words, rare words, and top 50 frequent words
        words_to_remove = set(stop_words + rare_words + top_50_words)
        
        return [token for token in tokens if token not in words_to_remove and len(token) >= 3]

    # Apply the remove_words function to the 'tokens' column
    df['tokens'] = df['tokens'].apply(remove_words)

    return df

    
# Preprocessing is done for all the three models - [Title Only, Description, and Both]

# Preprocessing for Title Only Model
combined_data = preprocess_text(combined_data, "title")
combined_data.rename(columns={'tokens': 'title_tokens'}, inplace=True)

# Preprocessing for Concat Model - Uses both Title and Description
combined_data["concat_feature"] = combined_data["title"] + " "+ combined_data["description"]
combined_data = preprocess_text(combined_data, "concat_feature")
combined_data.rename(columns={'tokens': 'concat_feature_tokens'}, inplace=True)

# Preprocessing for Main Model - [Description]
combined_data = preprocess_text(combined_data, "description")

In [11]:
combined_data.head()

Unnamed: 0,title,webindex,company,description,category,title_tokens,concat_feature,concat_feature_tokens,tokens
0,FP&A Blue Chip,68802053,Hays Senior Finance,A market leading retail business is going thro...,Accounting_Finance,[],FP&A Blue Chip A market leading retail busine...,"[blue, chip, market, leading, retail, rapid, g...","[market, leading, retail, rapid, growth, due, ..."
1,Part time Management Accountant,70757636,FS2 UK Ltd,You will be responsible for the efficient runn...,Accounting_Finance,"[part, management]",Part time Management Accountant You will be re...,"[part, time, accountant, responsible, efficien...","[responsible, efficient, running, accounting, ..."
2,IFA EMPLOYED,71356489,Clark James Ltd,Role The purpose of the role is to provide adv...,Accounting_Finance,[ifa],IFA EMPLOYED Role The purpose of the role is ...,"[ifa, employed, purpose, provide, advice, tele...","[purpose, provide, advice, telephone, leads, s..."
3,Finance Manager,69073629,Accountancy Action Ltd,"Excellent opportunity to join our client, an e...",Accounting_Finance,[],Finance Manager Excellent opportunity to join ...,"[finance, opportunity, join, expanding, based,...","[opportunity, join, expanding, based, recruit,..."
4,Management Accountant,70656648,Alexander Lloyd,Our client offers a interesting opportunity fo...,Accounting_Finance,[management],Management Accountant Our client offers a inte...,"[accountant, offers, interesting, opportunity,...","[offers, interesting, opportunity, part, quali..."


## Saving required outputs
Save the vocabulary, bigrams and job advertisment txt as per spectification.
- vocab.txt

In [12]:
# Build the vocabulary
def build_vocabulary(df, output_file, column):
    """
    This function is used to build the vocabulary for the column type provided.
    """
    # Concatenate all cleaned descriptions into a single string
    cleaned_text = " ".join(" ".join(tokens) for tokens in df[column])

    # Tokenize the concatenated text to get the vocabulary
    vocabulary = sorted(set(cleaned_text.split()))
    
    # Save the vocabulary to a text file with word:integer_index format
    with open(output_file, 'w', encoding='utf-8') as file:
        for i, word in enumerate(vocabulary):
            file.write(f"{word}:{i}\n")


# Build and save the vocabulary to a text file - [General Model]
build_vocabulary(combined_data, 'vocab.txt', 'tokens')

# Build and save vocab for title_tokens - [Title Only Model]
build_vocabulary(combined_data, 'vocab_title.txt', 'title_tokens')

# Build and save vocab for concat_feature_tokens - [Combined Model]
build_vocabulary(combined_data, 'vocab_concat_feature.txt', 'concat_feature_tokens')

In [13]:
# Saving the lists as JSON objects for easier loading in the other Jupyter Notebook
combined_data["tokens"] = combined_data["tokens"].apply(json.dumps)
combined_data["title_tokens"] = combined_data["title_tokens"].apply(json.dumps)
combined_data["concat_feature_tokens"] = combined_data["concat_feature_tokens"].apply(json.dumps)

# Saving the data for the later use
combined_data.to_excel("./combined_data.xlsx", index=False)

## Summary
This Jupyter Notebook generates four different files.
Three vocabulary files for three different types of modelling techniques and a combined data file which will be used in all the models in the task2 Jupyter Notebook