# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Ziqing Yan
#### Student ID: s3749857

Date: 1 October 2021

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* os
* re
* nltk
* sklearn
* itertools

## Introduction
In Task 1, I pre-process the text data about job advertisement and output the cleaned useful information to txt files.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import os
import re
import nltk
from sklearn.datasets import load_files
from itertools import chain
from nltk.probability import *
from nltk.util import ngrams

### 1.1 Examining and loading data
- In the data folder, there are 8 folders whose name corresponds the category of job advertisements. 
- In each folder, there are many .txt files containing the job advertisements of the same category. 
- Each file contains title, web index and the description of the job advertisement. Some files also contain company name.

Now let's extract the useful information of the files and store them in proper formats.

In [2]:
# Extract the information of the files and store them in lists
ad_id = [] # job advertisement id
category = [] # job category
web_index = [] # web index of the job advertisement
title = [] # the title of the job description
description = [] # job description

data = "./data" # the path where the data stores

for folder_name in os.listdir(data):
    folder_path = os.path.join(data, folder_name) # get the path of the current folder
    for file_name in os.listdir(folder_path):
        ad_id.append(re.search(r"\d{5}", file_name).group()) # use Regex to extract the 5 digit id in the file name and store it
        category.append(folder_name) # add the folder name as category name into the list
        file_path = os.path.join(folder_path, file_name) # get the file path
        with open(file_path, "r", encoding= 'utf-8') as f:
            lines = f.read().splitlines() # read the information of each line in the current file
            # Add the information of the file into the corresponding list
            title.append(lines[0].split(":", 1)[1].strip())
            web_index.append(lines[1].split(":", 1)[1].strip())
            # Check if the file contains the company name
            if lines[2].split(":")[0].strip() == "Company":
                description.append(lines[3].split(":", 1)[1].strip())
            else:
                description.append(lines[2].split(":", 1)[1].strip())

The orders of these list storing the information of the job advertisements are consistent with each other. This means that the same position of each list correspond the information of the same job advertisement.

### 1.2 Pre-processing data
After getting the content of these files, we can preprocess the description of the job advertisement in terms of the requirements.

Let's first tokenize the descriptions.

In [3]:
# Tokenize the descriptions
tokenized_descriptions = [] # store the tokenized descriptions
pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
tokenizer = nltk.RegexpTokenizer(pattern)

for des in description:
    lower_des = des.lower() # turn the description to lowercase
    tokenized_description = tokenizer.tokenize(lower_des) # tokenize the description using Regex
    tokenized_descriptions.append(tokenized_description)
tokenized_descriptions

[['are',
  'you',
  'a',
  'successful',
  'results',
  'driven',
  'person',
  'are',
  'you',
  'looking',
  'to',
  'start',
  'a',
  'career',
  'in',
  'financial',
  'services',
  'are',
  'you',
  'looking',
  'for',
  'a',
  'fast',
  'paced',
  'dynamic',
  'working',
  'environment',
  'our',
  'client',
  'is',
  'proud',
  'to',
  'be',
  'an',
  'independent',
  'estate',
  'agent',
  'with',
  'an',
  'unbeatable',
  'local',
  'knowledge',
  'and',
  'reputation',
  'for',
  'excellent',
  'service',
  'they',
  'have',
  'a',
  'mortgage',
  'advisor',
  'within',
  'each',
  'of',
  'their',
  'branches',
  'to',
  'accommodate',
  'their',
  'customer',
  's',
  'needs',
  'as',
  'a',
  'trainee',
  'mortgage',
  'advisor',
  'you',
  'will',
  'work',
  'closely',
  'with',
  'your',
  'team',
  'to',
  'learn',
  'and',
  'develop',
  'whilst',
  'achieving',
  'shared',
  'goals',
  'in',
  'addition',
  'to',
  'this',
  'you',
  'will',
  'have',
  'the',
  'det

In [4]:
# Look at the vocabulary size and the total number of token
print("Vocabulary size: ",len(set(list(chain.from_iterable(tokenized_descriptions)))))
print("Total number of tokens: ", len(list(chain.from_iterable(tokenized_descriptions))))

Vocabulary size:  89591
Total number of tokens:  13799127


Then, remove the word with length less than 2 or among the stopword list.

In [5]:
# Load the file containing the stopwords
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()

# Remove the word with length less than 2 or among the stopword list
tokenized_descriptions = [[word for word in des if word not in stopwords and len(word) >= 2] for des in tokenized_descriptions]

If the word appears only once among all descriptions, remove it.

In [6]:
# Remove the word that appears only once based on term frequency
words = list(chain.from_iterable(tokenized_descriptions)) # put all words in a single list
word_freq = FreqDist(words)
appear_once_word = set([key for key, value in word_freq.items() if value < 2]) # get the word that appears only once

tokenized_descriptions = [[word for word in des if word not in appear_once_word] for des in tokenized_descriptions]

Top 50 most frequent words based on document frequency also need to be removed.

In [7]:
# Remove the top 50 most frequent words based on document frequency
word2 = list(chain.from_iterable([set(des) for des in tokenized_descriptions])) # make each word in an description appears only once
word_freq2 = FreqDist(word2)
top_50 = set([key for key, value in word_freq2.most_common(50)]) # get the top 50 most frequent words

tokenized_descriptions = [[word for word in des if word not in top_50] for des in tokenized_descriptions]

In [8]:
# Look at the vocabulary size and the total number of token again
print("Vocabulary size: ",len(set(list(chain.from_iterable(tokenized_descriptions)))))
print("Total number of tokens: ", len(words))

Vocabulary size:  40038
Total number of tokens:  7863307


Let's move on extracting the top 10 bigrams based on term frequency and saving them as a txt file called <span style="color: red">bigram.txt.</span>

In [9]:
word3 = list(chain.from_iterable(tokenized_descriptions)) # put all words in a single list
bigrams = ngrams(word3, n = 2)
bigrams_freq = FreqDist(bigrams)
top_10 = bigrams_freq.most_common(10) # extract the top 10 bigrams

# Save the top 10 bigrams as bigram.txt
bigram_file = open("./bigram.txt", "w") # create the file for output and open in write mode
for bigram in top_10:
    bigram_file.write("{} {},{}\n".format(bigram[0][0], bigram[0][1], bigram[1])) # output them in the required format
bigram_file.close() # close the file

We also need to save all job advertisement text and information as a txt file called <span style="color: red">job_ads.txt</span>

In [10]:
# Save all job advertisement text and information
job_ads = open("./job_ads.txt", "w", encoding = "utf-8") # create the file for output and open in write mode
tokenized_descriptions = [" ".join(des) for des in tokenized_descriptions] # concatenate all the words of each description and separate with space

# The lengths of the lists created at the begining of the task as well as tokenized_descriptions are same and their indices are 
# consistent, so any length of these lists can be selected as the maximum index value(here we choose the length of ad_id)
for i in range(0, len(ad_id)): 
    job_ads.write("ID: " + str(ad_id[i]) + "\n")
    job_ads.write("Category: " + category[i] + "\n")
    job_ads.write("Webindex: " + str(web_index[i]) + "\n")
    job_ads.write("Title: " + str(title[i]) + "\n")
    job_ads.write("Description: " + tokenized_descriptions[i] + "\n")

job_ads.close() # close the file

Finally, we build a vocabulary of the preprocessed description and store them in a txt file called <span style="color: red">vocab.txt</span>.

In [11]:
# Store the words in the txt file in alphabetical order
vocab = open("./vocab.txt", "w", encoding = "utf-8") # create the file for output and open in write mode
ordered_words = sorted(set(word3)) # sort the words in alphabetical order(word3 is a single list containing the preprocessed vocabulary)
for word, index in zip(ordered_words, range(0, len(ordered_words))): # make the index value start from 0
    vocab.write("{}:{}\n".format(word, index))
vocab.close() # close the file

## Summary
In this task, we extract the information from the files. Then, we preprocess the description. Finally, we store the useful information in three txt files in proper format.