### Advanced Python AI and ML Tools - Assignment 1

__Group Members:__
1) Aanal Patel - C0910376
2) Bimal Shresta - C0919385
3) Danilo Diaz - C0889539
4) Ernie Sumoso - C0881591

### Index
- __Step 1. Dataset Description (web scrapped)__
- __Step 2. Data Wrangling (cleaning, formatting, structuring, validating)__
    - __Step 9. NLP techniques (data cleaning, stopword and puctuation removal, tokenizing, stemming, and lemmatization)__
- __Step 3. Plotting methods for distribution__
- __Step 4. Pandas profiling for EDA (exploratory data analysis)__
- __Step 5. Encoding methods, creating new numerical columns__
- __Step 6. Outlier identification (with boxplots and IQR)__
- __Step 7. Addressing outliers with Quantile-based flooring and capping, Trimming, and Log Transformation__
- __Step 8. Unsupervised learning methods__

### Step 1. Dataset Description (web scrapped)

(Bimal add a description of what you did to web scrap the data here, what is the source and what were your steps)

In [1]:
import pandas as pd

# reading the web scrapped data from CSV file, setting the index column
df = pd.read_csv("job_data.csv", index_col=0)

# displaying the raw data
df.tail()

Unnamed: 0,job_title,company,salary,job_location,post,job_type,job_desc,company_qns,job_posted_date,job_link
2610,Level 2/3 Support Engineer,Fuse Technology Pty Ltd,,Sydney NSW,Help Desk & IT Support (Information & Communic...,Full time,The opportunityAs part of our exciting growth ...,Which of the following statements best describ...,2024-02-21,https://www.seek.com.au/job/73930150?type=stan...
2611,NIGHT SHIFT WAREHOUSE TEAM LEADER WANTED WETHE...,Labourforce,$47 per hour + penalties,"Wetherill Park, Sydney NSW","Warehousing, Storage & Distribution (Manufactu...",Contract/Temp,Our client is one of Australia's leading Manuf...,,2024-02-21,https://www.seek.com.au/job/73870879?type=stan...
2612,Casual Retail Assistant,Independent Living Specialists,"$31.11 per hour, plus super","Randwick, Sydney NSW",Retail Assistants (Retail & Consumer Products),Casual/Vacation,Independent Living Specialists¬†is a fast-growi...,Do you have customer service experience?Do you...,2024-02-21,https://www.seek.com.au/job/73899163?type=stan...
2613,Studio Assistant,Cendre,,"Oxenford, Gold Coast QLD","Pickers & Packers (Manufacturing, Transport & ...",Full time,Cendr√© is a revered e-commerce jewellery brand...,,2024-02-21,https://www.seek.com.au/job/73875587?type=stan...
2614,Junior IT Support Officer,Hare & Forbes,,"Northmead, Sydney NSW",Help Desk & IT Support (Information & Communic...,Full time,"Parramatta locationWork with a close-knit, exp...",Do you have demonstrated experience diagnosing...,2024-02-21,https://www.seek.com.au/job/73868216?type=stan...


In [2]:
# display the number of rows, columns and the column names
def display_shape_and_colnames(df):
    print("Number of Rows:", df.shape[0])
    print("Number of Columns:", df.shape[1])
    print(df.columns)
    
display_shape_and_colnames(df)

Number of Rows: 9800
Number of Columns: 10
Index(['job_title', 'company', 'salary', 'job_location', 'post', 'job_type',
       'job_desc', 'company_qns', 'job_posted_date', 'job_link'],
      dtype='object')


Some of our __column names__ are __redundant__ because we are working with job data.

Let's delete the prefix __"job"__ from our column names.

Some other __column names__ are __abbreviated__ (e.g. "job_desc", "company_qns").

Let's __replace them with full names__ so we can have accurate column names.

In [3]:
def clean_colnames(df):
    # delete the prefix "job_" on our column names
    for column_name in df.columns.to_list():
        if column_name.startswith("job_"):
            df.rename(columns={column_name : column_name.lstrip("job_")}, inplace=True)

    # rename abbreviated column names
    df.rename(columns={'desc':'description', 'company_qns':'company_questions', 'post':'department'}, inplace=True)

clean_colnames(df)
# display clean column names
df.head(2)

Unnamed: 0,title,company,salary,location,department,type,description,company_questions,posted_date,link
0,Experienced Support Worker (PPT & CAS),Ability Gateway,$35.50 per hour [PPT],"Wagga Wagga, Wagga Wagga & Riverina NSW",Aged & Disability Support (Community Services ...,Part time,About usWe are an outcome focused NDIS service...,Do you own or have regular access to a car?Whi...,2024-02-21,https://www.seek.com.au/job/73909631?type=prom...
1,Regional Manager - Inspire@HOME,CatholicCare Tasmania,,"Launceston, Launceston & North East TAS","Child Welfare, Youth & Family Services (Commun...",Full time,CatholicCare Tasmania is the primary social se...,,2024-02-21,https://www.seek.com.au/job/73909232?type=prom...


Now let's undestand all of our columns by providing a description to each one:
- __title__: title of the posted job
- __company__: name of the company that has posted the job
- __salary__: salary range for the job, can be defined per hour, monthly, annually, etc.
- __location__: geographical location of the job or company
- __department__: field or department of the job (e.g. IT, Sales, etc.)
- __description__: long description of the job posting
- __company_questions__: questions issued by the company to the applicants, according to the post
- __posted_date__: format yyyy-mm-dd
- __link__: link of the job posting

Now that we have a general understanding of our web scrapped data. 

Let's go ahead to the next step to perform our data wrangling methods.

### Step 2. Data Wrangling (cleaning, formatting, structuring, validating)

This is one crucial step as we are dealing with real-world data that is often unclean, and needs lost of processing.

But, before performing any action, let's learn our data by doing some basic analysis.

We will check the following stats by implementing functions:
- missing values per column
- duplicated rows
- number of unique values per column

In [4]:
def check_missing_values(df):
    # check for number of missing values per column
    print("# Missing Values")
    print(df.isna().sum())
    
    # check for % of missing values
    print("\n% Missing Values")
    print(df.isna().mean() * 100)
    
check_missing_values(df)

# Missing Values
title                   0
company                 0
salary               5216
location                0
department              0
type                    0
description             0
company_questions    5034
posted_date             0
link                    0
dtype: int64

% Missing Values
title                 0.000000
company               0.000000
salary               53.224490
location              0.000000
department            0.000000
type                  0.000000
description           0.000000
company_questions    51.367347
posted_date           0.000000
link                  0.000000
dtype: float64


As expected, many job posts do not include a salary range or any information about the salary.

It is no surprise that __more than half of our data has missing values for salary__.

On the other hand, we also have __more than half missing values for the company questions column__.

In [5]:
def check_duplicated_values(df):
    # check for number of duplicated values
    print("# Duplicated Values")
    print(df.duplicated().sum())
    
    # check for % of duplicated values
    print("\n% Duplicated Values")
    print(df.duplicated().mean() * 100)

check_duplicated_values(df)

df[df.duplicated()].tail(4)

# Duplicated Values
944

% Duplicated Values
9.63265306122449


Unnamed: 0,title,company,salary,location,department,type,description,company_questions,posted_date,link
2587,Pick Packers,Action Workforce,35,"Maddington, Perth WA","Warehousing, Storage & Distribution (Manufactu...",Casual/Vacation,Action Workforce are looking for Experienced P...,,2024-02-21,https://www.seek.com.au/job/73901168?type=stan...
2593,Accounts Person- KALGOORLIE RESIDENTS ONLY,Golden mile cleaning services,$30 ‚Äì $33.50 per hour,"Kalgoorlie, Kalgoorlie, Goldfields & Esperance WA",Administrative Assistants (Administration & Of...,Part time,Job Title: Accounts Person¬†We are currently se...,Which of the following statements best describ...,2024-02-21,https://www.seek.com.au/job/73908087?type=prom...
2603,Warehouse Assistant,Omni Recruit,,"Truganina, Melbourne VIC","Pickers & Packers (Manufacturing, Transport & ...",Casual/Vacation,Business is booming and we are currently seeki...,Do you agree to the privacy policy of Omni Rec...,2024-02-20,https://www.seek.com.au/job/73863322?type=stan...
2612,Casual Retail Assistant,Independent Living Specialists,"$31.11 per hour, plus super","Randwick, Sydney NSW",Retail Assistants (Retail & Consumer Products),Casual/Vacation,Independent Living Specialists¬†is a fast-growi...,Do you have customer service experience?Do you...,2024-02-21,https://www.seek.com.au/job/73899163?type=stan...


In [6]:
df['type'].unique()

array(['Part time', 'Full time', 'Casual/Vacation', 'Contract/Temp',
       'Contract/Temp, Casual/Vacation, Part time',
       'Contract/Temp, Casual/Vacation, Full time, Part time',
       'Contract/Temp, Part time', 'Casual/Vacation, Full time'],
      dtype=object)

Some considerable amount of our data __(around 9.5%) are duplicated__ rows.

This can be __dangerous for analysis__, as it can affect multiple metrics and our model training.

We have to __deal with these duplicated values__ in future steps.

In [7]:
def check_nunique_values(df):
    # check number of unique values per column
    print("# Unique Values per Column")
    for col in df.columns:
        print("'"+col+"'", "# of unique values:", df[col].nunique())
        
    # check % of unique values per column (relative to number of total rows in the dataset)
    print("\n% Unique Values per Column")
    for col in df.columns:
        print("'"+col+"'", "% of unique values:", round(df[col].nunique() * 100 / df.shape[0], 2), "%")
        
check_nunique_values(df)

# Unique Values per Column
'title' # of unique values: 5655
'company' # of unique values: 4965
'salary' # of unique values: 2645
'location' # of unique values: 1448
'department' # of unique values: 451
'type' # of unique values: 8
'description' # of unique values: 7958
'company_questions' # of unique values: 2730
'posted_date' # of unique values: 95
'link' # of unique values: 8664

% Unique Values per Column
'title' % of unique values: 57.7 %
'company' % of unique values: 50.66 %
'salary' % of unique values: 26.99 %
'location' % of unique values: 14.78 %
'department' % of unique values: 4.6 %
'type' % of unique values: 0.08 %
'description' % of unique values: 81.2 %
'company_questions' % of unique values: 27.86 %
'posted_date' % of unique values: 0.97 %
'link' % of unique values: 88.41 %


Some of our columns have a __large amount of unique values__.

Although we still have not processed our values, we must __seek to reduce the number of unique values through data processing__.

There are some columns with vast amount of unique values __(>50% of total rows)__. These columns are:
- title
- company
- description
- link

Let's start __dealing with the unique values per column.__

To reduce the number of unique values, let's apply some NLP methods to each column values.

We will start with some basic cleaning that includes:
- removing punctuation
- removing digits
- lower case all letters
- removing extra whitespaces

To accomplish this, we will implement a __class called NLP__ that will __contain all__ of our implemented __NLP methods/techniques__ that will be __applied on our data__.

In [8]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import numpy as np
import string
import re

# class containing our implemented NLP techniques and methods
class NLP():
    
    # remove all punctuation from a word (string)
    def remove_punctuation(self, word):
        if not isinstance(word, str): return word
        return word.translate(str.maketrans('', '', string.punctuation))
    
    # remove all digits/numbers from a word (string)
    def remove_digits(self, word):
        if not isinstance(word, str): return word
        return re.sub(r'\d+', '', word)
    
    # checks if word is a string and returns lower cased
    def lower_word(self, word):
        if not isinstance(word, str): return word
        return word.lower()

    # perform basic operations to clean 1 column of a dataframe
    def basic_clean_text_column(self, df, colname):
        print("Basic cleaning on column '" + colname + "':")
        nunique = df[colname].nunique()
        print("# Unique values before cleaning:", df[colname].nunique())
        for value in df[colname].unique():
            # save original value to replace later
            og_value = value
            
            # if we are dealing with a null value, don't modify anything
            if value is np.nan: continue
            
            # remove punctuation from the column value
            value = self.remove_punctuation(str(value))
            
            # remove digits from column value
            value = self.remove_digits(value)
            
            # lower case column value
            value = self.lower_word(value)
            
            # word tokenize the column value
            word_tokens = word_tokenize(value)
            
            # update df value in place
            df[colname].replace(og_value, ' '.join(word_tokens), inplace=True)
        new_nunique = df[colname].nunique()
        print("# Unique values after cleaning:", df[colname].nunique())
        print("% of unique values reduction:", round(100 - (new_nunique*100/nunique),2), "%")
    

Now that we have implemented a class for our methods,

let's go ahead and __apply a basic cleaning on all our columns__.

Then, we can __compare values before vs after cleaning__.

In [9]:
def clean_and_compare_column(df, colname):
    # save raw title data into a new dataframe just to compare before vs after cleaning
    df_compare = df[[colname]].copy()

    # perform the basic cleaning on the title column
    nlp = NLP()
    nlp.basic_clean_text_column(df, colname)

    # compare before vs after
    df_compare["clean "+colname] = df[colname]
    display(df_compare)

clean_and_compare_column(df, 'title')

Basic cleaning on column 'title':
# Unique values before cleaning: 5655
# Unique values after cleaning: 5541
% of unique values reduction: 2.02 %


Unnamed: 0,title,clean title
0,Experienced Support Worker (PPT & CAS),experienced support worker ppt cas
1,Regional Manager - Inspire@HOME,regional manager inspirehome
2,Family Support Worker,family support worker
3,CPS Case Manager,cps case manager
4,Intake Worker,intake worker
...,...,...
2610,Level 2/3 Support Engineer,level support engineer
2611,NIGHT SHIFT WAREHOUSE TEAM LEADER WANTED WETHE...,night shift warehouse team leader wanted wethe...
2612,Casual Retail Assistant,casual retail assistant
2613,Studio Assistant,studio assistant


After this __1st experiment__ of __cleaning the 'title' column__ we notice that we have __reduced the number of unique values by 114__.

Which is equivalent of aproximately __2% of the total unique values__, __not a significant reduction__.

However, we have considerably clean our raw texts, and this will allow us to apply further NLP techniques that will have better results on reducing the number of unique values.

But before getting ahead, let's __apply the same basic cleaning on the rest of our text columns__ such as:
- title
- company
- location
- department
- description
- company_questions

In [10]:
# define the remaining text columns that we need to perform a basic clean
text_cols = ['company', 'location', 'department', 'description', 'company_questions']

# implement a function to perform the cleaning on these columns
def clean_and_compare_columns(df, cols):
    for colname in cols:
        clean_and_compare_column(df, colname)

# call the implemented function
clean_and_compare_columns(df, text_cols)

Basic cleaning on column 'company':
# Unique values before cleaning: 4965
# Unique values after cleaning: 4965
% of unique values reduction: 0.0 %


Unnamed: 0,company,clean company
0,Ability Gateway,ability gateway
1,CatholicCare Tasmania,catholiccare tasmania
2,Community Gro,community gro
3,Open Minds,open minds
4,The Centre for Women & Co.,the centre for women co
...,...,...
2610,Fuse Technology Pty Ltd,fuse technology pty ltd
2611,Labourforce,labourforce
2612,Independent Living Specialists,independent living specialists
2613,Cendre,cendre


Basic cleaning on column 'location':
# Unique values before cleaning: 1448
# Unique values after cleaning: 1448
% of unique values reduction: 0.0 %


Unnamed: 0,location,clean location
0,"Wagga Wagga, Wagga Wagga & Riverina NSW",wagga wagga wagga wagga riverina nsw
1,"Launceston, Launceston & North East TAS",launceston launceston north east tas
2,"Townsville, Northern QLD",townsville northern qld
3,"Nambour, Sunshine Coast QLD",nambour sunshine coast qld
4,"Underwood, Brisbane QLD",underwood brisbane qld
...,...,...
2610,Sydney NSW,sydney nsw
2611,"Wetherill Park, Sydney NSW",wetherill park sydney nsw
2612,"Randwick, Sydney NSW",randwick sydney nsw
2613,"Oxenford, Gold Coast QLD",oxenford gold coast qld


Basic cleaning on column 'department':
# Unique values before cleaning: 451
# Unique values after cleaning: 451
% of unique values reduction: 0.0 %


Unnamed: 0,department,clean department
0,Aged & Disability Support (Community Services ...,aged disability support community services dev...
1,"Child Welfare, Youth & Family Services (Commun...",child welfare youth family services community ...
2,"Child Welfare, Youth & Family Services (Commun...",child welfare youth family services community ...
3,Community Development (Community Services & De...,community development community services devel...
4,"Child Welfare, Youth & Family Services (Commun...",child welfare youth family services community ...
...,...,...
2610,Help Desk & IT Support (Information & Communic...,help desk it support information communication...
2611,"Warehousing, Storage & Distribution (Manufactu...",warehousing storage distribution manufacturing...
2612,Retail Assistants (Retail & Consumer Products),retail assistants retail consumer products
2613,"Pickers & Packers (Manufacturing, Transport & ...",pickers packers manufacturing transport logistics


Basic cleaning on column 'description':
# Unique values before cleaning: 7958
# Unique values after cleaning: 7928
% of unique values reduction: 0.38 %


Unnamed: 0,description,clean description
0,About usWe are an outcome focused NDIS service...,about uswe are an outcome focused ndis service...
1,CatholicCare Tasmania is the primary social se...,catholiccare tasmania is the primary social se...
2,Community Gro Inc¬†is a community-based non-pro...,community gro inc is a communitybased nonprofi...
3,As a Case Manager for Coastal Supports at Open...,as a case manager for coastal supports at open...
4,About Us and Our Team Culture ¬†¬†At The Centre ...,about us and our team culture at the centre fo...
...,...,...
2610,The opportunityAs part of our exciting growth ...,the opportunityas part of our exciting growth ...
2611,Our client is one of Australia's leading Manuf...,our client is one of australias leading manufa...
2612,Independent Living Specialists¬†is a fast-growi...,independent living specialists is a fastgrowin...
2613,Cendr√© is a revered e-commerce jewellery brand...,cendr√© is a revered ecommerce jewellery brand ...


Basic cleaning on column 'company_questions':
# Unique values before cleaning: 2730
# Unique values after cleaning: 2728
% of unique values reduction: 0.07 %


Unnamed: 0,company_questions,clean company_questions
0,Do you own or have regular access to a car?Whi...,do you own or have regular access to a carwhic...
1,,
2,Which of the following statements best describ...,which of the following statements best describ...
3,,
4,Which of the following statements best describ...,which of the following statements best describ...
...,...,...
2610,Which of the following statements best describ...,which of the following statements best describ...
2611,,
2612,Do you have customer service experience?Do you...,do you have customer service experiencedo you ...
2613,,


It seems that __most of the columns have not reduced their number of unique values yet__.

Let's take a look to the entire __dataframe__ in the __current clean version__.

In [11]:
# display our current dataframe version
df.head()

Unnamed: 0,title,company,salary,location,department,type,description,company_questions,posted_date,link
0,experienced support worker ppt cas,ability gateway,$35.50 per hour [PPT],wagga wagga wagga wagga riverina nsw,aged disability support community services dev...,Part time,about uswe are an outcome focused ndis service...,do you own or have regular access to a carwhic...,2024-02-21,https://www.seek.com.au/job/73909631?type=prom...
1,regional manager inspirehome,catholiccare tasmania,,launceston launceston north east tas,child welfare youth family services community ...,Full time,catholiccare tasmania is the primary social se...,,2024-02-21,https://www.seek.com.au/job/73909232?type=prom...
2,family support worker,community gro,$40 ‚Äì $44 per hour,townsville northern qld,child welfare youth family services community ...,Full time,community gro inc is a communitybased nonprofi...,which of the following statements best describ...,2024-02-19,https://www.seek.com.au/job/73832771?type=stan...
3,cps case manager,open minds,$82k ‚Äì 84k + super + salary packaging + benefits,nambour sunshine coast qld,community development community services devel...,Full time,as a case manager for coastal supports at open...,,2024-02-21,https://www.seek.com.au/job/73901240?type=stan...
4,intake worker,the centre for women co,$41 ‚Äì $42 per hour,underwood brisbane qld,child welfare youth family services community ...,Full time,about us and our team culture at the centre fo...,which of the following statements best describ...,2024-02-20,https://www.seek.com.au/job/73861002?type=stan...


One important step during text processing and cleaning is the __removal of stopwords__.

We have seen __lots of stopwords accross our dataset__.

Our next step for cleaning is to remove all those stopwords.

However, there is a catch. We must __pay attention to certain words that have important meaning and are considered stopwords__.

- __Example:__ The most common meaning of the word __"it"__ is considered a stopword. However, "IT" in job posting titles may refer to "Information Technologies".

This example and many others need to be considered before just simply deleting all stopwords.

On the other hand, __some words that are not considered stopwords may need to be deleted__. In those cases we need to add them as stopwords.

To have a sense of which stopwords we must remove and keep, let's start by identifying 1, 2, and 3 length words from our columns.

After taking a general look at them we may __identify which ones to remove, and which ones to keep__.

In [12]:
# return a list of lists, each list will contain the words of length 1, 2, 3... n
def identify_words_len_1_to_n(df, colname, n):
    # set n number of empty lists
    words = [[] for _ in range(n)]
    
    # loop through unique values of the column
    for value in df[colname].unique():
        # if it's not a string, go to the next value
        if not isinstance(value, str): continue
        
        # tokenize the value, loop through the words, if the word length its in range, add them to corresponding list
        tokens = word_tokenize(value)
        for word in tokens:
            if len(word) <= n:
                words[len(word)-1].append(word)
                
    # delete repeated values in the lists and sort them
    words_len_1_to_n = [sorted(list(set(words_sublist))) for words_sublist in words]
    
    # print the results (each list)
    print("Words of length 1 to", n, "on column '"+colname+"'")
    for i in range(n):
        print("- Words Length", i+1)
        print(words_len_1_to_n[i])
    return words_len_1_to_n

words_len_1_to_3 = identify_words_len_1_to_n(df, 'title', 3)

Words of length 1 to 3 on column 'title'
- Words Length 1
['a', 'd', 'f', 'i', 'k', 'l', 'm', 'n', 'p', 's', 't', 'v', 'w', 'x', 'y', '‚Äì', '‚Äô', 'üí°', 'ü§ù']
- Words Length 2
['ah', 'ai', 'am', 'an', 'ao', 'ap', 'ar', 'as', 'at', 'au', 'av', 'ba', 'bb', 'bi', 'bp', 'ca', 'cc', 'ci', 'co', 'cx', 'dc', 'do', 'ds', 'ea', 'el', 'er', 'fm', 'fq', 'ft', 'gc', 'gm', 'go', 'gp', 'hc', 'hm', 'hr', 'ic', 'in', 'it', 'iv', 'ld', 'le', 'lf', 'lo', 'ma', 'mc', 'md', 'mq', 'mr', 'ms', 'mt', 'my', 'nd', 'no', 'nt', 'od', 'of', 'on', 'oo', 'or', 'ot', 'pa', 'pc', 'ph', 'pm', 'po', 'pt', 'pw', 'px', 'qa', 'qc', 'rd', 're', 'rn', 'sa', 'sc', 'sr', 'st', 'sw', 'to', 'tq', 'up', 'us', 'vp', 'wa', 'we', 'yr', '‚öΩÔ∏è']
- Words Length 3
['abn', 'acm', 'act', 'age', 'ags', 'aid', 'ain', 'air', 'ald', 'alh', 'ali', 'all', 'ame', 'and', 'anz', 'aod', 'app', 'aps', 'apy', 'arc', 'are', 'aso', 'asx', 'atm', 'aus', 'aws', 'bar', 'bas', 'bay', 'bdm', 'bft', 'bgs', 'bid', 'bms', 'bom', 'box', 'bus', 'bws', 'c

For our column title all words length 1 need to be removed, as they don't bring any value to our analysis.

The only 1-length string that will not be removed is the apostrophe to keep word consistency.
- __‚Äô__ : apostrophe

From our 2 length words, we will remove most of them except for the following common job accronyms:
- __hr__ : Human Resources
- __it__: Information Technology

From the 3 length words, again we will remove most of them except for the following:
- __ceo__: Chief Executive Officer
- __cfo__: Chief Financial Officer
- __aws__: Amazon Web Services
- __pmo__: Project Management Office
- __pcp__: Primary Care Physician
- __crm__: Customer Relationship Management
- __sap__: System Applications (ERP leader)
- __app__: application
- __dev__: developer
- __lab__: laboratory
- __web__: internet
- __law__: self-explanatory

Let's perform the same operation with the rest of the text columns.

In [13]:
def identify_words_len_1_to_n_columns(df, text_columns, ns):
    # loop through the specified columns and identify the words of length 1 to n
    words_per_col = []
    for i, colname in enumerate(text_columns):
        words_per_col.append(identify_words_len_1_to_n(df, colname, ns[i]))
        print("\n")
    return words_per_col

# define the word lengths per text column
text_cols = ['title', 'company', 'location', 'department', 'description', 'company_questions']
word_max_lens = [3, 3, 3, 3, 2, 3]
print("Text columns:", text_cols, end='\n')
print("Words max length:", word_max_lens, end='\n\n')
words_per_col = identify_words_len_1_to_n_columns(df, text_cols, word_max_lens)

Text columns: ['title', 'company', 'location', 'department', 'description', 'company_questions']
Words max length: [3, 3, 3, 3, 2, 3]

Words of length 1 to 3 on column 'title'
- Words Length 1
['a', 'd', 'f', 'i', 'k', 'l', 'm', 'n', 'p', 's', 't', 'v', 'w', 'x', 'y', '‚Äì', '‚Äô', 'üí°', 'ü§ù']
- Words Length 2
['ah', 'ai', 'am', 'an', 'ao', 'ap', 'ar', 'as', 'at', 'au', 'av', 'ba', 'bb', 'bi', 'bp', 'ca', 'cc', 'ci', 'co', 'cx', 'dc', 'do', 'ds', 'ea', 'el', 'er', 'fm', 'fq', 'ft', 'gc', 'gm', 'go', 'gp', 'hc', 'hm', 'hr', 'ic', 'in', 'it', 'iv', 'ld', 'le', 'lf', 'lo', 'ma', 'mc', 'md', 'mq', 'mr', 'ms', 'mt', 'my', 'nd', 'no', 'nt', 'od', 'of', 'on', 'oo', 'or', 'ot', 'pa', 'pc', 'ph', 'pm', 'po', 'pt', 'pw', 'px', 'qa', 'qc', 'rd', 're', 'rn', 'sa', 'sc', 'sr', 'st', 'sw', 'to', 'tq', 'up', 'us', 'vp', 'wa', 'we', 'yr', '‚öΩÔ∏è']
- Words Length 3
['abn', 'acm', 'act', 'age', 'ags', 'aid', 'ain', 'air', 'ald', 'alh', 'ali', 'all', 'ame', 'and', 'anz', 'aod', 'app', 'aps', 'apy', 

Now that we have identified more words to remove from our own data,

let's __implement a function that removes all stopwords__ from the english language on top of the ones identified on our 1 to 3-length words analysis.

Let's also keep in mind the __list of values that should not be removed__ (from the same analysis).

But first, let's __define our additional stopwords and our exceptions__ as mentioned before.

In [14]:
from nltk import flatten # convert nested list into 1D list

def set_additional_stopwords(words_per_col):
    # set our additional stopwords making use of the identified 1 to 3 length words for each column
    additionals = []
    for column_words in words_per_col:
        # make sure we only have unique values by using set
        additionals.append(list(set(flatten(column_words))))
    return additionals

additionals = set_additional_stopwords(words_per_col)

# set the exceptions manually based on our previous word length analysis
exceptions = ['‚Äô', 'hr', 'it', 'ceo', 'cfo', 'aws', 'pmo', 'pcp', 'crm', 'sap', 'app', 'dev', 'lab', 'web', 'law']

Now that we have defined the additional stopwords and the exceptions,

Let's actually __implement a new class that stores our stopwords removal methods__.

We will use this class to perform our stop words removal __taking into account our additional stopwords and exceptions__.

This will be __performed for every text column__.

In [15]:
class NLP_stopwords():
    
    def remove_stopwords_columns(self, df, colnames, additionals=[], exceptions=[]):
        if additionals == []:
            additionals = [[] for _ in range(len(colnameS))]
        if len(colnames) != len(additionals):
            raise Exception("Column names length must be equal to the additional stop words.")

        # remove stopwords on specified columns
        for i, colname in enumerate(colnames):
            self.remove_stopwords_column(df, colname, additionals[i], exceptions)

    def remove_stopwords_column(self, df, colname, additional=[], exceptions=[]):
        print("Removing stopwords on column '" + colname + "'")
        nunique = df[colname].nunique()
        print("# Unique values with stopwords:", df[colname].nunique())        
        
        # loop through unique values of the column
        for value in df[colname].unique():
            # make sure the value is a string
            if not isinstance(value, str): continue
            
            # tokenize the unique column value
            tokens = word_tokenize(value)

            # remove stopwords
            self.remove_stopwords(tokens, additional, exceptions)

            # update df value in place
            df[colname].replace(value, ' '.join(tokens), inplace=True)
        
        new_nunique = df[colname].nunique()
        print("# Unique values without stopwords:", df[colname].nunique())
        print("% of unique values reduction:", round(100 - (new_nunique*100/nunique),2), "%", end="\n\n")

    def remove_stopwords(self, tokens, additional=[], exceptions=[]):
        # remove stopwords on a list of word tokens
        i = 0
        # add the additional parameter stopwords
        total_stopwords = stopwords.words('english') + additional
        while i < len(tokens):
            word = tokens[i]
            # if the word is in exceptions, don't remove it
            if word in total_stopwords and word not in exceptions:
                tokens.pop(i)
                i -= 1
            i += 1

nlp = NLP_stopwords()
nlp.remove_stopwords_columns(df, text_cols, additionals, exceptions)

Removing stopwords on column 'title'
# Unique values with stopwords: 5541
# Unique values without stopwords: 5398
% of unique values reduction: 2.58 %

Removing stopwords on column 'company'
# Unique values with stopwords: 4965
# Unique values without stopwords: 4704
% of unique values reduction: 5.26 %

Removing stopwords on column 'location'
# Unique values with stopwords: 1448
# Unique values without stopwords: 1439
% of unique values reduction: 0.62 %

Removing stopwords on column 'department'
# Unique values with stopwords: 451
# Unique values without stopwords: 449
% of unique values reduction: 0.44 %

Removing stopwords on column 'description'
# Unique values with stopwords: 7928
# Unique values without stopwords: 7927
% of unique values reduction: 0.01 %

Removing stopwords on column 'company_questions'
# Unique values with stopwords: 2728
# Unique values without stopwords: 2728
% of unique values reduction: 0.0 %



Now that we have __removed stopwords__ (including additional analyzed extra stopwords and keeping some exceptions),

we see that __unique values have been reduced on some low percentage__.

However, our text columns __data now contain more clean data, with mostly only relevant words__.

Let's perform one more step to __group our unique values__ and hopefully __reduce the number of uniques significantly__.

For starters, let's __find our unigrams, bigrams, and trigrams__.

Once again, we will __define a third NLP class to store our new implemented methods__.

In [16]:
class NLP_ngrams():
        
    def get_column_ngram(self, df, colname, n=1):
        # get an n-gram dictionary (example: bigrams) of 1 specified column
        ngrams = {}
        
        # iterate through the rows of the column
        for i, row in df[[colname]].iterrows():
            # tokenize the value if it's a string
            if not isinstance(row[0], str): continue
            tokens = word_tokenize(row[0])
            
            # for each word in the word tokenization, add or update the dictionary of ngrams
            for i in range(len(tokens) - n + 1):
                ngram = ' '.join(tokens[i:i+n])
                ngrams[ngram] = ngrams.get(ngram, 0) + 1
        
        # sort the dictionary of ngrams by value in descending order
        return dict(sorted(ngrams.items(), key = lambda x : x[1], reverse=True))
    
    def get_column_ngrams(self, df, colname, n=1):
        # get all n-grams from one specific column
        # example: with n = 3, ngrams will store a list of 3 dictionaries: unigrams, bigrams, and trigrams
        
        if n < 1: raise Exception("n in n-grams must be an integer greater or equal to 1")
        ngrams = [{} for _ in range(n)]
        # loop through n, populating the n-gram (unigram, bigrams, etc.)
        for i in range(n):
            ngrams[i] = self.get_column_ngram(df, colname, i+1)
        return ngrams
        
    def get_columns_ngrams(self, df, colnames, n=1):
        # get all ngrams (1-n) from all columns specified
        # dictionary, keys = column names, values = ngram list returned from get_column_ngrams
        columns_ngrams = {}
        for colname in colnames:
            columns_ngrams[colname] = self.get_column_ngrams(df, colname, n)
        return columns_ngrams
    

            
nlp = NLP_ngrams()
columns_ngrams_1_to_n = nlp.get_columns_ngrams(df, text_cols, 3)

Let's briefly explain the previous code results.

We have stored a __dictionary__ in the variable __'columns_ngrams_1_to_n'__.

This dictionary has the __column names as keys__. In this case we have only __text columns: ['title', 'company', 'location', 'department', 'description', 'company_questions']__

On the other hand, the __values for each key is a list__.

This __list contains 1 or more dictionaries__. __Each dictionary is the n-gram counts__.

__Example:__ key='title', value = [{unigrams dictionary}, {bigrams dictionary}, {trigrams dictionary}]

Now, let's take a __closer look to an n-gram dictionary__.

In [17]:
# access bigrams from the column 'title', print only top 20 most common bigrams
print("{")
for i, (key, value) in enumerate(columns_ngrams_1_to_n['title'][1].items()):
    print("'"+key+"':", str(value)+",")
    if i == 20: break
print("}")

{
'property manager': 199,
'support officer': 166,
'general manager': 153,
'administration officer': 142,
'administration assistant': 140,
'people culture': 138,
'business partner': 138,
'customer service': 119,
'part time': 116,
'human resources': 115,
'sales assistant': 115,
'support worker': 104,
'real estate': 103,
'assistant accountant': 98,
'accounts payable': 97,
'it support': 95,
'financial accountant': 90,
'finance manager': 88,
'medical receptionist': 86,
'team member': 86,
'executive officer': 83,
}


In [18]:
# access trigrams from the column 'company', print only top 20 most common trigrams
print("{")
for i, (key, value) in enumerate(columns_ngrams_1_to_n['company'][2].items()):
    print("'" + key + "':", str(value) + ",")
    if i == 20: break
print("}")

{
'australian federal police': 34,
'recruitment real estate': 27,
'sharp carter accounting': 27,
'carter accounting clerical': 27,
'gough recruitment real': 26,
'real estate property': 26,
'estate property development': 26,
'property development construction': 26,
'department communities justice': 25,
'aldi stores australia': 25,
'hospital health service': 23,
'perigon group limited': 23,
'department health queensland': 21,
'australian government solicitor': 20,
'local health district': 19,
'eden ritchie recruitment': 19,
'allianz australia insurance': 18,
'randstad business support': 17,
'australian clinical labs': 17,
'australian football league': 17,
'vincent paul society': 15,
}


In [19]:
# access unigrams from the column 'location', print only top 20 most common unigrams
print("{")
for i, (key, value) in enumerate(columns_ngrams_1_to_n['location'][0].items()):
    print("'" + key + "':", str(value) + ",")
    if i == 20: break
print("}")

{
'sydney': 2287,
'melbourne': 1793,
'brisbane': 1372,
'coast': 1130,
'perth': 778,
'north': 566,
'west': 473,
'south': 464,
'adelaide': 455,
'newcastle': 340,
'central': 324,
'gold': 314,
'maitland': 256,
'hunter': 237,
'canberra': 232,
'valley': 198,
'park': 184,
'sunshine': 181,
'toowoomba': 177,
'wagga': 162,
'wollongong': 159,
}


We have seen some examples of our stored n-gram dictionaries (unigrams, bigrams, and trigrams, for all our text columns).

Let's go ahead and __count these unigrams, bigrams, and trigrams per column__.

In [20]:
def count_columns_ngrams_unique_values(columns_ngrams):
    ncolumns = len(columns_ngrams) # number of columns
    colnames = columns_ngrams.keys() # column names
    nngrams = len(columns_ngrams[list(colnames)[0]]) # number of n-grams stored

    # loop through the column names and save the number of keys within each n-gram dictionary
    nuniques = {}
    for colname in colnames:
        nuniques[colname] = [len(ngrams) for ngrams in columns_ngrams[colname]]

    return nuniques

nuniques = count_columns_ngrams_unique_values(columns_ngrams_1_to_n)
nuniques

{'title': [2885, 8696, 8268],
 'company': [4652, 5658, 2687],
 'location': [1326, 1751, 1007],
 'department': [351, 575, 653],
 'description': [103283, 819322, 1433393],
 'company_questions': [2762, 7674, 11191]}

We have __many n-grams for each column__. The number of n-grams are significantly high.

However, on our next step we will __replace the dataset values with the most common n-gram found for that same column__.

This will __standardize our unique values and significantly reduce the number of unique values__.

In [21]:
# first, let's reset the rows index
df.reset_index(inplace=True)
del df['index']

class NLP_standardization():
    
    def standardize_column_based_on_ngrams(self, df, colname, ngrams_list, n):
        # dictionary of n-grams (unigrams, bigrams, or trigrams)
        ngrams = ngrams_list[n-1]
        
        # iterate through the rows of the column
        for i, row in df[[colname]].iterrows():
            # make sure value is a string
            value = row[0]
            if not isinstance(value, str): continue
            
            # word tokenize the value
            tokens = word_tokenize(value)
            
            # loop through all possible n-grams and look for the most common saved on the dictionary of n-grams
            most_common_ngram = "", 0
            for j in range(len(tokens) - n + 1):
                ngram = ' '.join(tokens[j:j+n])
                if ngram not in ngrams: continue
                if ngrams[ngram] > most_common_ngram[1]:
                    most_common_ngram = ngram, ngrams[ngram]
            
            if most_common_ngram[1] != 0:
                df.iloc[i][colname] = most_common_ngram[0]
                
    def standardize_columns_based_on_ngrams(self, df, colnames, ngrams_list, n_list):
        # repeat the previous method process for every column specified as parameter (with their corresponding n of n-gram)
        for i, colname in enumerate(colnames):
            nlp.standardize_column_based_on_ngrams(df, colname, ngrams_list[colname], n_list[i])
        
nlp = NLP_standardization()


In [22]:
nunique = df['title'].nunique()
print("# Unique values column 'title' before replacing values with bigrams:", nunique)

# replace 'title' values with most common bigrams contained in the values
nlp.standardize_column_based_on_ngrams(df, 'title', columns_ngrams_1_to_n['title'], 2)
new_nunique = df['title'].nunique()
print("# Unique values column 'title' after replacing values with bigrams:", new_nunique)
print("% Unique values reduction:", round(100 - (new_nunique*100/nunique),2), "%", end="\n\n")

# Unique values column 'title' before replacing values with bigrams: 5398
# Unique values column 'title' after replacing values with bigrams: 2173
% Unique values reduction: 59.74 %



In [23]:
reduce_text_columns = ['location', 'department']
n_list = [1, 1]
nuniques = [df['location'].nunique(), df['department'].nunique()]
print("# Unique values column 'location' before replacing values with unigrams:", nuniques[0])
print("# Unique values column 'department' before replacing values with unigrams:", nuniques[1])

# Unique values column 'location' before replacing values with bigrams: 1439
# Unique values column 'department' before replacing values with bigrams: 449


In [24]:
nlp.standardize_columns_based_on_ngrams(df, reduce_text_columns, columns_ngrams_1_to_n, n_list)

new_nuniques = [df['location'].nunique(), df['department'].nunique()]
print("# Unique values column 'location' after replacing values with unigrams:", new_nuniques[0])
print("# Unique values column 'department' after replacing values with unigrams:", new_nuniques[1])
print("% Unique values reduction:", round(100 - (new_nuniques[0]*100/nuniques[0]),2), "%", end="\n")
print("% Unique values reduction:", round(100 - (new_nuniques[0]*100/nuniques[0]),2), "%", end="\n")

# Unique values column 'location' after replacing values with bigrams: 47
# Unique values column 'department' after replacing values with bigrams: 29
% Unique values reduction: 96.73 %
% Unique values reduction: 96.73 %


In [25]:
df

Unnamed: 0,title,company,salary,location,department,type,description,company_questions,posted_date,link
0,support worker,ability gateway,$35.50 per hour [PPT],wagga,services,Part time,uswe outcome focused ndis service provider bas...,regular access carwhich following statements b...,2024-02-21,https://www.seek.com.au/job/73909631?type=prom...
1,regional manager,catholiccare tasmania,,north,services,Full time,catholiccare tasmania primary social services ...,,2024-02-21,https://www.seek.com.au/job/73909232?type=prom...
2,support worker,community,$40 ‚Äì $44 per hour,northern,services,Full time,community gro inc communitybased nonprofit org...,following statements best describes right work...,2024-02-19,https://www.seek.com.au/job/73832771?type=stan...
3,case manager,open minds,$82k ‚Äì 84k + super + salary packaging + benefits,coast,services,Full time,case manager coastal supports open minds sunsh...,,2024-02-21,https://www.seek.com.au/job/73901240?type=stan...
4,intake worker,centre women,$41 ‚Äì $42 per hour,brisbane,services,Full time,team culture centre women men services work su...,following statements best describes right work...,2024-02-20,https://www.seek.com.au/job/73861002?type=stan...
...,...,...,...,...,...,...,...,...,...,...
9795,support engineer,fuse technology,,sydney,support,Full time,opportunityas part exciting growth expansion s...,following statements best describes right work...,2024-02-21,https://www.seek.com.au/job/73930150?type=stan...
9796,team leader,labourforce,$47 per hour + penalties,sydney,transport,Contract/Temp,client one australias leading manufacturer dis...,,2024-02-21,https://www.seek.com.au/job/73870879?type=stan...
9797,retail assistant,independent living specialists,"$31.11 per hour, plus super",sydney,retail,Casual/Vacation,independent living specialists fastgrowing bus...,customer service experiencedo current ndis wor...,2024-02-21,https://www.seek.com.au/job/73899163?type=stan...
9798,studio assistant,cendre,,coast,transport,Full time,cendr√© revered ecommerce jewellery brand compr...,,2024-02-21,https://www.seek.com.au/job/73875587?type=stan...
