### Advanced Python AI and ML Tools - Assignment 1

__Group Members:__
1) Aanal Patel - C0910376
2) Bimal Shresta - C0919385
3) Danilo Diaz - C0889539
4) Ernie Sumoso - C0881591

### Index
- __Step 1. Dataset Description (web scrapped)__
- __Step 2. Data Wrangling (cleaning, formatting, structuring, validating)__
- __Step 3. Plotting methods for distribution__
- __Step 4. Pandas profiling for EDA (exploratory data analysis)__
- __Step 5. Encoding methods, creating new numerical columns__
- __Step 6. Outlier identification (with boxplots and IQR)__
- __Step 7. Addressing outliers with Quantile-based flooring and capping, Trimming, and Log Transformation__
- __Step 8. Unsupervised learning methods__
- __Step 9. NLP techniques (data cleaning, stopword and puctuation removal, tokenizing, stemming, and lemmatization)__

### Step 1. Dataset Description (web scrapped)

(Bimal add a description of what you did to web scrap the data here, what is the source and what were your steps)

In [42]:
import pandas as pd

# reading the web scrapped data from CSV file, setting the index column
df = pd.read_csv("job_data.csv", index_col=0)

# displaying the raw data
df.tail()

Unnamed: 0,job_title,company,salary,job_location,post,job_type,job_desc,company_qns,job_posted_date,job_link
2610,Level 2/3 Support Engineer,Fuse Technology Pty Ltd,,Sydney NSW,Help Desk & IT Support (Information & Communic...,Full time,The opportunityAs part of our exciting growth ...,Which of the following statements best describ...,2024-02-21,https://www.seek.com.au/job/73930150?type=stan...
2611,NIGHT SHIFT WAREHOUSE TEAM LEADER WANTED WETHE...,Labourforce,$47 per hour + penalties,"Wetherill Park, Sydney NSW","Warehousing, Storage & Distribution (Manufactu...",Contract/Temp,Our client is one of Australia's leading Manuf...,,2024-02-21,https://www.seek.com.au/job/73870879?type=stan...
2612,Casual Retail Assistant,Independent Living Specialists,"$31.11 per hour, plus super","Randwick, Sydney NSW",Retail Assistants (Retail & Consumer Products),Casual/Vacation,Independent Living Specialists is a fast-growi...,Do you have customer service experience?Do you...,2024-02-21,https://www.seek.com.au/job/73899163?type=stan...
2613,Studio Assistant,Cendre,,"Oxenford, Gold Coast QLD","Pickers & Packers (Manufacturing, Transport & ...",Full time,Cendré is a revered e-commerce jewellery brand...,,2024-02-21,https://www.seek.com.au/job/73875587?type=stan...
2614,Junior IT Support Officer,Hare & Forbes,,"Northmead, Sydney NSW",Help Desk & IT Support (Information & Communic...,Full time,"Parramatta locationWork with a close-knit, exp...",Do you have demonstrated experience diagnosing...,2024-02-21,https://www.seek.com.au/job/73868216?type=stan...


In [43]:
# display the number of rows, columns and the column names
def display_shape_and_colnames(df):
    print("Number of Rows:", df.shape[0])
    print("Number of Columns:", df.shape[1])
    print(df.columns)
    
display_shape_and_colnames(df)

Number of Rows: 9800
Number of Columns: 10
Index(['job_title', 'company', 'salary', 'job_location', 'post', 'job_type',
       'job_desc', 'company_qns', 'job_posted_date', 'job_link'],
      dtype='object')


Some of our __column names__ are __redundant__ because we are working with job data.

Let's delete the prefix __"job"__ from our column names.

Some other __column names__ are __abbreviated__ (e.g. "job_desc", "company_qns").

Let's __replace them with full names__ so we can have accurate column names.

In [44]:
def clean_colnames(df):
    # delete the prefix "job_" on our column names
    for column_name in df.columns.to_list():
        if column_name.startswith("job_"):
            df.rename(columns={column_name : column_name.lstrip("job_")}, inplace=True)

    # rename abbreviated column names
    df.rename(columns={'desc':'description', 'company_qns':'company_questions', 'post':'department'}, inplace=True)

clean_colnames(df)
# display clean column names
df.head(2)

Unnamed: 0,title,company,salary,location,department,type,description,company_questions,posted_date,link
0,Experienced Support Worker (PPT & CAS),Ability Gateway,$35.50 per hour [PPT],"Wagga Wagga, Wagga Wagga & Riverina NSW",Aged & Disability Support (Community Services ...,Part time,About usWe are an outcome focused NDIS service...,Do you own or have regular access to a car?Whi...,2024-02-21,https://www.seek.com.au/job/73909631?type=prom...
1,Regional Manager - Inspire@HOME,CatholicCare Tasmania,,"Launceston, Launceston & North East TAS","Child Welfare, Youth & Family Services (Commun...",Full time,CatholicCare Tasmania is the primary social se...,,2024-02-21,https://www.seek.com.au/job/73909232?type=prom...


Now let's undestand all of our columns by providing a description to each one:
- __title__: title of the posted job
- __company__: name of the company that has posted the job
- __salary__: salary range for the job, can be defined per hour, monthly, annually, etc.
- __location__: geographical location of the job or company
- __department__: field or department of the job (e.g. IT, Sales, etc.)
- __description__: long description of the job posting
- __company_questions__: questions issued by the company to the applicants, according to the post
- __posted_date__: format yyyy-mm-dd
- __link__: link of the job posting

Now that we have a general understanding of our web scrapped data. 

Let's go ahead to the next step to perform our data wrangling methods.

### Step 2. Data Wrangling (cleaning, formatting, structuring, validating)

In [47]:
def check_missing_values(df):
    # check for number of missing values per column
    print("# Missing Values")
    print(df.isna().sum())
    
    # check for % of missing values
    print("\n% Missing Values")
    print(df.isna().mean() * 100)
    
check_missing_values(df)

# Missing Values
title                   0
company                 0
salary               5216
location                0
department              0
type                    0
description             0
company_questions    5034
posted_date             0
link                    0
dtype: int64

% Missing Values
title                 0.000000
company               0.000000
salary               53.224490
location              0.000000
department            0.000000
type                  0.000000
description           0.000000
company_questions    51.367347
posted_date           0.000000
link                  0.000000
dtype: float64


As expected, many job posts do not include a salary range or any information about the salary.

It is no surprise that __more than half of our data has missing values for salary__.

On the other hand, we also have __more than half missing values for the company questions column__.

In [55]:
def check_duplicated_values(df):
    # check for number of duplicated values
    print("# Duplicated Values")
    print(df.duplicated().sum())
    
    # check for % of duplicated values
    print("\n% Duplicated Values")
    print(df.duplicated().mean() * 100)

check_duplicated_values(df)

df[df.duplicated()].tail(4)

# Duplicated Values
944

% Duplicated Values
9.63265306122449


Unnamed: 0,title,company,salary,location,department,type,description,company_questions,posted_date,link
2587,Pick Packers,Action Workforce,35,"Maddington, Perth WA","Warehousing, Storage & Distribution (Manufactu...",Casual/Vacation,Action Workforce are looking for Experienced P...,,2024-02-21,https://www.seek.com.au/job/73901168?type=stan...
2593,Accounts Person- KALGOORLIE RESIDENTS ONLY,Golden mile cleaning services,$30 – $33.50 per hour,"Kalgoorlie, Kalgoorlie, Goldfields & Esperance WA",Administrative Assistants (Administration & Of...,Part time,Job Title: Accounts Person We are currently se...,Which of the following statements best describ...,2024-02-21,https://www.seek.com.au/job/73908087?type=prom...
2603,Warehouse Assistant,Omni Recruit,,"Truganina, Melbourne VIC","Pickers & Packers (Manufacturing, Transport & ...",Casual/Vacation,Business is booming and we are currently seeki...,Do you agree to the privacy policy of Omni Rec...,2024-02-20,https://www.seek.com.au/job/73863322?type=stan...
2612,Casual Retail Assistant,Independent Living Specialists,"$31.11 per hour, plus super","Randwick, Sydney NSW",Retail Assistants (Retail & Consumer Products),Casual/Vacation,Independent Living Specialists is a fast-growi...,Do you have customer service experience?Do you...,2024-02-21,https://www.seek.com.au/job/73899163?type=stan...


Some considerable amount of our data __(around 9.5%) are duplicated__ rows.

This can be __dangerous for analysis__, as it can affect multiple metrics and our model training.

We have to __deal with these duplicated values__.

In [66]:
def check_nunique_values(df):
    # check number of unique values per column
    print("# Unique Values per Column")
    for col in df.columns:
        print("'"+col+"'", "# of unique values:", df[col].nunique())
        
    # check % of unique values per column (relative to number of total rows in the dataset)
    print("\n% Unique Values per Column")
    for col in df.columns:
        print("'"+col+"'", "% of unique values:", round(df[col].nunique() * 100 / df.shape[0], 2), "%")
        
check_nunique_values(df)

# Unique Values per Column
'title' # of unique values: 5655
'company' # of unique values: 4965
'salary' # of unique values: 2645
'location' # of unique values: 1448
'department' # of unique values: 451
'type' # of unique values: 8
'description' # of unique values: 7958
'company_questions' # of unique values: 2730
'posted_date' # of unique values: 95
'link' # of unique values: 8664

% Unique Values per Column
'title' % of unique values: 57.7 %
'company' % of unique values: 50.66 %
'salary' % of unique values: 26.99 %
'location' % of unique values: 14.78 %
'department' % of unique values: 4.6 %
'type' % of unique values: 0.08 %
'description' % of unique values: 81.2 %
'company_questions' % of unique values: 27.86 %
'posted_date' % of unique values: 0.97 %
'link' % of unique values: 88.41 %


Some of our columns have a __large amount of unique values__.

Although we still have not processed our values, we must __seek to reduce the number of unique values through data processing__.

These are some of the columns with vast amount of unique values __(>50% of total rows)__:
- title
- company
- description
- link

In [175]:
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
import re



def remove_stopwords(tokens, add_stopwords=[]):
    i = 0
    total_stopwords = stopwords.words('english') + add_stopwords
    while i < len(tokens):
        if tokens[i] in total_stopwords:
            tokens.pop(i)
            i -= 1
        i += 1

def remove_digits(tokens):
    for i, word in enumerate(tokens):
        tokens[i] = re.sub(r'\d+', '', word)
    return [word for word in tokens if word != ""]

def remove_short_words(tokens, min_length, exceptions=[]):
    i = 0
    while i < len(tokens):
        if len(tokens[i]) < min_length and tokens[i] not in exceptions:
            tokens.pop(i)
            i -= 1
        i += 1
        
def populate_len_words(tokens):
    counts = [], [], []
    for word in tokens:
        if len(word) > 0 and len(word) <= len(counts):
            counts[len(word) - 1].append(word)
    return counts
        
len_1 = []
len_2 = []
len_3 = []
words_count = {}
bigram_count = {}
trigram_count = {}
for title_value in df['title'].unique():
    # remove punctuation from the title value
    title_no_punc = title_value.translate(str.maketrans('', '', string.punctuation))
    
    # word tokenize the title value
    word_tokens = word_tokenize(title_no_punc)
    
    # lower word tokens
    word_tokens = list(map(str.lower, word_tokens))
    
    # remove stopwords
    add_stopwords = ['ict', 'plus', 'per', 'week', 'bws', 'new', 'asap', 'pae', 'year', 'years', 'itc', 'day']
    remove_stopwords(word_tokens, add_stopwords)
    
    # remove shorter words (abbreviations)
    exceptions = ['hr']
    min_length = 3
    remove_short_words(word_tokens, min_length, exceptions)
    
    # remove digits
    word_tokens = remove_digits(word_tokens)
    
    # populate 1, 2, and 3 len words
    current_words = populate_len_words(word_tokens)
    len_1 += current_words[0]
    len_2 += current_words[1]
    len_3 += current_words[2]
    
    # build bigrams from word tokens
    bigrams = []
    for i in range(len(word_tokens)-1):
        bigrams.append(word_tokens[i] + ' ' + word_tokens[i+1])
    
    # build bigrams from word tokens
    trigrams = []
    for i in range(len(word_tokens)-2):
        trigrams.append(word_tokens[i] + ' ' + word_tokens[i+1] + ' ' + word_tokens[i+2])
    
    # update the word counter
    for word in word_tokens:
        words_count[word] = words_count.get(word, 0) + 1
    
    # update the bigrams
    for bigram in bigrams:
        bigram_count[bigram] = bigram_count.get(bigram, 0) + 1
    
    # update the trigrams
    for trigram in trigrams:
        trigram_count[trigram] = trigram_count.get(bigram, 0) + 1
    
words_sorted = sorted(words_count.items(), key=lambda x : x[1], reverse=True)
bigrams_sorted = sorted(bigram_count.items(), key=lambda x : x[1], reverse=True)
trigrams_sorted = sorted(trigram_count.items(), key=lambda x : x[1], reverse=True)

In [178]:
list(set(len_1))

['t', 'a', 'y', 'f', 'k']

In [179]:
list(set(len_2))

['ca', 'hr', 'el', 'hm', 'px', 'nd', 'po', 'pm', 'bb', 'oo', 'ao', 'am', 'ic']

In [180]:
list(set(len_3))

['bas',
 'sap',
 'ald',
 'mid',
 'elm',
 'aod',
 'sub',
 'try',
 'ivf',
 'qsr',
 'fix',
 'smp',
 'eca',
 'gis',
 'esg',
 'gin',
 'car',
 'aws',
 'euc',
 'rpd',
 'sme',
 'whv',
 'uni',
 'net',
 'qld',
 'alh',
 'due',
 'hbc',
 'jmf',
 'cns',
 'bdm',
 'fit',
 'bid',
 'far',
 'cbd',
 'ute',
 'inc',
 'pmo',
 'fpa',
 'sil',
 'anz',
 'pty',
 'ims',
 'sor',
 'two',
 'acm',
 'fom',
 'hcp',
 'trc',
 'eho',
 'ote',
 'cnc',
 'imc',
 'rab',
 'kit',
 'aso',
 'cps',
 'ras',
 'gpc',
 'law',
 'jay',
 'woy',
 'png',
 'pqe',
 'cas',
 'job',
 'whs',
 'hsp',
 'ecm',
 'gmp',
 'mep',
 'wet',
 'vps',
 'itt',
 'iga',
 'pcp',
 'ses',
 'ctp',
 'non',
 'ceo',
 'arc',
 'rda',
 'vic',
 'msp',
 'aps',
 'exp',
 'one',
 'pet',
 'stp',
 'ngs',
 'bay',
 'ame',
 'van',
 'wfh',
 'wfa',
 'pos',
 'yha',
 'tas',
 'crk',
 'bft',
 'cmt',
 'cad',
 'dna',
 'bar',
 'eso',
 'ops',
 'ltd',
 'lvl',
 'syd',
 'sea',
 'tax',
 'egm',
 'rep',
 'end',
 'apy',
 'ttw',
 'rto',
 'nfp',
 'spa',
 'icp',
 'app',
 'web',
 'mcv',
 'pts',
 'phd',
