# DATA102 Major Course Output

**Group Number**: 7

**Members**:
- Guerra, Angelo
- Hinolan, Charles
- Lasala, Kyle
- Lorenzo, Antonio
- Roco, Katrina

**Section**: S11

**Instructor**: Mr. Jude Michael Teves

# Problem Statement

In today's competitive job market, many people are always searching for work, whether they are professionals looking for a change in direction or career progression or recent graduates just starting out in the industry. Even if there are a lot of job ads on the internet, job searchers frequently struggle to sort through the overflow of listings on various platforms. The user interfaces, job classifications, and search methods of these websites can vary, making it challenging for candidates to locate suitable jobs quickly. Furthermore, most job portals lack personalization, which results in a generic browsing experience that ignores each user's particular tastes, abilities, and career objectives.

Because job seekers must constantly enter their qualifications, sort through countless irrelevant job advertisements, and monitor application statuses, the process is not only time-consuming but also intellectually taxing. This monotony frequently leads to dissatisfaction and lost chances, especially for people juggling job hunting with other obligations. Job seekers require a more intelligent, efficient method because they are under additional pressure to be informed about new positions on multiple channels. The time, effort, and stress involved in job searching could be greatly decreased by a job recommender system that scrapes job search websites and provides highly relevant job opportunities customized for each user. This would free up more time for people to concentrate on creating strong applications and eventually finding the ideal position.

## Preliminaries

In [None]:
%!pip install webdriver-manager
%pip install -U sentence-transformers



You should consider upgrading via the 'C:\Users\bianc\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'C:\Users\bianc\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


Below are the libraries needed to perform data operation throughout this notebook:

In [1]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

import time
import numpy as np
import pandas as pd
import multiprocessing
import re
from rapidfuzz import process
import unicodedata
import os

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.metrics import jaccard_score
from rapidfuzz.distance.Levenshtein import normalized_similarity

# Data Collection: Sample Users of Recommender Systems

To build our user dataset, we designed a survey to collect users’ job preferences, which serve as their baseline profiles. The responses are stored in a Google Sheet, from which the data will be extracted and loaded into a dataframe for further processing.

In [2]:
sheet_id = "18GpQ2NjwZhjtDy6N3j6rDVBfR0sH50NDroJcTKRKuJQ"
sheet_name = "FormResponses1"

url= f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"

df_form_response = pd.read_csv(url)

df_form_response.head()

Unnamed: 0,Timestamp,Do you give your consent?,Complete Name,Which location do you prefer to work in?,What is your preferred job role or function?,What is your preferred job type?,What keywords or phrases would you typically use when searching for jobs?,What is your expected monthly salary?,What is your preferred work setup?,What are your primary skills related to your desired role?,What benefits and perks are most important to you in a job?,What type of work environment do you prefer?,What level of experience are you looking for in your next role?,Do you have any additional preferences or requirements for your ideal job?
0,4/3/2025 17:47:08,"Yes, I give my consent.","GO, Daphne Janelyn L.","Taguig, National Capital Region (NCR)",Information Technology,Full-time,"data science, data analytics, software developer","₱50,000 - ₱60,000",Hybrid,"data science, machine learning, data analytics","Health Issurance, Bonuses",A mix of both collaborative/team-oriented and ...,Entry level,No
1,4/3/2025 17:57:37,"Yes, I give my consent.","PAYONGAYONG, Joanna Angela B.","Makati, National Capital Region (NCR)",Healthcare,Full-time,Hospitals in Makati,"₱20,000 - ₱30,000",On-Site,Phlebotomy,Healthcare literate,A mix of both collaborative/team-oriented and ...,Internship,
2,4/3/2025 17:57:37,"Yes, I give my consent.","SIASOYCO, Gabriel M.","Paranaque, National Capital Region (NCR)",Engineering,Internship,Mechanical Engineer,"Below ₱20,000",Hybrid,"CAD, Project Management, SOLIWORKS, ANSYS, Pro...","Work-life balance, Work load",Collaborative/team-oriented,Entry level,HVAC Industry
3,4/3/2025 17:57:49,"Yes, I give my consent.","ONG, Camron Evan C.","Manila, National Capital Region (NCR)",Information Technology,Full-time,software engineer,"₱30,000 - ₱40,000",Hybrid,Programming,Work-life balance,A mix of both collaborative/team-oriented and ...,Entry level,
4,4/3/2025 17:58:38,"Yes, I give my consent.","LORENZO, Antonio Jose Maria A.","Makati, National Capital Region (NCR)",Engineering,Internship,IT jobs Makati,"₱40,000 - ₱50,000",Hybrid,Data Visualization,Work-life balance,Collaborative/team-oriented,Entry level,


Clean `job_query` feature by removing quotation marks (`"`,`'`) from separate strings, removing commas, replacing spaces with hyphens (`-`) to concatenate strings, and standardize strings by converting everything to lowercase.

In [3]:
job_query = df_form_response["What keywords or phrases would you typically use when searching for jobs?"].tolist()
job_query = [str(x).replace('"', '').replace("'", '').replace('“', '').replace('”', '').replace(',', '').strip().replace(' ', '-').lower() for x in job_query]

Extrract `job_func`, `location`, and `job_type` from their corresponding question in the survey

In [4]:
# get job_func, location, job_type
cols = [
    "Which location do you prefer to work in?",
    "What is your preferred job role or function?",
    "What level of experience are you looking for in your next role?"
]

all_responses = df_form_response[cols].rename(columns={
    "What is your preferred job role or function?": "job_func",
    "Which location do you prefer to work in?": "location",
    "What level of experience are you looking for in your next role?": "job_type"
})

# concat job desc
job_desc_cols = [
    "What is your preferred work setup?",
    "What are your primary skills related to your desired role?",
    "What benefits and perks are most important to you in a job?",
    "What type of work environment do you prefer?",
    "What is your expected monthly salary?"
]

job_desc = df_form_response[job_desc_cols].agg(' '.join, axis=1)
all_responses['job_desc'] = job_desc

all_responses.head()

Unnamed: 0,location,job_func,job_type,job_desc
0,"Taguig, National Capital Region (NCR)",Information Technology,Entry level,"Hybrid data science, machine learning, data an..."
1,"Makati, National Capital Region (NCR)",Healthcare,Internship,On-Site Phlebotomy Healthcare literate A mix o...
2,"Paranaque, National Capital Region (NCR)",Engineering,Entry level,"Hybrid CAD, Project Management, SOLIWORKS, ANS..."
3,"Manila, National Capital Region (NCR)",Information Technology,Entry level,Hybrid Programming Work-life balance A mix of ...
4,"Makati, National Capital Region (NCR)",Engineering,Entry level,Hybrid Data Visualization Work-life balance Co...


# Data Collection: Scraping Job Entries from Websites

To optimize the scraping process, we implemented the scraping process to each of the websites through their own separate functions (to accommodate each website's differences) with the help of parallel programming.

In [5]:
# scraping for respondent #
len_jobs = 30

# sample user looking for data science jobs
!python parallel_scraper.py "data-science" {len_jobs}

No jobs found for this query on Foundit.
Foundit Scraping Time 87.91761517524719
Kalibrr Scraping Time 140.00044345855713
Jobstreet Scraping Time 159.61431884765625
Linkedin Scraping Time 225.87779712677002


# Checkpoint

Extract each jobs data from their corresponding website origin and load to a CSV file individually.

In [6]:
# loading
linkedin_df = pd.read_csv('linkedin.csv')
foundit_df = pd.read_csv('foundit.csv')
jobstreet_df = pd.read_csv('jobstreet.csv')
kalibrr_df = pd.read_csv('kalibrr.csv')

# Preprocessing

In [7]:
linkedin_df.columns

Index(['Unnamed: 0', 'title', 'link', 'location', 'company', 'emp_type',
       'job_func', 'job_desc', 'posted'],
      dtype='object')

In [8]:
foundit_df.columns

Index(['Unnamed: 0', 'title', 'company', 'link', 'location', 'posted',
       'emp_type', 'job_func', 'job_desc'],
      dtype='object')

In [9]:
jobstreet_df.columns

Index(['Unnamed: 0', 'title', 'link', 'company', 'posted', 'location',
       'job_func', 'emp_type', 'job_desc'],
      dtype='object')

In [10]:
kalibrr_df.columns

Index(['Unnamed: 0', 'title', 'link', 'company', 'emp_type', 'location',
       'job_func', 'posted', 'job_desc'],
      dtype='object')

In [11]:
# Drop 'Unnamed: 0' column
linkedin_df = linkedin_df.drop(columns=['Unnamed: 0'], errors='ignore')
foundit_df = foundit_df.drop(columns=['Unnamed: 0'], errors='ignore')
jobstreet_df = jobstreet_df.drop(columns=['Unnamed: 0'], errors='ignore')
kalibrr_df = kalibrr_df.drop(columns=['Unnamed: 0'], errors='ignore')

### Standardizing Job Functions

In [12]:
sheet_id = "1FJgg2JWrKfzyWbi-76vKcR1dm-q6RwCAipobzotr_Eg"
sheet_name = "Job_Functions"

url= f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"

df_jobs = pd.read_csv(url)

df_jobs

Unnamed: 0,Standardized,Linkedin,Foundit,Kalibrr,Jobstreet,Unnamed: 5
0,Accounting & Auditing,Accounting/Auditing,,Accounting and Finance,Accounting,"""Accounting & Auditing"","
1,Administration & Office Support,Administrative,Admin/secretarial/front office,Adminstration and Coordination,Administration & Office Support,"""Administration & Office Support"","
2,Advertising,,Advertising/entertainment/media,Media and Creatives,"Advertising, Arts & Media","""Advertising"","
3,Analyst,Analyst,Analytics/business intelligence,,,"""Analyst"","
4,Architecture,,Architecture/interior design,Architecture and Engineering,Design & Architecture,"""Architecture"","
...,...,...,...,...,...,...
64,Sales,,Sales/business development,,,"""Sales"","
65,Sales,Entrepreneurship,Retail chains,,,"""Sales"","
66,Sales,,Fashion/apparels,,,"""Sales"","
67,Sciences,Research,,Sciences,Science & Technology,"""Sciences"","


In [13]:
# mapping and standardizing
def standardization_map_exact(df, source_col, standard_col='Standardized'):
    source = df.dropna(subset=[source_col, standard_col])[source_col]
    target = df.dropna(subset=[source_col, standard_col])[standard_col]
    mapping = dict(zip(source, target))
    return mapping


def standardize_job_function(df_jobs_ref, df_target, site_column, source_col, target_col='job_func_stand'):
    # Create the mapping dictionary from the job mapping table
    mapping_dict = standardization_map_exact(df_jobs_ref,site_column)
    
    # Apply the mapping to the job function column
    df_target[target_col] = df_target[source_col].apply(
        lambda x: ", ".join(
            mapping_dict.get(f.strip(), f.strip()) 
            for f in str(x).split(",") if f.strip()
        )
    )

Fixing Job Functions Linkedin, Kalibrr, and Jobstreet

In [14]:
standardize_job_function(df_jobs, linkedin_df, 'Linkedin', 'job_func')
standardize_job_function(df_jobs, kalibrr_df, 'Kalibrr', 'job_func')

jobstreet_df['job_func_clean'] = jobstreet_df['job_func'].str.extract(r'\(([^)]+)\)')
jobstreet_df['job_func_clean'] = jobstreet_df['job_func_clean'].str.replace('&amp;', '&', regex=False)

standardize_job_function(df_jobs, jobstreet_df, 'Jobstreet', 'job_func_clean')

In [15]:
# after preprocessing
job_func_count = pd.concat([linkedin_df['job_func_stand'].value_counts(),kalibrr_df['job_func_stand'].value_counts(), jobstreet_df['job_func_stand'].value_counts()],axis=1)
job_func_count.columns = ['Linkedin', 'Kalibrr', 'Jobstreet']
job_func_count


Unnamed: 0_level_0,Linkedin,Kalibrr,Jobstreet
job_func_stand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Information Technology,5.0,22.0,17.0
Analyst,4.0,,
"Engineering, Information Technology",3.0,,
"Banking & Financial Services, Sales",2.0,,
"Sciences, Analyst, Information Technology",2.0,,
"Information Technology, Analyst",1.0,,
"Analyst, Strategy/Planning",1.0,,
General Business,1.0,,
"Engineering, Science, Analyst",1.0,,
Legal,1.0,,


### Standardizing Location

In [16]:
sheet_id = "1FJgg2JWrKfzyWbi-76vKcR1dm-q6RwCAipobzotr_Eg"
sheet_name = "LocationwCoordinates"

url= f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"

df_locations = pd.read_csv(url)


df_locations['city_clean'] = df_locations['City/Province'].str.replace(r'\s+City$', '', regex=True)


df_locations

Unnamed: 0,City/Province,Region,Type,Latitude,Longitude,Unnamed: 5,city_clean
0,Caloocan City,National Capital Region (NCR),City,14.6500,120.9667,"""Caloocan City, National Capital Region (NCR)"",",Caloocan
1,Las Pinas City,National Capital Region (NCR),City,14.6333,121.0333,"""Las Pinas City, National Capital Region (NCR)"",",Las Pinas
2,Makati City,National Capital Region (NCR),City,14.5503,121.0327,"""Makati City, National Capital Region (NCR)"",",Makati
3,Malabon City,National Capital Region (NCR),City,14.6600,120.9600,"""Malabon City, National Capital Region (NCR)"",",Malabon
4,Mandaluyong City,National Capital Region (NCR),City,14.6167,121.0333,"""Mandaluyong City, National Capital Region (NC...",Mandaluyong
...,...,...,...,...,...,...,...
211,Lanao del Sur,Bangsamoro Autonomous Region in Muslim Mindana...,Province,7.9167,124.3000,"""Lanao del Sur, Bangsamoro Autonomous Region i...",Lanao del Sur
212,Sulu,Bangsamoro Autonomous Region in Muslim Mindana...,Province,6.0000,121.0000,"""Sulu, Bangsamoro Autonomous Region in Muslim ...",Sulu
213,Tawi-Tawi,Bangsamoro Autonomous Region in Muslim Mindana...,Province,5.1333,120.1000,"""Tawi-Tawi, Bangsamoro Autonomous Region in Mu...",Tawi-Tawi
214,Maguindanao del Norte,Bangsamoro Autonomous Region in Muslim Mindana...,Province,7.1833,124.4333,"""Maguindanao del Norte, Bangsamoro Autonomous ...",Maguindanao del Norte


In [17]:
def standardize_and_add_coordinates(df_target, location_col, df_locations, new_col='location_standardized'):
    def normalize(text):
        text = str(text).lower().replace(',', '').replace('-', '').replace('philippines', '').strip()
        text = unicodedata.normalize('NFKD', text).encode('ASCII', 'ignore').decode() 
        return text

    def smart_location_match(raw_location, keywords, df_lookup):
        
        # for handling nulls values
        if not isinstance(raw_location, str):
            return None

        raw = normalize(raw_location)

        if raw_location.lower().strip() == "philippines" or raw == "other":
            return "Philippines"

        for keyword in keywords:
            if normalize(keyword) in raw:
                match = keyword
                break
        else:
            match, score, _ = process.extractOne(raw, keywords, processor=normalize)
            if score <= 80:
                return "Other"

        if match in df_lookup['city_clean'].values:
            matches = df_lookup[df_lookup['city_clean'] == match]
            region = matches['Region'].values[0]
            type_ = matches['Type'].values[0]
            if type_ == 'City':
                return f"{match.title()} City, {region}"
            else:
                return f"{match.title()}, {region}"
        elif match in df_lookup['Region'].values:
            return match
        else:
            return "Other"

    # Build keywords from cities + regions
    keywords = pd.concat([df_locations['city_clean'], df_locations['Region']]).dropna().unique().tolist()

    # Standardize location
    df_target[new_col] = df_target[location_col].apply(lambda x: smart_location_match(x, keywords, df_locations))

    # Extract city_clean for merge
    df_target['city_clean_match'] = df_target[new_col].apply(
        lambda x: x.split(",")[0].replace(" City", "").strip().lower() if isinstance(x, str) else None
    )

    # Prepare reference for matching
    df_locations_temp = df_locations.copy()
    df_locations_temp['city_clean'] = df_locations_temp['city_clean'].str.lower()

    # Remove old lat/lon if any
    df_target = df_target.drop(columns=['Latitude', 'Longitude'], errors='ignore')

    # Merge lat/lon
    df_target = df_target.merge(
        df_locations_temp[['city_clean', 'Latitude', 'Longitude']],
        how='left',
        left_on='city_clean_match',
        right_on='city_clean'
    )

    # Add coordinates for "Philippines"
    philippines_mask = df_target[new_col] == "Philippines"
    df_target.loc[philippines_mask, "Latitude"] = 12.8797
    df_target.loc[philippines_mask, "Longitude"] = 121.7740

    # Clean up
    df_target.drop(columns=['city_clean_match', 'city_clean'], inplace=True, errors='ignore')

    return df_target

Fixing Locations in Linkedin, Foundit, Kalibrr, and Jobstreet

In [18]:
linkedin_df = standardize_and_add_coordinates(linkedin_df, location_col='location', df_locations=df_locations)
foundit_df = standardize_and_add_coordinates(foundit_df, location_col='location', df_locations=df_locations)
kalibrr_df = standardize_and_add_coordinates(kalibrr_df, location_col='location', df_locations=df_locations)
jobstreet_df = standardize_and_add_coordinates(jobstreet_df, location_col='location', df_locations=df_locations)

In [19]:
# after preprocessing
loc_count = pd.concat([linkedin_df['location_standardized'].value_counts(),foundit_df['location_standardized'].value_counts(),kalibrr_df['location_standardized'].value_counts(), jobstreet_df['location_standardized'].value_counts()],axis=1)
loc_count.columns = ['Linkedin', 'Foundit', 'Kalibrr', 'Jobstreet']
loc_count

Unnamed: 0_level_0,Linkedin,Foundit,Kalibrr,Jobstreet
location_standardized,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Makati City, National Capital Region (NCR)",10.0,,1.0,6.0
"Pasay City, National Capital Region (NCR)",4.0,,,
"Manila City, National Capital Region (NCR)",4.0,4.0,4.0,19.0
"Taguig City, National Capital Region (NCR)",3.0,1.0,10.0,
"Pasig City, National Capital Region (NCR)",2.0,,6.0,
"Mandaluyong City, National Capital Region (NCR)",2.0,,3.0,4.0
Philippines,2.0,1.0,,
Other,1.0,,,
National Capital Region (NCR),1.0,,,
"Quezon City, National Capital Region (NCR)",1.0,,4.0,


### Standardizing Job Type

In [20]:
sheet_id = "1FJgg2JWrKfzyWbi-76vKcR1dm-q6RwCAipobzotr_Eg"
sheet_name = "Job_Type"

url= f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"

df_job_type = pd.read_csv(url)

df_job_type

Unnamed: 0,Standardized,Linkedin,Foundit,Kalibrr,Jobstreet
0,Full-time,Full-time,Permanent Job,Full time,Full time
1,Contract/Temporary,Contract,Contract Job,Contractual,Contract/Temp
2,Internship,Internship,,,
3,Part-time,Part-time,,Part time,Part time
4,Contract/Temporary,Temporary,,Freelance,Casual/Vacation


In [21]:
# clean comma-separated job types
def clean_job_type(val):
    job_types = [t.strip() for t in str(val).split(',')]
    
    remove = {'Other','Work From Home', 'Jobs for Women'}
    
    cleaned = [x for x in job_types if x not in remove]
    
    return ", ".join(cleaned) if cleaned else None

# infer job type from job title
def job_type_from_title(df, target_col='job_type_stand', text_col='title'):

    # intern & internship == Internship; part time & part-time == Part-time
    keywords = {r'\bintern\b|\binternship\b': 'Internship', r'\bpart-time\b|\bpart time\b': 'Part-time'}

    for pattern in keywords:
        job_type = keywords[pattern]

        is_match = df[text_col].str.contains(pattern, case=False, na=False)

        df.loc[is_match, target_col] = job_type
    
    return df


def standardize_job_type(df, mapping_dict, source_col):
    df['job_type_stand'] = df[source_col].apply(
        lambda x: ", ".join(
            mapping_dict.get(f.strip(), f.strip()) 
            for f in str(x).split(",") if f.strip()
        )
    )

# entire preprocessing function
def preprocess_job_type(df, source_name, source_col='emp_type', standard_col='Standardized'):
    if df is None or df.empty:
        df['job_type_stand'] = None
        return df
    
    df[source_col] = df[source_col].apply(clean_job_type)

    df = job_type_from_title(df, target_col='job_type_stand', text_col='title')

    mapping = standardization_map_exact(df_job_type, source_name, standard_col)

    standardize_job_type(df, mapping, source_col)

    return df



In [22]:
# before preprocessing
job_type_count = pd.concat([linkedin_df['emp_type'].value_counts(),foundit_df['emp_type'].value_counts(),kalibrr_df['emp_type'].value_counts(), jobstreet_df['emp_type'].value_counts()],axis=1)
job_type_count.columns = ['Linkedin', 'Foundit', 'Kalibrr', 'Jobstreet']
job_type_count

Unnamed: 0_level_0,Linkedin,Foundit,Kalibrr,Jobstreet
emp_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Full-time,26.0,,,
Contract,4.0,,,
Permanent Job,,6.0,,
Full time,,,30.0,28.0
Freelance,,,1.0,
Contractual,,,1.0,
Contract/Temp,,,,1.0


Fixing Job Type in Linkedin, Foundit, Kalibrr, Jobstreet

In [23]:
# preprocessing
linkedin_df = preprocess_job_type(linkedin_df, 'Linkedin')
foundit_df = preprocess_job_type(foundit_df, 'Foundit')
kalibrr_df = preprocess_job_type(kalibrr_df, 'Kalibrr')
jobstreet_df = preprocess_job_type(jobstreet_df, 'Jobstreet')

In [24]:
# after preprocessing
job_type_count = pd.concat([linkedin_df['job_type_stand'].value_counts(),foundit_df['job_type_stand'].value_counts(),kalibrr_df['job_type_stand'].value_counts(), jobstreet_df['job_type_stand'].value_counts()],axis=1)
job_type_count.columns = ['Linkedin', 'Foundit', 'Kalibrr', 'Jobstreet']
job_type_count

Unnamed: 0_level_0,Linkedin,Foundit,Kalibrr,Jobstreet
job_type_stand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Full-time,26.0,6.0,30.0,28
Contract/Temporary,4.0,,2.0,1
,,,,1


In [25]:
linkedin_df_stand = linkedin_df.drop(['job_func','location','emp_type'], axis=1)
kalibrr_df_stand = kalibrr_df.drop(['job_func','location','emp_type'], axis=1)
# foundit_df_stand = foundit_df.drop(['job_func','location'], axis=1)
foundit_df_stand = foundit_df.drop(['location','emp_type'], axis=1)
jobstreet_df_stand = jobstreet_df.drop(['job_func','job_func_clean','location','emp_type'], axis=1)

### Merging the Job Listings

In [26]:
linkedin_df_stand.columns

Index(['title', 'link', 'company', 'job_desc', 'posted', 'job_func_stand',
       'location_standardized', 'Latitude', 'Longitude', 'job_type_stand'],
      dtype='object')

In [27]:
kalibrr_df_stand.columns

Index(['title', 'link', 'company', 'posted', 'job_desc', 'job_func_stand',
       'location_standardized', 'Latitude', 'Longitude', 'job_type_stand'],
      dtype='object')

In [28]:
foundit_df_stand.columns

Index(['title', 'company', 'link', 'posted', 'job_func', 'job_desc',
       'location_standardized', 'Latitude', 'Longitude', 'job_type_stand'],
      dtype='object')

In [29]:
jobstreet_df_stand.columns

Index(['title', 'link', 'company', 'posted', 'job_desc', 'job_func_stand',
       'location_standardized', 'Latitude', 'Longitude', 'job_type_stand'],
      dtype='object')

In [30]:
# merge the four dataframes
columns = ['title', 'company', 'link', 'job_desc','job_func_stand', 'location_standardized', 'job_type_stand']

linkedin_df_stand['source'] = 'Linkedin'
kalibrr_df_stand['source'] = 'Kalibrr'
foundit_df_stand['source'] = 'Foundit'
jobstreet_df_stand['source'] = 'Jobstreet'

for df in [linkedin_df_stand, kalibrr_df_stand, foundit_df_stand, jobstreet_df_stand]:
    for col in columns:
        if col not in df.columns:
            df[col] = np.nan


job_listings = pd.concat([
    linkedin_df_stand[columns + ['source']],
    kalibrr_df_stand[columns + ['source']],
    foundit_df_stand[columns + ['source']],
    jobstreet_df_stand[columns + ['source']]
], ignore_index=True)

job_listings

Unnamed: 0,title,company,link,job_desc,job_func_stand,location_standardized,job_type_stand,source
0,DATA PH Hiring (All Roles),EY,https://ph.linkedin.com/jobs/view/data-ph-hiri...,"\n At EY, you’ll have the chance to b...",Information Technology,"Taguig City, National Capital Region (NCR)",Full-time,Linkedin
1,Data Analytics Analyst (Remote),Deloitte,https://ph.linkedin.com/jobs/view/data-analyti...,\n <p><strong>What impact will you ma...,Analyst,"Taguig City, National Capital Region (NCR)",Contract/Temporary,Linkedin
2,Passenger Sales Support Executive (Data Analyt...,Cebu Pacific Air,https://ph.linkedin.com/jobs/view/passenger-sa...,\n <strong>Department<br><br></strong...,"Sales, Sales","Pasay City, National Capital Region (NCR)",Full-time,Linkedin
3,Data Scientist,Jollibee Group,https://ph.linkedin.com/jobs/view/data-scienti...,\n Title: Data Scientist<br><br>The <...,"Engineering, Information Technology","Pasig City, National Capital Region (NCR)",Full-time,Linkedin
4,Data Scientist,Maya,https://ph.linkedin.com/jobs/view/data-scienti...,\n <p><strong>Overview:</strong></p><...,Information Technology,"Mandaluyong City, National Capital Region (NCR)",Full-time,Linkedin
...,...,...,...,...,...,...,...,...
93,Data Science Specialist,Nityo Infotech Services Philippines Inc.,https://ph.jobstreet.com/job/83191584?type=sta...,<p><strong>Location: </strong>Makati<br><stron...,Information Technology,"Makati City, National Capital Region (NCR)",Full-time,Jobstreet
94,"Cloud Data Engineer (Azure &amp; Fabric, SQL/P...",Satellite Office,https://ph.jobstreet.com/job/83357046?type=sta...,<p><strong>DATA ANALYST</strong></p><p>Work fo...,Information Technology,"Manila City, National Capital Region (NCR)",Full-time,Jobstreet
95,Business Analyst - Data &amp; Insights,MyBudget,https://ph.jobstreet.com/job/83375610?type=sta...,<p>MyBudget Asia provides global support for M...,Information Technology,"Manila City, National Capital Region (NCR)",Full-time,Jobstreet
96,AI Developer – Azure Databricks &amp; Machine ...,Outsourced Quality Assured Services Inc. (ISO ...,https://ph.jobstreet.com/job/83130305?type=sta...,<p><strong>Company Description</strong><br>Out...,Information Technology,"Manila City, National Capital Region (NCR)",Full-time,Jobstreet


### Dropping Duplicates

In [31]:
duplicates = job_listings[job_listings.duplicated(subset=['title', 'company'], keep=False)]
duplicates = duplicates.groupby(['title', 'company','source']).size().reset_index(name='count')
duplicates = duplicates.sort_values(by='count', ascending=False)
duplicates

Unnamed: 0,title,company,source,count
4,Senior Data Scientist,LeapFroggr Inc.,Kalibrr,4
2,Enterprise Analytics Specialist,Cebu Pacific Air,Linkedin,2
3,RISK ANALYTICS OFFICER,Bank of the Philippine Islands (BPI),Linkedin,2
5,Technology Consultant | Data Architecture Prin...,Accenture in the Philippines,Kalibrr,2
0,Data Analyst | Makati,MedGrocer,Kalibrr,1
1,Data Analyst | Makati,MedGrocer,Linkedin,1


In [32]:
# retain first listing
job_listings = job_listings.drop_duplicates(subset=['title', 'company'], keep='first')
job_listings

Unnamed: 0,title,company,link,job_desc,job_func_stand,location_standardized,job_type_stand,source
0,DATA PH Hiring (All Roles),EY,https://ph.linkedin.com/jobs/view/data-ph-hiri...,"\n At EY, you’ll have the chance to b...",Information Technology,"Taguig City, National Capital Region (NCR)",Full-time,Linkedin
1,Data Analytics Analyst (Remote),Deloitte,https://ph.linkedin.com/jobs/view/data-analyti...,\n <p><strong>What impact will you ma...,Analyst,"Taguig City, National Capital Region (NCR)",Contract/Temporary,Linkedin
2,Passenger Sales Support Executive (Data Analyt...,Cebu Pacific Air,https://ph.linkedin.com/jobs/view/passenger-sa...,\n <strong>Department<br><br></strong...,"Sales, Sales","Pasay City, National Capital Region (NCR)",Full-time,Linkedin
3,Data Scientist,Jollibee Group,https://ph.linkedin.com/jobs/view/data-scienti...,\n Title: Data Scientist<br><br>The <...,"Engineering, Information Technology","Pasig City, National Capital Region (NCR)",Full-time,Linkedin
4,Data Scientist,Maya,https://ph.linkedin.com/jobs/view/data-scienti...,\n <p><strong>Overview:</strong></p><...,Information Technology,"Mandaluyong City, National Capital Region (NCR)",Full-time,Linkedin
...,...,...,...,...,...,...,...,...
93,Data Science Specialist,Nityo Infotech Services Philippines Inc.,https://ph.jobstreet.com/job/83191584?type=sta...,<p><strong>Location: </strong>Makati<br><stron...,Information Technology,"Makati City, National Capital Region (NCR)",Full-time,Jobstreet
94,"Cloud Data Engineer (Azure &amp; Fabric, SQL/P...",Satellite Office,https://ph.jobstreet.com/job/83357046?type=sta...,<p><strong>DATA ANALYST</strong></p><p>Work fo...,Information Technology,"Manila City, National Capital Region (NCR)",Full-time,Jobstreet
95,Business Analyst - Data &amp; Insights,MyBudget,https://ph.jobstreet.com/job/83375610?type=sta...,<p>MyBudget Asia provides global support for M...,Information Technology,"Manila City, National Capital Region (NCR)",Full-time,Jobstreet
96,AI Developer – Azure Databricks &amp; Machine ...,Outsourced Quality Assured Services Inc. (ISO ...,https://ph.jobstreet.com/job/83130305?type=sta...,<p><strong>Company Description</strong><br>Out...,Information Technology,"Manila City, National Capital Region (NCR)",Full-time,Jobstreet


In [33]:
job_listings = job_listings.reset_index().drop('index', axis=1)
job_listings.columns = ['title', 'company', 'link','job_desc','job_func','location','job_type','source']

### Cleaning Job Description

In [34]:
def remove_html_tags(text):
    if not isinstance(text, str):
        return text 
    clean = re.sub(r'<.*?>', '', text)
    clean = clean.replace('\n', '').replace('\r', '').strip()
    return clean

In [35]:
job_listings['job_desc'] = job_listings.apply(lambda row: remove_html_tags(row['job_desc']), axis=1)
job_listings.head()

Unnamed: 0,title,company,link,job_desc,job_func,location,job_type,source
0,DATA PH Hiring (All Roles),EY,https://ph.linkedin.com/jobs/view/data-ph-hiri...,"At EY, you’ll have the chance to build a caree...",Information Technology,"Taguig City, National Capital Region (NCR)",Full-time,Linkedin
1,Data Analytics Analyst (Remote),Deloitte,https://ph.linkedin.com/jobs/view/data-analyti...,"What impact will you make?At Deloitte, we offe...",Analyst,"Taguig City, National Capital Region (NCR)",Contract/Temporary,Linkedin
2,Passenger Sales Support Executive (Data Analyt...,Cebu Pacific Air,https://ph.linkedin.com/jobs/view/passenger-sa...,DepartmentSales SupportEmployee TypeProbationa...,"Sales, Sales","Pasay City, National Capital Region (NCR)",Full-time,Linkedin
3,Data Scientist,Jollibee Group,https://ph.linkedin.com/jobs/view/data-scienti...,Title: Data ScientistThe Data Scientist is res...,"Engineering, Information Technology","Pasig City, National Capital Region (NCR)",Full-time,Linkedin
4,Data Scientist,Maya,https://ph.linkedin.com/jobs/view/data-scienti...,"Overview:As a Data Scientist, you will be resp...",Information Technology,"Mandaluyong City, National Capital Region (NCR)",Full-time,Linkedin


# Checkpoint

In [36]:
job_listings.columns = ['title','company','link', 'job_desc', 'job_func', 'location', 'job_type', 'source']
job_listings.to_csv('job_listings.csv')

In [37]:
job_listings = pd.read_csv('job_listings.csv').drop('Unnamed: 0', axis=1)
job_listings.head()

Unnamed: 0,title,company,link,job_desc,job_func,location,job_type,source
0,DATA PH Hiring (All Roles),EY,https://ph.linkedin.com/jobs/view/data-ph-hiri...,"At EY, you’ll have the chance to build a caree...",Information Technology,"Taguig City, National Capital Region (NCR)",Full-time,Linkedin
1,Data Analytics Analyst (Remote),Deloitte,https://ph.linkedin.com/jobs/view/data-analyti...,"What impact will you make?At Deloitte, we offe...",Analyst,"Taguig City, National Capital Region (NCR)",Contract/Temporary,Linkedin
2,Passenger Sales Support Executive (Data Analyt...,Cebu Pacific Air,https://ph.linkedin.com/jobs/view/passenger-sa...,DepartmentSales SupportEmployee TypeProbationa...,"Sales, Sales","Pasay City, National Capital Region (NCR)",Full-time,Linkedin
3,Data Scientist,Jollibee Group,https://ph.linkedin.com/jobs/view/data-scienti...,Title: Data ScientistThe Data Scientist is res...,"Engineering, Information Technology","Pasig City, National Capital Region (NCR)",Full-time,Linkedin
4,Data Scientist,Maya,https://ph.linkedin.com/jobs/view/data-scienti...,"Overview:As a Data Scientist, you will be resp...",Information Technology,"Mandaluyong City, National Capital Region (NCR)",Full-time,Linkedin


# Building Utility Matrix for Content-Based Filtering

In [38]:
matrix = job_listings[['title', 'job_desc', 'job_func', 'location', 'job_type']]

In [39]:
def location_to_coord(df_target, df_locations):
    df_locations['location'] = df_locations['City/Province'] + ', ' + df_locations['Region']
    df_target = df_target.merge(df_locations[['location', 'Latitude', 'Longitude']], on='location', how='left')
    df_target = df_target.drop('location', axis=1)
    df_target['Latitude'] = df_target['Latitude'].fillna(12.8797)
    df_target['Longitude'] = df_target['Longitude'].fillna(121.7740)
    return df_target

def job_type_encoding(df_target):
    one_hat = pd.get_dummies(df_target['job_type'])
    for col in ['Full-time', 'Contract/Temporary', 'Part-time', 'Internship']:
        if col not in one_hat.columns:
            one_hat[col] = False
    one_hat = one_hat[['Full-time', 'Contract/Temporary', 'Part-time', 'Internship']]
    df_target = pd.concat([df_target, one_hat.astype(int)], axis=1)
    df_target = df_target.drop('job_type', axis=1)
    return df_target

def sentence_2_vec(df_target):
    df_target = df_target.fillna({'job_desc': '', 'job_func': ''})
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = pd.DataFrame(model.encode(df_target['job_desc']))
    embeddings2 = pd.DataFrame(model.encode(df_target['job_func']))
    embeddings2.columns = [x+len(embeddings.columns) for x in embeddings2.columns]
    df_target = df_target.drop(['job_desc', 'job_func'], axis=1)
    df_target = pd.concat([df_target, embeddings, embeddings2], axis=1)
    return df_target

In [40]:
matrix = job_type_encoding(matrix)
matrix = location_to_coord(matrix, df_locations)
matrix = sentence_2_vec(matrix)

In [41]:
matrix

Unnamed: 0,title,Full-time,Contract/Temporary,Part-time,Internship,Latitude,Longitude,0,1,2,...,758,759,760,761,762,763,764,765,766,767
0,DATA PH Hiring (All Roles),1,0,0,0,14.5500,121.0833,-0.040261,-0.037374,0.036982,...,0.031285,-0.023702,0.100298,0.000189,0.006308,0.064320,0.056485,-0.074071,0.074240,0.010131
1,Data Analytics Analyst (Remote),0,1,0,0,14.5500,121.0833,-0.051495,-0.015045,-0.041116,...,0.060892,0.000673,0.014715,-0.061912,-0.089441,0.054234,0.083023,-0.048958,0.005434,0.004617
2,Passenger Sales Support Executive (Data Analyt...,1,0,0,0,14.5378,121.0014,0.001428,-0.002253,0.011876,...,0.031585,-0.021183,-0.008356,-0.042382,-0.033441,0.018713,0.045198,-0.020632,0.042962,0.060534
3,Data Scientist,1,0,0,0,14.5605,121.0765,-0.003965,-0.005569,-0.025925,...,0.035966,0.039613,0.034523,-0.052508,-0.043252,0.039149,0.056286,-0.075177,0.094946,0.039914
4,Data Scientist,1,0,0,0,14.6167,121.0333,-0.026785,-0.032693,0.018242,...,0.031285,-0.023702,0.100298,0.000189,0.006308,0.064320,0.056485,-0.074071,0.074240,0.010131
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,Data Science Specialist,1,0,0,0,14.5503,121.0327,-0.045670,0.040086,-0.066529,...,0.031285,-0.023702,0.100298,0.000189,0.006308,0.064320,0.056485,-0.074071,0.074240,0.010131
87,"Cloud Data Engineer (Azure &amp; Fabric, SQL/P...",1,0,0,0,14.5995,120.9842,0.002385,0.021173,-0.021664,...,0.031285,-0.023702,0.100298,0.000189,0.006308,0.064320,0.056485,-0.074071,0.074240,0.010131
88,Business Analyst - Data &amp; Insights,1,0,0,0,14.5995,120.9842,-0.011788,0.061682,-0.034131,...,0.031285,-0.023702,0.100298,0.000189,0.006308,0.064320,0.056485,-0.074071,0.074240,0.010131
89,AI Developer – Azure Databricks &amp; Machine ...,1,0,0,0,14.5995,120.9842,-0.070792,-0.012066,-0.020706,...,0.031285,-0.023702,0.100298,0.000189,0.006308,0.064320,0.056485,-0.074071,0.074240,0.010131


# User Profile

In [70]:
# sample user profile
user = {
    'location': ['Makati, National Capital Region (NCR)'],
    'job_func': ['Information Technology'],
    'job_type': ['Internship'],
    'job_desc': ['data science, data analytics, Bonuses A mix of both collaborative/team-oriented and independent/solo work ₱50,000 - ₱60,000']
}
user = pd.DataFrame(user)

In [71]:
user_profile = job_type_encoding(user)
user_profile = location_to_coord(user_profile, df_locations)
user_profile = sentence_2_vec(user_profile)

# Recommending the Jobs

## Content-based Filtering

In [72]:
def compute_similarity_score1(user, matrix):
    job_func_desc_score = cosine_similarity(user_profile.iloc[:,6:], matrix.iloc[:,7:])
    location_score = 1/(1+euclidean_distances(user_profile[['Latitude','Longitude']],matrix[['Latitude','Longitude']]))
    job_type_score = np.array([matrix.apply(lambda row: np.dot(user_profile.iloc[:,:4].values[0], row[1:5].values), axis=1)])
    title_score = np.array(matrix.apply(lambda row: normalized_similarity(job_query, row['title']), axis=1))
    
    total = job_func_desc_score + location_score + job_type_score + title_score
    return pd.DataFrame({"total":total[0]})

In [73]:
total = compute_similarity_score1(user_profile, matrix)
recommended_jobs = job_listings.iloc[total.sort_values('total', ascending=False).index]
recommended_jobs

Unnamed: 0,title,company,link,job_desc,job_func,location,job_type,source
82,"Data Analytics and Reporting | Tableau, VBA, SQL",Cardinal Health International Philippines Inc.,https://ph.jobstreet.com/job/83337259?type=sta...,About the TeamThe Strategic Delivery Solutions...,Information Technology,,,Jobstreet
27,ML Engineer,Redient Security,https://ph.linkedin.com/jobs/view/ml-engineer-...,"Job Title: Machine Learning Engineer (Remote, ...","Engineering, Information Technology",Philippines,Contract/Temporary,Linkedin
18,Business Intelligence Analyst,BDO Unibank,https://ph.linkedin.com/jobs/view/business-int...,The Business Intelligence Analyst is responsib...,Analyst,National Capital Region (NCR),Full-time,Linkedin
25,"Data Analyst (Remote, Graveyard)",Anytime Mailbox,https://ph.linkedin.com/jobs/view/data-analyst...,As a Data Analyst specializing in analyzing in...,Analyst,Philippines,Contract/Temporary,Linkedin
57,Head of Credit Policy and Data Science,Growsari,https://www.foundit.com.ph/job/head-of-credit-...,About the Company:SariPay is a fintech company...,,Philippines,Full-time,Foundit
...,...,...,...,...,...,...,...,...
56,Data Science Manager - Customer Success,Feedzai,https://www.foundit.com.ph/job/data-science-ma...,Feedzai is the worlds first RiskOps platform f...,,"Manila City, National Capital Region (NCR)",Full-time,Foundit
8,RISK ANALYTICS OFFICER,Bank of the Philippine Islands (BPI),https://ph.linkedin.com/jobs/view/risk-analyti...,This position is primarily responsible for ena...,"Banking & Financial Services, Sales","Manila City, National Capital Region (NCR)",Full-time,Linkedin
63,Data Scientist,"CHAMP Cargosystems Philippines, Inc.",https://ph.jobstreet.com/job/83098272?type=sta...,Overview&nbsp;CHAMP Cargosystems provides the ...,Sciences,"Manila City, National Capital Region (NCR)",Full-time,Jobstreet
5,Data Scientist,Bank of the Philippine Islands (BPI),https://ph.linkedin.com/jobs/view/data-scienti...,This position is primarily responsible for the...,"Analyst, Sales, Iba pa","Makati City, National Capital Region (NCR)",Full-time,Linkedin


## Colaborative Filtering

Building Utility Matrix for Collaborative Filtering

In [74]:
selected_jobs_per_user = {}
for each in os.listdir('./job_outputs'):
    df = pd.read_csv(f'./job_outputs/{each}')
    selected = df[df['selected'] == 1].reset_index().drop('index',axis=1)
    selected_links = list(selected['link'])
    name = each.split('_')[1]
    if len(selected_links) != 5:
        selected_links = [np.nan, np.nan, np.nan, np.nan, np.nan]
    selected_jobs_per_user[name] = selected_links

selected_jobs_df = pd.DataFrame(selected_jobs_per_user).T.reset_index()
selected_jobs_df.rename({'index':'name'}, axis=1, inplace=True)

In [75]:
all_responses['name'] = df_form_response['Complete Name']
all_responses['job_title'] = job_query
user_matrix = all_responses[['name', 'job_title', 'location', 'job_func', 'job_type', 'job_desc']]
user_matrix['name'] = user_matrix.name.str.split(', ').str[0]
user_item_matrix = user_matrix.merge(selected_jobs_df, on='name', how='left')

In [76]:
user_matrix_num = job_type_encoding(user_item_matrix[['job_title', 'location', 'job_func', 'job_type', 'job_desc']])
user_matrix_num = location_to_coord(user_matrix_num, df_locations)
user_matrix_num = sentence_2_vec(user_matrix_num)

In [81]:
def compute_similarity_score2(user, user_matrix, job_query):
    job_func_desc_score = cosine_similarity(user.iloc[:,6:], user_matrix.iloc[:,7:])
    location_score = 1/(1+euclidean_distances(user[['Latitude','Longitude']], user_matrix[['Latitude','Longitude']]))
    job_type_score = np.array([user_matrix.apply(lambda row: np.dot(user.iloc[:,:4].values[0], row[1:5].values), axis=1)])
    total = job_func_desc_score + location_score + job_type_score
    return pd.DataFrame({"total":total[0]})

In [111]:
total = compute_similarity_score2(user_profile, user_matrix_num, "data_science")
similar_users = user_item_matrix.loc[total.sort_values('total', ascending=False).index][:5]
similar_users

Unnamed: 0,name,job_title,location,job_func,job_type,job_desc,0,1,2,3,4
17,ESPERANZA,internships-software-engineer-web-application-...,"Calamba, CALABARZON (Region IV-A)",Information Technology,Internship,"Hybrid programming, web development, UI/UX Hea...",https://www.kalibrr.com/c/a-philippines/jobs/2...,https://ph.jobstreet.com/job/83256230?type=sta...,https://www.kalibrr.com/c/a-philippines/jobs/2...,https://ph.jobstreet.com/job/83187788?type=sta...,https://www.kalibrr.com/c/glyphstudios-inc-1/j...
10,MANLISES,ai-machine-learning-internship-nlp,"Manila, National Capital Region (NCR)",Information Technology,Internship,"Remote (Work from Home) programming, machine l...",https://ph.jobstreet.com/job/83240150?type=sta...,https://www.kalibrr.com/c/media-meter/jobs/246...,https://www.kalibrr.com/c/media-meter/jobs/252...,https://www.kalibrr.com/c/media-meter/jobs/247...,https://www.kalibrr.com/c/media-meter/jobs/246...
53,DIMAGIBA,internships-data-science-makati-software-engin...,"Makati, National Capital Region (NCR)",Information Technology,Internship,"Hybrid programming, web development, UI/UX, da...",https://ph.jobstreet.com/job/83187694?type=sta...,https://www.kalibrr.com/c/medgrocer/jobs/25338...,https://www.kalibrr.com/c/likha-it/jobs/252701...,https://www.kalibrr.com/c/gardenia/jobs/253220...,https://www.kalibrr.com/c/medgrocer/jobs/23630...
15,ROCO,it-internship-manila-it-internship-ph,"Taguig, National Capital Region (NCR)",Information Technology,Internship,"Hybrid RPA, programming, QA work-life balance ...",https://ph.jobstreet.com/job/83235471?type=sta...,https://ph.jobstreet.com/job/83009607?type=sta...,https://ph.jobstreet.com/job/82585606?type=sta...,https://www.kalibrr.com/c/media-meter/jobs/246...,https://www.kalibrr.com/c/multisys-technologie...
28,GUBAN,marketing-jobs-makati,"Makati, National Capital Region (NCR)",Marketing,Internship,"On-Site flexible, selling, leadership Work-lif...",https://ph.jobstreet.com/job/83229220?type=sta...,https://ph.linkedin.com/jobs/view/marketing-as...,https://ph.linkedin.com/jobs/view/business-rel...,https://ph.linkedin.com/jobs/view/marketing-as...,https://ph.linkedin.com/jobs/view/program-mana...
