<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# <u><b>Capstone:</b></u> Recommending adjacent jobs to platfrom delivery drivers using NLP and consine similarity

--- 
### Part 1: Find adjacent jobs using cosine distance, based on O*NET tasks description
---

The Occupational Information Network (O\*NET) Program is U.S.'s primary source of occupational information. The O\*NET Standard Occupational Classification (SOC) taxonomy identifies and defines 923 distinct occupations and associated task, based on data collected from job incumbents or occupation experts. 

The information is essential to understanding the rapidly changing nature of work and how it impacts the workforce, by helping:
- workers find the training and jobs they need, 
- employers the skilled workers necessary to be competitive in the marketplace
- schools and government agencies develop and maintain a skilled workforce.

The Singapore Standard Occupational Classification (SSOC) is adapted from SOC and localized to Singapore's context.

The SOC tasks dataset lists both core and supplementary tasks of each occupation. 



This notebook applies Cosine Similarity on the SOC core tasks of each occupation, to find an alternative occupation to `Light Truck Drivers` with the most similar tasks.

`Light Truck Drivers`, drive a light vehicle, such as a truck or van, with a capacity of less than 26,001 pounds Gross Vehicle Weight (GVW), primarily to pick up merchandise or packages from a distribution center and deliver. May load and unload vehicle.

Their core tasks includes:
- Obey traffic laws and follow established traffic and transportation procedures.
- Turn in receipts and money received from deliveries.
- Read maps and follow written or verbal geographic directions.
- Verify the contents of inventory loads against shipping papers.
- Load and unload trucks, vans, or automobiles.
- Drive vehicles with capacities under three tons to transport materials to and from specified destinations, such as railroad stations, plants, residences, offices, or within industrial yards.
- Maintain records, such as vehicle logs, records of cargo, or billing statements, in accordance with regulations.
- Inspect and maintain vehicle supplies and equipment, such as gas, oil, water, tires, lights, or brakes, to ensure that vehicles are in proper working condition.
- Present bills and receipts and collect payments for goods delivered or loaded.
- Report any mechanical problems encountered with vehicles.
- Perform emergency repairs, such as changing tires or installing light bulbs, fuses, tire chains, or spark plugs.
- Report delays, accidents, or other traffic and transportation situations to bases or other vehicles, using telephones or mobile two-way radios

In [1]:
import pandas as pd
import numpy as np

import nltk 
nltk.download('punkt') 
nltk.download('averaged_perceptron_tagger') 
nltk.download('wordnet') 
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
from nltk.corpus import stopwords 
nltk.download('stopwords') 
stop_words = set(stopwords.words('english')) 

VERB_CODES = {'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'}

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity 

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# user-defined functions
from eda_utils import clean_string

[nltk_data] Downloading package punkt to /Users/pris/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pris/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/pris/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/pris/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# load scrapped job postings
occupations=pd.read_csv('../Data/Task Statements.csv') 

In [3]:
occupations.columns

Index(['O*NET-SOC Code', 'Title', 'Task ID', 'Task', 'Task Type',
       'Incumbents Responding', 'Date', 'Domain Source'],
      dtype='object')

In [4]:
occupations=occupations[occupations['Task Type']=='Core'].groupby(['O*NET-SOC Code', 'Title']).agg({'Task': lambda x: ' '.join(x)})

In [5]:
occupations=occupations.reset_index()

In [6]:
occupations.head()

Unnamed: 0,O*NET-SOC Code,Title,Task
0,11-1011.00,Chief Executives,Direct or coordinate an organization's financi...
1,11-1011.03,Chief Sustainability Officers,Monitor and evaluate effectiveness of sustaina...
2,11-1021.00,General and Operations Managers,"Review financial statements, sales or activity..."
3,11-2011.00,Advertising and Promotions Managers,Plan and prepare advertising and promotional m...
4,11-2021.00,Marketing Managers,"Identify, develop, or evaluate marketing strat..."


In [7]:
# Proprocess tasks
occupations['task_proc'] =occupations['Task'].map(lambda x: clean_string(x, stem='spacy'))

In [8]:
count = CountVectorizer()
count_matrix = count.fit_transform(occupations['task_proc'])

In [9]:
## we can create a dataframe to represent the number of the words in every sentence
table = count_matrix.todense()
df = pd.DataFrame(table, 
                  columns=count.get_feature_names_out(), 
                  index=occupations['Title'])
df

Unnamed: 0_level_0,ab,abandon,abatement,abbreviation,abdominal,ability,abnormal,abnormality,aboard,aboveground,...,yoga,youth,zero,zerobase,zeta,zinc,zipper,zone,zoning,zoom
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Chief Executives,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Chief Sustainability Officers,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
General and Operations Managers,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Advertising and Promotions Managers,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Marketing Managers,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Gas Compressor and Gas Pumping Station Operators,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Pump Operators, Except Wellhead Pumpers",0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Wellhead Pumpers,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Refuse and Recyclable Material Collectors,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
cosine_sim = cosine_similarity(count_matrix)
print(cosine_sim)

[[1.         0.17993506 0.56764621 ... 0.0345854  0.08240419 0.05805395]
 [0.17993506 1.         0.11673081 ... 0.02417018 0.01919619 0.03606336]
 [0.56764621 0.11673081 1.         ... 0.05889681 0.03508232 0.12083171]
 ...
 [0.0345854  0.02417018 0.05889681 ... 1.         0.11622583 0.27293788]
 [0.08240419 0.01919619 0.03508232 ... 0.11622583 1.         0.23603839]
 [0.05805395 0.03606336 0.12083171 ... 0.27293788 0.23603839 1.        ]]


In [11]:
similar_jobs = list(enumerate(cosine_sim[occupations[occupations['Title']=='Light Truck Drivers'].index.values[0]]))

## Sort the list in descending order
sorted_similar_jobs = sorted(similar_jobs, key=lambda x:x[1], reverse=True)
sorted_similar_jobs

[(837, 0.9999999999999988),
 (836, 0.5147074385286154),
 (854, 0.36810096477564),
 (459, 0.36575397259125625),
 (828, 0.30257135910226235),
 (573, 0.30162467340889487),
 (838, 0.3004291855343842),
 (684, 0.27263898807805886),
 (694, 0.2718837195356705),
 (843, 0.26569371377681855),
 (861, 0.2611041844116734),
 (851, 0.2480537002136374),
 (685, 0.2401989193427306),
 (849, 0.2399137233451914),
 (831, 0.2332957588208874),
 (467, 0.22501017075982685),
 (860, 0.22393089929376958),
 (850, 0.22282825891079314),
 (862, 0.21659542988464364),
 (575, 0.216366366222047),
 (839, 0.21382898357642316),
 (682, 0.21296273478183453),
 (488, 0.21022399042875153),
 (449, 0.2100303248045671),
 (440, 0.20817020153834942),
 (460, 0.20599127755175234),
 (845, 0.20495895585975488),
 (871, 0.20061823949625285),
 (149, 0.19881337906064875),
 (847, 0.1982515714200406),
 (549, 0.19228665013374344),
 (18, 0.19059801293412926),
 (448, 0.18916010217834134),
 (835, 0.18809603121561158),
 (716, 0.1856064674669238),
 (8

In [12]:
# 0 index is the job itself so we skip that
most_similar_jobs = pd.DataFrame(sorted_similar_jobs[1:6], columns=["index", "similarity"])
most_similar_jobs

Unnamed: 0,index,similarity
0,836,0.514707
1,854,0.368101
2,459,0.365754
3,828,0.302571
4,573,0.301625


In [13]:
## Take only the URLS and send an email to the person that is looking for a position similar to their CV 
least_similar_jobs = pd.DataFrame(sorted_similar_jobs[-5:], columns=["index", "similarity"])
least_similar_jobs   

Unnamed: 0,index,similarity
0,589,0.009522
1,599,0.00855
2,748,0.006223
3,636,0.006203
4,539,0.0


In [14]:
for i in range(5):
    print(occupations.loc[most_similar_jobs.loc[i]['index']])
    print("")

O*NET-SOC Code                                           53-3032.00
Title                       Heavy and Tractor-Trailer Truck Drivers
Task              Check all load-related documentation for compl...
task_proc         check loadrelated documentation completeness a...
Name: 836, dtype: object

O*NET-SOC Code                                           53-6051.07
Title             Transportation Vehicle, Equipment and Systems ...
Task              Inspect vehicles or other equipment for eviden...
task_proc         inspect vehicle equipment evidence abuse damag...
Name: 854, dtype: object

O*NET-SOC Code                                           33-3041.00
Title                                   Parking Enforcement Workers
Task              Enter and retrieve information pertaining to v...
task_proc         enter retrieve information pertain vehicle reg...
Name: 459, dtype: object

O*NET-SOC Code                                           53-1043.00
Title             First-Line Superviso

In [15]:
for i in range(5):
    print(occupations.loc[least_similar_jobs.loc[i]['index']])
    print("")

O*NET-SOC Code                                           43-9031.00
Title                                            Desktop Publishers
Task              Operate desktop publishing software and equipm...
task_proc         operate desktop publish software equipment des...
Name: 589, dtype: object

O*NET-SOC Code                                           45-2041.00
Title                    Graders and Sorters, Agricultural Products
Task              Place products in containers according to grad...
task_proc         place product container accord grade mark grad...
Name: 599, dtype: object

O*NET-SOC Code                                           51-4071.00
Title                                   Foundry Mold and Coremakers
Task              Sift and pack sand into mold sections, core bo...
task_proc         sift pack sand mold section core box pattern c...
Name: 748, dtype: object

O*NET-SOC Code                                           47-2171.00
Title                            Reinf

The top 5 most similar jobs identified are:
- Heavy and Tractor-Trailer Truck Drivers
- Transportation Vehicle, Equipment and Systems Inspectors, Except Aviation
- Parking Enforcement Workers
- First-Line Supervisors of Material-Moving Machine and Vehicle Operators
- Couriers and Messengers

The next step is to scrape job postings from local jobsites, to see if job descriptions listed by employers are indeed similar to `Light Truck Drivers`. For verification purposes, we would also compare against an identified least similar job to see if job descriptions listed by employers are indeed disimilar to `Light Truck Drivers` and not a suitable recommendation.