## Pulling MCF Data

**Author:** Shaun Khoo  
**Date:** 4 Dec 2021  
**Context:** Need the full set of MCF job ads data to run language modelling (pre-training) for the DistilBERT model  
**Objective:** Call the MCF Jobs API for all the job ad IDs that we have in our dataset

#### A) Setting up

Importing the libraries and the dataset containing the job ad IDs to be pulled

In [None]:
import pandas as pd
import requests
import hashlib 
import time
import random
import json
import os

`existing1` contains the 50k subset which we previously pulled data for. `existing2` contains the 200k subset which we received from the Jumpstart team from DSAID's hackathon in Jan 2021.

In [2]:
existing1 = pd.read_csv('../Data/Archive/MCF_Training_Set_Full.csv')

In [3]:
existing1 = existing1[['MCF_Job_Ad_ID', 'uuid']]

In [5]:
existing2 = pd.read_csv('../Data/Archive/WGS_Dataset_Part_1_JobInfo.csv')

In [6]:
existing2 = existing2[['job_post_id', 'status']]
existing2.columns = ['MCF_Job_Ad_ID', 'uuid']

`new` contains all the job ad IDs between Jan 2019 to Jun 2021, which we extracted from the Data Lab.

In [8]:
new = pd.read_csv('../Data/Archive/MCF_Job_Ad_ID.csv')
new.drop('Unnamed: 0', axis = 1, inplace = True)
new.columns = ['MCF_Job_Ad_ID']

`total` is the merged dataset between `new` and `existing1` and `existing2`. A successful join indicates to us that we already have the data.

In [9]:
total = new.merge(existing1, on = 'MCF_Job_Ad_ID', how = 'left').merge(existing2, on = ['MCF_Job_Ad_ID'], how = 'left')

We initially prioritised those that were not already pulled, in case we get blocked from the API. Then we pulled data for those which we already have to complete our raw dataset.

In [10]:
total['avail'] = total['uuid_x'].notnull() | total['uuid_y'].notnull()

In [11]:
to_be_pulled = total[total['avail']][['MCF_Job_Ad_ID']]

Checking the total number of job ad IDs to pull data for

In [12]:
to_be_pulled.shape

(271258, 1)

#### B) Preparing to pull the data

Applying the MD5 hash to each job ad ID to get the `uuid` for querying the API.

In [13]:
to_be_pulled['uuid'] = [hashlib.md5(job_id.encode()).hexdigest() for job_id in to_be_pulled['MCF_Job_Ad_ID']]

In [14]:
to_be_pulled.head()

Unnamed: 0,MCF_Job_Ad_ID,uuid
254979,JOB-2020-0000007,92f307a9f0eed799d73a0d105415cc8a
254980,JOB-2020-0000008,25f5ed1896dc86da537fe0fee4c00928
254981,JOB-2020-0000016,dc63df578b3fac5ae605b09962ea19c9
254982,JOB-2020-0000018,7c73f2caa9dd923b109bdd63ba28dff1
254983,JOB-2020-0000032,0ebf6690c67b53ba9ce1d506605665b2


We query for a MCF job ad that doesn't exist in the database to get the error message, so we can avoid writing the error message out as a JSON file.

In [17]:
test = requests.get('https://api.mycareersfuture.gov.sg/v2/jobs/bb5faebc85f3504c17b83e16d2b4dafb')

In [19]:
uuid_not_found = test.json()
print(uuid_not_found)

{'message': 'UUID is not found in the database.'}


Initialise the variables for monitoring the progress of calling the API - `rate_limit_count` checks how many times we have been rate-limited (so we can adjust the time delay if needed) and `errors` is a list of MCF job ad IDs which we couldn't pull any info for.

In [22]:
rate_limit_count = 0 
errors = []

#### C) Pulling the data

Run the code below to call the MCF jobs API for the list of MCF job ad IDs from above. Note the file directory structure - the JSON response will be written to the `Data/Raw/mcf_api_responses` folder.

In [31]:
base_url = 'https://api.mycareersfuture.gov.sg/v2/jobs'

completed = [filename.replace('.json', '') for filename in os.listdir('../Data/Raw/mcf_api_responses')]
total_count = len(to_be_pulled)
    
for i, ad_id, uuid in zip(list(range(1, len(to_be_pulled)+1)), to_be_pulled['MCF_Job_Ad_ID'], to_be_pulled['uuid']):
        
    if ad_id in completed:
        continue
    
    req = requests.get(base_url + "/" + uuid)
    
    if req.status_code != 200:
        try:
            # if the uuid can't be found
            if req.json() == uuid_not_found:
                errors.append(ad_id)
        except:
            # if we are getting rate limited
            print('Backing off...\r', end = '')
            rate_limit_count += 1
            time.sleep(2)
            req = requests.get(base_url + "/" + uuid)
            if req.status_code != 200:
                errors.append(ad_id)
            
    if req.status_code == 200:
        try:                    
            with open(f'../Data/Raw/mcf_raw/{ad_id}.json', 'w') as file:
                json.dump(req.json(), file)
            print(f'{i}/{total_count} completed - called {ad_id} successfully! Error count: {len(errors)}, Rate limit count: {rate_limit_count}\r', end = '')
        except:
            errors.append(ad_id)

Checking how many MCF job ad IDs have been pulled successfully. Note that due to an earlier bug, some JSONs were exported even though there was no data. These will be removed by Ben.

In [25]:
completed = [filename.replace('.json', '') for filename in os.listdir('../Data/Raw/mcf_raw')]
len(completed)

882583

In [32]:
len(total)

883402