# Job Sieve
---

Development documentation for the screen scraping of multiple job boards for useful statistics, filtered high-value prospects, and correlative analytics.

---

## <a name="toc"></a> Table of Contents
1. [Process Job Board](#process_job_board)
  1. [Job Listings](#job_listings)
  2. [Job Posts](#job_posts)
2. [Analytics](#analytics)
  1. [Keyword Frequencies](#keyword_frequencies)
  2. [Resume Correlations](#resume_correlations)
  3. [Filtering Prospects](#filtering_prospects)



In [8]:
# -------------------- LOAD DEPENDENCIES -------------------- #

# Environment hard reset
%reset -f

# Standard math and data libraries
import numpy as np
import pandas as pd

# Plotting libraries
import matplotlib.pyplot as plt
%matplotlib inline

# Libraries for scraping
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import lxml.html as lh
import ssl

import json

# Date time for date operations
import datetime

# Levenshtein fuzzy comparisons
from fuzzywuzzy import fuzz 
from fuzzywuzzy import process

# Import string cleaning functions
import re

# Flask support
from flask import request, jsonify

# Configure paths
from pathlib import Path
# data_path = Path('Datasets')


## <a name="process_job_board"></a> [Process Job Board](#toc)

Given a job board, input the desired job characteristics, form them into a query, and aggregate all job postings returned to us from that query.

1. [Job Listings](#job_listings)
2. [Job Posts](#job_posts)



### <a name="job_listings"></a> [Job Listings](#toc)

When using an online job board, any query entered into the search bar returns several pages of job listings. Each listing on those pages is referred to as a card. The card is an HTML object which stores all of the relevant details that identify that card ti the user. These details include the job title, a link to the job posting itself, the company, the location, and a brief summary of the job's key features. The role of this program is to construct the query for the user, to execute the query against the job board, and to process (parse) those job cards into a Python object that can be expanded upon.

After the first call, the number of pages returned by the query are extracted. The program iterates through each page and compiles the job cards for future analysis.

**Future development:** <br>
- The original results tell us the most relevant job listings (ordered typically by decending relevance) as well as the number of pages returned by the query. Using asynchronous techniques, these pages can be called simultaneously and processed by the order in which the results are returned. This technique greatly reduced the overhead expense of calling each query and waiting for the server to reply. The program can process the cards on its own time as the cards return.



In [9]:
# -------------------- PARSE JOB CARDS -------------------- #

keywords = ["CCNA"]
location =  "Albuquerque"

from custom.indeed.form_query import form_query
base_query = form_query(keywords, location)

# ---------- #

from custom.indeed.call_query import call_query
from custom.indeed.extract_cards import extract_cards
from custom.indeed.parse_cards import parse_cards
from custom.indeed.page_range import page_range

page = call_query(base_query)
_, num_pages = page_range(page)
cards = extract_cards(page)

parsed_cards = parse_cards(cards)
for i in range(1, num_pages-1):
    ()
    next_page = call_query(base_query + "&start=" + str(10*1))
    next_cards = extract_cards(next_page)
    parsed_cards.extend(parse_cards(next_cards))

parsed_cards[0]


{'title': 'Proactive Network/System Administrator',
 'link': 'https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0CvxNjQrHGEGG9TW8zjO7tEzRAP9CfA1HQoqhnWZQgiasWbPY4MsRVzzLoi5Z2o-cVF77GkCgLii7gqA37iI8sFHdQtQK1_gkWnW3VPOkJsxVvLFmm7mn0dFMozrVyLG3JyCJQe9D0sZTyKG7AYPiFIwi1XQSt0ecmSHt0ttXsuOZhoQeyM2ehQtLj_y3URJYvMhYqqvAC_37XaOXsSdTO9_T5Kvz1Vt3Hppny8n603YLwBtfySdWBbsCLk_TjKzYVHFvxT1IKF_xLUSmkt9wvTvco0wS6IWVrD2Uw2JY1k1nptSXoWacDxIDqiPQXcGRCNjDao3TtTLUEsLEdCya8VjsYjawu7DETfi7bu2jWeF3kDoBMNogH1V8xEj6DaluPJM6AWNpl0D_EN5yVeiSVAxOV7-tJTpadDh1gsnkkv_35PPA67QaBoaABEFvNYEPKc7sIP9VwWwb5egIlLperXRt2oR9v7k-c=&p=0&fvj=1&vjs=3',
 'company': 'Steady Networks',
 'location': 'Albuquerque, NM 87110 (Montgomery Park area)',
 'summary': ["Documentation of our clients' technical environment.",
  "Comprehensive knowledge of our clients' technical layout, capacity and deficiencies."]}

### <a name="job_posts"></a> [Job Posts](#toc)

With the cards processed, the next step is to iterate over the cards and parse each resulting job post. This information will be compiled for future analysis to determine if the job posting is worth the time it takes to formally apply for the job. This step boils down to just calling the link to each posting, extracting the description, extracting the site to apply to the job, and appending both attributes to the card to create a complete accounting of each opportunity. Some job postings have an option to apply for the job directly on the job board website. In such an instance, the application link is simply set to the job posting link itself.



In [16]:
# -------------------- PARSE JOB POSTINGS -------------------- #

from custom.indeed.parse_job_postings import parse_job_postings
cards = parse_job_postings(parsed_cards)
cards[0]


{'title': 'Proactive Network/System Administrator',
 'link': 'https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0CvxNjQrHGEGG9TW8zjO7tEzRAP9CfA1HQoqhnWZQgiasWbPY4MsRVzzLoi5Z2o-cVF77GkCgLii7gqA37iI8sFHdQtQK1_gkWnW3VPOkJsxVvLFmm7mn0dFMozrVyLG3JyCJQe9D0sZTyKG7AYPiFIwi1XQSt0ecmSHt0ttXsuOZhoQeyM2ehQtLj_y3URJYvMhYqqvAC_37XaOXsSdTO9_T5Kvz1Vt3Hppny8n603YLwBtfySdWBbsCLk_TjKzYVHFvxT1IKF_xLUSmkt9wvTvco0wS6IWVrD2Uw2JY1k1nptSXoWacDxIDqiPQXcGRCNjDao3TtTLUEsLEdCya8VjsYjawu7DETfi7bu2jWeF3kDoBMNogH1V8xEj6DaluPJM6AWNpl0D_EN5yVeiSVAxOV7-tJTpadDh1gsnkkv_35PPA67QaBoaABEFvNYEPKc7sIP9VwWwb5egIlLperXRt2oR9v7k-c=&p=0&fvj=1&vjs=3',
 'company': 'Steady Networks',
 'location': 'Albuquerque, NM 87110 (Montgomery Park area)',
 'summary': ["Documentation of our clients' technical environment.",
  "Comprehensive knowledge of our clients' technical layout, capacity and deficiencies."],
 'description': "Job detailsSalary$60,000 - $70,000 a yearJob TypeFull-timeQualificationsExperience:Network Administration, 2 years (

## <a name="analytics"></a> [Analytics](#toc)

This program is capable of aggregating large amounts of job posting data. By processing that data, patterns in aggregate employer demand can be discovered, the jobs with the most correlation to your resume can be identified, and the greatest prospects for an application can be filtered out and returned to the user. This analytics section aims to do just that.

1. [Keyword Frequencies](#keyword_frequencies)
2. [Resume Correlations](#resume_correlations)
3. [Filtering Prospects](#filtering_prospects)



### <a name="keywords"></a> [Keyword Frequencies](#toc)

This analysis component seeks to answer the question: What is industry looking for in its candidates? In this sense, what we are really looking for are common keywords used amongst them. These keywords represent the terms under which an employer is more likely to consider a given candidate. Examples included "degree", "bachelors", "masters", "experience", "X certification", etc. There are many ways to execute this analysis.

A predefined list of terms can be compiled and compared against the applications. The advantage of this approach lies in its simplicity. Without having to mine for key words, a boolean state is determined representing if the posting either has or does not have the given keyword. Unfortunately, this approach also limits the scope of the analysis to those keywords that are known by the users and cannot be used to mine for novel keywords.

Another, more convoluted approach, involves comparing all job postings against each other to determine what words they have in common. The issue with this approach is that many meaningless keywords will be identified. At first, structural words like "the" will be identified in mass. After being filtered out, the next issue will be in filtering out contextually-dependent and insigificant words like "application" and "responsibilities". This filtering process will take up more time than any other component of this project. For this reason, the predefined approach will be used.

My particular interest lies in determining what computerscience/cybersecurity certifications are most in demand by employers. Consequently, the predetermined keyword approach is best suited. Compilations of certifications like [this one](https://www.webopedia.com/reference/computer-certifications/) will serve nicely for seeding the target keywords list.



In [18]:
# -------------------- STUDY KEYWORDS -------------------- #

# Load keywords
with open("keywords/keywords.json") as f:
    keywords = json.load(f)

# Study jobpostings
for card in cards:
    for certification in keywords["entries"]:
        for target in certification["targets"]:
            if target in card["description"]:
                certification["count"] += 1
            else:
                continue

keywords


{'entries': [{'name': 'CompTia A+',
   'targets': ['A+', 'A +', 'A Plus'],
   'count': 124},
  {'name': 'CompTia Network+',
   'targets': ['Network+', 'Network +', 'Network Plus'],
   'count': 202},
  {'name': 'CompTia Security+',
   'targets': ['Security+', 'Security +', 'Security Plus'],
   'count': 223},
  {'name': 'CompTia Cloud+',
   'targets': ['Cloud+', 'Cloud +', 'Cloud Plus'],
   'count': 0},
  {'name': 'CompTia Linux+',
   'targets': ['Linux+', 'Linux +', 'Linux Plus'],
   'count': 0},
  {'name': 'CompTia Server+',
   'targets': ['Server+', 'Server +', 'Server Plus'],
   'count': 0},
  {'name': 'CompTia CySA+',
   'targets': ['Cybersecurity Analyst', 'CySA', 'CSA'],
   'count': 52},
  {'name': 'CompTia CASP+',
   'targets': ['CASP+', 'CASP +', 'CASP Plus'],
   'count': 0},
  {'name': 'CompTia PenTest+',
   'targets': ['PenTest+', 'PenTest +', 'PenTest Plus'],
   'count': 0},
  {'name': 'CompTia Project+',
   'targets': ['Project+', 'Project +', 'Project Plus'],
   'count': 0}