- The notebook is part of a project to re-design a course curriculum for MIE 1624: Introduction to Data Science and Analytics. This is done by performing a web scraping exercise to extract relevant skills required for data analyst, data scientist, data manager, data engineer, etc. from well-known job posting sites, such as Indeed, glassdoor, linkedin, upwork, etc. Additional data will also be obtained from Kaggle datasets and other online platforms such as CognitiveClas.ai, Coursera, EdX, DataCamp, etc.
- This notebook will extract the skills for data related jobs from Indeed sites, focusing on North America countries: US and Canada.
- The scraping is conducted using "requests", "BeautifulSoup", and if needded, "Selenium" libraries in Python, then "pandas" library will be used to assemble data into dataframe for further pre-processing and cleaning steps. Note that BeautifulSoup is:
  * a Python-based parsing library that allows you to extract data from web pages
  * It structures an HTML or XML web page. BS is made up of different parsing tools such as html.parser, lxml, and HTML5lib
  * user-friendly
  
- Selenium is used when target websites has a lot of Javascript elements in its code. Selenium is an API that allow you to control a headless browser through a series of programs. When using Selenium, you can also perform other actions such as mouse clicks and filling forms. 
- A URL for data scientist job search in Toronto from Indeed site looks like: "https://ca.indeed.com/jobs?q=data%20scientist&l=Toronto%2C%20ON", where:


    * "q=" begins the string for the “what” field on the page, separating search terms with “+” (i.e. searching for “data+scientist” jobs)
    * “&l=” begins the string for city of interest, separating search terms with “+” if city is more than one word (i.e. “New+York”
    * Each page of the job results have 15 job posts.



- Resources used to construct this notebook:
*   Web Scraping Job Postings from Indeed: https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
*   How to scrape job posts from Indeed with Python: https://www.youtube.com/watch?v=eN_3d4JrL_w
* https://towardsdatascience.com/in-10-minutes-web-scraping-with-beautiful-soup-and-selenium-for-data-professionals-8de169d36319
* https://www.youtube.com/watch?v=QiD1lbM-utk



# Canada

4 queries were made into ca.Indeed.com to obtain job postings for Data Analyst, Data Engineer, Data Scientist, and Machine Learning. Due to large quantities of job postings, a limit of 70 pages were extracted (equaling around 1000 job posts) for each role. 

Job title index was also assigned to each job posting during the web scrape. For example, while scraping for Data Analyst roles, an index number of 1 was assigned to each posting. This will help for analysis. 

In [3]:
!pip install beautifulsoup4



You should consider upgrading via the 'C:\Users\Adam\anaconda3\python.exe -m pip install --upgrade pip' command.


In [4]:
!pip install Selenium



You should consider upgrading via the 'C:\Users\Adam\anaconda3\python.exe -m pip install --upgrade pip' command.


In [5]:
# BeautifulSoup
import requests  #to send the get requests to servers to get the raw html
from bs4 import BeautifulSoup # to parse the html and extract data from Indeed

# for Selenium
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

import urllib # use urlencode function from urllin to create full url
import pandas as pd

import csv #to export data
from datetime import datetime #to get current date



## Functions for ca.Indeed.com

In [6]:
driver = webdriver.Chrome("./chromedriver")

  driver = webdriver.Chrome("./chromedriver")


In [41]:
def find_jobs_from(job_title, location, filename="results.xls"):    
    """
    This function extracts all the desired characteristics of all new job postings
    of the title and location specified and returns them in single file.
    The arguments it takes are:
        - Job_title
        - Location
        - Filename: to specify the filename and format of the output.
            Default is .xls file called 'results.xls'
    """
    
    job_soup = load_indeed_jobs_div(job_title, location)
    jobs_list, num_listings = extract_job_information_indeed(job_soup)
      
    save_jobs_to_excel(jobs_list, filename)
 
    print('{} new job postings retrieved from {}. Stored in {}.'.format(num_listings, filename))
    

In [42]:
## ================== FUNCTIONS FOR CA.INDEED.COM =================== ##

# This function takes job title a user is searching for, and the specified location (City, province, or remote) and return an object that contains all job cards
def load_indeed_jobs_div(job_title, location):
    getVars = {'q' : job_title, 'l' : location, 'fromage' : 'last', 'sort' : 'date'}    #get the latest job posting by sorting and get last
    url = ('https://ca.indeed.com/jobs?' + urllib.parse.urlencode(getVars))
    
    # print(url)

    page = requests.get(url)   #send request to browser and get response
    
    soup = BeautifulSoup(page.text, 'lxml')
    # soup = BeautifulSoup(page.content, "html.parser")    #parse content using either html.parser or lxml
    
    # job_soup = soup.find(id="resultsCol")   #this id returns the entire column of job cards
    job_soup = soup.find("div",id="mosaic-provider-jobcards")
    return job_soup



In [43]:
# test
job_soup = load_indeed_jobs_div('data scientist', 'toronto, on')
job_elems = job_soup.find_all('a', class_='tapItem')
job_elems[0]

<a class="tapItem fs-unmask result job_506e546718dd26ca resultWithShelf sponTapItem desktop" data-hide-spinner="true" data-hiring-event="false" data-jk="506e546718dd26ca" data-mobtk="1fiun2plgpi2o801" href="/rc/clk?jk=506e546718dd26ca&amp;fccid=d5a5824be27ba831&amp;vjs=3" id="job_506e546718dd26ca" rel="nofollow" target="_blank"><div class="slider_container"><div class="slider_list"><div class="slider_item"><div class="job_seen_beacon"><table cellpadding="0" cellspacing="0" class="jobCard_mainContent" role="presentation"><tbody><tr><td class="resultContent"><div class="heading4 color-text-primary singleLineTitle tapItem-gutter"><h2 class="jobTitle jobTitle-color-purple jobTitle-newJob"><div class="new topLeft holisticNewBlue desktop"><span class="label">new</span></div><span title="Senior Data Scientist (Remote, Canada)">Senior Data Scientist (Remote, Canada)</span></h2></div><div class="heading6 company_location tapItem-gutter"><pre><span class="companyName"><a class="turnstileLink com

In [64]:
# This function takes the result from the above function and return a job lists and number of listings
def extract_job_information_indeed(job_soup):
    job_elems = job_soup.find_all('a', class_='tapItem')
    
    cols = []
    extracted_info = []
    
    # Extract titles  
    titles = []
    cols.append('titles')
    for job_elem in job_elems:
        titles.append(extract_job_title_indeed(job_elem))   #call the extract_job_title_indeed below
    extracted_info.append(titles)                    
    
    links = []
    cols.append('links')
    for job_elem in job_elems:
        links.append(extract_link_indeed(job_elem))
    extracted_info.append(links)
    
    snippets=[]
    cols.append('snippets')
    for job_elem in job_elems:
      snippets.append(extract_jobsnippet(job_elem))
    extracted_info.append(snippets)
    
    jobs_list = {}
    
    for j in range(len(cols)):
        jobs_list[cols[j]] = extracted_info[j]
    
    num_listings = len(extracted_info[0])
    
    return jobs_list, num_listings

In [45]:
job_elems[0]

<a class="tapItem fs-unmask result job_506e546718dd26ca resultWithShelf sponTapItem desktop" data-hide-spinner="true" data-hiring-event="false" data-jk="506e546718dd26ca" data-mobtk="1fiun2plgpi2o801" href="/rc/clk?jk=506e546718dd26ca&amp;fccid=d5a5824be27ba831&amp;vjs=3" id="job_506e546718dd26ca" rel="nofollow" target="_blank"><div class="slider_container"><div class="slider_list"><div class="slider_item"><div class="job_seen_beacon"><table cellpadding="0" cellspacing="0" class="jobCard_mainContent" role="presentation"><tbody><tr><td class="resultContent"><div class="heading4 color-text-primary singleLineTitle tapItem-gutter"><h2 class="jobTitle jobTitle-color-purple jobTitle-newJob"><div class="new topLeft holisticNewBlue desktop"><span class="label">new</span></div><span title="Senior Data Scientist (Remote, Canada)">Senior Data Scientist (Remote, Canada)</span></h2></div><div class="heading6 company_location tapItem-gutter"><pre><span class="companyName"><a class="turnstileLink com

In [46]:
def extract_job_title_indeed(job_elem):
    title_elem = job_elem.find('h2', class_='jobTitle')
    
    spans = title_elem.find_all('span')

    title = spans[1].text.strip()

    return title


In [47]:
# test
extract_job_title_indeed(job_elems[0])

'Senior Data Scientist (Remote, Canada)'

In [54]:
def extract_link_indeed(job_elem):
    href = job_elem.get('href')
    link = f"https://ca.indeed.com{href}"

    
    # driver = webdriver.Chrome("./chromedriver")
    # link.click()
    return link



In [49]:
# test
extract_link_indeed(job_elems[0])

'https://ca.indeed.com/rc/clk?jk=506e546718dd26ca&fccid=d5a5824be27ba831&vjs=3'

In [61]:
def extract_jobsnippet(job_elem):
  job_summary = job_elem.find('div', 'job-snippet').text.replace('\n', '')
  return job_summary

In [62]:
extract_jobsnippet(job_elems[0])

'Masterful data storytelling and strategic thinking.Influence leadership to drive more data-informed decisions.Deep understanding of advanced SQL techniques.'

In [50]:
## ======================= GENERIC FUNCTIONS ======================= ##

def save_jobs_to_excel(jobs_list, filename):
    jobs = pd.DataFrame(jobs_list)
    jobs.to_excel(filename)

## Demo

In [51]:
find_jobs_from('"data scientist"', 'toronto, on')

IndexError: ignored