This notebook runs web scraping of the career seach websites for some companies. It outputs a csv file for each company.

By default the script updates the jobs for all companies. If only selected companies are to be updated, list those companies in the python list of 'update_list' which can be found at the end of the script.

When a new company is added, the following items need to be updated:
1. complete_company_list
2. add a function of scrape_jobs_[company_name]() 
3. add a block to the function of main_update_func(update_list) for the new company

Scraping of some websites may fail for various reasons. Often times if you rerun for these companies only using 'update_list' will solve the problem. If not, then one needs to debug and fix the code if necessary (html format may change overtime). 


In [6]:
import time
import os
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import re
from timeit import default_timer as timer
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

In [7]:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

In [8]:
# for debugging, set to -1 to scrape all jobs, otherwise will scrape jobs from index of 0 to test-1
test=-1

In [63]:
complete_company_list = ['Accenture',
                        'Amazon',
                        'Apple',
                        'Cisco',
                        'Collabera',
                        'Deloitte',
                        'Expedia',
                        'Fox News',
                        'Google',
                        'IBM',
                        'Infosys',
                        'Intel',
                        'JnJ',
                        'JPM',
                        'KPMG',
                        'Microsoft',
                        'Nvidia',
                        'Oracle',
                        'State Farm',
                        'Texas Instruments',
                        'Vistra',
                        'Vizient',
                        'Walmart']

total_company_counts = len(complete_company_list)
print(total_company_counts)

23


## Specify the URL for each website (keywords = 'technology')

In [10]:
accenture_url1 = 'https://www.accenture.com/us-en/careers/jobsearch?jk=technology&sb=0&vw=0&is_rj=0&pg='
accenture_url2 = ''

amazon_url1 = 'https://www.amazon.jobs/en/search?offset='
amazon_url2 = '&result_limit=10&sort=relevant&country%5B%5D=USA&distanceType=Mi&radius=24km&latitude=&longitude=&loc_group_id=&loc_query=&base_query=technology&city=&country=&region=&county=&query_options=&'
         
deloitte_url = 'https://apply.deloitte.com/careers/SearchJobs/technology?sort=relevancy'

google_url1 = 'https://careers.google.com/jobs/results/?page='
google_url2 = '&q=technology'
    
ibm_url = 'https://www.ibm.com/careers/us-en/search/?search=technology&filters=primary_country:CA,primary_country:US'

intel_url = 'https://jobs.intel.com/en/search-jobs/technology/599/1' 

jnj_url1 = 'https://jobs.jnj.com/en/jobs/?page='
jnj_url2 = '&search=technology&country=United+States&pagesize=20#results' 

jpm_url = 'https://jpmc.fa.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/requisitions?keyword=software+engineer&location=United+States&locationId=300000000289738&locationLevel=country'
    
kpmg_url1 = 'https://www.kpmguscareers.com/job-search/?career-level-parents=Experienced%7C&career-level=&spage='
kpmg_url2 = ''    

microsoft_url1 = 'https://careers.microsoft.com/us/en/search-results?keywords=technology&from='
microsoft_url2 = '&s=1'

nvidia_url = 'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'

oracle_url = 'https://eeho.fa.us2.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1/requisitions?keyword=technology&location=United+States&locationId=300000000149325&locationLevel=country&mode=location'

statefarm_url1 = 'https://jobs.statefarm.com/main/jobs?page='
statefarm_url2 = '&keywords=Technology&sortBy=relevance'

ti_url1 = 'https://careers.ti.com/search-jobs/?keyword=technology&pg='
ti_url2 = ''

vistra_url = 'https://vst.wd5.myworkdayjobs.com/en-US/vistra_careers'

vizient_url = 'https://vizient.wd1.myworkdayjobs.com/Vizient_Careers'



### Uncomment the cell below if want to scrape all jobs (no keywords)

"""
accenture_url1 = 'https://www.accenture.com/us-en/careers/jobsearch?jk=&sb=1&vw=0&is_rj=0&pg=1'
accenture_url2 = ''

amazon_url1 = 'https://www.amazon.jobs/en/search?offset='
amazon_url2 = '&result_limit=10&sort=relevant&distanceType=Mi&radius=24km&latitude=&longitude=&loc_group_id=&loc_query=&base_query=&city=&country=&region=&county=&query_options=&'
 
deloitte_url = 'https://apply.deloitte.com/careers/SearchJobs?sort=relevancy'

google_url1 = 'https://careers.google.com/jobs/results/?page='
google_url2 = ''
    
ibm_url = 'https://www.ibm.com/careers/us-en/search/?filters=primary_country:CA,primary_country:US'
   
intel_url = 'https://jobs.intel.com/en/search-jobs' 

jnj_url1 = 'https://jobs.jnj.com/en/jobs/?page='
jnj_url2 = '&country=United+States&pagesize=20#results' 

jpm_url = 'https://jpmc.fa.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/requisitions?location=United+States&locationId=300000000289738&locationLevel=country'

kpmg_url1 = 'https://www.kpmguscareers.com/job-search/?career-level-parents=Experienced%7C&career-level=&spage='
kpmg_url2 = ''    

microsoft_url1 = 'https://careers.microsoft.com/us/en/search-results?from='
microsoft_url2 = '&s=1'

nvidia_url = 'https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite'

oracle_url = 'https://eeho.fa.us2.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1/requisitions?location=United+States&locationId=300000000149325&locationLevel=country&mode=location'

statefarm_url1 = 'https://jobs.statefarm.com/main/jobs?page='
statefarm_url2 = '&sortBy=relevance'

ti_url1 = 'https://careers.ti.com/search-jobs/?pg='
ti_url2 = ''

vistra_url = 'https://vst.wd5.myworkdayjobs.com/en-US/vistra_careers'

vizient_url = 'https://vizient.wd1.myworkdayjobs.com/Vizient_Careers'

"""

## Common functions

In [11]:
def remove_chars(s):
  s_new = s.replace('\n', ' ').replace('\xa0', ' ')
  while '  ' in s_new:
    s_new = s_new.replace('  ', ' ')
  return s_new

In [12]:
def make_df(company_name, title, link, qual, descrp):
    df = pd.DataFrame(zip(title, qual, link, descrp))
    df.columns = ['TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION']
    df['COMPANY'] = company_name
    df = df.iloc[:, [4, 0, 1, 2, 3]]
    return df

In [13]:
def get_empty_rows(df, colname) :
  # look for empty QUALIFICATION entries
  empty_idx=[]
  for i in range(len(df[colname])):
    count = len(df[colname][i])
    if count<=10 : empty_idx.append(i)
    else: pass
  return empty_idx

In [14]:
def get_html(driver, url):
    driver.get(url)
    time.sleep(1)
    return BeautifulSoup(driver.page_source, 'html.parser')

In [15]:
def post_process_and_ouput(company_name, title, link, qual, descrp):
    
    qual_cleaned = [remove_chars(q) for q in qual]
    descrp_cleaned = [remove_chars(d) for d in descrp]
    qual_cleaned = [d if len(q)==0 else q for q,d in zip(qual_cleaned,descrp_cleaned)] 
    
    # create a dataframe from the data
    df = make_df(company_name, title, link, qual_cleaned, descrp_cleaned)
    #print(df.shape[0])
    
    # drop the empty Qualification entries
    df
    df.drop(get_empty_rows(df, 'QUALIFICATIONS'), inplace=True)
    
    #remove the duplicated jobs
    df_nodup = df.drop_duplicates()
    print("There are {} jobs from {}.".format(df_nodup.shape[0], company_name))
    
    filename = company_name + '_technology_jobs_cleaned.csv'
    try:
        df_nodup.to_csv(filename)
    except:
        print(f'ERROR: Failed to save the file ({filename})')

In [16]:
def get_titles_links_byUrl(title_tag, title_attr, title_value, 
                           link_tag, link_attr, link_value, link_prefix, 
                           url1, url2, start=1, multiplier=1):
    job_title=[]
    job_link=[]
    page_num = start
    url = url1 + str(page_num*multiplier) + url2
    driver=webdriver.Chrome(options = chrome_options)

    try:
        soup = get_html(driver, url)
    except:
        print(f'ERROR: Failed to load {url}.')

    # will exit while loop when soup.find_all returns None
    while soup.find_all(title_tag, {title_attr: title_value}):
        job_title.extend([td.text for td in soup.findAll(title_tag, {title_attr: title_value})])
        job_link.extend([link_prefix + td['href'] for td in soup.findAll(link_tag, {link_attr: link_value})])
        driver.quit()
              
        page_num += 1
        driver=webdriver.Chrome(options = chrome_options)
        url = url1 + str(page_num*multiplier) + url2
        
        try:
            soup = get_html(driver, url)
        except:
            print(f'ERROR: Failed to load {url}.')

    driver.quit()
        
    df = pd.DataFrame(zip([remove_chars(job) for job in job_title],job_link), columns=['JOB_TITLE', 'JOB_LINK'])
    df = df.drop_duplicates() 
    return df
    

In [17]:
def get_titles_links_by_btnClick(driver, total_pages, NextBtn, title_tag, title_attr, title_value, 
                                 link_tag, link_attr, link_value, link_prefix):
    job_title=[]
    job_link=[]
    
    current_page = 1
    while current_page <= total_pages:
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        job_title.extend([td.text for td in soup.findAll(title_tag, {title_attr: title_value})])
        job_link.extend([link_prefix + td['href'] for td in soup.findAll(link_tag, {link_attr: link_value})])

        try:
            next_button = driver.find_element('xpath', NextBtn) 
            driver.execute_script("arguments[0].click();", next_button)
            time.sleep(1)
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            current_page += 1
        except: break
    driver.quit()
        
    df = pd.DataFrame(zip(job_title,job_link), columns=['JOB_TITLE', 'JOB_LINK'])
    df = df.drop_duplicates() 
    return df

## Accenture

In [18]:
def accenture_get_jobs(url1, url2, totalPages, jobsPerPage):
    """
    retrieve job titles and job links from each page
    """
    title = []
    link = []
    page_num = 0

    driver=webdriver.Chrome(options = chrome_options)
    URL = url1 + str(page_num+1) + url2
    try:
        soup = get_html(driver, URL)
    except:
        print(f'ERROR: Failed to load {URL}.')
    
    soup.find('a', {"class" : "cmp-jobs-results__title"}).tag
    print(totalPages, jobsPerPage)
    # loop through all pages
    while page_num <= totalPages :
        driver.quit()
        tags = soup.find_all("a", {"class": "cmp-teaser__title-link"})[2:jobsPerPage+2]
        link.extend(t['href'] for t in tags)
        tags = soup.find_all("h3", {"class": "cmp-teaser__title"})[2:jobsPerPage+2]
        title.extend(t.text for t in tags)

        page_num += 1
        driver=webdriver.Chrome(options = chrome_options)
        URL = url1 + str(page_num+1) + url2        
        try:
            soup = get_html(driver, URL)
        except:
            print(f'ERROR: Failed to load {URL}.')
    
    driver.quit() 
    df = pd.DataFrame(zip(title, link), columns=['JOB_TITLE', 'JOB_LINK'])
    df = df.drop_duplicates() 
    return df

In [19]:
def accenture_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]       
        try:
            soup = get_html(driver, URL)
        except:
            print(f'ERROR: Failed to load {URL}.')
        
        s = ''
        d = ''
        
        try:
          tags = soup.find('h2', text="Qualifications").findNext('ul').parent.find_all('ul')
          for t in tags:
            s = s + " " + t.text
        except: pass

        # retrieve job descriptions
        try:
          d = soup.find('div', {'class': "description-content"}).text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [20]:
def scrape_jobs_accenture():   
    # get the first page
    driver=webdriver.Chrome(options = chrome_options)
    URL = accenture_url1 + "1" + accenture_url2
    
    try:
        soup = get_html(driver, URL)
    except:
        print(f'ERROR: Failed to load {URL}.')
    
    jobs_per_page = 0
    
    tags = soup.find_all("a", {"class": "cmp-teaser__title-link"})
    for t in tags:
        if "www.accenture.com/us-en/careers" in t['href']:
            jobs_per_page += 1
    
    # get total number of jobs and calculate total number of pages
    s = soup.find('a', {"class" : "cmp-jobs-results__title"}).text
    total_count = int(s.replace(")", "").split("(")[1])
    total_pages = int(total_count / jobs_per_page + 1)
    
    df_title_link = accenture_get_jobs(accenture_url1, accenture_url2, total_pages, jobs_per_page)
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = accenture_job_description(df_title_link['JOB_TITLE'].values[:-2], df_title_link['JOB_LINK'].values[:-2])

    post_process_and_ouput('Accenture', title, link, qual, descrp)

## Amazon

In [21]:
def amazon_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        try:
            soup = get_html(driver, URL)
        except:
            print(f'ERROR: Failed to load {URL}.')

        s = ''
        d = ''
        try:
          tag = soup.find("h2", text='BASIC QUALIFICATIONS').parent
          if tag.find('li'):
            for t in tag.find_all('li'):
              s = s + ' ' + t.text
          else:
            s = s + ' ' + tag.text
        except: pass

        try:
          tag = soup.find("h2", text='PREFERRED QUALIFICATIONS').parent
          if tag.find('li'):
            for t in tag.find_all('li'):
              s = s + ' ' + t.text
          else:
            s = s + ' ' + tag.text
        except: pass

        try:
          tag = soup.find("h2", text='DESCRIPTION').parent
          if tag:
            d = tag.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])

    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [22]:
def scrape_jobs_amazon():   
    df_title_link = get_titles_links_byUrl('h3', 'class', 'job-title', 
                                           'a', 'class', 'job-link', 'https://www.amazon.jobs', 
                                           amazon_url1, amazon_url2, 0, 10)
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = amazon_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Amazon', title, link, qual, descrp)

## Apple

In [23]:
def scrape_jobs_apple():  
    x = 1
    i = 1
    job_title = []
    job_link = []

    driver = webdriver.Chrome()
    driver.implicitly_wait(20)

    url = 'https://www.apple.com/careers/us/'

    driver.get(url)
    driver.maximize_window()

    searchButton = driver.find_element(By.XPATH, "//a[text()='Search']")
    searchButton.click()

    searchTextBox = driver.find_element(By.XPATH, "//input[@placeholder='Search by role or keyword']")
    searchTextBox.send_keys("Software Engineer")
    time.sleep(3)

    searchTextBox.send_keys(Keys.RETURN)
    time.sleep(3)

    try:

        for x in range(5):

            y = 0
            while y != 20:
                title = driver.find_element(By.XPATH, "(//tbody//a[contains(@id,'jotTitle')])[" + str(y + 1) + "]").text
                job_title.append(title)

                link = driver.find_element(By.XPATH, "(//td[@class='table-col-1']//a[@class][@href])[" + str(y + 1) + "]")
                job_link.append(link.get_attribute('href'))
                y = y + 1

            driver.find_element(By.XPATH, "//span[@class='next']").click()
            x=x+1

    except NoSuchElementException:...

    driver.close()    

    df = pd.DataFrame(zip(job_title, job_link))
    df.columns = ['Title', 'Link']

    df['Qualifcations'] = np.nan

    driver = webdriver.Chrome()
    driver.implicitly_wait(20)

    for i in range(len(df['Link'])):

            url = (df['Link'][i])       
            driver.get(url)
            driver.maximize_window()
            time.sleep(3)

            qualificationList = driver.find_element(By.XPATH, "(//ul[@class='jd__list'])[1]")
            qualificationsearch = qualificationList.find_elements(By.TAG_NAME, "li")
            df.loc[i, 'Qualifcations'] = ''

            for y in qualificationsearch:
                df['Qualifcations'][i] = df['Qualifcations'][i]+' '+y.text  


    driver.close()

    df = df.dropna()

    to_drop = df[df['Qualifcations'] == ''].index
    df = df.drop(to_drop)

    df = df.reset_index(drop=True)
    df['Company'] = 'Apple'
    df = df[['Company', 'Title', 'Link', 'Qualifcations']]
    df.head()
    df.to_csv('apple_jobs.csv')
    print("There are {} jobs from Apple.".format(df.shape[0]))

## Cisco

In [24]:
def scrape_jobs_cisco():

    job_title = []
    job_link = []

    x = 1
    i = 0

    driver = webdriver.Chrome()
    driver.implicitly_wait(10)

    while x > 0:
        try:
            url = 'https://jobs.cisco.com/jobs/SearchJobs/?21178=%5B169482%5D&21178_format=6020&listFilterMode=1&projectOffset='+str(i*25)    
            driver.get(url)

            #table = driver.find_element(By.XPATH, '//*[@id="content"]/div/div[2]/table')

            for j in range(1,25):
                job_title.append(driver.find_element(By.XPATH, '//*[@id="content"]/div/div[2]/table/tbody/tr['+str(j)+']/td[1]/a').text)       
                job_link.append(driver.find_element(By.XPATH, '//*[@id="content"]/div/div[2]/table/tbody/tr['+str(j)+']/td[1]/a').get_attribute('href'))

            i += 1
        except:
            print('last page scraped: ', i)
            x -= 1 

    driver.close()
    print(len(job_title))

    df = pd.DataFrame(zip(job_title, job_link))
    df.columns = ['TITLE', 'LINK']

    df['QUALIFICATIONS'] = np.nan

    driver = webdriver.Chrome()

    for i in range(len(df['LINK'])):
        try:
            url = (df['LINK'][i])       
            driver.get(url)

            txt = driver.find_element(By.XPATH, '//*[@id="content"]/div/div[2]/div/div[2]/div[3]/div/div[1]')
            qual = txt.find_elements(By.TAG_NAME, 'li')
            df.loc[i,'QUALIFICATIONS'] = ''

            for q in qual:
                df.loc[i,'QUALIFICATIONS'] = df['QUALIFICATIONS'][i]+' '+q.text  
        except:
            df.loc[i,'QUALIFICATIONS'] = np.nan 

    driver.close()

    df = df.dropna()

    to_drop = df[df['QUALIFICATIONS'] == ''].index
    df = df.drop(to_drop)

    df = df.reset_index(drop=True)

    df['COMPANY'] = 'Cisco'
    #df['DESCRIPTION'] = df['QUALIFICATIONS']
    df['DESCRIPTION'] = ''
    df = df[['COMPANY', 'TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION']]
    print("There are {} jobs from Cisco.".format(df.shape[0]))

    df.to_csv('cisco_jobs_usa.csv')

## Collabera

In [25]:
def scrape_jobs_collabera():
    # loop doesn't break. Once the pages end, it automatically loads the last page.
    x = 1
    i = 1

    job_title = []
    job_link = []

    driver = webdriver.Chrome()
    driver.implicitly_wait(20)

    while x != 0:
        try:
            url = 'https://collabera.com/job-search/?sort_by=&industry=&keyword=&location=&Posteddays=0&q='+str(i)

            driver.get(url)

            for j in range(1,11):
                title = driver.find_element(By.XPATH,'/html/body/div[1]/section[3]/div/div/div[2]/div/div[2]/div['+str(j)+']/div/a/h5')
                job_title.append(title.text)

                link = driver.find_element(By.XPATH,'/html/body/div[1]/section[3]/div/div/div[2]/div/div[2]/div['+str(j)+']/div/a')
                job_link.append(link.get_attribute('href'))    
        except:
            x -= 1
        i +=1

    driver.close()
    
    print(len(job_title))
    
    df = pd.DataFrame(zip(job_title, job_link))
    df.columns = ['TITLE', 'LINK']
    df['QUALIFICATIONS'] = np.nan

    # Inconsistant formatting/wording on the job description page.
    # I tested getting just the li tags. However, the amount of text is very similar.
    # But since some job description pages don't have li tags, those will be missed by the script.
    # Therefore, I decided to get the whole text on the job description page. Luckily, the text is not very long on the pgaes.

    driver = webdriver.Chrome()

    for i in range(len(df['LINK'])):

        try:
            url = (df['LINK'][i]) 
            driver.get(url)             
            txt = driver.find_element(By.XPATH, '/html/body/div[1]/section[2]/div/div/div[1]/div[1]/div')
            df.loc[i,'QUALIFICATIONS'] = txt.text

        except:
            df.loc[i,'QUALIFICATIONS'] = np.nan

    driver.close()

    df = df.dropna()
    df[df['QUALIFICATIONS'] == '']

    df['COMPANY'] = 'Collabera'
    #df['DESCRIPTION'] = df['QUALIFICATIONS']
    df['DESCRIPTION'] = ''
    df = df.reindex(columns=['COMPANY', 'TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION'])
    df = df.reset_index(drop=True)
    print("There are {} jobs from Collabera.".format(df.shape[0]))
    
    df.to_csv('collabera_jobs.csv')


## Deloitte

In [26]:
def deloitte_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    qual_pattern = re.compile("qualifications|required:", re.IGNORECASE)
    descrp_pattern = re.compile("work you’ll do|job duties", re.IGNORECASE)

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        try:
            soup = get_html(driver, URL)
        except:
            print(f'ERROR: Failed to load {URL}.')
        
        s = ''
        d = ''
        
        # get descriptions
        try:
          tag = soup.find("strong", text=descrp_pattern).findNext("ul")
          d = tag.text
        except: pass
    
        # get qualifications
        try:
          tag = soup.find(re.compile("(strong|span)"), text=qual_pattern).findNext("ul")
          s = tag.text
          if tag.findNext("ul"):
            s = s + " " + tag.findNext('ul').text
        except: pass
                      
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [27]:
def scrape_jobs_deloitte():
    # go to the first page of job post
    driver = webdriver.Chrome()
    driver.implicitly_wait(1)
    driver.get(deloitte_url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    # load all jobs to the current page by clicking the load more button
    next_button = driver.find_element('xpath', '//*[@class="button button--default button--loadmore"]')
    while next_button:
        driver.execute_script("arguments[0].click();", next_button)
        time.sleep(2)
        try:
            next_button = driver.find_element('xpath', '//*[@class="button button--default button--loadmore"]')
        except:
            break

     # scrape all job titles and links
    job_title=[]
    job_link=[]

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    job_title.extend([td.text for td in soup.find_all("a", {"class": "link"})])
    job_link.extend([td['href'] for td in soup.find_all("a", {"class": "link"})])   

    # remove unwanted chars
    job_title_cleaned = [remove_chars(s) for s in job_title]
    job_link_cleaned = [remove_chars(s) for s in job_link]

    # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title_cleaned, job_link_cleaned), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = deloitte_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Deloitte', title, link, qual, descrp)

## Expedia

In [28]:
def scrape_jobs_expedia():
    job_title = []
    job_link = []

    #this url redirects to the page 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25633&siteid=5439&Codes=BeMore#home'
    #url = 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25633&siteid=5439&Codes=BeMore#keyWordSearch=technology%20or%20software%20engineering%20or%20developer%20or%20azure%20or%20aws&locationSearch='

    url = 'https://careers.expediagroup.com/jobs/?&filter[country]=United+States'

    driver = webdriver.Chrome()

    driver.implicitly_wait(20)

    driver.get(url)
    time.sleep(2)

    x = 1
    i = 1

    while x == 1:
        try:

            next_button = driver.find_element(By.ID,'loadmore')
            next_button.click()

        except:
            print('page',i)
            x = 0 
        i +=1

    TITLE = []
    LINK = []

    y = 1     
    i = 1

    while y != 0:
    #while i < 3:

        y = 2

        try:
            j_title = driver.find_element(By.XPATH,'//*[@id="resultslist"]/li['+str(i)+']/a/h3')
            TITLE.append(j_title.text) 
        except:
            TITLE.append('')
            y-=1

        try:
            j_link = driver.find_element(By.XPATH,'//*[@id="resultslist"]/li['+str(i)+']/a')
            LINK.append(j_link.get_attribute('href'))
        except:
            LINK.append('')
            y-=1

        i = i+1

    # convert to dataframe

    df = pd.DataFrame(zip(TITLE, LINK))
    df.columns = ['TITLE', 'LINK']
    print(df.shape[0])
    df.to_csv('Expedia_title_link.csv')
    df = pd.read_csv('Expedia_title_link.csv', index_col=0)

    df = df.dropna()
    #df.isna().sum()

    df['QUALIFICATIONS'] = np.nan
    df['DESCRIPTION'] = np.nan
    df.drop_duplicates(subset=['TITLE', 'LINK'])

    driver = webdriver.Chrome()

    for i in range(len(df['LINK'])):
        url = (df['LINK'][i])
        #print(i, url)
        driver.get(url)

        try:
            qual = driver.find_element(By.XPATH,'/html/body/main/div[2]/div/div/div[1]/section[3]/div/div/div/div[1]/ul[2]')
            df.loc[i,'QUALIFICATIONS'] = qual.text
        except:
            pass

        try:
            desc = driver.find_element(By.XPATH,'/html/body/main/div[2]/div/div/div[1]/section[3]/div/div/div/div[1]/ul[1]')
            df.loc[i,'DESCRIPTION'] = desc.text
        except:
            pass

    driver.close()

    df = df.dropna()

    df['QUALIFICATIONS'] = df['QUALIFICATIONS'].str.lower()
    df['DESCRIPTION'] = df['DESCRIPTION'].str.lower()
    df = df.drop_duplicates(subset=['TITLE', 'QUALIFICATIONS', 'DESCRIPTION'])
    df = df.reset_index(drop=True)
    df = df.reindex(columns=['COMPANY', 'TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION'])
    df['COMPANY'] = 'Expedia'
    
    print("There are {} jobs from Expedia.".format(df.shape[0]))
    df.to_csv('expedia_jobs_usa.csv')
    os.remove('Expedia_title_link.csv')

## Fox News

In [29]:
# retrieve job qualifications and descriptions
def fox_job_description_2(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        s = ''
        d = ''

        text_pattern_descrp = re.compile("A SNAPSHOT OF YOUR RESPONSIBILITIES|JOB DESCRIPTION", re.IGNORECASE)
        #text_pattern_descrp_2 = re.compile("JOB DESCRIPTION", re.IGNORECASE)
        
        
        text_pattern_qual = re.compile("WHAT YOU WILL NEED|\
                                        WHAT YOU NEED|\
                                        JOB RELATED KNOWLEDGE, SKILLS AND ABILITIES:|\
                                        ABOUT YOU", re.IGNORECASE)
        #text_pattern_qual_2 = re.compile("WHAT YOU NEED", re.IGNORECASE)
        #text_pattern_qual_3 = re.compile("JOB RELATED KNOWLEDGE, SKILLS AND ABILITIES:", re.IGNORECASE)
        #text_pattern_qual_4 = re.compile("ABOUT YOU", re.IGNORECASE)
        text_pattern_qual_5 = re.compile("QUALIFICATIONS", re.IGNORECASE)        
        
        #trys all of the qualification metrics
        try:
            tag = soup.find("b", text=text_pattern_qual).findNext("ul")
            s = s + tag.text
        except: pass
        try:
            tag = soup.find("b", text=text_pattern_qual_5).findNext("p")
            s = s + tag.text
        except: pass               
        #trys all of the description metrics
        try:
            tag = soup.find("b", text=text_pattern_descrp).findNext("ul")
            d = tag.text
        except: 
            try:
                tag = soup.find("b", text = text_pattern_descrp).findNext("br")
                s = s + tag.text
            except: pass        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [30]:
# retrieve job qualifications and descriptions
def fox_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        s = ''
        d = ''

        text_pattern_descrp_1 = re.compile("A SNAPSHOT OF YOUR RESPONSIBILITIES", re.IGNORECASE)
        text_pattern_descrp_2 = re.compile("JOB DESCRIPTION", re.IGNORECASE)
        
        
        text_pattern_qual_1 = re.compile("WHAT YOU WILL NEED", re.IGNORECASE)
        text_pattern_qual_2 = re.compile("WHAT YOU NEED", re.IGNORECASE)
        text_pattern_qual_3 = re.compile("JOB RELATED KNOWLEDGE, SKILLS AND ABILITIES:", re.IGNORECASE)
        text_pattern_qual_4 = re.compile("ABOUT YOU", re.IGNORECASE)
        text_pattern_qual_5 = re.compile("QUALIFICATIONS", re.IGNORECASE)        
        
        #trys all of the qualification metrics
        try:
            tag = soup.find("b", text=text_pattern_qual_1).findNext("ul")
            s = s + tag.text
        except: pass
        
        try:
            tag = soup.find("b", text = text_patter_qual_2).findNext("ul")
            s = s + tag.text
        except: pass
        
        try:
            tag = soup.find("b", text = text_patter_qual_3).findNext("ul")
            s = s + tag.text
        except: pass
        
        try:
            tag = soup.find("b", text = text_patter_qual_4).findNext("ul")
            s = s + tag.text
        except: pass
        
        try:
            tag = soup.find("b", text = text_patter_qual_5).findNext("br")
            s = s + tag.text
        except: pass
        
        
               
        #trys all of the description metrics
        try:
            tag = soup.find("b", text=text_pattern_descrp_1).findNext("ul")
            d = tag.text
        except: pass
        
        try:
            tag = soup.find("b", text=text_pattern_descrp_2).findNext("ul")
            d = tag.text
        except: pass
        
        try:
            tag = soup.find("b", text=text_pattern_descrp_2).findNext("p")
            d = tag.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [31]:
def scrape_jobs_fox():
    # specify the url strings for the company's job posting website
    url = 'https://www.foxcareers.com/Search/SearchResults'
    driver = webdriver.Chrome()
    driver.get(url)

    # get job titles and links for each page and click the next button to go to the next page until no more
    job_title=[]
    job_link=[]

    next_button = driver.find_element('xpath', '//*[@id="loadMoreButton"]')  
    while next_button:
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        job_title.extend([td.text for td in soup.findAll("a", {"class": "searchResultTitle"})])
        job_link.extend(['https://www.foxcareers.com' + td['href'] for td in soup.findAll("a", {"class": "searchResultTitle"})])
        try:
            next_button.click()
            time.sleep(1)
        except: break
            
     # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title, job_link), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = fox_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Fox News', title, link, qual, descrp)       


## Google

In [32]:
def google_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        s = ''
        d = ''
        
        soup = get_html(driver, link[i])
        try:
          tag = soup.find("h3", text='Minimum qualifications:').parent.find("ul")
          if tag:
            s = s + tag.text
        except: pass
        
        try:
          tag = soup.find('h3', text='Preferred qualifications:').parent.find("ul").findNextSibling('ul')
          if tag:
            s = s + tag.text
        except: pass

        try:
          tag = soup.find("div", {'id': 'accordion-responsibilities'})
          if tag:
            d = tag.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [33]:
def scrape_jobs_google():
    # retrieve job titles and job links
    df_title_link = get_titles_links_byUrl('h2', 'class', 'gc-card__title gc-heading gc-heading--beta', 
                                           'a', 'class', 'gc-card', 'https://careers.google.com', 
                                           google_url1, google_url2)
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = google_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Google', title, link, qual, descrp)

## IBM

In [34]:
def ibm_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []
    
    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        try:
            soup = get_html(driver, URL)
        except:
            print(f'ERROR: Failed to load {URL}.')

        s = ''
        d = ''
              
        # get the job qualifications
        try:
            tag1 = soup.find("span", text="Required Technical and Professional Expertise")
            tag = tag1.findNext(['ul', 'ol'])
            s = tag.text
        except: pass
        
        try:
            tag2 = soup.find("span", text="Preferred Technical and Professional Expertise")
            tag = tag1.findNext(['ul', 'ol'])
            s = s + ". " + tag.text
        except: pass

        # get the job responsibilities
        try:
            tag1 = soup.find("span", text="Your Role and Responsibilities")
            tag = tag1.findNext(['ul', 'ol'])
            d = tag.text
        except: pass        
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [35]:
def scrape_jobs_ibm():
    driver = webdriver.Chrome()
    driver.implicitly_wait(1)
    driver.get(ibm_url)
    driver.implicitly_wait(1)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    # handle the cookies message window
    try:
        next_button = driver.find_element('xpath', '//*[@id="truste-consent-button"]')  
        driver.execute_script("arguments[0].click();", next_button)
    except:
        pass

    # we first get the total number of jobs and number of jobs per page
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    result = soup.find("div", {"class": "UpperList_quantityJobs__eDIK8"}).text
    jobs_per_page = int(result.split()[1])
    total_jobs = int(result.split()[3])

    # now calculate the total number of pages
    total_pages = total_jobs//jobs_per_page + 1    

    # retrieve job titles and job links
    df_title_link = get_titles_links_by_btnClick(driver, total_pages, '//*[@aria-labelledby="tooltip-6"]', 
                                 'h3', 'class', 'bx--card__heading', 
                                 'a', 'class', 'cds--link bx--card__footer undefined', '')  
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = ibm_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('IBM', title, link, qual, descrp)

## Infosys

In [36]:
def scrape_jobs_infosys():
    url = 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25633&siteid=5439&Codes=BeMore#home'
    driver = webdriver.Chrome()
    driver.implicitly_wait(20)
    driver.get(url)

    location = driver.find_element(By.XPATH,'//*[@id="initialSearchBox__26"]')
    location.send_keys('usa')

    search_button = driver.find_element(By.XPATH,'//*[@id="searchControls_BUTTON_2"]')
    search_button.click()

    x = 1
    while x == 1:
        try:
            next_button = driver.find_element(By.XPATH,'//*[@id="showMoreJobs"]')
            next_button.click()
        except:
            x = 0    

    job_title = []
    job_link = []

    y = 2
    i = 1

    while y != 0:  
        y = 2
        try:  
            job = driver.find_element(By.XPATH,'//*[@id="mainJobListContainer"]/div/div/ul/li['+str(i)+']/div[2]/div[1]')        
            job_title.append(job.text)
        except:
            job_title.append('')
            y -= 1

        try:
            link = driver.find_element(By.XPATH,'//*[@id="Job_'+str(i)+'"]')        
            job_link.append(link.get_attribute('href'))
        except:
            job_link.append('')
            y -= 1        

        i = i+1

    # convert to dataframe

    df = pd.DataFrame(zip(job_title, job_link))
    df.columns = ['TITLE', 'LINK']
    df.head()

    df['QUALIFICATIONS'] = np.nan

    driver = webdriver.Chrome()   

    for i in range(len(df['LINK'])):
        try:
            job_text = ''
            url = (df['LINK'][i])

            driver.get(url)

            desc = driver.find_element(By.XPATH,'//*[@id="content"]/div[1]/div[7]/div[4]/div[2]/div/div[3]/div[4]/p[2]')
            texts = desc.find_elements(By.TAG_NAME, 'li')

            for Text in texts: 
                job_text = job_text+Text.text+' '

            df.loc[i, 'QUALIFICATIONS'] = job_text

        except:
            df.loc[i, 'QUALIFICATIONS'] = ''

    driver.close()

    to_drop = df[df['QUALIFICATIONS'] == ''].index
    df = df.drop(to_drop)
    df = df.dropna()

    df['QUALIFICATIONS'] = df['QUALIFICATIONS'].str.lower()
    df = df.drop_duplicates(subset=['TITLE', 'QUALIFICATIONS'])
    df = df.reset_index(drop=True)

    df['COMPANY'] = 'Infosys'
    df['DESCRIPTION'] = ''

    df = df.reindex(columns=['COMPANY', 'TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION'])
    print("There are {} jobs from Infosys.".format(df.shape[0]))
    df.to_csv('infosys_usa_ jobs.csv')

## Intel

In [37]:
def intel_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []
    
    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        s = ''
        d = ''

        # get job qualifications
        # retrieve the text between "Qualifications" and the next section with tag "h2"
        try:
            tag = soup.find('h2', text='Qualifications').findNextSibling()
            ul=0
            space_count = 0
            while (tag.name!='h2') & (ul<2):
                if tag.name == 'ul':
                    ul += 1
                s = s + " " + tag.text
                space_count += 1
                try: tag = tag.findNextSibling() 
                except: pass
            
            # if there are no 'ul' or other text blocks found, then look for text under tag 'br'
            if len(s)<=space_count : 
                try:
                    tag = soup.find('h2', text='Qualifications').findNextSibling('br')
                    s = s + " " + tag.next_element
                except: pass
                    
        except: pass

        # get job descriptions
        # retrieve the text between "Job Description" and the next section with tag "h2"
        try:
            tag = soup.find('h2', text='Job Description').findNextSibling()
            ul=0
            space_count = 0
            while (tag.name!='h2'):
                d = d + " " + tag.text
                space_count += 1
                try: tag = tag.findNextSibling() 
                except: pass
            
            # if there are no 'ul' or other text blocks found, then look for text under tag 'br'
            if len(s)<=space_count : 
                try:
                    tag = soup.find('h2', text='Job Description').findNextSibling('br')
                    d = d + " " + tag.next_element
                except: pass
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])

    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [38]:
def scrape_jobs_intel():
    driver = webdriver.Chrome()
    driver.implicitly_wait(2)
    driver.get(intel_url)

    # handle the pop up cookies message
    try:
        next_button = driver.find_element('xpath', '//*[@id="igdpr-button"]')
        next_button.click()
    except: pass

    # get job titles and links for jobs on each page
    # click button to go to the next page

    job_title=[]
    job_link=[]

    while True:
        time.sleep(2)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        tags = soup.find_all('a', {'class': 'job-title-link'})
        job_link.extend(['https://jobs.intel.com' + t['href'] for t in tags])
        job_title.extend([t.find('h2').text for t in tags])

        try:
            next_button = driver.find_element('xpath', '//*[@class="next"]')
            driver.execute_script("arguments[0].click();", next_button)
        except: break

    job_title = [remove_chars(job) for job in job_title]

    # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title, job_link), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = intel_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Intel', title, link, qual, descrp)

## JnJ

In [39]:
def jnj_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []
    
    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(2)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        s = ''
        d = ''
        
        text_pattern = text=re.compile("qualifications", re.IGNORECASE)
       
        # get job qualifications
        try:
          tags = soup.find('h3', text=text_pattern).parent.find_all(['ul', 'ol'])
          if len(tags) > 0:
                for t in tags:
                    s = s + " " + t.text
          else:
              try:
                tags = soup.find('', text=text_pattern).parent.findNextSiblings(['p'])
                for t in tags:
                  s = s + " " + t.text
              except: pass
        except: pass
            
        # get job descriptions
        try:
          tag = soup.find("h2", text='Description').findNext(['ul', 'ol'])
          d = tag.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [40]:
def scrape_jobs_jnj():   
    # retrieve job titles and job links
    df_title_link = get_titles_links_byUrl('a', 'class', 'stretched-link js-view-job', 
                                           'a', 'class', 'stretched-link js-view-job', 'https://jobs.jnj.com', 
                                           jnj_url1, jnj_url2)
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = jnj_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('JnJ', title, link, qual, descrp)

## JPM

In [41]:
def jpm_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []
    
    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(2)
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        d = ''
              
        # get the job responsibilities
        try:
            tag = soup.find("div", {'class': "job-description"})
            tags = tag.find_all('ul')
            if len(tags) > 0:
                for t in tags:
                    d = d + " " + t.text
        except: pass
        
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, description, description

In [42]:
def scrape_jobs_jpm():
    driver = webdriver.Chrome()
    driver.implicitly_wait(1)
    driver.get(jpm_url)

    cookie_button = driver.find_element('xpath', '//*[@class="cookie-consent__button cookie-consent__button--primary"]')
    cookie_button.click()

    #have to scroll up and down several times to make load more button visible
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, 0);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)

    next_button = driver.find_element('xpath', '//*[@class="search-results-load-more-btn"]')

    while next_button:
        driver.execute_script("arguments[0].click();", next_button)
        time.sleep(2)
        try:
            next_button = driver.find_element('xpath', '//*[@class="search-results-load-more-btn"]')
        except:
            break

    # scrape job titles and links
    job_title=[]
    job_link=[]
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    job_title.extend([td.text for td in soup.find_all("h3", {"class": "job-title"})])
    job_link.extend([td['href'] for td in soup.find_all("a", {"class": "joblist-tile"})])  

    # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title, job_link), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = jpm_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('JPM', title, link, qual, descrp)   

## KPMG

In [43]:
def kpmg_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(2)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        s = ''
        d = ''

        try:
          tag = soup.find("h3", text='Qualifications:').findNextSibling("ul").find_all('li')
          if tag:
            for t in tag:
                s = s + '. ' + t.text
        except: pass

        try:
          tag = soup.find("h3", text='Responsibilities:').findNextSibling("ul").find_all('li')
          if tag:
            for t in tag:
                d = d + '. ' + t.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [44]:
def scrape_jobs_kpmg():   
    # retrieve job titles and job links
    df_title_link = get_titles_links_byUrl('div', 'class', 'h5 text-dark-grey', 
                                           'a', 'class', 'box-shadow d-block', 'https://www.kpmguscareers.com', 
                                           kpmg_url1, kpmg_url2)
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = kpmg_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('KPMG', title, link, qual, descrp)

## Microsoft

In [45]:
def microsoft_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        s = ''
        d = ''
        
        soup = get_html(driver, link[i])
        try:
          tag = soup.find('p', {'data-ph-at-id' : 'job-qualifications-text'}).find_all('li')
          for t in tag:
            s = s + '. ' + t.text
        except: pass
        
        try:
          tag = soup.find('p', {'data-ph-at-id' : 'job-responsibilities-text'})
          d =  tag.text
        except: pass
        
        if len(s) > 10:
            qualifications.append(s)
            description.append(d)   
            jobtitle.append(title[i])
            joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [46]:
def scrape_jobs_microsoft():
    page_num = 0
    url = microsoft_url1 + str(page_num*20) + microsoft_url2
    driver=webdriver.Chrome()
    soup = get_html(driver, url)
    next_button = driver.find_element('xpath', '//*[@class="btn primary-button btn-lg phs-search-submit au-target"]')
    next_button.click()
    time.sleep(1)

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    jobs_per_page = len(soup.find_all('span', {'class': 'job-title'})) 
    total_jobs = int(soup.find('span', {'class': 'total-jobs'}).text)
    total_pages = total_jobs//jobs_per_page + 1    
    #print(jobs_per_page, total_jobs, total_pages)

    # retrieve job titles and job links
    df_title_link = get_titles_links_by_btnClick(driver, total_pages, '//a[@aria-label="View Next page"]', 
                                 'span', 'class', 'job-title', 
                                 'a', 'data-ph-at-id', 'job-link', '')   
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = microsoft_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Microsoft', title, link, qual, descrp)

## Nvidia

In [47]:
def nvidia_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        try:
            soup = get_html(driver, URL)
        except:
            print(f'ERROR: Failed to load {URL}.')
        
        s = ''
        d = ''

        text_pattern_descrp = re.compile("ll be doing", re.IGNORECASE)
        text_pattern_qual_1 = re.compile("what we need to see", re.IGNORECASE)        
        text_pattern_qual_2 = re.compile("ways to stand out from the crowd", re.IGNORECASE)  
        
        try:
          tag = soup.find("b", text=text_pattern_qual_1).findNext("ul")
          s = s + tag.text
        except: pass
        
        try:
          tag = soup.find('b', text=text_pattern_qual_2).findNext("ul")
          s = s + tag.text
        except: pass

        try:
          tag = soup.find("b", text=text_pattern_descrp).findNext("ul")
          d = tag.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [48]:
def scrape_jobs_nvidia():
    driver = webdriver.Chrome()
    driver.implicitly_wait(1)
    driver.get(nvidia_url)

    # get job titles and links for each page and click the next button to go to the next page until no more
    job_title=[]
    job_link=[]
    time.sleep(1)
    next_button = driver.find_element('xpath', '//*[@aria-label="next"]')  
    while next_button:
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        job_title.extend([td.text for td in soup.findAll("a", {"data-automation-id": "jobTitle"})])
        job_link.extend(['https://nvidia.wd5.myworkdayjobs.com' + td['href'] for td in soup.findAll("a", {"data-automation-id": "jobTitle"})])
        try:
            next_button.click()
            time.sleep(1)
        except: break

    # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title, job_link), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = nvidia_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Nvidia', title, link, qual, descrp)   

## Oracle

In [49]:
def oracle_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []
    
    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(2)
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        s = ''
        d = ''
        r = ''
        
        # job qualifications are inside of the responsibilities section
        try:
            tag1 = soup.find("div", {'data-bind': "html: job().responsibilities"})
            
            # for responsibilities we retrieve the entire text under tag1
            r = tag1.text
            
            # for qualifications we retrieve the text under 'ul' only
            for t in tag1.find_all('ul'): 
                s = s + ' ' + t.text
        except: pass
        
        try:
            # for descriptions we retrieve the entire text under tag2, however, a lot of 
            # job descriptions are under qualifications section. So we will combined them 
            tag2 = soup.find("div", {'data-bind': "html: job().description"})
            d = tag2.text + " " + r
        except: pass      
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [50]:
def scrape_jobs_oracle():
    driver = webdriver.Chrome()
    driver.implicitly_wait(2)
    driver.get(oracle_url)

    #have to scroll up and down several times to make load more button visible
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)

    driver.execute_script("window.scrollTo(0, 0);")
    time.sleep(1)

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)

    # now click load more button
    next_button = driver.find_element('xpath', '//*[@class="search-results-load-more-btn"]')        
    while next_button:
        next_button.click()
        time.sleep(5)
        try:
            next_button = driver.find_element('xpath', '//*[@class="search-results-load-more-btn"]') 
        except:
            break
    
    # scrape job titles and links
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    job_title=[]
    job_link=[]

    job_title.extend(t.text for t in soup.find_all("h3", {"class": "job-title"}))
    job_link.extend(t['href'] for t in soup.find_all("a", {"class": "joblist-tile"}))


    # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title, job_link), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = oracle_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Oracle', title, link, qual, descrp)

## StateFarm

In [51]:
# retrieve job qualifications and descriptions
def sf_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):    
        s = ''
        d = ''
        
        soup = get_html(driver, link[i])   
        
        # get qualifications
        try:
          tag = soup.find("strong", text='Qualifications')
          # for some jobs the qualifications are under 'ul' tag as bullet points
          # for some other jobs they are under 'p' tag
          if tag.findNextSibling("ul"):
              s = tag.findNextSibling("ul").text
          elif tag.findNextSibling("p"):
              s = tag.findNextSibling("p").text
        except: pass
        
        # get responsibilities
        try:
          tag = soup.find("strong", text='Responsibilities')
          # For some jobs the responsibilities are under 'ul' tag as bullet points
          # For some other jobs they are under 'p' tag
          # When 'ul' is absent from "Responsibilities" but presents in "Qualifications", 
          # the code grabs the 'ul' text under qualifications, and this is ok
          if tag.findNextSibling("ul"):
              d = tag.findNextSibling("ul").text
          elif tag.findNextSibling("p"):
              d = tag.findNextSibling("p").text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])

    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [52]:
def scrape_jobs_statefarm():
    # retrieve job titles and job links
    df_title_link = get_titles_links_byUrl('p', 'class', 'job-title', 
                                     'a', 'class', 'job-title-link', 
                                     'https://jobs.statefarm.com',
                                      statefarm_url1, statefarm_url2)
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = sf_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])
    
    post_process_and_ouput('StateFarm', title, link, qual, descrp)

## Texas Instrument

In [53]:
def ti_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        try:
            soup = get_html(driver, URL)
        except:
            print(f'ERROR: Failed to load {URL}.')
        
        s = ''
        d = ''

        try:
          tag = soup.find(re.compile("(b|strong)"), text=re.compile("(Minimum Reqirements|Minimum reqirements)")).findNext("ul")
          if tag:
            s = s + " " + tag.text
        except: pass
        
        try:
          tag = soup.find(re.compile("(b|strong)"), text=re.compile("(Preferred|Required)")).findNext("ul")
          if tag:
            s = s + " " + tag.text
        except: pass

        # retrieve job descriptions. This will only work if the descriptions are listed as 
        # bullet points under tag "ul"
        try:
          tag = soup.find("span", text="Apply online").findNext('ul')
          d = tag.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [54]:
def scrape_jobs_ti():   
    # retrieve job titles and job links
    df_title_link = get_titles_links_byUrl('div', 'class', 'jobTitle', 
                                           'a', 'class', 'av-icon-char', 'https://careers.ti.com', 
                                           ti_url1, ti_url2)
    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = ti_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('TI', title, link, qual, descrp)

## Vistra

In [55]:
# retrieve job qualifications and descriptions
def vistra_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        s = ''
        d = ''

        text_pattern_descrp = re.compile("Job Description", re.IGNORECASE)
        text_pattern_qual = re.compile("Education, Experience, & Skill Requirements|\
                            Key Metrics|\
                            Key Accountabilities (directly or through others)",
                            re.IGNORECASE)
        
        try:
            tag = soup.find("", text=text_pattern_qual).findNext("ul")
            s = s + tag.text
        except: pass        
           
        #trys all of the description metrics
        try:
            tag = soup.find("", text=text_pattern_descrp).findNext("ul")
            d = tag.text
        except: pass
      
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [56]:
def scrape_jobs_vistra():
    # specify the url strings for the company's job posting website
    
    driver = webdriver.Chrome()
    driver.implicitly_wait(1)
    driver.get(vistra_url)

    # get job titles and links for each page and click the next button to go to the next page until no more
    job_title=[]
    job_link=[]

    next_button = driver.find_element('xpath', '//*[@aria-label="next"]')  
    while next_button:
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        job_title.extend([td.text for td in soup.findAll("a", {"data-automation-id": "jobTitle"})])
        job_link.extend(['https://vst.wd5.myworkdayjobs.com' + td['href'] for td in soup.findAll("a", {"data-automation-id": "jobTitle"})])
        try:
            next_button.click()
            time.sleep(1)
        except: break
            
    # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title, job_link), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()
    print(df_title_link.shape[0])
    
    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = vistra_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Vistra', title, link, qual, descrp)



## Vizient

In [57]:
# retrieve job qualifications and descriptions
def vizient_job_description(title, link):
    qualifications = []
    description = []
    jobtitle = []
    joblink = []

    driver=webdriver.Chrome('chromedriver',options=chrome_options)
    for i in range(len(link)):
        URL=link[i]
        driver.get(URL)
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        s = ''
        d = ''

        text_pattern_descrp_1 = re.compile("Responsibilities:", re.IGNORECASE)
        
        
        text_pattern_qual_1 = re.compile("Qualifications:", re.IGNORECASE)        
        
        
        #trys all of the qualification metrics
        try:
            tag = soup.find("b", text=text_pattern_qual_1).findNext("ul")
            s = s + tag.text
        except: pass
        
               
        #trys all of the description metrics
        try:
            tag = soup.find("b", text=text_pattern_descrp_1).findNext("ul")
            d = tag.text
        except: pass
        
        qualifications.append(s)
        description.append(d)   
        jobtitle.append(title[i])
        joblink.append(link[i])
        
    driver.quit()            
 
    return jobtitle, link, qualifications, description

In [58]:
def scrape_jobs_vizient():
    driver = webdriver.Chrome()
    driver.implicitly_wait(1)
    driver.get(vizient_url)
    time.sleep(1)
    
    # get job titles and links for each page and click the next button to go to the next page until no more
    job_title=[]
    job_link=[]

    next_button = driver.find_element('xpath', '//*[@aria-label="next"]') 

    while next_button:
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        job_title.extend([td.text for td in soup.findAll("a", {"data-automation-id": "jobTitle"})])
        job_link.extend(['https://vizient.wd1.myworkdayjobs.com' + td['href'] for td in soup.findAll("a", {"data-automation-id": "jobTitle"})])
        try:
            next_button.click()
            time.sleep(1)
        except: break

    # create a dataframe that contains job titles and links for all job categories
    df_title_link = pd.DataFrame(zip(job_title, job_link), columns=['JOB_TITLE', 'JOB_LINK'])

    # drop the duplicates
    df_title_link = df_title_link.drop_duplicates()

    print(df_title_link.shape[0])

    # retrieve the qualification and descriptions for each job.
    title, link, qual, descrp = vizient_job_description(df_title_link['JOB_TITLE'].values[:test], df_title_link['JOB_LINK'].values[:test])

    post_process_and_ouput('Vizient', title, link, qual, descrp)

## Walmart

In [59]:
def scrape_jobs_walmart():
    # The jobs page automatically defaults to your location. Therefore, it turns up 0 results.
    # The below code will help with making the jobs page not default to a specific locaion.

    driver = webdriver.Chrome()
    driver.implicitly_wait(10)

    url = 'https://careers.walmart.com/'
    driver.get(url)

    job_search = driver.find_element(By.XPATH, '//*[@id="search"]')
    job_search.send_keys('technology')

    search_button = driver.find_element(By.XPATH, '//*[@id="location"]')
    search_button.click()

    TITLE = []
    LINK = []

    y = 1
    i = 1

    while y != 0: 
        try:        
            url = 'https://careers.walmart.com/results?q=&page='+str(i)+'&sort=rank&jobCategory=00000161-7bad-da32-a37b-fbef5e390000,00000161-7bf4-da32-a37b-fbf7c59e0000,00000161-7bff-da32-a37b-fbffc8c10000,00000161-8bd0-d3dd-a1fd-bbd0febc0000,00000161-8be6-da32-a37b-cbe70c150000&jobSubCategory=0000015a-a577-de75-a9ff-bdff284e0000&expand=department,0000015e-b97d-d143-af5e-bd7da8ca0000,00000161-8be6-da32-a37b-cbe70c150000,brand,type,rate&type=jobs'        
            driver.get(url) 

            for j in range(1,26): 
                job = driver.find_element(By.XPATH, '//*[@id="search-results"]/li['+str(j)+']/div[1]/h4/a')

                try:
                    TITLE.append(job.text)
                except:
                    TITLE.append('')

                try:
                    LINK.append(job.get_attribute('href'))
                except:
                    LINK.append('')             
        except:
            y -= 1

        i += 1

    driver.close()    
    print(len(TITLE), len(LINK)) 

    df = pd.DataFrame(zip(TITLE, LINK))
    df['QUALIFICATIONS'] = np.nan

    df.columns = ['TITLE', 'LINK', 'QUALIFICATIONS']

    df = df.drop_duplicates(subset=['TITLE'])
    print(df.shape[0])

    driver = webdriver.Chrome()

    for i in range(len(df['LINK'])):
        try:        
            url = (df['LINK'][i])
            driver.get(url)
            desc = driver.find_element(By.XPATH, '/html/body/main/section[3]/div/div[2]')
            df.loc[i,'QUALIFICATIONS'] = desc.text   
        except:
            df.loc[i,'QUALIFICATIONS'] = np.nan

    driver.close()

    df = df.drop_duplicates(subset=['TITLE', 'LINK', 'QUALIFICATIONS'])
    df['QUALIFICATIONS'] = df['QUALIFICATIONS'].str.lower()
    df = df.dropna()
    df['COMPANY'] = 'Walmart'
    df = df.reset_index(drop = True)
    len(df)

    for i in range(len(df['QUALIFICATIONS'])):
        try:
            desc = df['QUALIFICATIONS'][i]
            mid = desc.index('minimum qualifications')+22
            desc = desc[mid:]
            df['QUALIFICATIONS'][i] = desc

        except:
            pass

    # removing unnecessary text
    remove1 = '...\noutlined below are the required minimum qualifications for this position. if none are listed, there are no minimum qualifications.'
    remove2 = '...\noutlined below are the optional preferred qualifications for this position. if none are listed, there are no preferred qualifications.'

    for i in range(len(df['QUALIFICATIONS'])):
        try:
            df['QUALIFICATIONS'][i] = df['QUALIFICATIONS'][i].replace(remove1, '')
            df['QUALIFICATIONS'][i] = df['QUALIFICATIONS'][i].replace(remove2, '')
            df['QUALIFICATIONS'][i] = df['QUALIFICATIONS'][i].replace('\n', '')
            df['QUALIFICATIONS'][i] = df['QUALIFICATIONS'][i].replace('• ', '')
        except:
            pass

    for i in range(len(df['QUALIFICATIONS'])):
        try:
            desc = df['QUALIFICATIONS'][i]
            mid = desc.index('primary location')
            desc = desc[:mid]
            df.loc[i,'QUALIFICATIONS'] = desc      
            df.loc[i,'QUALIFICATIONS'] = df['QUALIFICATIONS'][i].strip()
        except:
            pass

    df['DESCRIPTION'] = df['QUALIFICATIONS']
    df = df[['COMPANY', 'TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION']]

    #df['DESCRIPTION'] = df['QUALIFICATIONS']
    df['DESCRIPTION'] = ''
    df = df[['COMPANY', 'TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION']]
    print("There are {} jobs from Walmart.".format(df.shape[0]))
    df.to_csv('walmart_technology_jobs.csv')

## Main update function

In [60]:
def main_update_func(update_list):
    start_0 = timer()
    if update_list.count('Accenture') > 0:
        start = timer()
        try:
            scrape_jobs_accenture()
            end = timer()
            print(f'Accenture: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Accenture scraping failed\n')
            
    if update_list.count('Apple') > 0:
        start = timer()
        try:
            scrape_jobs_apple()
            end = timer()
            print(f'Apple: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Apple scraping failed\n')
            
    if update_list.count('Amazon') > 0:
        start = timer()
        try:
            scrape_jobs_amazon()
            end = timer()
            print(f'Amazon: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Amazon scraping failed\n')

    if update_list.count('Deloitte') > 0:
        start = timer()
        try:
            scrape_jobs_deloitte()
            end = timer()
            print(f'Deloitte: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Deloitte scraping failed\n')

    if update_list.count('Google') > 0:
        start = timer()
        try:
            scrape_jobs_google()
            end = timer()
            print(f'Google: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Google scraping failed\n')

    if update_list.count('IBM') > 0:
        start = timer()
        try:
            scrape_jobs_ibm()
            end = timer()
            print(f'IBM: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('IBM scraping failed\n')

    if update_list.count('Intel') > 0:
        start = timer()
        try:
            scrape_jobs_intel()
            end = timer()
            print(f'Intel: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Intel scraping failed\n')

    if update_list.count('JnJ') > 0:
        start = timer()
        try:
            scrape_jobs_jnj()
            end = timer()
            print(f'JnJ: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('JnJ scraping failed\n')

    if update_list.count('JPM') > 0:
        start = timer()
        try:
            scrape_jobs_jpm()
            end = timer()
            print(f'JPM: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('JPM scraping failed\n')

    if update_list.count('KPMG') > 0:
        start = timer()
        try:
            scrape_jobs_kpmg()
            end = timer()
            print(f'KPMG: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('KPMG scraping failed\n')

    if update_list.count('Microsoft') > 0:
        start = timer()
        try:
            scrape_jobs_microsoft()
            end = timer()
            print(f'Microsoft: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Microsoft scraping failed\n')


    if update_list.count('Nvidia') > 0:
        start = timer()
        try:
            scrape_jobs_nvidia()
            end = timer()
            print(f'Nvidia: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Nvidia scraping failed\n')

    if update_list.count('Oracle') > 0:
        start = timer()
        try:
            scrape_jobs_oracle()
            end = timer()
            print(f'Oracle: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Oracle scraping failed\n')


    if update_list.count('State Farm') > 0:
        start = timer()
        try:
            scrape_jobs_statefarm()
            end = timer()
            print(f'State Farm: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('State Farm scraping failed\n')


    if update_list.count('Texas Instruments') > 0:
        start = timer()
        try:
            scrape_jobs_ti()
            end = timer()
            print(f'Texas Instruments: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Texas Instruments scraping failed\n')


    if update_list.count('Cisco') > 0:
        start = timer()
        try:
            scrape_jobs_cisco()
            end = timer()
            print(f'Cisco: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Cisco scraping failed\n')
            

    if update_list.count('Collabera') > 0:
        start = timer()
        try:
            scrape_jobs_collabera()
            end = timer()
            print(f'Collabera: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Collabera scraping failed\n')


    if update_list.count('Expedia') > 0:
        start = timer()
        try:
            scrape_jobs_expedia()
            end = timer()
            print(f'Expedia: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Expedia scraping failed\n')


    if update_list.count('Infosys') > 0:
        start = timer()
        try:
            scrape_jobs_infosys()
            end = timer()
            print(f'Infosis: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Infosys scraping failed\n')

            
    if update_list.count('Walmart') > 0:
        start = timer()
        try:
            scrape_jobs_walmart()
            end = timer()
            print(f'Walmart: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Walmart scraping failed\n')
            
    if update_list.count('Vizient') > 0:
        start = timer()
        try:
            scrape_jobs_vizient()
            end = timer()
            print(f'Vizient: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Vizient scraping failed\n')

    if update_list.count('Vistra') > 0:
        start = timer()
        try:
            scrape_jobs_vistra()
            end = timer()
            print(f'Vistra: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Vistra scraping failed\n')
            
    if update_list.count('Fox News') > 0:
        start = timer()
        try:
            scrape_jobs_fox()
            end = timer()
            print(f'Fox News: time elapsed {int((end-start)/60)} minutes\n')
        except:
            print('Fox News scraping failed\n')
            
    end_0 = timer()
    print(f'\nTotal time elapsed {int((end_0-start_0)/60)} minutes')

# The program expects to have a list of companies for which the jobs will be updated

In [62]:
complete_company_list

['Accenture',
 'Amazon',
 'AppleCisco',
 'Collabera',
 'Deloitte',
 'Expedia',
 'Fox NewsGoogle',
 'IBM',
 'InfosysIntel',
 'JnJ',
 'JPM',
 'KPMG',
 'Microsoft',
 'Nvidia',
 'Oracle',
 'State Farm',
 'Texas Instruments',
 'Vistra',
 'Vizient',
 'Walmart']

In [61]:
#update_list=['Expedia']
update_list=complete_company_list
main_update_func(update_list)

48 9
293
There are 251 jobs from Accenture.
Accenture: time elapsed 22 minutes

553
There are 552 jobs from Amazon.
Amazon: time elapsed 26 minutes

1863
There are 1765 jobs from Deloitte.
Deloitte: time elapsed 120 minutes

147
There are 133 jobs from IBM.
IBM: time elapsed 11 minutes

440
There are 385 jobs from JnJ.
JnJ: time elapsed 22 minutes

263
There are 92 jobs from JPM.
JPM: time elapsed 13 minutes

631
There are 594 jobs from KPMG.
KPMG: time elapsed 36 minutes

280
There are 269 jobs from Microsoft.
Microsoft: time elapsed 9 minutes

219
There are 208 jobs from Nvidia.
Nvidia: time elapsed 6 minutes

1051
There are 948 jobs from Oracle.
Oracle: time elapsed 55 minutes

47
There are 45 jobs from StateFarm.
State Farm: time elapsed 5 minutes

397
There are 393 jobs from TI.
Texas Instruments: time elapsed 27 minutes

1476
There are 1476 jobs from Collabera.
Collabera: time elapsed 53 minutes

page 4
1
Expedia scraping failed

363 363
220
There are 148 jobs from Walmart.
Walma