<a href="https://colab.research.google.com/github/JessicaOjo/Job-trend-analysis-/blob/main/Scrapping_Indeed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import sys
import requests as rq
from bs4 import BeautifulSoup as bs
from time import sleep
from time import time
from random import randint
from warnings import warn
import json
import pandas as pd

Most websites do not leave job postings for more than 2-5 months hence getting job posting of previous years from the jobsites directly is impossible. 

To get old job postings, I'm extracting data from a web archive called [the wayback machine](https://archive.org/web/). The drawback to this is that the links are not clickable hence being able to extract the job posting descriptions might be difficult.

# Scraping Indeed job site 

In [8]:
roles = [
    'Marketing Technologist',
    'SEO Consultant',
    'Web analytics Developer',
    'Digital Marketing Manager',
    'Social media manager',
    'Content Manager',
    'Information Architect',
    'UX designer',
    'UI Designer',
    'Front end designer',
    'Front end developer',
    'Mobile Developer',
    'Full stack developer',
    'Software Developer',
    'WordPress Developer',
    'Python Developer',
    'Systems Engineer',
    'Data Architect',
    'Database Administrator',
    'Data Analyst', 
    'Data scientist',
    'Cloud Architect',
    'DevOps Manager',
    'Agile project manager',
    'Product Manager',
    'Security specialist',
    'QA (Quality Assurance) specialist',
    'Game developer',
    'Computer Graphics animator',
    'Information security analyst',
    'Network and system administrator',
    'Product owner'

]

In [9]:
url = f'http://web.archive.org/cdx/search/cdx?url=www.indeed.com/jobs?q=data+scientist&amp;explvl=entry_level&from=20160101&to=20230215&output=json'
urls = rq.get(url).text
parse_url = json.loads(urls)
print(parse_url)

[['urlkey', 'timestamp', 'original', 'mimetype', 'statuscode', 'digest', 'length'], ['com,indeed)/jobs?q=data%20scientist', '20170811213301', 'https://www.indeed.com/jobs?q=Data%20Scientist', 'text/html', '200', 'BZ27G3E7AWVPQU5UPN63EDAAZ6X3MCXQ', '32285'], ['com,indeed)/jobs?q=data%20scientist', '20170818212747', 'https://www.indeed.com/jobs?q=Data%20Scientist', 'text/html', '200', '47JW7VM3NF2HEW4X4JAX7PZ5EY6JPLUY', '34450'], ['com,indeed)/jobs?q=data%20scientist', '20170825185625', 'https://www.indeed.com/jobs?q=Data%20Scientist', 'text/html', '200', '7KXUK6RVYP7DEOU7TX6MC6XJZ4XQY3ZR', '34087'], ['com,indeed)/jobs?q=data%20scientist', '20170901184836', 'https://www.indeed.com/jobs?q=Data%20Scientist', 'text/html', '200', 'REJV43YXUVQYAV7OJOC2CDBEGDZ7O5A6', '34467'], ['com,indeed)/jobs?q=data%20scientist', '20170908173444', 'https://www.indeed.com/jobs?q=Data%20Scientist', 'text/html', '200', 'HW7BOKP6TXUZ3BBFUHXRVD4JESG24XHR', '33937'], ['com,indeed)/jobs?q=data%20scientist', '20170

In [10]:
def get_archive_link(role):
  url = f'http://web.archive.org/cdx/search/cdx?url=www.indeed.com/jobs?q={role}&amp;explvl=entry_level&from=20160101&to=20230215&output=json'
  urls = rq.get(url).text
  parse_url = json.loads(urls) #parses the JSON from urls.

  url_list = []
  for i in range(1,len(parse_url)):
    orig_url = parse_url[i][2]
    tstamp = parse_url[i][1]
    waylink = tstamp+'/'+orig_url
    url_list.append(waylink)

  ## Compiles final url pattern.
  final_list = []
  for url in url_list:
    final_url = 'https://web.archive.org/web/'+url
    final_list.append(final_url)

  return final_list

In [11]:
# extract company
def extract_company(div): 
    company = div.find_all(name="span", attrs={"class":"company"})
    if len(company) > 0:
      for b in company:
        return (b.text.strip())
    else:
      sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
      for span in sec_try:
          return (span.text.strip())
    return 'NOT_FOUND'


# extract job salary
def extract_salary(div): 
    salaries = []
    try:
      return (div.find('nobr').text)
    except:
      try:
        div_two = div.find(name='div', attrs={'class':'salary no-wrap'})
        div_three = div_two.find('div')
        salaries.append(div_three.text.strip())
        return salaries
      except:
        try:
          div_two = div.find(name='div', attrs={'class':'sjcl'})
          div_three = div_two.find('div')
          salaries.append(div_three.text.strip())
          return salaries
        except:
          return ('NOT_FOUND')
    return 'NOT_FOUND'


# extract job location
def extract_location(div):
  for span in div.findAll('span', attrs={'class': 'location'}):
    return (span.text)
  return 'NOT_FOUND'


# extract job title
def extract_job_title(div):
  for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
    return (a['title'])
  return('NOT_FOUND')


# extract jd summary
def extract_summary(div): 
  spans = div.findAll('span', attrs={'class': 'summary'})
  for span in spans:
    return (span.text.strip())
  return 'NOT_FOUND'
 

# extract link of job description 
def extract_link(div): 
  for a in div.find_all(name='a', attrs={'data-tn-element':'jobTitle'}):
    return (a['href'])
  return('NOT_FOUND')


# extract date of job when it was posted
def extract_date(div):
  try:
    spans = div.findAll('span', attrs={'class': 'date'})
    for span in spans:
      return (span.text.strip())
  except:
    return 'NOT_FOUND'
  return 'NOT_FOUND'


# extract full job description from link
def extract_fulltext(url):
  try:
    page = rq.get('http://www.indeed.com' + url)
    soup = bs(page.text, "lxml", from_encoding="utf-8")
    spans = soup.findAll('span', attrs={'class': 'summary'})
    for span in spans:
      return (span.text.strip())
  except:
    return 'NOT_FOUND'
  return 'NOT_FOUND'

In [12]:
# define dataframe columns
df = pd.DataFrame(columns = ['unique_id', 'job_qry','job_title', 
                             'company_name', 'location', 'summary', 
                             'salary', 'link', 'date', 'full_text'])

In [13]:
for role in roles:
  role = role.lower().replace(' ', '+')
  archive_url_list = get_archive_link(role)
  print(archive_url_list)

[]
[]
[]
['https://web.archive.org/web/20190708121537/http://indeed.com/jobs?q=Digital%20Marketing%20Manager', 'https://web.archive.org/web/20190708121538/http://www.indeed.com/jobs?q=Digital%20Marketing%20Manager', 'https://web.archive.org/web/20190708121539/https://www.indeed.com/jobs?q=Digital%20Marketing%20Manager', 'https://web.archive.org/web/20190805185632/http://indeed.com/jobs?q=Digital%20Marketing%20Manager', 'https://web.archive.org/web/20190805185632/http://www.indeed.com/jobs?q=Digital%20Marketing%20Manager', 'https://web.archive.org/web/20190805185633/https://www.indeed.com/jobs?q=Digital%20Marketing%20Manager']
['https://web.archive.org/web/20190708121358/http://indeed.com/jobs?q=Social%20Media%20Manager', 'https://web.archive.org/web/20190708121359/http://www.indeed.com/jobs?q=Social%20Media%20Manager', 'https://web.archive.org/web/20190708121400/https://www.indeed.com/jobs?q=Social%20Media%20Manager', 'https://web.archive.org/web/20190805185447/http://indeed.com/jobs?q

In [15]:
for role in roles:
  role_ = role.lower().replace(' ', '+')
  archive_url_list = get_archive_link(role_)
  for url in archive_url_list:
    for i in range(3): #retry 3 times if connection error
      while True:
        try:
          pg = rq.get(url).text
          sleep(3) #ensuring 5 seconds sleep after every grab
        except ConnectionError:
          sleep(3)
          continue
        break

    soup = bs(pg,'html.parser')
    divs = soup.find_all(name="div", attrs={"class":"row"})

    cnt = 0
    for div in divs:
      #specifying row num for index of job posting in dataframe
      num = (len(df) + 1) 
      cnt = cnt + 1
      #job data after parsing
      job_post = [] 

      #append unique id
      job_post.append(div['id'])

      #append job qry
      job_post.append(role)

      #grabbing job title
      job_post.append(extract_job_title(div))

      #grabbing company
      job_post.append(extract_company(div))

      #grabbing location name
      job_post.append(extract_location(div))

      #grabbing summary text
      job_post.append(extract_summary(div))

      #grabbing salary
      job_post.append(extract_salary(div))

      #grabbing link
      link = extract_link(div)
      job_post.append(link)

      #grabbing date
      job_post.append(extract_date(div))

      #grabbing full_text
      job_post.append(extract_fulltext(link))

      #appending list of job post info to dataframe at index num
      df.loc[num] = job_post
  roles.remove(role)
  print(roles)

  sleep(5)

df.to_csv('job_data_indeed.csv', index=False)

['Digital Marketing Manager', 'Content Manager', 'UX designer', 'Front end designer', 'Front end developer', 'Mobile Developer', 'Full stack developer', 'Software Developer', 'WordPress Developer', 'Python Developer', 'Systems Engineer', 'Data Architect', 'Database Administrator', 'Data Analyst', 'Data scientist', 'Cloud Architect', 'DevOps Manager', 'Agile project manager', 'Product Manager', 'Security specialist', 'QA (Quality Assurance) specialist', 'Game developer', 'Computer Graphics animator', 'Information security analyst', 'Network and system administrator', 'Product owner']
['Digital Marketing Manager', 'UX designer', 'Front end designer', 'Front end developer', 'Mobile Developer', 'Full stack developer', 'Software Developer', 'WordPress Developer', 'Python Developer', 'Systems Engineer', 'Data Architect', 'Database Administrator', 'Data Analyst', 'Data scientist', 'Cloud Architect', 'DevOps Manager', 'Agile project manager', 'Product Manager', 'Security specialist', 'QA (Qual

In [None]:
# todo 
# check availability of all the websites and their links 
# get salary estimates for the years 

In [16]:
df

Unnamed: 0,unique_id,job_qry,job_title,company_name,location,summary,salary,link,date,full_text
1,pj_b9c0fe264fb6191d,Social media manager,Social Media Manager,360SWEATER,NOT_FOUND,NOT_FOUND,[360SWEATER],/web/20190708121400/https://www.indeed.com/pag...,NOT_FOUND,NOT_FOUND
2,pj_1b5c54a6b7341e61,Social media manager,Social Media Community Manager,A Shoc Beverage,NOT_FOUND,NOT_FOUND,[A Shoc Beverage],/web/20190708121400/https://www.indeed.com/pag...,NOT_FOUND,NOT_FOUND
3,pj_557f226535d37d66,Social media manager,Social Media Manager - (FT),Advanced Plastic Surgery Solutions,NOT_FOUND,NOT_FOUND,[Advanced Plastic Surgery Solutions],/web/20190708121400/https://www.indeed.com/pag...,NOT_FOUND,NOT_FOUND
4,pj_88f7ec126a72b559,Social media manager,Social Engagement Manager,Margaritaville Resort and Spa and Margaritavil...,NOT_FOUND,NOT_FOUND,[Margaritaville Resort and Spa and Margaritavi...,/web/20190708121400/https://www.indeed.com/pag...,NOT_FOUND,NOT_FOUND
5,pj_c9872f45c191dcea,Social media manager,Social Media Manager,AMResorts,NOT_FOUND,NOT_FOUND,[AMResorts\n\n\n26 reviews],/web/20190708121400/https://www.indeed.com/pag...,NOT_FOUND,NOT_FOUND
...,...,...,...,...,...,...,...,...,...,...
1225,p_f574e968bf0ef913,Security specialist,Security/Protection Specialist,Yes Sir Security,"Los Angeles, CA","Security guard, security officer, security age...",NOT_FOUND,/web/20180201191539/https://www.indeed.com/com...,17 days ago,NOT_FOUND
1226,p_a0f8d025f0085b6a,Security specialist,Security Specialist 1,Los Alamos National Laboratory,"Los Alamos, NM",Personnel Security ensures that granting a wor...,NOT_FOUND,/web/20180201191539/https://www.indeed.com/rc/...,6 days ago,NOT_FOUND
1227,p_32e6d41f53d528ec,Security specialist,Personnel Security/Industrial Security Specialist,Advanced Integration Technology,"Plano, TX",Previous experience as a personnel security sp...,NOT_FOUND,/web/20180201191539/https://www.indeed.com/rc/...,16 days ago,NOT_FOUND
1228,pj_8b3ccda4b1691a7d,Security specialist,Security Specialist,"Security Industry Specialists, Inc.","Cupertino, CA",The Security Specialist reports to the Securit...,[$18 an hour],/web/20180201191539/https://www.indeed.com/pag...,20 hours ago,NOT_FOUND


In [17]:
df.sample(50)

Unnamed: 0,unique_id,job_qry,job_title,company_name,location,summary,salary,link,date,full_text
57,pj_a3bc38ebfa1c6ea3,Social media manager,Social Media Marketing Manager,4Patriots,NOT_FOUND,NOT_FOUND,[4Patriots],/web/20190805185449/https://www.indeed.com/pag...,52 minutes ago,NOT_FOUND
467,p_2d3258453c305fa8,Software Developer,"Software Engineer, Frontend",Dealpath,"San Francisco, CA",Familiar with a variety of software developmen...,NOT_FOUND,/web/20171025122224/https://www.indeed.com/rc/...,4 days ago,NOT_FOUND
133,pj_955b3c865a034ccd,Front end developer,JavaScript Developer,Indeed Prime,"Seattle, WA",Apply to 100+ top companies with 1 simple appl...,NOT_FOUND,/web/20170825194327/https://www.indeed.com/pag...,NOT_FOUND,NOT_FOUND
152,p_bf96220a690aa483,Front end developer,"Hiring ::: UI/React ::: Sunnyvale,CA ::: Contract",velossent,"Sunnyvale, CA",We are looking for a Senior Front-end Develope...,NOT_FOUND,/web/20170901193501/https://www.indeed.com/com...,4 hours ago,NOT_FOUND
1055,pj_2425ed3b560218db,Data Analyst,Data Analyst,Masson Farms of New Mexico,NOT_FOUND,NOT_FOUND,[Masson Farms of New Mexico\n\n\n4 reviews],/web/20190708112906/https://www.indeed.com/pag...,2 days ago,NOT_FOUND
904,p_bc1bdf5e26f3a314,Data Analyst,"Data Analyst, UberEverything",Uber,"San Francisco, CA 94103 (South Of Market area)","The Data Analyst, UberEverything is first and ...",NOT_FOUND,/web/20171123153424/https://www.indeed.com/rc/...,17 days ago,NOT_FOUND
941,pj_d3b85d2974213b97,Data Analyst,Quality Improvement Analyst,"Northwest Health Services, Inc.","Saint Joseph, MO 64506",Knowledge and experience of data analysis and ...,NOT_FOUND,/web/20171228171517/https://www.indeed.com/pag...,NOT_FOUND,NOT_FOUND
563,p_4e28e8a1243286f1,Software Developer,Software Engineer Intern,6sense,"San Francisco, CA",Software Engineering Intern. Java or Python. T...,NOT_FOUND,/web/20180104170818/https://www.indeed.com/rc/...,30+ days ago,NOT_FOUND
846,p_d70b6c73f40e58d6,Data Analyst,Data Analytics Intern,Slack,"San Francisco, CA",Slack is looking for interns to work alongside...,NOT_FOUND,/web/20171011122454/https://www.indeed.com/rc/...,15 days ago,NOT_FOUND
992,p_e9837f9e05fd5970,Data Analyst,"Sales Finance, Business Intelligence Analyst",Twitter,"San Francisco, CA 94103 (South Of Market area)",Facilitate integration of new acquisition reve...,NOT_FOUND,/web/20180118164443/https://www.indeed.com/rc/...,1 day ago,NOT_FOUND
