WEB SCRAPPING - INDEED.COM

In this project, I wanted to showcase my abilities to get my own data through web scraping. For this project, I wanted to explore data analyst/business analyst/product analyst jobs posted in California on indeed.com, a job aggregator that updates multiple times daily. I conducted my scraping using the “requests” and “BeautifulSoup” libraries in python to gather and parse information from indeed’s pages, before using the “pandas” library to assemble my data into a dataframe for further cleaning and analysis.

Importing necessary libraries

In [2]:
import requests
import bs4
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

To check if a website allows web scraping or not you can use status_code as follows:

import requests

from bs4 import BeautifulSoup 

r=requests.get(" ENTER URL OF YOUR CHOICE")

r.status_code

The output to this should be 200. Anything other than 200 means that the website your trying to scrape either does not allow web scraping or allows partially.

TASK 1:
SEARCHING JOB POSTINGS FOR DATA ANALYST - ENTRY LEVEL - FULL TIME IN LAST 14 DAYS FROM INDEED.COM

Checking if it is permissible to web scrap indeed.com

In [3]:
res=requests.get("https://www.indeed.com/jobs?q=data+analyst&l=California&radius=100&jt=fulltime&explvl=entry_level&fromage=14")
res.status_code

200

Getting job openings of entry level, full time data analyst positions in California in last 14 days

In [4]:
type(res)

requests.models.Response

In [5]:
res.text

'<!DOCTYPE html>\n<html lang="en" dir="ltr">\n<head>\n<meta http-equiv="content-type" content="text/html;charset=UTF-8">\n<script type="text/javascript" src="//d3fw5vlhllyvee.cloudfront.net/s/3c2ab36/en_US.js"></script>\n<link href="//d3fw5vlhllyvee.cloudfront.net/s/b45d10b/jobsearch_all.css" rel="stylesheet" type="text/css">\n<link rel="alternate" type="application/rss+xml" title="Data Analyst Jobs, Employment in California" href="http://rss.indeed.com/rss?q=data+analyst&l=California&radius=100&jt=fulltime&explvl=entry_level">\n<link rel="alternate" media="only screen and (max-width: 640px)" href="/m/jobs?q=data+analyst&l=California&radius=100&jt=fulltime&explvl=entry_level&fromage=14">\n<link rel="alternate" media="handheld" href="/m/jobs?q=data+analyst&l=California&radius=100&jt=fulltime&explvl=entry_level&fromage=14">\n\n<script type="text/javascript">\n\nif (typeof window[\'closureReadyCallbacks\'] == \'undefined\') {\nwindow[\'closureReadyCallbacks\'] = [];\n}\n\nfunction call_wh

In [6]:
soup = bs4.BeautifulSoup(res.text,"lxml")

In [7]:
soup

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<script src="//d3fw5vlhllyvee.cloudfront.net/s/3c2ab36/en_US.js" type="text/javascript"></script>
<link href="//d3fw5vlhllyvee.cloudfront.net/s/b45d10b/jobsearch_all.css" rel="stylesheet" type="text/css"/>
<link href="http://rss.indeed.com/rss?q=data+analyst&amp;l=California&amp;radius=100&amp;jt=fulltime&amp;explvl=entry_level" rel="alternate" title="Data Analyst Jobs, Employment in California" type="application/rss+xml"/>
<link href="/m/jobs?q=data+analyst&amp;l=California&amp;radius=100&amp;jt=fulltime&amp;explvl=entry_level&amp;fromage=14" media="only screen and (max-width: 640px)" rel="alternate"/>
<link href="/m/jobs?q=data+analyst&amp;l=California&amp;radius=100&amp;jt=fulltime&amp;explvl=entry_level&amp;fromage=14" media="handheld" rel="alternate"/>
<script type="text/javascript">

if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReady

Creating data frame for data analyst positions - columns : job titles, job location, company name

In [8]:
job_title=[]
company_name=[]
job_location=[]

Getting job title

In [9]:
soup.select('.jobtitle')[0]['title']

'Healthcare Data Analyst'

Getting name of the company

In [10]:
soup.select('.sjcl .company')[0].text.replace('\n','')

'American Indian Health & Services'

Getting location of the job

In [11]:
soup.select(".location")[0].text

'Santa Barbara, CA 93111'

Gathering data from the homepage

In [12]:
base_url='https://www.indeed.com/jobs?q=data+analyst&l=California&radius=100&jt=fulltime&explvl=entry_level&fromage=14'

In [13]:
#Gathering data from homepage
res=requests.get(base_url)
soup=bs4.BeautifulSoup(res.text,"lxml")

In [14]:
soup

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<script src="//d3fw5vlhllyvee.cloudfront.net/s/3c2ab36/en_US.js" type="text/javascript"></script>
<link href="//d3fw5vlhllyvee.cloudfront.net/s/b45d10b/jobsearch_all.css" rel="stylesheet" type="text/css"/>
<link href="http://rss.indeed.com/rss?q=data+analyst&amp;l=California&amp;radius=100&amp;jt=fulltime&amp;explvl=entry_level" rel="alternate" title="Data Analyst Jobs, Employment in California" type="application/rss+xml"/>
<link href="/m/jobs?q=data+analyst&amp;l=California&amp;radius=100&amp;jt=fulltime&amp;explvl=entry_level&amp;fromage=14" media="only screen and (max-width: 640px)" rel="alternate"/>
<link href="/m/jobs?q=data+analyst&amp;l=California&amp;radius=100&amp;jt=fulltime&amp;explvl=entry_level&amp;fromage=14" media="handheld" rel="alternate"/>
<script type="text/javascript">

if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReady

In [15]:
length=len(soup.select('.jobtitle'))

for i in range(0,length):
    job_title.append(soup.select('.jobtitle')[i]['title'])
    job_location.append(soup.select(".location")[i].text)
    company_name.append(soup.select('.sjcl .company')[i].text.replace('\n',''))

In [16]:
job_title

['Data Analyst',
 'Data Analyst',
 'Survey Data Analyst',
 'Technical Data Analyst Fellow',
 'Data and Policy Analyst IV (4955663)',
 'Data Scientist',
 'Data Analyst - Bilingual English and Slovak Language',
 'Analyst, Data',
 'Staff Business Data Analysis',
 'Strategic Intelligence Analyst/Data Analyst',
 'Market Research Analyst',
 'Data Analyst',
 'Administrative Analyst',
 'Data Associate',
 'Big Data Engineer Internship']

Gathering data from other pages

In [17]:
for i in range(10,230,10):
    scrape_url=base_url+'&start={}'.format(i)
    res=requests.get(scrape_url)
    soup=bs4.BeautifulSoup(res.text,"lxml")
    
    length=len(soup.select('.jobtitle'))

    for i in range(0,length):
        job_title.append(soup.select('.jobtitle')[i]['title'])
        job_location.append(soup.select(".location")[i].text)
        company_name.append(soup.select('.sjcl .company')[i].text.replace('\n',''))

In [18]:
len(job_title)

348

In [19]:
len(job_location)

348

In [20]:
len(company_name)

348

In [21]:
indeed_data_analyst = pd.DataFrame(list(zip(company_name, job_title,job_location)), 
               columns =['Company Name', 'Job Title', 'Job Location'])

In [22]:
indeed_data_analyst.head()

Unnamed: 0,Company Name,Job Title,Job Location
0,Giving Assistant,Data Analyst,"San Francisco, CA 94103 (Mission area)"
1,methinks,Data Analyst,"Mountain View, CA"
2,Change Research,Survey Data Analyst,"San Francisco, CA"
3,Change Research,Technical Data Analyst Fellow,"San Francisco, CA"
4,Acumen LLC,Data and Policy Analyst IV (4955663),"Burlingame, CA 94010"


In [23]:
indeed_data_analyst['Company Name'].value_counts()

US Department of the Navy                                36
US Department of the Army                                18
AppleOne                                                 11
Merakey                                                  10
East Bay Municipal Utility District                       9
Giving Assistant                                          9
Universal Consulting Services, Inc.                       9
Broadleaf-inc                                             9
All Stem Connection                                       9
Cathay Bank                                               8
Athens Services                                           8
University Enterprises, Inc.                              7
Ascent Services Group                                     7
Intuit                                                    6
American Indian Health & Services                         6
University of California San Francisco                    6
Positive Behavior Supports Corporation (

In [24]:
indeed_data_analyst.groupby('Job Location').count()

Unnamed: 0_level_0,Company Name,Job Title
Job Location,Unnamed: 1_level_1,Unnamed: 2_level_1
"Agoura Hills, CA 91301 (Whizin's Row area)",2,2
"Alameda County, CA",1,1
"Anaheim, CA",1,1
"Burlingame, CA 94010",1,1
"Calabasas, CA",2,2
California,14,14
"Camarillo, CA 93010",6,6
"Carlsbad, CA",1,1
"Carlsbad, CA 92008 (North Beach area)",1,1
"Chico, CA 95973",2,2


TASK 2: SEARCHING JOB PORSTINGS FOR BUSINESS ANALYST - ENTRY LEVEL - FULL TIME IN LAST 14 DAYS FROM INDEED.COM

In [25]:
job_title2=[]
company_name2=[]
job_location2=[]

base_url='https://www.indeed.com/jobs?q=business+analyst&l=California&radius=100&jt=fulltime&explvl=entry_level&fromage=14'
res=requests.get(base_url)
res.text
soup=bs4.BeautifulSoup(res.text,"lxml")

#Gathering home page data
length=len(soup.select('.jobtitle'))

for i in range(0,length):
    job_title2.append(soup.select('.jobtitle')[i]['title'])
    job_location2.append(soup.select(".location")[i].text)
    company_name2.append(soup.select('.sjcl .company')[i].text.replace('\n',''))
    
#Gathering data from other pages
for i in range(10,50,10):
    scrape_url=base_url+'&start={}'.format(i)
    res=requests.get(scrape_url)
    soup=bs4.BeautifulSoup(res.text,"lxml")
    
    length=len(soup.select('.jobtitle'))

    for i in range(0,length):
        job_title2.append(soup.select('.jobtitle')[i]['title'])
        job_location2.append(soup.select(".location")[i].text)
        company_name2.append(soup.select('.sjcl .company')[i].text.replace('\n',''))

In [26]:
len(job_title2)

56

In [27]:
len(company_name2)

56

In [28]:
indeed_business_analyst = pd.DataFrame(list(zip(company_name2, job_title2,job_location2)), 
               columns =['Company Name', 'Job Title', 'Job Location'])

In [29]:
indeed_business_analyst.head()

Unnamed: 0,Company Name,Job Title,Job Location
0,Spotline Inc,Business Analyst (W2/Fulltime) PHARMA DOMAIN,"Palo Alto, CA"
1,Arbor Financial Systems,Graduate Business Analyst,"San Francisco Bay Area, CA"
2,Intuit,Staff Business Data Analysis,"San Francisco, CA 94102 (Downtown area)"
3,Pacific Gas And Electric Company,Business Analyst,"San Francisco, CA 94105 (South Beach area)"
4,Applicantz,Privacy Operations Business Analyst,"San Rafael, CA"


TASK 3: SEARCHING FOR PRODUCT ANALYST ROLES ENTRY LEVEL FULL TIME IN LAST 14 DAYS

In [30]:
job_title3=[]
company_name3=[]
job_location3=[]

base_url='https://www.indeed.com/jobs?q=product+analyst&l=California&radius=100&jt=fulltime&explvl=entry_level&fromage=14'
res=requests.get(base_url)
res.text
soup=bs4.BeautifulSoup(res.text,"lxml")

#Gathering home page data
length=len(soup.select('.jobtitle'))

for i in range(0,length):
    job_title3.append(soup.select('.jobtitle')[i]['title'])
    job_location3.append(soup.select(".location")[i].text)
    company_name3.append(soup.select('.sjcl .company')[i].text.replace('\n',''))
    
#Gathering data from other pages
for i in range(10,200,10):
    scrape_url=base_url+'&start={}'.format(i)
    res=requests.get(scrape_url)
    soup=bs4.BeautifulSoup(res.text,"lxml")
    
    length=len(soup.select('.jobtitle'))

    for i in range(0,length):
        job_title3.append(soup.select('.jobtitle')[i]['title'])
        job_location3.append(soup.select(".location")[i].text)
        company_name3.append(soup.select('.sjcl .company')[i].text.replace('\n',''))

In [31]:
indeed_product_analyst = pd.DataFrame(list(zip(company_name3, job_title3,job_location3)), 
               columns =['Company Name', 'Job Title', 'Job Location'])

In [32]:
indeed_product_analyst.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 302 entries, 0 to 301
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Company Name  302 non-null    object
 1   Job Title     302 non-null    object
 2   Job Location  302 non-null    object
dtypes: object(3)
memory usage: 7.2+ KB


TASK 4: CONVERTING INTO CSV FILES

In [34]:
indeed_data_analyst.to_csv('indeed_data_analyst.csv',index=False)
indeed_business_analyst.to_csv('indeed_business_analyst.csv',index=False)
indeed_product_analyst.to_csv('indeed_product_analyst.csv',index=False)