## Demo: Web scraping with BeautifulSoup

this demo shows how to use BeautifulSoup to crawl job listing in indeed.

In [None]:
## Import the necessary packages
from bs4 import BeautifulSoup
import urllib
import re
import pandas as pd

### 1. Reach the link of jobs first

use indeed mobile web version (https://www.indeed.com/m/) since its html is simplier

In [3]:
from urllib.request import urlopen
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
page = urlopen(url)
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all('a', attrs={'rel':['nofollow']})
for i in all_matches:
    print (i['href'])
    print (type(i['href']))
    print ("https://www.indeed.com/m/"+i['href'])

viewjob?jk=8d11ab80a497e4bd
<class 'str'>
https://www.indeed.com/m/viewjob?jk=8d11ab80a497e4bd
viewjob?jk=c5d25ae33dedf097
<class 'str'>
https://www.indeed.com/m/viewjob?jk=c5d25ae33dedf097
viewjob?jk=ed0c4721f9b456bd
<class 'str'>
https://www.indeed.com/m/viewjob?jk=ed0c4721f9b456bd
viewjob?jk=110b9c2db276db58
<class 'str'>
https://www.indeed.com/m/viewjob?jk=110b9c2db276db58
viewjob?jk=193547a91e3c7357
<class 'str'>
https://www.indeed.com/m/viewjob?jk=193547a91e3c7357
viewjob?jk=0610bd006c630479
<class 'str'>
https://www.indeed.com/m/viewjob?jk=0610bd006c630479
viewjob?jk=ee6d5ddff98335ac
<class 'str'>
https://www.indeed.com/m/viewjob?jk=ee6d5ddff98335ac
viewjob?jk=260ceedfb7f96549
<class 'str'>
https://www.indeed.com/m/viewjob?jk=260ceedfb7f96549
viewjob?jk=e8fe83ac09c606ce
<class 'str'>
https://www.indeed.com/m/viewjob?jk=e8fe83ac09c606ce
viewjob?jk=c61846f8b8f7017f
<class 'str'>
https://www.indeed.com/m/viewjob?jk=c61846f8b8f7017f


### 2. Find the title, company, location and detailed job description for each job

Let's first see a brief example:

In [4]:
test_html= \
'''
<html>
	<body>
		<p>
			<b>
				<font size="+1">Analyst - Data Science</font>
			</b>
			<br>The Boston Consulting Group - <span class="location">Los Angeles, CA</span>
		</p>
	</body>
</html>
'''


In [5]:
bs = BeautifulSoup(test_html,'lxml')

In [6]:
print(bs.body.p.b.font.text)

Analyst - Data Science


In [7]:
print(bs.body.p.text)



Analyst - Data Science

The Boston Consulting Group - Los Angeles, CA



In [11]:
print(bs.body.p.span.text)

Los Angeles, CA


#### Find title, company, location and job description for one position

In [12]:
title = []
company = []
location = []
jd = []
for each in all_matches:
    jd_url= 'http://www.indeed.com/m/'+each['href']
    jd_page = urlopen(jd_url)
    jd_soup = BeautifulSoup(jd_page, 'lxml')
    jd_desc = jd_soup.findAll('div',attrs={'id':['desc']}) ## find the structure like: <div id="desc"></>
#    break
    title.append(jd_soup.body.p.b.font.text)
    company.append(jd_desc[0].span.text)
    location.append(jd_soup.body.p.span.text)
    jd.append(jd_desc[0].text)

In [13]:
## Job Description
print(jd_desc[0].text)

Hulu is a premium streaming TV destination that seeks to captivate and connect viewers with the stories they love. We create amazing experiences that celebrate the best of entertainment and technology. We’re looking for great people who are passionate about redefining TV through innovation, unconventional thinking and embracing fun. It’s a mission that takes some serious smarts, intense curiosity and determination to be the best. Come be part of the team that’s powering play. SUMMARY
At Hulu, number one priority is our customers. We make business decisions around our customers’ preferences. Data Sciences Team at Hulu combines deep data analysis and research of our rich user data to present a compelling vision around user retention and preferences across the vast ecosystem of Hulu’s product offerings and content. We are looking for data scientists who are passionate about using data to drive strategy and product recommendations. You will be engaged with senior leaders to design well-con

In [14]:
## Job Title 
print(jd_soup.body.p.b.font.text)

Data Scientist


In [15]:
## Company Name
print(jd_desc[0].span.text)
print(jd_soup.body.p.span.previous_sibling.split('-')[0][1:])

Hulu
Hulu 


In [16]:
title

[]

#### Save the data into Data Frame

In [None]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [None]:
df

If we don't break the loop above, we can crawl all the job information from one page.

## 3. Change Pages Automatically

In [None]:
title = []
company = []
location = []
jd = []
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
for i in range(2):
    
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    all_matches = soup.findAll(attrs={'rel':['nofollow']})
    for each in all_matches:
        jd_url= 'http://www.indeed.com/m/'+each['href']
        jd_page =urlopen(jd_url)
        jd_soup = BeautifulSoup(jd_page, 'lxml')
        jd_desc = jd_soup.findAll(attrs={'id':['desc']})
        title.append(jd_soup.body.p.b.font.text)
        company.append(jd_desc[0].span.text)
        location.append(jd_soup.body.p.span.text)
        jd.append(jd_desc[0].text)
        
    ## Change the pages to Next Page
    url_all = soup.findAll(attrs={'rel':['next']})
    url = 'http://www.indeed.com/m/'+ str(url_all[0]['href'])


In [None]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [None]:
df