## Demo: Web scraping with BeautifulSoup

this demo shows how to use BeautifulSoup to crawl job listing in indeed.

In [1]:
## Import the necessary packages
from bs4 import BeautifulSoup
import urllib
import re
import pandas as pd

### 1. Reach the link of jobs first

use indeed mobile web version (https://www.indeed.com/m/) since its html is simplier

In [2]:
from urllib.request import urlopen
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
page = urlopen(url)
soup = BeautifulSoup(page, 'lxml')

all_matches = soup.find_all('a', attrs={'rel':['nofollow']})
for i in all_matches:
    print (i['href'])
    print (type(i['href']))
    print ("https://www.indeed.com/m/"+i['href'])

viewjob?jk=8d11ab80a497e4bd
<class 'str'>
https://www.indeed.com/m/viewjob?jk=8d11ab80a497e4bd
viewjob?jk=c5d25ae33dedf097
<class 'str'>
https://www.indeed.com/m/viewjob?jk=c5d25ae33dedf097
viewjob?jk=ed0c4721f9b456bd
<class 'str'>
https://www.indeed.com/m/viewjob?jk=ed0c4721f9b456bd
viewjob?jk=110b9c2db276db58
<class 'str'>
https://www.indeed.com/m/viewjob?jk=110b9c2db276db58
viewjob?jk=193547a91e3c7357
<class 'str'>
https://www.indeed.com/m/viewjob?jk=193547a91e3c7357
viewjob?jk=0610bd006c630479
<class 'str'>
https://www.indeed.com/m/viewjob?jk=0610bd006c630479
viewjob?jk=ee6d5ddff98335ac
<class 'str'>
https://www.indeed.com/m/viewjob?jk=ee6d5ddff98335ac
viewjob?jk=260ceedfb7f96549
<class 'str'>
https://www.indeed.com/m/viewjob?jk=260ceedfb7f96549
viewjob?jk=e8fe83ac09c606ce
<class 'str'>
https://www.indeed.com/m/viewjob?jk=e8fe83ac09c606ce
viewjob?jk=c61846f8b8f7017f
<class 'str'>
https://www.indeed.com/m/viewjob?jk=c61846f8b8f7017f


### 2. Find the title, company, location and detailed job description for each job

Let's first see a brief example:

In [3]:
test_html= \
'''
<html>
	<body>
		<p>
			<b>
				<font size="+1">Analyst - Data Science</font>
			</b>
			<br>The Boston Consulting Group - <span class="location">Los Angeles, CA</span>
		</p>
	</body>
</html>
'''


In [4]:
bs = BeautifulSoup(test_html,'lxml')

In [5]:
print(bs.body.p.b.font.text)

Analyst - Data Science


In [6]:
print(bs.body.p.text)



Analyst - Data Science

The Boston Consulting Group - Los Angeles, CA



In [7]:
print(bs.body.p.span.text)

Los Angeles, CA


#### Find title, company, location and job description for one position

In [8]:
title = []
company = []
location = []
jd = []
for each in all_matches:
    jd_url= 'http://www.indeed.com/m/'+each['href']
    jd_page = urlopen(jd_url)
    jd_soup = BeautifulSoup(jd_page, 'lxml')
    jd_desc = jd_soup.findAll('div',attrs={'id':['desc']}) ## find the structure like: <div id="desc"></>
#    break
    title.append(jd_soup.body.p.b.font.text)
    company.append(jd_desc[0].span.text)
    location.append(jd_soup.body.p.span.text)
    jd.append(jd_desc[0].text)

In [9]:
## Job Description
print(jd_desc[0].text)

Research Scientists work on and contribute to research projects relevant to Disney Research and the company. Our researchers have the possibility to publish results in top venues such as NIPS, ICML, CVPR, or EMNLP. Areas of interest include, but are not limited to:
Information extraction from text, knowledge base construction and completion, embedding methods
Dialogue systems
Hybrid approaches between statistical and symbolic AI
Probabilistic modeling and approximate inference
Deep learning, representation learning
Recommender systems
Machine learning for robotics
Activity and action recognition in videos, metadata extraction

Duties of a Research Scientist include:
Conduct complex, advanced research projects in areas of interest to Disney Research and our partnering Business Units
Develop new and advanced cutting-edge techniques and algorithms
Transfer and implement results and technology in hard- and software prototypes and demo systems relevant to the Disney businesses
Survey releva

In [10]:
## Job Title 
print(jd_soup.body.p.b.font.text)

Research Scientist, Machine Learning / Artificial Intelligence


In [11]:
## Company Name
print(jd_desc[0].span.text)
print(jd_soup.body.p.span.previous_sibling.split('-')[0][1:])

Disney
Disney Parks & Resorts 


In [12]:
title

['Data Scientist',
 'Data Analyst',
 'Intern – Data Scientist, Data Analytics & Insights',
 'Data Scientist',
 'Data Analyst - Devices Customer Engagement',
 'Resource Intern LA - Summer 2018',
 'Data Scientist',
 'Data Scientist',
 'Manager of Data Science & Analytics',
 'Research Scientist, Machine Learning / Artificial Intelligence']

#### Save the data into Data Frame

In [13]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [14]:
df

Unnamed: 0,Job Description,company,location,title
0,Hulu is a premium streaming TV destination tha...,Hulu,"Santa Monica, CA",Data Scientist
1,Are you passionate about leveraging data to de...,Amazon.com,"Manhattan Beach, CA",Data Analyst
2,\nIntegrate multiple data sources and define k...,NBCUniversal,"Universal City, CA","Intern – Data Scientist, Data Analytics & Insi..."
3,-\n\n328267\n\nDiscover it Here.\n\nAt Nordstr...,Nordstrom,"Los Angeles, CA 90045",Data Scientist
4,"Consider this challenge: every day, tens of mi...",Amazon.com,"Santa Monica, CA",Data Analyst - Devices Customer Engagement
5,"Reporting to the Sr. Resource Manager, this po...",BuzzFeed,"Los Angeles, CA 90036",Resource Intern LA - Summer 2018
6,In addition to the responsibilities listed bel...,Kaiser Permanente,"Pasadena, CA",Data Scientist
7,Are you interested in working for the music in...,Universal Music Group,"Santa Monica, CA",Data Scientist
8,-\n\n329525\n\nDiscover it Here.\n\nAt Nordstr...,Nordstrom,"Los Angeles, CA 90045",Manager of Data Science & Analytics
9,Research Scientists work on and contribute to ...,Disney,"Glendale, CA","Research Scientist, Machine Learning / Artific..."


If we don't break the loop above, we can crawl all the job information from one page.

## 3. Change Pages Automatically

In [15]:
title = []
company = []
location = []
jd = []
url = "https://www.indeed.com/m/jobs?q=data+scientist&l=Los+Angeles%2C+CA"
for i in range(2): # search to page 2
    
    page = urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    all_matches = soup.findAll(attrs={'rel':['nofollow']})
    for each in all_matches:
        jd_url= 'http://www.indeed.com/m/'+each['href']
        jd_page =urlopen(jd_url)
        jd_soup = BeautifulSoup(jd_page, 'lxml')
        jd_desc = jd_soup.findAll(attrs={'id':['desc']})
        title.append(jd_soup.body.p.b.font.text)
        company.append(jd_desc[0].span.text)
        location.append(jd_soup.body.p.span.text)
        jd.append(jd_desc[0].text)
        
    ## Change the pages to Next Page
    url_all = soup.findAll(attrs={'rel':['next']})
    url = 'http://www.indeed.com/m/'+ str(url_all[0]['href'])


In [16]:
job = {'title': title,
         'company': company,
         'location': location,
         'Job Description': jd}
df = pd.DataFrame.from_dict(job)

In [17]:
df

Unnamed: 0,Job Description,company,location,title
0,Hulu is a premium streaming TV destination tha...,Hulu,"Santa Monica, CA",Data Scientist
1,Are you passionate about leveraging data to de...,Amazon.com,"Manhattan Beach, CA",Data Analyst
2,\nIntegrate multiple data sources and define k...,NBCUniversal,"Universal City, CA","Intern – Data Scientist, Data Analytics & Insi..."
3,-\n\n328267\n\nDiscover it Here.\n\nAt Nordstr...,Nordstrom,"Los Angeles, CA 90045",Data Scientist
4,"Consider this challenge: every day, tens of mi...",Amazon.com,"Santa Monica, CA",Data Analyst - Devices Customer Engagement
5,"Reporting to the Sr. Resource Manager, this po...",BuzzFeed,"Los Angeles, CA 90036",Resource Intern LA - Summer 2018
6,In addition to the responsibilities listed bel...,Kaiser Permanente,"Pasadena, CA",Data Scientist
7,Are you interested in working for the music in...,Universal Music Group,"Santa Monica, CA",Data Scientist
8,-\n\n329525\n\nDiscover it Here.\n\nAt Nordstr...,Nordstrom,"Los Angeles, CA 90045",Manager of Data Science & Analytics
9,Research Scientists work on and contribute to ...,Disney,"Glendale, CA","Research Scientist, Machine Learning / Artific..."
