### In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

In this notebook, we'll see how we can retrieve the contents of a website and then parse the resulting HTML to extract the data we want.

For this, we'll again be using the requests library.

In [1]:
import requests

# Question 1-.
Start by performing a GET request on the url above and convert the response into a BeautifulSoup object. 

In [2]:
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

In [3]:
response.status_code

200

In [4]:
#response.text

# Question 1a.
Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title. 


In [5]:
from bs4 import BeautifulSoup

In [6]:
soup = BeautifulSoup(response.text)

In [7]:
#print(soup.prettify())

In [8]:
soup.find('h2')

<h2 class="title is-5">Senior Python Developer</h2>

In [9]:
type(soup.find('h2'))

bs4.element.Tag

In [10]:
soup.find('h2').text.strip()

'Senior Python Developer'

# Question 1b.
Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list. 

In [11]:
job_titles = soup.findAll('h2')
job_title = [job_title.text.strip() for job_title in job_titles]
job_title

['Senior Python Developer',
 'Energy engineer',
 'Legal executive',
 'Fitness centre manager',
 'Product manager',
 'Medical technical officer',
 'Physiological scientist',
 'Textile designer',
 'Television floor manager',
 'Waste management officer',
 'Software Engineer (Python)',
 'Interpreter',
 'Architect',
 'Meteorologist',
 'Audiological scientist',
 'English as a second language teacher',
 'Surgeon',
 'Equities trader',
 'Newspaper journalist',
 'Materials engineer',
 'Python Programmer (Entry-Level)',
 'Product/process development scientist',
 'Scientist, research (maths)',
 'Ecologist',
 'Materials engineer',
 'Historic buildings inspector/conservation officer',
 'Data scientist',
 'Psychiatrist',
 'Structural engineer',
 'Immigration officer',
 'Python Programmer (Entry-Level)',
 'Neurosurgeon',
 'Broadcast engineer',
 'Make',
 'Nurse, adult',
 'Air broker',
 'Editor, film/video',
 'Production assistant, radio',
 'Engineer, communications',
 'Sales executive',
 'Software Deve

# Question 1c.
Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [12]:
soup.find('h3')

<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>

In [13]:
soup.find('h3').text

'Payne, Roberts and Davis'

In [14]:
companies = soup.findAll('h3')
company = [company.text.strip() for company in companies]
company

['Payne, Roberts and Davis',
 'Vasquez-Davidson',
 'Jackson, Chambers and Levy',
 'Savage-Bradley',
 'Ramirez Inc',
 'Rogers-Yates',
 'Kramer-Klein',
 'Meyers-Johnson',
 'Hughes-Williams',
 'Jones, Williams and Villa',
 'Garcia PLC',
 'Gregory and Sons',
 'Clark, Garcia and Sosa',
 'Bush PLC',
 'Salazar-Meyers',
 'Parker, Murphy and Brooks',
 'Cruz-Brown',
 'Macdonald-Ferguson',
 'Williams, Peterson and Rojas',
 'Smith and Sons',
 'Moss, Duncan and Allen',
 'Gomez-Carroll',
 'Manning, Welch and Herring',
 'Lee, Gutierrez and Brown',
 'Davis, Serrano and Cook',
 'Smith LLC',
 'Thomas Group',
 'Silva-King',
 'Pierce-Long',
 'Walker-Simpson',
 'Cooper and Sons',
 'Donovan, Gonzalez and Figueroa',
 'Morgan, Butler and Bennett',
 'Snyder-Lee',
 'Harris PLC',
 'Washington PLC',
 'Brown, Price and Campbell',
 'Mcgee PLC',
 'Dixon Inc',
 'Thompson, Sheppard and Ward',
 'Adams-Brewer',
 'Schneider-Brady',
 'Gonzales-Frank',
 'Smith-Wong',
 'Pierce-Herrera',
 'Aguilar, Rivera and Quinn',
 'Lowe,

In [15]:
soup.find('p')

<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>

In [16]:
soup.find('p').text.strip()

'Fake Jobs for Your Web Scraping Journey'

In [17]:
soup.findAll('p', class_='location')

[<p class="location">
         Stewartbury, AA
       </p>,
 <p class="location">
         Christopherville, AA
       </p>,
 <p class="location">
         Port Ericaburgh, AA
       </p>,
 <p class="location">
         East Seanview, AP
       </p>,
 <p class="location">
         North Jamieview, AP
       </p>,
 <p class="location">
         Davidville, AP
       </p>,
 <p class="location">
         South Christopher, AE
       </p>,
 <p class="location">
         Port Jonathan, AE
       </p>,
 <p class="location">
         Osbornetown, AE
       </p>,
 <p class="location">
         Scotttown, AP
       </p>,
 <p class="location">
         Ericberg, AE
       </p>,
 <p class="location">
         Ramireztown, AE
       </p>,
 <p class="location">
         Figueroaview, AA
       </p>,
 <p class="location">
         Kelseystad, AA
       </p>,
 <p class="location">
         Williamsburgh, AE
       </p>,
 <p class="location">
         Mitchellburgh, AE
       </p>,
 <p class="location

In [18]:
locs = soup.findAll('p', class_='location')
location = [location.text.strip() for location in locs]
location

['Stewartbury, AA',
 'Christopherville, AA',
 'Port Ericaburgh, AA',
 'East Seanview, AP',
 'North Jamieview, AP',
 'Davidville, AP',
 'South Christopher, AE',
 'Port Jonathan, AE',
 'Osbornetown, AE',
 'Scotttown, AP',
 'Ericberg, AE',
 'Ramireztown, AE',
 'Figueroaview, AA',
 'Kelseystad, AA',
 'Williamsburgh, AE',
 'Mitchellburgh, AE',
 'West Jessicabury, AA',
 'Maloneshire, AE',
 'Johnsonton, AA',
 'South Davidtown, AP',
 'Port Sara, AE',
 'Marktown, AA',
 'Laurenland, AE',
 'Lauraton, AP',
 'South Tammyberg, AP',
 'North Brandonville, AP',
 'Port Robertfurt, AA',
 'Burnettbury, AE',
 'Herbertside, AA',
 'Christopherport, AP',
 'West Victor, AE',
 'Port Aaron, AP',
 'Loribury, AA',
 'Angelastad, AP',
 'Larrytown, AE',
 'West Colin, AP',
 'West Stephanie, AP',
 'Laurentown, AP',
 'Wrightberg, AP',
 'Alberttown, AE',
 'Brockburgh, AE',
 'North Jason, AE',
 'Arnoldhaven, AE',
 'Lake Destiny, AP',
 'South Timothyburgh, AP',
 'New Jimmyton, AE',
 'New Lucasbury, AP',
 'Port Cory, AE',
 

In [19]:
soup.find('time')

<time datetime="2021-04-08">2021-04-08</time>

In [20]:
soup.find('time').text

'2021-04-08'

In [21]:
dat = soup.findAll('time')
date = [date.text.strip() for date in dat]
date

['2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-

# Question 1d. 
Take the lists that you have created and combine them into a pandas DataFrame. 

In [22]:
import pandas as pd
import io

In [23]:
#table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))

#from IPython.core.display import HTML

#HTML(table_html)

In [24]:
df = pd.DataFrame({'col1': job_title, 'col2': company, 'col3' : date, 'col4': location})

In [25]:
#df

In [26]:
df = df.rename(columns={'col1': 'job_titles', 'col2': 'company', 'col3':'date', 'col4':'location'})

# Question 2. 
# Next, add a column that contains the url for the "Apply" button. Try this in two ways.   
    a. First, use the BeautifulSoup find_all method to extract the urls.  

In [27]:
soup.find('a')

<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>

In [28]:
# Find all links for anchor tag and specific "Apply" button
#links = soup.find_all('a', string='Apply')
# create a blank list links_list
#links_list = []
#for link in links:
    # Create a list of all the URL's for Apply button
#    links_list.append(link.get('href'))
# print final list of links
#links_list

In [29]:
soup.find('a', string = 'Learn')

<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>

In [30]:
soup.find('a', string = 'Apply')

<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>

In [31]:
#[a['href'] for a in soup.findAll('a', string = 'Learn')]

In [32]:
apply = [a['href'] for a in soup.findAll('a', string = 'Apply')]
apply

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html',
 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html',
 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html',
 'https://realpython.github.io/fake-jobs/jobs/architect-12.html',
 'https://realpython.gi

In [33]:
df

Unnamed: 0,job_titles,company,date,location
0,Senior Python Developer,"Payne, Roberts and Davis",2021-04-08,"Stewartbury, AA"
1,Energy engineer,Vasquez-Davidson,2021-04-08,"Christopherville, AA"
2,Legal executive,"Jackson, Chambers and Levy",2021-04-08,"Port Ericaburgh, AA"
3,Fitness centre manager,Savage-Bradley,2021-04-08,"East Seanview, AP"
4,Product manager,Ramirez Inc,2021-04-08,"North Jamieview, AP"
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty",2021-04-08,"Lake Abigail, AE"
96,"Radiographer, diagnostic",Holder LLC,2021-04-08,"Jacobshire, AP"
97,Database administrator,Yates-Ferguson,2021-04-08,"Port Susan, AE"
98,Furniture designer,Ortega-Lawrence,2021-04-08,"North Tiffany, AA"


In [34]:
df['url']= apply

# Question 2b
b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [35]:
#Q.2.b. #url is constructed base url = 'https://realpython.github.io/fake-jobs'  and   /jobs/  and job title  and .html 
# if the title has () it is replaced with -  .
#job_titles # job titles from q.1
#base_url = 'https://realpython.github.io/fake-jobs'
#constructed_url = [base_url + '/jobs/' +title.replace(' ','-').replace(',','').replace('(','').replace(')','').lower() + f'-{index}.html' for index,title in enumerate(job_titles)]
#enumerate used to iterate over a sequence ,while keeping track of index of each item it returns both index and item
#constructed_url

In [36]:
#Q2b. translate method
import string
base_url = 'https://realpython.github.io/fake-jobs/jobs/'
url_title = [x.replace('/',' ').translate(str.maketrans('','', string.punctuation)).lower().replace(' ','-') for x in job_title]
url_2 = [f'{base_url}{x}-{i}.html' for i, x in enumerate(url_title)]

In [37]:
#2nd method
#jobs['url2'] = 'https://realpython.github.io/fake-jobs/jobs/' + (
#    jobs['title']
#    .str.lower()
#    .str.replace('[\s/]', '-', regex = True)
#    .str.replace('[(),]', '', regex = True)
#) + '-' + jobs.index.astype(str) + '.html'

# question 3a

In [38]:
# import urllib module
import urllib.request 

# providing url 
url = "https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" 

# opening the url for reading 
html = urllib.request.urlopen(url) 
  
# parsing the html file 
htmlParse = BeautifulSoup(html, 'html.parser') 
  
# getting all the paragraphs 
for para in htmlParse.find_all("p")[1]:
    print(para.get_text())

Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.


In [39]:
#Method 2 
for div in htmlParse.find_all("div", class_="content"):
    para = div.find_all("p")[0]
    print(para.text)

Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.


# Question 3b

In [40]:
def get_descriptions(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    description = soup.findAll('p')[1]
    return description.text

In [41]:
df['description']=df['url'].apply(get_descriptions)

In [42]:
df

Unnamed: 0,job_titles,company,date,location,url,description
0,Senior Python Developer,"Payne, Roberts and Davis",2021-04-08,"Stewartbury, AA",https://realpython.github.io/fake-jobs/jobs/se...,Professional asset web application environment...
1,Energy engineer,Vasquez-Davidson,2021-04-08,"Christopherville, AA",https://realpython.github.io/fake-jobs/jobs/en...,Party prevent live. Quickly candidate change a...
2,Legal executive,"Jackson, Chambers and Levy",2021-04-08,"Port Ericaburgh, AA",https://realpython.github.io/fake-jobs/jobs/le...,Administration even relate head color. Staff b...
3,Fitness centre manager,Savage-Bradley,2021-04-08,"East Seanview, AP",https://realpython.github.io/fake-jobs/jobs/fi...,Tv program actually race tonight themselves tr...
4,Product manager,Ramirez Inc,2021-04-08,"North Jamieview, AP",https://realpython.github.io/fake-jobs/jobs/pr...,Traditional page a although for study anyone. ...
...,...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty",2021-04-08,"Lake Abigail, AE",https://realpython.github.io/fake-jobs/jobs/mu...,Paper age physical current note. There reality...
96,"Radiographer, diagnostic",Holder LLC,2021-04-08,"Jacobshire, AP",https://realpython.github.io/fake-jobs/jobs/ra...,Able such right culture. Wrong pick structure ...
97,Database administrator,Yates-Ferguson,2021-04-08,"Port Susan, AE",https://realpython.github.io/fake-jobs/jobs/da...,Create day party decade high clear. Past trade...
98,Furniture designer,Ortega-Lawrence,2021-04-08,"North Tiffany, AA",https://realpython.github.io/fake-jobs/jobs/fu...,Pressure under rock next week. Recognize so re...
