## websites to scrape
* [Simple page](https://dataquestio.github.io/web-scraping-pages/simple.html)
* [Another simple page](https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html)
* [More complex webpage](https://au.indeed.com/software-developer-jobs-in-Australia?vjk=6fcf7af0c601fbca)
* [Complex webpage](https://www.pacsun.com/mens/)

## [BeautifulSoup documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)

In [None]:
from bs4 import BeautifulSoup # import the package
import requests

# variable that holds our URL
url = "https://www.pacsun.com/mens/"

# make a request to the URL and extract the HTML text with requests
html = requests.get(url).text
print(html)

KeyboardInterrupt: ignored

In [None]:
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

In [None]:
soup.body.section.find('h1', class_ = 'title is-2').text

'Senior Python Developer'

In [None]:
soup.body.find_all('h1')

[<h1 class="title is-1">
         Fake Python
       </h1>, <h1 class="title is-2">Senior Python Developer</h1>]

In [None]:
for h1 in soup.body.find_all('h1'):
  print(h1.text)


        Fake Python
      
Senior Python Developer


In [None]:
content = soup.body.find('div', class_ = 'content')
content

<div class="content">
<p>Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.</p>
<p id="location"><strong>Location:</strong> Stewartbury, AA</p>
<p id="date"><strong>Posted:</strong> 2021-04-08</p>
</div>

In [None]:
for ele in content.find_all('p', id=["location", 'date']):
  print(ele.text)

Location: Stewartbury, AA
Posted: 2021-04-08


In [None]:
content = soup.body.find_all('div', class_ = 'job_seen_beacon')

In [None]:
job_data = []
for job in content:
  #extract title 
  title = job.h2.text
  title = title.replace('new', '')
  
  #company name
  employer = job.find('span', class_ = 'companyName').text
  
  #extract rating
  try:
    rating = job.find('span', class_ ="ratingNumber").text
  except AttributeError as err:
    rating = None

  #location
  location = job.find('div', class_ = "companyLocation").text

  #job description 
  desc = job.find('li').text
  
  data = {
      'title': title,
      'employer': employer,
      'rating': rating,
      'location': location,
      'description': desc
  }
  job_data.append(data)
job_data[0]

{'description': 'Experience working one or more of the following: web or mobile application development, Unix/Linux environments, distributed and parallel systems, machine…',
 'employer': 'Google',
 'location': 'Sydney NSW',
 'rating': '4.3',
 'title': 'Software Engineer, Early Career'}

In [None]:
import json
print(json.dumps(job_data, indent = 2))

In [None]:
content = soup.body.find('div', class_ ="row refinement-wrapper")

In [None]:
content

[]