# **Web Scraping Solo Project**

## 1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.

In [10]:
import requests
from bs4 import BeautifulSoup as BS

In [11]:
URL = 'https://realpython.github.io/fake-jobs/'

# # Not always needed
# headers = {
#     "User-Agent": "MyPythonScript/1.0 (contact@example.com)"
# }

response = requests.get(URL) # (URL, headers = headers) if headers needed

In [12]:
type(response)

requests.models.Response

In [13]:
response.status_code

200

In [14]:
requests.get('https://realpython.github.io/fake-jobs/') # (, headers = headers) if needed

<Response [200]>

In [15]:
soup = BS(response.text)

In [78]:
# print(soup.prettify())

#### Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.

In [31]:
# Inspect element to find 'Senior Python Developer' is in tag type 'h2' class 'title is-5'. Using 'title' alone will
# return the desired result. *** Take note: class_ *** class does not work

soup.find('h2', class_ = ['title', 'is-5']).text

'Senior Python Developer'

#### Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [79]:
# jobtitles = soup.findAll('h2', class_ = ['title', 'is-5'])
# print(type(jobtitles))
# jobtitles

#### Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.

In [49]:
# h3 class_ = ['subtitle'] ; p class_ = ['location'] ; p class_ = ['is-small']

soup.findAll('h2').find('h1').text

[]

In [74]:
import pandas as pd

jobcards = soup.findAll('div', class_='card-content')

data = []

for card in jobcards:
    title = card.find('h2', class_ = ['title']).text
    company = card.find('h3', class_ = ['subtitle']).text
    location = card.find('p', class_ = ['location']).text
    posting_date = card.find('p', class_ = ['is-small']).text

    jobinfo = {
        'Title': title.strip(), 
        'Company': company.strip(),
        'Location': location.strip(),
        'Date Posted': posting_date.strip()
    }

    data.append(jobinfo)

In [75]:
jobpostinfo = pd.DataFrame(data)

In [77]:
jobpostinfo.head()

Unnamed: 0,Title,Company,Location,Date Posted
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08


## 2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.

####  First, use the BeautifulSoup find_all method to extract the urls.

#### Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.