# Webscraping

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

To retrieve the contents of a website, we will be using bthe [_requests_](https://requests.readthedocs.io/en/master/) library.

In [1]:
#Import the requests library.
import requests

#Import beautiful soup, so we can prettify the text of the website
from bs4 import BeautifulSoup as BS

#other libraries that may or may not be used:
from IPython.core.display import HTML
import pandas as pd

## 1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  

In [2]:
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

### a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title

In [3]:
#Before we can use .find, we need to use an HTML parser to make it more readable
soup = BS(response.text)

In [4]:
#print(soup.prettify())

In [5]:
#To find just the first job title:
soup.find('h2').text

'Senior Python Developer'

### b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list. 

In [6]:
#To find all job titles:
all_jobs = soup.findAll('h2')

In [7]:
#Use a for loop to save each job title to a list
#Fist create an empty list
title = []

#The use a for loop to extract each listing in the .findall call from above
#create a variable to store just the text portion of the html string
#make sure to use .strip in case there are extra spaces or breaks
#append each text string to the empty list
for listing in all_jobs:
    j = listing.text.strip()
    title.append(j)
    
#print result to confirm it worked
#print(title)

### c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [8]:
#Adding company data:
all_companies = soup.findAll('h3')
company = []
for listing in all_companies:
    c = listing.text.strip()
    company.append(c)
    
#print(company)

In [45]:
#Adding locations
all_locations = soup.findAll('p', {'class': 'location'})
location = []
for listing in all_locations:
    l = listing.text.strip()
    location.append(l)
    
#print(location)

In [46]:
#alternative code, from class review: 
location_alt = [loc.text.strip() for loc in all_locations]
#print(location_alt)

In [10]:
#Adding posting dates
all_dates = soup.findAll('time')
date_posted = []
for listing in all_dates:
    d = listing.text.strip()
    date_posted.append(d)
    
#print(date_posted)

### d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [11]:
jobs_df = pd.DataFrame({'title':title,'company':company,'location':location,'date_posted':date_posted})
jobs_df

Unnamed: 0,title,company,location,date_posted
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08


## 2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.

### a. First, use the BeautifulSoup find_all method to extract the urls.

In [49]:
#Use a .findall() to extract all the <a> tags, assign to variable
#Create empty list to put URLs in
#Use a for loop to get each "href" inside the <a> tags and append it to the empty list
#Use [1::2] to skip every other link bc there are two different types of links on the page
#Create new column in the jobs_df table
all_urls = soup.find_all('a')
apply_urls = []
for link in all_urls[1::2]:
    apply_urls.append(link.get('href'))
jobs_df['link'] = apply_urls
jobs_df

Unnamed: 0,title,company,location,date_posted,link
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...
...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...


In [51]:
#alternative: 
all_footers = soup.findAll(class_='card-footer')
apply_urls = [footer.findAll('a')[1]['href'] for footer in all_footers]
print(apply_urls)

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

In [53]:
#Dibran's code, from class review: 
urls = []
for x in soup.find_all('a'):
    if 'jobs' in x['href']:
        urls.append(x['href'])
        
print(urls)

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

### b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [56]:
#https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html
#https://realpython.github.io/fake-jobs/jobs/scientist-research-maths-22.html

#Dibran's answer:
'https://realpython.github.io/fake-jobs/jobs/' + (
    jobs_df['title']
    .str.lower()
    .str.replace('[\s/]', '-', regex = True)
    .str.replace('[(),]', '', regex = True)
) + '-' + jobs_df.index.astype(str) + '.html'

0     https://realpython.github.io/fake-jobs/jobs/se...
1     https://realpython.github.io/fake-jobs/jobs/en...
2     https://realpython.github.io/fake-jobs/jobs/le...
3     https://realpython.github.io/fake-jobs/jobs/fi...
4     https://realpython.github.io/fake-jobs/jobs/pr...
                            ...                        
95    https://realpython.github.io/fake-jobs/jobs/mu...
96    https://realpython.github.io/fake-jobs/jobs/ra...
97    https://realpython.github.io/fake-jobs/jobs/da...
98    https://realpython.github.io/fake-jobs/jobs/fu...
99    https://realpython.github.io/fake-jobs/jobs/sh...
Length: 100, dtype: object

## 3. Finally, we want to get the job description text for each job.

### a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph. 

In [20]:
#Convert URL into a BS object using requests.get
single_job = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'
single_job_response = requests.get(single_job)

#Use the HTML parser to make it more readable
soup_2 = BS(single_job_response.text)

In [57]:
body = soup_2.find('div', {'class':'content'}).find('p').text
print(body)

Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.


### b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than ... special good along.".  

In [58]:
def get_job_desc(url): 
    response = requests.get(url)
    soup = BS(response.text)
    return soup.find('div', {'class':'content'}).find('p').text

### c. Use the [.apply method](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) on the url column you created above to retrieve the description text for all of the jobs.

In [59]:
jobs_df['job_description'] = jobs_df['link'].apply(get_job_desc)
jobs_df

                                 title                     company  \
0              Senior Python Developer    Payne, Roberts and Davis   
1                      Energy engineer            Vasquez-Davidson   
2                      Legal executive  Jackson, Chambers and Levy   
3               Fitness centre manager              Savage-Bradley   
4                      Product manager                 Ramirez Inc   
..                                 ...                         ...   
95  Museum/gallery exhibitions officer     Nguyen, Yoder and Petty   
96            Radiographer, diagnostic                  Holder LLC   
97              Database administrator              Yates-Ferguson   
98                  Furniture designer             Ortega-Lawrence   
99                         Ship broker   Fuentes, Walls and Castro   

                location date_posted  \
0        Stewartbury, AA  2021-04-08   
1   Christopherville, AA  2021-04-08   
2    Port Ericaburgh, AA  2021-04-08   

In [60]:
jobs_df

Unnamed: 0,title,company,location,date_posted,link,job_description
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...,Professional asset web application environment...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...,Party prevent live. Quickly candidate change a...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...,Administration even relate head color. Staff b...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...,Tv program actually race tonight themselves tr...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...,Traditional page a although for study anyone. ...
...,...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/mu...,Paper age physical current note. There reality...
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ra...,Able such right culture. Wrong pick structure ...
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/da...,Create day party decade high clear. Past trade...
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fu...,Pressure under rock next week. Recognize so re...
