In [1]:
import requests
from bs4 import BeautifulSoup as BS
import pandas as pd


In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.


1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  
a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.  
b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.  
c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  
d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [2]:
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

In [3]:
response.status_code

200

In [4]:
soup = BS(response.text)

In [5]:
soup.find('title').text

'Fake Python'

In [6]:
print(soup.find('h2').text)

Senior Python Developer


b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.  


In [7]:
table = soup.findAll('h2')

In [8]:
roles = [x.text for x in table]

In [9]:
jobs_df = pd.DataFrame(roles, columns=['jobs'])

In [10]:
jobs_df

Unnamed: 0,jobs
0,Senior Python Developer
1,Energy engineer
2,Legal executive
3,Fitness centre manager
4,Product manager
...,...
95,Museum/gallery exhibitions officer
96,"Radiographer, diagnostic"
97,Database administrator
98,Furniture designer


c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [11]:
company = [x.text for x in soup.findAll('h3', attrs={'class','subtitle'})]

In [12]:
location = [x.text for x in soup.findAll('p', attrs={'class','location'})]

In [13]:
date = [x.get('datetime') for x in soup.findAll('time')]

In [14]:
company_df = pd.DataFrame(company, columns=['company'])

In [15]:
location_df = pd.DataFrame(location, columns=['location'])

In [16]:
date_df = pd.DataFrame(date, columns=['date'])

In [17]:
location_df['location']=location_df['location'].str.strip('\n')

In [18]:
location_df['location']=location_df['location'].str.replace('\n','')

In [19]:
jobs_df['company'] = company_df['company']

In [20]:
jobs_df['location'] = location_df['location']

In [21]:
jobs_df['date']= date_df['date']

d. Take the lists that you have created and combine them into a pandas DataFrame. 


In [22]:
jobs_df

Unnamed: 0,jobs,company,location,date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08
...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08


2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.   
    a. First, use the BeautifulSoup find_all method to extract the urls.  
    b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [23]:
apply = [x.get('href') for x in soup.findAll('a')]

In [24]:
print(apply)

['https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://www.realpython.com', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://www.realpython.com', 'https://realpython.github

In [25]:
apply_df =  pd.DataFrame(apply, columns=['link'])

In [26]:
apply_df_2 = [x for x in apply_df['link'] if x != "https://www.realpython.com"]

In [27]:
apply_df_3 = pd.DataFrame(apply_df_2,columns=['link'])

In [28]:
jobs_df['link'] = apply_df_3['link']

In [29]:
jobs_df.head(15)

Unnamed: 0,jobs,company,location,date,link
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...
5,Medical technical officer,Rogers-Yates,"Davidville, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/me...
6,Physiological scientist,Kramer-Klein,"South Christopher, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ph...
7,Textile designer,Meyers-Johnson,"Port Jonathan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/te...
8,Television floor manager,Hughes-Williams,"Osbornetown, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/te...
9,Waste management officer,"Jones, Williams and Villa","Scotttown, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/wa...


3. Finally, we want to get the job description text for each job.  
    a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.  
    b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".  
    c. Use the [.apply method](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) on the url column you created above to retrieve the description text for all of the jobs.

In [30]:
URL2 = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'

response2 = requests.get(URL2)
soup2 = BS(response2.text)
f = [x.text for x in soup2.findAll('div', attrs={'class','content'})]
f = [string.strip() for string in f]


In [31]:
f

['Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.\nLocation: Stewartbury, AA\nPosted: 2021-04-08']

In [32]:
useing = jobs_df['jobs'].reset_index()

In [33]:
useing['jobs']=useing['jobs'].str.lower()

In [34]:
useing['jobs']= useing['jobs'].replace('\s+','-', regex= True)

In [35]:
useing['jobs'] = [f"{x[1]}-{x[0]}" for x in enumerate(useing['jobs'])]

In [36]:
useing['jobs']= useing['jobs'].str.replace(r'\(|\)', '', regex=True)

In [37]:
useing['linkn']= 'https://realpython.github.io/fake-jobs/jobs/' + useing['jobs'] + '.html'

In [38]:
hello = useing['linkn']

In [39]:
pd.set_option('max_colwidth',1000)

In [52]:
def tot(entry):
    response3 = requests.get(entry)
    soup3 = BS(response3.text)
    f = soup3.findAll('p')[1].text
    return f


In [53]:
jobs_df['dscpt'] = jobs_df['link'].apply(tot)

In [54]:
jobs_df

Unnamed: 0,jobs,company,location,date,link,dscpt
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html,Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html,Party prevent live. Quickly candidate change although. Together type music hospital. Every speech support time operation wear often.
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html,Administration even relate head color. Staff beyond chair recently and off. Own available buy country store build before. Already against which continue. Look road article quickly. International big employee determine positive go Congress. Level others record hospital employee toward like.
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html,Tv program actually race tonight themselves true power. Study economy night actually score from. Name care several. Good explain grow water plant perform resource. Security stock ball organization recognize civil. Pm her then nothing increase.
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/product-manager-4.html,Traditional page a although for study anyone. Could yourself plan base rise would. Wear individual about add senior woman. Partner couple part cup few read consider. Take however ball ever laugh society technology. President stage population boy.
...,...,...,...,...,...,...
95,Museum/gallery exhibitions officer,"Nguyen, Yoder and Petty","Lake Abigail, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/museum-gallery-exhibitions-officer-95.html,Paper age physical current note. There reality size move red join. Trouble you eight describe pattern hard however sign. Space majority bit instead smile happen. Green wife in end decade leader. Begin actually team industry only. Various picture rule poor information. Admit play avoid few. Day dog receive wife.
96,"Radiographer, diagnostic",Holder LLC,"Jacobshire, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/radiographer-diagnostic-96.html,Able such right culture. Wrong pick structure wrong figure continue food. Glass loss increase organization decide. Present spend make garden social man. Any manager political keep attack behavior security movement. Perhaps such those position wrong. Quickly wind include allow must point. Similar age option war partner determine. Wide method movie painting. Rate measure brother approach five. Later role change. Adult prepare son particular economy evening same trouble. Family east within walk school.
97,Database administrator,Yates-Ferguson,"Port Susan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/database-administrator-97.html,Create day party decade high clear. Past trade believe worry film. Approach beautiful late. Manage every quality he under. Town foot hotel brother. Perform particular only his.
98,Furniture designer,Ortega-Lawrence,"North Tiffany, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/furniture-designer-98.html,Pressure under rock next week. Recognize so relationship risk. Similar myself improve.
