### In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

In [1]:
import requests
from bs4 import BeautifulSoup as BS

In [2]:
URL = 'https://realpython.github.io/fake-jobs/'

headers = {
    "User-Agent": "MyPythonScript/1.0 (contact@example.com)"
}

response = requests.get(URL, headers = headers)

In [3]:
requests.get('https://realpython.github.io/fake-jobs/',headers= headers)

<Response [200]>

## #1 Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.
### - a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.




In [4]:
soup = BS(response.text)

In [5]:
soup.find('h2', class_ ='title').text


'Senior Python Developer'

In [6]:
soup.find(class_ ='title is-5').text

'Senior Python Developer'

In [7]:
soup.find(class_ ='media-content').text

'\nSenior Python Developer\nPayne, Roberts and Davis\n'

### - b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [8]:
titles = soup.find_all(class_ ='title is-5')
for i in titles:
    print(i.text)

Senior Python Developer
Energy engineer
Legal executive
Fitness centre manager
Product manager
Medical technical officer
Physiological scientist
Textile designer
Television floor manager
Waste management officer
Software Engineer (Python)
Interpreter
Architect
Meteorologist
Audiological scientist
English as a second language teacher
Surgeon
Equities trader
Newspaper journalist
Materials engineer
Python Programmer (Entry-Level)
Product/process development scientist
Scientist, research (maths)
Ecologist
Materials engineer
Historic buildings inspector/conservation officer
Data scientist
Psychiatrist
Structural engineer
Immigration officer
Python Programmer (Entry-Level)
Neurosurgeon
Broadcast engineer
Make
Nurse, adult
Air broker
Editor, film/video
Production assistant, radio
Engineer, communications
Sales executive
Software Developer (Python)
Futures trader
Tour manager
Cytogeneticist
Designer, multimedia
Trade union research officer
Chemist, analytical
Programmer, multimedia
Engineer, b

In [9]:
title_list = []
for t in titles:
    title_list.append(t.text)

print(title_list)

['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager', 'Medical technical officer', 'Physiological scientist', 'Textile designer', 'Television floor manager', 'Waste management officer', 'Software Engineer (Python)', 'Interpreter', 'Architect', 'Meteorologist', 'Audiological scientist', 'English as a second language teacher', 'Surgeon', 'Equities trader', 'Newspaper journalist', 'Materials engineer', 'Python Programmer (Entry-Level)', 'Product/process development scientist', 'Scientist, research (maths)', 'Ecologist', 'Materials engineer', 'Historic buildings inspector/conservation officer', 'Data scientist', 'Psychiatrist', 'Structural engineer', 'Immigration officer', 'Python Programmer (Entry-Level)', 'Neurosurgeon', 'Broadcast engineer', 'Make', 'Nurse, adult', 'Air broker', 'Editor, film/video', 'Production assistant, radio', 'Engineer, communications', 'Sales executive', 'Software Developer (Python)', 'Futures trader', 'Tour

### -c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.

In [10]:

company = soup.find_all(class_ ='subtitle is-6 company')
location = soup.find_all(class_ ='location')
date = soup.find_all(class_ ='is-small has-text-grey')

company_list = []
l_list = []
d_list = []
for c in company:
    company_list.append(c.text.strip())

for l in location:
    l_list.append(l.text.strip())
    
for d in date:
    d_list.append(d.text.strip())

    
print(company_list)
print()
print(l_list)
print()
print(d_list)
print()

['Payne, Roberts and Davis', 'Vasquez-Davidson', 'Jackson, Chambers and Levy', 'Savage-Bradley', 'Ramirez Inc', 'Rogers-Yates', 'Kramer-Klein', 'Meyers-Johnson', 'Hughes-Williams', 'Jones, Williams and Villa', 'Garcia PLC', 'Gregory and Sons', 'Clark, Garcia and Sosa', 'Bush PLC', 'Salazar-Meyers', 'Parker, Murphy and Brooks', 'Cruz-Brown', 'Macdonald-Ferguson', 'Williams, Peterson and Rojas', 'Smith and Sons', 'Moss, Duncan and Allen', 'Gomez-Carroll', 'Manning, Welch and Herring', 'Lee, Gutierrez and Brown', 'Davis, Serrano and Cook', 'Smith LLC', 'Thomas Group', 'Silva-King', 'Pierce-Long', 'Walker-Simpson', 'Cooper and Sons', 'Donovan, Gonzalez and Figueroa', 'Morgan, Butler and Bennett', 'Snyder-Lee', 'Harris PLC', 'Washington PLC', 'Brown, Price and Campbell', 'Mcgee PLC', 'Dixon Inc', 'Thompson, Sheppard and Ward', 'Adams-Brewer', 'Schneider-Brady', 'Gonzales-Frank', 'Smith-Wong', 'Pierce-Herrera', 'Aguilar, Rivera and Quinn', 'Lowe, Barnes and Thomas', 'Lewis, Gonzalez and Vasq

### - d. Take the lists that you have created and combine them into a pandas DataFrame.

In [11]:
import pandas as pd
list = pd.DataFrame(
    {'companies': company_list,
     'locations': l_list,
     'dates': d_list
    })
print(list)

                     companies             locations       dates
0     Payne, Roberts and Davis       Stewartbury, AA  2021-04-08
1             Vasquez-Davidson  Christopherville, AA  2021-04-08
2   Jackson, Chambers and Levy   Port Ericaburgh, AA  2021-04-08
3               Savage-Bradley     East Seanview, AP  2021-04-08
4                  Ramirez Inc   North Jamieview, AP  2021-04-08
..                         ...                   ...         ...
95     Nguyen, Yoder and Petty      Lake Abigail, AE  2021-04-08
96                  Holder LLC        Jacobshire, AP  2021-04-08
97              Yates-Ferguson        Port Susan, AE  2021-04-08
98             Ortega-Lawrence     North Tiffany, AA  2021-04-08
99   Fuentes, Walls and Castro     Michelleville, AP  2021-04-08

[100 rows x 3 columns]


## #2 Next, add a column that contains the url for the "Apply" button. Try this in two ways.
### - a. First, use the BeautifulSoup find_all method to extract the urls.

In [80]:
apply_url = soup.find_all('a',string= 'Apply')
apply_url = [x['href'] for x in apply_url]
print(apply_url)


['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

### - b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [28]:
import re

# title_list
# used from earlier shows all lists title_list

#here is the cheat sheet for regex   https://www.dataquest.io/cheat-sheet/regular-expressions-cheat-sheet/

# for context, in case you forget
# regex re.sub which checks for something, then you replace itwith  something else in whatever index you used
# syntax  re.sub(pattern, repl, string, count=0, flags=0)

# cheat sheet also for commands regarding cleanup
# \w maches all alphanumeric characters dealing with all capitalized characters
# \s matches whitespace characters like \n \t you name it
# [] defines a set where each character is matched a match occurs if any character appears in the set
# The ^ anchor matches the character or group to its right

base = "https://realpython.github.io/fake-jobs/jobs"

urls = []

for i, title in enumerate(title_list):
    cleaned = title.lower()
    cleaned = cleaned.replace(" ","-")
    cleaned = re.sub(r"[^\w\s-]", "", cleaned)
    url = f"{base}/{cleaned}-{i}.html"
    urls.append(url)
print(urls) 

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html', 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html', 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html', 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html', 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html', 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html', 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html', 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html', 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html', 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html', 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html', 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html', 'https://realpython.github.io/fake-jobs/jobs/architect-12.html', 'https://realpython.github.io/fake-

### 3. Finally, we want to get the job description text for each job.
### - a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.

In [14]:
URL = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'

headers = {
    "User-Agent": "MyPythonScript/1.0 (contact@example.com)"
}

response = requests.get(URL, headers = headers)

In [15]:
requests.get('https://realpython.github.io/fake-jobs/',headers= headers)

<Response [200]>

In [16]:
soo = BS(response.text)

In [17]:
soo.find(class_ ='content').text

'\nProfessional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.\nLocation: Stewartbury, AA\nPosted: 2021-04-08\n'

### - b. We want to be able to do this for all pages. 
### Write a function which takes as input a url and returns the description text on that page. 
### For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".

In [47]:
d = BS(response.text)
desc = d.find('div', class_ = 'content')
para = desc.find('p').text
print(para)

Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.


In [83]:
# wants all pages this only does 1 
def input_description(i):
    if i is not None:
        response = requests.get(i)
        d = BS(response.text)
        desc = d.find('div', class_ = 'content')
        para = desc.find('p').text
        return para
    else:
        print("error")
input_description('https://realpython.github.io/fake-jobs/jobs/product-process-development-scientist-21.html')

# div = soup.find('div', class_='content')
# # Then find the <p> inside it
# paragraph = div.find('p').text

'Tell time special beyond could key assume. Play wait education think similar particular. Film manage several dark. Hit simple personal home they although.'

In [None]:
# title_list = []
# for t in titles:
#     title_list.append(t.text)

# print(title_list)

In [None]:
# base = "https://realpython.github.io/fake-jobs/jobs"

# urls = []

# for i, title in enumerate(title_list):
#     cleaned = title.lower()
#     cleaned = cleaned.replace(" ","-")
#     cleaned = re.sub(r"[^\w\s-]", "", cleaned)
#     url = f"{base}/{cleaned}-{i}.html"
#     urls.append(url)
# print(urls) 

### - c Use the .apply method on the url column you created above to retrieve the description text for all of the jobs.

In [85]:
# found out the issue so im using soo in this case but earlier i was using soup thats why im getting an error
df = pd.DataFrame({'url': apply_url})
list = df["url"].apply(input_description)

print(list)

0     Professional asset web application environment...
1     Party prevent live. Quickly candidate change a...
2     Administration even relate head color. Staff b...
3     Tv program actually race tonight themselves tr...
4     Traditional page a although for study anyone. ...
                            ...                        
95    Paper age physical current note. There reality...
96    Able such right culture. Wrong pick structure ...
97    Create day party decade high clear. Past trade...
98    Pressure under rock next week. Recognize so re...
99    Management common popular project only. Must s...
Name: url, Length: 100, dtype: object
