In [None]:
import requests
from bs4 import BeautifulSoup as BS
import pandas as pd
import re

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.

In [None]:
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

type(response)

In [None]:
soup = BS(response.text)

In [None]:
print(soup.prettify())

a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.

In [None]:
soup.find('h2')

In [None]:
soup.find('h2').text

In [None]:
soup.find(attrs={'class':'title is-5'}).text

b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [None]:
job_titles = soup.findAll(attrs={'class':'title is-5'})
print(type(job_titles))


In [None]:
first_job = job_titles[0]
print(first_job)

In [None]:
job_titles_text = [x.text for x in job_titles]
print(job_titles_text)

c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.

_let's do one at a time here. i think for my approach i can inspect the page and see if there's a class around the div i can use to grab the companies, locations, posting dates. it looks like div .card-content could be useful here. i'm going to make a list of those divs first._

In [None]:
job_cards = soup.findAll('div', attrs = {'class':'card-content'})

In [None]:
print(job_cards[0])

_ok pivoting back to doing it simply but keeping this beginning here in case i want to come back to it_

In [None]:
job_companies = soup.findAll('h3', attrs = {'class':'subtitle is-6 company'})

In [None]:
job_companies_text = [job.text for job in job_companies]
job_companies_text

In [None]:
job_locations = soup.findAll('p', attrs = {'class':'location'})
job_locations_text = [job.text.strip() for job in job_locations]
job_locations_text

In [None]:
job_dates = soup.findAll('time')
job_dates_text = [job.text.strip() for job in job_dates]
job_dates_text

d. Take the lists that you have created and combine them into a pandas DataFrame.



In [None]:
fake_jobs_df = pd.DataFrame({'job_title': job_titles_text,
                             'job_company': job_companies_text,
                             'job_location': job_locations_text,
                              'job_posting_date':job_dates_text})
fake_jobs_df

2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.

a. First, use the BeautifulSoup find_all method to extract the urls.

_i want to isolate the URLs from the apply button. the problem i'm running into is that there are two a tags w/o  IDs (they have the same class). maybe i can do a for loop to make a list of only the a tags with apply as the text or use a filter in my findall??? the pattern is that we want every other item, perhaps we can separate out odd index number items._

In [None]:
urls = soup.findAll('a')

urls

In [None]:
url_apply_list = []

for url in urls:
    if url.text == 'Apply':
       url_apply_list.append(url)
        
url_apply_list
    

In [None]:
url_text = [url.get('href') for url in url_apply_list]
url_text

In [None]:
fake_jobs_df['application_link'] = url_text
fake_jobs_df

In [None]:
#enum_urls = list(enumerate(urls))
#this was me trying to go odd/even eventually i wanted to loop it and see if it was divisble by two to determine if it was even or odd

In [None]:
#enum_urls[0]

b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

_okay so it looks like the URL pattern is this. each URL starts with https://realpython.github.io/fake-jobs/jobs/ and is then followed by the job title in lowercase separated by dashes, any special characters seem to be replaced with just a space. so museum/gallery is museum-gallery. so i think the first thing i want to do is loop through job titles column and change those to lowercase, replace spaces with dashes and concat with the base URL??????? MAYBE?????_

In [None]:
fake_jobs_df['string_manipulation_urls'] = ' '

In [None]:
base_url = 'https://realpython.github.io/fake-jobs/jobs/'

for index, row in fake_jobs_df.iterrows():
    print ('index: ', index)
    print('row: ', row)

In [None]:
for index, row in fake_jobs_df.iterrows():
        url_value = row.job_title.lower().replace('(', ' ').replace(')', ' ').replace('/', ' ').replace(',', ' ').replace('  ', ' ').strip().replace(' ', '-')
        fake_jobs_df['string_manipulation_urls'] = base_url + url_value + '.html'
        
fake_jobs_df['string_manipulation_urls']
        

In [None]:
fake_jobs_df

3. Finally, we want to get the job description text for each job.

a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.

In [None]:
job_url = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'

job_response = requests.get(job_url)

job_soup = BS(job_response.text)

In [None]:
print(job_soup.prettify())

In [None]:
job_soup.find('div', attrs={'class':'content'}).find('p').text

b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".

In [None]:
def scrape(page_url):
    url = page_url
    response = requests.get(url)
    soup = BS(response.text)
    
    description = soup.find('div', attrs={'class':'content'}).find('p').text
    
    return description

scrape('https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html')

c. Use the .apply method on the url column you created above to retrieve the description text for all of the jobs.

In [None]:
fake_jobs_df['application_link'].apply(scrape)

In [None]:
descriptions = fake_jobs_df['application_link'].apply(scrape)
type(descriptions)


In [None]:
fake_jobs_df['job_description'] = fake_jobs_df['application_link'].apply(scrape)
fake_jobs_df