## Webscraping

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as BS
URL = 'https://realpython.github.io/fake-jobs/'

response = requests.get(URL)

a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.  

In [None]:
soup = BS(response.text)
soup.find('h2')

b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.  

In [None]:
titles = soup.findAll('h2')
titles = [x.get_text('h2') for x in titles]
print(type(titles))
titles

c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [None]:
companies = soup.findAll('h3')
companies = [x.get_text('h3') for x in companies]
print(type(companies))
companies

In [None]:
location = soup.findAll('p', attrs={'class' : 'location'})
location = [x.get_text('location') for x in location]
location = [line.strip() for line in location]
print(type(location))
location

In [None]:
date = [x['datetime'] for x in soup.find_all('time', {'datetime': True})]
date

d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [None]:
jobs = pd.DataFrame(list(zip(titles, companies, location, date)), columns=['Job Title', 'Company', 'Location', 'Date Posted'])
jobs


2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.

 a. First, use the BeautifulSoup find_all method to extract the urls.  

In [None]:
link = soup.findAll('a')
link = [link.get('href') for link in link]
print(type(link))
print(link)


In [None]:
apply1= soup.findAll('a', string='Apply', attrs={'class': ['card-footer-item']})
apply = [x.get('href').strip() for x in apply1]
apply #group was working on this one but it's not working for me. :(ugh. had to rerun everything then it worked

In [None]:
jobs = pd.DataFrame(list(zip(titles, companies, location, date, apply)), columns=['Job Title', 'Company', 'Location', 'Date Posted', 'Link'])
jobs

 b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

3. Finally, we want to get the job description text for each job.  

a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph. 

In [None]:
URL = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'

response = requests.get(URL)

In [None]:
soup = BS(response.text)
soup.find('$0') #keep trying different things to get it to pull the right info, not working.
soup

In [None]:
description = soup.find('div', class_='box').find('p').get_text() #finds the first instance of <div> with "a" class set to "box", and then finds p and gets the text, super cool
description

 b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".  

In [None]:
def find_description(url):
    response = requests.get(url)
soup = BS(response.text, 'html.parser')
description = soup.find('div', class_='card-text')


In [None]:
url = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'
description = find_description(url)
print(description)

In [None]:
def description(url):
    response = requests.get(url)
    return soup.find('p')

url = "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html"
alldescriptions = description(url)
alldescriptions


In [None]:
def description(url):
    response = requests.get(url)
    soup = BS(response.text, 'html.parser')
    return soup.find('div', class_='box').find('p').get_text() #code from part a

url = "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html"
alldescriptions = description(url)
alldescriptions #YESSSSSS!

c. Use the [.apply method](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) on the url column you created above to retrieve the description text for all of the jobs.

In [None]:
#trying to figure out the apply method but running out of time. Will probably need to come back to this.

_______________________________________________________________________________________________________________________________________________________________

## Webscraping Bonus

1. Navigate to https://www.billboard.com/charts/hot-100/. Using BeautifulSoup, extract out the This Week, artist, song, Last Week, Peak Position, and Weeks on Chart values into a pandas DataFrame. Hint: The HTML for the number one ranked song is slightly different from that of the rest of the songs.## Webscraping Bonus

2. After getting the code working for the current chart, navigate to last week's chart. Notice how the url for the page changes. Write a function which will, given a date, return a pandas DataFrame containing the Billboard chart data for that date.

3. Write a loop to retrieve the Billboard chart data for the last 10 weeks.