# Webscraping REMOTE.CO

Hello!  This notebook walks through my thought process and web-scraping pipeline of the website [Remote Developer Jobs](https://remote.co/remote-jobs/developer).  Here I continue off from the Marin Breuss's [Beautiful Soup Tutoral](https://realpython.com/beautiful-soup-web-scraper-python/#find-elements-by-id) by practicing concepts I have learned on a suggested website. 

Overall I found this a bit more challenging because the HTML structure doesn't feel as "organized" as the Fake Python Job Website and I definately have to use a bit of my creativity that might be unconvetional to typical coding practice.  As always feel free to follow through and please provide any comments whether it'd be suggestion on coding etiquette, better methods, or pointing out any errors I have made.  I hope you enjoy going through this as much as I did!

### Set-up

I use ```BeautifulSoup``` and ```requests``` in this excercise.  Our overall objective is the scrap all job-related content from the website and learn to filter some of the jobs we are interested in.

Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Data Source: https://remote.co/remote-jobs/developer/

After inspecting the webpage, let's retrieve the HTML using URL provided and parse through the content.

In [13]:
# Import Packages
import requests
from bs4 import BeautifulSoup

URL = "https://remote.co/remote-jobs/developer/"
page = requests.get(URL)

soup = BeautifulSoup (page.content, "html.parser")

After going through the DOM by inspecting the website, I identified the HTML element that encompasses all the job content by identifying it with the ```container tag``` and ```class```.  

In [None]:
# Inspect the elements of interest 
results = soup.find("div", class_ = "card bg-white m-0")
print(results.prettify()) 

The structure looks pretty solid to me.  Previously, I think I used a ```class``` parameter that is too generic so it picked up on a lot of the stuff we don't really need.

### Scraping Job Contents

Let's start scraping the job content.  I created a iterable ```job_cards``` containing all job related content by specifying unique parameters.  I used a FOR loop to iterate through all ```job_cards```, extracting job titles, company names, and apply links using ```.find()``` method.

Notice here finding the company name is much more difficult.  The company name is a sub-text underneath each of the job card and is grouped together long with tags the job is tied with.  When I extracted the text from the HTML element, I also extracted the blank spaces and tags the job is associated with.  I had to use ```.split()``` method to turn the text into a list of strings and then index it to print out the correct text.

In [None]:
# Create iterable containing all job postings
job_cards = results.find_all("a", class_ = "card m-0 border-left-0 border-right-0 border-top-0 border-bottom")

for job in job_cards:
    # Print out job title
    title = job.find("span", class_ = "font-weight-bold larger")
    print(title.text.strip())
    # Print out company 
    sub_text = job.find("p", class_ = "m-0 text-secondary").text.strip() 
    sub_text_split = sub_text.split("\n") # Company name split from the rest of sub_text via a new line "\n"
    print(sub_text_split[0]) # Company name is first so print the first from the list of strings we splited
    # Print job apply links
    job_url = job["href"]
    print(f"Apply: https://remote.co{job_url}\n") # Need to put the front of URL because it 


### Filtering specific jobs

1. Selecting only full stack jobs
2. Select only jobs that are high-paying

In [None]:
# Select only "Full Stack" jobs

# Create iterable containing all job posting with "Full Stack" in job title
full_stack_jobs = results.find_all(
    "span", class_ = "font-weight-bold larger", 
    string = lambda text: "fullstack" in text.lower() or # take into account different ways of writing full stack
    "full stack" in text.lower() or 
    "full-stack" in text.lower()
    )

# List comprehension up the hiearchies to extract element that encompasses all content
job_cards = [
    elements.parent.parent.parent.parent.parent.parent for elements in full_stack_jobs
]

# Print out all job info
for job in job_cards:
    # Print out job title
    title = job.find("span", class_ = "font-weight-bold larger")
    print(title.text.strip())
    # Print out company - **note: see below code block why I chose to do this...
    sub_text = job.find("p", class_ = "m-0 text-secondary").text.strip() 
    sub_text_split = sub_text.split("\n") # Company name split from the rest of sub_text via \n
    print(sub_text_split[0]) # Company name is first so print the first from the list of strings we splited
    # Print job apply links
    job_url = job["href"]
    print(f"Apply: https://remote.co{job_url}\n") # Need to put the front of URL because it 

In [None]:
# Select only jobs with high-paying tags
high_pay_jobs = results.find_all(
    "small", string = lambda text: "high-paying" in text.lower()
)

# Use list comprehension to select parent elements
job_cards = [
    elements.parent.parent.parent.parent.parent.parent.parent for elements in high_pay_jobs
]

# Print out all relevant info
for job in job_cards:
    title = job.find("span", class_ = "font-weight-bold larger")
    print(title.text.strip())
    sub_text = job.find("p", class_ = "m-0 text-secondary").text.strip()
    sub_text_split = sub_text.split("\n")
    print(sub_text_split[0])
    job_url = job["href"]
    print(f"Apply: https://remote.co{job_url}\n")


Awesome.  I hope everything worked.  Now we filtered out which jobs are high-paying so we know which jobs we should probably aim for if we wanna get that bag.  Thanks for following along!