# The Free Python Job Board

## Task : Web scrapping a static website

In this exercise, we aim to implement a web scraping script to systematically retrieve all Python-related job listings from the [Pythonjobs](.\pythonjobs.github.io) website. To accomplish this, we will utilize the `requests` library to handle the HTTP communication with the website, allowing us to fetch the page content. Subsequently, we will employ the `BeautifulSoup`, `SoupStrainer` library from `bs4` to parse the HTML content and extract the relevant job data based on predefined HTML tags and attributes. Finally, for the purpose of organizing and displaying the extracted data in a structured and readable format, we will leverage the `pandas` library to create a DataFrame. This DataFrame will then be used to generate a tabular representation of the job listings, which can be further manipulated or analyzed as required.

### Importing Packages 
1. Pandas 
2. Requests
3. BeautifulSoup

In [2]:
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
import requests

### Scrape Web Content and Parse HTML Code

`SoupStrainer` is a class provided by the Beautiful Soup library in Python. It allows you to parse only certain parts of an incoming document, which can be particularly useful when dealing with large HTML or XML files. By creating a `SoupStrainer` object, you can specify the criteria for the parts of the document you want to parse, such as specific tags, attributes, or strings. This selective parsing can lead to significant efficiency improvements, as it reduces the amount of data that needs to be processed.

Here's a brief overview of how `SoupStrainer` works:
- You define a `SoupStrainer` by specifying the conditions that match the elements you want to parse. These conditions can be based on tag names, attributes, or string content.
- When you create a `BeautifulSoup` object, you pass the `SoupStrainer` object as the `parse_only` argument.
- Beautiful Soup will then parse only the parts of the document that match the criteria defined in the `SoupStrainer`, ignoring the rest.



In [21]:
#Define the url
url = "https://pythonjobs.github.io/"
#Send an HTTP request
pages = requests.get(url)
#Parse only the 'section' tag
only_section_tags = SoupStrainer("section")

#Check if HTTP Request was successful and the parse 
if pages.status_code == 200:
    soup = BeautifulSoup(pages.content,"html.parser",parse_only=only_section_tags)

### Extract Attributes from HTML elements 

In [71]:
for element in soup.find_all("div",class_="job")[:2]:
    #Get all other details except the job title 
    oth_details = element.find_all("span")
    #print job title
    print(element.h1.text.strip())
    print()
    #Job description 
    print(element.p.text.strip())
    
    #print other details 
    print(oth_details[0].text.strip())#location
    print(oth_details[1].text.strip())#Posting date
    print(oth_details[2].text.strip())#Contract type
    print(oth_details[3].text.strip())#Company
    #get application url
    print(url+ element.a["href"])
    print("--------------------------------------------")

Strats Python Developer

Overview HBK is searching for a Python software developer to join our Strats team in London on a full-time basis. The Strats group works closely and primarily with investment professionals in all our offices to help...
London, UK
Thu, 06 Oct 2022
Permanent
HBK Europe Management LLP
https://pythonjobs.github.io//jobs/hbk-strats-developer.html
--------------------------------------------
Python Software Developer

We’re hiring a Python Software Developer to join our interdisciplinary team, working with data publishers and users. To find out more about this role and working at Open Data Services check out this twitter thread....
Remote, UK-only
Thu, 23 Jun 2022
permanent
Open Data Services Co-operative
https://pythonjobs.github.io//jobs/open-data-services-co-operative-python-software-developer.html
--------------------------------------------


### Creating a dataframe 

In [72]:
#create table
job_listings = pd.DataFrame()

#Create empty lists to hold the different elements
job_title = []
job_desc = []
job_location = []
job_posting_date =[]
job_contract_type = []
job_company = []
job_url =[]

for element in soup.find_all("div",class_="job"):
    oth_details = element.find_all("span")
    job_title.append(element.h1.text.strip())
    job_desc.append(element.p.text.strip())
    job_location.append(oth_details[0].text.strip())
    job_posting_date.append(oth_details[1].text.strip())#Posting date
    job_contract_type.append(oth_details[2].text.strip())#Contract type
    job_company.append(oth_details[3].text.strip())#Company
    job_url.append(url+ element.a["href"])
    
#Add data to our dataframe
job_listings["Title"] = job_title
job_listings["Company"] = job_company
job_listings["Description"] = job_desc
job_listings["location"] = job_location
job_listings["Contract Type"] = job_contract_type
job_listings["Posting Date"] = job_posting_date
job_listings["Link"] = job_url

#View 
job_listings.head()

Unnamed: 0,Title,Company,Description,location,Contract Type,Posting Date,Link
0,Strats Python Developer,HBK Europe Management LLP,Overview HBK is searching for a Python softwar...,"London, UK",Permanent,"Thu, 06 Oct 2022",https://pythonjobs.github.io//jobs/hbk-strats-...
1,Python Software Developer,Open Data Services Co-operative,We’re hiring a Python Software Developer to jo...,"Remote, UK-only",permanent,"Thu, 23 Jun 2022",https://pythonjobs.github.io//jobs/open-data-s...
2,"Senior Software Engineer, Back-End (Remote)",Oomnitza,Oomnitza offers enterprise IT a unique solutio...,"Galway, Ireland, Remote",permanent,"Wed, 30 Mar 2022",https://pythonjobs.github.io//jobs/oomnitza-ba...
3,Python Backend Engineer,BMAT,What you’ll be doing In the Data team we want ...,Remote,Permanent,"Tue, 23 Nov 2021",https://pythonjobs.github.io//jobs/bmat-python...
4,Senior Backend Engineer,BMAT,What you’ll be doing BMAT is teaming up with a...,Remote,Permanent,"Tue, 23 Nov 2021",https://pythonjobs.github.io//jobs/bmat-senior...


### Putting it all together

In [2]:
import pandas as pd
from bs4 import BeautifulSoup, SoupStrainer
import requests
from IPython.display import HTML

url = "https://pythonjobs.github.io/"
pages = requests.get(url)
only_section_tags = SoupStrainer("section")

if pages.status_code == 200:
    soup = BeautifulSoup(pages.content,"html.parser",parse_only=only_section_tags)


job_listings = pd.DataFrame()

job_title = []
job_desc = []
job_location = []
job_posting_date =[]
job_contract_type = []
job_company = []
job_url =[]

for element in soup.find_all("div",class_="job"):
    oth_details = element.find_all("span")
    job_title.append(element.h1.text.strip())
    job_desc.append(element.p.text.strip())
    job_location.append(oth_details[0].text.strip())
    job_posting_date.append(oth_details[1].text.strip())
    job_contract_type.append(oth_details[2].text.strip())
    job_company.append(oth_details[3].text.strip())
    job_url.append(url+ element.a["href"])



    
job_listings["Title"] = job_title
job_listings["Company"] = job_company
job_listings["Description"] = job_desc
job_listings["location"] = job_location
job_listings["Contract Type"] = job_contract_type
job_listings["Posting Date"] = job_posting_date
job_listings["Link"] = job_url


# Define a function to make URLs clickable
def make_clickable(val):
    return f'<a target="_blank" href="{val}">{val}</a>'

# Apply the function to the 'url' column to make it clickable
df_styled = job_listings.style.format({'Link': make_clickable})

# Render the DataFrame as HTML
HTML(df_styled.to_html(escape=False))


# # #View 
# job_listings.head()

Unnamed: 0,Title,Company,Description,location,Contract Type,Posting Date,Link
0,Strats Python Developer,HBK Europe Management LLP,Overview HBK is searching for a Python software developer to join our Strats team in London on a full-time basis. The Strats group works closely and primarily with investment professionals in all our offices to help...,"London, UK",Permanent,"Thu, 06 Oct 2022",https://pythonjobs.github.io//jobs/hbk-strats-developer.html
1,Python Software Developer,Open Data Services Co-operative,"We’re hiring a Python Software Developer to join our interdisciplinary team, working with data publishers and users. To find out more about this role and working at Open Data Services check out this twitter thread....","Remote, UK-only",permanent,"Thu, 23 Jun 2022",https://pythonjobs.github.io//jobs/open-data-services-co-operative-python-software-developer.html
2,"Senior Software Engineer, Back-End (Remote)",Oomnitza,"Oomnitza offers enterprise IT a unique solution to manage the entirety of the digital estate. Unlike our competitors, who deliver siloed solutions, Oomnitza offers granular control and orchestration across the...","Galway, Ireland, Remote",permanent,"Wed, 30 Mar 2022",https://pythonjobs.github.io//jobs/oomnitza-back-end-sw-enginneer-irl-remote.html
3,Python Backend Engineer,BMAT,"What you’ll be doing In the Data team we want to have the most complete music metadata database in the world. As part of our team you will help in any stage: from integrating new sources, enhancing the entity...",Remote,Permanent,"Tue, 23 Nov 2021",https://pythonjobs.github.io//jobs/bmat-python-backend-engineer.html
4,Senior Backend Engineer,BMAT,"What you’ll be doing BMAT is teaming up with a global music streaming service to revolutionize the way copyrights and royalties are handled in the online world. As we transition from discovery to an MVP phase, we’re...",Remote,Permanent,"Tue, 23 Nov 2021",https://pythonjobs.github.io//jobs/bmat-senior-backend-engineer.html
